All you need to know about using Elasticsearch in Python.

In this article, we are going to talk about how to use Elasticsearch in Python. As a data engineer, you may need to create Elasticsearch documents in Python with some scripts. As a software engineer, when you design your API in Python, you would need to make REST API calls to Elasticsearch to fetch the data. Therefore, if you are using Elasticsearch in your work or plan to learn it, this article can be useful for you.

If you haven’t installed Elasticsearch yet, you can go to the official website of Elasticsearch and download and install Elasticsearch and Kibana on your computer.

Alternatively, if you don’t want to install Elasticsearch and Kibana on your computer, you can choose to use Docker containers for local development. If you haven’t installed Docker and docker-compose yet, you can download them from the official Docker website and install them accordingly. I highly recommend using Docker containers for local development because they are platform-independent and do not have library dependency issues on your computer and thus are easy to maintain. You can use this docker-compose file to start Elasticsearch and Kibana directly.

version: "3.8"
services:
  elasticsearch:
    image: elasticsearch:7.12.0
    environment:
      - discovery.type=single-node
      - bootstrap.memory_lock=true
      - "ES_JAVA_OPTS=-Xms512m -Xmx512m"
    ulimits:
      memlock:
        soft: -1
        hard: -1
    volumes:
      - esdata:/usr/share/elasticsearch/data
    ports:
      - 9200:9200

  kibana:
    image: kibana:7.12.0
    environment:
      - "ELASTICSEARCH_URL=http://elasticsearch:9200"
      - "SERVER_NAME=127.0.0.1"
    ports:
      - 5601:5601
    depends_on:
      - elasticsearch

volumes:
  esdata:
    driver: local

This docker-compose file specifies the docker images to use and other basic configurations. Importantly, it creates a persistent volume for Elasticsearch so that your data can persist when you restart the containers.

To start the Elasticsearch and Kibana services, which you can understand as docker containers running in the background, you can copy the content in a file named docker-compose.yml, and run this command:

docker compose up -d

Note that you need to use the -f option to specify the name of the docker-compose file if it is not named as docker-compose.yml.

You can check the status of the services with this command:

docker compose ps

To check the logs of the services:

docker compose logs -f

When the services are running properly, you can start to use Elasticsearch in Python.

To run the commands in this article, you can use the Python console directly. Alternatively, you can run them in iPython, Jupyter Notebook, or Spyder. With these IDEs, you can have a better experience with command autocompletion and you can keep track of the commands used.

To use the Elasticsearch module in Python, you need to install the elasticsearch package with pip:

python -m pip install "elasticsearch>7,<8"

On Windows, you may need to run this command to enable pip:

python -m ensurepip

Then open your favorite Python IDE and we can start to work with Elasticsearch in Python.

First, we need to create an Elasticsearch client:

from elasticsearch import Elasticsearch

es_client = Elasticsearch()

Here, we didn’t specify anything to create the client and all default settings are used. In your practical work, you would need to specify the host, user name, and password to create a valid client. To simulate this case, in the local development environment, you can also create the client by specifying all the default settings:

es_client = Elasticsearch(
    "localhost:9200",
    http_auth=["elastic", "changeme"],
)

elastic, changeme, 9200 are the default user name, password, and port for Elasticsearch.

You can also create the client in this way:

es_client = Elasticsearch(
    hosts=[{"host": "localhost", "port": 9200}],
    http_auth=["elastic", "changeme"],
)

Here hosts is a list of nodes, or a single node we should connect to. A node should be a dictionary ({"host": "localhost", "port": 9200}), or a string in the format of host[:port] which will be translated to a dictionary automatically. Most of the time we would only connect to a single node and it’s more convenient to use the previous format. http_auth is a list or tuple where the first element is the user name and the second one is the password.

If you want to have more advanced settings for authentication, you can check the official documentation.

To work with indices, we need to use IndicesClient. To create an index client, we need to pass in the Elasticsearch client created above:

from elasticsearch.client import IndicesClient

es_index_client = IndicesClient(es_client)

Before we create an index, we need to define the settings and mappings for it. The settings and mappings are not required to create an index. However, in practical usage, you always need to define settings and mappings which can make your search engine more robust, more efficient, and more powerful. In this article, we will use this demo configuration:

configurations = {
    "settings": {
        "index": {"number_of_replicas": 2},
        "analysis": {
            "filter": {
                "ngram_filter": {
                    "type": "edge_ngram",
                    "min_gram": 2,
                    "max_gram": 15,
                },
            },
            "analyzer": {
                "ngram_analyzer": {
                    "type": "custom",
                    "tokenizer": "standard",
                    "filter": ["lowercase", "ngram_filter"],
                },
            },
        },
    },
    "mappings": {
        "properties": {
            "id": {"type": "long"},
            "name": {
                "type": "text",
                "analyzer": "standard",
                "fields": {
                    "keyword": {"type": "keyword"},
                    "ngrams": {"type": "text", "analyzer": "ngram_analyzer"},
                },
            },
            "brand": {
                "type": "text",
                "fields": {
                    "keyword": {"type": "keyword"},
                },
            },
            "price": {"type": "float"},
            "attributes": {
                "type": "nested",
                "properties": {
                    "attribute_name": {"type": "text"},
                    "attribute_value": {"type": "text"},
                },
            },
        }
    },
}

If you want to be an expert in Elasticsearch, you would need to know more about the settings and mappings for an index.

In this example, we define the number of replicas for our Elasticsearch, which will make no difference in a local development environment, but in production, multiple replicas can improve availability and fault tolerance.

Besides, we define the fields for our document in the mappings section. Elasticsearch supports dynamic mapping, which means we don’t need to define the field types in advance and Elasticsearch will create them automatically. However, we should always define the mapping whenever possible. It is better to be explicit about the mapping than implicit. The more you know about your data, the more robust the search engine can be.

Finally, we define an ngram filter and analyzer in the settings section which supports searching by partial input or autocompletion, which will be demonstrated later.

To create an Elasticsearch index with the above settings, run:

es_index_client.create(index="laptops-demo", body=configurations)

index: The name of the index.
body: The configuration for the index (settings and mappings).

If created successfully, in the console you can see:

{'acknowledged': True, 'shards_acknowledged': True, 'index': 'laptops-demo'}

We can also check the created index in Kibana (http://localhost:5601). Go to the Dev tools window:

Then run the following queries to check the settings and mappings of the created index:

GET _cat/indices
GET laptops-demo/_settings
GET laptops-demo/_mapping

Back to Python again, we can create an alias for our index with the following command. You can use an alias to access an index just as the index itself.

es_index_client.put_alias(index="laptops-demo", name="laptops")

There can be multiple aliases for an index and there can be multiple indices with the same alias, which can be useful to group relevant indices together.

To get the aliases of an index:

es_index_client.get_alias(
    index="laptops",
    allow_no_indices=True,
    ignore_unavailable=True
)
# {'laptops-demo': {'aliases': {'laptops': {}}}}

To get all the indices with the same alias, just specify the alias name as the index name:

es_index_client.get_alias(
    index="laptops",
    allow_no_indices=True,
    ignore_unavailable=True
)
# {'laptops-demo': {'aliases': {'laptops': {}}}}

allow_no_indices=True: No error will be raised if there are indices with the specified alias.
ignore_unavailable=True: No error will be raised if the specified index or alias does not exist.

If you want, you can delete an index with the index client:

es_index_client.delete(index="laptops-demo", ignore=404)

ignore=404: if the index to be deleted does not exist, no error will be raised.

You can also delete an alias for an index:

es_index_client.delete_alias(index="laptops-demo", name="laptops")
# {'acknowledged': True}

Now that we have an index created with proper settings and mappings, we can start to add documents to it. To create documents in Python, we need to use the client (es_client) created at the beginning of this article. To create a single document manually, we can use the index method of the client:

doc = {
    "id": 1,
    "name": "HP EliteBook Model 1",
    "brand": "HP",
    "price": 38842.00,
    "attributes": [
        {"attribute_name": "cpu", "attribute_value": "Intel Core i7"},
        {"attribute_name": "memory", "attribute_value": "8GB"},
        {"attribute_name": "storage", "attribute_value": "256GB"},
    ],
}

es_client.index(index="laptops-demo", id=1, body=doc)

I always prefer to check the results in Kibana, because the index name, field names, and commands can be auto-completed and formatted. Besides, the results are also nicely formatted for easy readability. In Kibana, run:

GET laptops-demo/_doc/1

Of course, you can also check the result in Python if you prefer:

es_client.get(index="laptops-demo", id=1)

So far you have learned how to create a single Elasticsearch document in Python. However, Python is not that useful if you just want to create one or two documents. Kibana can be more useful if you just want to do CRUD operations on a couple of documents manually. The real power of Python is batch processing. When you have a large number of documents to create, you can write a script to do it.

Suppose you have a CSV feed file for the laptops which need to be indexed. You can download the demo CSV file from this link. To create documents in bulk, you need to use the bulk method of the client. The format to be used is the same as the bulk API:

{ "index" : { "_index" : "test", "_id" : "1" } }
{ "field1" : "value1" }
{ "create" : { "_index" : "test", "_id" : "2" } }
{ "field1" : "value3" }
{ "update" : {"_index" : "test", "_id" : "1" } }
{ "doc" : {"field2" : "value2"} }
{ "delete" : { "_index" : "test", "_id" : "2" } }

Both the index and create actions would create a new index and expect a source on the next line. The difference is that create fails if a document with the same ID already exists in the target, while index adds or replaces a document as necessary.
update updates an existing index and expects the fields to be updated on the next line.
delete deletes a document and does not expect a source on the next line.

To create documents in bulk in Python, we need to read data from the CSV file and convert the data into the format that the bulk API expects. We can use the following code to read the data, convert the data and create the documents in Python:

import csv
import json

colums = ["id", "name", "price", "brand", "cpu", "memory", "storage"]
index_name = "laptops-demo"

with open("laptops_demo.csv", "r") as fi:
    reader = csv.DictReader(
        fi, fieldnames=colums, delimiter=",", quotechar='"'
    )

    # This skips the first row which is the header of the CSV file.
    next(reader)

    actions = []
    for row in reader:
        action = {"index": {"_index": index_name, "_id": int(row["id"])}}
        doc = {
            "id": int(row["id"]),
            "name": row["name"],
            "price": float(row["price"]),
            "brand": row["brand"],
            "attributes": [
                {"attribute_name": "cpu", "attribute_value": row["cpu"]},
                {"attribute_name": "memory", "attribute_value": row["memory"]},
                {
                    "attribute_name": "storage",
                    "attribute_value": row["storage"],
                },
            ],
        }
        actions.append(json.dumps(action))
        actions.append(json.dumps(doc))

    with open("laptops_demo.json", "w") as fo:
        fo.write("\n".join(actions))

    es_client.bulk(body="\n".join(actions))

Key points:

The csv module reads the CSV file and returns the result as dictionaries.
The json module converts dictionaries in Python to JSON objects which are required by the bulk API. You can read this article to read more about Python dictionary and JSON, and the caveats related.
We are using the index keyword to create the documents. The index action can add or replace a document as necessary. Therefore, you can run the code multiple times and will get the same result.
For each index action, there should be a document immediately after it. The document should be formatted according to the mappings defined at the beginning of this article.

After you run the code, you can check the results in Kibana:

GET laptops-demo/_search

You can see that the 200 documents have been created successfully:

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 200,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "laptops-demo",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 1.0,
        "_source" : {
          "id" : "1",
          "name" : "HP EliteBook Model 1",
          "price" : "38842",
          "attributes" : [
            {
              "attribute_name" : "cpu",
              "attribute_value" : "Intel Core i7"
            },
...

Now all the documents have been added to our Elasticsearch index, we can search for documents based on different conditions.

For example, let’s search for all MacBook laptops. In Kibana, the query to use is:

GET laptops-demo/_search
{
  "query": {
    "match": {
      "name": "Apple"
    }
  }
}

The corresponding Python code is:

search_query = {
    "query": {
        "match": {
        "name": "Apple"
        }
    }
}

es_client.search(index="laptops-demo", body=search_query)

You can see the same result in the Python console as in Kibana, but I think you would agree that the results in Kibana are more readable.

Finally, let’s do an interesting search. Since we use ngram in our filter and analyzer for the name field, we can do a search-as-you-type search, or autocompletion search, namely we can search by queries that are part of the exact data. For example:

search_query = {
    "query": {
        "match": {
        "name.ngrams": "Appl"
        }
    }
}

es_client.search(index="laptops-demo", body=search_query)

With query Apple and Appl, you get the same results. This is the power of ngram in Elasticsearch, which can be really helpful in many scenarios.

In this article, you have learned the following knowledge about Elasticsearch and Kibana:

How to start Elasticsearch and Kibana with Docker.
How to set basic configurations for an Elasticsearch index, including the settings and mappings.
How to create Elasticsearch index, alias, and documents in Python.
How to write Python code to create Elasticsearch documents in bulk.
How to query documents in Kibana and Python.

Elasticsearch is a really powerful search engine and is very wildly used. The focus of this article is to create Elasticsearch documents with Python in bulk. We have barely scratched the surface of the searching queries of Elasticsearch. I will have a dedicated article to introduce the basic and advanced queries of Elasticsearch for searching and analysis. The queries used in Elasticsearch native language are the same as those used in Python. Therefore, after you have mastered the native queries in Kibana, you can use them freely in Python.

SuperDataMiner

All you need to know about using Elasticsearch in Python.

Related articles:

Leave a comment Cancel reply

All you need to know about using Elasticsearch in Python.

Related articles:

Share this:

Leave a comment Cancel reply