Use Elasticsearch vector search capabilities

Objectives

In this tutorial, you deploy an instance of Databases for Elasticsearch and use it to store vector representations of images that you are then able to search to find similarities with new, unseen, images.

These vector representations, known as embeddings, are created using Machine learning algorithms. Machine learning is a branch of artificial intelligence (AI) and computer science that focuses on the use of data and algorithms to imitate the way that humans learn, gradually improving its accuracy. By using statistical methods, algorithms are trained to make classifications or predictions, and to uncover key insights in data mining projects.

These learning algorithms are known as “models”. In this tutorial we make use of one such model, OpenAI's CLIP. CLIP (Contrastive Language-Image Pre-Training) is a neural network trained on a variety of (image, text) pairs.

Traditionally, finding similarities between images, something that is relatively straightforward to the human eye, has been difficult for a computer to do. Machine Learning has transformed this field of search.

See the other tutorials in this Elasticsearch machine learning series:

Databases for Elasticsearch is a paid-for service, so following this tutorial will incur charges.

Getting productive

To begin the provisioning process, install some must-have productivity tools:

You need to have an IBM Cloud account.
Terraform - To codify and provision infrastructure.
Python

You also need a dataset of images that provide the corpus for doing similarity search. For example, you might have a dataset of car images or of bird images. You typically need thousands of these images. We cannot provide one here because of copyright restrictions on images.

In this tutorial you won't upload the images themselves to your database. The vector representations will be calculated locally and only those will be uploaded. In a real-life scenario you would probably have the images stored somewhere (like an Object Storage bucket) and a reference to the image location would be stored alongside the vector representation, for retrieval.

Obtain an API key

Create an IBM Cloud API key that enables Terraform to provision infrastructure into your account. You can create up to 20 API keys.

For security reasons, the API key is only available to be copied or downloaded at the time of creation. If the API key is lost, you must create a new API key.

Clone the project

Clone the project from the GitHub repository.

git clone https://github.com/IBM/elasticsearch-ml-vector-search-tutorial.git

Install the Elasticsearch cluster

Navigate into the terraform folder of the cloned project.
```
cd elasticsearch-ml-vector-search-tutorial/terraform
```
On your machine, create a document that is named terraform.tfvars, with the following fields:
```
 ibmcloud_api_key = "<your_api_key_from_step_1>"
 region = "<your_region>"
 elastic_password = "<make-up-a-password>"
```
The terraform.tfvars document contains variables that you might want to keep secret, so it is excluded from the public Github repository.
Install the infrastructure with the following command:
```
terraform init
terraform apply --auto-approve
```
Finally, export the database access URL to your terminal environment. It will be required by subsequent steps.
```
 terraform output --json
 export ES_URL="<the url value obtained from the output>"
```

Install dependencies

You need to install some Python dependencies:

    pip3 install elasticsearch
    pip3 install Pillow
    pip3 install imgbeddings
    pip3 install requests

Generate vector embeddings for your images

Create a folder called images at the root of the project folder structure. Inside it, create one or more folders with different images. For example, if you have a dataset of cars then you may want to create folders for different types of car, for example fordescort and fordcortina. This is not strictly necessary (all images could go in a folder called cars), but organizing folders may make it easier to identify search matches later on.

You are ready to run the create.py script. In the root of the project type:

python3 create.py

This script creates an Elasticsearch index called images in your Databases for Elasticsearch database.

Then, it cycles through the images folder and for each image it finds it create a set of embeddings using this open source Python package, which uses the open source CLIP model from OpenAI. With those embeddings and a small amount of metadata (file path and file id), the script creates a document that is then uploaded to the Elasticsearch index.

Depending on the size of your dataset, this process could take multiple hours to complete.

Search your dataset

You are now ready to test your new dataset. To do this you need to find another image of a car (if you are using cars) that is not part of your original dataset. Save this image (for example, myimage.jpg) to the root of the project.

Run the search.py script, passing in the image name:

python3 search.py myimage.jpg

The script generates a set of embeddings for the supplied image using the same algorithm as before. It attempts a known nearest neighbor search on the dataset to find the closest match in the dataset to the image that was supplied in the search. It returns the details of the closest match.

{"took": 4947, "timed_out": false, "_shards": {"total": 1, "successful": 1, "skipped": 0, "failed": 0}, "hits": {"total": {"value": 1, "relation": "eq"}, "max_score": 0.97320974, "hits": [{"_index": "images", "_id": "5c910de5357cb9a3f1b43e6618b141afa6666bfca8676269d5e10a14e1688819", "_score": 0.97320974, "fields": {"file_path": ["./images/Bald_Eagle/26897.jpg"], "desc": ["Bald_Eagle"]}}]}}

You can repeat this process with other images.

Image similarity, not bird similarity

What you have built is more of an image similarity search than a bird (or car) similarity search. The vector search algorithm is analyzing the whole image and looking for matches. An image of a warbler sitting on a tree branch might be more similar to an image of a tree sparrow sitting on a tree branch than to an image of a warbler in flight. Nevertheless, it is a powerful tool to make relationships between objects that (until recently) were difficult, if not impossible.

Tear down your infrastructure

Your Databases for Elasticsearch incurs charges. After you finish this tutorial, you can remove all the infrastructure by going to the terraform directory of the project and using the command:

terraform destroy

Next Steps

If you are ready to explore further, you can also use Databases for Elasticsearch to not only store vector embeddings but to generate them, as well. That is the subject of our next tutorial in this series.