Computer scienceData scienceInstrumentsQdrant

Filtering in Qdrant

9 minutes read

This topic will explore the concept of filtering in Qdrant. We will work with a collection of vector data and apply different filtering strategies to query data based on various attributes.

The setup

First, we have to install the Qdrant client and run the Qdrant Docker image. This can be done as follows:

pip install qdrant-client

And to run the Docker image, use the following command:

docker run -p 6333:6333 -p 6334:6334 -v $(pwd)/qdrant_storage:/qdrant/storage:z qdrant/qdrant

Let’s start by importing the required modules and instantiating the Qdrant client.

import json
import uuid

from qdrant_client import QdrantClient
from qdrant_client.models import PointStruct

COLLECTION_NAME = 'filter_showcase'
QDRANT_HOST = 'localhost'
QDRANT_PORT = 6333
VECTOR_DIMENSION = 1536

client = QdrantClient(host=QDRANT_HOST, port=QDRANT_PORT)

Next, we will create the collection with vector configuration:

if not client.collection_exists(collection_name=f"{COLLECTION_NAME}"):
    client.create_collection(
        collection_name=COLLECTION_NAME,
        vectors_config={"size": VECTOR_DIMENSION, "distance": "Cosine"},
    )

Now, we can upload the data into Qdrant. You can use this script to create the collection and load the dateset. We will work with a synthetic dataset to showcase some of main filters. A single row of the dataset has the following format:

{
  "id": 1,
  "datetime": "2021-05-14T08:23:45Z",
  "text": "Exploring the advancements in artificial intelligence and machine learning.",
  "category": "Technology",
  "tags": [
    "AI",
    "Machine Learning",
    "Innovation"
  ],
  "value": 75.6,
  "is_active": true,
  "embedding": [] # an OpenAI embedding with 1536 elements
}

The clauses

The first clause filter we will consider is a must clause. This filter behaves like an AND condition: it retrieves points that match all provided criteria. Here, it selects only the points where category is "Technology" and is_active is True:

from qdrant_client.http.models import FieldCondition, Filter, MatchValue

must_filter = Filter(
    must=[
        FieldCondition(key="category", match=MatchValue(value="Technology")),
        FieldCondition(key="is_active", match=MatchValue(value=True)),
    ]
)

result_must = client.scroll(collection_name=COLLECTION_NAME, scroll_filter=must_filter)

print([point.payload.get("id") for point in result_must[0]])
# Output: [20, 1, 17]

Another clause, must_not, works in a similar manner (but does negation of the specified FieldCondition()). Here, we use client.scroll() — a method of the client that allows to list the items that match a given filter without looking for similar embeddings.

The should clause operates like an OR condition, returning points that match at least one of the given criteria. This gives more flexibility by matching points that satisfy either one of the filters. In this case, points with category "Technology" or is_active are selected.

should_filter = Filter(
    should=[
        FieldCondition(key="category", match=MatchValue(value="Technology")),
        FieldCondition(key="is_active", match=MatchValue(value=True)),
    ]
)
result_should = client.scroll(
    collection_name=COLLECTION_NAME, scroll_filter=should_filter
)
print([point.payload.get("id") for point in result_should[0]])
# Output: [10, 14, 15, 20, 4, 11, 13, 3, 6, 7]

Meta: a closer look at main classes

In the previous section, we imported and used Filter(), FieldCondition(), MatchValue(), so let’s see these and some other classes in more detail.

The primary purpose of the Filter() class is to limit search to vectors that meet specific criteria based on their payload data and combine multiple conditions to create complex filtering logic for searches.

In general, the Filter() class can include multiple clauses (such as must and must_not in that case) and is used with the following syntax:

combined_filter = Filter(
    must=[
        FieldCondition(...)
    ],
    must_not=[
        FieldCondition(...)
    ],
    ...
)

FieldCondition() allows to specify conditions on document fields for filtered searches where results must match specific criteria in addition to vector similarity. It’s 2 main parameters are:

key: The name of the payload field to check (string),
match: The type of match operation to perform. This is set to the Match… class from the qdrant_client.http.models module, such as the MatchValue(), which performs exact matching operations on field values (useful for filtering by discrete categories, tags, status values, or any other field where you need an exact match rather than a range or partial match).

Besides MatchValue(), there are different classes to deal with different search types, such as MatchText() for full-text search, Range() for filtering on numeric values with >, <, ≥, ≤, and DatetimeRange(), which allows to work with dates. We will consider those in the next section.

The text match

Let’s take a look at MatchText(), a class to filter and retrieve documents based on textual content within specified fields (here, we still don’t search for the similar vectors to some embedding, but it is also possible to use .query_points() to just search with the filters without the embedding similarity):

from qdrant_client.models import MatchText

match_text_filter = Filter(
    must=[FieldCondition(key="text", match=MatchText(text="learning"))]
)

search_result = client.query_points(
    collection_name=COLLECTION_NAME, query_filter=match_text_filter, limit=5
).points

print(search_result)

Working with ranges

Similarly, we can query either regular ranges or the time ranges. For this topic, the dataset we are working with has the datetime field in the RFC 3339 format (the only format currently supported in Qdrant), which allows to perform time filtering. In case you wish to enable time searches in your application, you should convert the timestamps to RFC 3339.

We’ll use Range() to find all points that have a value field between 80 and 90 (including the boundaries since greater that or equal to and less than or equal to are used):

from qdrant_client.models import Range

select_by_value = Filter(
    must=FieldCondition(
        key="value",
        range=Range(
            gt=None,
            gte=80.0,
            lt=None,
            lte=90.0,
        ),
    )
)

search_result = client.query_points(
    collection_name=COLLECTION_NAME, query_filter=select_by_value, limit=5
).points

print(search_result)

Suppose we want to find the points starting from November 11th, 2021 (without an upper bound):

from qdrant_client.models import DatetimeRange

time_based_filter = Filter(
    must=FieldCondition(
        key="datetime",
        range=DatetimeRange(gt="2021-11-11T00:00:00Z", gte=None, lt=None, lte=None),
    )
)


search_result = client.query_points(
    collection_name=COLLECTION_NAME, query_filter=time_based_filter
).points

print(search_result)

Conclusion

As a result, you are now familiar with the following aspects:

Qdrant supports various filtering strategies to query vector data based on attributes alongside vector similarity searches.
Filter() combines multiple conditions with clauses like must (AND), must_not (negation), and should (OR) to create filtering logic.
FieldCondition() specifies conditions on document fields with a key parameter for the field name and match parameter for the operation type.
MatchValue() performs exact matching on field values, useful for filtering by categories. MatchText() filters based on textual content within specified fields, supporting full-text search capabilities.
Range() filters allow querying numeric values using comparison operators (greater than, less than, etc.) to find values within specific boundaries. DatetimeRange() supports time-based filtering using RFC 3339 format timestamps, enabling queries with specific time constraints.
Client's .scroll() lists items matching a given filter without considering vector similarity, while .query_points() can be used for filtered searches with embeddings.

3 learners liked this piece of theory. 0 didn't like it. What about you?

Report a typo