Computer scienceData scienceVector databases

Introduction to vector search

9 minutes read

In this topic, you'll learn about different types of vector search methods, how to convert content into searchable vectors, organize them for quick retrieval, and rank results effectively. You will also learn how to combine searching in many ways and pick and choose the best result.

The overview

Embedding models transform text into vectors. These vectors capture the semantic meaning of the text so that similar documents are located near each other in a multi-dimensional space. Each text unit (for example, word, sentence, or a document) is mapped to a point in a multi-dimensional space. When you perform a search, your query is also converted to a vector by the same model that was used to encode the original documents. The system then finds vectors in the space that are nearest to your query vector.

Basically, we map some word embeddings in a vector space as dots. You can see an example on the image below:

Pasted illustration

Vector search comes in three main types: dense, sparse, and hybrid.

  • Dense vector search uses embeddings obtained by encoding language models (such as BERT or OpenAI's text embedding models) to capture the meaning of text. Think of it like understanding that "car" and "automobile" mean the same thing, even though they're different words. This method is good at finding relevant matches even when exact keywords don't match.

  • Sparse vector search works on a different kind of vectors that encode the words present in the text without focusing on their underlying meaning as much as the dense emmbeddings. It's like creating an inventory of words in a document, noting how often each appears and how unique they are. This approach is suitable for scenarios where exact keyword matches are important, such as technical documentation or legal documents.

  • Hybrid vector search combines both approaches. Dense embeddings capture the context of the query and sparse embeddings put more emphasis on the exact keywords. This combination often provides more accurate results than using either method alone.

Preparing the Dataset

Before creating the vectors from the documents you want to search over, the typical NLP text preprocessing techniques are applied:

  • Common words such as "the," "is," and "at" are removed because they do not add significant meaning.

  • Text is lowercased and punctuation is handled to ensure consistency.

Example:

  • Initial Text: "The quick brown fox jumps over the lazy dog."

  • After Cleaning: "quick brown fox jumps lazy dog"

This preprocessing makes the subsequent embedding generation more effective by focusing on the meaningful components of the sentence.

Naïve Chunking with Overlap

The next step involves creating more manageable text fragments. For long documents, splitting text into smaller pieces, or chunks, is important for maintaining context and ensuring better embedding quality. Additionally, all embedding models have a context limit - they can't encode a text that contains more than some fixed number of tokens, and will truncate the information that is present in the text beyond this limit. This approach helps in processing text more effectively.

Naïve chunking is a method of dividing text into smaller parts. It involves splitting the text into segments, sometimes with overlapping words at the boundaries to preserve context. This overlap ensures that important information is not lost when moving from one chunk to the next, which is useful for understanding relationships between consecutive sentences or phrases.

For example, consider the following text: "Machine learning models require a lot of data to train effectively. Preprocessing the data is crucial for achieving high accuracy."

When chunked with a size of 10 words and an overlap of 3 words, it would look like this:

  • Chunk 1: "Machine learning models require a lot of data to train"

  • Chunk 2: "data to train effectively. Preprocessing the data is crucial"

  • Chunk 3: "the data is crucial for achieving high accuracy."

The overlapping words, such as "data to train," help maintain continuity and context between the chunks, ensuring that the meaning is preserved throughout the process. This method is simple yet effective for handling long texts while retaining information.

Embedding Generation

After preprocessing, each text chunk is transformed into a dense vector representation. This process is often done using models like BERT.

We convert text chunks into vectors (each sentence or chunk is processed by a neural network to produce a vector—a series of numbers—that captures its semantic meaning). In case you want to implement hybrid search, the similar conversion is done on the same chunks, but with different models (such as the previously mentioned TF-IDF, although newer and better models are available).

The embeddings vectors are saved in the database for future indexing. The indexing is needed to perform approximate search in the vector space. If the search is exact, each time the query is send, it's vector will be compared to all other vectors in the collection to find the most similar ones, which is inefficient and can't be used in real applications. Here is an example of the vector:

"The cat sat on the mat.": [-0.2398944, 0.24637009, -0.5362498]

Let's see the visualized embeddings!

Pasted illustration

Query Processing

When a search query is submitted, the system processes it by converting the query into a vector and then comparing it to the vectors in the index.

  1. Convert the search query (e.g., "quick brown fox jumps lazy dog") into its vector representation using the same embedding model.

  2. The next step is to find the nearest neighbors. Typically, this is done with some approximate methods that allow to compute similarities without computing the distance between all vectors in the collection and the query vector (one of the popular approaches is HNSW, or Hierarchical Navigable Small World index). Common metrics for similarity include cosine similarity or Euclidean distance.

  3. Then, the selection of the nearest neighbors is performed. The top-k (k can be any other number of retrieved documents you want to see) most similar vectors (neighbors) are returned as the search result.

Pasted illustration

Conclusion

In summary, vector search is an evolution from traditional keyword matching. Through a series of steps—data preprocessing, embedding generation, indexing, and query processing—the system builds an efficient search mechanism. This approach not only improves search accuracy in scenarios where meaning matters but also opens doors to advanced applications like recommendation systems and conversational AI.

3 learners liked this piece of theory. 0 didn't like it. What about you?
Report a typo