Computer scienceFundamentalsSQL and DatabasesDBMSPostgreSQL

Overview of PGVector

19 minutes read

PostgreSQL is one of the most widely used relational database engines. As the name suggests, relational databases store data and the relationships between them. Unfortunately, when working with high-dimensional data, like vector embeddings, relational databases won't suffice. That is where vector databases come in.

Vector databases excel at storing and retrieving embeddings—numerical representations of data such as text, images, or audio. This makes them perfect for modern AI and machine learning applications. pgvector is the next best thing. It adds these capabilities to the PostgreSQL database engine we all know and love.

Let's dive in and see how this works!

Hello pgvector

pgvector is an extension for PostgreSQL that provides support for storing vector data types and performing operations on them. This allows you to meet the needs of modern AI and ML workloads. Such workloads require vector-based operations, such as similarity search. With this extension, you can store and manage embeddings alongside your other data.

Since it is an extension for PostgreSQL, you need to have PostgreSQL installed. You can grab the installer for your platform on their website. Once installed and configured, the next step is to install the pgvector extension. Depending on your platform, the installation procedure varies. You can check the various installation methods in the project's README page on GitHub for the most updated instructions.

The easiest way to install pgvector is using Docker. Both the database engine and extension can be installed with a single command across platforms, but you do need to have the Docker Platform installed. You might also want to check out how to work with PostgreSQL with pgAdmin client on Docker.

Once you have everything installed, you need to enable the extension in your databases before you can use it. This is easily done with the following SQL command using your database client:

CREATE EXTENSION IF NOT EXISTS vector;

You must do this as the first step in every database you create. If you no longer need the extension, you can disable it by running:

DROP EXTENSION vector;

This is the output of the two commands on a sample database hyper-db:

Output of running the two vector commands

Running these commands programmatically can be done with various PostgreSQL clients and libraries such as Python's psycopg2 library.

First steps

Now that you have everything set up, it is time to store some data. Imagine you are working on a recommendation module for an e-commerce application. You want the platform to perform fast similarity searches and recommend products based on user queries or previous behavior. You decide to leverage pgvector in you solution.

The first step is to create a table to store the product's information. This table should include a column for vector data using the special data type provided by pgvector. Each data type supports a different number of dimensions:

vector — up to 2000 dimensions;
halfvec — up to 4,000 dimensions;
bit — up to 64,000 dimensions;
sparsevec — up to 1,000 nonzero dimensions.

You will also add columns for any additional product information:

CREATE TABLE products (
    product_id SERIAL PRIMARY KEY,
    product_name TEXT,
    description TEXT,
    embeddings vector(384),   -- (e.g., 384-dimensional vector from Hugging Face all-MiniLM-L6-v2 sentence transformer model)
    category TEXT, 
    price DECIMAL(10, 2) 
);

Let's focus on the embeddings column. As part of our data ingestion pipeline, we need to generate embeddings for all our products based on their description.

Here is the code snippet

import requests, os
from dotenv import load_dotenv
from retry import retry

# Load environment variables from a .env file
load_dotenv()

# Retrieve the API key from environment variables
API_KEY = os.getenv("HUGGINGFACE_API_KEY")

# Define the API URL for the Hugging Face feature extraction pipeline
api_url = f"https://api-inference.huggingface.co/pipeline/feature-extraction/sentence-transformers/all-MiniLM-L6-v2"

# Set the headers for the API request, especially the authorization token
headers = {"Authorization": f"Bearer {API_KEY}"}

@retry(tries=3, delay=10)
def generate_embedding(text):
    """
    Generate embeddings for the given text using the Hugging Face API.

    Args:
        text (str): The input text to generate the embeddings for.

    Returns:
        list or None: The embeddings as a list of 384 dimensions if successful, None otherwise.
    """
    # Make a POST request to the Hugging Face API with the input text
    res = requests.post(api_url, headers=headers, json={"inputs": text, "options": {"wait_for_model": True}})

    # Check if the request was successful
    if res.ok:
        # Parse the JSON response
        output = res.json()
        # Check if the output is a list of 384 dimensions
        if isinstance(output, list) and len(output) == 384:
            return output
    else:
        # Print the status code if the request failed
        print("API request failed with status code:", res.status_code)

    # Return None if the embedding generation failed
    return None

# Example usage: Generate an embedding for the text "Noise-canceling headphones."
embeddings = generate_embedding("Noise-canceling headphones.")
print(embeddings if embeddings else "Embedding generation failed.")

In this snippet, we are sending the product description to the feature extraction API from Hugging Face. This allows us to convert text into its vector representations. The API key is loaded from a .env file and used to construct the request with proper authorization. We'll retry up to three times with a 10-second delay if the model isn't ready. Finally, we print the embeddings if successfully retrieved.

You can also use OpenAI's embeddings endpoints to generate embeddings. You need to sign up and generate API keys to use these capabilities on both platforms.

If all goes well, we get a 384-dimension vector representation for the given description:

[-0.12030839920043945, 0.08137407898902893, ..., 0.0500892736017704]

Once we have the embeddings, we then insert them into the database along with the other product information:

INSERT INTO products (product_name, description, embedding, category, price)
VALUES
  ('Wireless Bluetooth Headphones', 
   'High-quality wireless Bluetooth headphones with noise cancellation and long battery life.', 
   '[-0.12030839920043945, 0.08137407898902893, ..., 0.0500892736017704]', -- Embeddings generated from model
   'Electronics', 
   99.99
  );

Here is a subset of products in our database:

Screenshot showing ten products in the hyper-db database.

You can also download the products.json file containing this data and import it into your database.

Querying

Now that the vectors have made it into our database, we can use them to calculate the similarity between products. For this, we use normal SELECT statements combined with vector functions from pgvector in the ORDER BY clause. The results will be sorted by the most closely related items. pgvector supports various vector similarity functions. Here is a table summarizing them along with their respective indexes (discussed in the next section):

Operator	Description	Index	Details
`<->`	Euclidean (L2) distance	`vector_l2_ops`	This is the straight-line distance between two vectors.
`<#>`	Negative inner product	`vector_ip_ops`	The negative of the product of the magnitudes of two vectors and the cosine of the angle between them.
`<=>`	Cosine distance	`vector_cosine_ops`	Calculates the cosine of the angle between two vectors, resulting in a value between -1 and 1. Cosine similarity is 1 - cosine distance.
`<+>`	L1 distance	`vector_l1_ops`	This is the sum of the absolute differences between the coordinates of two points.
`<~>`	Hamming distance	`bit_hamming_ops`	Measures the number of positions at which two vectors differ.
`<%>`	Jaccard distance	`bit_jaccard_ops`	Measures the dissimilarity between two sets by calculating one minus the ratio of the intersection to the union of the sets.

Let's say a user searches for a new product like "noise-canceling headphones." We first retrieve the 384-dimension vectors for this description:

[-0.01649387739598751, ..., 0.10808290541172028] -- embeddings from model for the query

Then, we query the database for products with similar descriptions:

SELECT product_id, product_name, embedding
FROM products
ORDER BY embedding <=> '[-0.01649387739598751, ..., 0.10808290541172028]'  -- Sort by the query embedding
LIMIT 5; -- Only return five products

Note that we used the Cosine distance function. The other functions can also be used in the same way. Here is the output of the above command:

A subset of products related to electronics.

As you can see, the returned results are the first five most related products. Conversely, imagine a user searches for a product like "frying pan with heat-resistant handle and even heat distribution." Again, we retrieve the embeddings for this sentence and perform queries. As expected, the first five most relevant products are returned:

A subset of products related to kitchenware.

Queries performed in this way will sort items based on how close their descriptions are in the vector space.

Indexes

Without indexes, pgvector performs an exact search by comparing the query vector to all vectors. This ensures accuracy but can be slow for large datasets. Also, for applications like recommendation systems, you might only need related items rather than exact matches. To improve query performance, you can create indexes on your data.

Indexes perform an approximate nearest neighbor (ANN) search instead of an exact search. Only a subset of the vector space, rather than the entire space, is searched. Additionally, combining multiple columns into a composite index can also enhance performance by reducing the need for multiple scans.

pgvector supports the following index types:

Inverted File with Flat Compression (IVFFlat);
Hierarchical Navigable Small Worlds (HNSW).

IVFFlat partitions the dataset into multiple lists using a K-means clustering technique. When creating the index, you specify how many partitions are created with the lists option. Here, we are creating an IVFFlat index for the L2 distance (vector_l2_ops) function:

CREATE INDEX products_embedding_ivf_index
ON products
USING ivfflat (embedding vector_l2_ops)
WITH (lists = 100); -- create a hundred partitions

During searches, the relevant partition is identified first. Then, a local search is performed within that partition to find the closest match. The probes parameter specifies how many partitions are searched during a query. A higher value improves recall by examining more partitions, at the expense of query performance. The default is one probe but can be modified:

SET ivfflat.probes = 10;

On the other hand, HNSW builds a hierarchical graph where each node represents a data point, and edges connect nodes that are close to each other. During searches, HNSW starts at the top level and navigates through the graph by moving through nodes with the smallest distance to the query vector. It gradually works its way down to lower levels for a more precise search. It leverages the structure of the graph to reduce the number of comparisons needed.

Here's a command to create an HNSW index for the L2 distance function:

CREATE INDEX products_embedding_hnsw_index
ON products
USING hnsw (embedding vector_l2_ops)
WITH (M = 16, ef_construction = 200);

The M parameter sets the number of neighbors per node, and ef_construction sets how many neighbors are considered when building the HNSW index. A higher value provides more exact searches, but building the index as well as insertions will take longer.

Now, our table has three indexes (including the primary key):

Screenshot showing three indexes: primary key, ivfflat and hnsw

ivfflat is faster to build and has a lower memory footprint than hnsw. However, hnsw has better query performance.

Best practices

Unfortunately, PostgreSQL is not a vector-first database and may lag in performance. Adhering to some best practices might help. Let's review some of them.

First, optimizing the database itself using tools like PgTune can also optimize query performance. These key parameters ensure that PostgreSQL operates efficiently, which in turn improves query performance:

shared_buffers — the amount of memory PostgreSQL uses for caching. The higher, the better;
work_mem — the amount of memory available for sorting and joins. Increasing this reduces disk-based operations for large queries;
maintenance_work_mem — for faster index creation and maintenance.

Indexing large tables can be resource-intensive. Use concurrent indexing to avoid locking tables and keep the system responsive. Here's a command to build an hnsw index concurrently:

CREATE INDEX CONCURRENTLY products_embedding_hnsw_index
ON products
USING hnsw (embedding vector_l2_ops)
WITH (M = 16, ef_construction = 200);

For the IVFFlat index, the number of lists and probes play a significant role in performance. More lists improve accuracy but require more memory. More probes improve accuracy at the cost of longer query times. Monitor your database to balance these settings for the most optimal performance in your specific use case.

Finally, as your dataset grows, scaling the database itself is needed to maintain good performance. This involves:

Vertical scaling — adding more CPU, RAM, and disk;
Horizontal scaling — using partitioning and sharding to distribute the load across servers.

This ensures the system can handle more queries and larger datasets without any bottlenecks.

You can view benchmarks of how pgvector performs in these tests.

Conclusion

pgvector extends the capabilities of PostgreSQL to store and perform operations on vector embeddings. To use these capabilities, you need to:

Install the pgvector extension and enable it for your database;
Use the special data types provided by pgvector for the column that will store embeddings;
Use API endpoints from Hugging Face or OpenAI to simplify the process of generating embeddings for your data and insert them into the database;
When performing queries, leverage pgvector's operators, such as L2 distance and cosine distance, to perform similarity searches;
Adhere to recommended best practices to improve performance.

How did you like the theory?

Report a typo