Computer scienceData scienceNLPRetrieval-augmented generation

Text chunking techniques

28 minutes read

One of the most important applications of large language models (LLMs) is Retrieval-Augmented Generation (RAG), which enhances model outputs by integrating external knowledge. At its core, chunking enables efficient retrieval by breaking text into meaningful segments, preserving context, and improving accuracy. In this topic, we will examine how chunking works within RAG, exploring different chunking techniques and their impact on retrieval efficiency and response quality.

The Role of Chunking in RAG

As we know, Retrieval-Augmented Generation (RAG) helps large language models generate more accurate and relevant responses by retrieving information from external sources. It works in three phases:

  1. Retrieval: When a query is received, RAG searches a knowledge base (e.g., documents, databases) to find relevant information.

  2. Augmentation: The retrieved data is added as context for the model.

  3. Generation: The model uses both the query and the additional context to create a more accurate response.

Since RAG depends on retrieving relevant information, the way data is stored and accessed plays a key role. Chunking organizes data into structured segments, improving both retrieval speed and response accuracy. As we can see in the picture, before retrieval, chunking splits large texts from the knowledge base into smaller, meaningful chunks, making it easier and faster to find relevant information:

Pasted illustration

Chunking also helps LLMs stay within their token limits by breaking large texts into smaller, meaningful sections. Since LLMs can only process a fixed number of tokens at a time, retrieving entire documents may exceed this limit, causing important information to be cut off. By pre-chunking the data, we ensure each retrieved section is small enough to fit within the token constraints while still containing meaningful and complete information.

Without chunking, the model might miss important details or retrieve incomplete or irrelevant information, reducing accuracy. Well-structured chunks also preserve context and logical connections, making large texts easier to manage. Consequently, this leads to faster and more precise retrieval, improving the overall performance of RAG systems.

Chunking techniques

There are various text chunking techniques, each suited to different text structures and use cases. In this section, we will explore five common chunking techniques and implement them using LangChain framework and two popular NLP libraries: NLTK and spaCy. NLTK is great for basic text processing and techniques such as fixed-size chunking and sentence-based chunking, but it lacks advanced linguistic analysis. In contrast, spaCy provides pre-trained models for tokenization, dependency parsing, and more, making it ideal for document-structured chunking that require sentence segmentation and embeddings.

Fixed-Size Chunking

Fixed-size chunking breaks text into equal parts based on a set character or token limit. It ensures predictable chunk sizes and is useful when working with strict processing constraints, such as API token limits. It can be implemented as follows using NLTK:

First, let’s do the imports needed for this topic and load the tokenizer:

import nltk
from langchain.text_splitter import RecursiveCharacterTextSplitter
import spacy
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans

nltk.download('punkt_tab')

Next, we define a function fixed_size_chunking that splits the input text into chunks of a specified size (chunk_size), with an optional overlap (overlap) to retain context between chunks. The text is tokenized using nltk.word_tokenize, and the chunks are created by slicing the tokenized words:

def fixed_size_chunking(text, chunk_size, overlap):
    words = nltk.word_tokenize(text) # Tokenize the text into words
    chunks = []
    i = 0
    # Loop to create chunks of specified size with overlap
    while i < len(words):
        chunk = words[i:i + chunk_size] # Slice words for chunk
        chunks.append(" ".join(chunk)) # Join words into a chunk
        i += chunk_size - overlap # Update index with overlap
    return chunks

We can call fixed_size_chunking() with the following parameters:

text = "There was a cat. The cat sat. The cat sat on a mat."
chunk_size = 5
overlap = 1 # overlap size definition

chunks = fixed_size_chunking(text, chunk_size, overlap)
print(chunks)

The result of fixed-size chunking will be:

Pasted illustration

This shows how the text is split into chunks of 5 tokens each, with an overlap of 1 token between consecutive chunks. However, as observed, fixed-size chunking may split sentences unnaturally and break phrases in the middle, leading to a loss of context.

Recursive Character Chunking

Recursive character chunking splits text at natural delimiters as paragraphs, newlines, and spaces until the chunks fit within a size limit, keeping the text readable and meaningful. It is ideal for unstructured text and mixed-length datasets, preserving the flow of sentences and paragraphs. It can be implemented using LangChain as follows.

We can load a sample text:

text = """There was a cat.

The cat sat.

The cat sat on a mat."""

Here, we will use the RecursiveCharacterTextSplitter to split the text. This function first attempts to split the text at the most meaningful delimiter (paragraphs). The chunk_size parameter is specified in symbols and characters, not tokens. If the chunks exceed the size limit, it uses smaller delimiters such as newlines and spaces:

text_splitter = RecursiveCharacterTextSplitter(chunk_size=25, chunk_overlap=0)

Then, we can call the chunking function with our sample text:

documents = text_splitter.create_documents([text])
for doc in documents:
    print(doc)

The result of running the code will be:

page_content='There was a cat.'
page_content='The cat sat.'
page_content='The cat sat on a mat.'

Although this method does not guarantee equal chunk sizes, and if the text lacks clear breaks, it may create chunks that cut off in the middle of sentences. This can affect readability and make the text harder to understand.

Sentence Chunking

Sentence chunking breaks text into separate sentences, each containing a complete idea. This helps AI understand and process the text better, making it useful for tasks such as answering questions, summarizing, and translating. It can be implemented as follows using NLTK.

We define a function sentence_chunking that splits the input text into individual sentences. The text is processed using nltk.sent_tokenize, which detects sentence boundaries based on punctuation. Each sentence is preserved as a complete thought, and the function returns a list of these sentences:

def sentence_chunking(text):
    sentences = nltk.sent_tokenize(text)
    return sentences

Then, we can define the example text for chunking:

text = "There was a cat. The cat sat. The cat sat on a mat."

Lastly, we can call the chunking function with our sample text and print the resulting chunks:

chunks = sentence_chunking(text)
print(chunks)

The result of running the code will be:

Pasted illustration

This demonstrates how the text is split into 3 individual sentences, preserving the structure of each statement. However, sentence chunking struggles with sub-sections or paragraphs, leading to inconsistent chunk sizes due to varying sentence lengths. Additionally, it does not handle abbreviations (e.g., "Dr.") correctly and may incorrectly split sentences.

Document-structured Chunking

The document-structured chunking technique splits text by its layout, such as paragraphs and headings, to keep things organized. It is useful for structured documents such as articles and reports, helping maintain topic flow and making it easier to process. This technique can be implemented as follows:

Firstly, we need to load the pre-trained model:

nlp = spacy.load("en_core_web_sm")

Then, we use spaCy model to create the document_structure_chunking function, which chunks a document based on its structure by splitting it at double newlines. When the code is executed, the output displays the individual chunks of the document, each representing a section, along with the number of sentences contained in each chunk:

def document_structure_chunking(text):
    doc = nlp(text)
    chunks = []
    for paragraph in text.split('\\n\\n'):
        cleaned_para = paragraph.strip()
        if cleaned_para:
            chunks.append(cleaned_para)
    return chunks

Next, we can define the sample text for chunking:

text = """Title: Cat

Intro: There was a cat.

Body: The cat sat.

Conclusion: The cat sat on a mat."""

Lastly, call the document_structure_chunking() function and print result:

chunks = document_structure_chunking(text)
print("Document Chunks:", chunks)
if chunks:
    for i, chunk in enumerate(chunks, 1):
        doc = nlp(chunk)
        sentences = [sent.text.strip() for sent in doc.sents]
        print(f"Chunk {i} ({len(sentences)} sentences): {sentences}")

The result of running the code will be:

Document Chunks: ['Title: Cat', 'Intro: There was a cat.', 'Body: The cat sat.', 'Conclusion: The cat sat on a mat.']
Chunk 1 (1 sentences): ['Title: Cat']
Chunk 2 (1 sentences): ['Intro: There was a cat.']
Chunk 3 (1 sentences): ['Body: The cat sat.']
Chunk 4 (1 sentences): ['Conclusion: The cat sat on a mat.']

As observed, this technique requires well-structured text with clear paragraph markers. If the text lacks consistent formatting, it may not split properly, resulting in chunks that are either too large or too small.

Semantic Chunking

Semantic chunking is a method that breaks text into meaningful sections based on the content and relationships between words. It improves understanding and accuracy in tasks such as searching for information. Instead of splitting text by size or structure, it groups sentences based on their meaning using embeddings clustering. When combined with syntactic parsing, which focuses on sentence structure, it helps ensure the text is both logically connected and grammatically correct, making it more useful for chatbots and search engines. It can be implemented with Langchain as follows:

First, we start by importing the necessary libraries for connecting to the LiteLLM API and for handling text embeddings and chunking:

import configparser  # For reading configuration files
import os  # For handling file paths and environment variables
from dotenv import load_dotenv  # For loading environment variables from a .env file
import openai  # Library to connect with LiteLLM API
from langchain_experimental.text_splitter import SemanticChunker  # For semantic text chunking
from langchain_openai.embeddings import OpenAIEmbeddings  # For text embeddings

Here, we configure the API key and the base URL for the LiteLLM service by loading them from a .env file for security:

# Load environment variables from .env file
load_dotenv()

# Define paths for configuration files
current_path: str = os.path.dirname(os.path.abspath(__file__))
ini_file: str = os.path.join(current_path, "config.ini")

# Read configuration from .ini file
config = configparser.ConfigParser()
config.read(ini_file)

# Read environment variables
API_KEY: str | None = os.environ.get("OPENAI_API_KEY")
BASE_URL: str | None = os.environ.get("BASE_URL")

Then, we initialize the OpenAIEmbeddings object to convert text into embeddings using the custom API endpoint:

embeddings = OpenAIEmbeddings(
    openai_api_key=API_KEY,
    openai_api_base=BASE_URL
)

Also, we can define the example text for chunking:

text = "There was a cat. The cat sat. The cat sat on a mat."

The SemanticChunker in the langchain_experimental library splits text into chunks based on semantic differences between sentences. It calculates these differences using embeddings and sets "breakpoints" when the difference between two sentences exceeds a certain threshold. This threshold is controlled by the breakpoint_threshold_type parameter. If the difference between the embeddings of two consecutive sentences goes beyond this threshold, the chunker inserts a breakpoint and splits the text.

The breakpoint_threshold_type parameter determines how the threshold for splitting is calculated. The following are the available methods:

  1. percentile uses a percentile-based threshold for chunking.

  2. standard_deviation sets breakpoints based on standard deviation from the mean semantic difference.

  3. interquartile uses the interquartile range (IQR) to determine breakpoints.

  4. gradient sets breakpoints based on changes in the gradient of semantic differences between consecutive sentences.

In our code, we define different chunking methods to see how they perform on the sample text:

chunking_methods = [
    ("Percentile", "percentile"),
    ("Standard Deviation", "standard_deviation"),
    ("Interquartile", "interquartile"),
    ("Gradient", "gradient")
]

Then, we create a function semantic_chunking that splits text into chunks based on the specified method (breakpoint_threshold_type) and prints the results. This function also leverages the breakpoint_threshold_amount parameter to control chunking strictness and min_chunk_size to prevent excessively small chunks. Higher values of breakpoint_threshold_amount result in fewer chunks (less strict splitting), while lower values lead to more chunks (stricter splitting):

def semantic_chunking(method_name, method_type):
    print(f"\\nChunks using {method_name} method:")
    
    text_splitter = SemanticChunker(
        embeddings, 
        breakpoint_threshold_type=method_type,
        breakpoint_threshold_amount=0.5,
        min_chunk_size=5
    )
    
    # Split the text into chunks
    docs = text_splitter.create_documents([text])
    
    # Display chunking results
    for i, doc in enumerate(docs):
        print(f"Chunk {i+1}: {doc.page_content}")

    print(f"Number of chunks using {method_name} method: {len(docs)}")

Lastly, we can call the loop applies each chunking method to the text and prints the results:

for method_name, method_type in chunking_methods:
    split_text_and_print(method_name, method_type)

The expected output of the code will be:

Chunks using Percentile method:
Chunk 1: There was a cat.
Chunk 2: The cat sat. The cat sat on a mat.
Number of chunks using Percentile method: 2

Chunks using Standard Deviation method:
Chunk 1: There was a cat.
Chunk 2: The cat sat. The cat sat on a mat.
Number of chunks using Standard Deviation method: 2

Chunks using Interquartile method:
Chunk 1: There was a cat.
Chunk 2: The cat sat. The cat sat on a mat.
Number of chunks using Interquartile method: 2

Chunks using Gradient method:
Chunk 1: There was a cat. The cat sat. The cat sat on a mat.
Number of chunks using Gradient method: 1

Although this chunking technique can be time-consuming, it may also overlook important formatting cues such as headings, paragraphs, and lists. These elements are essential in research papers, contracts, and technical documentation.

Chunk Enrichment

After chunking, each chunk undergoes an enrichment process to improve search accuracy and retrieval efficiency. This process consists of two key stages: text cleaning and metadata augmentation.

Text cleaning ensures that the content is structured, consistent, and free of unnecessary noise. This step involves:

  • Converting all text to lowercase for consistency.

  • Eliminating common words (e.g., "the," "and," "is") that do not add significant meaning to reduce vector dimensionality.

  • Fixing misspellings to improve text matching and prevent search errors.

  • Expanding contractions (e.g., changing "can't" to "cannot") and abbreviations to ensure consistency and clarity.

  • Removing unnecessary symbols or Unicode characters to minimize noise.

For example, given the sample text: "There was a cat. The cat sat. The cat sat on a mat." The text cleaning process would proceed as follows:

Pasted illustration

After text cleaning, it could transform to "cat cat sat cat sat mat".

Once the text is cleaned, metadata is added to provide more context and improve search efficiency. This step includes:

  • Creating short summaries and titles for quick reference.

  • Identifying important terms and named entities for precise filtering.

  • Generate rephrased text to capture different ways users might search for the same information.

  • Identify potential questions the chunk can answer.

  • Store source and language details for filtering and citation.

For our example, the process goes as follows:

Pasted illustration

These enhancements improve search results by making it easier to find relevant information.

The Impact of Chunking Quality

Chunk quality and size significantly impact retrieval accuracy and efficiency in RAG systems. Well-structured chunks maintain semantic context, ensuring relevant information is retrieved. However, imbalanced chunking can lead to biases: small chunks (100–300 tokens) improve keyword matching but may fragment context, while large chunks (500+ tokens) enhance coherence but risk overlooking critical details and increasing computational costs. To address these issues, effective chunking applies structured techniques to divide text and enriches cleaned chunks with metadata for better semantic relevance.

Conclusion

Chunking is essential process in retrieval-augmented generation (RAG) systems by structuring text for efficient retrieval while preserving semantic context. Chunking techniques such as fixed-size, recursive, and others can be implemented using the LangChain framework and libraries such as SpaCy or NLTK. When combined with enrichment strategies, effective chunking enhances retrieval accuracy, reduces biases, and improves overall system performance.

2 learners liked this piece of theory. 0 didn't like it. What about you?
Report a typo