Generative AIBuilding with foundation modelsTools & frameworks

LlamaIndex overview

24 minutes read

When working with LLMs, you need to connect the model to your data sources. Developers often use a retrieval-augmented generation (RAG) pipeline for this purpose. Although the concept is simple, creating a reliable, production-level RAG system involves complex engineering. Fortunately, a dedicated framework like LlamaIndex can help simplify this process. Let's examine LlamaIndex's main components, its high-level interfaces for quick development, and its advanced features for creating autonomous agents.

What is LlamaIndex?

LlamaIndex is an open-source, data-centric toolkit for building, optimizing, and deploying RAG and agentic applications. It provides all the necessary components and tools to streamline RAG pipeline implementation. With LlamaIndex, you can build context-aware LLM applications.

Instead of building data connectors, chunking logic, and retrieval strategies from scratch, LlamaIndex offers a rich library of pre-built, optimized components. This helps you move from a proof-of-concept to a production-grade application with less effort. You can focus on your application's unique logic instead of the underlying plumbing.

LlamaIndex's focuses on providing both ease of use and extensive customization options. You can use its high-level APIs to create a complete RAG pipeline in just a few lines of code. Beneath this simplicity lies a modular and fully transparent architecture. Every component—data loaders, text splitters, retrievers, response synthesizers—is an independent module that you can swap, configure, or extend. This allows you to start simple and gradually refine your pipeline as your needs grow.

A key strength of LlamaIndex is its large and active ecosystem. The framework includes hundreds of integrations, serving as a central hub for your entire LLM stack. You get data connectors for almost any data source, including files (PDFs, markdown), services (Notion, Slack, Salesforce), and databases (Postgres, MongoDB). It also integrates smoothly with many vector stores for indexing and supports nearly any LLM. You can even use it in conjunction with other frameworks like LangChain.

The core workflow

The best way to understand LlamaIndex is to see its entire workflow in action. The framework's high-level API simplifies the two main stages of a RAG workflow—ingestion and querying—into a few lines of code. It also provides components to enhance every step for highly efficient RAG pipelines.

Let's explore the "Hello, World!" of LlamaIndex by building a complete, functional RAG query engine. Then, we'll examine the abstractions involved. First, ensure your environment is set up (we're using a local Ollama model, but you can use any model):

$ pip install llama-index llama-index-llms-ollama

import os
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.llms.ollama import Ollama

# Create a 'data' directory for our documents
os.makedirs("data", exist_ok=True)
with open("data/strategy.txt", "w") as f:
    f.write("The Q4 strategy focuses on expanding into the European market.")

# Ingestion: Load data into 'Document' objects
# A 'Document' is a generic container around any data source.
documents = SimpleDirectoryReader("./data").load_data()

# Ingestion: Build an 'Index' from the documents
# The 'VectorStoreIndex' handles chunking, embedding, and storage.
index = VectorStoreIndex.from_documents(documents)

# Querying: Create a 'QueryEngine'
# The engine is the primary interface for asking questions.
query_engine = index.as_query_engine(llm=Ollama(model="llama3:8b"))

# Querying: Execute a query
response = query_engine.query("What is the Q4 strategy?")

print(response)

# Expanding into the European market.

In the example above, we start by creating a data/ directory and adding a text file called strategy.txt. Then, we use several abstractions to create the pipeline:

The Document is the first object you'll encounter. It serves as a simple container for your data, whether it comes from a text file, PDF, or database. It stores both the content and metadata.

The SimpleDirectoryReader is a connector that reads files and creates a list of Document objects. You can also use other data connectors from LlamaHub to connect to other data sources.

An Index is a data structure that stores your Documents in an easily retrievable way. The VectorStoreIndex is the most common type (there are others). It processes your documents into vector embeddings and stores them for efficient similarity search.

This is your main interface for Q&A. It is a stateless object that takes a query, runs the entire RAG pipeline (retrieval and synthesis), and returns a response. The .as_query_engine() method is a high-level API that automatically creates a robust query workflow for you. This method is most commonly created from an index or a retriever.

Your code may not work yet if you haven't set up the default embedding model's credentials. We'll cover this in the next section.

Key abstractions: nodes, LLMs, and embeddings

While the high-level API is useful, true customization comes from understanding the components it orchestrates. The most fundamental unit of data in any Index is the Node. Let's manually build our ingestion pipeline to see how Documents transform into Nodes and how we can explicitly configure the models. But first, let's install more packages:

$ pip install llama-index-embeddings-cohere python-dotenv

import dotenv
import os
from llama_index.core import Document, VectorStoreIndex, Settings
from llama_index.core.node_parser import SentenceSplitter
from llama_index.llms.ollama import Ollama
from llama_index.embeddings.cohere import CohereEmbedding

dotenv.load_dotenv()

# Settings: define the core 'LLM' and embedding model abstractions
Settings.llm = Ollama(model="llama3:8b")
Settings.embed_model = CohereEmbedding(api_key=os.getenv("COHERE_AI_API_KEY"))

# Ingestion: use a 'NodeParser' to split 'Documents' into 'Nodes'
node_parser = SentenceSplitter(chunk_size=512)
text = "Your long document text containing complex information..."
documents = [Document(text=text)]
nodes = node_parser.get_nodes_from_documents(documents)

# Ingestion: create the index from your custom-processed nodes
index = VectorStoreIndex(nodes)

print(f"Built an index from {len(nodes)} Nodes.") # Built an index from 1 Nodes.

In the above snippet, we use the following components:

The Settings object is a global context object that allows you to configure global defaults (like the core model abstractions) for your application. If needed, you can override the global setting by passing the desired option locally:

Settings.llm = Ollama(model="llama3:8b")
openai_query_engine = index.as_query_engine(llm=OpenAI(
    model="gpt-4o-mini"
))

Now, this specific query engine would use OpenAI's model instead of the globally configured Ollama model.

LLM is the abstraction for a model. LlamaIndex integrates with dozens of model providers. When you set Settings.llm, you define the default model for all synthesis and generation tasks. Here, we use a local Ollama model and Cohere embedding models. You can get a free API key from Cohere for these tasks.

LlamaIndex provides abstractions for embedding models from various providers. These models convert text into numerical vectors. Setting Settings.embed_model ensures that both your document Nodes and queries use the same model consistently, but you can override this 'locally'.

This is the basic unit of data in LlamaIndex. Each Node contains a chunk of text and metadata, and it gets embedded and stored in the Index. Building your index from Nodes gives you maximum control.

A node parser (like SentenceSplitter, JSONNodeParser and others) takes Document objects and splits them into smaller, more manageable Nodes.

Now, try running the previous snippet with settings for your model provider. It should work as expected.

The query workflow: retrievers and postprocessors

The query process is also a customizable workflow with several components. The two most important ones are the Retriever and NodePostprocessor. A retriever fetches nodes from the index, and a node postprocessor filters or re-ranks those nodes before sending them to the LLM.

Let's build a custom query engine to implement a "retrieve-then-rerank" strategy, a common pattern for improving accuracy. We'll use a longer file on Hyperskill's core philosophy, so download it and place it in the /data folder in the current project directory. You'll also need to install additional packages and get another API key from Anthropic (you can use any provider, though):

$ pip install llama-index-postprocessor-cohere-rerank
$ pip install llama-index-llms-anthropic

import dotenv
import os
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core.response.pprint_utils import pprint_response
from llama_index.llms.anthropic import Anthropic
from llama_index.embeddings.cohere import CohereEmbedding
from llama_index.postprocessor.cohere_rerank import CohereRerank
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core import Settings

dotenv.load_dotenv()

# you can swap in different model providers as needed
Settings.embed_model = CohereEmbedding(api_key=os.getenv("COHERE_AI_API_KEY")) # same as before
Settings.llm = Anthropic(api_key=os.getenv("CLAUDE_API_KEY"), model="claude-3-7-sonnet-20250219")

documents = SimpleDirectoryReader("./data").load_data()

index = VectorStoreIndex.from_documents(documents)

# Get a retriever: A 'Retriever' is responsible for fetching nodes from an index.
retriever = index.as_retriever(similarity_top_k=5)

# Define a postprocessor: A 'NodePostprocessor' refines a list of retrieved nodes.
reranker = CohereRerank(
    api_key=os.getenv("COHERE_AI_API_KEY"),
    top_n=2
)

# Build a query engine: Assemble the query workflow manually.
# The 'RetrieverQueryEngine' is a common implementation.
query_engine = RetrieverQueryEngine.from_args(
    retriever=retriever,
    node_postprocessors=[reranker]
)

# Execute query
response = query_engine.query("Summarize Hyperskill's top 2 tenets in two concise paragraphs!")

pprint_response(response, show_source=True

Code output

Hyperskill's first core tenet is "Active Learning over Passive Absorption," which rejects the notion that merely reading or watching lectures is sufficient for learning to code.
The second fundamental tenet is "Project-based Learning as Contextual Anchor," which recognizes that true growth comes from building complete projects rather than solving isolated problems...
________________________________________________________________________________________________________________________________________________________________________________________________________
Source Node 1/2
Node ID: f1bd8ab0-b425-4cd0-ac6b-c10f5b1292e1
Similarity: 0.9996244
Text: # Hyperskill: Core Philosophy *(A conceptual sketch)*  ---  ## Introduction & High-Level Vision  **Hyperskill** is an online learning platform for programming and software engineering that
emphasizes active, project-based learning rather than passive consumption.   It positions itself as "reimagining education for the AI era," integrating intell...
________________________________________________________________________________________________________________________________________________________________________________________________________
Source Node 2/2
Node ID: 420c5135-f774-4801-b191-16679242911e
Similarity: 0.9347534
Text: This helps internalize decomposition skills. - Reverse decomposition: learners see how large systems can be broken into smaller parts (in reverse).  ---  ## Pros, Risks & Tradeoffs  No
philosophy is perfect. Let's be clear-eyed: here are strengths *and* possible weaknesses.  ### Strengths  - **Deeper retention & transfer** — Because learners app...

Here are some key components used here:

This object retrieves Nodes from an Index based on a query. You can configure its behavior, like the number of nodes to fetch (similarity_top_k). LlamaIndex offers many types of retrievers for different strategies.

This module sits in the middle of the query workflow. It takes the list of nodes from the retriever and can filter, re-order, or transform them. LlamaIndex provides numerous post-processors for various use cases.

The CohereRerank uses Cohere's rerankers to find the most semantically relevant nodes.

Stateful conversations with chat engines

The query engine we saw earlier is stateless—it has no memory of past interactions. For conversational applications, you need to manage state. The chat engine is a stateful abstraction designed specifically for this purpose, managing the conversation history for you.

Let's transform our index into a conversational chatbot that can answer follow-up questions.

import os

import dotenv
from llama_index.core import Settings
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.embeddings.cohere import CohereEmbedding
from llama_index.llms.anthropic import Anthropic

dotenv.load_dotenv()

Settings.embed_model = CohereEmbedding(api_key=os.getenv("COHERE_AI_API_KEY"))
Settings.llm = Anthropic(api_key=os.getenv("CLAUDE_API_KEY"), model="claude-3-7-sonnet-20250219")

documents = SimpleDirectoryReader("./data").load_data()

index = VectorStoreIndex.from_documents(documents)

# Create a ChatEngine, a stateful interface for conversational Q&A
chat_engine = index.as_chat_engine()

# The chat engine manages the conversation state internally
response = chat_engine.chat("Summarize Hyperskill's top 2 tenets in two concise paragraphs!")
print(response)

# Ask a follow-up question. The engine uses its internal state (history) to understand context.
follow_up_response = chat_engine.chat("How do they achieve the first key tenet?")
print(follow_up_response)

# The chat history is an object that can be inspected or cleared
print(chat_engine.chat_history)
chat_engine.reset()  # Clears the state

Code output

INFO - Condensed question: Summarize Hyperskill's top 2 tenets in two concise paragraphs!

# Hyperskill's Top 2 Tenets

**Active Learning over Passive Absorption**: Hyperskill firmly believes that simply reading or watching lectures is insufficient for learning to code. Instead...

**Project-based Learning as Contextual Anchor**: At Hyperskill's core is the conviction that genuine learning happens when building real things, not just solving isolated problems. The platform...

INFO - Condensed question: Standalone question: How does Hyperskill achieve its tenet of "Active Learning over Passive Absorption"?

Hyperskill achieves its "Active Learning over Passive Absorption" tenet through several practical implementations:

- Every topic in the curriculum includes not just theoretical content, but also immediate quizzes and hands-on coding tasks that...

- The platform provides immediate feedback through automated testing of code submissions, showing learners exactly where they succeeded or failed...

- The system is designed so that theory units are closely followed by practical exercises, creating a tight integration between learning concepts and applying them...

Here we used:

A stateful high-level interface for conversation. It combines a Retriever, a Response Synthesizer, and a Memory module. Its main purpose is to manage the conversation state (the history) and use it to process new messages effectively, enabling natural follow-up questions.

The chat engine shows how LlamaIndex provides abstractions that handle state for you. This allows you to build more complex, interactive applications without manually managing the dialogue history.

A core component that manages the final synthesis stage. Its main job is to take the user's query and retrieved Nodes, then create a single, coherent answer using the LLM.

A specialized component that maintains conversational context. It stores the ordered sequence of messages exchanged during a dialogue. This history helps stateful engines like the ChatEngine understand follow-up questions and generate context-aware responses.

Autonomous workflows with agents and tools

The most advanced applications require LLMs to do more than answer questions; they need them to reason and take action. An Agent is an autonomous entity that uses an LLM as its reasoning brain to choose from a set of Tools to accomplish a complex goal. This represents a shift from a fixed RAG workflow to a dynamic, LLM-driven workflow.

Let's build an agent that can both look up information and perform a calculation.

import asyncio
import os

import dotenv
from llama_index.core import Settings
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core.agent import ReActAgent
from llama_index.core.tools import QueryEngineTool, FunctionTool
from llama_index.embeddings.cohere import CohereEmbedding
from llama_index.llms.anthropic import Anthropic

dotenv.load_dotenv()

Settings.embed_model = CohereEmbedding(api_key=os.getenv("COHERE_AI_API_KEY"))
Settings.llm = Anthropic(api_key=os.getenv("CLAUDE_API_KEY"), model="claude-3-7-sonnet-20250219")

documents = SimpleDirectoryReader("./data").load_data()

index = VectorStoreIndex.from_documents(documents)

# 1. Define tools: A 'Tool' is a capability offered to an agent.
# Tool 1: A query engine for our data
query_tool = QueryEngineTool.from_defaults(
    query_engine=index.as_query_engine(),
    name="philosophy_tool",
    description="Provides information on Hyperskill's core philosophy."
)

# Tool 2: A simple Python function
def add_numbers(a: float, b: float) -> float:
    """Adds two numbers together."""
    return a + b

calculator_tool = FunctionTool.from_defaults(fn=add_numbers)

# 2. Create an agent: The 'Agent' is the orchestrator that decides which tools to use.
agent = ReActAgent(
    tools=[query_tool, calculator_tool],
    verbose=True  # Shows the agent's thought process
)

async def main():
    # 3. Run the agentic workflow
    response = await agent.run("What is Hyperskill's top core tenet? Also, what is 20 + 22?")
    print(str(response))

if __name__ == "__main__":
    asyncio.run(main())

"""
Hyperskill's top core tenet is "Active Learning over Passive Absorption." This principle emphasizes...

As for your second question, 20 + 22 = 42.
"""

Key abstractions introduced here are:

A Tool is a well-defined interface for a specific capability. It combines a function with a name and description that the agent's LLM can understand. LlamaIndex provides helpers like QueryEngineTool and FunctionTool to create them easily.

An Agent is the central orchestrator. It operates in a loop (like ReAct's thought -> action -> observation loop) to break down complex tasks, select appropriate tools, execute them, and synthesize the final answer. It represents the most advanced type of workflow in LlamaIndex.

A Workflow is the end-to-end sequence of operations that processes input to generate a final response. The specific arrangement of components defines it—from a simple pipeline with a Retriever and ResponseSynthesizer to the dynamic reasoning loop of an Agent. It represents the complete application logic that brings all core abstractions together to function as a cohesive system.

Conclusion

We have built several increasingly sophisticated applications while exploring the core features of LlamaIndex. We examined how raw data is stored in Documents and processed into Nodes. We configured customizable LLM and embedding model objects, which function as the reasoning and embedding engines of your application.

We advanced from the stateless QueryEngine with its customizable Retriever workflow to the stateful ChatEngine that manages conversational memory. We then created an autonomous Agent that uses Tools to run dynamic, multi-step workflows. These core building blocks give you the foundation and flexibility to move beyond simple prototypes and build robust, production-grade LLM-powered applications.

We have only scratched the surface of the LlamaIndex ecosystem, as it contains many more components and integrations for building applications. For all available components and features, make sure to check the official documentation and API reference.

How did you like the theory?

Report a typo