Generative AIBuilding with foundation modelsOpen-source models

Local development with Ollama

12 minutes read

Developing applications powered by LLMs typically involves using cloud-based APIs. However, this raises concerns about data privacy, latency, and costs. What if you could run these models directly on your own infrastructure, giving you complete control over your data and development process? This is exactly what Ollama solves.

What is Ollama?

Ollama is an open-source tool that simplifies the process of downloading, setting up, and running LLMs directly on your hardware. It manages model weights, configurations, and dependencies, letting you interact with state-of-the-art models like Llama 3, Phi-3, and Mistral locally through simple commands. This local-first approach offers key benefits, including better data privacy, lower latency, and complete offline functionality.

Ollama uses an efficient client-server architecture. A lightweight server runs in the background on your machine to manage models and process requests. You can interact with this server through a command-line interface (CLI) or by making API calls from your applications.

Under the hood, Ollama uses highly optimized backends like llama.cpp to run model inference efficiently, even on low-cost hardware. It smartly uses available resources and supports GPU acceleration options like NVIDIA and AMD cards, as well as Apple Metal for macOS. This allows it to deliver the best performance.

A standout feature of Ollama is its large and growing library of open-source models. The platform gives you easy access to many pre-trained models, each designed for different tasks and hardware requirements. This variety helps you experiment and find the ideal model for your needs without being tied to one provider.

Ollama goes beyond just running existing models — it enables customization. Through a simple configuration file called a Modelfile, you can adjust a model's behavior, parameters like temperature, and shape its personality. As we'll see later, this customizability gives you precise control over a model's outputs.

Ollama helps developers and researchers innovate freely. By eliminating the need for cloud services, it's ideal for projects that require data security, cost efficiency, and offline capabilities. Ollama handles the technical complexities so you can focus on creating innovative, intelligent applications.

Setting up your Ollama environment

Before running LLMs on your machine, you need to set up the Ollama environment. The process is straightforward, regardless of your operating system. Ollama officially supports macOS, Windows, and Linux, providing a consistent experience across all platforms.

The installation package handles the entire setup for the Ollama binary and sets up the background server that manages the models. For Windows users, simply download an installer and follow the prompts. For macOS users, download the .app file and move it to your Applications folder. On Linux, you only need one command:

$ curl -fsSL https://ollama.com/install.sh | sh

Once installed, you can verify that everything works correctly by opening a new terminal window and running the command ollama --version or ollama -v. A successful installation will show the installed version of Ollama:

$ ollama -v
ollama version is 0.12.5

With the Ollama service running in the background, your environment is ready. You can now download models from the library and interact with them directly from the command line, which we'll explore in the next sections.

Running models with Ollama

With your Ollama environment set up, let's dive into the exciting part: running and interacting with LLMs. Ollama's command-line interface (CLI) makes this process intuitive, similar to tools like Docker. The main command you'll use is ollama run. This command downloads a model if it's not already on your system and starts an interactive chat session with it in your terminal.

Open your terminal and type the following command:

$ ollama run llama3:8b

When you run the llama3 model for the first time, Ollama will "pull" it from the official library and show a progress bar as it downloads the files:

pulling manifest
pulling 6a0746a1ec1a: 100% ▕████████████████████████████████████████ 4.7 GB                         
pulling 4fa551d4f938: 100% ▕████████████████████████████████████████▏  12 KB                         
pulling 8ab4849b038c: 100% ▕████████████████████████████████████████▏  254 B                         
pulling 577073ffcc6c: 100% ▕████████████████████████████████████████▏  110 B                         
pulling 3f8eb4da87fa: 100% ▕████████████████████████████████████████▏  485 B                         
verifying sha256 digest 
writing manifest 
success

After the download finishes, it loads the model into memory and gives you a prompt to start chatting. You can ask questions, request summaries, or ask it to write code:

>>> Hey there. 
Hi! It's nice to meet you. Is there something I can help you with, or would you like to chat?

>>> Send a message (/? for help)

This interactive session helps you quickly test a model's capabilities and understand its personality and performance.

Ollama offers many models to choose from, each with different sizes, capabilities, and instruction-tuning. You can see the full list on the Ollama website. Each model has a name and an optional tag (like llama3:8b for the 8-billion parameter version). If you don't specify a tag, Ollama uses the latest tag by default.

You can try phi3 if you want a smaller, more lightweight model. If you're working on coding tasks, codellama is a great choice.

Managing models

Beyond running models, the Ollama CLI offers several useful commands for managing your local model library. To view a list of all your downloaded models, use the ollama list command. This shows a table with each model's name, ID, size, and last modification date.

If you want to download a model without starting a chat session immediately, use ollama pull <model_name>. This helps you pre-load models for future use. To free up disk space, remove any models you don't need with ollama rm <model_name>. You can also check detailed information about a specific local model using ollama show <model_name>. This command shows the model's metadata, including its license, capabilities, and parameters.

If you want to copy a model, use ollama cp <source_model> <new_model_name>. This is perfect for creating a duplicate that you intend to customize. These management commands give you complete control over your local AI toolkit, helping you maintain an organized and efficient development environment.

Integrating models into your applications

While interacting with models through the command line works well for quick tests, you eventually have to integrate them into your applications. Ollama provides a REST API that lets you programmatically interact with models running locally. To make this process easier, the Ollama team offers an official Python library, ollama-python, which serves as a convenient wrapper around this API. You can also use frameworks like LangChain or LlamaIndex to interact with Ollama models.

First, install the library using pip:

$ pip install ollama

After installation, you can import it and use it in your Python scripts. Before running any code, make sure the Ollama service is running in the background on your machine. The library communicates with this background process, so it must be active to handle your requests.

Let's look at a basic example—generating text. You can use the ollama.generate function to send a prompt to a model and get a single, complete response. Simply specify the model you want to use and your prompt. The function returns a dictionary containing the response and other useful metadata:

import ollama

response = ollama.generate(
  model='llama3:8b',
  prompt='Why is the sky blue?'
)

print(response['response'])

For creating conversational applications like chatbots, the ollama.chat function is perfect. This function allows you to send a list of messages, each with a role (user, assistant, or system) and content. This structure helps the model maintain context throughout a conversation, leading to more coherent and relevant responses:

import ollama

response = ollama.chat(
  model='llama3:8b',
  messages=[
    {'role': 'system', 'content': 'You are a helpful assistant.'},
    {'role': 'user', 'content': 'Why is the sky blue?'},
  ]
)

print(response['message']['content'])

A key feature of the Ollama Python library is its streaming support. Instead of waiting for the complete response, you can process it in real-time as the model generates it, word by word. Set stream=True in the generate or chat function call to get a generator that provides response chunks as they become available. Streaming is essential for creating responsive user experiences in interactive applications, as it provides immediate feedback rather than making users wait.

Streaming support

import ollama

response = ollama.chat(
    model='llama3:8b',
    stream=True,
    messages=[
        {'role': 'system', 'content': 'You are a helpful assistant.'},
        {'role': 'user', 'content': 'Why is the sky blue?'},
    ]
)

for chunk in response:
    if 'message' in chunk and 'content' in chunk['message']:
        print(chunk['message']['content'], end='', flush=True)

Customizing models with `Modelfile`

Ollama's extensive library of pre-built models covers many use cases. However, you might want to create a custom version of a model for your specific needs. This is where the Modelfile comes in. A Modelfile is a simple, text-based configuration file that serves as a blueprint for creating a new custom model. It's similar to a Dockerfile for container images. It lets you define a model's characteristics, such as its base model, system prompt, and operating parameters, in a clear and repeatable way.

The syntax of a Modelfile is straightforward. It consists of instructions, each on a new line, followed by arguments. The most important instruction is FROM, which specifies the base model for your custom model. You can use any model you've already pulled from the Ollama library. For example, FROM llama3:latest tells Ollama to use the latest Llama 3 model as the foundation for your new creation. This instruction ensures your custom model inherits all the capabilities of its parent.

After defining the base model, you can customize its behavior. A key instruction is SYSTEM. This lets you set a permanent system message that shapes the model's personality and responses. You could make the model respond like a pirate or act as an expert in a specific field. This system prompt automatically appears at the start of the conversation, ensuring the model keeps your desired persona without needing to repeat the instruction in every prompt:

# Use Llama 3 as the base model
FROM llama3:8b

# Set the system message to define the model's personality
SYSTEM """
You are a helpful and witty AI assistant named 'CodeBot'.
You always include a fun fact about programming in every response.
"""

Besides setting a personality, the Modelfile lets you adjust the model's parameters using the PARAMETER instruction. For example, you can set the temperature to control the randomness of the output (lower values make it more predictable). You can also set num_ctx to define the context window size, and others. These parameter adjustments can greatly affect the model's performance and creativity, helping you optimize it for specific tasks.

After you create your Modelfile (saved as Modelfile without an extension), you can build your custom model using the ollama create command. To create a model named codebot, run:

$ ollama create codebot -f ./Modelfile.

Ollama will process your instructions and create a new, complete model based on your settings. This new model will appear in your ollama list, ready for use with ollama run like any other model. This process helps you build a collection of specialized, reusable models perfectly suited for your projects.

Conclusion

We have explored Ollama for local development with AI models. We discovered that Ollama is a streamlined tool that simplifies running LLMs locally, offering needed privacy, speed, and control. We walked through the setup process across Windows, macOS, and Linux, showing you how to get your environment ready in seconds. You learned how to use the intuitive Ollama CLI to run, manage, and interact with a vast library of open-source models like Llama 3 and Phi-3.

Then, we looked at application development by integrating Ollama with Python. Using the official ollama library, you saw how to perform text generation and create conversational experiences. This enabled you to embed AI capabilities directly into your projects.

Finally, we explored the core of customization with the Modelfile. You learned how to build your own specialized models by setting a base model, defining a persistent SYSTEM prompt, and adjusting PARAMETERs to create AI assistants suited for any task.

How did you like the theory?

Report a typo