Building Your Own Local LLM: A Hands-On Journey

Why Build Your Own LLM Setup?

If you’re reading this, you’ve probably used ChatGPT, Claude, or another AI assistant. They’re incredibly powerful, but have you ever paused to think about what happens to your data when you hit “send”? Every query, every piece of code you share, every business idea you brainstorm—it all gets processed on someone else’s servers.

This isn’t just about privacy paranoia. It’s about understanding and controlling the technology that’s rapidly becoming essential to how we work and think.

Learning by Building

I’m a firm believer that the best way to understand something is to build it yourself. Reading documentation is great, watching tutorials helps, but nothing beats getting your hands dirty with actual code. If you’re like me—someone who needs to do to truly understand—then this series is for you.

Over the next few posts, we’ll embark on a journey from zero to a fully functional, private AI assistant running entirely on your local machine. No cloud dependencies, no data leaving your computer, just you and your own personal LLM.

So lets get started..

To have a ChatGPT-like assistant running locally, we need two main components: a model (LLM) and an interface (the way we talk to it).

  • So what is an LLM?
    • For decades, we’ve used NLP for sentiment analysis, text classification, named entity recognition, and machine translation. Each task required its own specialized model, carefully trained on labeled datasets.
    • Then in 2017, the Transformer architecture changed everything. Instead of processing text word-by-word, Transformers could analyze entire sentences at once using “attention mechanisms”—essentially learning which words relate to each other regardless of their position.
    • Big tech companies realized they could train massive Transformers on a simple task—predict the next word—using trillions of words from the internet. These Large Language Models (LLMs) weren’t explicitly taught grammar or facts, but by learning to predict text really well, they implicitly learned to understand and generate human-like responses. OpenAI’s GPT, Google’s PaLM, Meta’s Llama—they all follow this approach, spending millions on compute to train these models.
    • The good news is that many of these models are now open-sourced! Meta releases Llama, OpenAI just released GPT-OSS (their first open model since GPT-2!), Google shares Gemma etc. We can download and use models that cost millions to train, completely free.
  • But how do we actually get and run these models?
    • You can’t just download a model file and double-click it. These models are essentially huge arrays of numbers (tensors) saved in formats like GGUF or SafeTensors. To run them, you need:
      • A runtime that can load these weights into memory
      • An inference engine to process your input through the neural network
      • A way to handle tokenization (converting text to numbers and back)
      • Optimization for your hardware (CPU vs GPU, memory management)
    • For this, there are many ways, but I am going to focus mainly on 2 of them:
  • Option 1: Use Ollama
    • Think of it as “Docker for LLMs”
    • Runs as a local server (written in Go, using llama.cpp under the hood)
    • You interact with it via REST API or command line
      • Pros: Dead simple installation, handles all optimization automatically, manages models for you, great model library
      • Cons: Less flexibility, abstracted away from the actual model files, limited to models in their registry
  • Option 2: Download directly from Hugging Face
    • Get raw model files from huggingface.co (the GitHub of AI models)
    • Use Python libraries like Transformers or llama.cpp-python
      • Pros: Complete control, access to any model, can modify and fine-tune, understand exactly what’s happening
      • Cons: Complex setup, manage CUDA/PyTorch yourself, handle memory management, write your own inference code, deal with tokenizers
  • For our first project, we’re using Ollama because it lets us focus on building rather than fighting with dependencies. Once Ollama is installed, it runs a lightweight server on your machine (default port 11434). When you chat with the model, your Python code sends HTTP requests to this local server, which handles all the heavy lifting of running the neural network and returns generated text.

Alright, so now that we have gathered good context! Lets build..

📦 Get the complete code: Download the full project from GitHub

Step 1: Installing Ollama

First, let’s get Ollama up and running. Installation is straightforward:

macOS/Linux: Visit ollama.com/download or run curl -fsSL https://ollama.com/install.sh | sh

Windows: Download the installer from ollama.com/download

Once installed, verify it’s working

Step 2: Downloading Your First Model

Let’s start with Llama 3.2, a powerful yet efficient model. Open your terminal and run:

ollama pull llama3.2:1b

Step 3: Test Drive Your Model

Before we write any Python code, let’s make sure everything works:

ollama run llama3.2

You’ll get an interactive prompt. Try asking it something:

Tell me a joke about programming

Hit Ctrl+D to exit when you’re done playing around.

Step 4: Building Our Python Chat Interface

Now, let’s build a Python script to interact with our model.

git clone https://github.com/anirudh83/llm-blog-series.git
cd llm-blog-series/blog1

Install the Python dependency:

pip install ollama

Step 5: Running Your Local Chat Agent

That’s it! Run the chat interface:

python3 local_chat.py

You now have a fully functional, private ChatGPT alternative running on your machine. Try having a conversation:

You: What's the capital of France?
Assistant: The capital of France is Paris.

Understanding What’s Happening

Let’s break down the key components:

  • Ollama Service (Server)
    Acts as a local LLM runtime.
    Loads the model (e.g., Llama3) into memory and serves it via an HTTP API.
    Handles all low-level details: tokenization, model execution, hardware acceleration (e.g., GPU if available), and memory optimization.
    Think of it as a daemonized inference engine—your Python code talks to it over localhost:11434.
  • Python Client (local_chat.py)
    Implements a terminal-based client for chatting with Ollama.
    Sends HTTP requests to ollama.chat() with the full conversation history.
    Streams and prints the model’s response in real-time (word-by-word).
    Provides a clean abstraction (LocalLLM class) for managing prompts, history, temperature, and model switching.
  • Conversation History
    • Maintains context across messages using a list of dicts:
      • {‘role’: ‘user’ | ‘assistant’ | ‘system’, ‘content’: ‘…’}
    • This history is sent on every request so the model can generate coherent, context-aware replies.
    • You can clear, inspect, or save this history anytime.
  • Streaming Responses
    • By default, responses are streamed as tokens (chunks of text) arrive.
    • This mimics the ChatGPT typing effect and improves UX.
    • Implemented via a simple loop over ollama.chat(..., stream=True).

Customizing Your Experience

Want to experiment? The code is designed to be hackable:

  • System Prompts: Make the model behave differently (be a pirate, a teacher, or a code reviewer)
  • Temperature Control: Adjust creativity vs consistency in responses
  • Model Switching: Try different models for different tasks on the fly
  • Conversation Management: Save important chats, clear history when needed

Try different models for different tasks:

  • codellama: Specialized for code generation
  • mistral: Good balance of speed and capability
  • gemma2: Google’s efficient open model
  • phi3: Microsoft’s tiny but mighty model

Performance Tips

  • Model Selection: Start with smaller models (1B-3B parameters) to test your setup
  • Quantization: Ollama uses 4-bit quantization by default, balancing quality and speed
  • Context Length: Keep conversations focused—long histories slow down responses
  • GPU Acceleration: If you have an NVIDIA GPU, Ollama will automatically use it

What’s Next?

Congratulations! You’ve just built your own private AI assistant. But this is just the beginning. In upcoming posts, we’ll explore:

  • How to make your LLM understand and work with your specific documents and data
  • Building more sophisticated applications that integrate with your daily workflow
  • Training techniques to specialize your model for particular tasks
  • Creating multi-agent systems where different models collaborate
  • Deploying your local LLM for family or team use while maintaining privacy

The beauty of running models locally is that you’re not limited by API rate limits or costs. You can experiment freely, fail fast, and learn continuously.

Get the complete code and detailed setup instructions at: github.com/anirudh83/llm-blog-series/tree/main/blog1

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.