Building a Distributed AI Inference Cluster in My Homelab

How I used llama.cpp’s RPC backend to run a real distributed LLM job across three machines — mirroring the architecture used in production data centers.


Introduction

I’ve been exploring AI infrastructure from an IT and systems engineering angle — not the math and model research side, but the platform side. How do you actually run these models at scale? What does the hardware stack look like? How do cloud providers like Google, Anthropic, and OpenAI serve millions of inference requests per day?

To answer those questions properly, I decided to stop reading about it and start building it. I put together a three-node homelab cluster and ran a genuinely distributed LLM inference job — one prompt, split across multiple machines, exactly the way it’s done in data centers. Just at a slightly smaller scale.

This post documents what I built, how I built it, and what I learned. My goal is to demonstrate that distributed AI infrastructure is not magic reserved for hyperscalers — the same architectural patterns are reproducible at home with the right software stack.


The Hardware

Each of my three nodes runs an AMD Threadripper 3995WX — a 32-core workstation CPU — with 250 GB of DDR4 RAM. That’s 750 GB of combined system RAM across the cluster, which turns out to be the most important number in this build.

The GPUs in these machines are consumer-grade cards from different generations and vendors, which creates compatibility challenges for GPU-based distributed inference. Rather than fight that battle, I focused on CPU-based distributed inference using system RAM as the primary resource. This is a legitimate architectural choice: when you have enough RAM, you can run very large models without touching the GPU at all.

Node CPU RAM GPU
mj0ezx03 Threadripper 3995WX 250 GB NVIDIA GTX 1080 (8 GB)
mj0edeh6 Threadripper 3995WX 250 GB NVIDIA GTX 980 Ti (6 GB)
mj0ezwz Threadripper 3995WX 250 GB AMD RX 5700 XT (8 GB)

What Is Distributed Inference?

Before diving into the setup, it’s worth explaining what distributed inference actually means and why it matters.

Large language models contain billions of parameters — numerical weights stored in memory that define the model’s behavior. A model like Llama 3.1 8B has, as the name suggests, 8 billion parameters. At 16-bit precision that’s roughly 16 GB of memory just to hold the weights. Larger models like Llama 70B require around 140 GB at full precision, or ~35 GB at 4-bit quantization.

When a model is too large to fit on a single machine’s GPU VRAM, it gets sharded — split across multiple devices. Each device holds a portion of the model’s layers. During inference, the computation flows through each shard in sequence, passing activations between nodes via the network interconnect.

In production data centers, this happens across NVIDIA H100 GPUs connected by InfiniBand at 200+ Gb/s. In my homelab, it happens across Threadripper CPUs connected by gigabit Ethernet. The software architecture is identical. The speed is not.


The Software Stack: llama.cpp RPC

llama.cpp is an open-source inference engine that runs large language models efficiently on consumer hardware. It supports quantized model formats (GGUF) that dramatically reduce memory requirements, and it includes an RPC backend that enables distributing a model across multiple machines over a standard TCP network.

The RPC architecture works like this:

[ Client Node ]
  llama-server or llama-cli
  |
  |-- RPC --> [ Worker Node 1 ] rpc-server (holds model layers 0-N)
  |-- RPC --> [ Worker Node 2 ] rpc-server (holds model layers N+1-M)
  |-- RPC --> [ Worker Node 3 ] rpc-server (holds model layers M+1-end)

The client node orchestrates inference, distributing the model’s layers across however many RPC workers are available. Each worker uses its local CPU RAM as the compute resource.


Step 1: Download llama.cpp

The prebuilt binaries are available on the llama.cpp releases page.

For Ubuntu x64 CPU-only inference (which is what we want for cross-vendor distributed inference), download the CPU build:

cd ~
wget https://github.com/ggml-org/llama.cpp/releases/download/b8412/llama-b8412-bin-ubuntu-x64.tar.gz
tar -xzf llama-b8412-bin-ubuntu-x64.tar.gz
cd llama-b8412

Do this on all three nodes. The CPU build has no GPU driver dependencies — no CUDA, no ROCm — so it works identically regardless of which GPU is physically in the machine.

Note on GPU builds: The Linux release page includes a ROCm build for AMD GPUs and no CUDA build (CUDA is Windows-only in the prebuilt binaries). For this distributed CPU RAM experiment, the CPU build is the correct choice regardless.


Step 2: Start the RPC Server on Each Worker Node

The rpc-server binary listens for incoming connections from a client and executes model layer computations using local resources. Start it on each of the three nodes:

cd ~/llama-b8412
LD_LIBRARY_PATH=. ./rpc-server -H 0.0.0.0 -p 50052

The LD_LIBRARY_PATH=. tells the OS to look in the current directory for the shared library files (.so files) that ship alongside the binary. Without this, the dynamic linker won’t find them.

The -H 0.0.0.0 flag tells the server to listen on all network interfaces, not just localhost, so other nodes on the LAN can connect.

You should see output similar to:

load_backend: loaded RPC backend from /home/user/llama-b8412/libggml-rpc.so
WARNING: Host ('0.0.0.0') is != '127.0.0.1'
         Never expose the RPC server to an open network!
         This is an experimental feature and is not secure!

Security note: The warning is real. The RPC server has no authentication. Only run this on a trusted private network, never exposed to the internet.


Step 3: Download a Model

I used Meta’s Llama 3.1 8B Instruct model in IQ2_M quantization format, which reduces the model to approximately 3 GB while preserving reasonable output quality. You can download quantized GGUF models from Hugging Face:

# Example - adjust the URL for whichever model you choose
wget https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF/resolve/main/Meta-Llama-3.1-8B-Instruct-IQ2_M.gguf

Place the model file on the primary node (the one that will run llama-server). The worker nodes do not need a copy of the model file — they receive layer data from the primary node at runtime.


Step 4: Run Distributed Inference

With rpc-server running on all three nodes, start the inference server on the primary node, pointing it at all three RPC workers:

cd ~/llama-b8412
LD_LIBRARY_PATH=. ./llama-server \
  --model ~/Meta-Llama-3.1-8B-Instruct-IQ2_M.gguf \
  --rpc 192.168.2.117:50052,192.168.2.118:50052,192.168.2.119:50052 \
  --host 0.0.0.0 \
  --port 8080

Replace the IP addresses with your actual node IPs. The --rpc flag accepts a comma-separated list of host:port pairs. llama.cpp will distribute the model’s layers across all listed RPC backends automatically.


Step 5: Verify It’s Working

Check that the server is healthy:

curl http://192.168.2.117:8080/health

Expected response:

{"status":"ok"}

Send a real inference request using the OpenAI-compatible API:

curl http://192.168.2.117:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama",
    "messages": [
      {"role": "user", "content": "Explain distributed computing in one paragraph."}
    ]
  }'

If you receive a coherent response, congratulations — you just ran a distributed LLM inference job across multiple machines.


Step 6: Add a Web UI with Open WebUI

The llama-server ships with a basic built-in web UI, but it can be unreliable in prebuilt binaries. For a proper ChatGPT-style interface, run Open WebUI as a Docker container pointed at your llama-server API:

docker run -d \
  -p 3000:8080 \
  -e OPENAI_API_KEY=none \
  -e OPENAI_API_BASE_URL=http://192.168.2.117:8080/v1 \
  --name open-webui \
  ghcr.io/open-webui/open-webui:main

Visit http://192.168.2.117:3000 in your browser. You now have a polished, self-hosted AI chat interface running on top of your distributed inference cluster.


Architecture Diagram

                    ┌─────────────────────────────────┐
                    │         Open WebUI              │
                    │    (Docker, port 3000)          │
                    └────────────────┬────────────────┘
                                     │ HTTP
                    ┌────────────────▼────────────────┐
                    │         llama-server            │
                    │    (mj0ezwz, port 8080)         │
                    │    Model: Llama 3.1 8B IQ2_M    │
                    └──────┬──────────┬───────────────┘
                           │ RPC      │ RPC      │ RPC
              ┌────────────▼──┐  ┌───▼────────┐  ┌▼────────────┐
              │  rpc-server   │  │ rpc-server │  │ rpc-server  │
              │   mj0ezwz     │  │  mj0ezx03  │  │  mj0edeh6   │
              │  250 GB RAM   │  │ 250 GB RAM │  │ 250 GB RAM  │
              └───────────────┘  └────────────┘  └─────────────┘

Total distributed RAM available to the model: 750 GB across 3 nodes.


How This Maps to Production Data Centers

This homelab setup is architecturally equivalent to what runs in production, just at a different scale and with slower interconnects.

Component My Homelab Production Data Center
Compute nodes 3x Threadripper workstations Thousands of GPU servers
Memory per node 250 GB DDR4 80 GB HBM3 (H100 VRAM)
Interconnect 1 GbE Ethernet InfiniBand NDR (400 Gb/s)
Distributed framework llama.cpp RPC vLLM, TensorRT-LLM, custom
Orchestration Manual Kubernetes + KubeFlow / Ray
Model format GGUF quantized FP16 / BF16 full precision

The key insight is that the pattern is the same: a model is sharded across multiple nodes, compute happens locally on each shard, and activations are passed between nodes over the network. The software doing this in my lab (llama.cpp) uses the same conceptual collective operations as production frameworks.


Limitations and What I’d Do Differently

Interconnect is the real bottleneck. On 1 GbE Ethernet, passing activations between nodes is hundreds of times slower than what a real cluster uses. This makes distributed inference across separate machines genuinely slow — expect 1-5 tokens per second rather than 50+. For a proof of concept this is fine, but it illustrates exactly why InfiniBand exists.

GPU vendor fragmentation. My three GPUs are two NVIDIA cards and one AMD card. No mainstream distributed GPU inference framework supports mixing CUDA and ROCm backends in a single job. This forced a CPU-only approach, which is slower than GPU inference but sidesteps the compatibility problem entirely.


Next Steps

The natural progression from here involves two areas:

Upgrading the interconnect. Used InfiniBand hardware (Mellanox ConnectX-3, QDR/FDR generation) is surprisingly affordable on eBay — HCAs around $15-30 each, a used switch around $100-300. Replacing gigabit Ethernet with 40-56 Gb/s InfiniBand would put this cluster in the same category of interconnect technology used in real GPU clusters, just an older generation. That’s the single upgrade that would have the biggest impact on distributed inference performance.

Moving up the stack. The next layer of complexity involves deploying this behind a proper orchestration layer — Kubernetes (K3s for a homelab scale) with KServe for model serving, Ray for distributed job scheduling, Prometheus and Grafana for observability, and MLflow for model versioning. These are the tools that actually run AI infrastructure in production, and all of them are open source and runnable on this hardware.


Conclusion

Distributed AI inference is not magic. The architectural patterns are open, the software is open source, and the hardware requirements — while demanding at the high end — are reproducible at homelab scale with enough RAM and the right interconnect.

What I’ve built here is a legitimate MVP of a distributed inference cluster. It uses the same sharding concepts, the same RPC communication patterns, and the same API surface as production systems. The numbers are smaller and the network is slower, but the fundamentals are identical.

For anyone interested in AI infrastructure, MLOps, or HPC systems engineering as a career path: you don’t need a data center budget to learn this stuff. You need curiosity, some used server hardware, and a willingness to read error messages carefully.

Did you use AI to write this blog post?

Of course! (With the exception of this one paragraph.) In fact, after I got it working, this blog post is one of my first distributed, self-hosted LLM jobs! Though this blog post is AI generated, the steps I took were certainly real and documented above!


Built on Ubuntu 24, llama.cpp b8412, Llama 3.1 8B Instruct IQ2_M. All three nodes are AMD Threadripper 3995WX with 250 GB DDR4 RAM.