Running Local AI Models with Ollama: Infrastructure Requirements by Model Size (7B to 70B+)
A deep, practical guide to running local LLMs with Ollama, including how model size affects RAM/VRAM, CPU/GPU requirements, quantization choices, and real-world example setups from 7B to 70B+.
Running an AI model locally has shifted from “weekend curiosity” to a practical engineering choice: you get lower latency, stronger data control, predictable costs, and the ability to run workflows even when cloud access is unavailable. Ollama has become one of the simplest ways to do this on macOS, Linux, and Windows, largely because it wraps proven inference engines (notably llama.cpp-based runtimes and model packaging conventions) into a workflow that feels like docker run, but for LLMs.
This guide focuses on the part people most often get wrong: matching the model and quantization to your infrastructure. “Can my machine run a 13B model?” is rarely the right question; the right question is “Can my machine run this model, at this context length, with this quantization, at a throughput I consider usable?”
Below is a practical, infrastructure-first explanation—with examples—of how to plan local model deployments with Ollama from 7B through 70B+, and how CPU, RAM, GPU VRAM, quantization, and context length interact.
What Ollama is doing under the hood (and why it matters)
Ollama pulls a model package (commonly GGUF-based for llama.cpp-compatible execution), allocates memory for:
- Model weights (dominant factor, affected by parameter count and quantization),
- KV cache (grows with context length and model size),
- Runtime overhead (tokenization, buffers, threads, GPU kernels, etc.).
When you choose a model in Ollama (for example, a Llama- or Mistral-family variant), you’re implicitly choosing:
- Parameter count (7B, 8B, 13B, 14B, 33B, 34B, 70B, etc.).
- Quantization (Q2/Q3/Q4/Q5/Q8, or similar).
- Context length (e.g., 4k, 8k, 16k, 32k), which drives KV cache memory.
- Execution path (CPU-only or GPU-accelerated, depending on platform and build).
Those choices determine whether the model fits and whether it runs fast enough to be useful.
Model size vs. memory: the first-order approximation
A rough way to think about weight memory is:
- FP16 weights: ~2 bytes per parameter (plus overhead), which is huge.
- Quantized weights: fewer bits per parameter, commonly:
- Q4 ≈ 4 bits/param (0.5 bytes/param),
- Q5 ≈ 5 bits/param (0.625 bytes/param),
- Q8 ≈ 8 bits/param (1 byte/param).
This is not exact because real quantization formats add scales/zeros and block metadata, but it’s close enough to plan infrastructure.
Rule of thumb for weights only (very approximate):
- 7B at Q4: ~4–5 GB
- 13B at Q4: ~8–10 GB
- 33B/34B at Q4: ~18–22 GB
- 70B at Q4: ~35–45 GB
Then add KV cache and runtime overhead.
The KV cache: why context length can break “it fits” assumptions
Even if your weights fit in RAM/VRAM, large context can push you over the edge. KV cache scales with:
- number of layers,
- hidden size,
- context length,
- batch/parallelism,
- precision used for KV (often FP16/FP32 variants, sometimes optimized).
You don’t need the exact formula to plan; you need the consequence:
- Doubling context length (e.g., 4k → 8k) roughly doubles KV cache memory.
- Moving from a 7B to a 70B model increases KV cache needs substantially because the model is wider/deeper.
If you want long-context summarization (8k–32k), KV cache becomes a primary infrastructure driver, not just weights.
CPU-only vs GPU-accelerated: choosing the right bottleneck
CPU-only inference
CPU-only is viable and often pleasant for:
- 7B/8B models at Q4/Q5,
- 13B at Q4 on a strong desktop CPU,
- “chat at human pace” usage where 5–20 tok/s is acceptable.
CPU-only becomes frustrating for:
- 34B and above unless you accept very low throughput,
- long contexts (KV cache bandwidth),
- multi-user setups.
CPU inference bottlenecks are typically:
- memory bandwidth,
- vectorization (AVX2/AVX-512),
- thread scheduling.
GPU-accelerated inference
GPU acceleration is usually about:
- achieving higher token throughput,
- keeping latency low at higher context lengths,
- running bigger models (34B/70B) at usable speeds.
But it introduces a hard constraint:
- VRAM capacity determines how much of the model + KV cache can reside on GPU.
If VRAM is insufficient, runtimes can offload some layers to GPU and keep the rest in RAM (hybrid offload). This can work well, but performance varies based on PCIe bandwidth and how much ends up ping-ponging.
Infrastructure tiers by model size (practical targets)
The following tiers assume typical chat/instruction use with moderate context (4k–8k) unless noted.
Tier A: 7B–8B models (best “default local LLM” class)
Typical models: Llama 3.x 8B class, Mistral 7B class, similar.
Recommended hardware (comfortable):
- RAM: 16 GB minimum; 32 GB preferred if you multitask or use long contexts.
- CPU: modern 6–12 cores.
- GPU (optional): 8–12 GB VRAM is nice but not required.
Quantization guidance:
- Q4: fastest and smallest, usually solid quality for chat and tools.
- Q5: noticeable quality bump for some tasks; more memory.
- Q8: closer to FP16 behavior, heavier footprint—often not necessary for casual use.
What “good” performance looks like:
- CPU-only: often usable.
- GPU: very responsive.
This is the size range where Ollama feels “effortless” on mainstream hardware.
Tier B: 13B–14B models (quality jump, heavier memory)
Typical models: Llama 2/3 13B class, Qwen-class 14B variants, etc.
Recommended hardware:
- RAM: 32 GB (realistic minimum for smooth work).
- GPU: 12–16 GB VRAM for high GPU offload; 24 GB is excellent.
- CPU: strong single-thread + good memory bandwidth matters.
Quantization guidance:
- Q4 is common and workable.
- Q5 is a good sweet spot if you have RAM/VRAM headroom.
- Avoid overly aggressive quantizations (Q2/Q3) unless you must fit tight constraints.
This tier is often the best “local-only” tradeoff if you want a clear quality bump but don’t want to redesign your machine around it.
Tier C: 33B–34B models (serious local inference territory)
Typical models: 33B/34B class (varies by family).
Recommended hardware:
- RAM: 64 GB recommended.
- GPU: 24 GB VRAM minimum if you want meaningful GPU acceleration; 48 GB+ is excellent.
- CPU-only: possible but usually slow.
Quantization guidance:
- Q4 is typically the entry point.
- Q5 may be too large for many single-GPU setups once KV cache is included.
This is where “it loads” and “it’s usable” diverge dramatically. If your use-case is interactive chat, GPU offload is often the difference between acceptable and not.
Tier D: 70B+ models (high-end, multi-GPU or very large VRAM)
Typical models: Llama 70B class, similar.
Recommended hardware:
- RAM: 128 GB if you plan to run large quantizations and long contexts; 64 GB can work for certain quantizations but becomes tight quickly.
- GPU: 48 GB VRAM (single) can work for Q4-ish with careful KV/cache management; otherwise multi-GPU becomes attractive.
- CPU-only: generally impractical for interactive use.
Quantization guidance:
- Q4 variants are common for feasibility.
- Q5/Q6 can be excellent but often push you into multi-GPU or huge VRAM territory.
- Plan context length conservatively if VRAM is your constraint.
If you’re targeting 70B, you’re doing infrastructure planning in the same way you would for a small on-prem service, not a desktop tool.
Quantization choices in practice (Q4 vs Q5 vs Q8)
Quantization is not only about “will it fit?”; it affects:
- response quality (especially reasoning consistency, instruction following, and niche domains),
- speed (smaller weights can be faster due to cache and bandwidth),
- stability under long contexts.
A practical heuristic for local Ollama usage:
- Q4: best starting point. Quality is usually “good enough” and it fits more machines.
- Q5: use when you have spare memory and care about quality. Often a noticeable uplift for 13B+.
- Q8: use when you’re troubleshooting quality regressions or need maximal fidelity; otherwise it’s heavy.
If you’re on the edge of fit, dropping from Q5 to Q4 often fixes it immediately, while preserving more quality than jumping to a smaller parameter count.
Context length planning: choose it intentionally
A common failure mode is choosing a big model and then quietly expecting 16k or 32k context to “just work.” Long context requires:
- more KV cache memory,
- more compute per token,
- more time to prefill (processing the prompt).
Two practical patterns:
- Interactive chat: 4k–8k is usually plenty if you summarize and manage conversation.
- Document work (summarization/QA): consider smaller models with longer context, or chunking + retrieval (RAG), rather than forcing a huge model into massive context.
If you are building workflows around documents, it’s often better engineering to:
- chunk text,
- embed into a vector store,
- retrieve relevant passages,
- feed a smaller context window, than to brute-force 32k contexts on a large model.
Example setups (realistic, not aspirational)
Example 1: Laptop/dev machine, “I want local chat and coding help”
Target: 7B/8B Q4/Q5, 4k–8k context
- RAM: 16–32 GB
- GPU: optional; Apple Silicon unified memory works well; on discrete GPU, 8–12 GB VRAM is plenty
- Expected experience: responsive chat, solid tool use, reasonable coding assistance
Ollama usage:
ollama pull llama3.1:8b
ollama run llama3.1:8b
Example 2: Desktop workstation, “I want better reasoning without cloud”
Target: 13B/14B Q4/Q5, 8k context
- RAM: 32–64 GB
- GPU: 16–24 GB VRAM recommended
- Expected experience: strong local assistant, better long-form responses, better instruction reliability
Ollama usage:
ollama pull qwen2.5:14b
ollama run qwen2.5:14b
Example 3: Prosumer GPU box, “I want 34B-class locally”
Target: 34B Q4, 4k–8k context
- RAM: 64 GB
- GPU: 24 GB VRAM minimum; 48 GB improves headroom and context stability
- Expected experience: excellent answers, heavier prefill, but finally “big model” quality locally
Operational note: keep other GPU memory consumers low (browsers, electron apps, other GPU services), because fragmentation and background VRAM usage can cause sudden OOM failures.
Example 4: On-prem service, “70B for a small team”
Target: 70B Q4, moderate context, multi-user
- RAM: 128 GB
- GPU: 48 GB+ VRAM or multi-GPU
- Expected experience: high quality; engineering needed around concurrency, batching, and prompt discipline
Deployment note: local serving should include:
- a reverse proxy,
- request limits,
- logging with redaction,
- model warm-up procedures,
- and observability (latency/token metrics).
Serving models with Ollama (API-first workflows)
Ollama is often used as a local service rather than an interactive CLI. A typical pattern is:
- start the Ollama daemon,
- call the local HTTP API from your app.
Example request (chat completion style varies by version/model, but the core idea is stable):
curl http://localhost:11434/api/generate \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.1:8b",
"prompt": "Explain the tradeoffs between Q4 and Q5 quantization for local inference.",
"stream": false
}'
From an infrastructure perspective, API usage introduces two immediate considerations:
- Concurrency: two users chatting can double KV cache pressure and saturate compute.
- Latency variance: long prompts cause “prefill spikes,” so p95 latency can look worse than average.
If you’re building internal tools, shape prompts and limit context sizes explicitly to keep tail latencies under control.
Storage and model management considerations
Model files are large, and you’ll likely pull multiple variants while testing.
Practical planning:
- 7B Q4: a few GB
- 13B Q4/Q5: ~10–15 GB class
- 34B Q4: tens of GB
- 70B Q4: dozens of GB
Give yourself:
- at least 100–200 GB free if you intend to iterate,
- fast SSD (NVMe) to reduce load times and improve swap behavior (though swapping during inference is still a bad time).
Operating system and accelerator backend notes
Ollama’s experience differs by platform because GPU backends differ:
- macOS (Apple Silicon): unified memory means “VRAM vs RAM” is less rigid, but you still have a finite pool; large models compete with everything else in memory. Performance can be very good due to Metal acceleration paths.
- Linux (NVIDIA): CUDA ecosystem typically offers the most predictable performance for local inference on consumer GPUs.
- Linux (AMD): ROCm support varies by GPU and distro; feasibility depends on driver stack maturity.
- Windows: workable, but GPU acceleration and driver/runtime behavior can be more variable; many power users prefer WSL2 for consistency.
The key infrastructure point: don’t only spec “GPU TFLOPs.” For LLM inference, VRAM capacity and memory bandwidth are often more important than raw compute.
Troubleshooting “it’s slow” vs “it doesn’t fit”
Symptoms of not fitting (memory pressure)
- process killed / OOM errors,
- dramatic slowdowns when the OS starts swapping,
- GPU out-of-memory during prompt prefill (often at longer contexts),
- load succeeds but first long prompt fails.
Fixes:
- lower quantization (Q5 → Q4),
- reduce context length,
- reduce concurrency,
- ensure other apps aren’t consuming GPU memory,
- move up a hardware tier (more RAM/VRAM).
Symptoms of “it fits but it’s slow”
- low tokens/sec despite no OOM,
- high CPU usage with poor throughput,
- GPU underutilized because too little is offloaded.
Fixes:
- increase GPU offload (when possible),
- pick a smaller model class,
- prefer Q4 for speed if quality remains acceptable,
- reduce context length and prompt size,
- ensure correct backend (CUDA/Metal) is being used by your environment.
External sources (references)
- Ollama documentation: https://ollama.com/docs
- Ollama GitHub repository (implementation details, issues, platform notes): https://github.com/ollama/ollama
- llama.cpp (GGUF format, quantization, CPU/GPU backends): https://github.com/ggerganov/llama.cpp
- GGUF specification and tooling context (via llama.cpp docs/issues): https://github.com/ggerganov/llama.cpp/tree/master/docs
- Quantization concepts and formats (community-maintained discussion in llama.cpp repo/issues): https://github.com/ggerganov/llama.cpp/issues
- NVIDIA CUDA overview (for understanding the GPU software stack): https://developer.nvidia.com/cuda-zone
- Apple Metal overview (macOS GPU acceleration background): https://developer.apple.com/metal/