Free Tool
Select your model size, budget, and priorities. Get a ranked list of hardware options in real time.
Largest model to run
Budget
What matters most
Ecosystem preference
4 options match your filters
Sorted by match score
Enthusiast GPU
NVIDIA RTX 5090
match
Memory
32 GB
34B speed
~50 tok/s
Price
~$2,000
Fastest single-GPU option up to 34B. The CUDA ecosystem means every inference tool works out of the box.
Consumer GPU
AMD RX 7900 XTX
match
Memory
24 GB
34B speed
~20 tok/s
Price
~$700–900
Best budget entry point. Great for 7B–13B models. Limited to smaller quantized models beyond that.
RunPod / Together.ai / Replicate
Cloud Inference
match
Memory
Any
34B speed
~50 (API)
Price
Pay per use
No upfront cost, any model size, scale to zero. Data leaves your device — the only real trade-off.
Personal Supercomputer
NVIDIA DGX Spark (GB10)
match
Memory
128 GB
34B speed
~30 tok/s
Price
~$3,000
Best value for 70B+ models. 128 GB unified memory is the key differentiator — no model swapping, no compromises.
The most important metric for local LLM inference is memory bandwidth, not VRAM or clock speed. Bandwidth determines how many tokens per second you generate. The B200's 8,000 GB/s bandwidth delivers over 700 tok/s on a 7B model — while a 960 GB/s consumer GPU caps out around 80 tok/s on the same model.
VRAM is the ceiling, bandwidth is the speed. You need enough memory to load the model, then bandwidth determines how fast it runs. A 70B model at Q4 quantization requires roughly 40 GB — which immediately rules out any single GPU with less than 48 GB.
| Use case | Best pick | Why |
|---|---|---|
| Getting started | Cloud (RunPod) | $0 upfront, any model size, no setup |
| Budget local inference | RTX 5090 | Best speed/price on 7B–34B models |
| 70B models on a budget | DGX Spark GB10 | 128 GB at $3k — nothing else comes close |
| Fastest possible | 2× RTX 5090 NVLink | 3,584 GB/s, 64 GB pooled |
| Apple ecosystem / zero config | Mac Studio M4 Ultra | Silent, 192 GB, llama.cpp native |
| Production multi-user serving | NVIDIA B200 | 8,000 GB/s, purpose-built for inference |
It depends on the model size and quantization. A 7B model at Q4 needs ~4 GB, 13B needs ~8 GB, 34B needs ~20 GB, 70B needs ~40 GB, and 200B+ needs ~120 GB. Add a buffer of 2–4 GB for the OS and other processes.
Yes. Once the model fits in memory, tokens per second is almost entirely determined by bandwidth. A GPU with 32 GB and 1,792 GB/s will generate tokens much faster than one with 48 GB and 600 GB/s.
Yes, and it works exceptionally well. Apple Silicon uses unified memory shared between CPU and GPU, so the full 192 GB on a Mac Studio M4 Ultra is available for model weights. llama.cpp has excellent Metal (Apple GPU) support with near-native performance.
Quantization reduces model weights from 16-bit to 4-bit or 8-bit numbers. This shrinks memory requirements by 2–4× with minimal quality loss. Q4_K_M is the most common balance point — it cuts a 70B model from ~140 GB to ~40 GB.
llama.cpp is the most widely used inference engine and runs on all platforms. Ollama wraps it with a simple CLI and API. LM Studio provides a desktop GUI. vLLM is the standard for production server deployments.
Private AI deployments — on-prem or cloud — are one of our core practices. We'll match the right hardware to your models, compliance requirements, and budget.
Get a free infrastructure review