LLM Inference Sizing

Right-size your
GPU fleet.

Focused tools for engineers and financial analysts evaluating LLM inference infrastructure.

Quick

Quick sizing

3 questions. Get GPU count, cost comparison, and a shareable summary.

Start →
Advanced

Full calculator

All inputs, HF model lookup, optimisation tiers (FP8, prefix cache, llm-d), cost model, formulas.

Open →
Compare
📊

GPU Explorer

Compare NVIDIA and AMD GPUs by VRAM, throughput, and price. Interactive bubble chart.

Explore →
Economics
📈

Hybrid savings

Cloud vs on-prem vs dedicated node costs. Drag the split slider to find your crossover point.

Calculate →
Routing

Fleet routing

Model semantic routing economics. Compare API spend vs self-hosted. Calculate payback period.

Analyse →
Compare
📌

Saved results

Compare workloads side by side. Export to PDF, Excel, or Google Sheets.

View →
Why this exists

Every time someone asked me how many GPUs they needed to serve an LLM, I opened the same spreadsheet and ran the same math. KV cache per user, available pool after weights, replica count, cost per hour — it took 20 minutes to explain.

I built gpu.calc so that conversation takes 3 questions instead of 20 minutes. The hard math runs automatically. The assumptions are visible and editable. The result is shareable.

Vikas Grover, AI Infrastructure · LLM Inference

Quick sizing
What model are you serving?
Pick a preset or enter a HuggingFace model ID
🔑 Have a gated model? Paste your HF token
How many people at the same time?
Peak concurrent users — not total users
Or type exact:
How long are your conversations?
This determines how much GPU memory each user needs
Or enter tokens (K): K tokens — supports up to 1M+
Where do you want to run this?
Sets your cost model default
Does this look right?
Review your answers before we calculate
🤖
Model
Llama 3.1 8B · GQA · FP16
HuggingFace ID or search
🔑 Have a gated model? Paste your HF token
Manual override
8B
Attention type
Precision
👥
Users & context
30 users · 8K ctx · 10% prefix
Peak concurrent users
30
1101001K10K100K
Context window
8K
1K8K32K128K512K1M+
Prefix cache hit rate
10%
Inference mode
vLLM · standard · TP=1
Framework
Mode
Tensor parallel
KV cache precision
💰
Cost inputs
Cloud · GCP $3.00/hr · no discount
Deployment
Cloud provider
Reserved discount
Guidance estimates only — validate with benchmarks before procurement. Actual requirements vary with framework overhead, batch patterns, and hardware.
REAL-TIME INFERENCE ONLY
Llama 3.1 8B · 30 users · 8K ctx · TP=1
2
H100 GPUs to get started
8B · 30 users · H100 80GB · 8K ctx · TP=1
ONE GPU — what's inside See why →
Weights 16GB
KV cache 34GB
Overhead 2GB
Free 28GB
Weight memory
16 GB
+2.0GB overhead
Avail for KV
54.7 GB
per replica (95% PA)
KV per user
2.56 GB
0.320 MB/tok · 8K ctx
Users / replica
21
2 replicas × 1GPU
GPUs needed
2
H100 80GB · 2 replicas
fits ✓
Monthly cost
☁ Cloud
$163K
GCP · on-demand
⬛ On-prem
hardware · amortised
VRAM utilisation
67%
33% headroom remaining
H100 fits ✓
KV HEADROOM — users per replica at different context lengths
NODE TOPOLOGY
OPTIMIZE
1
GPU choice
H100 80GB · default
GPU tier
NVIDIA
AMD
2
Reduce memory (quantization)
FP16 weights · FP16 KV
3
KV cache efficiency
vLLM paging ON · 10% prefix · GQA
10%
4
Scale (users & context)
30 users · 8K ctx
30
110010K100K
8K
1K32K128K1M+
5
Cost model
Cloud · GCP $3.00/hr · no discount

Edit cost inputs in the left panel → Cost inputs section

📊 Save this sizing to Google Sheets Export your full config, GPU count, KV cache breakdown, and 5-year cost as a spreadsheet row.

Saved Results

Compare workloads side by side. Save from the Advanced calculator.

Hybrid savings

All numbers editable · click any blue value to change it

YOUR WORKLOAD
2
16
$25K
30%
5 yrs
CLOUD PROVIDER (on-demand)
DEDICATED NODE (flat rate)
Reserved node, always billed 24/7. Cheaper than on-demand at >25% util.
$/mo = per node. GPUs/node is configurable. CoreWeave 8×H100 ~$3.09/GPU/hr flat.
⚡ Calibrated estimate
Modeled using benchmark-calibrated inference profiles
MODEL
PRESET MODELS
OR LOOKUP BY HF ID
Llama 3.1 8B
INFERENCE INPUTS
HARDWARE & BACKEND
Tensor parallel (?)
Pipeline parallel (?)
📄 Estimator request payload
Sent to estimator API. Edit before running.
database_mode defaults to HYBRID. Edits here override the form.
🔍 Estimator response & debug
RAW JSON RESPONSE
— not run yet —
PARSED FIELDS
— not run yet —
Calibrated estimates are modeled using benchmark-calibrated inference profiles and should be validated with workload-specific benchmarking before procurement.

Calibrated estimate

All numbers editable · results update on re-run

No estimate yet
Configure model and hardware on the left, then click Run estimate

GPU Explorer

LLM inference planning — compare GPU generations by memory, bandwidth, and cost efficiency

NVIDIA AMD
Preset:
X Y Bubble
X VRAM (GB)
Y Throughput Index
Bubble = cost efficiency (smaller = better)
💡 Top-right = high throughput. Smaller bubble = better cost efficiency. Top-left = cheaper but lower performance.
Throughput Index is a planning metric derived from memory bandwidth, VRAM, and architecture generation. It enables relative GPU comparison — not exact model throughput.
Inference performance depends on model architecture (GQA vs MHA), sequence length, batching, and inference backend (vLLM, TensorRT-LLM, etc.).
Cluster scaling — GPU-level performance does not directly translate to cluster-level throughput. Interconnect (NVLink, InfiniBand) and workload distribution impact scaling.
Pricing reflects indicative on-prem GPU estimates and may vary significantly by provider, region, and purchasing model. Validate all figures with vendors before procurement.
Sources: NVIDIA product pages, AMD product pages, published benchmarks, market reports. © gpu.calc
🔒

Admin only

Enter your admin token to view analytics

Set token once: localStorage.setItem('admin_token','yourkey')

VG
Vikas Grover
AI Infrastructure · LLM Inference
Building open-source AI infra tooling. Follow for LLM inference, GPU sizing, and platform engineering content.
Quick question
What are you using this for? (1 tap, no email needed)
Great — want Vikas to review your specific situation?
Just exploring — skip
💲
API pricing
edit prices below · select model per tier above
Updated March 2026 — edit when prices change
FLEET ROUTING

Semantic routing economics

HOW YOUR TRAFFIC ROUTES
ANNUAL COST IF ALL TRAFFIC WENT TO ONE MODEL:
YOUR CURRENT API SPEND
API cost / year
On-prem cost / year (amortised 5yr)
GPUs needed (at 10% concurrency + 20% buffer)
Annual savings
Should you self-host?
Engineering cost: $ one-time
PER-TIER BREAKDOWN
Concurrency: 10% · 100 simultaneous users
TierAPI modelTrafficQueries/dayAPI cost/yrSelf-host GPUsTokens (in/out)vs all-Sonnet
5-YEAR COST COMPARISON
>