LLM Inference Sizing

Right-size your
GPU fleet.

Focused tools for engineers and financial analysts evaluating LLM inference infrastructure.

Quick

Quick sizing

3 questions. Get GPU count, cost comparison, and a shareable summary.

Start →
Advanced

Full calculator

All inputs, HF model lookup, optimisation tiers (FP8, prefix cache, llm-d), cost model, formulas.

Open →
Compare
📊

GPU Explorer

Compare NVIDIA and AMD GPUs by VRAM, throughput, and price. Interactive bubble chart.

Explore →
Economics
📈

Hybrid savings

Cloud vs on-prem vs dedicated node costs. Drag the split slider to find your crossover point.

Calculate →
Routing

Fleet routing

Model semantic routing economics. Compare API spend vs self-hosted. Calculate payback period.

Analyse →
Compare
📌

Saved results

Compare workloads side by side. Export to PDF, Excel, or Google Sheets.

View →
Why this exists

Every time someone asked me how many GPUs they needed to serve an LLM, I opened the same spreadsheet and ran the same math. KV cache per user, available pool after weights, replica count, cost per hour — it took 20 minutes to explain.

I built gpu.calc so that conversation takes 3 questions instead of 20 minutes. The hard math runs automatically. The assumptions are visible and editable. The result is shareable.

Vikas Grover, AI Infrastructure · LLM Inference

Quick sizing
What model are you serving?
Pick a preset or enter a HuggingFace model ID
🔑 Have a gated model? Paste your HF token
How many people at the same time?
Peak concurrent users — not total users
Or type exact:
How long are your conversations?
This determines how much GPU memory each user needs
Or enter tokens (K): K tokens — supports up to 1M+
Where do you want to run this?
Sets your cost model default
Does this look right?
Review your answers before we calculate
🤖
Model
Llama 3.1 8B · GQA · FP16
HuggingFace ID or search
🔑 Have a gated model? Paste your HF token
Manual override — Parameters (B)
8B
Attention type
Weight precision
KV cache precision (independent from weights)
⚙️
GPU
H200 · 141 GB
NVIDIA
AMD
Tensor parallel (powers of 2 only)
👥
Users & context
1000 users · 4K ctx · 20% active
Peak users
1K
Active concurrency (% generating at any instant)
20%
Chat/API: 10-20% · Batch: 100% · Interactive: 30-50%
Avg input tokens (ISL) — avg prompt length
Avg output tokens (OSL) — avg response length
avg_ctx = ISL + OSL/2 = 2,100 tokens per slot 1,575 words
⚙ Max context window (model limit)
Model's architectural max — NOT used for KV sizing.
Using max ctx (e.g. 128K) instead of avg ISL inflates GPU count 10-100×.
Prefix cache hit rate
0%
Cache type
Persistent: prefix hit reduces compute (TTFT), NOT VRAM
Serving
vLLM · standard · TP=auto
Framework
vLLM Memory Budget
Version
gpu_mem_util default 0.90
act_peak GB
estimated
From vLLM log: PyTorch activation peak memory takes X.XX GiB
Mode
💰
Cost inputs
Cloud · GCP $3.00/hr
Deployment
Cloud provider
Reserved discount
Guidance estimates only — validate with benchmarks before procurement.
REAL-TIME INFERENCE ONLY
Qwen3-8B · 1000 users · 4K ctx · TP=1
GPUs needed (memory floor)
calculating...
fits ✓
Replicas
copies of model
VRAM used
per replica
Monthly cost
Memory floor:
⚠ The + means add throughput GPUs → Throughput tab for full answer
ONE REPLICA — what's in VRAM (TP=1)
Weights
KV cache
Overhead
Free
How we got there — click any layer to see the math
1
Model weights
calculating...
2
KV cache
calculating...
3
GPU capability
calculating...
4
Memory fit & replicas
calculating...
5
Cost
calculating...
What changes things
Sensitivity to each input
KV HEADROOM — users per replica at different context lengths
0%
📊 Save this sizing to Google Sheets Export config, GPU count, KV breakdown, and 5-year cost.

Saved Results

Compare workloads side by side. Save from the Advanced calculator.

Hybrid savings

All numbers editable · click any blue value to change it

YOUR WORKLOAD
2
16
$25K
30%
5 yrs
CLOUD PROVIDER (on-demand)
DEDICATED NODE (flat rate)
Reserved node, always billed 24/7. Cheaper than on-demand at >25% util.
$/mo = per node. GPUs/node is configurable. CoreWeave 8×H100 ~$3.09/GPU/hr flat.
📈 Throughput Sizing
TTFT · TPOT · GPU count · replicas · TP selection
🤖
Model
gpt-oss-20b
Preset
HF model ID
Precision
⚙ Model params (editable)
Total params (B)
Active params (B)
Dense params (B)
Expert params (B)
Num experts
Active experts
Layers
KV heads
Head dim
Hidden dim
Is MoE?
⚙️
Hardware
H200 · TP auto
GPU
Tensor Parallel
⚙ GPU params (editable)
BF16 TFLOPS
Mem BW (TB/s)
VRAM (GB)
MFU prefill
BW base (η)
NVLink (GB/s)
PCIe (GB/s)
📊
Workload
ISL=9000 · OSL=50 · 80%
Input tokens (ISL)
Output tokens (OSL)
Concurrency
Batch size
Prefix cache hit %
↳ Eff. ISL: 1,800 tokens (prefix skips compute, not storage)
🎯
Target
100M req/day · 30% buf
Requests / day
= 1,157.4 req/sec
Infrastructure buffer %
Serving mode
🔬
Advanced
MFU · KV efficiency · calibration
Cal. multiplier (?)
Max batched tokens
Weight BW util
IB penalty (TP>8)
Disagg queue factor
Routing eff. (?)
Throughput eff.

📈 Throughput Sizing

TTFT · TPOT · GPU count · TP selection · vLLM vs llm-d

Directional estimate only. TPOT calibrated from AIC silicon data (gpt-oss-20b H100/H200 TP=8, ±5%). GPU count uses latency-aware sizing — differs from max-throughput benchmarks. TTFT and GPU count are typically accurate to ±30–50% vs real deployments. Validate with your own benchmarks before procurement. All fields are editable — override with your measured values.
📈
Configure and calculate
Set your model, GPU, and workload on the left, then click Calculate.
⚡ Performance
AIC-powered · GPU sizing · real silicon data
🤖
Model
gpt-oss-20b
Preset models
Or lookup HF model ID
gpt-oss-20b
⚙️
Hardware & backend
H100 SXM · vLLM · TP=8
GPU
Backend
Tensor parallel
DB mode
⚙ Quant override (advanced)
MoE quant
Backend ver.
📊
Workload
ISL=9000 · OSL=50 · 80% prefix
↳ Effective ISL: 1,800 tokens
🎯
Target & SLA
100M req/day · TTFT ≤500ms
= 1,157.4 req/sec
SLA constraint
💰
Cost inputs
Cloud · GCP $3.20/hr · no discount
Cloud provider
Reserved discount
On-prem GPU price ($K)
$30K
Amortisation
5 yrs
📄
Request & response
View raw API calls
LAST REQUEST
— not run yet —
RAW RESPONSE
— not run yet —
Powered by NVIDIA AIConfigurator. Estimates — validate with benchmarking before procurement.

⚡ Performance

GPU sizing · AIC silicon data · agg vs disagg

Configure & run estimate
Set your workload and target on the left. Click Run estimate to query AIC for agg and disagg configs — GPU sizing auto-derived.

GPU Explorer

LLM inference planning — compare GPU generations by memory, bandwidth, and cost efficiency

NVIDIA AMD
Preset:
X Y Bubble
X VRAM (GB)
Y Throughput Index
Bubble = cost efficiency (smaller = better)
💡 Top-right = high throughput. Smaller bubble = better cost efficiency. Top-left = cheaper but lower performance.
Throughput Index is a planning metric derived from memory bandwidth, VRAM, and architecture generation. It enables relative GPU comparison — not exact model throughput.
Inference performance depends on model architecture (GQA vs MHA), sequence length, batching, and inference backend (vLLM, TensorRT-LLM, etc.).
Cluster scaling — GPU-level performance does not directly translate to cluster-level throughput. Interconnect (NVLink, InfiniBand) and workload distribution impact scaling.
Pricing reflects indicative on-prem GPU estimates and may vary significantly by provider, region, and purchasing model. Validate all figures with vendors before procurement.
Sources: NVIDIA product pages, AMD product pages, published benchmarks, market reports. © gpu.calc
🔒

Admin only

Enter your admin token to view analytics

Set token once: localStorage.setItem('admin_token','yourkey')

VG
Vikas Grover
AI Infrastructure · LLM Inference
Building open-source AI infra tooling. Follow for LLM inference, GPU sizing, and platform engineering content.
Quick question
What are you using this for? (1 tap, no email needed)
Great — want Vikas to review your specific situation?
Just exploring — skip
💲
API pricing
edit prices below · select model per tier above
Updated March 2026 — edit when prices change
FLEET ROUTING

Semantic routing economics

HOW YOUR TRAFFIC ROUTES
ANNUAL COST IF ALL TRAFFIC WENT TO ONE MODEL:
YOUR CURRENT API SPEND
API cost / year
On-prem cost / year (amortised 5yr)
GPUs needed (at 10% concurrency + 20% buffer)
Annual savings
Should you self-host?
Engineering cost: $ one-time
PER-TIER BREAKDOWN
Concurrency: 10% · 100 simultaneous users
TierAPI modelTrafficQueries/dayAPI cost/yrSelf-host GPUsTokens (in/out)vs all-Sonnet
5-YEAR COST COMPARISON
>