LLM Inference Sizing

Right-size your
GPU fleet.

Focused tools for engineers and financial analysts evaluating LLM inference infrastructure.

Quick

▶

Quick sizing

3 questions. Get GPU count, cost comparison, and a shareable summary.

Start →

Advanced

⚙

Full calculator

All inputs, HF model lookup, optimisation tiers (FP8, prefix cache, llm-d), cost model, formulas.

Open →

Compare

📊

GPU Explorer

Compare NVIDIA and AMD GPUs by VRAM, throughput, and price. Interactive bubble chart.

Explore →

Economics

📈

Hybrid savings

Cloud vs on-prem vs dedicated node costs. Drag the split slider to find your crossover point.

Calculate →

Routing

☰

Fleet routing

Model semantic routing economics. Compare API spend vs self-hosted. Calculate payback period.

Analyse →

Compare

📌

Saved results

Compare workloads side by side. Export to PDF, Excel, or Google Sheets.

View →

Why this exists

Every time someone asked me how many GPUs they needed to serve an LLM, I opened the same spreadsheet and ran the same math. KV cache per user, available pool after weights, replica count, cost per hour — it took 20 minutes to explain.

I built gpu.calc so that conversation takes 3 questions instead of 20 minutes. The hard math runs automatically. The assumptions are visible and editable. The result is shareable.

— Vikas Grover, AI Infrastructure · LLM Inference

🤖

Model

Llama 3.1 8B · GQA · FP16

►

HuggingFace ID or search

🔑 Have a gated model? Paste your HF token

Manual override — Parameters (B)

Params 8B

Attention type

Weight precision

KV cache precision (independent from weights)

⚙️

GPU

H200 · 141 GB

►

NVIDIA

AMD

Tensor parallel (powers of 2 only)

👥

Users & context

1000 users · 4K ctx · 20% active

►

Peak users

Users 1K

Active concurrency (% generating at any instant)

Active % 20%

Chat/API: 10-20% · Batch: 100% · Interactive: 30-50%

Avg input tokens (ISL) — avg prompt length

Avg output tokens (OSL) — avg response length

avg_ctx = ISL + OSL/2 = 2,100 tokens per slot ≈ 1,575 words

⚙ Max context window (model limit)

Model's architectural max — NOT used for KV sizing.
Using max ctx (e.g. 128K) instead of avg ISL inflates GPU count 10-100×.

Prefix cache hit rate

Hit rate 0%

Cache type

Persistent: prefix hit reduces compute (TTFT), NOT VRAM

⚡

Serving

vLLM · standard · TP=auto

►

Framework

vLLM Memory Budget

Version

gpu_mem_util default 0.90

act_peak GB

estimated

From vLLM log: PyTorch activation peak memory takes X.XX GiB

Mode

💰

Cost inputs

Cloud · GCP $3.00/hr

►

Deployment

Cloud provider

Reserved discount

Guidance estimates only — validate with benchmarks before procurement.

REAL-TIME INFERENCE ONLY

Qwen3-8B · 1000 users · 4K ctx · TP=1

GPUs needed (memory floor)

—

calculating...

fits ✓

Replicas

—

copies of model

VRAM used

—

per replica

Monthly cost

▼

—

Memory floor: —

⚠ The + means add throughput GPUs → Throughput tab for full answer

ONE REPLICA — what's in VRAM (TP=1)

Weights

KV cache

Overhead

Free

How we got there — click any layer to see the math

1

Model weights

calculating...

►

2

KV cache

calculating...

►

3

GPU capability

calculating...

►

4

Memory fit & replicas

calculating...

►

5

Cost

calculating...

►

What changes things

Sensitivity to each input

KV HEADROOM — users per replica at different context lengths

📊 Save this sizing to Google Sheets Export config, GPU count, KV breakdown, and 5-year cost.

Saved Results

Compare workloads side by side. Save from the Advanced calculator.

Hybrid savings

All numbers editable · click any blue value to change it

YOUR WORKLOAD

GPUs needed2

GPUs running16

HW price ($K)$25K

On-prem GPU

Avg utilisation30%

Amortisation5 yrs

CLOUD PROVIDER (on-demand)

DEDICATED NODE (flat rate)

Reserved node, always billed 24/7. Cheaper than on-demand at >25% util.

$/mo = per node. GPUs/node is configurable. CoreWeave 8×H100 ~$3.09/GPU/hr flat.

📈 Throughput Sizing

TTFT · TPOT · GPU count · replicas · TP selection

🤖

Model

gpt-oss-20b

►

Preset

HF model ID

Precision

⚙ Model params (editable)

Total params (B)

Active params (B)

Dense params (B)

Expert params (B)

Num experts

Active experts

Layers

KV heads

Head dim

Hidden dim

Is MoE?

⚙️

Hardware

H200 · TP auto

►

GPU

Tensor Parallel

⚙ GPU params (editable)

BF16 TFLOPS

Mem BW (TB/s)

VRAM (GB)

MFU prefill

BW base (η)

NVLink (GB/s)

PCIe (GB/s)

📊

Workload

ISL=9000 · OSL=50 · 80%

►

Input tokens (ISL)

Output tokens (OSL)

Concurrency

Batch size

Prefix cache hit %

↳ Eff. ISL: 1,800 tokens (prefix skips compute, not storage)

🎯

Target

100M req/day · 30% buf

►

Requests / day

= 1,157.4 req/sec

Infrastructure buffer %

Serving mode

🔬

Advanced

MFU · KV efficiency · calibration

►

Cal. multiplier (?)

Max batched tokens

Weight BW util

IB penalty (TP>8)

Disagg queue factor

Routing eff. (?)

Throughput eff.

📈 Throughput Sizing

TTFT · TPOT · GPU count · TP selection · vLLM vs llm-d

⚠

Directional estimate only. TPOT calibrated from AIC silicon data (gpt-oss-20b H100/H200 TP=8, ±5%). GPU count uses latency-aware sizing — differs from max-throughput benchmarks. TTFT and GPU count are typically accurate to ±30–50% vs real deployments. Validate with your own benchmarks before procurement. All fields are editable — override with your measured values.

📈

Configure and calculate

Set your model, GPU, and workload on the left, then click Calculate.

⚡ Performance

AIC-powered · GPU sizing · real silicon data

🤖

Model

gpt-oss-20b

►

Preset models

Or lookup HF model ID

gpt-oss-20b

⚙️

Hardware & backend

H100 SXM · vLLM · TP=8

►

GPU

Backend

Tensor parallel

DB mode

⚙ Quant override (advanced)

MoE quant

Backend ver.

📊

Workload

ISL=9000 · OSL=50 · 80% prefix

►

Input tokens (ISL)

Output tokens (OSL)

Prefix reuse % (?)

↳ Effective ISL: 1,800 tokens

Batch / concurrency

🎯

Target & SLA

100M req/day · TTFT ≤500ms

►

Requests / day

= 1,157.4 req/sec

Infra buffer %

SLA constraint

TTFT threshold (ms)

💰

Cost inputs

Cloud · GCP $3.20/hr · no discount

►

Cloud provider

Reserved discount

On-prem GPU price ($K)

Per GPU $30K

Amortisation

Years 5 yrs

📄

Request & response

View raw API calls

►

LAST REQUEST

— not run yet —

RAW RESPONSE

— not run yet —

Powered by NVIDIA AIConfigurator. Estimates — validate with benchmarking before procurement.

⚡ Performance

GPU sizing · AIC silicon data · agg vs disagg

⚡

Configure & run estimate

Set your workload and target on the left. Click Run estimate to query AIC for agg and disagg configs — GPU sizing auto-derived.

GPU Explorer

LLM inference planning — compare GPU generations by memory, bandwidth, and cost efficiency

NVIDIA AMD

Preset:

X Y Bubble

X VRAM (GB)

Y Throughput Index

● Bubble = cost efficiency (smaller = better)

💡 Top-right = high throughput. Smaller bubble = better cost efficiency. Top-left = cheaper but lower performance.

Throughput Index is a planning metric derived from memory bandwidth, VRAM, and architecture generation. It enables relative GPU comparison — not exact model throughput.
Inference performance depends on model architecture (GQA vs MHA), sequence length, batching, and inference backend (vLLM, TensorRT-LLM, etc.).
Cluster scaling — GPU-level performance does not directly translate to cluster-level throughput. Interconnect (NVLink, InfiniBand) and workload distribution impact scaling.
Pricing reflects indicative on-prem GPU estimates and may vary significantly by provider, region, and purchasing model. Validate all figures with vendors before procurement.
Sources: NVIDIA product pages, AMD product pages, published benchmarks, market reports. © gpu.calc

🔒

Admin only

Enter your admin token to view analytics

Set token once: localStorage.setItem('admin_token','yourkey')

💲

API pricing

edit prices below · select model per tier above

►

Updated March 2026 — edit when prices change

FLEET ROUTING

Semantic routing economics

HOW YOUR TRAFFIC ROUTES

ANNUAL COST IF ALL TRAFFIC WENT TO ONE MODEL:

YOUR CURRENT API SPEND

—

API cost / year

—

On-prem cost / year (amortised 5yr)

—

GPUs needed (at 10% concurrency + 20% buffer)

—

Annual savings

Should you self-host?

Engineering cost: $ one-time

PER-TIER BREAKDOWN

Concurrency: 10% · 100 simultaneous users

Tier	API model	Traffic	Queries/day	API cost/yr	Self-host GPUs	Tokens (in/out)	vs all-Sonnet

5-YEAR COST COMPARISON

>

Right-size yourGPU fleet.

Quick sizing

Full calculator

GPU Explorer

Hybrid savings

Fleet routing

Saved results

Saved Results

Hybrid savings

📈 Throughput Sizing

⚡ Performance

GPU Explorer

Admin only

Analytics Dashboard

Semantic routing economics

Right-size your
GPU fleet.