LLM Inference Sizing

Right-size your
GPU fleet.

Focused tools for engineers and financial analysts evaluating LLM inference infrastructure.

Quick

▶

Quick sizing

3 questions. Get GPU count, cost comparison, and a shareable summary.

Start →

Advanced

⚙

Full calculator

All inputs, HF model lookup, optimisation tiers (FP8, prefix cache, llm-d), cost model, formulas.

Open →

Compare

📊

GPU Explorer

Compare NVIDIA and AMD GPUs by VRAM, throughput, and price. Interactive bubble chart.

Explore →

Economics

📈

Hybrid savings

Cloud vs on-prem vs dedicated node costs. Drag the split slider to find your crossover point.

Calculate →

Routing

☰

Fleet routing

Model semantic routing economics. Compare API spend vs self-hosted. Calculate payback period.

Analyse →

Compare

📌

Saved results

Compare workloads side by side. Export to PDF, Excel, or Google Sheets.

View →

Why this exists

Every time someone asked me how many GPUs they needed to serve an LLM, I opened the same spreadsheet and ran the same math. KV cache per user, available pool after weights, replica count, cost per hour — it took 20 minutes to explain.

I built gpu.calc so that conversation takes 3 questions instead of 20 minutes. The hard math runs automatically. The assumptions are visible and editable. The result is shareable.

— Vikas Grover, AI Infrastructure · LLM Inference

Quick sizing

What model are you serving?

Pick a preset or enter a HuggingFace model ID

🔑 Have a gated model? Paste your HF token

How many people at the same time?

Peak concurrent users — not total users

Or type exact:

How long are your conversations?

This determines how much GPU memory each user needs

Or enter tokens (K): K tokens — supports up to 1M+

Where do you want to run this?

Sets your cost model default

Does this look right?

Review your answers before we calculate

🤖

Model

Llama 3.1 8B · GQA · FP16

►

HuggingFace ID or search

🔑 Have a gated model? Paste your HF token

Manual override

Parameters (B) 8B

Attention type

Precision

👥

Users & context

30 users · 8K ctx · 10% prefix

►

Peak concurrent users

Users 30

1101001K10K100K

Context window

Tokens 8K

1K8K32K128K512K1M+

Prefix cache hit rate

Hit rate 10%

⚡

Inference mode

vLLM · standard · TP=1

►

Framework

Mode

Tensor parallel

KV cache precision

💰

Cost inputs

Cloud · GCP $3.00/hr · no discount

►

Deployment

Cloud provider

Reserved discount

Guidance estimates only — validate with benchmarks before procurement. Actual requirements vary with framework overhead, batch patterns, and hardware.

REAL-TIME INFERENCE ONLY

Llama 3.1 8B · 30 users · 8K ctx · TP=1

H100 GPUs to get started

8B · 30 users · H100 80GB · 8K ctx · TP=1

ONE GPU — what's inside See why →

Weights 16GB

KV cache 34GB

Overhead 2GB

Free 28GB

Weight memory

16 GB

+2.0GB overhead

Avail for KV

54.7 GB

per replica (95% PA)

KV per user

2.56 GB

0.320 MB/tok · 8K ctx

Users / replica

2 replicas × 1GPU

GPUs needed

H100 80GB · 2 replicas

fits ✓

Monthly cost

☁ Cloud

$163K

GCP · on-demand

⬛ On-prem

—

hardware · amortised

VRAM utilisation

67%

33% headroom remaining

H100 fits ✓

KV HEADROOM — users per replica at different context lengths

NODE TOPOLOGY

OPTIMIZE

GPU choice

H100 80GB · default

►

GPU tier

NVIDIA

AMD

Reduce memory (quantization)

FP16 weights · FP16 KV

►

Weight precision

KV cache precision

KV cache efficiency

vLLM paging ON · 10% prefix · GQA

►

Framework

Prefix cache hit rate

10%

Scale (users & context)

30 users · 8K ctx

►

Peak concurrent users

110010K100K

Context window

1K32K128K1M+

Cost model

Cloud · GCP $3.00/hr · no discount

►

Edit cost inputs in the left panel → Cost inputs section

📊 Save this sizing to Google Sheets Export your full config, GPU count, KV cache breakdown, and 5-year cost as a spreadsheet row.

Saved Results

Compare workloads side by side. Save from the Advanced calculator.

Hybrid savings

All numbers editable · click any blue value to change it

YOUR WORKLOAD

GPUs needed2

GPUs running16

HW price ($K)$25K

On-prem GPU

Avg utilisation30%

Amortisation5 yrs

CLOUD PROVIDER (on-demand)

DEDICATED NODE (flat rate)

Reserved node, always billed 24/7. Cheaper than on-demand at >25% util.

$/mo = per node. GPUs/node is configurable. CoreWeave 8×H100 ~$3.09/GPU/hr flat.

⚡ Calibrated estimate

Modeled using benchmark-calibrated inference profiles

MODEL

PRESET MODELS

OR LOOKUP BY HF ID

Llama 3.1 8B

INFERENCE INPUTS

Input tokens (?)

Output tokens (?)

Batch size

Context (K tokens) (?)

HARDWARE & BACKEND

GPU

Backend

Tensor parallel (?)

Pipeline parallel (?)

📄 Estimator request payload▾

Sent to estimator API. Edit before running.

database_mode defaults to HYBRID. Edits here override the form.

🔍 Estimator response & debug▾

RAW JSON RESPONSE

— not run yet —

PARSED FIELDS

— not run yet —

Calibrated estimates are modeled using benchmark-calibrated inference profiles and should be validated with workload-specific benchmarking before procurement.

Calibrated estimate

All numbers editable · results update on re-run

⚡

No estimate yet

Configure model and hardware on the left, then click Run estimate

GPU Explorer

LLM inference planning — compare GPU generations by memory, bandwidth, and cost efficiency

NVIDIA AMD

Preset:

X Y Bubble

X VRAM (GB)

Y Throughput Index

● Bubble = cost efficiency (smaller = better)

💡 Top-right = high throughput. Smaller bubble = better cost efficiency. Top-left = cheaper but lower performance.

Throughput Index is a planning metric derived from memory bandwidth, VRAM, and architecture generation. It enables relative GPU comparison — not exact model throughput.
Inference performance depends on model architecture (GQA vs MHA), sequence length, batching, and inference backend (vLLM, TensorRT-LLM, etc.).
Cluster scaling — GPU-level performance does not directly translate to cluster-level throughput. Interconnect (NVLink, InfiniBand) and workload distribution impact scaling.
Pricing reflects indicative on-prem GPU estimates and may vary significantly by provider, region, and purchasing model. Validate all figures with vendors before procurement.
Sources: NVIDIA product pages, AMD product pages, published benchmarks, market reports. © gpu.calc

🔒

Admin only

Enter your admin token to view analytics

Set token once: localStorage.setItem('admin_token','yourkey')

💲

API pricing

edit prices below · select model per tier above

►

Updated March 2026 — edit when prices change

FLEET ROUTING

Semantic routing economics

HOW YOUR TRAFFIC ROUTES

ANNUAL COST IF ALL TRAFFIC WENT TO ONE MODEL:

YOUR CURRENT API SPEND

—

API cost / year

—

On-prem cost / year (amortised 5yr)

—

GPUs needed (at 10% concurrency + 20% buffer)

—

Annual savings

Should you self-host?

Engineering cost: $ one-time

PER-TIER BREAKDOWN

Concurrency: 10% · 100 simultaneous users

Tier	API model	Traffic	Queries/day	API cost/yr	Self-host GPUs	Tokens (in/out)	vs all-Sonnet

5-YEAR COST COMPARISON

Right-size yourGPU fleet.

Quick sizing

Full calculator

GPU Explorer

Hybrid savings

Fleet routing

Saved results

Saved Results

Hybrid savings

Calibrated estimate

GPU Explorer

Admin only

Analytics Dashboard

Semantic routing economics

Right-size your
GPU fleet.