[000_calculator]

diff. LLM inference simulator

Estimate model fit, prefill latency, and decode throughput across practical GPU choices.

SYS: ONLINE V.0.1 [LIVE]
GPU Provider

Need GPUs to run this?

Spin up GPU instances on RunPod and benchmark your model against this estimate.

Launch on RunPod
Models
GPU
1
Inference
Model size
Quantization
Prompt tokens (prefill) 512
256 512 1k 2k 4k

Calculation
Model size
GB
Fits in VRAM
Prefill time
ms
Token/s (decode)
tok/s
Recommendation

Run a simulation to size the setup.

PENDING

Select models and hardware, then simulate to get a practical read on fit, prompt latency, and decode speed.

VRAM

Weights must fit in aggregate GPU memory before serving is realistic.

Prefill

Prompt processing depends on model size, prompt length, and total compute.

Decode

Generation speed is usually constrained by memory bandwidth.

Select one or more models and run a simulation to see concurrent streamed output…
Idle
esc
esc