KV Cache Sizing
How "64K context" turns into bytes on the wire — the formula, worked examples for Llama-3.1-70B, TP sharding math, and a cross-reference to LMCache's calculator.
This is the reference for "what does c=128, 64K mean in bytes?" and
"how does that divide across the GPUs?" — the size math behind every
context-size table in our docs.
The first thing to settle: 64K is tokens, not bytes
The "context" axis in inference benchmark tables is the prompt length
in tokens. When a row reads 64K, it means 65,536 tokens of input
context. This is unrelated to KiB / MiB / GiB.
A token is a vocabulary unit the tokenizer emits. For Llama-3.1 the tokenizer's vocabulary is 128,256 entries; English averages roughly 0.75 tokens per word. So a "64K context" prompt is on the order of 50,000 English words — about a 200-page book.
What that turns into in bytes depends on the model. For Llama-3.1-70B at BF16 it's 20.5 GB of KV cache per session. The walkthrough is below.
What is "KV cache" and why does it grow with context?
A transformer at inference does two things per token: (1) compute its query/key/value vectors and attend over all prior tokens, and (2) stash that token's K and V vectors in a per-layer buffer so future tokens can attend back without re-running the projection.
Step 2 is the KV cache. It grows monotonically — every prompt or generated token adds one (K, V) pair per layer per attention head. At 64K tokens the cache holds 64K × (K, V) across all layers and heads. Context length maps to memory linearly: each token buys a fixed-size slot.
The formula, with every term named
For a batch size of 1 (one session), context length N tokens:
KV bytes per session = 2 × N × num_layers × num_kv_heads × head_dim × bytes_per_elem
│ │ │ │ │ └── 2 for BF16/FP16
│ │ │ │ └────────────── e.g., 128
│ │ │ └─────────────────────────── after GQA reduction
│ │ └──────────────────────────────────────── e.g., 80 for 70B
│ └─────────────────────────────────────────────── tokens of context
└─────────────────────────────────────────────────── K and V are stored separatelyPer token (i.e., per 1 token of context):
KV bytes per token = 2 × num_layers × num_kv_heads × head_dim × bytes_per_elemThis is the value LMCache's calculator publishes per model. Plugging Llama-3.1-70B's parameters:
2 × 80 × 8 × 128 × 2 = 327,680 bytes ≈ 320 KB per token (≈ 0.31 MB)Independent confirmation: LMCache's KV Cache Calculator publishes exactly this number for Llama-3.1-70B at BF16. VMware's LLM Inference Sizing and Performance Guidance derives the same per-token figure and uses it to compute "40 GB of KV cache for a single Llama-3.1-70B request at 128K context."
Common pitfall — the 2× error. Some references write the formula as 2 × num_layers × num_kv_heads × head_dim and arrive at half the correct number.
They are conflating the two factors of 2 — either dropping K/V separation or
assuming 1 byte/element. The right formula at BF16 has two factors of 2:
one for K+V being separate caches, one for bytes_per_elem = 2. If your
number disagrees with LMCache's calculator by exactly 2×, this is the cause.
Worked example — Llama-3.1-70B
Architecture, from the Hugging Face config.json:
| Field | Value |
|---|---|
num_hidden_layers | 80 |
num_key_value_heads | 8 (GQA — fewer than the 64 attention heads) |
head_dim | 128 |
| dtype | bfloat16 (2 bytes/element) |
Per-token KV (total across all GPUs)
2 × 80 × 8 × 128 × 2 = 327,680 bytes = 320 KiBPer-session KV at common context lengths
| Context | Tokens | KV per session |
|---|---|---|
| 4K | 4,096 | 1.25 GiB |
| 16K | 16,384 | 5.00 GiB |
| 32K | 32,768 | 10.00 GiB |
| 64K | 65,536 | 20.00 GiB |
| 128K | 131,072 | 40.00 GiB |
These are total bytes for one session's KV cache, summed across all GPUs holding pieces of it.
Tensor-parallel sharding
When tensor parallelism is N, the KV heads are split across the N
GPUs along the head axis. The constraint is num_kv_heads % TP == 0,
so for Llama-3.1-70B (8 KV heads) valid TP values are 1, 2, 4, 8.
At TP=8, each GPU holds exactly 1 KV head. Per-GPU per-token KV is therefore:
2 × num_layers × (num_kv_heads / TP) × head_dim × bytes_per_elem
2 × 80 × (8 / 8) × 128 × 2
= 40,960 bytes = 40 KiB per token, per GPUPer-session at 64K context:
65,536 × 40 KiB = 2,621,440 KiB = 2.5 GiB per GPUAcross 8 GPUs in the TP group, that's 2.5 × 8 = 20 GiB total per session — the same number as before, just split eight ways. The
arithmetic is consistent: TP shards the KV cache; it does not
change total bytes.
Adding concurrency
A load generator with --concurrency C keeps C in-flight requests at
all times. Each request has its own KV cache, so the aggregate live
KV state at any moment is:
total live KV = C × KV_per_sessionFor example, at c=128 and 64K context:
128 × 20 GiB = 2,560 GiB = 2.5 TiB of live KV state across the clusterCompare to an 8× H200 HBM budget (8 × 141 GB ≈ 1.05 TiB total, ~0.93 TiB usable for KV after the 70B model weights). 2.5 TiB ≫ 0.93 TiB, so most sessions cannot fit in HBM and must either be evicted (offload tier required) or recomputed on next access. This is the regime where an external KV state tier earns its keep — the arithmetic, not any specific benchmark, is the source of the requirement.
Where MemKV pays off
MemKV's benefit comes from one place: skipping prefill compute on the GPU. A KV state read off MemKV displaces the matching prefill that would otherwise run. The bigger the prefill that's avoided, and the more often the same prefix would have been recomputed, the bigger the win.
The gating condition is prefix reuse. If almost every prompt is unique, MemKV has nothing to serve back — every request still has to prefill from scratch on the GPU, and offloading the resulting KV state only adds I/O without removing compute. In that regime there is no real benefit to a remote KV tier.
Workloads that hit this regularly:
- Long-context prompts where prefill is the dominant cost (RAG over large documents, codebases pasted into context, transcripts).
- Prefix reuse across requests (shared system prompts, multi-turn conversations, agent loops, batch evaluation over the same context).
- Cluster traffic that exceeds the per-GPU HBM prefix cache, so prefixes get evicted and the next hit would otherwise recompute.
Workloads where MemKV adds little:
- Mostly-unique prompts with no shared prefix across requests.
- Short prompts where prefill is cheap to begin with.
- Working sets that fit entirely in HBM and never get evicted.
How "64K context" travels into MemKV
The KV cache is stored as fixed-size blocks. Recent vLLM defaults to a
block size of 16 tokens (configurable via --block-size). For
Llama-3.1-70B at TP=8, one vLLM block, per GPU, is:
16 tokens × 40 KiB/token = 640 KiB per block per GPUKVBM groups consecutive blocks into a single NIXL transfer, and NIXL's
makeXferReq merges contiguous descriptors before they reach the
MemKV plugin. The effective transfer size that lands on the wire is
multi-MB to tens of MB, well into the high-bandwidth region of
MemKV's read curve.
Bulk return: RC for control, DC for data
A session's KV cache is not one object on MemKV — it is many slabs,
each holding one KV block. When the engine asks to recall a session,
the plugin packs up to max_batch_size keys into one BatchRead
control message (default 64).
Two transports are involved, and the distinction matters:
- Control plane (RC). The
BatchRead/BatchWriterequest and its response ride a per-connection Reliable Connection QP. One round-trip carries up tomax_batch_sizekeys. - Data plane (DC). Inside the server's handling of one batch, each per-key data transfer is an RDMA WRITE/READ over Dynamically Connected transport. The server holds a pool of DC initiators and addresses the client's single DC target endpoint per work request. The KV bytes themselves ride DC, not RC. DC scales O(N) QPs across peers (one DCT per node) instead of O(N²) per-peer RC, which is what lets MemKV sustain near-line-rate fabric utilization without per-client QP setup.
On non-mlx5 hardware where DC is unavailable, the data plane falls back to RC transparently. Same wire protocol, same plugin code path.
Knobs that control batching and parallelism
Within one NIXL postXfer call, batches to the same server are
dispatched sequentially on a single thread/connection. Batches to
different servers run in parallel. KVBM's parallel postXfer calls
(different threads on the engine side) take different connections from
the per-server pool, so the maximum number of in-flight batches per
server is bounded by the connection pool size.
| Knob | Default | Configured via | What it changes |
|---|---|---|---|
max_batch_size | 64 | MEMKV_MAX_BATCH_SIZE env / MEMKV_CONFIG yaml | Ops per BatchRead/BatchWrite control message |
num_connections | 8 | MEMKV_NUM_CONNECTIONS env / MEMKV_CONFIG yaml | Per-server RC connection pool size; bounds in-flight batches per server |
| Client DCI pool | 256 | client-side, fixed | Concurrent DC data transfers fanned out within a batch on the client |
| Server DCI pool | 256 | rdma.num_dcis in /etc/memkv/config.yaml | Server-side DCI pool; size to expected fan-in across clients |
Tune these — don't crank them. Larger max_batch_size reduces
control-plane round-trips but increases the encoded message size, CPU
encode/decode work, and per-batch state held under the connection mutex. The
runtime auto-falls-back to per-op transfers when the encoded message exceeds
the control buffer cap, but you'll feel that fallback as a sudden latency
cliff. Same with num_connections: more connections add pinned MR + DCI
bookkeeping per server. The right values depend on the engine's offload
parallelism, average per-key payload size, and fabric round-trip latency.
Start at the defaults, measure the per-batch throughput debug log line, and
adjust one knob at a time.
For a representative point — 70B BF16 at TP=8, 64K context, c=128 — that translates into roughly:
- Per request: 20.5 GB total / 2.56 GB per GPU read from MemKV
- Aggregate: 2.62 TB of live KV state, of which the working set spills to MemKV once HBM is exhausted
- On the wire:
max_batch_size-key control messages over RC, with the per-key data transfers fanned out over DC; multiple batches in flight across the connection pool
None of which has anything to do with "64 kilobytes."
Reference numbers for other Llama-family models
For quick comparison (per-token total at BF16, no TP sharding):
| Model | Layers | KV heads | Head dim | Per token |
|---|---|---|---|---|
| Llama-3.1-8B | 32 | 8 | 128 | 128 KB |
| Llama-3.1-70B | 80 | 8 | 128 | 320 KB |
| Llama-3.1-405B | 126 | 8 | 128 | 504 KB |
| gpt-oss-120b (MoE) | 36 | 8 | 64 | 72 KB (half the layers cache full attention; the rest are sliding window — see config) |
The same formula applies, just with the model's config.json values
substituted. Verify any model against LMCache's KV
calculator before quoting
a number.
TL;DR for cross-references
When you see a benchmark cell like c=128, 64K:
- 64K = 65,536 input tokens (not bytes, not anything else).
- c=128 = 128 simultaneous in-flight requests
(
--concurrency 128on the load generator). - The bytes follow from
2 × N × num_layers × num_kv_heads × head_dim × bytes_per_elemfor one session, multiplied bycfor aggregate. - For any specific model, run that arithmetic with the model's
num_layers,num_kv_heads,head_dim, anddtype— or paste it into the LMCache calculator.
Sources
- LMCache — KV Cache Size Calculator — canonical per-model KV per-token reference
- VMware Cloud Foundation Blog — LLM Inference Sizing and Performance Guidance
- Spheron — GPU Memory Requirements for LLMs: VRAM Calculator
- AWS Neuron — Training Llama-3.1-70B with Tensor Parallelism — TP sharding constraints
- JAX Scaling Book — Serving LLaMA 3-70B on TPUs — KV-head sharding strategy
- Hugging Face — meta-llama/Llama-3.1-70B-Instruct config.json — the architectural source of truth