KV Cache Sizing

How "64K context" turns into bytes on the wire — the formula, worked examples for Llama-3.1-70B, TP sharding math, and a cross-reference to LMCache's calculator.

This is the reference for "what does c=128, 64K mean in bytes?" and "how does that divide across the GPUs?" — the size math behind every context-size table in our docs.

The first thing to settle: 64K is tokens, not bytes

The "context" axis in inference benchmark tables is the prompt length in tokens. When a row reads 64K, it means 65,536 tokens of input context. This is unrelated to KiB / MiB / GiB.

A token is a vocabulary unit the tokenizer emits. For Llama-3.1 the tokenizer's vocabulary is 128,256 entries; English averages roughly 0.75 tokens per word. So a "64K context" prompt is on the order of 50,000 English words — about a 200-page book.

What that turns into in bytes depends on the model. For Llama-3.1-70B at BF16 it's 20 GiB of KV cache per session. The walkthrough is below.

What is "KV cache" and why does it grow with context?

A transformer at inference does two things per token: (1) compute its query/key/value vectors and attend over all prior tokens, and (2) stash that token's K and V vectors in a per-layer buffer so future tokens can attend back without re-running the projection.

Step 2 is the KV cache. It grows monotonically — every prompt or generated token adds one (K, V) pair per layer per attention head. At 64K tokens the cache holds 64K × (K, V) across all layers and heads. Context length maps to memory linearly: each token buys a fixed-size slot.

The formula, with every term named

For a batch size of 1 (one session), context length N tokens:

KV bytes per session = 2 × N × num_layers × num_kv_heads × head_dim × bytes_per_elem
                       │   │       │              │             │            └── 2 for BF16/FP16
                       │   │       │              │             └────────────── e.g., 128
                       │   │       │              └─────────────────────────── after GQA reduction
                       │   │       └──────────────────────────────────────── e.g., 80 for 70B
                       │   └─────────────────────────────────────────────── tokens of context
                       └─────────────────────────────────────────────────── K and V are stored separately

Per token (i.e., per 1 token of context):

KV bytes per token = 2 × num_layers × num_kv_heads × head_dim × bytes_per_elem

This is the value LMCache's calculator publishes per model. Plugging Llama-3.1-70B's parameters:

2 × 80 × 8 × 128 × 2 = 327,680 bytes = 320 KiB per token (≈ 0.31 MiB)

Independent confirmation: LMCache's KV Cache Calculator publishes exactly this number for Llama-3.1-70B at BF16. VMware's LLM Inference Sizing and Performance Guidance derives the same per-token figure and uses it to compute "40 GB of KV cache for a single Llama-3.1-70B request at 128K context."

Common pitfall — the 2× error. Some references write the formula as 2 × num_layers × num_kv_heads × head_dim and arrive at half the correct number. They are conflating the two factors of 2 — either dropping K/V separation or assuming 1 byte/element. The right formula at BF16 has two factors of 2: one for K+V being separate caches, one for bytes_per_elem = 2. If your number disagrees with LMCache's calculator by exactly 2×, this is the cause.

Worked example — Llama-3.1-70B

Architecture, from the Hugging Face config.json:

Field	Value
`num_hidden_layers`	80
`num_key_value_heads`	8 (GQA — fewer than the 64 attention heads)
`head_dim`	128
dtype	bfloat16 (2 bytes/element)

Per-token KV (total across all GPUs)

2 × 80 × 8 × 128 × 2 = 327,680 bytes = 320 KiB

Per-session KV at common context lengths

Context	Tokens	KV per session
4K	4,096	1.25 GiB
16K	16,384	5.00 GiB
32K	32,768	10.00 GiB
64K	65,536	20.00 GiB
128K	131,072	40.00 GiB

These are total bytes for one session's KV cache, summed across all GPUs holding pieces of it.

Tensor-parallel sharding

When tensor parallelism is N, the KV heads are split across the N GPUs along the head axis. The constraint is num_kv_heads % TP == 0, so for Llama-3.1-70B (8 KV heads) valid TP values are 1, 2, 4, 8.

At TP=8, each GPU holds exactly 1 KV head. Per-GPU per-token KV is therefore:

2 × num_layers × (num_kv_heads / TP) × head_dim × bytes_per_elem
2 × 80         × (8 / 8)             × 128      × 2
= 40,960 bytes = 40 KiB per token, per GPU

Per-session at 64K context:

65,536 × 40 KiB = 2,621,440 KiB = 2.5 GiB per GPU

Across 8 GPUs in the TP group, that's 2.5 × 8 = 20 GiB total per session — the same number as before, just split eight ways. The arithmetic is consistent: TP shards the KV cache; it does not change total bytes.

Adding concurrency

A load generator with --concurrency C keeps C in-flight requests at all times. Each request has its own KV cache, so the aggregate live KV state at any moment is:

total live KV = C × KV_per_session

For example, at c=128 and 64K context:

128 × 20 GiB = 2,560 GiB = 2.5 TiB of live KV state across the cluster

Compare to an 8× H200 HBM budget (8 × 141 GB ≈ 1.05 TiB total, ~0.93 TiB usable for KV after the 70B model weights). 2.5 TiB ≫ 0.93 TiB, so most sessions cannot fit in HBM and must either be evicted (offload tier required) or recomputed on next access. This is the regime where an external KV state tier earns its keep — the arithmetic, not any specific benchmark, is the source of the requirement.

Where MemKV pays off

MemKV's benefit comes from one place: skipping prefill compute on the GPU. A KV state read off MemKV displaces the matching prefill that would otherwise run. The bigger the prefill that's avoided, and the more often the same prefix would have been recomputed, the bigger the win.

The gating condition is prefix reuse. If almost every prompt is unique, MemKV has nothing to serve back — every request still has to prefill from scratch on the GPU, and offloading the resulting KV state only adds I/O without removing compute. In that regime there is no real benefit to a remote KV tier.

Workloads that hit this regularly:

Long-context prompts where prefill is the dominant cost (RAG over large documents, codebases pasted into context, transcripts).
Prefix reuse across requests (shared system prompts, multi-turn conversations, agent loops, batch evaluation over the same context).
Cluster traffic that exceeds the per-GPU HBM prefix cache, so prefixes get evicted and the next hit would otherwise recompute.

Workloads where MemKV adds little:

Mostly-unique prompts with no shared prefix across requests.
Short prompts where prefill is cheap to begin with.
Working sets that fit entirely in HBM and never get evicted.

How "64K context" travels into MemKV

The KV cache is stored as fixed-size blocks. Recent vLLM defaults to a block size of 16 tokens (configurable via --block-size). For Llama-3.1-70B at TP=8, one vLLM block, per GPU, is:

16 tokens × 40 KiB/token = 640 KiB per block per GPU

KVBM groups consecutive blocks into a single NIXL transfer, and NIXL's makeXferReq merges contiguous descriptors before they reach the MemKV plugin. The effective transfer size that lands on the wire is multi-MB to tens of MB, well into the high-bandwidth region of MemKV's read curve.

Bulk return: RC for control, DC for data

A session's KV cache is not one object on MemKV — it is many slabs, each holding one KV block. When the engine asks to recall a session, the plugin packs up to max_batch_size keys into one BatchRead control message (default 64).

Two transports are involved, and the distinction matters:

Control plane (RC). The BatchRead / BatchWrite request and its response ride a per-connection Reliable Connection QP. One round-trip carries up to max_batch_size keys.
Data plane (DC). Inside the server's handling of one batch, each per-key data transfer is an RDMA WRITE/READ over Dynamically Connected transport. The server holds a pool of DC initiators and addresses the client's single DC target endpoint per work request. The KV bytes themselves ride DC, not RC. DC scales O(N) QPs across peers (one DCT per node) instead of O(N²) per-peer RC, which is what lets MemKV sustain near-line-rate fabric utilization without per-client QP setup.

On non-mlx5 hardware where DC is unavailable, the data plane falls back to RC transparently. Same wire protocol, same plugin code path.

Knobs that control batching and parallelism

Within one NIXL postXfer call, batches to the same server are dispatched sequentially on a single thread/connection. Batches to different servers run in parallel. KVBM's parallel postXfer calls (different threads on the engine side) take different connections from the per-server pool, so the maximum number of in-flight batches per server is bounded by the connection pool size.

Knob	Default	Configured via	What it changes
`max_batch_size`	64	`MEMKV_MAX_BATCH_SIZE` env / `MEMKV_CONFIG` yaml	Ops per `BatchRead`/`BatchWrite` control message
`num_connections`	8	`MEMKV_NUM_CONNECTIONS` env / `MEMKV_CONFIG` yaml	Per-server RC connection pool size; bounds in-flight batches per server
Client DCI pool	256	client-side, fixed	Concurrent DC data transfers fanned out within a batch on the client
Server DCI pool	256	`rdma.num_dcis` in `/etc/memkv/config.yaml`	Server-side DCI pool; size to expected fan-in across clients

Tune these — don't crank them. Larger max_batch_size reduces control-plane round-trips but increases the encoded message size, CPU encode/decode work, and per-batch state held under the connection mutex. The runtime auto-falls-back to per-op transfers when the encoded message exceeds the control buffer cap, but you'll feel that fallback as a sudden latency cliff. Same with num_connections: more connections add pinned MR + DCI bookkeeping per server. The right values depend on the engine's offload parallelism, average per-key payload size, and fabric round-trip latency. Start at the defaults, measure the per-batch throughput debug log line, and adjust one knob at a time.

For a representative point — 70B BF16 at TP=8, 64K context, c=128 — that translates into roughly:

Per request: 20 GiB total / 2.5 GiB per GPU read from MemKV
Aggregate: 2.5 TiB of live KV state, of which the working set spills to MemKV once HBM is exhausted
On the wire: max_batch_size-key control messages over RC, with the per-key data transfers fanned out over DC; multiple batches in flight across the connection pool

None of which has anything to do with "64 kilobytes."

Reference numbers for other Llama-family models

For quick comparison (per-token total at BF16, no TP sharding):

Model	Layers	KV heads	Head dim	Per token
Llama-3.1-8B	32	8	128	128 KiB
Llama-3.1-70B	80	8	128	320 KiB
Llama-3.1-405B	126	8	128	504 KiB
gpt-oss-120b (MoE)	36	8	64	72 KiB (half the layers cache full attention; the rest are sliding window — see config)

The same formula applies, just with the model's config.json values substituted. Verify any model against LMCache's KV calculator before quoting a number.

TL;DR for cross-references

When you see a benchmark cell like c=128, 64K:

64K = 65,536 input tokens (not bytes, not anything else).
c=128 = 128 simultaneous in-flight requests (--concurrency 128 on the load generator).
The bytes follow from 2 × N × num_layers × num_kv_heads × head_dim × bytes_per_elem for one session, multiplied by c for aggregate.
For any specific model, run that arithmetic with the model's num_layers, num_kv_heads, head_dim, and dtype — or paste it into the LMCache calculator.

Sources

LMCache — KV Cache Size Calculator — canonical per-model KV per-token reference
VMware Cloud Foundation Blog — LLM Inference Sizing and Performance Guidance
Spheron — GPU Memory Requirements for LLMs: VRAM Calculator
AWS Neuron — Training Llama-3.1-70B with Tensor Parallelism — TP sharding constraints
JAX Scaling Book — Serving LLaMA 3-70B on TPUs — KV-head sharding strategy
Hugging Face — meta-llama/Llama-3.1-70B-Instruct config.json — the architectural source of truth

KV Cache Sizing

On this page