MemKV
Operate

RoCEv2 setup runbook

End-to-end setup for lossless RoCEv2 — switch PFC/ECN/DSCP, host DCB, jumbo-frame MTU, and the MemKV settings that match them — plus verification and troubleshooting.

MemKV moves KV blocks over RDMA. On Ethernet that means RoCEv2, and RoCEv2 only performs when the whole path — switch, host NIC, and MemKV — agrees on three things: a lossless priority (PFC + ECN), a DSCP/SL marking that lands traffic on that priority, and an MTU every hop can carry. Get one wrong and RDMA either rides the best-effort queue (slow, lossy) or drops every data packet outright (retry-exhausted, then a silent downgrade to TCP).

This runbook configures all three end to end. It assumes Mellanox/NVIDIA mlx5 hardware (ConnectX-5 or later) on Linux; adapt the switch steps to your vendor.

The individual MemKV knobs (gid_index, traffic_class, service_level, mtu) are described field-by-field in Configuration. This page is the operational sequence that ties them to the fabric.

The three things that must agree

LayerLossless priorityMarkingMTU
SwitchPFC on the RoCE prioritytrust DSCP, classify to queueport MTU ≥ payload + headers
Host (NIC)PFC + ECN on the priorityDSCP→prio map, trust dscpnetdev MTU (jumbo)
MemKVservice_level = priotraffic_class = DSCP << 2mtu ≤ port active MTU

The conventional RoCE marking is DSCP 26 → priority 3. MemKV's auto-detect prefers exactly that pairing when it finds it, so use it unless your fabric already standardized on another lossless priority.

Why MTU is the common failure

RoCE negotiates an IB-style path MTU — one of 256, 512, 1024, 2048, or 4096 bytes — derived from the link's L2 MTU minus headroom for headers. A standard 1500-byte Ethernet link negotiates a RoCE active MTU of 1024. To get the full 4096 path MTU you need jumbo frames (an L2 MTU of ~4200+; in practice set 9000) on every hop.

MemKV defaults mtu: 4096. If the fabric only negotiated 1024, a 4096-byte data packet overflows the frame and is dropped; the requester exhausts its retries (IBV_WC_RETRY_EXC_ERR, completion status 12) and the client falls back to TCP. This only shows up cross-node — a single-node loopback path never puts a packet on the wire, so it passes locally and fails in the cluster.

MemKV refuses to start when the configured mtu exceeds the port's negotiated active MTU, naming both values, rather than letting the data path stall. So you must either raise the fabric to carry 4096 (below) or set mtu down to match.

Switch configuration

Configure the RoCE priority lossless and trust DSCP so host markings survive the hop. On NVIDIA Spectrum (Cumulus Linux) the essentials are:

# Enable PFC on priority 3 (the RoCE priority)
nv set qos pfc default-global pfc-priority 3
nv set qos pfc default-global state enable

# Trust the incoming DSCP marking and map DSCP 26 -> traffic class 3
nv set qos mapping default-global trust dscp
nv set qos mapping default-global dscp 26 switch-priority 3

# Enable ECN (WRED) on the RoCE traffic class for congestion signalling
nv set qos congestion-control default-global congestion-mode ecn
nv set qos congestion-control default-global traffic-class 3

# Jumbo frames on every RoCE-facing port
nv set interface swp1-swp32 link mtu 9216
nv config apply

PFC and ECN must be enabled fabric-wide, on every switch in the RoCE path, not just the leaf the host attaches to. A single hop that drops the pause frames or strips the DSCP turns the path lossy and RDMA throughput collapses.

Host (NIC) configuration

Mark RoCE traffic, make the priority lossless, enable ECN, and raise the MTU. Use the netdev backing your HCA (find it with rdma link or ibdev2netdev); mlx5_0enp1s0f0 below.

Trust DSCP and map DSCP 26 to priority 3

mlnx_qos -i enp1s0f0 --trust dscp
mlnx_qos -i enp1s0f0 --dscp2prio set,26,3

Make priority 3 lossless with PFC

# bitmask per priority 0..7 — enable PFC on priority 3 only
mlnx_qos -i enp1s0f0 --pfc 0,0,0,1,0,0,0,0

Enable ECN for the RoCE priority

echo 1 > /sys/class/net/enp1s0f0/ecn/roce_np/enable/3
echo 1 > /sys/class/net/enp1s0f0/ecn/roce_rp/enable/3

Raise the MTU to enable a 4096 RoCE path MTU

ip link set enp1s0f0 mtu 9000

These commands are not persistent. Wire them into your NIC bring-up (a systemd unit, network-scripts, netplan, or the NVIDIA mlnx-en/openibd config) so they survive a reboot.

MemKV configuration

With the fabric lossless and jumbo-framed, MemKV's auto-detect resolves the right values on its own — leave the defaults (0) and it reads the DSCP→prio map and PFC mask from the NIC's DCB config, then stamps the matching DSCP and SL on egress.

rdma:
  device: mlx5_0
  gid_index: 0 # 0 = auto-detect the routable RoCEv2 GID
  traffic_class: 0 # 0 = auto-detect the lossless DSCP (e.g. 104 for DSCP 26)
  service_level: 0 # 0 = take the priority paired with the DSCP (e.g. 3)
  mtu: 4096 # requires jumbo frames end-to-end (above)

Pin them explicitly when auto-detect can't see your DCB config (some virtualized or non-mlx5 NICs) — traffic_class is the ToS byte, i.e. DSCP << 2 (DSCP 26 → 104), and service_level is the priority (3):

rdma:
  gid_index: 3 # a routable RoCEv2 GID entry (see verification below)
  traffic_class: 104 # DSCP 26 << 2
  service_level: 3 # matches the lossless PFC priority
  mtu: 4096

The client carries the same mtu (MEMKV_RDMA_MTU, default 4096) and must agree with the server's rdma.mtu and the fabric. Lower both ends to 1024 if you cannot enable jumbo frames — a mismatch between ends or against the link is refused at startup.

Two markings matter independently on mlx5: the DSCP (traffic_class) classifies the packet on the wire, but the egress scheduling class follows the SL (service_level), not the DSCP — so the SL must equal the lossless priority or traffic leaves on the best-effort queue even with the right DSCP. Likewise, gid_index must point at a RoCEv2 GID: index 0 on mlx5 is the RoCEv1 entry, which carries no IP header and therefore no DSCP, so it can never reach the lossless queue. Auto-detect (0) avoids both traps.

Verification

Confirm the negotiated RoCE path MTU

ibv_devinfo -d mlx5_0 | grep -E 'active_mtu|state'
# active_mtu: 4096 (5)   <- jumbo working; 1024 (3) means MTU too low

If this shows 1024 (3) after setting a 9000 netdev MTU, a hop in the path is still at 1500 — recheck the switch port MTU and any bond/VLAN sub-interface.

Confirm a routable RoCEv2 GID exists

show_gids mlx5_0
# look for RoCE v2 rows with a v4 (::ffff:a.b.c.d) IP — that index is what
# auto-detect picks; pin it as gid_index if you set it by hand

Confirm DSCP trust and the lossless priority

mlnx_qos -i enp1s0f0
# trust: dscp ; pfc enabled on prio 3 ; dscp 26 -> prio 3

Confirm traffic actually rides the lossless queue

This is the definitive check — counters on the RoCE priority must move while best-effort (prio 0) stays flat during an RDMA transfer:

ethtool -S enp1s0f0 | grep -E 'prio3|prio0' > before.txt
# run a MemKV read/write workload, then:
ethtool -S enp1s0f0 | grep -E 'prio3|prio0' > after.txt
diff before.txt after.txt
# tx_prio3_bytes must climb; tx_prio0_bytes should not. If prio0 moves and
# prio3 doesn't, the SL/DSCP isn't landing traffic on the lossless queue.

Confirm raw RDMA bandwidth between two nodes

# server node
ib_write_bw -d mlx5_0 -x 3 --report_gbits
# client node
ib_write_bw -d mlx5_0 -x 3 --report_gbits <server-ip>
# -x 3 selects a RoCEv2 GID index; expect near line rate with no retries

Troubleshooting

SymptomCauseFix
Startup error: configured MTU exceeds the port's active MTUrdma.mtu larger than the link negotiatedEnable jumbo frames end-to-end so the port negotiates 4096, or set mtu to the active value
IBV_WC_RETRY_EXC_ERR / status 12, then downgrading engine to inline-bulk TCPPath MTU exceeds the link MTU; data packets dropped (only cross-node)Same as above — fix the MTU on both ends and the fabric
RDMA works but throughput is low and lossyTraffic on the best-effort queue (no PFC, or wrong SL/DSCP/GID)Verify tx_prio3 moves (above); ensure gid_index is RoCEv2 and SL matches the PFC priority
no RoCEv2 GID found warning, gid_index 0NIC has no RoCEv2 GID, or it wasn't detectedConfirm RoCEv2 GIDs with show_gids; pin gid_index to a routable v2 entry
Auto-detect leaves traffic_class=0 with a warningMemKV couldn't read the NIC's DCB configPin traffic_class (DSCP << 2) and service_level explicitly

A loopback or single-node test never puts a packet on the wire, so MTU and PFC misconfiguration pass locally and only surface cross-node. Always validate on a multi-node path before declaring the fabric production-ready.