RoCEv2 setup runbook
End-to-end setup for lossless RoCEv2 — switch PFC/ECN/DSCP, host DCB, jumbo-frame MTU, and the MemKV settings that match them — plus verification and troubleshooting.
MemKV moves KV blocks over RDMA. On Ethernet that means RoCEv2, and RoCEv2 only performs when the whole path — switch, host NIC, and MemKV — agrees on three things: a lossless priority (PFC + ECN), a DSCP/SL marking that lands traffic on that priority, and an MTU every hop can carry. Get one wrong and RDMA either rides the best-effort queue (slow, lossy) or drops every data packet outright (retry-exhausted, then a silent downgrade to TCP).
This runbook configures all three end to end. It assumes Mellanox/NVIDIA mlx5 hardware (ConnectX-5 or later) on Linux; adapt the switch steps to your vendor.
The individual MemKV knobs (gid_index, traffic_class, service_level,
mtu) are described field-by-field in
Configuration. This page is the operational sequence
that ties them to the fabric.
The three things that must agree
| Layer | Lossless priority | Marking | MTU |
|---|---|---|---|
| Switch | PFC on the RoCE priority | trust DSCP, classify to queue | port MTU ≥ payload + headers |
| Host (NIC) | PFC + ECN on the priority | DSCP→prio map, trust dscp | netdev MTU (jumbo) |
| MemKV | service_level = prio | traffic_class = DSCP << 2 | mtu ≤ port active MTU |
The conventional RoCE marking is DSCP 26 → priority 3. MemKV's auto-detect prefers exactly that pairing when it finds it, so use it unless your fabric already standardized on another lossless priority.
Why MTU is the common failure
RoCE negotiates an IB-style path MTU — one of 256, 512, 1024, 2048, or 4096 bytes — derived from the link's L2 MTU minus headroom for headers. A standard 1500-byte Ethernet link negotiates a RoCE active MTU of 1024. To get the full 4096 path MTU you need jumbo frames (an L2 MTU of ~4200+; in practice set 9000) on every hop.
MemKV defaults mtu: 4096. If the fabric only negotiated 1024, a 4096-byte
data packet overflows the frame and is dropped; the requester exhausts its
retries (IBV_WC_RETRY_EXC_ERR, completion status 12) and the client falls
back to TCP. This only shows up cross-node — a single-node loopback path
never puts a packet on the wire, so it passes locally and fails in the cluster.
MemKV refuses to start when the configured mtu exceeds the port's negotiated
active MTU, naming both values, rather than letting the data path stall. So you
must either raise the fabric to carry 4096 (below) or set mtu down to match.
Switch configuration
Configure the RoCE priority lossless and trust DSCP so host markings survive the hop. On NVIDIA Spectrum (Cumulus Linux) the essentials are:
# Enable PFC on priority 3 (the RoCE priority)
nv set qos pfc default-global pfc-priority 3
nv set qos pfc default-global state enable
# Trust the incoming DSCP marking and map DSCP 26 -> traffic class 3
nv set qos mapping default-global trust dscp
nv set qos mapping default-global dscp 26 switch-priority 3
# Enable ECN (WRED) on the RoCE traffic class for congestion signalling
nv set qos congestion-control default-global congestion-mode ecn
nv set qos congestion-control default-global traffic-class 3
# Jumbo frames on every RoCE-facing port
nv set interface swp1-swp32 link mtu 9216
nv config applyPFC and ECN must be enabled fabric-wide, on every switch in the RoCE path, not just the leaf the host attaches to. A single hop that drops the pause frames or strips the DSCP turns the path lossy and RDMA throughput collapses.
Host (NIC) configuration
Mark RoCE traffic, make the priority lossless, enable ECN, and raise the MTU.
Use the netdev backing your HCA (find it with rdma link or
ibdev2netdev); mlx5_0 ↔ enp1s0f0 below.
Trust DSCP and map DSCP 26 to priority 3
mlnx_qos -i enp1s0f0 --trust dscp
mlnx_qos -i enp1s0f0 --dscp2prio set,26,3Make priority 3 lossless with PFC
# bitmask per priority 0..7 — enable PFC on priority 3 only
mlnx_qos -i enp1s0f0 --pfc 0,0,0,1,0,0,0,0Enable ECN for the RoCE priority
echo 1 > /sys/class/net/enp1s0f0/ecn/roce_np/enable/3
echo 1 > /sys/class/net/enp1s0f0/ecn/roce_rp/enable/3Raise the MTU to enable a 4096 RoCE path MTU
ip link set enp1s0f0 mtu 9000These commands are not persistent. Wire them into your NIC bring-up (a systemd
unit, network-scripts, netplan, or the NVIDIA mlnx-en/openibd config) so
they survive a reboot.
MemKV configuration
With the fabric lossless and jumbo-framed, MemKV's auto-detect resolves the
right values on its own — leave the defaults (0) and it reads the DSCP→prio
map and PFC mask from the NIC's DCB config, then stamps the matching DSCP and
SL on egress.
rdma:
device: mlx5_0
gid_index: 0 # 0 = auto-detect the routable RoCEv2 GID
traffic_class: 0 # 0 = auto-detect the lossless DSCP (e.g. 104 for DSCP 26)
service_level: 0 # 0 = take the priority paired with the DSCP (e.g. 3)
mtu: 4096 # requires jumbo frames end-to-end (above)Pin them explicitly when auto-detect can't see your DCB config (some
virtualized or non-mlx5 NICs) — traffic_class is the ToS byte, i.e.
DSCP << 2 (DSCP 26 → 104), and service_level is the priority (3):
rdma:
gid_index: 3 # a routable RoCEv2 GID entry (see verification below)
traffic_class: 104 # DSCP 26 << 2
service_level: 3 # matches the lossless PFC priority
mtu: 4096The client carries the same mtu (MEMKV_RDMA_MTU, default 4096) and
must agree with the server's rdma.mtu and the fabric. Lower both ends to
1024 if you cannot enable jumbo frames — a mismatch between ends or against
the link is refused at startup.
Two markings matter independently on mlx5: the DSCP (traffic_class)
classifies the packet on the wire, but the egress scheduling class follows
the SL (service_level), not the DSCP — so the SL must equal the lossless
priority or traffic leaves on the best-effort queue even with the right DSCP.
Likewise, gid_index must point at a RoCEv2 GID: index 0 on mlx5 is the
RoCEv1 entry, which carries no IP header and therefore no DSCP, so it can never
reach the lossless queue. Auto-detect (0) avoids both traps.
Verification
Confirm the negotiated RoCE path MTU
ibv_devinfo -d mlx5_0 | grep -E 'active_mtu|state'
# active_mtu: 4096 (5) <- jumbo working; 1024 (3) means MTU too lowIf this shows 1024 (3) after setting a 9000 netdev MTU, a hop in the path is
still at 1500 — recheck the switch port MTU and any bond/VLAN sub-interface.
Confirm a routable RoCEv2 GID exists
show_gids mlx5_0
# look for RoCE v2 rows with a v4 (::ffff:a.b.c.d) IP — that index is what
# auto-detect picks; pin it as gid_index if you set it by handConfirm DSCP trust and the lossless priority
mlnx_qos -i enp1s0f0
# trust: dscp ; pfc enabled on prio 3 ; dscp 26 -> prio 3Confirm traffic actually rides the lossless queue
This is the definitive check — counters on the RoCE priority must move while best-effort (prio 0) stays flat during an RDMA transfer:
ethtool -S enp1s0f0 | grep -E 'prio3|prio0' > before.txt
# run a MemKV read/write workload, then:
ethtool -S enp1s0f0 | grep -E 'prio3|prio0' > after.txt
diff before.txt after.txt
# tx_prio3_bytes must climb; tx_prio0_bytes should not. If prio0 moves and
# prio3 doesn't, the SL/DSCP isn't landing traffic on the lossless queue.Confirm raw RDMA bandwidth between two nodes
# server node
ib_write_bw -d mlx5_0 -x 3 --report_gbits
# client node
ib_write_bw -d mlx5_0 -x 3 --report_gbits <server-ip>
# -x 3 selects a RoCEv2 GID index; expect near line rate with no retriesTroubleshooting
| Symptom | Cause | Fix |
|---|---|---|
| Startup error: configured MTU exceeds the port's active MTU | rdma.mtu larger than the link negotiated | Enable jumbo frames end-to-end so the port negotiates 4096, or set mtu to the active value |
IBV_WC_RETRY_EXC_ERR / status 12, then downgrading engine to inline-bulk TCP | Path MTU exceeds the link MTU; data packets dropped (only cross-node) | Same as above — fix the MTU on both ends and the fabric |
| RDMA works but throughput is low and lossy | Traffic on the best-effort queue (no PFC, or wrong SL/DSCP/GID) | Verify tx_prio3 moves (above); ensure gid_index is RoCEv2 and SL matches the PFC priority |
no RoCEv2 GID found warning, gid_index 0 | NIC has no RoCEv2 GID, or it wasn't detected | Confirm RoCEv2 GIDs with show_gids; pin gid_index to a routable v2 entry |
Auto-detect leaves traffic_class=0 with a warning | MemKV couldn't read the NIC's DCB config | Pin traffic_class (DSCP << 2) and service_level explicitly |
A loopback or single-node test never puts a packet on the wire, so MTU and PFC misconfiguration pass locally and only surface cross-node. Always validate on a multi-node path before declaring the fabric production-ready.