Performance Tuning

This guide covers OS-level and hardware tuning for MinIO AIStor, with specific recommendations for high-bandwidth deployments using 100G/400G networks and NVMe storage.

For benchmarking tools to validate tuning changes, see Benchmarking.

Quick start with tuned

MinIO provides a tuned profile that applies CPU, memory, filesystem, and network settings:

sudo mkdir -p /usr/lib/tuned/minio/
sudo cp tuned.conf /usr/lib/tuned/minio/
sudo tuned-adm profile minio
sudo reboot

After reboot, verify the profile is active:

tuned-adm active

CPU

Setting Value Reason
governor performance Locks CPUs at maximum frequency. powersave adds latency from frequency scaling.
force_latency 1 (microsecond) Prevents deep C-states. CPUs stay in C0/C1 for instant wake-up.
energy_perf_bias performance Tells the hardware to prefer performance over power savings.
min_perf_pct 100 Forces Intel P-state driver to run at maximum performance.

Apply manually without tuned:

for cpu in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do
    echo performance | sudo tee "$cpu"
done

Memory

Setting Value Reason
transparent_hugepage madvise THP only for applications that request it. always causes compaction stalls.
vm.swappiness 0 Never swap. MinIO AIStor servers should have enough RAM.
vm.vfs_cache_pressure 50 Keeps inode/dentry caches longer, reducing XFS metadata re-reads.
vm.dirty_background_ratio 3 Start background writeback at 3% of RAM. Prevents sudden bursts.
vm.dirty_ratio 10 Force synchronous writeback at 10% of RAM.
vm.max_map_count 524288 MinIO AIStor memory-maps many files concurrently.

XFS filesystem

Setting Value Reason
fs.xfs.xfssyncd_centisecs 72000 Delays XFS sync daemon to 12 minutes. MinIO AIStor manages its own fsync calls.

Scheduler

Setting Value Reason
kernel.sched_migration_cost_ns 5000000 Increases the threshold for migrating tasks between CPUs from 0.5ms to 5ms. Reduces cache thrashing on NUMA systems.
kernel.numa_balancing 1 Allows the kernel to migrate pages closer to the CPU accessing them.
kernel.hung_task_timeout_secs 85 Prevents false hung-task warnings during heavy NVMe I/O.

Network

Network settings have the largest impact for high-bandwidth deployments.

TCP buffer sizes

sudo sysctl -w net.core.wmem_max=67108864
sudo sysctl -w net.core.rmem_max=67108864
sudo sysctl -w net.ipv4.tcp_rmem="4096 1048576 67108864"
sudo sysctl -w net.ipv4.tcp_wmem="4096 1048576 67108864"

TCP throughput is bounded by window_size / RTT (bandwidth-delay product). At 400 Gbps with 0.1ms RTT, a single stream needs approximately 5 MB of buffer. The 64 MB maximum provides headroom for longer paths and many concurrent streams.

Do not reduce TCP buffer maximums below 4 MB on 100G or faster networks. The older 4 MB defaults were sized for 10G and bottleneck individual TCP streams.

Low-latency settings

Setting Value Reason
net.core.busy_read 50 Busy-poll on read() for 50us before sleeping. Reduces latency at cost of CPU.
net.core.busy_poll 50 Busy-poll on poll()/select() for 50us.
net.ipv4.tcp_timestamps 1 Keeps timestamps enabled for accurate RTT estimation and CUBIC congestion recovery.
net.ipv4.tcp_slow_start_after_idle 0 Keeps congestion window warm on idle connections. Prevents throughput drops after brief pauses.

Connection handling

Setting Value Reason
net.core.netdev_max_backlog 250000 Queue size for incoming packets when the CPU cannot keep up. Prevents drops at 400G.
net.ipv4.ip_local_port_range 1024 65535 Approximately 64K ephemeral ports instead of the default 28K.
net.ipv4.tcp_fin_timeout 15 Reclaim FIN_WAIT2 connections after 15s instead of 60s.
net.ipv4.tcp_mtu_probing 1 Detects and works around MTU black holes.

NIC configuration

These settings must be applied separately from the tuned profile. Test each change independently to measure the impact on your workload.

Jumbo frames (MTU)

Enable jumbo frames on all data NICs:

NIC=$(ip -o addr show | grep '<data-subnet>' | awk '{print $2}')
ip link set $NIC mtu 9000

Standard 1500-byte frames create excessive per-packet overhead at 400G line rate. A single 400 Gbps link receiving 1500-byte frames processes approximately 33 million packets per second. With 9000-byte jumbo frames, this drops to approximately 5.5 million, reducing interrupt rate and CPU overhead by 6x.

The switch must also support jumbo frames on all ports connecting to MinIO AIStor servers and clients. Configure the switch MTU to 9100 or higher and use MTU 9000 on the hosts.

Verify end-to-end MTU works:

ping -M do -s 8972 -c 3 <remote-data-ip>

Flow control (pause frames)

Disable Ethernet flow control (IEEE 802.3x pause frames) on all data NICs:

NIC=$(ip -o addr show | grep '<data-subnet>' | awk '{print $2}')
ethtool -A $NIC rx off tx off

Pause frames cause head-of-line blocking: when a NIC sends a TX pause frame, the switch pauses all traffic to that port, not just the congested flow. TCP already handles congestion control per-flow, making Ethernet-level pause frames redundant and harmful for TCP storage traffic.

Verify flow control is off:

ethtool -a $NIC

Ring buffers

Maximize NIC ring buffer depth:

NIC=$(ip -o addr show | grep '<data-subnet>' | awk '{print $2}')
ethtool -G $NIC rx 8192 tx 8192

Default ring buffer sizes (typically 1024) are too small for 100G/400G line rate. Check current and maximum sizes with ethtool -g $NIC.

On Mellanox/NVIDIA ConnectX NICs, changing private flags (for example, dropless_rq or rx_cqe_compress) triggers an internal NIC reset that silently resets ring buffers back to defaults. Always re-apply ring buffer settings after any private flag change.

IRQ affinity and NUMA

Each NIC should have its interrupts pinned to the NUMA node closest to its PCIe slot. Cross-NUMA interrupt handling adds memory access latency on every packet.

Check which NUMA node a NIC belongs to:

cat /sys/class/net/$NIC/device/numa_node

On multi-NIC servers, ensure each NIC’s IRQs stay on its local NUMA node. A 400G NIC processing packets on a remote NUMA node loses measurable throughput to cross-socket memory traffic.

Large receive offload (LRO)

Enable LRO to let the NIC aggregate multiple incoming TCP segments into larger buffers before passing them to the kernel:

NIC=$(ip -o addr show | grep '<data-subnet>' | awk '{print $2}')
ethtool -K $NIC lro on

If ethtool -k shows large-receive-offload: off [fixed], the NIC hardware does not support LRO. In that case, ensure GRO (Generic Receive Offload) is enabled instead.

Interrupt coalescing

Use fixed interrupt coalescing instead of adaptive coalescing:

NIC=$(ip -o addr show | grep '<data-subnet>' | awk '{print $2}')
ethtool -C $NIC adaptive-rx off adaptive-tx off \
    rx-usecs 128 tx-usecs 64 \
    rx-frames 256 tx-frames 256

Adjust based on workload:

  • Lower latency: Reduce rx-usecs to 32-64, rx-frames to 64-128
  • Higher throughput: Increase rx-usecs to 256, rx-frames to 512

Congestion control

Use CUBIC (the Linux default). BBR’s bandwidth probing cycles cause throughput oscillation when many concurrent flows share the same high-bandwidth link.

Switch configuration

Configure the switch connecting MinIO AIStor servers for maximum throughput:

  • Jumbo Frames: Configure MTU 9100+ on all switch ports connected to MinIO AIStor servers and clients.
  • Multi-queue scheduling: Keep TC-to-queue mappings active. Without these, all traffic forces through a single queue.
  • Disable PFC and flow control: PFC is designed for RoCE/RDMA lossless fabrics and is counterproductive for TCP.
  • Buffer allocation: Use the default “lossy” buffer profile. Lossless profiles waste buffer space when PFC is disabled.

Monitor switch port counters for TX_DRP, RX_DRP, and per-queue distribution to detect issues.

Connection tracking (nf_conntrack)

On dedicated storage servers, connection tracking should ideally be disabled by not loading the nf_conntrack module:

sudo modprobe -r nf_conntrack
echo "blacklist nf_conntrack" | sudo tee /etc/modprobe.d/no-conntrack.conf

If firewall rules require it, configure aggressive timeout settings:

sudo sysctl -w net.netfilter.nf_conntrack_max=800000
sudo sysctl -w net.netfilter.nf_conntrack_tcp_timeout_established=300
sudo sysctl -w net.netfilter.nf_conntrack_tcp_timeout_time_wait=20

Bootloader settings

These settings require a reboot to take effect.

Setting Reason
skew_tick=1 Staggers timer interrupts across CPUs to avoid thundering-herd wakeups.
intel_iommu=off Disables IOMMU (VT-d/DMAR) to remove DMA translation overhead on every NVMe and NIC transfer.

For AMD systems, use amd_iommu=off instead of intel_iommu=off.

IOMMU is useful for VM passthrough (VT-d) and device isolation. On bare-metal storage servers running only MinIO AIStor, disable it.

MinIO AIStor settings

Connection limits

Set MINIO_MAX_IDLE_CONNS_PER_HOST to tune the maximum number of idle and active internode HTTP connections. Increase or decrease this value to adjust concurrency between nodes.

O_DIRECT

The MINIO_API_ODIRECT setting controls whether MinIO AIStor bypasses the OS page cache for reads and writes. The default is on (O_DIRECT for both reads and writes). You can also set it to read (O_DIRECT for reads only) or write (O_DIRECT for writes only). Setting it to off disables O_DIRECT entirely, which can cause the page cache to grow unbounded, leading to memory pressure and potential out-of-memory conditions. It is never recommended to set this to off in production.

Thread pressure monitoring

The MINIO_API_THREAD_PRESSURE_CHECK and related settings monitor goroutine usage and return HTTP 429 from health endpoints when thread pressure exceeds the critical threshold.

Validation

After applying the profile, verify key settings:

# CPU governor
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor

# IOMMU (after reboot)
dmesg | grep -i iommu

# TCP buffers
sysctl net.core.rmem_max net.core.wmem_max

# THP
cat /sys/kernel/mm/transparent_hugepage/enabled

# Flow control (per data NIC)
ethtool -a $NIC | grep -E 'RX:|TX:'

# Ring buffers (per data NIC)
ethtool -g $NIC | grep -A4 'Current'

# LRO (per data NIC)
ethtool -k $NIC | grep large-receive-offload

# NIC packet drops (should be zero or near-zero after tuning)
ethtool -S $NIC | grep -E 'rx_discards_phy|rx_out_of_buffer|tx_pause_ctrl_phy'