Performance Tuning

This guide covers OS-level and hardware tuning for MinIO AIStor, with specific recommendations for high-bandwidth deployments using 100G/400G networks and NVMe storage.

For benchmarking tools to validate tuning changes, see Benchmarking.

Quick start with tuned

MinIO provides a tuned profile that applies CPU, memory, filesystem, and network settings:

sudo mkdir -p /usr/lib/tuned/minio/
sudo cp tuned.conf /usr/lib/tuned/minio/
sudo tuned-adm profile minio
sudo reboot

After reboot, verify the profile is active:

tuned-adm active

CPU

Setting	Value	Reason
`governor`	`performance`	Locks CPUs at maximum frequency. `powersave` adds latency from frequency scaling.
`force_latency`	`1` (microsecond)	Prevents deep C-states. CPUs stay in C0/C1 for instant wake-up.
`energy_perf_bias`	`performance`	Tells the hardware to prefer performance over power savings.
`min_perf_pct`	`100`	Forces Intel P-state driver to run at maximum performance.

Apply manually without tuned:

for cpu in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do
    echo performance | sudo tee "$cpu"
done

Memory

Setting	Value	Reason
`transparent_hugepage`	`madvise`	THP only for applications that request it. `always` causes compaction stalls.
`vm.swappiness`	`0`	Never swap. MinIO AIStor servers should have enough RAM.
`vm.vfs_cache_pressure`	`50`	Keeps inode/dentry caches longer, reducing XFS metadata re-reads.
`vm.dirty_background_ratio`	`3`	Start background writeback at 3% of RAM. Prevents sudden bursts.
`vm.dirty_ratio`	`10`	Force synchronous writeback at 10% of RAM.
`vm.max_map_count`	`524288`	MinIO AIStor memory-maps many files concurrently.

XFS filesystem

Setting	Value	Reason
`fs.xfs.xfssyncd_centisecs`	`72000`	Delays XFS sync daemon to 12 minutes. MinIO AIStor manages its own `fsync` calls.

Scheduler

Setting	Value	Reason
`kernel.sched_migration_cost_ns`	`5000000`	Increases the threshold for migrating tasks between CPUs from 0.5ms to 5ms. Reduces cache thrashing on NUMA systems.
`kernel.numa_balancing`	`1`	Allows the kernel to migrate pages closer to the CPU accessing them.
`kernel.hung_task_timeout_secs`	`85`	Prevents false hung-task warnings during heavy NVMe I/O.

Network

Network settings have the largest impact for high-bandwidth deployments.

TCP buffer sizes

sudo sysctl -w net.core.wmem_max=67108864
sudo sysctl -w net.core.rmem_max=67108864
sudo sysctl -w net.ipv4.tcp_rmem="4096 1048576 67108864"
sudo sysctl -w net.ipv4.tcp_wmem="4096 1048576 67108864"

TCP throughput is bounded by window_size / RTT (bandwidth-delay product). At 400 Gbps with 0.1ms RTT, a single stream needs approximately 5 MB of buffer. The 64 MB maximum provides headroom for longer paths and many concurrent streams.

Do not reduce TCP buffer maximums below 4 MB on 100G or faster networks. The older 4 MB defaults were sized for 10G and bottleneck individual TCP streams.

Low-latency settings

Setting	Value	Reason
`net.core.busy_read`	`50`	Busy-poll on `read()` for 50us before sleeping. Reduces latency at cost of CPU.
`net.core.busy_poll`	`50`	Busy-poll on `poll()`/`select()` for 50us.
`net.ipv4.tcp_timestamps`	`1`	Keeps timestamps enabled for accurate RTT estimation and CUBIC congestion recovery.
`net.ipv4.tcp_slow_start_after_idle`	`0`	Keeps congestion window warm on idle connections. Prevents throughput drops after brief pauses.

Connection handling

Setting	Value	Reason
`net.core.netdev_max_backlog`	`250000`	Queue size for incoming packets when the CPU cannot keep up. Prevents drops at 400G.
`net.ipv4.ip_local_port_range`	`1024 65535`	Approximately 64K ephemeral ports instead of the default 28K.
`net.ipv4.tcp_fin_timeout`	`15`	Reclaim FIN_WAIT2 connections after 15s instead of 60s.
`net.ipv4.tcp_mtu_probing`	`1`	Detects and works around MTU black holes.

NIC configuration

These settings must be applied separately from the tuned profile. Test each change independently to measure the impact on your workload.

Jumbo frames (MTU)

Enable jumbo frames on all data NICs:

NIC=$(ip -o addr show | grep '<data-subnet>' | awk '{print $2}')
ip link set $NIC mtu 9000

Standard 1500-byte frames create excessive per-packet overhead at 400G line rate. A single 400 Gbps link receiving 1500-byte frames processes approximately 33 million packets per second. With 9000-byte jumbo frames, this drops to approximately 5.5 million, reducing interrupt rate and CPU overhead by 6x.

The switch must also support jumbo frames on all ports connecting to MinIO AIStor servers and clients. Configure the switch MTU to 9100 or higher and use MTU 9000 on the hosts.

Verify end-to-end MTU works:

ping -M do -s 8972 -c 3 <remote-data-ip>

Flow control (pause frames)

Disable Ethernet flow control (IEEE 802.3x pause frames) on all data NICs:

NIC=$(ip -o addr show | grep '<data-subnet>' | awk '{print $2}')
ethtool -A $NIC rx off tx off

Pause frames cause head-of-line blocking: when a NIC sends a TX pause frame, the switch pauses all traffic to that port, not just the congested flow. TCP already handles congestion control per-flow, making Ethernet-level pause frames redundant and harmful for TCP storage traffic.

Verify flow control is off:

ethtool -a $NIC

Ring buffers

Maximize NIC ring buffer depth:

NIC=$(ip -o addr show | grep '<data-subnet>' | awk '{print $2}')
ethtool -G $NIC rx 8192 tx 8192

Default ring buffer sizes (typically 1024) are too small for 100G/400G line rate. Check current and maximum sizes with ethtool -g $NIC.

On Mellanox/NVIDIA ConnectX NICs, changing private flags (for example, dropless_rq or rx_cqe_compress) triggers an internal NIC reset that silently resets ring buffers back to defaults. Always re-apply ring buffer settings after any private flag change.

IRQ affinity and NUMA

Each NIC should have its interrupts pinned to the NUMA node closest to its PCIe slot. Cross-NUMA interrupt handling adds memory access latency on every packet.

Check which NUMA node a NIC belongs to:

cat /sys/class/net/$NIC/device/numa_node

On multi-NIC servers, ensure each NIC’s IRQs stay on its local NUMA node. A 400G NIC processing packets on a remote NUMA node loses measurable throughput to cross-socket memory traffic.

Large receive offload (LRO)

Enable LRO to let the NIC aggregate multiple incoming TCP segments into larger buffers before passing them to the kernel:

NIC=$(ip -o addr show | grep '<data-subnet>' | awk '{print $2}')
ethtool -K $NIC lro on

If ethtool -k shows large-receive-offload: off [fixed], the NIC hardware does not support LRO. In that case, ensure GRO (Generic Receive Offload) is enabled instead.

Interrupt coalescing

Use fixed interrupt coalescing instead of adaptive coalescing:

NIC=$(ip -o addr show | grep '<data-subnet>' | awk '{print $2}')
ethtool -C $NIC adaptive-rx off adaptive-tx off \
    rx-usecs 128 tx-usecs 64 \
    rx-frames 256 tx-frames 256

Adjust based on workload:

Lower latency: Reduce rx-usecs to 32-64, rx-frames to 64-128
Higher throughput: Increase rx-usecs to 256, rx-frames to 512

Congestion control

Use CUBIC (the Linux default). BBR’s bandwidth probing cycles cause throughput oscillation when many concurrent flows share the same high-bandwidth link.

Switch configuration

Configure the switch connecting MinIO AIStor servers for maximum throughput:

Jumbo Frames: Configure MTU 9100+ on all switch ports connected to MinIO AIStor servers and clients.
Multi-queue scheduling: Keep TC-to-queue mappings active. Without these, all traffic forces through a single queue.
Disable PFC and flow control: PFC is designed for RoCE/RDMA lossless fabrics and is counterproductive for TCP.
Buffer allocation: Use the default “lossy” buffer profile. Lossless profiles waste buffer space when PFC is disabled.

Monitor switch port counters for TX_DRP, RX_DRP, and per-queue distribution to detect issues.

Connection tracking (nf_conntrack)

On dedicated storage servers, connection tracking should ideally be disabled by not loading the nf_conntrack module:

sudo modprobe -r nf_conntrack
echo "blacklist nf_conntrack" | sudo tee /etc/modprobe.d/no-conntrack.conf

If firewall rules require it, configure aggressive timeout settings:

sudo sysctl -w net.netfilter.nf_conntrack_max=800000
sudo sysctl -w net.netfilter.nf_conntrack_tcp_timeout_established=300
sudo sysctl -w net.netfilter.nf_conntrack_tcp_timeout_time_wait=20

Bootloader settings

These settings require a reboot to take effect.

Setting	Reason
`skew_tick=1`	Staggers timer interrupts across CPUs to avoid thundering-herd wakeups.
`intel_iommu=off`	Disables IOMMU (VT-d/DMAR) to remove DMA translation overhead on every NVMe and NIC transfer.

For AMD systems, use amd_iommu=off instead of intel_iommu=off.

IOMMU is useful for VM passthrough (VT-d) and device isolation. On bare-metal storage servers running only MinIO AIStor, disable it.

MinIO AIStor settings

Connection limits

Set MINIO_MAX_IDLE_CONNS_PER_HOST to tune the maximum number of idle and active internode HTTP connections. Increase or decrease this value to adjust concurrency between nodes.

O_DIRECT

The MINIO_API_ODIRECT setting controls whether MinIO AIStor bypasses the OS page cache for reads and writes. The default is on (O_DIRECT for both reads and writes). You can also set it to read (O_DIRECT for reads only) or write (O_DIRECT for writes only). Setting it to off disables O_DIRECT entirely, which can cause the page cache to grow unbounded, leading to memory pressure and potential out-of-memory conditions. It is never recommended to set this to off in production.

Thread pressure monitoring

The MINIO_API_THREAD_PRESSURE_CHECK and related settings monitor goroutine usage and return HTTP 429 from health endpoints when thread pressure exceeds the critical threshold.

Validation

After applying the profile, verify key settings:

# CPU governor
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor

# IOMMU (after reboot)
dmesg | grep -i iommu

# TCP buffers
sysctl net.core.rmem_max net.core.wmem_max

# THP
cat /sys/kernel/mm/transparent_hugepage/enabled

# Flow control (per data NIC)
ethtool -a $NIC | grep -E 'RX:|TX:'

# Ring buffers (per data NIC)
ethtool -g $NIC | grep -A4 'Current'

# LRO (per data NIC)
ethtool -k $NIC | grep large-receive-offload

# NIC packet drops (should be zero or near-zero after tuning)
ethtool -S $NIC | grep -E 'rx_discards_phy|rx_out_of_buffer|tx_pause_ctrl_phy'