Performance Tuning
This guide covers OS-level and hardware tuning for MinIO AIStor, with specific recommendations for high-bandwidth deployments using 100G/400G networks and NVMe storage.
For benchmarking tools to validate tuning changes, see Benchmarking.
Quick start with tuned
MinIO provides a tuned profile that applies CPU, memory, filesystem, and network settings:
sudo mkdir -p /usr/lib/tuned/minio/
sudo cp tuned.conf /usr/lib/tuned/minio/
sudo tuned-adm profile minio
sudo reboot
After reboot, verify the profile is active:
tuned-adm active
CPU
| Setting | Value | Reason |
|---|---|---|
governor |
performance |
Locks CPUs at maximum frequency. powersave adds latency from frequency scaling. |
force_latency |
1 (microsecond) |
Prevents deep C-states. CPUs stay in C0/C1 for instant wake-up. |
energy_perf_bias |
performance |
Tells the hardware to prefer performance over power savings. |
min_perf_pct |
100 |
Forces Intel P-state driver to run at maximum performance. |
Apply manually without tuned:
for cpu in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do
echo performance | sudo tee "$cpu"
done
Memory
| Setting | Value | Reason |
|---|---|---|
transparent_hugepage |
madvise |
THP only for applications that request it. always causes compaction stalls. |
vm.swappiness |
0 |
Never swap. MinIO AIStor servers should have enough RAM. |
vm.vfs_cache_pressure |
50 |
Keeps inode/dentry caches longer, reducing XFS metadata re-reads. |
vm.dirty_background_ratio |
3 |
Start background writeback at 3% of RAM. Prevents sudden bursts. |
vm.dirty_ratio |
10 |
Force synchronous writeback at 10% of RAM. |
vm.max_map_count |
524288 |
MinIO AIStor memory-maps many files concurrently. |
XFS filesystem
| Setting | Value | Reason |
|---|---|---|
fs.xfs.xfssyncd_centisecs |
72000 |
Delays XFS sync daemon to 12 minutes. MinIO AIStor manages its own fsync calls. |
Scheduler
| Setting | Value | Reason |
|---|---|---|
kernel.sched_migration_cost_ns |
5000000 |
Increases the threshold for migrating tasks between CPUs from 0.5ms to 5ms. Reduces cache thrashing on NUMA systems. |
kernel.numa_balancing |
1 |
Allows the kernel to migrate pages closer to the CPU accessing them. |
kernel.hung_task_timeout_secs |
85 |
Prevents false hung-task warnings during heavy NVMe I/O. |
Network
Network settings have the largest impact for high-bandwidth deployments.
TCP buffer sizes
sudo sysctl -w net.core.wmem_max=67108864
sudo sysctl -w net.core.rmem_max=67108864
sudo sysctl -w net.ipv4.tcp_rmem="4096 1048576 67108864"
sudo sysctl -w net.ipv4.tcp_wmem="4096 1048576 67108864"
TCP throughput is bounded by window_size / RTT (bandwidth-delay product).
At 400 Gbps with 0.1ms RTT, a single stream needs approximately 5 MB of buffer.
The 64 MB maximum provides headroom for longer paths and many concurrent streams.
Low-latency settings
| Setting | Value | Reason |
|---|---|---|
net.core.busy_read |
50 |
Busy-poll on read() for 50us before sleeping. Reduces latency at cost of CPU. |
net.core.busy_poll |
50 |
Busy-poll on poll()/select() for 50us. |
net.ipv4.tcp_timestamps |
1 |
Keeps timestamps enabled for accurate RTT estimation and CUBIC congestion recovery. |
net.ipv4.tcp_slow_start_after_idle |
0 |
Keeps congestion window warm on idle connections. Prevents throughput drops after brief pauses. |
Connection handling
| Setting | Value | Reason |
|---|---|---|
net.core.netdev_max_backlog |
250000 |
Queue size for incoming packets when the CPU cannot keep up. Prevents drops at 400G. |
net.ipv4.ip_local_port_range |
1024 65535 |
Approximately 64K ephemeral ports instead of the default 28K. |
net.ipv4.tcp_fin_timeout |
15 |
Reclaim FIN_WAIT2 connections after 15s instead of 60s. |
net.ipv4.tcp_mtu_probing |
1 |
Detects and works around MTU black holes. |
NIC configuration
These settings must be applied separately from the tuned profile.
Test each change independently to measure the impact on your workload.
Jumbo frames (MTU)
Enable jumbo frames on all data NICs:
NIC=$(ip -o addr show | grep '<data-subnet>' | awk '{print $2}')
ip link set $NIC mtu 9000
Standard 1500-byte frames create excessive per-packet overhead at 400G line rate. A single 400 Gbps link receiving 1500-byte frames processes approximately 33 million packets per second. With 9000-byte jumbo frames, this drops to approximately 5.5 million, reducing interrupt rate and CPU overhead by 6x.
The switch must also support jumbo frames on all ports connecting to MinIO AIStor servers and clients. Configure the switch MTU to 9100 or higher and use MTU 9000 on the hosts.
Verify end-to-end MTU works:
ping -M do -s 8972 -c 3 <remote-data-ip>
Flow control (pause frames)
Disable Ethernet flow control (IEEE 802.3x pause frames) on all data NICs:
NIC=$(ip -o addr show | grep '<data-subnet>' | awk '{print $2}')
ethtool -A $NIC rx off tx off
Pause frames cause head-of-line blocking: when a NIC sends a TX pause frame, the switch pauses all traffic to that port, not just the congested flow. TCP already handles congestion control per-flow, making Ethernet-level pause frames redundant and harmful for TCP storage traffic.
Verify flow control is off:
ethtool -a $NIC
Ring buffers
Maximize NIC ring buffer depth:
NIC=$(ip -o addr show | grep '<data-subnet>' | awk '{print $2}')
ethtool -G $NIC rx 8192 tx 8192
Default ring buffer sizes (typically 1024) are too small for 100G/400G line rate.
Check current and maximum sizes with ethtool -g $NIC.
dropless_rq or rx_cqe_compress) triggers an internal NIC reset that silently resets ring buffers back to defaults.
Always re-apply ring buffer settings after any private flag change.
IRQ affinity and NUMA
Each NIC should have its interrupts pinned to the NUMA node closest to its PCIe slot. Cross-NUMA interrupt handling adds memory access latency on every packet.
Check which NUMA node a NIC belongs to:
cat /sys/class/net/$NIC/device/numa_node
On multi-NIC servers, ensure each NIC’s IRQs stay on its local NUMA node. A 400G NIC processing packets on a remote NUMA node loses measurable throughput to cross-socket memory traffic.
Large receive offload (LRO)
Enable LRO to let the NIC aggregate multiple incoming TCP segments into larger buffers before passing them to the kernel:
NIC=$(ip -o addr show | grep '<data-subnet>' | awk '{print $2}')
ethtool -K $NIC lro on
If ethtool -k shows large-receive-offload: off [fixed], the NIC hardware does not support LRO.
In that case, ensure GRO (Generic Receive Offload) is enabled instead.
Interrupt coalescing
Use fixed interrupt coalescing instead of adaptive coalescing:
NIC=$(ip -o addr show | grep '<data-subnet>' | awk '{print $2}')
ethtool -C $NIC adaptive-rx off adaptive-tx off \
rx-usecs 128 tx-usecs 64 \
rx-frames 256 tx-frames 256
Adjust based on workload:
- Lower latency: Reduce
rx-usecsto 32-64,rx-framesto 64-128 - Higher throughput: Increase
rx-usecsto 256,rx-framesto 512
Congestion control
Use CUBIC (the Linux default). BBR’s bandwidth probing cycles cause throughput oscillation when many concurrent flows share the same high-bandwidth link.
Switch configuration
Configure the switch connecting MinIO AIStor servers for maximum throughput:
- Jumbo Frames: Configure MTU 9100+ on all switch ports connected to MinIO AIStor servers and clients.
- Multi-queue scheduling: Keep TC-to-queue mappings active. Without these, all traffic forces through a single queue.
- Disable PFC and flow control: PFC is designed for RoCE/RDMA lossless fabrics and is counterproductive for TCP.
- Buffer allocation: Use the default “lossy” buffer profile. Lossless profiles waste buffer space when PFC is disabled.
Monitor switch port counters for TX_DRP, RX_DRP, and per-queue distribution to detect issues.
Connection tracking (nf_conntrack)
On dedicated storage servers, connection tracking should ideally be disabled by not loading the nf_conntrack module:
sudo modprobe -r nf_conntrack
echo "blacklist nf_conntrack" | sudo tee /etc/modprobe.d/no-conntrack.conf
If firewall rules require it, configure aggressive timeout settings:
sudo sysctl -w net.netfilter.nf_conntrack_max=800000
sudo sysctl -w net.netfilter.nf_conntrack_tcp_timeout_established=300
sudo sysctl -w net.netfilter.nf_conntrack_tcp_timeout_time_wait=20
Bootloader settings
These settings require a reboot to take effect.
| Setting | Reason |
|---|---|
skew_tick=1 |
Staggers timer interrupts across CPUs to avoid thundering-herd wakeups. |
intel_iommu=off |
Disables IOMMU (VT-d/DMAR) to remove DMA translation overhead on every NVMe and NIC transfer. |
For AMD systems, use amd_iommu=off instead of intel_iommu=off.
IOMMU is useful for VM passthrough (VT-d) and device isolation. On bare-metal storage servers running only MinIO AIStor, disable it.
MinIO AIStor settings
Connection limits
Set MINIO_MAX_IDLE_CONNS_PER_HOST to tune the maximum number of idle and active internode HTTP connections.
Increase or decrease this value to adjust concurrency between nodes.
O_DIRECT
The MINIO_API_ODIRECT setting controls whether MinIO AIStor bypasses the OS page cache for reads and writes.
The default is on (O_DIRECT for both reads and writes).
You can also set it to read (O_DIRECT for reads only) or write (O_DIRECT for writes only).
Setting it to off disables O_DIRECT entirely, which can cause the page cache to grow unbounded, leading to memory pressure and potential out-of-memory conditions.
It is never recommended to set this to off in production.
Thread pressure monitoring
The MINIO_API_THREAD_PRESSURE_CHECK and related settings monitor goroutine usage and return HTTP 429 from health endpoints when thread pressure exceeds the critical threshold.
Validation
After applying the profile, verify key settings:
# CPU governor
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
# IOMMU (after reboot)
dmesg | grep -i iommu
# TCP buffers
sysctl net.core.rmem_max net.core.wmem_max
# THP
cat /sys/kernel/mm/transparent_hugepage/enabled
# Flow control (per data NIC)
ethtool -a $NIC | grep -E 'RX:|TX:'
# Ring buffers (per data NIC)
ethtool -g $NIC | grep -A4 'Current'
# LRO (per data NIC)
ethtool -k $NIC | grep large-receive-offload
# NIC packet drops (should be zero or near-zero after tuning)
ethtool -S $NIC | grep -E 'rx_discards_phy|rx_out_of_buffer|tx_pause_ctrl_phy'