Performance Tuning
This guide covers OS-level and hardware tuning for MinIO AIStor, with specific recommendations for high-bandwidth deployments using 100G/400G networks and NVMe storage.
For benchmarking tools to validate tuning changes, see Benchmarking.
Quick start with tuned
MinIO provides a tuned profile that applies CPU, memory, filesystem, and network settings.
Create the profile directory and write the profile file:
sudo mkdir -p /usr/lib/tuned/minio/
sudo tee /usr/lib/tuned/minio/tuned.conf > /dev/null <<'EOF'
[main]
summary=Maximum server performance for MinIO AIStor
[vm]
transparent_hugepage=madvise
[sysfs]
/sys/kernel/mm/transparent_hugepage/defrag=defer+madvise
/sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none=0
[cpu]
force_latency=1
governor=performance
energy_perf_bias=performance
min_perf_pct=100
[sysctl]
fs.xfs.xfssyncd_centisecs=72000
kernel.numa_balancing=1
# Do not use swap at all
vm.swappiness=0
vm.vfs_cache_pressure=50
# Start writeback at 3% memory
vm.dirty_background_ratio=3
# Force writeback at 10% memory
vm.dirty_ratio=10
# Quite a few memory map areas may be consumed
vm.max_map_count=524288
# Default is 500000 = 0.5ms, increasing to 5ms reduces
# unnecessary task migrations between CPUs on NUMA systems
kernel.sched_migration_cost_ns=5000000
# Increase hung task timeout for heavy I/O workloads
kernel.hung_task_timeout_secs=85
# TCP buffer sizes sized for high-bandwidth links (100G/400G)
# BDP at 400Gbps with 0.1ms RTT = ~5MB, so 64MB max provides
# headroom for larger RTTs and concurrent streams
net.core.wmem_max=67108864
net.core.rmem_max=67108864
net.core.rmem_default=4194304
net.core.wmem_default=4194304
net.core.optmem_max=2097152
net.ipv4.tcp_rmem="4096 1048576 67108864"
net.ipv4.tcp_wmem="4096 1048576 67108864"
net.ipv4.tcp_mem="8388608 12582912 16777216"
# Network backlog and connection queues
net.core.netdev_max_backlog=250000
net.core.somaxconn=65535
net.core.netdev_budget=600
net.core.netdev_budget_usecs=4000
# Busy polling for low latency
net.core.busy_read=50
net.core.busy_poll=50
# Disable SYN cookies on trusted networks
net.ipv4.tcp_syncookies=0
net.ipv4.tcp_max_syn_backlog=65535
# Keep timestamps enabled for accurate RTT estimation and SACK recovery.
# Without timestamps, CUBIC cannot distinguish late arrivals from losses,
# causing connections to stay degraded after congestion events.
net.ipv4.tcp_timestamps=1
# Enable selective acknowledgements and window scaling
net.ipv4.tcp_sack=1
net.ipv4.tcp_window_scaling=1
# Allocate more socket buffer space for TCP window
net.ipv4.tcp_adv_win_scale=1
# Disable RFC2861 slow-start-after-idle to keep cwnd warm
# on persistent connections
net.ipv4.tcp_slow_start_after_idle=0
# Don't cache TCP metrics from previous connections
net.ipv4.tcp_no_metrics_save=1
# Allow reuse of TIME_WAIT sockets
net.ipv4.tcp_tw_reuse=1
# Enable MTU probing to handle path MTU issues
net.ipv4.tcp_mtu_probing=1
net.ipv4.tcp_base_mss=1280
# Disable IPv6 on dedicated storage networks
net.ipv6.conf.all.disable_ipv6=1
net.ipv6.conf.default.disable_ipv6=1
net.ipv6.conf.lo.disable_ipv6=1
# Reclaim connections faster
net.ipv4.tcp_fin_timeout=15
# Maximize available ephemeral port range
net.ipv4.ip_local_port_range=1024 65535
[bootloader]
cmdline=skew_tick=1 intel_iommu=off amd_iommu=off iommu=pt
EOF
sudo tuned-adm profile minio
sudo reboot
net.ipv6.conf.all/default/lo.disable_ipv6=1) and disables TCP SYN cookies (net.ipv4.tcp_syncookies=0).
If your deployment uses IPv6 addressing, remove the three disable_ipv6 lines before applying the profile.
See Behavior-changing defaults for the trade-offs of these and other settings.
After reboot, verify the profile is active:
tuned-adm active
CPU
| Setting | Value | Reason |
|---|---|---|
governor |
performance |
Locks CPUs at maximum frequency. powersave adds latency from frequency scaling. |
force_latency |
1 (microsecond) |
Prevents deep C-states. CPUs stay in C0/C1 for instant wake-up. |
energy_perf_bias |
performance |
Tells the hardware to prefer performance over power savings. |
min_perf_pct |
100 |
Forces Intel P-state driver to run at maximum performance. |
Apply manually without tuned:
for cpu in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do
echo performance | sudo tee "$cpu"
done
Memory
| Setting | Value | Reason |
|---|---|---|
transparent_hugepage |
madvise |
THP only for applications that request it. always causes compaction stalls. |
vm.swappiness |
0 |
Never swap. MinIO AIStor servers should have enough RAM. |
vm.vfs_cache_pressure |
50 |
Keeps inode/dentry caches longer, reducing XFS metadata re-reads. |
vm.dirty_background_ratio |
3 |
Start background writeback at 3% of RAM. Prevents sudden bursts. |
vm.dirty_ratio |
10 |
Force synchronous writeback at 10% of RAM. |
vm.max_map_count |
524288 |
MinIO AIStor memory-maps many files concurrently. |
transparent_hugepage/defrag |
defer+madvise |
Defers THP compaction so allocation does not stall on direct reclaim. |
transparent_hugepage/khugepaged/max_ptes_none |
0 |
Prevents khugepaged from inflating memory by collapsing sparse pages into hugepages. |
XFS filesystem
| Setting | Value | Reason |
|---|---|---|
fs.xfs.xfssyncd_centisecs |
72000 |
Delays XFS sync daemon to 12 minutes. MinIO AIStor manages its own fsync calls. |
Scheduler
| Setting | Value | Reason |
|---|---|---|
kernel.sched_migration_cost_ns |
5000000 |
Increases the threshold for migrating tasks between CPUs from 0.5ms to 5ms. Reduces cache thrashing on NUMA systems. |
kernel.numa_balancing |
1 |
Allows the kernel to migrate pages closer to the CPU accessing them. |
kernel.hung_task_timeout_secs |
85 |
Prevents false hung-task warnings during heavy NVMe I/O. |
Network
Network settings have the largest impact for high-bandwidth deployments.
Behavior-changing defaults
The tuned profile assumes a dedicated, trusted storage network and changes two host-wide behaviors that admins should be aware of:
| Setting | Value | Effect |
|---|---|---|
net.ipv6.conf.all.disable_ipv6 (and default, lo) |
1 |
Disables IPv6 on the host. Intended for IPv4-only storage networks. |
net.ipv4.tcp_syncookies |
0 |
Disables SYN cookies. Removes SYN-flood protection in exchange for lower connection-setup overhead. |
net.ipv6.conf.*.disable_ipv6 lines from tuned.conf before applying the profile.
net.ipv4.tcp_syncookies=0 is only appropriate on isolated or trusted storage networks.
SYN cookies defend against SYN-flood denial-of-service attacks; disabling them assumes the storage network is not exposed to untrusted clients.
Leave SYN cookies enabled if the server is reachable from untrusted networks.
TCP buffer sizes
sudo sysctl -w net.core.wmem_max=67108864
sudo sysctl -w net.core.rmem_max=67108864
sudo sysctl -w net.ipv4.tcp_rmem="4096 1048576 67108864"
sudo sysctl -w net.ipv4.tcp_wmem="4096 1048576 67108864"
TCP throughput is bounded by window_size / RTT (bandwidth-delay product).
At 400 Gbps with 0.1ms RTT, a single stream needs approximately 5 MB of buffer.
The 64 MB maximum provides headroom for longer paths and many concurrent streams.
The profile also sets net.ipv4.tcp_mem="8388608 12582912 16777216" (low/pressure/high pages) so the kernel does not throttle overall TCP memory usage before the per-socket buffers above are reached.
Low-latency settings
| Setting | Value | Reason |
|---|---|---|
net.core.busy_read |
50 |
Busy-poll on read() for 50us before sleeping. Reduces latency at cost of CPU. |
net.core.busy_poll |
50 |
Busy-poll on poll()/select() for 50us. |
net.ipv4.tcp_timestamps |
1 |
Keeps timestamps enabled for accurate RTT estimation and CUBIC congestion recovery. |
net.ipv4.tcp_slow_start_after_idle |
0 |
Keeps congestion window warm on idle connections. Prevents throughput drops after brief pauses. |
Connection handling
| Setting | Value | Reason |
|---|---|---|
net.core.netdev_max_backlog |
250000 |
Queue size for incoming packets when the CPU cannot keep up. Prevents drops at 400G. |
net.core.somaxconn |
65535 |
Raises the maximum accept queue depth for listening sockets. |
net.core.netdev_budget |
600 |
Packets processed per softirq poll. Higher values improve throughput at 400G. |
net.core.netdev_budget_usecs |
4000 |
Time budget per softirq poll, paired with netdev_budget. |
net.ipv4.tcp_tw_reuse |
1 |
Allows reuse of TIME_WAIT sockets for new outbound connections. |
net.ipv4.ip_local_port_range |
1024 65535 |
Approximately 64K ephemeral ports instead of the default 28K. |
net.ipv4.tcp_fin_timeout |
15 |
Reclaim FIN_WAIT2 connections after 15s instead of 60s. |
net.ipv4.tcp_mtu_probing |
1 |
Detects and works around MTU black holes. |
NIC configuration
These settings must be applied separately from the tuned profile.
Test each change independently to measure the impact on your workload.
Jumbo frames (MTU)
Enable jumbo frames on all data NICs:
NIC=$(ip -o addr show | grep '<data-subnet>' | awk '{print $2}')
ip link set $NIC mtu 9000
Standard 1500-byte frames create excessive per-packet overhead at 400G line rate. A single 400 Gbps link receiving 1500-byte frames processes approximately 33 million packets per second. With 9000-byte jumbo frames, this drops to approximately 5.5 million, reducing interrupt rate and CPU overhead by 6x.
The switch must also support jumbo frames on all ports connecting to MinIO AIStor servers and clients. Configure the switch MTU to 9100 or higher and use MTU 9000 on the hosts.
Verify end-to-end MTU works:
ping -M do -s 8972 -c 3 <remote-data-ip>
Flow control (pause frames)
Disable Ethernet flow control (IEEE 802.3x pause frames) on all data NICs:
NIC=$(ip -o addr show | grep '<data-subnet>' | awk '{print $2}')
ethtool -A $NIC rx off tx off
Pause frames cause head-of-line blocking: when a NIC sends a TX pause frame, the switch pauses all traffic to that port, not just the congested flow. TCP already handles congestion control per-flow, making Ethernet-level pause frames redundant and harmful for TCP storage traffic.
Verify flow control is off:
ethtool -a $NIC
Ring buffers
Maximize NIC ring buffer depth:
NIC=$(ip -o addr show | grep '<data-subnet>' | awk '{print $2}')
ethtool -G $NIC rx 8192 tx 8192
Default ring buffer sizes (typically 1024) are too small for 100G/400G line rate.
Check current and maximum sizes with ethtool -g $NIC.
Mellanox/NVIDIA ConnectX private flags
Mellanox/NVIDIA ConnectX NICs expose private flags via ethtool --show-priv-flags $NIC.
Two flags are commonly suggested for performance but are not recommended for TCP storage workloads.
Set both to off:
| Flag | Recommendation | Reason |
|---|---|---|
dropless_rq |
off |
Without flow control (PFC or 802.3x pause), the NIC cannot signal backpressure upstream, so preventing receive-buffer drops instead causes internal stalls that reduce throughput. |
rx_cqe_compress |
off |
Per-packet CQE decompression adds CPU overhead on every received packet, and that cost outweighs the benefit for bulk TCP transfers. |
Verify both flags are off:
ethtool --show-priv-flags $NIC | grep -E 'dropless_rq|rx_cqe_compress'
# Expected: both off
ethtool --set-priv-flags change triggers an internal NIC reset that briefly drops all connections on that NIC and silently resets ring buffers back to defaults.
Plan these changes during maintenance windows and always re-apply ring buffer settings afterward.
IRQ affinity and NUMA
Each NIC should have its interrupts pinned to the NUMA node closest to its PCIe slot. Cross-NUMA interrupt handling adds memory access latency on every packet.
Check which NUMA node a NIC belongs to:
cat /sys/class/net/$NIC/device/numa_node
On multi-NIC servers, ensure each NIC’s IRQs stay on its local NUMA node. A 400G NIC processing packets on a remote NUMA node loses measurable throughput to cross-socket memory traffic.
Large receive offload (LRO)
Enable LRO to let the NIC aggregate multiple incoming TCP segments into larger buffers before passing them to the kernel:
NIC=$(ip -o addr show | grep '<data-subnet>' | awk '{print $2}')
ethtool -K $NIC lro on
If ethtool -k shows large-receive-offload: off [fixed], the NIC hardware does not support LRO.
In that case, ensure GRO (Generic Receive Offload) is enabled instead.
Interrupt coalescing
Use fixed interrupt coalescing instead of adaptive coalescing:
NIC=$(ip -o addr show | grep '<data-subnet>' | awk '{print $2}')
ethtool -C $NIC adaptive-rx off adaptive-tx off \
rx-usecs 128 tx-usecs 64 \
rx-frames 256 tx-frames 256
Adjust based on workload:
- Lower latency: Reduce
rx-usecsto 32-64,rx-framesto 64-128 - Higher throughput: Increase
rx-usecsto 256,rx-framesto 512
Congestion control
Use CUBIC (the Linux default). BBR’s bandwidth probing cycles cause throughput oscillation when many concurrent flows share the same high-bandwidth link.
Switch configuration
Configure the switch connecting MinIO AIStor servers for maximum throughput:
- Jumbo Frames: Configure MTU 9100+ on all switch ports connected to MinIO AIStor servers and clients.
- Multi-queue scheduling: Keep TC-to-queue mappings active. Without these, all traffic forces through a single queue.
- Disable PFC and flow control: PFC is designed for RoCE/RDMA lossless fabrics and is counterproductive for TCP.
- Buffer allocation: Use the default “lossy” buffer profile. Lossless profiles waste buffer space when PFC is disabled.
Monitor switch port counters for TX_DRP, RX_DRP, and per-queue distribution to detect issues.
Connection tracking (nf_conntrack)
On dedicated storage servers, connection tracking should ideally be disabled by not loading the nf_conntrack module:
sudo modprobe -r nf_conntrack
echo "blacklist nf_conntrack" | sudo tee /etc/modprobe.d/no-conntrack.conf
If firewall rules require it, configure aggressive timeout settings:
sudo sysctl -w net.netfilter.nf_conntrack_max=800000
sudo sysctl -w net.netfilter.nf_conntrack_tcp_timeout_established=300
sudo sysctl -w net.netfilter.nf_conntrack_tcp_timeout_time_wait=20
Bootloader settings
These settings require a reboot to take effect.
| Setting | Reason |
|---|---|
skew_tick=1 |
Staggers timer interrupts across CPUs to avoid thundering-herd wakeups. |
intel_iommu=off |
Disables IOMMU (VT-d/DMAR) to remove DMA translation overhead on every NVMe and NIC transfer. |
iommu=pt |
Sets IOMMU passthrough mode so devices bypass DMA remapping where the IOMMU remains active. |
For AMD systems, use amd_iommu=off instead of intel_iommu=off.
IOMMU is useful for VM passthrough (VT-d) and device isolation. On bare-metal storage servers running only MinIO AIStor, disable it.
MinIO AIStor settings
Connection limits
Set MINIO_MAX_IDLE_CONNS_PER_HOST to tune the maximum number of idle and active internode HTTP connections.
Increase or decrease this value to adjust concurrency between nodes.
O_DIRECT
The MINIO_API_ODIRECT setting controls whether MinIO AIStor bypasses the OS page cache for reads and writes.
The default is on (O_DIRECT for both reads and writes).
You can also set it to read (O_DIRECT for reads only) or write (O_DIRECT for writes only).
Setting it to off disables O_DIRECT entirely, which can cause the page cache to grow unbounded, leading to memory pressure and potential out-of-memory conditions.
It is never recommended to set this to off in production.
Thread pressure monitoring
The MINIO_API_THREAD_PRESSURE_CHECK and related settings monitor goroutine usage and return HTTP 429 from health endpoints when thread pressure exceeds the critical threshold.
Validation
After applying the profile, verify key settings:
# CPU governor
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
# IOMMU (after reboot)
dmesg | grep -i iommu
# TCP buffers
sysctl net.core.rmem_max net.core.wmem_max
# THP
cat /sys/kernel/mm/transparent_hugepage/enabled
# Flow control (per data NIC)
ethtool -a $NIC | grep -E 'RX:|TX:'
# Ring buffers (per data NIC)
ethtool -g $NIC | grep -A4 'Current'
# LRO (per data NIC)
ethtool -k $NIC | grep large-receive-offload
# NIC packet drops (should be zero or near-zero after tuning)
ethtool -S $NIC | grep -E 'rx_discards_phy|rx_out_of_buffer|tx_pause_ctrl_phy'