Performance Tuning

This guide covers OS-level and hardware tuning for MinIO AIStor, with specific recommendations for high-bandwidth deployments using 100G/400G networks and NVMe storage.

For benchmarking tools to validate tuning changes, see Benchmarking.

Quick start with tuned

MinIO provides a tuned profile that applies CPU, memory, filesystem, and network settings.

Create the profile directory and write the profile file:

sudo mkdir -p /usr/lib/tuned/minio/
sudo tee /usr/lib/tuned/minio/tuned.conf > /dev/null <<'EOF'
[main]
summary=Maximum server performance for MinIO AIStor

[vm]
transparent_hugepage=madvise

[sysfs]
/sys/kernel/mm/transparent_hugepage/defrag=defer+madvise
/sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none=0

[cpu]
force_latency=1
governor=performance
energy_perf_bias=performance
min_perf_pct=100

[sysctl]
fs.xfs.xfssyncd_centisecs=72000
kernel.numa_balancing=1

# Do not use swap at all
vm.swappiness=0
vm.vfs_cache_pressure=50

# Start writeback at 3% memory
vm.dirty_background_ratio=3
# Force writeback at 10% memory
vm.dirty_ratio=10

# Quite a few memory map areas may be consumed
vm.max_map_count=524288

# Default is 500000 = 0.5ms, increasing to 5ms reduces
# unnecessary task migrations between CPUs on NUMA systems
kernel.sched_migration_cost_ns=5000000

# Increase hung task timeout for heavy I/O workloads
kernel.hung_task_timeout_secs=85

# TCP buffer sizes sized for high-bandwidth links (100G/400G)
# BDP at 400Gbps with 0.1ms RTT = ~5MB, so 64MB max provides
# headroom for larger RTTs and concurrent streams
net.core.wmem_max=67108864
net.core.rmem_max=67108864
net.core.rmem_default=4194304
net.core.wmem_default=4194304
net.core.optmem_max=2097152
net.ipv4.tcp_rmem="4096 1048576 67108864"
net.ipv4.tcp_wmem="4096 1048576 67108864"
net.ipv4.tcp_mem="8388608 12582912 16777216"

# Network backlog and connection queues
net.core.netdev_max_backlog=250000
net.core.somaxconn=65535
net.core.netdev_budget=600
net.core.netdev_budget_usecs=4000

# Busy polling for low latency
net.core.busy_read=50
net.core.busy_poll=50

# Disable SYN cookies on trusted networks
net.ipv4.tcp_syncookies=0
net.ipv4.tcp_max_syn_backlog=65535

# Keep timestamps enabled for accurate RTT estimation and SACK recovery.
# Without timestamps, CUBIC cannot distinguish late arrivals from losses,
# causing connections to stay degraded after congestion events.
net.ipv4.tcp_timestamps=1

# Enable selective acknowledgements and window scaling
net.ipv4.tcp_sack=1
net.ipv4.tcp_window_scaling=1

# Allocate more socket buffer space for TCP window
net.ipv4.tcp_adv_win_scale=1

# Disable RFC2861 slow-start-after-idle to keep cwnd warm
# on persistent connections
net.ipv4.tcp_slow_start_after_idle=0

# Don't cache TCP metrics from previous connections
net.ipv4.tcp_no_metrics_save=1

# Allow reuse of TIME_WAIT sockets
net.ipv4.tcp_tw_reuse=1

# Enable MTU probing to handle path MTU issues
net.ipv4.tcp_mtu_probing=1
net.ipv4.tcp_base_mss=1280

# Disable IPv6 on dedicated storage networks
net.ipv6.conf.all.disable_ipv6=1
net.ipv6.conf.default.disable_ipv6=1
net.ipv6.conf.lo.disable_ipv6=1

# Reclaim connections faster
net.ipv4.tcp_fin_timeout=15

# Maximize available ephemeral port range
net.ipv4.ip_local_port_range=1024 65535

[bootloader]
cmdline=skew_tick=1 intel_iommu=off amd_iommu=off iommu=pt
EOF
sudo tuned-adm profile minio
sudo reboot
This profile disables IPv6 host-wide (net.ipv6.conf.all/default/lo.disable_ipv6=1) and disables TCP SYN cookies (net.ipv4.tcp_syncookies=0). If your deployment uses IPv6 addressing, remove the three disable_ipv6 lines before applying the profile. See Behavior-changing defaults for the trade-offs of these and other settings.

After reboot, verify the profile is active:

tuned-adm active

CPU

Setting Value Reason
governor performance Locks CPUs at maximum frequency. powersave adds latency from frequency scaling.
force_latency 1 (microsecond) Prevents deep C-states. CPUs stay in C0/C1 for instant wake-up.
energy_perf_bias performance Tells the hardware to prefer performance over power savings.
min_perf_pct 100 Forces Intel P-state driver to run at maximum performance.

Apply manually without tuned:

for cpu in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do
    echo performance | sudo tee "$cpu"
done

Memory

Setting Value Reason
transparent_hugepage madvise THP only for applications that request it. always causes compaction stalls.
vm.swappiness 0 Never swap. MinIO AIStor servers should have enough RAM.
vm.vfs_cache_pressure 50 Keeps inode/dentry caches longer, reducing XFS metadata re-reads.
vm.dirty_background_ratio 3 Start background writeback at 3% of RAM. Prevents sudden bursts.
vm.dirty_ratio 10 Force synchronous writeback at 10% of RAM.
vm.max_map_count 524288 MinIO AIStor memory-maps many files concurrently.
transparent_hugepage/defrag defer+madvise Defers THP compaction so allocation does not stall on direct reclaim.
transparent_hugepage/khugepaged/max_ptes_none 0 Prevents khugepaged from inflating memory by collapsing sparse pages into hugepages.

XFS filesystem

Setting Value Reason
fs.xfs.xfssyncd_centisecs 72000 Delays XFS sync daemon to 12 minutes. MinIO AIStor manages its own fsync calls.

Scheduler

Setting Value Reason
kernel.sched_migration_cost_ns 5000000 Increases the threshold for migrating tasks between CPUs from 0.5ms to 5ms. Reduces cache thrashing on NUMA systems.
kernel.numa_balancing 1 Allows the kernel to migrate pages closer to the CPU accessing them.
kernel.hung_task_timeout_secs 85 Prevents false hung-task warnings during heavy NVMe I/O.

Network

Network settings have the largest impact for high-bandwidth deployments.

Behavior-changing defaults

The tuned profile assumes a dedicated, trusted storage network and changes two host-wide behaviors that admins should be aware of:

Setting Value Effect
net.ipv6.conf.all.disable_ipv6 (and default, lo) 1 Disables IPv6 on the host. Intended for IPv4-only storage networks.
net.ipv4.tcp_syncookies 0 Disables SYN cookies. Removes SYN-flood protection in exchange for lower connection-setup overhead.
The profile disables IPv6 host-wide. If your deployment uses IPv6 addressing (MinIO AIStor supports both IPv4 and IPv6), remove the three net.ipv6.conf.*.disable_ipv6 lines from tuned.conf before applying the profile.
net.ipv4.tcp_syncookies=0 is only appropriate on isolated or trusted storage networks. SYN cookies defend against SYN-flood denial-of-service attacks; disabling them assumes the storage network is not exposed to untrusted clients. Leave SYN cookies enabled if the server is reachable from untrusted networks.

TCP buffer sizes

sudo sysctl -w net.core.wmem_max=67108864
sudo sysctl -w net.core.rmem_max=67108864
sudo sysctl -w net.ipv4.tcp_rmem="4096 1048576 67108864"
sudo sysctl -w net.ipv4.tcp_wmem="4096 1048576 67108864"

TCP throughput is bounded by window_size / RTT (bandwidth-delay product). At 400 Gbps with 0.1ms RTT, a single stream needs approximately 5 MB of buffer. The 64 MB maximum provides headroom for longer paths and many concurrent streams.

The profile also sets net.ipv4.tcp_mem="8388608 12582912 16777216" (low/pressure/high pages) so the kernel does not throttle overall TCP memory usage before the per-socket buffers above are reached.

Do not reduce TCP buffer maximums below 4 MB on 100G or faster networks. The older 4 MB defaults were sized for 10G and bottleneck individual TCP streams.

Low-latency settings

Setting Value Reason
net.core.busy_read 50 Busy-poll on read() for 50us before sleeping. Reduces latency at cost of CPU.
net.core.busy_poll 50 Busy-poll on poll()/select() for 50us.
net.ipv4.tcp_timestamps 1 Keeps timestamps enabled for accurate RTT estimation and CUBIC congestion recovery.
net.ipv4.tcp_slow_start_after_idle 0 Keeps congestion window warm on idle connections. Prevents throughput drops after brief pauses.

Connection handling

Setting Value Reason
net.core.netdev_max_backlog 250000 Queue size for incoming packets when the CPU cannot keep up. Prevents drops at 400G.
net.core.somaxconn 65535 Raises the maximum accept queue depth for listening sockets.
net.core.netdev_budget 600 Packets processed per softirq poll. Higher values improve throughput at 400G.
net.core.netdev_budget_usecs 4000 Time budget per softirq poll, paired with netdev_budget.
net.ipv4.tcp_tw_reuse 1 Allows reuse of TIME_WAIT sockets for new outbound connections.
net.ipv4.ip_local_port_range 1024 65535 Approximately 64K ephemeral ports instead of the default 28K.
net.ipv4.tcp_fin_timeout 15 Reclaim FIN_WAIT2 connections after 15s instead of 60s.
net.ipv4.tcp_mtu_probing 1 Detects and works around MTU black holes.

NIC configuration

These settings must be applied separately from the tuned profile. Test each change independently to measure the impact on your workload.

Jumbo frames (MTU)

Enable jumbo frames on all data NICs:

NIC=$(ip -o addr show | grep '<data-subnet>' | awk '{print $2}')
ip link set $NIC mtu 9000

Standard 1500-byte frames create excessive per-packet overhead at 400G line rate. A single 400 Gbps link receiving 1500-byte frames processes approximately 33 million packets per second. With 9000-byte jumbo frames, this drops to approximately 5.5 million, reducing interrupt rate and CPU overhead by 6x.

The switch must also support jumbo frames on all ports connecting to MinIO AIStor servers and clients. Configure the switch MTU to 9100 or higher and use MTU 9000 on the hosts.

Verify end-to-end MTU works:

ping -M do -s 8972 -c 3 <remote-data-ip>

Flow control (pause frames)

Disable Ethernet flow control (IEEE 802.3x pause frames) on all data NICs:

NIC=$(ip -o addr show | grep '<data-subnet>' | awk '{print $2}')
ethtool -A $NIC rx off tx off

Pause frames cause head-of-line blocking: when a NIC sends a TX pause frame, the switch pauses all traffic to that port, not just the congested flow. TCP already handles congestion control per-flow, making Ethernet-level pause frames redundant and harmful for TCP storage traffic.

Verify flow control is off:

ethtool -a $NIC

Ring buffers

Maximize NIC ring buffer depth:

NIC=$(ip -o addr show | grep '<data-subnet>' | awk '{print $2}')
ethtool -G $NIC rx 8192 tx 8192

Default ring buffer sizes (typically 1024) are too small for 100G/400G line rate. Check current and maximum sizes with ethtool -g $NIC.

Mellanox/NVIDIA ConnectX private flags

Mellanox/NVIDIA ConnectX NICs expose private flags via ethtool --show-priv-flags $NIC. Two flags are commonly suggested for performance but are not recommended for TCP storage workloads. Set both to off:

Flag Recommendation Reason
dropless_rq off Without flow control (PFC or 802.3x pause), the NIC cannot signal backpressure upstream, so preventing receive-buffer drops instead causes internal stalls that reduce throughput.
rx_cqe_compress off Per-packet CQE decompression adds CPU overhead on every received packet, and that cost outweighs the benefit for bulk TCP transfers.

Verify both flags are off:

ethtool --show-priv-flags $NIC | grep -E 'dropless_rq|rx_cqe_compress'
# Expected: both off
Any ethtool --set-priv-flags change triggers an internal NIC reset that briefly drops all connections on that NIC and silently resets ring buffers back to defaults. Plan these changes during maintenance windows and always re-apply ring buffer settings afterward.

IRQ affinity and NUMA

Each NIC should have its interrupts pinned to the NUMA node closest to its PCIe slot. Cross-NUMA interrupt handling adds memory access latency on every packet.

Check which NUMA node a NIC belongs to:

cat /sys/class/net/$NIC/device/numa_node

On multi-NIC servers, ensure each NIC’s IRQs stay on its local NUMA node. A 400G NIC processing packets on a remote NUMA node loses measurable throughput to cross-socket memory traffic.

Large receive offload (LRO)

Enable LRO to let the NIC aggregate multiple incoming TCP segments into larger buffers before passing them to the kernel:

NIC=$(ip -o addr show | grep '<data-subnet>' | awk '{print $2}')
ethtool -K $NIC lro on

If ethtool -k shows large-receive-offload: off [fixed], the NIC hardware does not support LRO. In that case, ensure GRO (Generic Receive Offload) is enabled instead.

Interrupt coalescing

Use fixed interrupt coalescing instead of adaptive coalescing:

NIC=$(ip -o addr show | grep '<data-subnet>' | awk '{print $2}')
ethtool -C $NIC adaptive-rx off adaptive-tx off \
    rx-usecs 128 tx-usecs 64 \
    rx-frames 256 tx-frames 256

Adjust based on workload:

  • Lower latency: Reduce rx-usecs to 32-64, rx-frames to 64-128
  • Higher throughput: Increase rx-usecs to 256, rx-frames to 512

Congestion control

Use CUBIC (the Linux default). BBR’s bandwidth probing cycles cause throughput oscillation when many concurrent flows share the same high-bandwidth link.

Switch configuration

Configure the switch connecting MinIO AIStor servers for maximum throughput:

  • Jumbo Frames: Configure MTU 9100+ on all switch ports connected to MinIO AIStor servers and clients.
  • Multi-queue scheduling: Keep TC-to-queue mappings active. Without these, all traffic forces through a single queue.
  • Disable PFC and flow control: PFC is designed for RoCE/RDMA lossless fabrics and is counterproductive for TCP.
  • Buffer allocation: Use the default “lossy” buffer profile. Lossless profiles waste buffer space when PFC is disabled.

Monitor switch port counters for TX_DRP, RX_DRP, and per-queue distribution to detect issues.

Connection tracking (nf_conntrack)

On dedicated storage servers, connection tracking should ideally be disabled by not loading the nf_conntrack module:

sudo modprobe -r nf_conntrack
echo "blacklist nf_conntrack" | sudo tee /etc/modprobe.d/no-conntrack.conf

If firewall rules require it, configure aggressive timeout settings:

sudo sysctl -w net.netfilter.nf_conntrack_max=800000
sudo sysctl -w net.netfilter.nf_conntrack_tcp_timeout_established=300
sudo sysctl -w net.netfilter.nf_conntrack_tcp_timeout_time_wait=20

Bootloader settings

These settings require a reboot to take effect.

Setting Reason
skew_tick=1 Staggers timer interrupts across CPUs to avoid thundering-herd wakeups.
intel_iommu=off Disables IOMMU (VT-d/DMAR) to remove DMA translation overhead on every NVMe and NIC transfer.
iommu=pt Sets IOMMU passthrough mode so devices bypass DMA remapping where the IOMMU remains active.

For AMD systems, use amd_iommu=off instead of intel_iommu=off.

IOMMU is useful for VM passthrough (VT-d) and device isolation. On bare-metal storage servers running only MinIO AIStor, disable it.

MinIO AIStor settings

Connection limits

Set MINIO_MAX_IDLE_CONNS_PER_HOST to tune the maximum number of idle and active internode HTTP connections. Increase or decrease this value to adjust concurrency between nodes.

O_DIRECT

The MINIO_API_ODIRECT setting controls whether MinIO AIStor bypasses the OS page cache for reads and writes. The default is on (O_DIRECT for both reads and writes). You can also set it to read (O_DIRECT for reads only) or write (O_DIRECT for writes only). Setting it to off disables O_DIRECT entirely, which can cause the page cache to grow unbounded, leading to memory pressure and potential out-of-memory conditions. It is never recommended to set this to off in production.

Thread pressure monitoring

The MINIO_API_THREAD_PRESSURE_CHECK and related settings monitor goroutine usage and return HTTP 429 from health endpoints when thread pressure exceeds the critical threshold.

Validation

After applying the profile, verify key settings:

# CPU governor
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor

# IOMMU (after reboot)
dmesg | grep -i iommu

# TCP buffers
sysctl net.core.rmem_max net.core.wmem_max

# THP
cat /sys/kernel/mm/transparent_hugepage/enabled

# Flow control (per data NIC)
ethtool -a $NIC | grep -E 'RX:|TX:'

# Ring buffers (per data NIC)
ethtool -g $NIC | grep -A4 'Current'

# LRO (per data NIC)
ethtool -k $NIC | grep large-receive-offload

# NIC packet drops (should be zero or near-zero after tuning)
ethtool -S $NIC | grep -E 'rx_discards_phy|rx_out_of_buffer|tx_pause_ctrl_phy'