Kernel and Network Stack Tuning for Minimal Latency
When building HFT systems for algorithmic trading, every microsecond counts. The kernel and network stack are the stage crew moving the packets from the wire to your strategy code — if they fumble, your execution timing (and P&L) suffers.
- Goal (this screen): give practical knobs you can change safely and a tiny C++ experiment that demonstrates why CPU affinity and polling vs. kernel wakeups matter. You're coming from Java/C/Python/JS — think of
isolcpusand IRQ affinity like telling the OS "don't interrupt my star player during the buzzer-beater".
Quick mental model (ASCII)
[NIC] --hw-ts--> (NIC ring RX) --> (NIC IRQ) --> [Kernel softirq / NAPI] --> [socket / user app] | v (CPU core)
Important places to tune:
IRQ affinity— bind NIC interrupts to specific CPU cores by writing to/proc/irq/<irq>/smp_affinityor usingirqbalancecarefully.isolcpus— kernel boot parameter to isolate cores from the scheduler (good for dedicating cores to latency sensitive threads).PREEMPT/ real-time kernels —CONFIG_PREEMPT,CONFIG_PREEMPT_RTreduce scheduling latency.RX/TX ring sizes—ethtool -g <iface>andethtool -G <iface> rx <count> tx <count>adjust NIC buffers.- Offloads — disable
GRO/GSO/TSOfor accurate per-packet timing withethtool -K <iface> gro off gso off tso off. - Socket & kernel knobs —
net.core.rmem_max,net.core.netdev_max_backlog,net.core.busy_pollandSO_BUSY_POLLfor polling sockets.
Why this matters in HFT terms:
- Polling (
busy-spin) is like having a guard constantly watching the scoreboard — you pay CPU (power) for ultra-low and deterministic latency. - Kernel wakeups (condvars, epoll) are energy efficient but introduce jitter — like waiting for the PA announcer to tell you the buzzer sounded.
Practical safe-testing rules:
- Test on a dedicated lab box (do not change kernel settings on prod network appliances).
- Keep a remote admin session and a recovery plan (rescue kernel, reboot). Use
sysctl -wfor transient changes. - Record baselines before each change. Use
ethtool -T,ptp4l -m(if PTP),tcpdump -tt,perf record/perf top.
Commands you will use often:
- Check timestamping/offloads:
ethtool -T eth0,ethtool -k eth0 - Resize rings:
ethtool -G eth0 rx 4096 tx 512 - Disable offloads:
ethtool -K eth0 gro off gso off tso off - Affix IRQ to CPU mask:
echo 2 > /proc/irq/<irq>/smp_affinity(mask is hex; be careful) - Transient sysctl:
sysctl -w net.core.busy_poll=50
Tiny experiment (run locally)
Below is a C++ program that simulates a simple producer (market-data) and consumer (strategy) pair and measures notification latency in three scenarios:
- unpinned threads (default scheduler)
- pinned to the same core (bad)
- pinned to different cores (good)
This will help you reason about isolcpus and thread pinning effects. It includes both condition_variable (kernel wake) and polling (busy-spin) modes. Try it on a multi-core Linux VM and change the CPU numbers (or run with isolcpus= kernel param) to see the difference.
Note: This is a simulation — it doesn't change kernel IRQ routing or NIC offloads. Run real network tests separately with pktgen and ethtool once you're comfortable.
Challenge: Run the program, then:
- Change
prod_cpu/cons_cpuvalues to match cores on your machine (try0and1). - Switch between
use_polling = trueandfalse. - Observe mean and max latencies. Relate improvements to what you'd expect if you used
isolcpusand bound the NIC IRQ to a nearby core.
Now the code — save as main.cpp and compile with g++ -O2 -std=c++17 -pthread main.cpp -o tune_test and run ./tune_test.
xxxxxxxxxx}using namespace std;using namespace std::chrono;// Pin a std::thread to a CPU core (returns true on success)bool pin_thread_to_cpu(std::thread &t, int cpu) { if (cpu < 0) return true; // -1 means leave unpinned cpu_set_t cpuset; CPU_ZERO(&cpuset); CPU_SET(cpu, &cpuset); int rc = pthread_setaffinity_np(t.native_handle(), sizeof(cpu_set_t), &cpuset); return rc == 0;}struct Results { double mean_ns; uint64_t max_ns;};

