Profiling, Performance Optimization and Vectorization
Why this matters for HFT engineers (beginner-friendly)
- In algorithmic trading you often process millions of
ticksper second. Small inefficiencies in loops or data layout become huge latency and throughput problems. - Think of optimization like a fast-break in basketball: you want the ball (data) to move in straight lines with no unnecessary stops. Poor data layout is like zig-zag dribbling that wastes time.
Quick ASCII visuals — memory layout and cache friendliness
AoS (Array of Structs) — awkward for per-field hot loops:
[ Tick{price,size,ts} ][ Tick{price,size,ts} ] [ Tick{...} ]
^ accessing price touches other fields too
SoA (Struct of Arrays) — cache-friendly when you only need one field:
prices: [p0, p1, p2, p3, ...]
sizes: [s0, s1, s2, s3, ...]
ts: [t0, t1, t2, t3, ...]
^ sequential memory for prices -> better L1/L2 prefetching
Core concepts to remember
- Profilers: use
perf(Linux) and Intel VTune to find hot functions and cache-miss hotspots. - Algorithmic choices: a better algorithm beats micro-optimizations (O(n) vs O(n log n)).
- Data layout:
SoAoften outperformsAoSfor tight numeric loops. - Memory pools: avoid frequent small allocations; use pools/arenas to reduce allocator overhead and fragmentation.
- Vectorization (SIMD): modern compilers auto-vectorize loops when the code is simple and memory-aliasing is clear. You can also use intrinsics later.
Commands & profiling tips (beginner-safe)
- Quick perf run:
sudo perf record -F 99 -- ./mainthensudo perf report --stdio
- See whether the compiler vectorized a loop (GCC/Clang): compile with
-O3 -ftree-vectorize -fopt-info-vecand check messages. - Use
perf stat -e cache-references,cache-misses ./mainto get cache miss rates.
How this ties to languages you know
- From Python/NumPy: moving inner loops to contiguous
numpyarrays or C++ gives big speedups. In Python, prefernumpyvector ops to Python loops. - From Java: Java's HotSpot does JIT vectorization — same principles (contiguous arrays, simple loops) apply.
- From C/JS: memory layout and cache behavior still matter — in JS typed arrays are faster for numeric tight loops.
Try this interactive C++ microbenchmark (below). It demonstrates:
- Generating reproducible
ticks(deterministic RNG) — similar to replaying market data. - Two implementations of a simple moving-average workload:
AoSvsSoA. - Timings using
std::chronoand a small checksum to keep results honest.
Compile hints (try these locally):
- g++ -O3 -march=native -std=c++17 main.cpp -o main
- Run
perf stat -e cache-references,cache-misses ./mainand compare cache misses between AoS and SoA - If you're curious about vectorization, add
-fopt-info-vec(GCC/Clang) to see what loops the compiler vectorized.
Beginner challenges to try after running the code
- Re-run with different
N(e.g., 100k, 10M). How doesops/secscale? Which version scales better? - Compile with and without
-O3. Inspect perf differences and runobjdump -dto see generated assembly. - Implement the same logic in Python with
numpyarrays; compare runtime and think about whynumpycan sometimes match C++ for vectorizable ops. - (Stretch) Try rewriting the hot loop with explicit intrinsics (
<immintrin.h>) and compare — only after you’re comfortable with the auto-vectorized result.
Small motivational analogy: if Kobe Bryant is the best at taking a direct straight-to-the-basket path, think of SoA + vectorized loop as the most direct path your code can take — fewer stops, fewer wasted cycles.
Now run the example below and experiment with the challenges above. The code is self-contained and prints timings and simple checksums so you can verify correctness while measuring performance.
xxxxxxxxxx}using namespace std;using steady = chrono::steady_clock;struct Tick { double price; double size; long ts;};// Generate deterministic ticks so runs are reproduciblevoid gen_ticks_aos(vector<Tick>& ticks, size_t N) { mt19937_64 rng(42); uniform_real_distribution<double> price_d(100.0, 101.0); uniform_real_distribution<double> size_d(1.0, 10.0); ticks.resize(N); for (size_t i = 0; i < N; ++i) { ticks[i].price = price_d(rng); ticks[i].size = size_d(rng); ticks[i].ts = static_cast<long>(i); }}


