Systems & building

The low-latency stack

∞structural

Reviewed 4 June 2026. As of 2026: a permanent feature of the market, not an edge that decays.

Kernel bypass, busy-polling, cache discipline, NUMA pinning: the engineering that shaves microseconds. Necessary for speed plays, overkill for most microstructure strategies.

The idea

The hot path, and the tail that hurts annotated diagramDG-STACK

What this shows. The hot path is the one tight line from packet in to order out; everything slow (logging, allocation) is pushed to a cold path that never blocks it, through a lock-free queue. Disciplined software lands a median of a few microseconds, but the engineering really lives in the tail. The p99 event, during a fast market, is the one you lose, so you optimise the tail, not the average.

What is the "hot path", and why is everything else kept off it?

The hot path is the critical sequence of code that runs on every market-data event that might lead to a trade: decode the packet, update the book, evaluate the signal, check risk, send the order. Everything that is not on it (logging, memory allocation, file I/O, admin, monitoring) is pushed to a cold path off the critical thread, because anything the hot path waits on becomes part of your latency.

The intuition is a relay sprinter who must hand off the baton the instant it arrives. Anything that makes them pause (tying a shoelace, checking their phone) costs the race. The hot path is the sprinter; the cold path is everything you do between races, arranged so it never interrupts one.

So the hot path is single-purpose and stripped down: no allocations (memory is pre-allocated in pools), no system calls, no locks, no logging that touches disk. Every piece of state it needs is already in memory and ideally already in cache. The cold path handles everything else on separate threads and cores (writing logs, persisting fills, updating dashboards, any kind of housekeeping) and it communicates with the hot path through a lock-free queue so the hot path never blocks. This discipline is the organising principle of the whole system architecture: the feed handler, book builder, strategy and order gateway are arranged so the latency-critical work is a tight, predictable line and everything else is asynchronous.

Why is event-driven architecture the right shape?

Trading is inherently reactive: you do nothing until the market sends you an event, a quote change, a trade, a timer. An event-driven design models exactly that: a loop that waits for the next event and dispatches it through the hot path. It avoids polling-the-world overhead, keeps a single deterministic order of processing, and maps cleanly onto the incoming feed stream.

The core is an event loop: get the next event (a market-data message, an order acknowledgement, a clock tick) dispatch it to its handler, repeat. One thread, one ordering, no surprises about what ran when. That determinism is also what makes the system reproducible enough to backtest faithfully: the simulator replays the same event stream through the same loop, so production and backtest share the strategy code and the processing order.

The events come from the messaging layer (the multicast feed delivers book updates, the order session delivers execution reports, internal timers fire) and each is a small, typed event the loop routes. This is the opposite of a request/response web service: there is no "wait for a database", the system reacts to a stream. The single design constraint that follows is never block the loop, which is exactly why everything slow lives on the cold path.

What is the difference between low latency and low jitter, and why does the tail matter more?

Latency is how long one event takes to process; jitter is how much that time varies. A system with a 5 µs median but an occasional 500 µs spike will lose exactly when it matters most, during the volatile bursts you most want to trade. So HFT engineers obsess over the tail (p99, p99.9, max), not the average. Predictability is the product.

The intuition is a train that is on time on average but randomly forty minutes late: useless for catching a connection, because you plan around the worst you will plausibly hit, not the mean. In trading, the worst case fires precisely during the fast markets where the opportunity (and the adverse selection) is greatest. The median is a vanity metric; the tail is the strategy.

A latency budget is built against a tail percentile, not the mean. The number that decides whether you win the contested trade is the high-percentile latency during a burst, which can be many times the median if jitter is uncontrolled.

L_{p99} = \inf\{\,\ell : \mathbb{P}(\text{latency} \le \ell) \ge 0.99\,\}, \qquad L_{p99} \gg L_{\text{median}} \;\Rightarrow\; \text{jitter problem}

Jitter comes from many small sources, each an unpredictable spike: OS scheduling (your thread getting pre-empted), interrupts, garbage collection (a reason managed languages are avoided on the hot path), cache misses, page faults, NUMA cross-socket access, contention on a lock. It is fought structurally: pin the hot-path thread to a dedicated isolated core with no scheduler and no other work, busy-poll instead of sleeping, pre-allocate and pre-fault memory, disable power-saving frequency scaling, keep data local to one NUMA node, and avoid any code path that can occasionally do something expensive. A latency budget is built against the tail, not the mean, because the tail is the trade you lose.

What is kernel bypass, and why DPDK / Solarflare?

The OS kernel's network stack adds microseconds and jitter to every packet: copies, interrupts, context switches. Kernel bypass lets your application read packets directly from the network card in user space, skipping the kernel. DPDK (a software framework) and Solarflare/Onload (driver plus NIC) are the standard ways to do it, both cutting and stabilising inbound latency.

The intuition: the normal path is like mail through a sorting office, where every letter is handled, queued and handed over with overhead. Kernel bypass is a direct delivery chute from the postman to your desk: fewer hands, far more predictable timing. DPDK maps the NIC's packet rings into your process so you poll them directly, with no kernel, no per-packet interrupt and no copy; Solarflare with Onload transparently accelerates standard sockets onto the same idea, and FPGA NICs go further still. These shave single-digit microseconds and, more importantly, remove a large source of jitter from the inbound multicast feed.

The cost is real: kernel bypass usually means busy-polling a core 100% of the time (it never sleeps waiting for an interrupt) so it burns CPU and power for latency. That is a deliberate trade. By 2026 this is mature, commoditised tooling: owning a kernel-bypass NIC is a purchase, part of why the infrastructure is the price of entry, not the edge.

What are busy-polling, lock-free queues and cache-friendliness?

These are the three core hot-path techniques. Busy-polling means the thread loops checking for work instead of sleeping, trading CPU for zero wake-up latency. Lock-free queues pass data between threads without mutexes, so no thread ever blocks. Cache-friendliness lays data out so the CPU finds it in fast L1/L2 cache rather than stalling on main memory.

Busy-polling (busy-wait): a sleeping thread takes microseconds to wake when an event arrives, unacceptable jitter on the hot path. Instead the thread spins, continuously polling the NIC ring and the input queue, so it reacts the instant data appears. The price is a fully-occupied, power-hungry core: worth it on the critical path, nowhere else. Lock-free queues (typically a single-producer/single-consumer ring buffer) let the hot path hand work to the cold path, and receive order acks, without ever taking a lock. A lock can block the hot path while another thread holds it; lock-free designs guarantee the hot thread always makes progress, killing a major jitter source.

Cache-friendliness matters because the memory hierarchy is brutally uneven: an L1 hit costs about a nanosecond, while a trip to main memory costs roughly a hundred. So you keep hot data small and contiguous (arrays, not pointer-chasing linked structures), avoid false sharing (two cores fighting over one cache line), and lay out the book so a lookup touches as few cache lines as possible. At microsecond budgets, cache misses are a dominant cost.

The latency gap between cache and main memory is about two orders of magnitude, so at a single-digit-microsecond budget a handful of avoidable cache misses can be the difference between winning and losing the event.

t_{\text{L1}} \approx 1\,\text{ns} \;\lll\; t_{\text{DRAM}} \approx 100\,\text{ns} \quad\Rightarrow\quad \frac{t_{\text{DRAM}}}{t_{\text{L1}}} \approx 100

These three, plus core-pinning and pre-allocation, are what take competent software from "tens of microseconds" to "single-digit microseconds", the kernel-bypass software tier below the silicon frontier.

C++ or Rust: does language choice matter?

For the hot path you want a compiled, no-garbage-collection language with manual control over memory and layout, which in 2026 means C++ (the incumbent, with the most low-latency libraries and FPGA toolchain support) or Rust (memory-safe with the same control, and a growing share). Managed languages such as Java and C# appear off the hot path or in tolerant tiers, where their tail-latency GC pauses are acceptable.

The non-negotiable is no garbage collector on the hot path. A GC pause is exactly the kind of unpredictable multi-millisecond spike that destroys tail latency. C++ and Rust have deterministic, manual or scoped memory management, so there is no pause to suffer. C++ remains the default: decades of low-latency idioms, the deepest library ecosystem, and the toolchains for FPGA co-design all target it. Rust offers the same zero-cost control with compile-time memory safety, which removes a class of the bugs that are catastrophic in a system that can lose money in microseconds, and its adoption is rising.

Python's place is real but bounded: it is the language of research (backtests, signal exploration, the research-to-production pipeline) not the live hot path. A common shape is "research in Python, production hot path in C++/Rust", and reconciling the two is its own discipline. What AI changes: LLM assistants are now genuinely useful at writing and optimising this code: translating a research prototype into a C++/Rust hot path, suggesting cache-friendly layouts, spotting an allocation on the hot path. The judgement about what to optimise and how to validate it stays human (see what AI changes for HFT). AI assists the optimisation; it does not change the physics.

How do you actually measure latency?

You timestamp the same event at multiple points (the packet at the NIC, decode done, the order on the wire) and record the distribution, not the average. The gold standard is hardware/NIC timestamping and external wire-capture (a passive tap that timestamps packets independently), because software clocks perturb what they measure. You report the median, p99, p99.9 and max.

Why distributions and not means: as above, the tail is what costs you, and a single mean number hides exactly the spikes you care about. Report percentiles and the max, and plot the histogram. For where the time goes, take internal timestamps at hot-path stages (for example via the CPU's rdtsc counter) which give you the breakdown stage-by-stage; for ground truth tick-to-trade, an external NIC or switch hardware timestamp, or a passive wire tap, measures the loop without your own instrumentation adding latency or jitter.

The trap is measurement that lies. Logging every event on the hot path, or calling an expensive clock, changes the thing you measure, so you sample, use cheap counters, and aggregate off the hot path. This breakdown is precisely the input to the latency budget: once you know where the microseconds go, you decide which stage is worth attacking (kernel bypass, FPGA, or colocation) and which is already small enough to ignore.

Worked example

Take a schematic tick-to-trade breakdown for a kernel-bypass software system in colocation, as of 2026 (illustrative ranges, not promises; measure your own, the point is the shape). Tick-to-trade is additive, so the budget is just the sum of the stages from "market-data packet in" to "order out".

Tick-to-trade is the sum of the stage latencies. With disciplined kernel-bypass software in colocation the median lands in the low single-digit microseconds; the p99 is where the engineering actually lives.

T_{\text{tick-to-trade}} = t_{\text{NIC}} + t_{\text{decode}} + t_{\text{book}} + t_{\text{strategy}} + t_{\text{risk}} + t_{\text{gateway}}

Stage by stage, the typical contributions and the lever that attacks each: NIC + kernel bypass (packet to user space) about 0.5–1 µs, the lever being DPDK or Solarflare; decode of the binary protocol about 0.1–0.5 µs, the lever being a fixed-offset layout with no parsing; book update about 0.1–0.5 µs, the lever a cache-friendly book structure; strategy/signal about 0.5–3 µs, the lever tight hot-path code over pre-computed state; pre-trade risk check about 0.1–0.5 µs, in-line with no I/O; and order encode + send about 0.5–1 µs, the lever a binary order protocol over kernel bypass. The total median tick-to-trade lands at roughly 2–7 µs.

The lesson the numbers teach is in the gap between the median and the tail. That median is achievable with disciplined software, but the p99 is where the engineering actually lives: uncontrolled, the 99th-percentile event might be 30 µs because a thread got pre-empted or a cache line was cold, and that is the event, during a fast market, that you lose. Core isolation, busy-polling and no GC are what keep the tail close to the median. An FPGA path collapses the decode-plus-strategy-plus-encode portion into the tens-to-hundreds of nanoseconds wire-to-wire, the only way past the software floor, at a cost. Compare your strategy's actual requirement against this before spending a penny on speed; most strategies live well above this floor, where low-jitter software is decisive and far cheaper than silicon. Figures are illustrative and dated to 2026; this is educational only, not investment advice.

Where this fits

↑ Up · building block of Systems & building ↔ Across · composes with Colocation & FPGA ↔ Across · composes with Messaging protocols ↔ Across · composes with Data volume & engineering → Apply · makes money in Latency arbitrage → Apply · makes money in Equities & futures ⤓ Build / Buy · tool needed Datasets & tools