Trading strategies·ml

Machine learning in HFT

◆still alpha

Reviewed 4 June 2026. As of 2026: a real edge still exists for those who can run it well.

Not a fifth strategy family but a toolkit across all of them. Genuine lift on feature-rich, high-sample problems (microstructure signals, news NLP) and overpromised everywhere else.

See it move

Does a bigger model win?drag the complexityIX-AI

In-sample acc.75%

Out-of-sample acc.63%

Inference latency4.3 µs

Verdictusable edge

Order imbalanceOFIMicropriceTrade signDepth slopeSpread

Model complexitymoderate

What to notice. ML in HFT is short-horizon prediction on microstructure features – and more model is not more edge. Crank complexity: in-sample accuracy keeps rising while out-of-sample peaks early and then decays (the bias–variance trap), and the inference latency climbs until the model is too slow to trade on the hot path at all. The 2026 edge is a well-fit, fast model – and the same models now power the surveillance that catches spoofing and stuffing.

What does machine learning actually do in HFT?

In HFT, machine learning is mostly applied to three jobs: predicting the next short-horizon price move from book state, classifying order flow and trades (sign inference, toxicity), and detecting anomalies and manipulation. It learns patterns from microstructure features that are hard to specify by hand; it does not "find alpha" autonomously, and it rarely runs inside the microsecond trade decision itself.

Intuition first. The order book is a rich, noisy, fast-changing object. Some of its structure (how imbalance precedes the next tick, how flow clusters before a move) is predictable but tediously non-linear. ML is good at exactly that: learning a messy mapping from many weak features to a short-horizon target where hand-coding the rule is impractical. It is a function approximator on microstructure, not an oracle.

It is worth saying plainly up front: ML is a toolkit, not a fifth strategy family alongside market making, arbitrage, directional and event trading. It is a way of building the signals and classifiers those strategies rely on. The honest application areas are the three above, plus a fourth that sits off the trading loop entirely: research and news, where large language models are now real. The brand posture throughout: we sell understanding, never a black box with a P&L promise. The realistic gains are incremental improvements to signals you could partly build by hand, not a step-change. For the wider lens see what AI changes for HFT.

Short-horizon price prediction

The flagship application is predicting the next move over micro-horizons (the next tick to the next few seconds) from the current and recent book state. Models from logistic regression and gradient-boosted trees up to LSTMs, CNNs and order-book transformers map microstructure features to a sign or a small return; the signal then feeds a simple, fast trading rule, not the model itself.

The target is usually the sign or magnitude of the next mid-move, or the next trade direction, over horizons from one event to a few seconds. The closer the horizon to "the next event", the more it overlaps with classical microstructure signals (the microprice and order-flow imbalance) and the more the marginal value of a deep model over a well-built linear one shrinks. A common, sobering finding: a careful linear or logistic model on good features (OFI, imbalance, microprice deviation) captures most of what a deep model does at these horizons. Deep learning's gains are real but often marginal, and bought with far more overfitting risk and latency.

Where deep learning has shown value (the DeepLOB line, Zhang–Zohren–Roberts 2019, and successors) it extracts spatial and temporal structure across book levels that hand-features miss, but it runs off the hot path and is notoriously sensitive to regime and to leakage.

\hat{y}_{t+h} = f_\theta(\text{book}_{t-k:t}), \qquad \text{accuracy}(f_{\text{deep}}) \approx \text{accuracy}(f_{\text{linear}}) + \varepsilon, \;\; \varepsilon \;\text{small}

This prediction work ladders into the order-flow and information guide, order flow and information. ML is a way of estimating the same conditional expectations that guide is built on, not a separate source of edge.

Order-flow and trade classification

ML sharpens the classification problems microstructure depends on: inferring whether a trade was buyer- or seller-initiated (beyond the tick rule and Lee–Ready 1991), classifying order-flow type, and labelling flow as informed or uninformed. Better classification feeds every downstream signal (OFI, VPIN, Kyle's lambda), so improving it has leverage across the whole stack.

Trade-sign inference is the canonical case. Rule-based classifiers (the tick rule, Lee–Ready 1991) are good but imperfect, and every signed-flow signal inherits their error. ML classifiers using local book and trade context can cut that error in hard regimes (lots of mid-prints), a small accuracy gain that propagates into more reliable OFI and VPIN. Informed-versus-uninformed labelling overlaps heavily with the toxicity work below; distinguishing toxic (informed) from benign (noise) flow is the market maker's core problem (adverse selection). The honest caveat: the label "was this flow informed?" is only observable ex post, which makes supervised training leak future information unless handled with care.

Classification is a better ML fit than raw return prediction: the target (the sign of a printed trade) is cleaner and far more abundant than a return target, which is almost all noise at HFT horizons. ML tends to add more reliable value on the classification jobs.

\text{SNR}\big(\text{trade sign}\big) \;\gg\; \text{SNR}\big(\text{micro-horizon return}\big)

Toxicity, manipulation and anomaly detection

ML is well-suited to detection problems: flagging toxic flow (adverse-selection risk) and spotting manipulation patterns (spoofing, layering, quote stuffing) and anomalies in the tape. These are recognition and surveillance applications: protect the market maker, or support compliance and venue surveillance, never operational manipulation.

Toxicity detection extends VPIN-style ideas (Easley–López de Prado–O'Hara): an ML model on bucketed order-flow and book features can give a market maker an earlier, richer "the flow is turning toxic, widen or pull" signal than a single gauge (PIN and VPIN). This is defensive ML: reducing adverse selection, not predicting direction. Manipulation and anomaly detection is recognition-only: ML classifiers and anomaly detectors flag the signatures of spoofing, layering and quote stuffing for surveillance and compliance. That is the lawful, defensive side of the manipulation guides: how the patterns are caught, never how to run them (spoofing and layering, market manipulation).

The leakage trap recurs here in its sharpest form. Detection labels are often assigned with hindsight (an event later ruled manipulative) so the training data quietly encodes the future. Honest detection systems are validated out-of-time and out-of-sample for exactly this reason, the same discipline the backtesting and simulation guide teaches.

Features from microstructure

HFT ML lives or dies on its features, and the good ones come from microstructure: order-flow imbalance, book imbalance and the microprice, queue position and depth, trade signs and durations, and recent realised volatility. These encode the short-horizon predictive structure; the model is often the easy part once the features are right.

The canonical feature set: order-flow imbalance (OFI), the single strongest short-horizon predictor; book imbalance and the microprice, an imbalance-weighted fair value predictive of the next move; queue state, depth and your position, governing fill probability; durations and event-time features (clustering and intensity, Engle–Russell ACD 1998); and signed trade flow with realised volatility, feeding toxicity and regime features.

Two structural feature pitfalls to name. Stationarity: microstructure features drift with regime, tick-size regime and venue rules, so a model trained on one regime decays. Leakage: a feature computed with even a microsecond of future information manufactures fake skill, and nanosecond, irregular timing makes this easy to get wrong.

x_t = \phi\big(\text{info available strictly before } t\big), \qquad \text{any peek at } t^{+} \;\Rightarrow\; \text{leakage}

Why deep learning is hard at HFT latencies

There is a latency-versus-complexity trade-off: the trade decision on the hot path must execute in sub-microsecond-to-microsecond time, and a deep network's inference is far too slow to sit there. So the standard architecture keeps ML off the hot path: models compute signals near-line or off-line, and a small, fast, deterministic rule (often on an FPGA) acts on them. Complexity and speed are deliberately decoupled.

Stated plainly: accuracy and inference latency trade off, and the hot path has a budget measured in microseconds, often single digits and FPGA-bound. A deep model's forward pass is orders of magnitude too slow to run inside it; you cannot have both maximum model complexity and minimum latency on the same code path.

The resolution is a split architecture: rich models compute signals on slow timescales (milliseconds to overnight); a simple, fast, deterministic function (a threshold, a small linear rule, a lookup table) makes the actual microsecond decision, often in hardware. In HFT, "using ML" usually means informing a fast rule, not being it.

\underbrace{f_\theta(\cdot)}_{\text{off-path: rich, slow}} \;\longrightarrow\; s_t \;\longrightarrow\; \underbrace{\text{act if } s_t \gt \kappa}_{\text{hot path: simple, fast}}, \qquad \ell_{\text{deep}} \gg \ell_{\text{budget}}

The corollary for model choice: at true HFT latencies the winner is often the simplest model that captures the signal: it runs faster, overfits less, and is easier to validate. Complexity is a hot-path liability, not just an overfitting one (the budget itself lives on colocation and FPGA). Move the horizon out (to seconds, minutes, or the research and portfolio layer) and the latency constraint relaxes; deep models and LLMs become entirely reasonable. The hot-path objection is specific to the high-frequency decision, not to ML in trading generally.

The realistic 2026 picture, including LLMs

In 2026, ML is a mature, standard part of the quant stack: incrementally useful for signals and classifiers, central to surveillance, and increasingly used around trading via LLMs for news parsing, research acceleration and code generation. It is a tool that compounds a good process; it does not rescue a bad one, and it does not make a slow path fast.

Where LLMs actually fit is off the trading loop: machine-readable news and event extraction (parsing filings, releases and headlines into structured signals, feeding news trading), research acceleration (literature, hypothesis generation, code) and tooling (faster harness and feature development). They sit off the latency-critical path by nature: an LLM call is milliseconds-to-seconds, never microseconds.

What AI does not change: it does not win the latency race (latency arbitrage: the wire is bought, not learned), it does not conjure capacity where impact and crowding cap it (capacity and alpha decay), and it does not exempt anyone from the cost stack. It lowers the cost of building and researching (genuinely part of the going independent in 2026 thesis) but it raises the bar too, because everyone has the same tools. The honest one-liner: ML improves a disciplined process at the margin and runs mostly off the hot path; treat any claim that it is a self-contained money machine as a red flag.

Overfitting is the default outcome

At HFT horizons the signal-to-noise ratio is tiny and the data, while abundant, is highly autocorrelated and regime-dependent, so a flexible model will fit noise by default. Overfitting is not an edge case; it is what happens unless you actively prevent it. The discipline that prevents it is the whole content of the backtesting and simulation guide.

Why it is the default, not a risk: returns at micro-horizons are almost all noise, and a model with enough capacity will happily explain the noise of the training period and call it skill. Add the multiple-testing problem (try enough features, models and hyperparameters and the best one looks great by chance) and a beautiful in-sample result is the expected outcome of a sloppy search, not evidence of anything.

Search over enough configurations and the best in-sample result is pure selection. The defences are strict out-of-sample and out-of-time validation, purged and embargoed cross-validation for autocorrelated series, honest multiple-testing accounting (the deflated-Sharpe intuition), and realistic fills and costs.

\max_{i=1\ldots N}\;\widehat{\text{SR}}_i \;\to\; \text{inflated by } \sqrt{2\ln N}\;\;\text{under the null}, \qquad \text{SR}_{\text{deflated}} \ll \widehat{\text{SR}}_{\max}

We make no performance promises and sell no signals; the ML tools we would ever sell are infrastructure (clean data, validation harnesses, reference feature pipelines) never an alpha model. Honesty about overfitting is the moat: it is what separates this from every "deep learning prints money" content farm.

Worked example

A short-horizon order-book classifier, illustrative and dated to 2026 (synthetic, to make the overfitting and latency points concrete). Task: predict the sign of the next mid-move over the next 10 book events on a liquid synthetic name. Features: OFI over the last 20 events, book imbalance, microprice deviation, recent realised volatility, and trade-sign run-length, five engineered microstructure features.

A logistic-regression baseline reaches out-of-sample directional accuracy of about 56% (versus a 50% coin-flip); inference is a few nanoseconds (a dot product) so it can sit very near the hot path. Gradient-boosted trees on the same features reach about 57.5%; inference is hundreds of nanoseconds to microseconds, off the hot path, but feasible near-line. A book-level LSTM/CNN on raw 10-level book history reaches about 58.5% when validated honestly out-of-time (a real but modest +2 points over the linear model) but inference is tens of microseconds, firmly off the hot path, so its signal must be precomputed and handed to a fast rule.

The overfitting demonstration. Skip out-of-time validation and tune over 200 feature and hyperparameter combinations, keeping the best in-sample: the "best" model shows about 64% in-sample accuracy and a glorious equity curve that collapses to about 50.5% (noise) out-of-time. The 8 points of "skill" were the multiple-testing illusion.

64\% \;\xrightarrow[\text{out-of-time}]{\text{honest split}}\; 50.5\%, \qquad \text{honest gain} \approx 58.5\% - 56\% = 2\ \text{pts}

The takeaway in one line: the deep model bought roughly 2 real points at the cost of much higher latency and far more overfitting exposure; whether that trade is worth it depends entirely on your validation discipline and your latency budget, and on the hot path, the answer is usually to ship the linear model. All figures are synthetic; reproduce the overfitting collapse directly in the backtesting guide by toggling the out-of-sample split and the multiple-testing slider.

Where this fits

↑ Up · building block of Trading strategies ↔ Across · composes with Order-flow information ↔ Across · composes with News trading ↔ Across · composes with What AI changes for HFT → Apply · makes money in Crypto market making → Apply · makes money in Prediction markets (Polymarket) ⤓ Build / Buy · tool needed Datasets & tools