High-frequency data

Trade-sign inference

∞structural

Reviewed 4 June 2026. As of 2026: a permanent feature of the market, not an edge that decays.

Most feeds don’t label who was the aggressor, so you infer it. The tick rule signs by price change; Lee–Ready (1991) compares the trade to the prevailing mid. Both are imperfect, and accuracy drops in fast markets.

See it move

Trade-sign inferenceclassify the tapeIX-TICKRULE

99.989
?

100.008
?

99.992
?

100.007
?

100.012
?

100.007
?

99.986
?

100.006
?

99.989
?

99.990
?

100.006
?

100.008
?

99.991
?

100.010
?

What to notice. The raw tape doesn't tell you who was the aggressor. Lee–Ready compares each trade to the prevailing mid: above is buyer-initiated, below is seller-initiated. Run it and the unlabelled prints turn into signed flow you can actually model.

Why isn't the buy/sell side in the data?

A trade is a match between a resting order and an incoming one: the tape records the execution (price, size, time) but most public feeds do not label which side initiated, which one crossed the spread to take liquidity. When a trade prints, two orders met: one was resting (passive, the maker), one arrived and crossed (aggressive, the taker). The taker's direction is the "trade sign": a buy if the taker bought (lifted the ask), a sell if the taker sold (hit the bid). The information content is in the taker's choice to demand immediacy. Consolidated tapes and most historical equity data were built to report prices, not intent; the matching engine knows the aggressor but the public feed historically did not carry it. So the single most useful microstructure label is absent and must be reconstructed.

It matters because signed order flow is the input to nearly every order-flow signal: order-flow imbalance, PIN/VPIN toxicity, Kyle's lambda (price impact per unit signed volume), the spread decomposition. Get the sign wrong and every one of those is biased. The 2026 exception: many crypto venues and some modern feeds do publish the taker side per trade (where you have such a feed, you skip this problem) but historical data, consolidated tapes and many venues still do not, so inference remains a core skill.

The trade sign is the taker's direction: +1 if the aggressor lifted the ask, −1 if it hit the bid. The matching engine knows it; most public tapes drop it, so you must infer the label that drives nearly every order-flow signal.

q_t = \begin{cases} +1 & \text{taker bought (lifted ask)} \\ -1 & \text{taker sold (hit bid)} \end{cases} \quad \text{(usually unobserved)}

What is the tick rule?

The tick rule classifies a trade by its price change: a trade printed above the previous trade is a buy (an uptick), below is a sell (a downtick), and on a flat price you carry the last non-zero sign forward (a zero-tick). Aggressive buyers push prices up and aggressive sellers push them down, so the direction of the last price change is a decent proxy for who is currently demanding liquidity. It is crude (it can only react after a price change, classifying the current trade using a past price move) but it works because flow is persistent, and it needs only the trade tape, no quotes, which is why it is the universal fallback.

Its weaknesses are the flip side: it lags, and it struggles in fast markets and at the open. Accuracy on liquid US equities is typically reported around 85% against the true side. Select only the tick rule in the explorer above and watch an uptick classify as a buy and a zero-tick inherit the prior sign.

Sign the trade by whether the price ticked up or down; on a flat tick, inherit the previous sign. Trade prices alone, no quotes; about 85% accurate on liquid US equities.

q_t = \begin{cases} +1 & p_t \gt p_{t-1} \\ -1 & p_t \lt p_{t-1} \\ q_{t-1} & p_t = p_{t-1} \end{cases}

What is the quote rule, and what is Lee–Ready (1991)?

The quote rule compares the trade price to the prevailing mid: above the mid is a buy, below is a sell. A trade printed near the ask was probably a buyer lifting the offer; near the bid, a seller hitting it. Comparing the trade to the contemporaneous mid is more direct than the tick rule because it uses the quote, not just the price history. The catch is the ambiguity at $p_t = \text{mid}_t$ : a mid-print has no side to read. Lee–Ready (1991) resolves it: apply the quote rule when the trade is away from the mid, and fall back to the tick rule for mid-prints. It is the standard equities trade-sign classifier.

Lee and Ready also specified a trade-quote timing alignment, historically lagging the quote by about 5 seconds to match the trade to the quote in force before it printed, because of reporting delays (a caveat the failure-modes section revisits). The EMO variant (Ellis–Michaely–O'Hara 2000) uses the quote rule only for trades at the bid or ask, and the tick rule for everything in between, a different split that does better on some venues. Lee–Ready is the equities benchmark, typically about 85% accurate overall, better on trades away from the mid, worse on mid-prints and in fast markets. The classifier selector above (tick · quote · Lee–Ready · EMO) compares them on the current regime.

The quote rule signs by which side of the mid the trade printed; Lee–Ready uses it away from the mid and hands mid-prints to the tick rule: the standard equities classifier, ~85% accurate.

q_t = \begin{cases} +1 & p_t \gt \text{mid}_t \\ -1 & p_t \lt \text{mid}_t \\ \text{tick rule} & p_t = \text{mid}_t \end{cases}

Where do these rules fail?

Trade-sign rules fail on four fronts. On mid-prints, a trade exactly at the mid has no quote-side signal, so the rule falls back to the tick rule (the weakest case) and midpoint-peg and dark executions concentrate exactly here, making venues with lots of mid liquidity the hardest. On trade-quote timing, the rule compares the trade to the prevailing quote, but which quote was in force when the trade actually executed? Reporting delays meant Lee–Ready lagged the quote by seconds in 1991-era data; on modern microsecond data the right offset is near-zero or even negative, and using the stale 5-second rule degrades accuracy. Holden–Jacobsen (2014) documented this and the resulting biases in modern equity data, so use the venue's own timestamps and test the alignment. In fast markets and bursts, quotes flicker faster than trades report, the "prevailing" quote is ambiguous, and accuracy drops in exactly the high-activity bursts (irregular time) where signed flow matters most. And non-standard executions (auctions, hidden/midpoint orders, odd lots) break the assumptions outright.

The honest framing: no rule is exact. Treat trade-sign as a noisy label with a known, regime-dependent error rate, and propagate that uncertainty into anything you build on it; do not pretend the inferred sign is ground truth. Switch the explorer above to the "lots of mid-prints (hard)" preset and accuracy visibly drops as the confusion matrix fills its off-diagonal.

The inferred sign is a noisy label, not ground truth: its accuracy degrades on mid-prints, in fast markets, and where trade-quote timestamps are misaligned. On microsecond data the classic 5-second Lee–Ready offset actively hurts (Holden–Jacobsen 2014).

\Pr(\hat{q}_t = q_t) \approx 0.85 \;\Rightarrow\; \text{error rate} \approx 15\%,\;\text{worst in fast, mid-heavy regimes}

Bulk-volume classification, and why the error propagates

Bulk-volume classification (BVC, Easley–López de Prado–O'Hara) abandons per-trade signing. Instead of labelling each trade, it works on volume bars (business time): the share of a bar's volume treated as buys is the standard-normal CDF of the bar's standardised price change, $Z\!\big((p_{\text{end}} - p_{\text{start}})/\sigma_{\Delta P}\big)$ . It trades per-trade precision for a smoother, aggregate buy/sell split that is robust in fast markets, and it is exactly what VPIN consumes.

The real point of this page is propagation. Every signed-flow construct inherits the sign error: order-flow imbalance sums signed volume; VPIN measures the imbalance between buy and sell volume; Kyle's lambda regresses price change on signed volume. A few percent of mislabelled trades biases all of them, and the bias is worst in toxic, fast regimes, which is precisely when those signals are supposed to be most informative. The discipline: when you report an OFI or a VPIN, know which classifier produced its signs and what its error rate is on your regime. The signal is only as clean as the labels under it. This is the maker/taker match of the limit order book seen from the data side, and where signed flow becomes a quoting edge, in order-flow / information-based market making.

BVC estimates the buy fraction of a volume bar from its standardised price change via the normal CDF: smoother than per-trade rules and the engine inside VPIN. Whatever the method, its label error flows straight into OFI, VPIN and Kyle's lambda.

\widehat{V}^{\,\text{buy}} = V \cdot Z\!\left(\frac{p_{\text{end}} - p_{\text{start}}}{\sigma_{\Delta P}}\right) \;\longrightarrow\; \text{OFI, VPIN, } \lambda \ \text{inherit the noise}

Worked example

A synthetic tape with known true signs, as of 2026; reproduce it in the explorer above. Generate an efficient price plus bid-ask bounce, mark each trade as taker-buy or taker-sell (ground truth known), set a controllable share of mid-prints (say 15%) and a tunable spread. Now run each classifier against the truth. The tick rule lands at about 85%, errors concentrating where the price did not change (zero-ticks inheriting a stale sign) and right after the mid reverses. The quote rule reaches about 88% on trades away from the mid, but cannot classify the 15% mid-prints at all. Lee–Ready (quote rule plus tick-rule fallback on mid-prints) comes out about 87% overall, beating the pure tick rule by recovering the away-from-mid trades while still handling mid-prints, just less well.

Then make it hard: raise mid-prints to 40% and add fast-market quote flicker. Every classifier's accuracy falls (Lee–Ready toward about 78%) and the confusion matrix's off-diagonal fills, exactly the regime where you most wanted clean signs.

Feed 87%-accurate signs into an order-flow imbalance and about 13% of signed volume carries the wrong sign, attenuating the OFI signal and biasing its link to the next price move. Part of the measured "alpha" is a label-noise artefact.

\text{accuracy} = 0.87 \;\Rightarrow\; \approx 13\%\ \text{of signed volume mis-signed} \;\Rightarrow\; \text{OFI attenuated, biased}

The downstream cost is the lesson: the inferred sign is a noisy label, and the noise does not vanish when you aggregate. It propagates into every signal built on signed flow, worst in exactly the toxic, fast regimes those signals exist to catch. Reverify accuracy figures against real, side-flagged data before relying on them; these synthetic figures exist to make the mechanism legible. The explorer lets you set the spread, the mid-print frequency and the toxicity, switch classifiers, and read the exact accuracy and confusion matrix against ground truth. Trade-sign inference is the fourth of these statistical traps, after bid-ask bounce, fat tails and irregular time; side-flagged trade data with a reference classifier benchmarked against ground truth is exactly what this page implies (datasets and tools).

Where this fits

↑ Up · building block of High-frequency data ↔ Across · composes with The limit order book ↔ Across · composes with Order-flow imbalance ↔ Across · composes with PIN / VPIN → Apply · makes money in Order-flow information → Apply · makes money in Crypto market making ⤓ Build / Buy · tool needed Datasets & tools

Common questions

How do I tell buyer- from seller-initiated trades?

You infer the aggressor, because most feeds do not label it. The tick rule signs a trade by whether its price rose or fell from the prior trade. The Lee–Ready algorithm (1991) compares the trade price to the prevailing midpoint (above mid is buyer-initiated, below is seller-initiated) falling back to the tick rule at the mid. Both are imperfect; accuracy drops in fast or fragmented markets.