High-frequency data

How data is recorded

∞structural

Reviewed 4 June 2026. As of 2026: a permanent feature of the market, not an edge that decays.

Trades, quotes and book updates, each stamped with a sequence number and a clock time. Getting the timestamps and ordering right is the difference between a real backtest and a fiction.

The idea

Anatomy of a tape annotated diagramDG-TAPE

What this shows. A tape is just the matching engine narrating itself: typed messages (ADD, TRADE, DELETE), each carrying a sequence number and up to three timestamps. One marketable buy emits two messages that must be applied atomically, and the trade carries no aggressor flag. Miss a single sequence number and a phantom order never cancels, silently corrupting every later state – which is why gap-free capture is the integrity check.

What are the three message types on a tape?

A tape carries three kinds of message. A trade records an execution: price, size, and sometimes the aggressor side. A quote records a change to the best bid or offer. A book update records an add, modify or cancel at any price level. Trades tell you what happened; quotes and book updates tell you what was available.

Intuition first, because the structure is simpler than the jargon. The matching engine (the limit order book) does only two things all day: it rests and cancels resting orders, and it crosses orders into trades. The tape is just the engine narrating both, message by message, in the exact order they happened. Everything else on this page is bookkeeping on top of that single stream.

A trade (a "print") says an execution occurred: instrument, price, size, timestamp, a trade ID, and frequently a set of condition codes (auction print, odd lot, off-book, late report). Critically, most public trade feeds do not flag the aggressor side (whether the buyer or the seller crossed the spread) which is the entire problem of trade-sign inference. A quote (a BBO update) says the best bid or best ask changed, a new top-of-book price or size; this is the L1 stream. A book update (depth message) says an order was added, modified down, or cancelled at some price level: the L2/L3 stream, the granular truth from which quotes and even trades can be derived.

A single logical event (one marketable order) can emit several messages at once: a trade print plus the book updates that remove the resting orders it consumed. Your handler must apply them as one atomic step.

\text{marketable order} \;\longrightarrow\; \{\,\text{TRADE},\; \text{DELETE/MODIFY of the resting orders}\,\}

What is the difference between L1, L2 and L3 data?

The three levels describe how much of the book a feed exposes. L1 is just the top of book: best bid and ask plus the last trade. L2 is aggregated depth per price level (market-by-price). L3 is order-by-order (market-by-order): every individual order with its own ID and queue position. Deeper feeds are larger, but they let you reconstruct more of the microstructure.

L1 (top of book / BBO) gives the best bid, best ask, their sizes and the last trade: enough to see the price, useless for modelling depth, queueing or fill probability. L2 (market-by-price, MBP) gives, for each side, the total resting size aggregated at each price level (often the top 5–10 levels): the standard input to most microstructure work, where you can see depth and model walking the book but cannot recover where a specific order sits in the queue. L3 (market-by-order, MBO) gives every individual order, with its own ID, price, size and arrival, so you can see queue position and reconstruct the book deterministically, required to study queue value and fill probability.

The honest point, repeated across this atlas: the granularity you can get caps the questions you can ask. Nasdaq TotalView-ITCH and most serious exchange feeds are MBO; most retail "L2" is aggregated MBP and cannot recover the queue. A clean L3/MBO tape is the resource the strategy pages quietly assume, and the thing almost nobody has cleanly.

The same book at three resolutions: L1 shows the touch, L2 the depth per price, L3 the individual orders in queue order. Each level hides what the level above cannot represent.

\text{L1: } (p_b, q_b, p_a, q_a) \;\subset\; \text{L2: } \{(p_k, Q_k)\} \;\subset\; \text{L3: } \{(\text{id}, p, q, \text{seq})\}

What are the three timestamps, and why do they differ?

A single message can carry up to three timestamps: the exchange (matching-engine) timestamp when the event occurred inside the venue, the gateway timestamp when it left the venue's edge, and your own capture timestamp when your handler received it. They differ by transmission and queueing latency, and only the exchange timestamp orders events across instruments correctly.

The intuition is a relay of clocks. The event happens in one clock domain (the engine), travels through wires and gateways (another), and is recorded in yours (a third). Each hop adds delay and jitter; the three stamps bracket where time was spent. The exchange / engine stamp is the causal clock: applied at the moment of the event, it is the only one that correctly orders two events on different instruments, so it is the stamp you use for any cross-instrument analysis (lead-lag, or the trade-sign rules that compare a trade to a contemporaneous quote). The gateway stamp is applied as the message egresses the venue's market-data gateway; its gap to the engine stamp measures internal venue latency. The capture stamp is applied by your feed handler or capture NIC on receipt; its gap to the gateway stamp is your transport latency.

The two gaps between the three stamps decompose total latency into the venue's internal delay and your transport delay, the latter being exactly what the low-latency stack and colocation exist to shrink.

\underbrace{t_{\text{gw}} - t_{\text{exch}}}_{\text{venue internal}} \;+\; \underbrace{t_{\text{cap}} - t_{\text{gw}}}_{\text{your transport}} \;=\; t_{\text{cap}} - t_{\text{exch}}

Serious capture stamps at the network card (PTP-disciplined, often to nanoseconds) rather than in software, because OS scheduling jitter dwarfs the differences you care about. A software time.time() in a Python handler is worthless for microstructure: the jitter it adds is larger than the latency gaps it would measure. The capture timestamp is exactly the quantity the low-latency stack and colocation exist to shrink.

Why does clock synchronisation (PTP) matter?

To compare events across machines, venues or instruments, every clock must agree to within microseconds, otherwise you cannot tell which of two near-simultaneous events came first. PTP (IEEE 1588) disciplines clocks to a common reference (often GPS) at sub-microsecond accuracy; NTP's millisecond accuracy is far too coarse, and MiFID II RTS 25 makes the floor a legal requirement.

The danger is manufactured causality. If two clocks drift by a millisecond, a "lead-lag" you measure at the millisecond scale is an artefact of the clocks, not the market: at HFT timescales, unsynchronised clocks invent relationships that are not there. PTP (Precision Time Protocol, IEEE 1588) distributes a GPS-disciplined grandmaster clock across a network with hardware timestamping at each hop, achieving sub-microsecond (often tens-of-nanoseconds) synchronisation. NTP is millisecond-class and adequate only for daily-bar work.

There is also a regulatory floor. MiFID II RTS 25 requires HFT participants' clocks synchronised to UTC within 100 microseconds (1 ms for non-HFT), and the US Consolidated Audit Trail (CAT) sets its own granularity rules; both are 2026 minimums, and competitive shops run far tighter. The takeaway for a reader buying or building a tape: the quality of the timestamps is as important as the prices. A feed with software timestamps and an undisciplined clock cannot support microstructure research, however cheap it is.

The synchronisation budget set against the events you want to order: a 100 µs MiFID floor, sub-microsecond PTP in practice, but events arriving microseconds apart inside a burst, so millisecond-class NTP cannot order them at all.

\Delta_{\text{NTP}} \sim 1\,\text{ms} \;\gg\; \Delta_{\text{MiFID}} = 100\,\mu\text{s} \;\gg\; \Delta_{\text{PTP}} \sim 10\text{–}100\,\text{ns}

What are sequence numbers, and why must they be gap-free?

Every message on a feed carries a monotonically increasing sequence number. It lets the receiver detect a dropped or out-of-order packet: if you see sequence 1004 after 1002, you missed 1003 and your book is now wrong. Gap detection plus a recovery mechanism is what makes a reconstructed book trustworthy.

The intuition is that local book state is a running total. The book you maintain is the cumulative sum of every increment; miss one increment and every subsequent state is silently corrupt: a phantom order that never cancels, a level that never clears. The sequence number is the integrity check that catches the miss. Multicast market-data feeds (binary, e.g. ITCH over UDP) number every message per channel, and a gap ( $\text{seq}_{\text{received}} \neq \text{seq}_{\text{expected}}$ ) means packet loss, common on UDP multicast, which triggers recovery.

A reconstructed book is the running sum of every applied increment; a single missed sequence number poisons every later state. Detect the gap, recover the missing increment, or the book diverges silently.

B_n = B_0 + \sum_{k=1}^{n} \Delta_k \qquad\Rightarrow\qquad \text{one missing } \Delta_k \;\Rightarrow\; B_m \text{ wrong for all } m \ge k

Recovery comes in two forms. A redundant A/B feed publishes two identical multicast streams; you arbitrate, taking whichever message arrives first and filling A's gaps from B. Or a snapshot/refresh channel lets you request a fresh book, then replay increments since. Either way, faithful backtesting depends on it: to simulate fills you must replay the exact message sequence the engine produced, in order, which is only possible if your capture is gap-free and sequence-checked. A tape with silent gaps produces a backtest that diverges from reality in ways you cannot see (backtesting & simulation).

Snapshots vs increments: how do you reconstruct the book?

A feed gives you a periodic snapshot (the full book at an instant) plus a stream of increments (each add, modify or cancel). You build the book by loading the latest snapshot, then applying every increment in sequence. Snapshots bound recovery time; increments carry the live changes. Get the apply order wrong and your book desynchronises.

Think of the snapshot as a save point and the increments as the moves since. You can always rebuild current state from the last save plus the moves, provided you apply them in exact sequence with no gaps. Incremental (delta) feeds are the efficient default: only changes are sent, so bandwidth tracks activity, not book size. Snapshot/refresh is sent periodically or on request, so a late joiner (or a handler that took a gap) can resynchronise without replaying the whole day.

The reconstruction rule: keep the snapshot whose sequence is the latest at or before now, then apply only the increments whose sequence strictly exceeds it. Apply a diff that predates the snapshot and you corrupt the book.

B(t) = \text{snapshot}_{s} \;\oplus\; \bigoplus_{\,\text{seq} \,\gt\, s} \Delta_{\text{seq}}, \qquad s = \max\{\,\text{snapshot seq} \le \text{now}\,\}

This is the classic crypto bug. Many crypto venues stream a REST/WS snapshot plus a WebSocket diff channel, and you must buffer the diffs, fetch the snapshot, then apply only diffs with sequence greater than the snapshot's. Skipping that handshake (applying diffs that predate the snapshot, or vice versa) silently corrupts the book, the single most common mistake when recording your own venue. The professional template is ITCH/OUCH-style: Nasdaq's ITCH is a pure incremental MBO feed (add / execute / cancel / delete order messages); OUCH is the matching order-entry protocol. The model (full add/modify/cancel increments with a recovery snapshot) is mirrored by CME MDP 3.0, Eurex EOBI and others. Cross-link to IX-FIX for decoding the actual message bytes; cite the venue's feed spec for the exact message set (as of 2026).

Worked example

A fragment of a synthetic MBO tape for instrument XYZ, as of 2026: one channel, timestamps in this venue's nanoseconds-since-midnight, $\text{seq}$ monotonic. Sequence 1001 adds 300 lots to the bid at 50.00 (order A, the best bid); 1002 adds 100 to the ask at 50.01 (order B, the best ask, so the spread is one tick); 1003 adds 400 to the bid at 49.99 (order C, level-2 bid). Then a marketable buy arrives: 1004 is a TRADE of 100 at 50.01, and 1005 is a DELETE of order B at 50.01. Read what that tape tells you.

The marketable buy crossing the spread produced two messages: the trade print (1004) and the book update removing the resting ask (1005). They share a timestamp to the nanosecond because they are one engine event, so your handler must apply both atomically; treat them as separate and you will momentarily show a trade against a still-resting offer. The trade at 1004 carries no aggressor flag; you would infer it was buyer-initiated because it printed at the ask, the quote rule; see trade-sign inference.

After the trade and the delete, the touch has moved: the ask is empty until the next ADD, so the top of book is one-sided: the trade did not just print, it consumed the offer.

\text{after seq 1005:}\quad \text{bid } 50.00 \times 300, \qquad \text{ask: empty until next ADD}

Now drop a packet. Suppose your capture jumped from seq 1003 to 1005, missing the trade at 1004. You would carry a phantom resting ask B that never cancels: your book shows liquidity that no longer exists, and every later state inherits the error. Gap detection on the missing 1004 ( $\text{expected } 1004,\ \text{received } 1005$ ) is precisely what saves you: it forces a recovery instead of a silent corruption. The numbers here are illustrative and synthetic; real timestamp granularity, message sets and condition codes are venue-specific, so always read the venue's feed specification (as of 2026). The structure (events, three timestamps, sequence numbers, snapshot plus increments) is invariant; the field widths and codes are what you must look up.

Where this fits

↑ Up · building block of High-frequency data ↔ Across · composes with The limit order book ↔ Across · composes with Irregular time & point processes ↔ Across · composes with Trade-sign inference → Apply · makes money in Crypto market making ⤓ Build / Buy · tool needed Datasets & tools