How data is recorded
∞structuralTrades, quotes and book updates, each stamped with a sequence number and a clock time. Getting the timestamps and ordering right is the difference between a real backtest and a fiction.
The idea
Reference figure. This concept is explained in prose and diagram; the interactive widgets live on the flagship pages it links to under Where this fits.
What are the three message types on a tape?
A tape carries three kinds of message. A trade records an execution: price, size, and sometimes the aggressor side. A quote records a change to the best bid or offer. A book update records an add, modify or cancel at any price level. Trades tell you what happened; quotes and book updates tell you what was available.
Intuition first, because the structure is simpler than the jargon. The matching engine (the limit order book) does only two things all day: it rests and cancels resting orders, and it crosses orders into trades. The tape is just the engine narrating both, message by message, in the exact order they happened. Everything else on this page is bookkeeping on top of that single stream.
A trade (a "print") says an execution occurred: instrument, price, size, timestamp, a trade ID, and frequently a set of condition codes (auction print, odd lot, off-book, late report). Critically, most public trade feeds do not flag the aggressor side (whether the buyer or the seller crossed the spread) which is the entire problem of trade-sign inference. A quote (a BBO update) says the best bid or best ask changed, a new top-of-book price or size; this is the L1 stream. A book update (depth message) says an order was added, modified down, or cancelled at some price level: the L2/L3 stream, the granular truth from which quotes and even trades can be derived.
What is the difference between L1, L2 and L3 data?
The three levels describe how much of the book a feed exposes. L1 is just the top of book: best bid and ask plus the last trade. L2 is aggregated depth per price level (market-by-price). L3 is order-by-order (market-by-order): every individual order with its own ID and queue position. Deeper feeds are larger, but they let you reconstruct more of the microstructure.
L1 (top of book / BBO) gives the best bid, best ask, their sizes and the last trade: enough to see the price, useless for modelling depth, queueing or fill probability. L2 (market-by-price, MBP) gives, for each side, the total resting size aggregated at each price level (often the top 5–10 levels): the standard input to most microstructure work, where you can see depth and model walking the book but cannot recover where a specific order sits in the queue. L3 (market-by-order, MBO) gives every individual order, with its own ID, price, size and arrival, so you can see queue position and reconstruct the book deterministically, required to study queue value and fill probability.
The honest point, repeated across this atlas: the granularity you can get caps the questions you can ask. Nasdaq TotalView-ITCH and most serious exchange feeds are MBO; most retail "L2" is aggregated MBP and cannot recover the queue. A clean L3/MBO tape is the resource the strategy pages quietly assume, and the thing almost nobody has cleanly.
What are the three timestamps, and why do they differ?
A single message can carry up to three timestamps: the exchange (matching-engine) timestamp when the event occurred inside the venue, the gateway timestamp when it left the venue's edge, and your own capture timestamp when your handler received it. They differ by transmission and queueing latency, and only the exchange timestamp orders events across instruments correctly.
The intuition is a relay of clocks. The event happens in one clock domain (the engine), travels through wires and gateways (another), and is recorded in yours (a third). Each hop adds delay and jitter; the three stamps bracket where time was spent. The exchange / engine stamp is the causal clock: applied at the moment of the event, it is the only one that correctly orders two events on different instruments, so it is the stamp you use for any cross-instrument analysis (lead-lag, or the trade-sign rules that compare a trade to a contemporaneous quote). The gateway stamp is applied as the message egresses the venue's market-data gateway; its gap to the engine stamp measures internal venue latency. The capture stamp is applied by your feed handler or capture NIC on receipt; its gap to the gateway stamp is your transport latency.
Serious capture stamps at the network card (PTP-disciplined, often to nanoseconds) rather than in software, because OS scheduling jitter dwarfs the differences you care about. A software time.time() in a Python handler is worthless for microstructure: the jitter it adds is larger than the latency gaps it would measure. The capture timestamp is exactly the quantity the low-latency stack and colocation exist to shrink.
Why does clock synchronisation (PTP) matter?
To compare events across machines, venues or instruments, every clock must agree to within microseconds, otherwise you cannot tell which of two near-simultaneous events came first. PTP (IEEE 1588) disciplines clocks to a common reference (often GPS) at sub-microsecond accuracy; NTP's millisecond accuracy is far too coarse, and MiFID II RTS 25 makes the floor a legal requirement.
The danger is manufactured causality. If two clocks drift by a millisecond, a "lead-lag" you measure at the millisecond scale is an artefact of the clocks, not the market: at HFT timescales, unsynchronised clocks invent relationships that are not there. PTP (Precision Time Protocol, IEEE 1588) distributes a GPS-disciplined grandmaster clock across a network with hardware timestamping at each hop, achieving sub-microsecond (often tens-of-nanoseconds) synchronisation. NTP is millisecond-class and adequate only for daily-bar work.
There is also a regulatory floor. MiFID II RTS 25 requires HFT participants' clocks synchronised to UTC within 100 microseconds (1 ms for non-HFT), and the US Consolidated Audit Trail (CAT) sets its own granularity rules; both are 2026 minimums, and competitive shops run far tighter. The takeaway for a reader buying or building a tape: the quality of the timestamps is as important as the prices. A feed with software timestamps and an undisciplined clock cannot support microstructure research, however cheap it is.
What are sequence numbers, and why must they be gap-free?
Every message on a feed carries a monotonically increasing sequence number. It lets the receiver detect a dropped or out-of-order packet: if you see sequence 1004 after 1002, you missed 1003 and your book is now wrong. Gap detection plus a recovery mechanism is what makes a reconstructed book trustworthy.
The intuition is that local book state is a running total. The book you maintain is the cumulative sum of every increment; miss one increment and every subsequent state is silently corrupt: a phantom order that never cancels, a level that never clears. The sequence number is the integrity check that catches the miss. Multicast market-data feeds (binary, e.g. ITCH over UDP) number every message per channel, and a gap () means packet loss, common on UDP multicast, which triggers recovery.
Recovery comes in two forms. A redundant A/B feed publishes two identical multicast streams; you arbitrate, taking whichever message arrives first and filling A's gaps from B. Or a snapshot/refresh channel lets you request a fresh book, then replay increments since. Either way, faithful backtesting depends on it: to simulate fills you must replay the exact message sequence the engine produced, in order, which is only possible if your capture is gap-free and sequence-checked. A tape with silent gaps produces a backtest that diverges from reality in ways you cannot see (backtesting & simulation).
Snapshots vs increments: how do you reconstruct the book?
A feed gives you a periodic snapshot (the full book at an instant) plus a stream of increments (each add, modify or cancel). You build the book by loading the latest snapshot, then applying every increment in sequence. Snapshots bound recovery time; increments carry the live changes. Get the apply order wrong and your book desynchronises.
Think of the snapshot as a save point and the increments as the moves since. You can always rebuild current state from the last save plus the moves, provided you apply them in exact sequence with no gaps. Incremental (delta) feeds are the efficient default: only changes are sent, so bandwidth tracks activity, not book size. Snapshot/refresh is sent periodically or on request, so a late joiner (or a handler that took a gap) can resynchronise without replaying the whole day.
This is the classic crypto bug. Many crypto venues stream a REST/WS snapshot plus a WebSocket diff channel, and you must buffer the diffs, fetch the snapshot, then apply only diffs with sequence greater than the snapshot's. Skipping that handshake (applying diffs that predate the snapshot, or vice versa) silently corrupts the book, the single most common mistake when recording your own venue. The professional template is ITCH/OUCH-style: Nasdaq's ITCH is a pure incremental MBO feed (add / execute / cancel / delete order messages); OUCH is the matching order-entry protocol. The model (full add/modify/cancel increments with a recovery snapshot) is mirrored by CME MDP 3.0, Eurex EOBI and others. Cross-link to IX-FIX for decoding the actual message bytes; cite the venue's feed spec for the exact message set (as of 2026).
Worked example
A fragment of a synthetic MBO tape for instrument XYZ, as of 2026: one channel, timestamps in this venue's nanoseconds-since-midnight, monotonic. Sequence 1001 adds 300 lots to the bid at 50.00 (order A, the best bid); 1002 adds 100 to the ask at 50.01 (order B, the best ask, so the spread is one tick); 1003 adds 400 to the bid at 49.99 (order C, level-2 bid). Then a marketable buy arrives: 1004 is a TRADE of 100 at 50.01, and 1005 is a DELETE of order B at 50.01. Read what that tape tells you.
The marketable buy crossing the spread produced two messages: the trade print (1004) and the book update removing the resting ask (1005). They share a timestamp to the nanosecond because they are one engine event, so your handler must apply both atomically; treat them as separate and you will momentarily show a trade against a still-resting offer. The trade at 1004 carries no aggressor flag; you would infer it was buyer-initiated because it printed at the ask, the quote rule; see trade-sign inference.
Now drop a packet. Suppose your capture jumped from seq 1003 to 1005, missing the trade at 1004. You would carry a phantom resting ask B that never cancels: your book shows liquidity that no longer exists, and every later state inherits the error. Gap detection on the missing 1004 () is precisely what saves you: it forces a recovery instead of a silent corruption. The numbers here are illustrative and synthetic; real timestamp granularity, message sets and condition codes are venue-specific, so always read the venue's feed specification (as of 2026). The structure (events, three timestamps, sequence numbers, snapshot plus increments) is invariant; the field widths and codes are what you must look up.