High-frequency data

Data volume & engineering

∞structural

Reviewed 4 June 2026. As of 2026: a permanent feature of the market, not an edge that decays.

Terabytes per day per market. High-frequency data is enormous, irregularly timed, and unforgiving of naive storage: the engineering is half the battle.

The idea

One instrument to one venue: raw versus stored annotated diagramDG-FIREHOSE

What this shows. A single instrument is tens of megabytes of tape a day and a whole venue is hundreds of gigabytes raw, but columnar compression at roughly eight times collapses that to tens of gigabytes stored – about ten terabytes a year. So disk is trivially cheap. The real bill is capturing the feed losslessly and paying the venue's market-data licence, which dwarf storage by orders of magnitude.

How much data does high-frequency trading actually generate?

Reframe high-frequency data from a statistics topic to an engineering one, and the first fact is scale. A daily bar is one row per instrument per day; a tape is one row per event, and on an active instrument events arrive thousands of times a second in bursts (irregular time). The row-count ratio between a tape and a daily-bar series is six to nine orders of magnitude. As a 2026 order of magnitude (reverify against the venue's own published statistics) a top-of-book quote stream for one liquid equity is comfortably seven-figure messages a day; full-depth MBO multiplies that; and consolidated US options data (OPRA) is the canonical firehose, quoted in tens of millions of messages per second at peak and double-digit terabytes a day uncompressed. Crypto adds 24/7 with no overnight lull and dozens of venues.

It grew, and keeps growing, for structural reasons: finer tick sizes, more order types, more venues (fragmentation), nanosecond timestamps, and the fact that quote traffic rises super-linearly with participation, because every market maker re-quotes on every input. The volume is not incidental; it is the engineering characteristic of this data. Treat it as a systems problem first and a statistics problem second.

The tape-to-bar row-count ratio is six to nine orders of magnitude: a daily series has one row per instrument-day, a tape has one row per event, and events arrive in bursts thousands per second on an active name.

\frac{\text{rows}_{\text{tape}}}{\text{rows}_{\text{daily}}} \;\sim\; 10^{6}\text{–}10^{9} \quad\text{per instrument}

Why can't you just use CSV and Pandas?

Most tick queries are "give me the trade prices for this instrument between 14:00 and 14:05". Row storage forces you to read every field of every row in that range; columnar storage lets you read only the price column for only that time slice. The data layout decides whether that query is milliseconds or minutes. CSV is the wrong tool on every axis: it encodes numbers as text (a float costs about 8 bytes binary, 15-plus as text), with no typing, no compression, no indexing and no predicate pushdown. Fine for a few thousand rows; catastrophic for a billion.

Row-oriented dataframes struggle for a related reason: loading a day of MBO into Pandas materialises every column in RAM, so you run out of memory before you run out of patience. You either chunk laboriously, or you move to a columnar engine that streams. This is not a preference; it is the difference between a feasible and an infeasible analysis. The format choice is the engineering.

A columnar query reads only the columns and row-ranges it touches; a row store reads everything in range. For a query touching a few of

C

columns over a fraction of

N

rows, the bytes-read ratio is the saving.

\frac{\text{bytes}_{\text{columnar}}}{\text{bytes}_{\text{row}}} \;\approx\; \frac{k}{C}\cdot\frac{n}{N}, \qquad k \ll C,\; n \ll N

What storage formats and engines solve this?

Tick data lives in columnar stores, and 2026 offers a tiered menu. Parquet + Arrow + (DuckDB / ClickHouse / Polars) is the open default: columnar files in cloud object storage, queried by an embedded or columnar engine: cheap, portable, no licence, scales horizontally. It is the single biggest reason the storage barrier has fallen. kdb+/q is the incumbent for serious tick work: an in-memory-first columnar time-series database with a terse vector language, exceptional at as-of joins (align a trade to the prevailing quote, exactly what trade-sign inference needs) and real-time streaming analytics. Historically expensive; still the benchmark for latency-sensitive tick analytics.

Arctic / ArcticDB is a Python-native, versioned time-series store (originally Man AHL's, MongoDB-backed; ArcticDB is the modern columnar rewrite) and popular where the research stack is Python and you want fast, versioned, dataframe-shaped tick storage without q. And raw pcap sits at the capture edge: the raw network packets stored losslessly so you can re-decode the feed exactly later: the ground truth, at the cost of size, before you normalise into columns. The decision is a cost / speed / operability triangle: object-store Parquet for cheap reach, kdb+ for low-latency analytics, Arctic for Python-native versioning, pcap for forensic fidelity. Most serious shops use more than one tier.

The format choice trades cost against query latency against operability against fidelity: Parquet on object storage for cheap reach, kdb+ for low-latency analytics, ArcticDB for Python versioning, pcap for lossless ground truth.

\text{pcap (fidelity)} \;\to\; \text{Parquet (cost)} \;\to\; \text{kdb+ (latency)}

How does compression work on tick data?

Tick data compresses extremely well, often 5–20×, because it is repetitive and slowly varying. A column of prices reads "50.00, 50.00, 50.01, 50.01, 50.01, 50.00": store the first value and the deltas, and most of the column becomes near-zero, which a general compressor crushes. Columnar layout is what makes this possible, because like values sit adjacent. The encodings that matter are delta-of-delta on timestamps (monotonic, near-constant spacing within a burst), dictionary encoding on side / condition-code / instrument columns (a handful of distinct values mapped to small integers), run-length encoding on repeated prices and sizes, and bit-packing of small integers, then a byte-level codec, Zstandard for ratio or LZ4 for decode speed, on top.

The trade-off is CPU against bytes: higher compression saves storage and network but costs CPU on read, so you maximise ratio for archival and favour decode speed on hot query paths. The practical upshot is large: that "terabytes a day" raw figure is often a few hundred gigabytes a day stored, which is exactly what makes the cloud-object-store economics work. Compression is why the barrier fell.

Delta encoding turns a slowly-varying column into a stream of near-zeros, dictionary encoding maps a few distinct categories to small integers, and a general codec crushes the result, compounding to roughly 5–20× on real tick data.

\text{prices} \;\xrightarrow{\;\Delta\;}\; \{0,0,+1,0,0,-1,\dots\} \;\xrightarrow{\;\text{Zstd}\;}\; \tfrac{1}{5}\text{–}\tfrac{1}{20}\ \text{size}

What does capturing the feed correctly actually take?

Storage is cheap and getting cheaper; not losing a message at three million messages a second is the hard part. Capture means receiving every message losslessly and in order, timestamping it at the network card, detecting sequence gaps and recovering them, and persisting it without dropping anything under load. A capture that silently drops packets produces a tape with gaps that corrupt every reconstructed book downstream (sequence numbers). Concretely it requires kernel-bypass NICs (so the OS scheduler does not drop multicast under burst), hardware/PTP timestamping (clock sync), A/B feed arbitration for gap-free recovery, and enough write throughput and buffering that a storage stall does not back-pressure into packet loss. This is the low-latency stack and colocation and FPGA, viewed from the data side.

This is also where the moat now sits. Open formats commoditised storage and query; they did nothing for lossless capture at the edge, which still requires colocated hardware and careful engineering. The clean tape's value is increasingly in the capture, not the disk.

Why is data volume both a barrier and a business?

The volume is a barrier because clean, complete, correctly-captured tick data is genuinely hard and expensive to obtain, store and query, and most aspiring quants never get it. That same difficulty is precisely why a clean dataset is something people pay for. If clean L2/L3 data were trivial, it would be free and there would be no dataset product; it is the volume-driven difficulty (capture, storage, reconstruction, correctness) that makes a clean, reconstructable order-book dataset worth selling (datasets). Meanwhile the feeds extract rent: exchanges charge substantial market-data fees, a recognised cost line and a live regulatory debate, so understanding the volume problem is understanding why the data layer is defensible and why your infra budget is dominated by it.

What 2026 shifts is which half is hard. The storage economics collapsed (object store plus Parquet plus DuckDB); the capture and licensing economics did not. So the edge moved from "can you afford kdb+" to "can you capture losslessly and clean it correctly", which is more about engineering discipline than budget, good news for the solo shop (going independent in 2026).

Worked example

A back-of-envelope storage sizing for one instrument-day, as of 2026, illustrative; reverify message rates against the venue's published statistics. Take a liquid equity, full-depth MBO, and assume about 2,000,000 messages on an active day. A normalised message is roughly 40 bytes (timestamp 8, sequence 8, type 1, side 1, price 8, size 8, ids and flags about 6), so the raw size is $2{,}000{,}000 \times 40 = 80\ \text{MB}$ per instrument-day uncompressed. Scale to a venue of 3,000 active instruments and that is about 240 GB a day raw.

Per instrument-day: 2M messages × ~40 bytes ≈ 80 MB raw. A 3,000-name venue ≈ 240 GB/day raw, which columnar compression at ~8× collapses to ~30 GB/day stored, about 10 TB/year for that venue's full depth.

\underbrace{240\ \text{GB/day}}_{\text{raw}} \;\xrightarrow{\;\div\,8\,(\text{Zstd})\;}\; \underbrace{30\ \text{GB/day}}_{\text{stored}} \;\approx\; 10\ \text{TB/yr}

Now the lesson. Ten terabytes a year on commodity object storage is tens of dollars a month in storage. Trivial. The cost is not the disk; it is the capture infrastructure (colocated hardware, redundant feeds) and the market-data licence fees, which dwarf storage by orders of magnitude. The headline "terabytes a day" is real raw, but compression and cheap object storage make keeping it easy. The money and the difficulty are in capturing it losslessly and in paying the venue for it. These figures are illustrative and synthetic; real message rates, widths and fees are venue-specific, so check the venue's published statistics and market-data fee schedule as of 2026. A clean, compressed, reconstructable tick/LOB dataset is precisely what this volume problem makes worth buying rather than building (datasets and tools).

Where this fits

↑ Up · building block of High-frequency data ↔ Across · composes with The low-latency stack ↔ Across · composes with Messaging protocols → Apply · makes money in Crypto market making → Apply · makes money in Equities & futures ⤓ Build / Buy · tool needed Datasets & tools

Common questions

Why is high-frequency data hard to work with?

Because it breaks the assumptions slower data lets you make. It is enormous (terabytes per day per market), irregularly timed (events, not clock ticks), heavy-tailed and non-normal, and contaminated by microstructure noise such as bid-ask bounce. Timestamps need care, and naive resampling destroys exactly the signal you want. It demands event-driven, point-process thinking rather than fixed-interval statistics.