Systems & building

Backtesting & simulation

∞structural

Reviewed 4 June 2026. As of 2026: a permanent feature of the market, not an edge that decays.

Toggle look-ahead, survivorship and optimistic fills and watch a fake edge appear. The hardest part of research is not finding an edge. It is not lying to yourself about one.

See it move

Conjure a fake edgetoggle the sinsIX-BACKTEST

Backtest verdictflat / honest

Realityno edge here

Look-ahead bias Survivorship bias Optimistic fills

What to notice. Turn on any of the three classic backtest sins and a flat strategy sprouts a beautiful upward equity curve. None of it survives live. The hardest part of research isn't finding an edge; it's not lying to yourself about one.

What is a backtest, and why does it lie so easily?

A backtest replays a strategy over historical data to estimate how it would have performed. It lies easily because every shortcut in the simulation flatters the result in the same direction (toward a higher, smoother return) and because you can keep adjusting the test until it looks good. The honest discipline is to make the simulation pessimistic and to judge it on data you never touched.

Start with the intuition. A backtest is a story you tell about the past, and you control every detail of the telling. Each convenient assumption ("I'd have been filled", "costs are negligible", "I'd have known by then") nudges the story toward profit. Stack a few of them and you have a beautiful equity curve that describes nothing real.

The property that makes this dangerous is an asymmetry: nearly all the common errors bias the result upward. There is no symmetric noise that occasionally makes a real strategy look worse than it is; the mistakes are systematically optimistic. So a backtest's default state is "too good", and your job is to fight it back toward reality.

The deepest trap is not any single sin but the process: you tweak, re-run, tweak again. Every adjustment that improves the backtest is a fit to that specific history, and by the time it looks great you have, without noticing, overfit it to the past. This page is the cautionary spine for the whole strategies section: every signal there must survive this page's scepticism before it means anything. In the explorer above, the toy strategy's true edge is exactly zero by construction, so every gain you can produce is a measured artefact of one of the sins below.

What is look-ahead bias?

Look-ahead bias is using information in a decision that you would not actually have had at that moment: the single most common and most lethal backtest error. Using a bar's close to decide a trade within that bar, applying a restated figure as of its event date, or using a statistic computed over the whole sample all let the strategy "see the future". It produces spectacular, entirely fictional results.

The classic forms are worth naming so you recognise them in your own code. Deciding today's trade using today's close (you do not know the close until the day is over). Using a corporate figure as-of its event date when it was actually published days later. Computing a normalisation (a mean, a standard deviation, a model fit) over the entire dataset and then using it at every point in time, so the in-sample statistic quietly leaks the future into the past.

Why is it so seductive? A tiny leak produces an enormous, smooth edge, because the strategy is effectively being told the answer. In the explorer above, look-ahead alone lifts the toy strategy's Sharpe from about zero to about three: pure fiction.

The defence is point-in-time, causal evaluation: at every simulated instant

t

, the strategy may use only information that was actually observable by then.

\text{decision}_t = f\big(\mathcal{F}_t\big), \qquad \mathcal{F}_t = \{\text{information observable up to and including } t\}

Use point-in-time datasets, lag every value to its real availability, and compute statistics on a rolling or expanding window that never peeks ahead. A faithful market-replay simulation enforces this structurally, because it feeds events in time order and nothing later exists yet.

What is survivorship bias?

Survivorship bias is backtesting only on the instruments that survived to today (the universe of currently-listed names) silently excluding everything that was delisted, went bankrupt or was acquired. Because the failures are removed, every strategy looks safer and more profitable than it really was, especially anything that buys distressed or falling names.

The intuition: study only the companies that exist today and you have pre-selected for survival. A strategy that "buys the dip" looks brilliant if every name that dipped to zero has been quietly deleted from your dataset: you never see the ones the dip killed.

It bites hardest on mean-reversion and value strategies that lean toward beaten-down names, and on any equity universe assembled from a current index membership applied to the past: the index's past members who were later dropped are simply gone. It also lurks in crypto (dead tokens, defunct venues) and is trivially easy to introduce by accident.

The defence is a point-in-time universe: the set of instruments that actually existed and were tradeable at each date, including the ones that later died, with their full price-to-zero history. Survivor-free databases exist for equities; for crypto you must deliberately retain delisted assets. In the explorer above, drop the synthetic "dead" symbols and the curve improves; restore them and the losses reappear.

What are unrealistic fills, and why is queue position the heart of it?

Unrealistic fills assume you traded at prices or sizes you would not actually have got: filling at the mid, always getting the touch, zero slippage, no queue. For passive (limit-order) strategies this is fatal: whether your resting order fills at all depends on your position in the queue, and assuming it always does invents the entire edge.

The optimistic-fill family is large. Filling a market order at the mid instead of paying the spread; ignoring that a large order walks the book; assuming a limit order at the touch always executes. Each ignores a real cost, and the spread you ignored is often the whole strategy: a market-making backtest that fills at mid is just collecting an imaginary spread.

Queue position is the crux for passive strategies. A limit order joins the back of the queue at its price level, and it only fills if enough volume trades through ahead of it first (price-time priority). Your fill rate (and crucially which fills you get) depends on where you sat in that queue. Assume-always-filled hands you the good fills and spares you the bad ones, exactly inverting adverse selection: in reality you tend to get filled precisely when the market is about to move against you.

Your order fills only once the cumulative volume traded at your price level exceeds the quantity queued ahead of you. A bar backtest cannot represent this quantity at all.

\text{fill} \iff \sum (\text{volume traded through your level}) \;\gt \; Q_{\text{ahead}}

Queue/fill modelling is therefore the hardest and most important part of a microstructure backtest: you must model arrival, your queue position, cancellations ahead of you and the probability of execution, ideally from L3 / market-by-order data that records the order-by-order book. The defence is conservative fill assumptions (you do not get filled unless the simulation shows the queue cleared past you), paying the spread and the fees, modelling partial fills, and preferring a market-replay backtest that reconstructs the actual book over a bar-based one that cannot represent the queue.

Why does latency belong in the backtest?

Because in reality there is a delay between seeing a signal and your order reaching the matching engine, and the world moves in that gap. A backtest that acts instantly on every signal captures opportunities that, by the time your order actually arrived, were already gone, picked off by faster players. Modelling your real latency is part of an honest fill model.

The intuition: you see a stale quote and "trade" against it in the backtest at time $T$ . But your order does not reach the venue until $T + \ell$ , where $\ell$ is your latency, by which point the quote you targeted is likely already updated or taken. The backtest filled you; reality would not have.

This is the backtest face of the whole latency and colocation story. An instantaneous backtest implicitly assumes you are infinitely fast, which silently grants you every latency-arbitrage opportunity. Bake in your actual tick-to-trade and most speed-dependent "edges" disappear for anyone not in the speed tier.

The defence is to delay every decision by a realistic latency before it can act, and to fill only against the book as it was when your order would truly have arrived, again something a faithful market-replay simulation does naturally.

What is overfitting, and why is multiple-testing the 2026 killer?

Overfitting is tuning a strategy to fit the noise in your historical sample rather than a real, repeatable pattern. The mechanism is multiple testing: try enough variations and the luckiest one will look brilliant in-sample purely by chance. In 2026, automated and AI-assisted strategy search makes trying thousands of variants effortless, so this is now the dominant way backtests lie.

The intuition is a coin-flipping parable. Flip 1,000 coins ten times each and the best coin will have a great run, not because it is a good coin, but because you searched 1,000 of them. Search 1,000 parameter sets and the best in-sample Sharpe is similarly a foregone illusion: you found the luckiest configuration, not a real edge.

The statistics make this precise. The expected maximum of $N$ noisy results grows with $N$ , so the more strategies or parameters you try, the higher the best in-sample performance you will see even when every true edge is zero. The honest Sharpe must be discounted for how many things you tried: this is the deflated Sharpe ratio (Bailey & López de Prado 2014), which adjusts the observed Sharpe for the number of trials and the non-normality of returns.

▸ Show the maximum-of-N argument and the deflation optional

Suppose you run $N$ independent strategies, each with a true Sharpe of zero, and each estimated Sharpe is approximately Gaussian noise with standard error $\hat{\sigma}_{SR}$ . The expected best of the $N$ is not zero; it grows like the maximum of $N$ standard normals.

\mathbb{E}\!\left[\max_{i\le N} \widehat{SR}_i\right] \approx \hat{\sigma}_{SR}\left[(1-\gamma)\,\Phi^{-1}\!\left(1-\tfrac{1}{N}\right) + \gamma\,\Phi^{-1}\!\left(1-\tfrac{1}{N e}\right)\right]

where $\Phi^{-1}$ is the inverse normal CDF and $\gamma$ is the Euler–Mascheroni constant. This rising threshold is the bar a real edge must clear. The deflated Sharpe ratio turns an observed Sharpe into the probability that the true Sharpe exceeds zero, after accounting for that bar, the number of trials, and the skew and kurtosis of the returns.

\widehat{DSR} = \Phi\!\left(\frac{\big(\widehat{SR} - SR_0\big)\sqrt{T-1}}{\sqrt{1 - \hat{\gamma}_3\,\widehat{SR} + \tfrac{\hat{\gamma}_4 - 1}{4}\,\widehat{SR}^{\,2}}}\right)

Here $SR_0$ is the inflated benchmark from the maximum-of-N expression above, $T$ is the number of observations, and $\hat{\gamma}_3,\hat{\gamma}_4$ are the skewness and kurtosis of the returns. A Sharpe of 3 found after 500 attempts can deflate to a probability barely above one-half. Reference: Bailey & López de Prado, "The Deflated Sharpe Ratio" (2014).

2026 makes this worse, not better: automated hyperparameter search and AI-assisted research collapse the cost of trying variants to nearly zero. The thing that used to limit overfitting (the effort of running each test) is gone, so the discipline must come from method, not friction (see what AI changes for HFT). The defences are concrete. Out-of-sample / hold-out: fit and tune on one period, then evaluate once on a period you never touched, and touching the hold-out more than once destroys it, because every peek is another test. Walk-forward validation rolls the train/test window forward through time so every evaluation is genuinely out-of-sample across regimes. Count your trials honestly and deflate the Sharpe for them. And prefer simple, economically-motivated strategies with few parameters: there is less to overfit and a reason it should work beyond "it fit the past".

What is market-replay simulation, and how does it differ from a bar backtest?

Market-replay simulation feeds the recorded message stream (every order, cancel and trade, in time order) through your system exactly as it happened, and matches your orders against the reconstructed book as it evolves. Unlike a coarse bar-by-bar backtest, it can represent the queue, the spread, partial fills and latency, the things that decide a high-frequency strategy's real P&L.

A bar backtest is the crude tool: you have OHLCV bars and assume fills at the open, close or mid. It cannot see the spread, the queue or intrabar dynamics, so it is structurally incapable of honestly testing any passive or latency-sensitive strategy. It is fine for slow, signal-driven equity strategies and useless for market making.

A market-replay / event-driven simulation is the honest tool: replay the recorded tape message-by-message, rebuild the book at every instant, and inject your orders into that book: they take a queue position, wait their turn, fill (or not) as real volume trades through, and pay real fees, after your real latency. This is the same event-driven architecture as the live system, which is the point: the simulator and production share the strategy code, so you test what you will run.

The hard part is whether your order changes the market. A faithful simulation must decide whether your presence affects others. The simplest and most conservative assumption is that your orders take liquidity and queue position but do not cause others to react; modelling market impact and others' responses is harder and matters more as your size grows (see capacity). This is why the build order puts "record and replay" and the backtest harness before the strategy: the simulator is the instrument that tells you whether anything is real.

What is conformance and exchange certification testing?

Before a venue lets your system trade live, it usually requires conformance (certification) testing: you connect to the exchange's test environment and prove that your system correctly speaks the protocol, handles every order type and message, and behaves under error and recovery conditions. It is a mandatory gate that validates correctness and safety, distinct from backtesting, which validates the strategy.

What it checks: that you encode and decode the venue's protocol correctly, handle rejects, cancels, partial fills, session recovery and sequence gaps, respect order-type semantics, and do not misbehave under load: that your system is a safe, compliant participant. Many venues and regulators require it before enabling production access, and it dovetails with the MiFID II RTS 6 and Rule 15c3-5 risk-control obligations.

It is separate from backtesting because the two answer different questions. Backtesting asks "does the strategy make money?"; conformance asks "does the system behave correctly against the real venue?". You need both: a profitable strategy on a non-conformant system is a regulatory and operational disaster waiting to happen. For the solo builder, crypto and prediction-market venues are far lighter on formal certification than regulated equities and futures exchanges (another reason those venues are the accessible starting point) but you still owe yourself the equivalent self-test against the venue's sandbox before risking capital.

Worked example

Take a single toy strategy with a true edge of exactly zero and run it through the explorer above, turning on each sin in turn. Because the strategy provably cannot make money, every gain below is a measured artefact, not a result; the figures are illustrative and reproducible from the widget's seed, as of 2026.

Honest configuration (out-of-sample, real fills, costs and latency) reports a Sharpe of about zero and an annual return of roughly zero minus costs. That is the truth. Add look-ahead bias alone and the Sharpe jumps to about three on a large, smooth curve: pure fiction, the strategy seeing the future. Add optimistic fills alone (mid, always filled) and you collect an imaginary spread worth a Sharpe of one to two. Add survivorship alone (drop the dead names) for a modest lift, the failures simply deleted. Add overfitting (keep the best of 200 random parameter sets) and the best in-sample curve looks great, the maximum-of-N illusion.

All sins on produces a beautiful in-sample curve at a Sharpe near three. Evaluate that same configuration out-of-sample and it collapses to zero: the cliff. The edge was never there.

SR_{\text{in-sample, all sins}} \approx 3 \qquad\longrightarrow\qquad SR_{\text{out-of-sample}} \approx 0

The lesson the table teaches in one line: the out-of-sample column is the only one that means anything. A backtest is guilty until proven innocent on data it has never seen, with realistic fills, costs and latency, and an honest count of how many times you tried. The widget lets you produce every row yourself; the numbers are deliberately a fiction you can audit, because the strategy's true edge is zero by construction. This is educational only and not investment advice; no performance is implied, and the figures exist solely to demonstrate the biases.

Where this fits

↑ Up · building block of Systems & building ↔ Across · composes with Capacity & alpha decay ↔ Across · composes with Market impact ↔ Across · composes with Research-to-production pipeline → Apply · makes money in Statistical arbitrage → Apply · makes money in Market making → Apply · makes money in Crypto market making ⤓ Build / Buy · tool needed Datasets & tools