How to Compare Backtest Results for 0DTE Strategy
Comparing backtests is about evidence, not impressions. If you change entry time, deltas, or management rules, you need a consistent method to decide whether Version B is actually better than Version A—or just looks better because of chance or settings drift. This guide gives a practical, repeatable process tailored to 0DTE SPX strategies and GreeksLab’s analytics.
1) Make It an Apples-to-Apples Test
Keep everything identical except the variable you’re testing.
Hold constant:
- Date range and trading calendar (exclude half-days or keep them in both)
- Capital, sizing model, commissions, slippage
- Entry window granularity (e.g., 9:31 vs 9:45)
- Underlying/universe (e.g., SPX only)
GreeksLab tip: Duplicate a strategy, change one parameter, and re-run. Log the change (“A: 16Δ; B: 12Δ”) and the run IDs.
2) Choose Primary and Guardrail Metrics
Pick 1–2 primary metrics that reflect your objective, and 3–5 guardrails to control risk.
Common primary metrics
- Risk-adjusted return: Sharpe, Sortino
- Profit efficiency: Average daily P&L, Return / Margin used
- Drawdown efficiency: MAR (CAGR / Max DD) for multi-year tests
Guardrails
- Max drawdown (absolute and %)
- CVaR / Expected shortfall (e.g., worst 5% days)
- Win rate vs payoff ratio (don’t accept higher win rate if payoff collapses)
- Tail risk indicators: worst day, cluster of losses
- Exposure: average and peak number of open positions
GreeksLab tip: Use the Overview tab for headline KPIs and the P&L distribution and underwater charts to sanity-check tails.
3) Compare by Market Regime (Not Just Aggregate)
A strategy can “win” overall while losing in key regimes.
Segment by:
- Volatility (e.g., VIX buckets: <15, 15–20, 20–25, >25)
- Trend/Range (up, down, chop)
- Time-of-day (entry hour)
- Event days (e.g. FOMC vs non-event)
GreeksLab tip: Use the Insights tab to slice results by volatility, weekday, entry hour, etc. Prefer a version that is robust across slices, not just top-line.
4) Use Paired, Day-Matched Comparisons
When A and B trade on the same days, compare their day-by-day differences. It’s more sensitive and fair than comparing unpaired aggregates.
Procedure
- Align the two backtests on calendar days.
- Compute ΔPnL = PnL_B − PnL_A per day.
- Review:
- Median ΔPnL (less sensitive to outliers)
- % of days B > A
- Worst ΔPnL (downside surprise)
- Distribution of ΔPnL (fat left tail?)
Why it matters: 0DTE returns are non-normal and serially dependent. Paired comparisons reduce noise from market path differences.
5) Look Beyond Averages: Distributions and Tails
- Histogram / KDE of daily PnL: Did the “better” version just add a few outsized wins?
- Left-tail focus: Compare the 5th and 1st percentiles (or CVaR). 0DTE can fail by tail clustering.
- Run-lengths: Max consecutive loss days. Can you survive that sequence?
GreeksLab tip: Use the P&L distribution and drawdown charts. A small Sharpe improvement is not worth a much fatter left tail.
6) Sample Size and Stability Checks
- Minimum sample: For intraday 0DTE, aim for hundreds of trading days across multiple vol regimes.
- Stability: Break your test into yearly or quarterly chunks. Does B outperform A in most chunks?
- Walk-forward: Optimize on Period 1, validate on Period 2, then roll forward.
Red flags
- Performance concentrated in a short window
- Version B wins only in one regime you over-weighted
- Highly parameter-sensitive results (tiny tweaks flip the outcome)
7) Multiple Comparisons Discipline
If you run 20 variations, some will “win” by chance.
Mitigations
- Pre-register 2–3 hypotheses (e.g., “12Δ vs 16Δ,” “9:31 vs 10:00 entry”).
- Use out-of-sample validation or walk-forward.
- Prefer simpler rules if performance is similar.
8) Execution Reality Check
Backtests can be over-optimistic if fills are too generous.
Sanity checks
- Increase slippage assumptions and re-run; does B still beat A?
GreeksLab tip: In Backtest Settings, model slippage and commissions and keep them identical across runs. Stress them higher to test fragility.
9) Decision Framework (Go / No-Go)
Use a simple scoring sheet. Example:
Criterion | Weight | A Score | B Score | Notes |
---|---|---|---|---|
Primary metric (Sortino) | 3x | 6.2 | 7.1 | B higher is better |
Max drawdown (lower is better) | 3x | -18% | -23% | A wins on DD |
CVaR 5% (lower is better) | 2x | -$1.9k | -$2.4k | A better tail |
Median ΔPnL (B − A) | 2x | — | +$35 | B wins per-day median |
Regime robustness (wins in slices) | 2x | 3/6 | 5/6 | B more consistent |
Execution stress (slippage ↑ 2×) | 2x | Breaks | Holds | B more resilient |
Rule of thumb: Approve B only if it wins on the primary metric, does not worsen tail risk materially, and survives execution stress.
10) Common Pitfalls (Avoid These)
- Comparing runs with different date ranges or costs
- Changing multiple parameters at once
- Cherry-picking regimes after looking at results
- Declaring victory on tiny effect sizes with short samples
GreeksLab Workflow (Step-by-Step)
- Duplicate baseline strategy. Rename clearly (e.g., “IC 16Δ → IC 12Δ”).
- Change one parameter only (delta, entry time, stop rule, etc.).
- Run both with the same backtest settings (dates, costs, slippage).
- In Overview, record: Sharpe/Sortino, Max DD, CVaR, Avg Daily PnL.
- In Insights, compare by VIX buckets, weekday, entry hour.
- In Positions/Daily, export day-matched results; compute ΔPnL distribution.
- Stress slippage/commissions; re-run. Check if ranking holds.
- Decide with the Go/No-Go framework. Document run IDs and rationale.
Summary
- Keep comparisons controlled (one change at a time, identical settings).
- Evaluate risk-adjusted performance and tail behavior, not just averages.
- Segment by regime, do paired day comparisons, and stress execution assumptions.
- Approve changes only when improvements are consistent, robust, and practically tradable.
Use this checklist every time you iterate. It will save you from overfitting and false positives—and surface changes that actually matter in live 0DTE trading.
Get the most out of GreeksLab!
Create a free account or sign in to access:
- Backtester tool
- Flexible strategy builder
- High resolution data
- Advanced analytics
- And much more...