Learn

Monte Carlo vs. historical backtest

How retirement simulations work, and how to read success rates

When Calcifer reports an “87% success rate,” it's answering a precise question: in what percentage of simulated retirements does your money last through your planned retirement period without hitting zero? There are two fundamentally different ways to run that simulation — historical backtesting and Monte Carlo — and they answer slightly different questions.

Understanding which one you're looking at, and what each one can and cannot tell you, makes a real difference in how you interpret the numbers. A 90% success rate from a Monte Carlo run is not the same as a 90% rate from a historical backtest — even if they point in the same direction.

Your numbers

Take the FIRE quiz to see your personal numbers

Take the quiz →

Historical backtesting

Historical backtesting tests your plan against every actual market sequence since 1871, using Shiller's long-run US stock and bond return data. If you're modeling a 30-year retirement, the test runs your portfolio through 1871–1901, then 1872–1902, then 1873–1903, and so on through roughly 120 overlapping 30-year windows. Each window is a complete real-world retirement scenario — complete with the 1929 crash, the stagflation of the 1970s, the dot-com bust, and 2008.

The success rate is simply the percentage of those windows where your portfolio never hit zero before the end of the period. A 94% historical success rate at the 4% rule over 30 years means roughly 6 of those 120 windows failed — and those failures were almost exclusively scenarios where someone retired in the late 1920s or mid-1960s into terrible sequence-of-returns conditions.

Strengths

+Uses real historical sequences including actual crashes — 1929, 1966, 2000, 2008
+Preserves real autocorrelation — bad years sometimes cluster, just as they did historically
+Anchored in actual economic history. Every data point is real.
+Results are interpretable: you can see which periods failed and why

Weaknesses

−Limited sample size — only ~120 non-overlapping 30-year windows in 150 years of data
−US-centric. The 20th century was the American century — survivorship bias is real.
−Cannot model “worse than history” scenarios — if future conditions are more severe than any historical period, the backtest won't show it
−Past sequences do not have to repeat in the same order

Monte Carlo simulation

Monte Carlo simulation does not replay history. Instead, it generates thousands of synthetic return sequences using statistical properties derived from historical returns — typically the mean and standard deviation of annual real returns. From those properties, it creates 10,000 or more random 30-year (or 40- or 50-year) return paths and counts how many of them allow your portfolio to survive.

The appeal is unlimited sample size. Where the historical backtest gives you ~120 windows for a 30-year period, Monte Carlo gives you 10,000 or 100,000 — each slightly different. This lets the simulation explore the full statistical distribution of outcomes, including tails that may be worse than any individual year in recorded history.

Strengths

+Unlimited sample size — 10,000+ runs gives richer statistical coverage
+Can model tail risks more extreme than anything in the historical record
+Can model parameter uncertainty — e.g., running the simulation with varying mean return assumptions
+Easy to layer on CAPE-based adjustments or stress scenarios that shift the distribution

Weaknesses

−Assumes returns are normally distributed — markets have fat tails and skewness that a Gaussian model understates
−Does not preserve real-world autocorrelation — crashes and recoveries have structure that random draws miss
−Can be optimistic if variance is underestimated, or pessimistic if mean return is set too conservatively
−Results depend heavily on the parameter assumptions — garbage in, garbage out

Monte Carlo variants Calcifer supports

Normal distribution

Draws each year's return from a normal distribution parameterized by historical mean and standard deviation. Fast and transparent, but underweights tail events.

Bootstrap resampling

Instead of a mathematical distribution, this approach resamples randomly from the actual year-by-year historical returns. Each synthetic sequence is assembled from real years in random order. This preserves fat tails while still generating many more paths than pure historical backtesting.

CAPE-adjusted

Shifts the mean return assumption based on the current Shiller CAPE (cyclically adjusted P/E ratio). When valuations are high — as they have been in recent years — the expected real return is reduced, making this variant generally more pessimistic and arguably more relevant to planning.

Sequence-stressed

Upweights bad sequences at the start of retirement to deliberately stress-test sequence-of-returns risk. This is the most pessimistic variant and is useful for checking worst-case early-retirement scenarios.

Comparing the two methods: same scenario, different lenses

The chart below shows the approximate success rates from each method for the classic 4% rule over a 30-year retirement. They give similar but not identical answers — and that gap is meaningful information.

Success rate: 4% withdrawal, 30-year retirement

Historical backtest (~120 real windows)

Monte Carlo (10,000 synthetic runs)

The ~3 percentage point gap is typical. Monte Carlo tends to run slightly lower because it can generate sequences worse than anything that actually occurred historically.

Divergence across withdrawal rates and horizons (approximate)

Historical

Monte Carlo

The two methods tend to diverge more at higher withdrawal rates and longer horizons — exactly the scenarios where stress-testing is most important.

How to read success rates

Success rates give you a probability distribution over outcomes, but they require some interpretation. A 90% historical success rate does not mean “you have a 10% chance of going broke” in any literal sense — it means 10% of historical 30-year periods resulted in portfolio depletion before the end of the period, assuming no adjustments were made.

In real life, virtually no retiree follows a rigid rule with zero adjustment for 30 years. The actual behavioral risk is lower than the model suggests — because real people cut spending in bad markets, pick up occasional work, or adjust when they see their portfolio eroding. The simulation's job is to show you the guardrails, not to predict your exact outcome.

95+

Historically very robust

The only failures at this success rate are typically 1929-era scenarios — someone who retired just before the Great Depression. Even then, most of those plans would have recovered if spending had been reduced modestly in the worst years. At this level, your plan is as resilient as the historical record allows.

85–95

Solid, with a flex plan

This range includes the classic 4% rule outcome for 30-year retirements. It's considered the consensus “safe withdrawal rate” zone. The sensible addition here is a guardrail rule: if your portfolio drops meaningfully in the first five years, reduce spending 10–15% for a year or two. That adjustment alone dramatically reduces real-world failure rates.

75–85

Meaningful risk — plan for flexibility

At this success rate, your withdrawal rate is high enough that adverse sequences produce real stress. This isn't necessarily disqualifying — if you have part-time income potential, a flexible budget, or plan to downsize, you can absorb what the model says is a 15–25% failure rate. But going in without any contingency plan at this level is risky.

<75

High risk — reconsider the withdrawal rate

Below 75%, the withdrawal rate is likely too aggressive for the time horizon. Solutions include: reducing the withdrawal rate (lower spending or more savings), shortening the horizon in the model (planning for a legacy instead of full depletion), or explicitly modeling a flexible strategy like Guyton-Klinger that cuts spending in bad stretches. This is not a number to rationalize away.

The key nuance: “failure” is not going broke

In simulation terms, “failure” means the portfolio hits zero before the end of the modeled period under the rigid withdrawal rule. In reality, retirees adjust. They spend less in bad years, pick up occasional work, sell a vacation home, or adjust expectations. The simulation is deliberately conservative in assuming no behavioral response. Treat the success rate as a lower bound on your real-world probability of surviving — not a literal prediction.

Which method should you use?

The correct answer is both. They are complementary lenses, not competing answers to the same question.

Use historical backtest to understand what actually happened

If you want to know how your plan would have fared across real market history — with real crashes, real recoveries, and real economic regimes — historical backtesting is the right tool. It's grounded. Every data point is something that actually occurred. And because the failures are specific historical periods, you can investigate them: what happened in 1929, and would I have handled it differently?

Use Monte Carlo to stress-test against scenarios worse than history

The US market delivered exceptional returns over the past 150 years. That's partly good policy and institutional strength, and partly survivorship bias — we are looking at one of history's great economic success stories. Monte Carlo can generate paths worse than anything in that record: extended 20-year bear markets, persistent inflation, or extended low-return environments. If you want to plan conservatively beyond what history shows, Monte Carlo is the better tool.

The gap between them is useful signal

When historical backtest gives 94% and Monte Carlo gives 91%, those are close enough to suggest reasonable agreement. When they diverge significantly — say, 90% vs 75% — it means one method is picking up something the other isn't. A large gap often signals a scenario where the historical record has been relatively favorable but statistical modeling suggests the distribution has meaningful downside tails. That divergence is “model risk” and deserves investigation.

Why the first decade matters most

Both historical and Monte Carlo simulations reveal the same core truth about retirement planning: the sequence of returns in your first decade matters enormously. A 4% withdrawal from a $1M portfolio produces very different outcomes depending on whether markets go up or down in years 1–10.

If you retire into a strong bull market, early gains build a buffer that protects you from later downturns. If you retire into a bear market, withdrawals in down years lock in losses and deplete capital that can never recover. This is called sequence-of-returns risk, and it's the main reason the 4% rule sometimes fails even when long-run average returns look fine.

The practical implication

Your first five years of retirement are when your plan is most vulnerable. If markets deliver poor returns in that window, consider a flexible strategy that can cut spending by 10–15% before the damage compounds. This is why Guyton-Klinger guardrails, CAPE-based rules, and variable percentage withdrawal strategies exist — they all have built-in mechanisms to reduce spending when portfolio stress is detected early.

How Monte Carlo models this vs. historical

Historical backtest preserves the actual autocorrelation of returns — bad years in the historical record were sometimes followed by more bad years (as in the 1966–1982 stagflation era). Simple Monte Carlo with independent draws does not capture this clustering. Bootstrap resampling does better. CAPE-adjusted Monte Carlo also helps by reducing expected returns when starting valuations are high, which statistically correlates with eventual mean-reversion and short-term drawdowns.

Monte Carlo vs. historical backtest

Historical backtesting

Monte Carlo simulation

Comparing the two methods: same scenario, different lenses

How to read success rates

Which method should you use?

Why the first decade matters most

Related topics

Monte Carlo vs. historical backtest

Historical backtesting

Monte Carlo simulation

Comparing the two methods: same scenario, different lenses

How to read success rates

Which method should you use?

Why the first decade matters most

Related topics