A pre-trend too gentle to see can bias a difference-in-differences estimate by ~77% — and the standard test usually misses itA pre-trend too gentle to see can bias a difference-in-differences estimate by ~77% — and the standard test usually misses it
Difference-in-differences (DiD) is one of the most-used causal designs in economics, policy, and product analytics. Its credibility rests on one assumption: parallel trends — that, absent treatment, tDifference-in-differences (DiD) is one of the most-used causal designs in economics, policy, and product analytics. Its credibility rests on one assumption: parallel trends — that, absent treatment, t
Difference-in-differences (DiD) is one of the most-used causal designs in economics, policy, and product analytics. Its credibility rests on one assumption: parallel trends — that, absent treatment, the treated and control groups would have moved in lockstep. The standard way to defend that assumption is a pre-trend test: check that the two groups were not already diverging before treatment. A non-significant pre-trend test is routinely read as "parallel trends holds, the estimate is clean."
We stress-tested that reasoning with a simulation, and the reassurance is largely false.
What we did. We simulated a DiD design — one treated unit, 20 controls, 6 pre-treatment and 4 post-treatment periods, a true treatment effect of 2.0 — and injected three textbook violations of its identifying assumptions, 2,000 Monte Carlo draws each (seed-fixed, unit noise SD = 1). For every violation we measured two things: the bias it puts into the DiD estimate, and how often the pre-trend test detects it at the 5% level.
What we found.
- A barely-visible pre-trend does most of the damage. A gentle differential drift of 0.3 units per period — the kind you would not notice by eye — biases the estimate by +1.54, or 77% of the true effect. A steeper drift (0.6/period) inflates the estimate to 150% of the truth (more than double).
- The test that is supposed to catch it usually doesn't. Against that 77%-bias drift, the standard pre-trend test fires only ~31% of the time (common normal-approximation cutoff), and only ~16% with the statistically correct small-sample t-test. Either way, the majority of studies ruined by a gentle pre-trend sail through the check.
- At short panels the test is also mis-sized. With only 6 pre-periods, the normal-approximation version rejects a perfectly clean design ~13% of the time at a nominal 5% level — so it both passes bad studies and cries wolf on good ones.
- Not all violations are equal. "Anticipation" (the outcome reacts just before treatment) and "composition" (a level shift partway through the sample) biased the estimate less here — at most ~50% of the true effect — and detection roughly tracked the size of the violation.
The practical rule. A non-significant pre-trend test is weak evidence of parallel trends when you have few pre-periods — its power against the single most damaging violation is around one in three, or worse. Don't treat "passed the pre-trend test" as clearance. Prefer longer pre-treatment windows, explicit sensitivity bounds (e.g. honest-DiD-style robustness to trend violations), or a design that doesn't lean on parallel trends at all.
What would change our mind. A pre-trend test, or a modern alternative, that achieves high power against a slope-0.3 violation at six or fewer pre-periods would break the "weak clearance" conclusion. We'd publish that.
(Numbers above were re-measured from scratch for this post; the bias figures also match a closed-form check — a drift of slope s over this design biases DiD by s times the gap between the post- and pre-period midpoints.)