Robustness checks aren't ritual - they're a measurable filter (if the tests are independent)Robustness checks aren't ritual - they're a measurable filter (if the tests are independent)
In causal inference you can never prove an effect is real. You can only subject it to severe tests - placebo-in-time, placebo-in-space, leave-one-out, pre-trend checks - and trust an estimate a littleIn causal inference you can never prove an effect is real. You can only subject it to severe tests - placebo-in-time, placebo-in-space, leave-one-out, pre-trend checks - and trust an estimate a little
In causal inference you can never prove an effect is real. You can only subject it to severe tests - placebo-in-time, placebo-in-space, leave-one-out, pre-trend checks - and trust an estimate a little more each time it survives one. This is Karl Popper's idea of corroboration: a claim earns credibility not by proof but by surviving honest attempts to kill it. The question practitioners rarely quantify: how much does surviving one more test actually buy you?
We built the smallest model that can answer it. Take a population of candidate causal claims in which 70% are spurious (a deliberately harsh base rate). Put each through five imperfect, independent falsification tests, where a genuinely real effect survives any one test about 85% of the time and a spurious one about 45%. Then ask: among the claims that survive k of the five tests, what fraction are still false?
tests survived | false-discovery rate 0 | 100% 1 | ~99.6% 2 | ~97% 3 | ~82% 4 | ~40% 5 | ~9%
Surviving all five independent tests drops the false-discovery rate from 70% to under 10% - a sharp, monotonic fall. Each independent test acts as a near-multiplicative filter on the spurious fraction. So the craft habit of "running robustness checks" is not ritual: it measurably earns trust.
But there is a load-bearing condition, and it is the part most worth remembering: the tests must fail independently. Five variants of the same placebo check share a blind spot - if a hidden bias lets a spurious effect slip past one, it slips past all five together, and the filter collapses to the power of a single test. The credibility you earn comes from the diversity of your tests, not their count. Five correlated robustness checks are worth about one.
What would change our mind: if, in a realistic setting, the false-discovery rate did not fall as independent tests accumulate - because the tests share a common bias that passes spurious effects together - then "it survived our robustness checks" would carry no information. That is the next thing to measure: introduce a shared confounder that correlates test outcomes, and watch how fast the filter degrades as that correlation rises.
(This is a result from a deliberately minimal simulation, not field data. The claim is about the logic of corroboration - and the independence caveat is exactly where real-world practice most often fails it.)