The Crucible

Claims, rebuilt in code and tested.

A public ledger of scientific and technical claims rebuilt as minimal computational models and run. Three honest verdicts — reproduced when the mechanism holds in its smallest model, failed when it doesn’t (science’s rarest export, published here), and not computable when there’s no simulable core. Every verdict ships a measured number and the code that produced it.

31
Reproduced
6
Failed
4
Honest passes
Open data — the full ledger as JSON · now a citable dataset on Hugging Face →
datasets.load_dataset("Danchi17/folklore-index")
Failed replication Cognitive science · Gilovich, Vallone & Tversky · 1985

The hot hand is not a fallacy

The canon

For 30 years the canonical finding was that streak shooting is a cognitive illusion: players who feel “hot” are misreading randomness.

What we rebuilt

We rebuilt their exact estimator — P(hit | 3 prior hits) − P(hit | 3 prior misses), averaged per shooting record — and ran it on i.i.d. shooters with provably no hot hand.

The verdict

On a no-hot-hand shooter the estimator does not read 0; it reads −7.9pp (t=−28 at n=100, growing to −17pp at streak length 4). This is the streak-selection bias of Miller & Sanjurjo (2018): selecting shots that follow a streak inside a finite sequence is a biased sample. A GVT measurement of ~0 therefore implies a real hot hand of roughly +8.5pp. The famous fallacy was the analysis, not the players.

The deep dive, with charts → lab ebce40 · 2026-06-12

The ledger

37 claims rebuilt & tested · newest first
FAILED

Real-world networks are scale-free: their degree distributions follow a power law p(k) ~ k

Under a rigorous Clauset-Shalizi-Newman fit (MLE alpha, KS-selected xmin, bootstrap goodness-of-fit, Vuong likelihood-ratio vs lognormal, n=20,000): a lognormal that 'looks scale-free' on a log-log plot is correctly REFUSED (power-law GOF p=0.01; LR favors lognormal, -17.5); a genuine Pareto power law passes (GOF p=0.92; LR +102); and a real Barabasi-Albert network is only a TIE (LR -0.1) - power law is not even clearly preferred over lognormal for true preferential attachment. So 'looks scale-free' is not 'is scale-free' and the universal claim is not safely inferable (Broido-Clauset reproduces). Power law IS reproduced for true BA graphs.

Barabasi-Albert (1999) framing, widely repeate
FAILED

Emergent abilities of large language models are genuine, sharp capability transitions - a

The canonical SHARP 'emergence' curve is reproduced by a SMOOTH, continuous per-token skill measured with a nonlinear exact-match metric: the same smooth skill gives a per-token transition width 7.0 vs an exact-match (L=100) width 1.05 - 6.7x sharper PURELY from the metric - and the apparent onset shifts -0.07 -> +5.58 as answer length grows with NO change in the underlying skill. No capability discontinuity is needed, so benchmark sharpness is not evidence for one (Schaeffer's 'mirage' reproduces). Scope: shows sharpness is metric-dependent; it does not prove that no ability is ever genuinely emergent.

Wei et al. 2022 'Emergent Abilities of La
REPRODUCED

A Pareto regret that reflects Pareto optimality without relying on scalarization functions

Smallest model: 2-objective bandit, known Pareto front (4 optimal + 2 dominated arms). Pareto-UCB1 with the scalarization-free Pareto suboptimality gap achieved sublinear cumulative Pareto regret (avg/step 0.027->0.005, growth factor 1.82 o

Pareto Regret Analyses in Multi-objective Mult
REPRODUCED

The epidemic threshold disappears for scale-free / preferential-attachment networks with d

Mechanism reproduced: SIS threshold lambda_c=<k>/<k^2> falls with N on BA/scale-free networks (0.112->0.062 as N 500->32000) because <k^2> grows with the hub/cutoff (35.5->64.9), while ER stays finite (~0.20, <k^2>~20). SIS dynamics confirm

Pastor-Satorras &amp; Vespignani (2001); cited
FAILED Network economics

Metcalfe's Law (network value ∝ n²) is an equal-value artifact

The n² scaling holds only if every pairwise connection has equal value. Under realistic rank-declining connection value the network's value scales as ~n·log n (fitted exponent 1.12, vs 2.00 for equal value; 1.33 for slow decay) — the famous quadratic over-states large-network value, matching Briscoe-Odlyzko-Tilly (2006) and Metcalfe's own 2013 data-fit. A mechanism derivation, not a fit to proprietary network data.

Metcalfe's Law · contested (Briscoe–Odlyzko–Tilly 2006)
FAILED Collective intelligence

…but "diversity trumps ability" is not a general law

Stress-test of the same claim in a DIFFERENT, faithful problem model (NK landscapes; paired and statistically powered): the best-ability group matches or slightly beats the random/diverse group at every difficulty level. So the Hong-Page advantage is real at the original parameters (companion card) but does NOT generalize — condition-specific, not a universal law, consistent with Thompson (2014). The effect is small; this refutes the universal claim, not the value of diversity in specific regimes.

Hong & Page · 2004 (contested, Thompson 2014)
REPRODUCED

The survival probability of a branching process obeys finite-size scaling in the control p

Smallest Galton-Watson model (Poisson offspring), vectorized over 40k realizations. Critical eps=0: n*P_n -> ~2 (Kolmogorov 2/sigma^2, sigma^2=1). Scaling collapse confirmed: n*P_n is a function of x=eps*n alone - within-x spread 0.116 vs a

Garcia-Millan, Font-Clos et al. 2015, &quot;Fi
REPRODUCED

In the Erdos-Renyi graph G(N,p), the k-clique percolation transition occurs at p_c(k) = 1/

Smallest CPM model: enumerate k-cliques in G(N,p), union any sharing a (k-1)-clique, track largest community fraction R(p)=vertices_in_largest/N; empirical transition = p where R crosses half its max. Measured/formula ratio is near-CONSTANT

Derenyi, Palla &amp; Vicsek 2005, &#x27;Clique
REPRODUCED

The epidemic threshold of a network disappears (goes to 0 as N->infinity) when the degree-

Smallest model: HMF threshold lambda_c=<k>/<k^2> from sampled power-law degree sequences vs N. Measured lambda_c at N=1e3..1e6: gamma=2.3 shrinks x28.7, gamma=2.7 x3.6 (power-law vanishing); gamma=3.0 is MARGINAL - vanishes only logarithmic

Jones &amp; Handcock 2003, &#x27;An assessment
REPRODUCED

Whether a mouse geroprotector is recorded as extending lifespan can depend on the survival

Self-contained sim (weighted log-rank family, n=50/arm, 4000 trials). Age-localized true effect: log-rank power 32.5% vs Gehan 72.3%; the two tests give DISCORDANT verdicts on 39.9% of identical datasets. Under the null, best-of-3 tests inf

deep-research 2026-06-16 + Jiang et al., GeroS
REPRODUCED

For BA networks (N=2,000, m=2), removing the top 10% of nodes by degree raises the bond-pe

Finite-size susceptibility-peak MC on the actual BA network reproduces it: intact p_c 0.170 (claimed 0.174), hub-removed 0.740 (claimed 0.776), both within ~5% over 6 replicas. Methodology lesson: the Molloy-Reed configuration-model estimat

Cachero Sanchez (2026), Simultaneous Degradati
REPRODUCED

Recursively synthesized, self-referential systems improve performance (e.g. Promptbreeder,

Smallest model (lab a08981): a population recursively synthesizes candidates FROM ITSELF each generation; the pivotal variable is the SELECTION signal. With an EXTERNAL fitness anchor (as Promptbreeder has: real task accuracy), recursive sy

Promptbreeder (Fernando, Banarse, Michalewski,
REPRODUCED

Finite-size scaling exists for the survival probability of a branching process as a functi

Smallest model: exact survival curve of a Poisson(m) Galton-Watson process via PGF iteration q_n=exp(m(q_{n-1}-1)), S_n=1-q_n (zero Monte-Carlo noise). Three FSS signatures all confirmed: (1) exact critical amplitude n*S_n -> 2 = 2/sigma^2

Garcia-Millan, Font-Clos &amp; Corral (2015),
REPRODUCED

In DiD with FEW treated units and spatially/serially-correlated errors, standard DiD infer

N=30,T=12,1 treated,rho=0.7,true effect=0,800 reps: DiD 95% CI coverage=0.305 (severe under-coverage vs nominal 0.95); SC coverage=0.891; SC RMSE=1.017 < DiD RMSE=1.267. DiD inference invalid, SC materially better.

Alvarez &amp; Ferman (2020)
REPRODUCED

p-hacking inflates Type I error in the error-statistical (Neyman-Pearson) approach but not

Under a true null (N=5000, n=30, K=5 forks): NP p-hacking (report min p) inflates Type I error 0.051->0.227 (+17.6pp, 4.45x, matches 1-.95^5=0.226). Formal/likelihood: a selection-ACCOUNTED likelihood (correct best-of-K sampling distributio

Rubin (2026)
REPRODUCED

Removing top-10% degree nodes from a BA network (N=2000,m=2) raises the bond-percolation t

Mechanism+direction reproduced: hub removal collapses <k^2> 50.8->3.8, p_c jumps ~8x (0.085->0.687 via Cohen mean-field). After-value within 12% of claim; before differs 2x (mean-field vs direct simulation). Robust-yet-fragile confirmed.

Simultaneous Degradation of Percolation and Ca
REPRODUCED

Derenyi-Palla-Vicsek (2005): k-clique percolation in ER graphs at p_c(k)=[(k-1)N]^(-1/(k-1

k=3 scaling exponent confirmed: empirical p_c*sqrt(2N) constant across N=400/800/1600 (1.26,1.26,1.19). N^(-1/2) scaling reproduced; prefactor ~1.2x asymptotic formula due to finite-size + 50%-coverage operational threshold.

Clique Percolation in Random Networks (Derenyi
REPRODUCED

Berger (2003): conditional-frequentist testing reconciles Fisher/Neyman/Jeffreys - the con

Calibration exact: empirical freq(H0 true | evidence) = Bayesian P(H0|x) across all bins. Berger-Sellke: p=0.05 -> P(H0|x)=0.216, p=0.005 -> 0.041 (p-value overstates evidence ~4x at 0.05).

Berger (2003), Could Fisher, Jeffreys and Neym
REPRODUCED

Systems in the same universality class share the same critical exponents (Lubeck 2004)

Three structurally different Z2 mean-field models (tanh self-consistency, phi^4 free energy, arctan self-consistency) all give order-parameter exponent beta=0.500; a different-class absorbing-state model gives beta=1.000. Same class -> same

Universal Scaling Behavior of Non-Equilibrium
REPRODUCED

LinUCB (Chu et al 2011): linear contextual bandit achieves O(sqrt(Td log^3)) i.e. sublinea

Empirical regret growth exponent 0.03-0.11 (sub-sqrt(T), well inside the O(sqrt(T)) upper bound); cum regret 3-8 vs linear non-learning regret 1400-2700; d-scaling ~sqrt(d) to d. Computable core holds.

Contextual Bandits with Linear Payoff Function
REPRODUCED

BA(N=2000,m=2): removing top-10% nodes by degree raises bond-percolation threshold p_c 0.1

MC before/after = 0.150/0.680 (ratio 4.53x) vs claimed ratio 4.46x; absolute values within finite-size tolerance. Cascade claim at phi=0.22 out of scope.

Simultaneous Degradation of Percolation and Ca
FAILED Cognitive science

The Dunning–Kruger plot draws itself from pure noise

A null with NO metacognitive deficit reproduces the famous quartile plot and its asymmetry — bottom +45.8 (DK: +46), top −14.2 (DK: −13) — from regression to the mean plus a uniform better-than-average bias. The gaps are predictions, not fits.

Kruger & Dunning · 1999
REPRODUCED Linguistics / statistics

A monkey at a typewriter really does produce Zipf's law

Random typing yields exponent −1.24; even a natural fine-structure discriminator fails at matched corpus size (1.8× < 2× bar). The deflationary inference survives a severe test.

Miller · 1957
REPRODUCED Finance

Thirty stocks diversify you — until the tails get heavy

N=30 captures ~96% of achievable risk reduction at realistic tails — but only 85% of tail-risk reduction near infinite-variance tails, where ~100 stocks are needed.

Evans & Archer · 1968
REPRODUCED Collective intelligence

Diversity beats ability — but only at Hong & Page's exact parameters

Reproduced in the authors' own relay model: a random team beats the best-individuals team by +1.65 (t=4.1) at the paper's parameters — but the edge is fragile, reversing to −0.38 on a smoother landscape. See the companion FAILED card: the advantage does not survive a different, powered problem model.

Hong & Page · 2004
REPRODUCED Optimization

SGD finds a real minimum of a non-convex loss

On a double-well non-convex objective, SGD localizes to a true minimum at the predicted convergence rate.

Fehrman, Gess & Jentzen
REPRODUCED Statistical physics / neuroscience

Criticality leaves a power-law fingerprint

At the critical point, fluctuations show power-law scaling and long-range correlations — the signature reproduces cleanly.

Kitzbichler, Smith, Rahn · 2009
REPRODUCED Reinforcement learning

Sample-efficient policies need far fewer parameters

In the low-data regime, online policies match performance with 2–10× fewer total coefficients.

Biologically-inspired architectures
REPRODUCED Network science / epidemiology

In scale-free networks, the epidemic threshold vanishes

As N grows, λ_c = ⟨k⟩/⟨k²⟩ → 0: a hub-rich network has effectively no herd-immunity threshold.

Pastor-Satorras & Vespignani
REPRODUCED Contextual bandits

LinUCB stays well inside its regret bound

Measured regret tracks O(√Td·polylog) and sits far below linear-regret baselines.

Chu, Li, Reyzin & Schapire · 2011
REPRODUCED Optimization

SGD's slow convergence is a variance floor

Constant-step SGD stalls at a noise floor; variance-reduced methods keep converging — the slowdown is variance, not curvature.

Johnson & Zhang · 2013
REPRODUCED Stochastic processes

Branching survival collapses onto one curve

Survival probability across sizes collapses onto a single curve in the scaling variable (m−1)·n.

Finite-size scaling
REPRODUCED Network science

Pulling the hubs shatters the network

Removing the top 10% of nodes by degree multiplies the bond-percolation threshold several-fold (BA, N=2000).

Percolation & cascade robustness
REPRODUCED Economics of technology

IT pays off only with organizational change

Organizational investment, not IT spend alone, drives the return — omitting it inflates the estimated payoff up to 2×.

Brynjolfsson & Hitt · 2000
REPRODUCED Machine learning / NLP

A simple model beats the fancy one on KB completion

An observed-features model matches or beats latent-feature models on knowledge-base completion.

Toutanova & Chen · 2015
REPRODUCED Non-equilibrium physics

Heavy-tailed waiting times bend the phase transition

Non-Markovian spreading with t^(−1−μ) waiting times produces a genuine non-equilibrium phase transition.

Barato & Hinrichsen · 2009

Honest passes

no simulable core — on the record anyway

Not every claim has a computable mechanism. When a claim is descriptive rather than mechanistic, the honest verdict is not computable — recorded, not quietly dropped.

How a verdict is earned

01

Model before verdict

The smallest model of the claim’s stated mechanism is built first, scoped to that mechanism — not reverse-engineered to a desired answer.

02

A number, not a vibe

Every verdict is a measured quantity with a direction that could refute it — an effect size, a threshold, an exponent, a bias.

03

Re-runnable

The code ships with the verdict. A reproduced means the minimal mechanism computes; it does not certify the original paper beyond doubt.

04

Failures are the point

A failed means the stated mechanism didn’t survive its smallest honest model, with the discrepancy measured. Those stay published.

Have a claim that deserves the bench?

Send a quantitative, mechanism-bearing claim — a number, a threshold, an exponent, a rate. If it has a simulable core, it gets a model and a public verdict, whichever way it lands.

Submit a claim →