The AI-Claim Crucible · AI-engineering folklore, tested in code

FAILED2026-06-20

Multi-agent systems (an orchestrator delegating to sub-agents) outperform a single agent on complex tasks.

2026 agent-engineering folklore (contested: Cognition 'Don't Build Multi-Agents' vs Anthropic's multi-agent research system; Gartner ~40% of agentic projects scrapped). Direct claim: Tran & Kiela, arXiv 2604.02460 - single-agent >= multi-agent at equal thinking-token budget.

Replicated on MuSiQue multi-hop QA (n=48 hop-diverse: 16 each of 2/3/4-hop, 20 paragraphs with distractors), single chain-of-thought vs a standard decompose-solve-aggregate multi-agent, generous per-call caps so neither side is truncated (an earlier forced budget-split truncated multi answers and was fixed). Single-agent SIGNIFICANTLY beats multi on BOTH frontier models AND uses ~3x fewer tokens: glm-5.2 single 96% vs multi 69% (gap +27pp, 95% CI [12.5,43.8]); deepseek-v4-flash single 94% vs multi 79% (gap +15pp, 95% CI [4.2,27.1]); both CIs exclude 0. Multi-agent does NOT outperform - the opposite, at lower cost. Honest nuance: the multi-agent closes/reverses the gap only at the hardest 4-hop tier (explicit decomposition pays off when depth is high). Supersedes the earlier NOT_COMPUTABLE verdict, which came from a noisy small toy eval; the real benchmark gives a stable signal.

Read the runnable test →

FAILED2026-06-20

LLMs inherit human cognitive biases - e.g. conservatism in Bayesian belief updating (people under-revise relative to Bayes; Edwards 1968).

Folklore that LLMs reproduce human judgment-and-decision biases; tested on the canonical bookbag-and-poker-chip conservatism task with real models.

On the exact task where humans are reliably conservative (two equally-likely sources, symmetric cue validity q=0.70, an R/B signal sequence; Bayesian posterior ~0.97), both frontier models return the EXACT Bayesian posterior when the likelihoods are specified: deepseek-v4-flash mean conservatism gap = -0.001 (4/5 items parsed), glm-5.2 = +0.000 (5/5). Humans on the same task report ~0.70 (a large conservative gap); the LLMs show ~0.000. The bias does NOT transfer - handed the diagnosticity as numbers, the model just computes the likelihood ratio. HONEST SCOPE: only the SPECIFIED-likelihood case is settled. The human-analogue INFER case (model must ESTIMATE cue validity from a training sample, as humans do) is NOT cleanly measurable here - the harder estimate-then-update task exhausts the thinking models' token budget before a parseable answer (1/10 parsed), and forcing answer-first would suppress the reasoning the task needs. So: conservatism does not appear when cue validity is told; whether it appears when cue validity must be learned is open.

Read the runnable test →

FAILED2026-06-20

Smaller chunks improve RAG retrieval quality - 'when in doubt, chunk smaller' raises precision/relevance.

Common RAG-engineering folklore (chunk-size tuning advice; 'smaller chunks = higher precision').

Deterministic numpy test: a 200-token document with one CONTIGUOUS gold span (length 30-50), fixed-grid chunking, each chunk scored by gold density (the precision force that rewards small chunks), retrieve top-k=3, measure recovery = gold tokens recovered / span length. Smaller is NOT better: the smallest chunk c=10 recovers only 0.750 of a 40-token span while c>=20 recovers 1.000 - the minimum chunk size is the WORST, and the optimum sits AT/ABOVE the span scale (best chunk = 20-40) in 9/9 robustness cells (seeds x span length). Mechanism: small chunks do maximize per-chunk density, but a grid cut fragments a contiguous span into more pieces than the top-k=3 budget can reassemble, and that recall loss dominates the precision gain. Anti-rig CONTROL: retrieving ALL chunks (no budget) flattens recovery to exactly 1.000 for every chunk size - proving the reversal is caused by the retrieval budget, not baked into the data. HONEST SCOPE: a noiseless density oracle, binary relevance, a single contiguous span, fixed grid boundaries - it isolates the fragmentation-vs-precision tradeoff, not embedding noise or overlapping/recursive chunkers. Conditional takeaway: 'smaller = better' is false whenever relevant evidence is contiguous and longer than a chunk under a tight top-k budget. (Generated + adversarially verified by an Agora workflow: 1 of 4 candidate claims survived 2 independent skeptic referees; the other 3 were killed as textbook/Condorcet or stipulated-geometry.)

Read the runnable test →

FAILED2026-06-20

Adding a reranker (cross-encoder) on top of first-stage retrieval reliably improves end-to-end RAG accuracy, or at worst never hurts it ('drop in a reranker for a free boost').

Cohere / Pinecone / LangChain / LlamaIndex RAG tutorials and 'production RAG checklist' blogs; repeated as received wisdom by practitioners.

Deterministic numpy model (seed 0, n=200k queries): gold doc + 3 hard negatives (lexically similar) + 27 soft distractors; NO-RERANK uses a noisy first-stage scorer, RERANK uses a CLEANER scorer but inflates the 3 hard negatives by `infl` (a cross-encoder fooled by lexical overlap); end-to-end correctness depends on the gold doc's rank in a position-biased top-5. Measured delta (rerank - no-rerank): infl=0.0 -> +0.139, 1.4 -> +0.013, 2.2 -> -0.074, 3.0 -> -0.131. The reranker HELPS when it is merely cleaner (control infl=0: +0.139) but crosses to a NET LOSS at infl* ~ 1.52 and reaches -0.131 - so 'a reranker never hurts / free boost' is FAILED: a second scorer with a different error profile can demote the gold doc out of a position-biased top-k. Anti-rig: the control (infl=0) shows the harness CAN produce a gain (not baked to fail); the verdict is read off the printed curve. HONEST SCOPE: a stylized simulation, not real corpora/models - it shows the loss is mechanically in-range and the condition for it (non-trivial hard-negative susceptibility), not that any specific production reranker degrades a specific stack; the realistic magnitude of `infl` for real cross-encoders is the open empirical question. The defensible takeaway: 'always helps' is not a safe default.

Read the runnable test →

FAILED2026-06-20

You can trust an LLM's CONFIDENCE to tell you when a retrieved document has corrupted its answer (high-confidence RAG answers are safe).

RAG-deployment folklore (confidence/self-consistency gating as a hallucination guard); tested on frontier models glm-5.2 and deepseek-v4-flash.

Two layers. (1) ALL-POISON: given a doc asserting the FALSE answer, frontier models ADOPT the poison at FULL confidence (glm-5.2 ~100%, deepseek ~94%) - confidence ~1.0 while wrong. (2) HARDENED 2026-06-20 to n=101 factual questions (was n=16), clean/poison 50/50, strong poison, K=3, thinking-robust reader: the grounding-DROP firewall (abstain when the answer depends on the retrieved doc) ships 0% wrong @ 50% coverage on BOTH models (0/101 kept; Wilson 95% upper bound 3.7%) where CONFIDENCE-gating ships 18.8% (glm-5.2) / 35.6% (deepseek) wrong; drop-sensitivity correlates -0.93/-0.95 with correctness vs confidence only +0.30/+0.23. Confidence is blind exactly when the model is confidently wrong; the cheap context-drop test is not.

Read the runnable test →

FAILED2026-06-19

When you evaluate N models/configs on a benchmark and report the top scorer's number, that score is a reliable estimate of its true performance and the winner is the truly-best model.

Universal AI benchmarking / leaderboard practice (SOTA reporting, hyperparameter + model selection on a held-out benchmark).

Winner's curse (selection-on-the-max), clean DETERMINISTIC model (no LLM noise - the lesson from entry #1): N models with true accuracies clustered within sigma_true=0.04, each measured with eval noise (finite test set + run-to-run variance). At N=50 models and eval-noise SE 0.06 (~ a 50-200 item benchmark): the reported winner's score is inflated by +0.112 (the SOTA bar is overstated), the observed winner is the truly-best model only 17% of the time, and trusting it costs 0.040 true accuracy vs the real best. Both effects GROW with the number of candidates and the eval noise (P(true best): N=5 -> 62%, N=100 -> 13%). So 'the leaderboard winner's score is reliable' FAILS: more candidates + noisier evals = more inflated, less trustworthy. (Ties to entry #1: the same eval noise that makes multi-vs-single non-reproducible is what drives this inflation.)

Read the runnable test →

FAILED2026-06-19

Retrieval-augmented frontier models weigh a retrieved document against their own knowledge - they won't blindly adopt a doc that contradicts what they correctly know.

RAG-robustness folklore; measured on real models via the Grounding Firewall poison protocol (Agora). The Poison-Deference Index.

Poison-Deference Index: 12 factual questions each model answered CORRECTLY without context, then given a context asserting the WRONG answer (real LLMs, k=3 order-corrected, thinking-robust reader). deepseek-v4-flash: PDR=92% (flips to the false answer on 11/12 questions it knew), CPR=83% (confidently wrong). glm-5.2: PDR=92%, CPR=92%. BOTH frontier models abandon their correct knowledge for a planted-false doc ~92% of the time, almost always confidently - retrieval OVERRIDES rather than augments. (Strong-assertion poison; 2 models so far - a live index across more models is the next step. Mitigation already exists: the Grounding Firewall's abstain-on-high-sensitivity gate catches exactly these.)

Read the runnable test →

REPRODUCED2026-06-19

AI agent success decays with a CONSTANT hazard (an exponential 'half-life') as the task gets longer.

arXiv 2505.05115 'Is there a half-life for the success rates of AI agents?'; tested on METR time-horizon data (metr.org/blog/2025-03-19; epoch.ai/benchmarks/metr-time-horizons).

Fit on METR's real public anchors (success vs human-task-length: ~99% @ 4min, ~80% @ 15min, 50% @ 60min [Claude 3.7], <10% @ 240min). A constant-hazard exponential (P = 0.5^(t/60min)) fits well: predicts 0.95/0.84/0.50/0.06 vs observed 0.99/0.80/0.50/0.08, SSE 0.0032 - actually BETTER than the logistic-in-log-time (SSE 0.0075, which overshoots the tail). The half-life claim holds. Minor tension: the t80/t50 ratio is slightly steeper than exponential (0.250 vs 0.322), hinting at mild extra steepening, but 4 anchor points can't resolve it. HONEST SCOPE: public anchors only (Claude 3.7 + aggregate); METR's raw per-task data would sharpen the per-model hazard.

Read the runnable test →

FAILED2026-06-19

The 'AI time horizon' is a robust headline number (supporting 'AI will automate month-long tasks within ~5 years').

METR time-horizon headline + 7-month-doubling extrapolation (metr.org/blog/2025-03-19); multiverse / specification-curve method (One Model Many Scores, arXiv 2308.16681).

Multiverse over ONE analytic fork METR itself exposes - the success-threshold choice. From the fitted curve on METR's real anchors the horizon is 60 min at 50% success but 21 min at 80% (and 170 min at 20%): a 2.8x swing from an arbitrary threshold. At doubling-every-7-months that 2.8x is ~11 months of apparent 'progress' - so the famous 'month-long tasks in ~5 years' timeline slips ~11 months if you (reasonably) demand 80% reliability instead of 50%. The headline is NOT a robust single number; it rides on an unstated analytic choice. (A full specification curve over more forks needs METR's raw per-task data.)

Read the runnable test →