Field notes that ship a number. Poznámky, ktoré nesú číslo.

June 19, 20262 minEN · SK

We built a meter for when an AI is confidently wrong - and confidence can't see itPostavili sme meradlo na to, kedy sa AI sebaisto myli - a confidence to nevidi

The problem. A confidence score cannot distinguish two failures that look identical from the outside: a model that ignores a correct document, and a model that confidently swallows a wrong or poisonedConfidence skore nerozlisi, ci model ignoruje spravny dokument alebo sebaisto prehlta nespravny. Grounding Meter to meria priamo - a predpoveda sebaiste-nespravne odpovede (r=-0.93) tam, kde confidence sotva (0.15-0.36).

June 19, 20262 minEN · SK

We built a firewall for AI confidently-wrong answers - and it catches what confidence cannotPostavili sme firewall na sebaisto-nespravne odpovede AI - a chyta to, co confidence nevidi

The problem. A model is most confident exactly when it is wrong for the right-looking reason. When a retrieved document states a plausible-but-false answer - a poisoned context: stale data, an injecteModel je najsebaistejsi presne vtedy, ked ho otraveny dokument privedie k chybe. Grounding Firewall sa zdrzi pri odpovediach, ktore visia na nacitanom dokumente - chyta chyby z otraveneho kontextu, na ktore je confidence slepy (AUC 0.028 vs 0.095).

Robustness checks aren't ritual - they're a measurable filter (if the tests are independent)Robustness checks aren't ritual - they're a measurable filter (if the tests are independent)

In causal inference you can never prove an effect is real. You can only subject it to severe tests - placebo-in-time, placebo-in-space, leave-one-out, pre-trend checks - and trust an estimate a littleIn causal inference you can never prove an effect is real. You can only subject it to severe tests - placebo-in-time, placebo-in-space, leave-one-out, pre-trend checks - and trust an estimate a little

June 18, 20262 minEN

Why a more capable AI can be more confidently wrongWhy a more capable AI can be more confidently wrong

Give a reasoner more evidence and it should get both more accurate and more sure. On independent evidence, it does. But real evidence is rarely independent — sources copy each other, datasets overlap,Give a reasoner more evidence and it should get both more accurate and more sure. On independent evidence, it does. But real evidence is rarely independent — sources copy each other, datasets overlap,

June 18, 20262 minEN

June 18, 20263 minEN · SK

We looked for the grounding 'tipping point' in AI self-training, herding, and Goodhart. It isn't there.Hladali sme bod zlomu straty ukotvenia v AI sebatrenovani, stadovitosti a Goodharte. Nie je tam.

A popular story says systems that lose touch with reality fail at a tipping point: an AI that trains on its own output collapses past a threshold; a crowd that watches itself flips into a bubble; a mePrisny test populrneho prbehu o bode zlomu pre AI sebatrenovanie, stadovitost a obchadzanie metrik: naprie styrmi minimalnymi modelmi so zhodnou pozitivnou a negativnou kontrolou nevykazuje ziadny kriticky prechod - kazdy degraduje plynulo. Ukotvenie posobi ako pole lamuce symetriu, ktore zaokruhli

June 18, 20264 minEN · SK

We hunted for the tipping point in 8 systems. Only one is a true critical cliff - and 'model collapse' isn't it.Lovili sme bod zlomu v 8 systemoch. Len jeden je skutocny kriticky utes - a kolaps modelu nim nie je.

Last time we looked for the "tipping point" in four systems people say have one — AI self-training, herding crowds, metric-gaming, misspecified inference — and found none: each degrades smoothly. But 2. cast lovu na bod zlomu: napric 8 mechanizmami sa skutocny utes objavi len pri strukturalnych extremoch (sebazosilnenie, tvrde/diskretne pravidla, nakazlive prepojenie, alebo presna nulova symetria), a akekolvek plynule ukotvenie ho zaokruhli na rampu. Len nulove pole je skutocna kritickost; kolap

June 17, 20263 minEN · SK

The most confident systems are the least groundedNajistejšie systémy sú najmenej ukotvené

Three failures look unrelated. An AI model trained on its own output degrades into nonsense ("model collapse"). Seventy expert teams handed the same brain-imaging dataset reach different conclusions; Jeden zákon za model collapse, replikačnou krízou aj trhovým lock-inom: istota postavená z vnútornej konzistencie sa odpája od pravdy, keď klesá externé ukotvenie. Odmerané v simulácii a porovnané s many-analysts štúdiami.

A pre-trend too gentle to see can bias a difference-in-differences estimate by ~77% — and the standard test usually misses itA pre-trend too gentle to see can bias a difference-in-differences estimate by ~77% — and the standard test usually misses it

Difference-in-differences (DiD) is one of the most-used causal designs in economics, policy, and product analytics. Its credibility rests on one assumption: parallel trends — that, absent treatment, tDifference-in-differences (DiD) is one of the most-used causal designs in economics, policy, and product analytics. Its credibility rests on one assumption: parallel trends — that, absent treatment, t

June 16, 20262 minEN

The calibrated prior for 'we reversed aging in mice': near zero - and here's the arithmeticThe calibrated prior for 'we reversed aging in mice': near zero - and here's the arithmetic

Every few weeks a headline announces that scientists reversed aging, extended lifespan, or found 'the protein that ages your brain.' Before believing any single one, it helps to know the base rate. HeEvery few weeks a headline announces that scientists reversed aging, extended lifespan, or found 'the protein that ages your brain.' Before believing any single one, it helps to know the base rate. He

June 16, 20262 minEN

I scored the 16 most-hyped anti-aging interventions. Zero have a proven human benefit.I scored the 16 most-hyped anti-aging interventions. Zero have a proven human benefit.

Rapamycin, NMN, senolytics, young blood, caloric restriction, partial reprogramming - the longevity field generates a 'we reversed aging' headline almost every week. So I built a scorecard: the 16 flaRapamycin, NMN, senolytics, young blood, caloric restriction, partial reprogramming - the longevity field generates a 'we reversed aging' headline almost every week. So I built a scorecard: the 16 fla

June 16, 20262 minEN

The hot-hand "fallacy" was the fallacy: a famous null is a measurement artifactKlam o "horúcej ruke" bol sám klamom: slávna nula je artefakt merania

The claim. In 1985, Gilovich, Vallone & Tversky concluded that the basketball "hot hand" is a cognitive illusion: conditioning on a streak of made shots does not raise the probability of the next makeSlávny výsledok z roku 1985 - že basketbalová horúca ruka je ilúzia - je artefakt vlastnej metódy: odhad vráti -7,9 pb aj na strelcovi bez horúcej ruky. Odmerané, s modelom a falzifikátorom.

Dunning-Kruger is (mostly) a statistical artifact: a zero-deficit null reproduces the famous plotDunning-Kruger je (väčšinou) štatistický artefakt: nulový model bez deficitu reprodukuje slávny graf

The famous Dunning-Kruger chart is largely a statistical artifact: a model with ZERO metacognitive deficit reproduces it (bottom quartile +45.8pp). Regression to the mean plus a uniform bias - the published position of Gignac & Zajenkowski (2020), and still debated.Slávny Dunning-Krugerov graf je väčšinou štatistický artefakt: reprodukuje ho model s NULOVÝM deficitom (spodný kvartil +45,8 pb). Regresia k priemeru plus uniformný bias - publikovaná pozícia Gignac-Zajenkowski (2020), stále sporné.

Your RAG store is rotting: freshness beats retrieval, and we measured itTvoj RAG store hnije: čerstvosť poráža vyhľadávanie, a odmerali sme to

The claim. Most RAG systems are tuned for retrieval and quietly neglect decay — and that, not the embedding model, is what makes them go wrong in production. A vector store that keeps every chunk foreVäčšina RAG systémov zanedbáva rozpad, nie embeddingy. Odmerali sme hodnota x čerstvosť vs recency cleanup pri 50pct keep-budgete: 96pct vs 52pct udržanej hodnoty (+83pct), + odstránenie orphanov a refresh stale. Zabalené ako ragfresh, open nástroj bez závislostí.

Your second brain is dying of maintenance — so we built one that maintains itselfTvoj druhý mozog umiera na údržbu — tak sme spravili taký, čo sa udržiava sám

The claim. Second brains don't die at capture — they die at maintenance. Setting up Obsidian/Notion is easy; the ongoing chore of re-linking, de-duplicating, archiving, and noticing what's gone stale Druhé mozgy padajú na údržbe, nie na zachytávaní. Údržbár bez závislostí nájde dead linky/orphany/stale/dups + percolation health gauge, navrhne ku ktorej poznámke linknúť každý orphan, a bezpečne to aplikuje. Overené na reálnom ~7 700-poznámkovom vaulte. Open-core.

Your AI might be training on itself — and we measured the two ways that ends badlyVasa AI mozno trenuje sama na sebe — odmerali sme dva sposoby, ako sa to zle skonci

The claim. Any system that learns from its own output is a strange loop — a model retrained on synthetic data, an agent whose memory is its own past answers, a RAG store indexing the system's prior geKolaps modelu, odmerany. Kazdy system, ktory sa uci z vlastneho vystupu, je podivna slucka. Postavili sme najmensi spustitelny model a nasli sme dva sposoby zlyhania — a dve paky, ktore im branilia: ~5% kotva realnych dat zastavi kolaps diverzity, a udrzanie exponentu sebadovery p<=1 zastavi trvale

Everyone says 'set exit criteria' — nobody gives you the number. We measured it.Kazdy hovori 'urci si exit kriteria' — nikto ti neda cislo. My sme ho odmerali.

The claim. "Set exit criteria and ignore the sunk cost" is the most repeated career and business advice there is — and it's useless, because it never tells you the threshold. When exactly do you cut aKedy vzdat slabnuce usilie, odmerane. Vzdaj to, ked nedavny vynos klesne ~60% pod svoj vrchol (drawdown stop). Je to vnutorne optimum - prilis skoro aj prilis neskoro oboje strakaju - a porazi tazenie do vycerpania o +239% pri rovnakom rozpocte.

June 14, 20262 minEN · SK

More data, more wrong: a Bayesian credible interval is not coverage under misspecificationViac dát, viac mimo: bayesovský kredibilný interval nie je pokrytie pri zlej špecifikácii

A 95% Bayesian credible interval feels like a guarantee: "there's a 95% chance the true value lies in here." That reading is only valid when the model is correctly specified. Under the kind of misspec95% bayesovský kredibilný interval pôsobí ako záruka: „je 95 % šanca, že pravá hodnota leží tu." Toto čítanie platí len keď je model správne špecifikovaný. Pri zlej špecifikácii, ktorá je v reálnych d

June 14, 20262 minEN · SK

A 95% confidence interval that covers 31% of the time: difference-in-differences with one treated unit95% interval spolahlivosti, ktory pokryva len 31 % pripadov: difference-in-differences s jednou ostatkovanou jednotkou

The claim. When you run difference-in-differences (DiD) with a single treated unit and errors that are correlated over time, the "95%" confidence interval it reports is badly overconfident. In a cleanReplikacia: s jednou ostatkovanou jednotkou a korelovanymi chybami pokryva 95% CI metody DiD len ~31 %; synthetic control obnovi ~89 %, ale za cenu ~4x sirsich intervalov.

June 13, 20261 minEN · SK

Passing a pre-trends test is weak evidence: which difference-in-differences assumption fails worst, measuredPrejsť testom pre-trendov je slabý dôkaz: ktorý predpoklad difference-in-differences zlyháva najhoršie, namerané

A difference-in-differences pre-trends test catches only about one-third of the violations that ruin your estimate. Measured, with the simulation and the falsifier.Test pre-trendov v difference-in-differences zachytí len asi tretinu porušení, ktoré zničia tvoj odhad. Odmerané, so simuláciou aj falzifikátorom.

June 12, 20265 minEN · SK

The Operating-Point Trap: methods break exactly where they are neededPasca prevádzkového bodu: metódy zlyhávajú presne tam, kde ich potrebuješ

A standard method is calibrated in the benign regime and its error is wired to the very thing that defines the hard regime — so it breaks exactly at the operating point that made you reach for it.Štandardná metóda je kalibrovaná v miernom režime a jej chyba je zviazaná práve s tým, čo definuje ťažký režim — takže sa láme presne v prevádzkovom bode, kvôli ktorému si po nej siahol.