We built a meter for when an AI is confidently wrong - and confidence can't see itPostavili sme meradlo na to, kedy sa AI sebaisto myli - a confidence to nevidi
The problem. A confidence score cannot distinguish two failures that look identical from the outside: a model that ignores a correct document, and a model that confidently swallows a wrong or poisonedConfidence skore nerozlisi, ci model ignoruje spravny dokument alebo sebaisto prehlta nespravny. Grounding Meter to meria priamo - a predpoveda sebaiste-nespravne odpovede (r=-0.93) tam, kde confidence sotva (0.15-0.36).
The problem. A confidence score cannot distinguish two failures that look identical from the outside: a model that ignores a correct document, and a model that confidently swallows a wrong or poisoned one. Both just look like "a confident answer." So we built an instrument that measures the difference directly - the Grounding Meter.
What it measures. For a question with two options, we supply an external context that asserts one option with graded strength - a fixed-wording ladder from "1 source reports X" up to "6 sources report X" - and we flip which option the sources push. From the model's answer-token probabilities we read follow(d) = how much the model goes with whatever the context says, at evidence-dose d. We average over both option orders (A/B and B/A) and both push-directions, so position bias and option bias cancel by construction. The output is a per-model grounding curve: how strongly the model defers to external evidence as that evidence accumulates.
What we found (open model qwen2.5-7B, a 14-question bank spanning fictional to near-axiom facts):
- The grounding curve is continuous and ordered by how strong the model's prior is. Fictional or weak-prior facts saturate almost immediately (half-saturation dose about 0.08); near-axiom facts the model is sure of - "water boils at 100 C", "H2O is water" - resist even six agreeing sources.
- The grounding signal predicts confident-wrongness. The correlation between the follow-curve and the model's accuracy when the supplied context is false was -0.93 - following a false document strongly means a confident wrong answer. The correlation between the model's own confidence and that same accuracy was only 0.15-0.36. Confidence is nearly blind to what the meter sees - and confidence genuinely varied across items, so this is not an artifact of a flat signal.
- A frontier model (GLM-5.2), under the same framing, behaves differently: it resists plausible-but-wrong sources on facts it knows and defers only on genuinely unknown (fictional) items. Grounding is a property of (model x how the context is framed), not a fixed trait - which is exactly why you would want to measure it per deployment.
The method in two sentences. We read p(answer) from token logprobs on any OpenAI-compatible endpoint (local Ollama returns them), sweep a fixed-wording k-of-N source dose ladder, and define grounding as the half-saturation / area of the resulting follow-curve, bias-cancelled over option-order and push-direction. Smoothness is a population property of a prior-stratified question bank, not a per-item claim.
The falsifierIf the grounding curve had tracked confidence, or could not separate "follows the context" from "ignores it," the meter would be useless. It did neither: grounding predicted real error at -0.93 while confidence did not, and the curve cleanly separated stubborn-prior items from easily-grounded ones.
What would change our mind. A model whose grounding curve fails to predict its real error rate under a poisoned context; or the cross-model differences vanishing once we control for prompt-wording sensitivity - an ablation we still owe before ranking models. We report one open model in full plus a frontier-model slice; we are not claiming a leaderboard yet.
It is a one-file open tool. Point it at your own model - local or hosted - and get its grounding curve back. The reference benchmark and the tool are open.
Problem. Confidence skore nedokaze rozlisit dve zlyhania, ktore zvonku vyzeraju rovnako: model, ktory ignoruje spravny dokument, a model, ktory sebaisto prehlta nespravny alebo otraveny. Oboje vyzera len ako "sebaista odpoved". Tak sme postavili nastroj, ktory ten rozdiel meria priamo - Grounding Meter.
Co meria. Pri otazke s dvoma moznostami dodame externy kontext, ktory tvrdi jednu moznost s odstupnovanou silou - rebrik s pevnym znenim od "1 zdroj hlasi X" az po "6 zdrojov hlasi X" - a prepiname, ktoru moznost zdroje tlacia. Z pravdepodobnosti odpovedoveho tokenu citame follow(d) = nakolko model ide s tym, co hovori kontext, pri davke dokazov d. Priemerujeme cez obe poradia moznosti (A/B aj B/A) aj cez oba smery tlaku, takze pozicne a moznostne skreslenie sa vyrusi z konstrukcie. Vystupom je grounding krivka pre dany model: nakolko sa podriaduje externym dokazom, ako sa dokazy hromadia.
Co sme zistili (otvoreny model qwen2.5-7B, banka 14 otazok od fiktivnych po takmer-axiomy):
- Grounding krivka je spojita a zoradena podla sily prioru modelu. Fiktivne fakty alebo fakty so slabym priorom sa nasytia takmer okamzite (pol-nasytenie pri davke ~0.08); takmer-axiomy, ktorymi si je model isty - "voda vrie pri 100 C", "H2O je voda" - odolavaju aj siestim zhodnym zdrojom.
- Grounding signal predpoveda sebaiste-nespravne odpovede. Korelacia medzi follow-krivkou a spravnostou modelu ked je dodany kontext nepravdivy bola -0.93 - nasledovanie nepravdiveho dokumentu silne znamena sebaistu nespravnu odpoved. Korelacia medzi vlastnym confidence modelu a tou istou spravnostou bola len 0.15-0.36. Confidence je takmer slepy voci tomu, co meter vidi - a confidence sa naprieč otazkami realne menil, takze to nie je artefakt plocheho signalu.
- Frontier model (GLM-5.2) sa pri rovnakom ramcovani sprava inak: odolava vierohodnym-ale-nespravnym zdrojom pri faktoch, ktore pozna, a podriaduje sa len pri naozaj neznamych (fiktivnych) polozkach. Grounding je vlastnostou (model x ako je kontext ramcovany), nie pevnou crtou - presne preto by ste ho chceli merat pre kazde nasadenie.
Metoda v dvoch vetach. Citame p(odpoved) z logpravdepodobnosti tokenov na hocijakom OpenAI-kompatibilnom endpointe (lokalny Ollama ich vracia), prejdeme rebrik davok k-z-N zdrojov s pevnym znenim a grounding definujeme ako pol-nasytenie / plochu vyslednej follow-krivky, so skreslenim vyrusenym cez poradie moznosti a smer tlaku. Hladkost je populacna vlastnost banky otazok stratifikovanej podla prioru, nie tvrdenie o jednej polozke.
Falzifikator. Keby grounding krivka kopirovala confidence, alebo nevedela oddelit "nasleduje kontext" od "ignoruje ho", meter by bol zbytocny. Neurobil ani jedno: grounding predpovedal realnu chybu na -0.93, kym confidence nie, a krivka cisto oddelila polozky s tvrdohlavym priorom od lahko-uzemnitelnych.
Co by nas presvedcilo o opaku. Model, ktoreho grounding krivka nepredpoveda jeho realnu chybovost pri otravenom kontexte; alebo zmiznutie rozdielov medzi modelmi, ked skontrolujeme citlivost na znenie promptu - ablacia, ktoru este dlhujeme pred rebrickovanim modelov. Reportujeme jeden otvoreny model naplno plus rez frontier modelom; netvrdime rebricek.
Je to jednosuborovy otvoreny nastroj. Namierte ho na vlastny model - lokalny alebo hostovany - a vrati vam jeho grounding krivku. Referencny benchmark aj nastroj su otvorene.