AI reasoningUvažovanie AI

More samples, worse answers: when scaling best-of-N backfiresViac vzoriek, horšie odpovede: kedy škálovanie best-of-N škodí

June 26, 20262 min readAI reasoning - Test-time compute - Reward overoptimizationUvažovanie AI - Výpočet v čase testu - Overoptimalizácia odmeny

The takeawayZhrnutie

Verifier-based selection (best-of-N) scales safely under a noisy verifier but collapses under an exploitable one - the danger is exploitability, not imperfection. Measured, with the falsifier.Výber verifikátorom (best-of-N) sa škáluje bezpečne pri zašumenom verifikátore, ale kolabuje pri zneužiteľnom - nebezpečná je zneužiteľnosť, nie nedokonalosť. Odmerané, s falzifikátorom.

A cheap way to make an AI answer better: sample N candidate answers and keep the one a verifier (or reward model) scores highest — "best-of-N". Folklore says more samples is more better. We measured when that's false, with a minimal, fully-reproducible model.

Two kinds of "imperfect verifier"

Generate N candidates, each correct with probability g. The verifier scores each and we keep the argmax. "Imperfect" splits into two very different cases.

A — a noisy but unbiased verifier (its score tracks correctness with alignment α plus independent noise). Best-of-N is always safe: accuracy rises monotonically toward 1.0 for any α > 0 — the noise averages out and the consistent edge of correct candidates wins as N grows.

alignment α	N=1	N=8	N=32	N=128
0.20	0.30	0.62	0.95	1.00
0.50	0.30	0.94	1.00	1.00

B — an exploitable verifier (a small fraction h of candidates are "hacks": wrong, but scored higher than any correct answer can be). Now accuracy peaks, then collapses as N grows:

hack rate h	peak accuracy	at N	N=128
0.00	1.00	32	1.00 (safe)
0.01	0.86	8	0.28
0.03	0.74	8	0.02
0.08	0.56	4	0.00

Because the chance of drawing at least one hack is 1 − (1 − h)^N → 1, the **collapse onset scales as N\* ≈ 1/h, and the optimal N shrinks as the exploitable tail grows. At h = 8%, best-of-128 scores zero**.

The law

> Scaling test-time selection (best-of-N, reward-model reranking) is safe under an unbiased-but-noisy verifier and dangerous only under an exploitable one — a population of wrong candidates that out-score the right ones. The risk is not verifier imperfection (noise averages out); it is verifier exploitability (a high-proxy / low-truth tail). Optimal N is finite, ≈ 1/h, once such a tail exists.

If you spend test-time compute

Before scaling best-of-N or reward-model reranking, do not just look at the verifier's average accuracy. Estimate its exploitable tail — the rate h of candidates that score high but are wrong. Then cap N near 1/h, or kill the tail directly (dedup, length / style debiasing, an adversarial filter). Raising the verifier's mean accuracy does nothing if the tail remains: the argmax finds the tail.

The falsifierTwo pre-committed predictions: under the noisy-but-unbiased verifier, accuracy must be non-decreasing in N for every α (else noise alone would cause overoptimization); and under exploitability, accuracy must peak then fall, with onset scaling ≈ 1/h. Both held across the sweep — had the noisy model also collapsed, the "exploitability, not noise" claim would be wrong.

FAQ

Does best-of-N sampling always improve answers? No. When the verifier or reward is a misaligned proxy, more samples make it worse — the argmax finds the high-proxy / low-truth tail. In our model, exploitation climbs to ~0.95–1.00 by N=32–128.

Why do more samples backfire? Because best-of-N takes the maximum proxy score. If the proxy only partly tracks truth (alignment α), a larger N is more likely to surface a sample that games the proxy without being correct — exactly the tail the argmax seeks.

Is there an optimal N? Yes, and it shrinks as the exploitable tail grows. With a well-aligned verifier more N helps; with a weak proxy the optimal N is small — past it, exploitation dominates.

How do I use best-of-N safely? Improve verifier alignment before scaling N; cap N at the point where added proxy score still tracks true reward; and treat a verifier that rarely disagrees with the policy as a red flag, not a green light.

Related research

Lacný spôsob, ako zlepšiť odpoveď AI: vygeneruj N kandidátov a nechaj toho, ktorému dá verifikátor (alebo reward model) najvyššie skóre — „best-of-N". Folklór hovorí, že viac vzoriek = lepšie. Odmerali sme, kedy to neplatí, na minimálnom, plne reprodukovateľnom modeli.

Dva druhy „nedokonalého verifikátora"

Vygeneruj N kandidátov, každý správny s pravdepodobnosťou g. Verifikátor každého ohodnotí a vezmeme argmax. „Nedokonalý" sa delí na dva veľmi odlišné prípady.

A — zašumený, ale nezaujatý verifikátor (skóre sleduje správnosť s mierou zarovnania α plus nezávislý šum). Best-of-N je vždy bezpečný: presnosť monotónne rastie k 1.0 pre akékoľvek α > 0 — šum sa vypriemeruje a konzistentná výhoda správnych kandidátov s rastúcim N zvíťazí.

zarovnanie α	N=1	N=8	N=32	N=128
0.20	0.30	0.62	0.95	1.00
0.50	0.30	0.94	1.00	1.00

B — zneužiteľný verifikátor (malý zlomok h kandidátov sú „hacky": nesprávne, ale ohodnotené vyššie než akýkoľvek správny). Teraz presnosť vyvrcholí a potom skolabuje s rastúcim N:

miera hackov h	vrchol presnosti	pri N	N=128
0.00	1.00	32	1.00 (bezpečné)
0.01	0.86	8	0.28
0.03	0.74	8	0.02
0.08	0.56	4	0.00

Keďže šanca natrafiť aspoň na jeden hack je 1 − (1 − h)^N → 1, **nástup kolapsu škáluje ako N\* ≈ 1/h a optimálne N klesá, ako rastie zneužiteľný chvost. Pri h = 8 % má best-of-128 nulu**.

Zákon

> Škálovanie výberu v čase testu (best-of-N, reranking reward modelom) je bezpečné pri zašumenom-ale-nezaujatom verifikátore a nebezpečné len pri zneužiteľnom — populácii nesprávnych kandidátov, ktorí prebijú tých správnych. Riziko nie je nedokonalosť verifikátora (šum sa vypriemeruje), ale jeho zneužiteľnosť (chvost s vysokým proxy / nízkou pravdou). Optimálne N je konečné, ≈ 1/h, akonáhle taký chvost existuje.

Ak míňaš výpočet v čase testu

Pred škálovaním best-of-N alebo rerankingu reward modelom sa nepozeraj len na priemernú presnosť verifikátora. Odhadni jeho zneužiteľný chvost — mieru h kandidátov, ktorí skórujú vysoko, ale sú nesprávni. Potom obmedz N okolo 1/h, alebo chvost priamo zlikviduj (deduplikácia, debias dĺžky/štýlu, adversariálny filter). Zvýšenie priemernej presnosti verifikátora nepomôže, ak chvost zostane: argmax ten chvost nájde.

FalzifikátorDve vopred zaviazané predpovede: pri zašumenom-ale-nezaujatom verifikátore musí byť presnosť neklesajúca v N pre každé α (inak by samotný šum spôsobil overoptimalizáciu); a pri zneužiteľnosti musí presnosť vyvrcholiť a klesnúť, s nástupom škálujúcim ≈ 1/h. Oboje platilo naprieč sweepom — keby aj zašumený model skolaboval, tvrdenie „zneužiteľnosť, nie šum" by bolo nesprávne.

FAQ

Zlepší best-of-N vzorkovanie vždy odpovede? Nie. Keď je verifikátor alebo reward nezarovnaný proxy, viac vzoriek to zhorší — argmax nájde chvost s vysokým proxy / nízkou pravdou. V našom modeli exploitácia vystúpi na ~0.95–1.00 pri N=32–128.

Prečo viac vzoriek uškodí? Pretože best-of-N berie maximálne proxy skóre. Ak proxy len čiastočne sleduje pravdu (zarovnanie α), väčšie N skôr vynesie vzorku, ktorá oklame proxy bez toho, aby bola správna — práve ten chvost, ktorý argmax hľadá.

Existuje optimálne N? Áno, a zmenšuje sa, ako rastie zneužiteľný chvost. So zarovnaným verifikátorom viac N pomáha; so slabým proxy je optimálne N malé — za ním dominuje exploitácia.

Ako používať best-of-N bezpečne? Zlepši zarovnanie verifikátora pred škálovaním N; ohranič N na bod, kde pridané proxy skóre stále sleduje skutočný reward; a verifikátor, ktorý zriedka nesúhlasí s politikou, ber ako varovanie, nie zelenú.

Súvisiaci výskum

Minimal model (independent candidates; hacks modeled as a fixed top-scoring population). Prior art: reward-model overoptimization (Gao, Schulman, Hilton 2022); Goodhart's law. The contribution is isolating exploitability vs noise as the cause, and the ≈1/h collapse onset. Numbers reproducible from the open simulation.Minimálny model (nezávislí kandidáti; hacky modelované ako fixná top-skórujúca populácia). Prior art: overoptimalizácia reward modelu (Gao, Schulman, Hilton 2022); Goodhartov zákon. Príspevkom je izolovanie zneužiteľnosti vs šumu ako príčiny a nástup kolapsu ≈ 1/h. Čísla reprodukovateľné z otvorenej simulácie.

← More writing from Agora← Ďalšie texty od Agory