ReplicationReplikácia

Why RAG serves stale facts: the supersession blind spot, reproducedPrečo RAG podáva zastarané fakty: supersession slepý bod, reprodukované

June 28, 20263 min readagent memory · RAG · supersession · replication · severe-testedpamäť agentov · RAG · supersession · replikácia · prísne testované

The takeawayZhrnutie

Retrieval memory has no model of time: a superseded fact and its replacement are near-identical in embedding space, so similarity search silently returns the stale one (AUROC 0.61, near chance). We reproduced it and measured the deterministic-key fix: stale recall 41.7% to 0%.Retrieval pamäť nemá model času: nahradený fakt a jeho náhrada sú v embedding priestore takmer identické, takže similarity search ticho vráti zastaraný (AUROC 0.61, takmer náhoda). Reprodukovali sme to a odmerali deterministickú opravu: stale recall 41.7 % na 0 %.

The short version. Retrieval memory — RAG, vector stores, agent memory — has no model of time. When a fact is updated (a function renamed, a price changed, an API key rotated), the old value and the new one sit at nearly the same point in embedding space. A similarity search cannot tell which is current, so it silently returns the stale one. We reproduced this on our own local stack and measured the fix.

Credit where it is due: this supersession blind spot was named and measured by Neeraj Yadav (MemStrata, arXiv:2606.26511), who reports a cosine classifier separating current from superseded facts at AUROC ~0.59 — near chance. We replicated it independently on a different stack, then shipped the deterministic fix in our open-source memory core, mnemo.

The claim, replicated

On 24 (subject, relation, object) facts embedded with local nomic-embed-text (mean-centered, because nomic is anisotropic):

measurement	value
mean cosine(original, contradiction)	0.768
mean cosine(original, rephrased duplicate)	0.843
contradiction ranked at least as similar as the duplicate	10 / 24
AUROC: "low similarity implies supersession"	0.613 (chance = 0.5; Yadav ~0.59)

The AUROC is the whole story. A contradiction — the new value — is often more embedding-similar to the original than a genuine rephrase is. So no similarity threshold can reliably flag "this record supersedes that one." Verdict: REPRODUCED.

What it costs you

Store the original, then the update, then ask for the current value:

retrieval	stale-fact rate
cosine top-1	41.7% (Yadav's RAG: 15–40%)
deterministic (subject, relation, object) key	0.0%

A similarity store serves the superseded value about 40% of the time. That is not a tuning problem — it is structural.

The fix: a key, not a threshold

Similarity fails because it answers the wrong question. Supersession is not "are these two texts similar?" — it is "do these two records describe the same (subject, relation)?" That is a deterministic key, not a distance. When a new value arrives for an existing (subject, relation) key, retire the old one — no embedding, no LLM, no threshold.

We shipped exactly this in mnemo v0.2.0:

remember("Billing API auth method: API keys", key="billing-api::auth")

retires every active record with that key, so recall never returns the stale value. It is bi-temporal — a back-filled earlier value cannot overwrite the current one — and append-only: the old value is demoted, not deleted (still there with include_superseded=True).

This is the storage-side complement to Yadav's retrieval-side result. His bi-temporal ledger fixes which fact you retrieve; the same (subject, relation, object) discipline fixes which fact your memory keeps current. The two halves want to be designed together.

Honest limits

Synthetic, 24 facts. This characterizes the mechanism, not a product benchmark. The exact AUROC and stale-fact numbers will move with your data and embedder; the probe is one file — re-run it on yours.
The falsifier. If the deterministic key did not cut the stale-fact rate below similarity search, the idea would be worthless. It did: 41.7% → 0%.
It needs a key. The fix only applies to facts you can assign a (subject, relation) to — config, prices, versions, status, identities. For free-text memory with no natural key you are back to the hard retrieval-side problem, which is exactly what Yadav's paper attacks.

FAQ

Does a better embedding model fix stale-fact retrieval? No. The problem is structural: a contradicted fact is often more embedding-similar to the original than a rephrase is (we measured AUROC 0.61, near chance). A stronger embedder shifts the similarity band, not the blind spot.

What is supersession in AI memory? It is when a stored fact is replaced by a newer value — a renamed function, a changed price, a rotated key. A memory with no model of time keeps both and cannot tell which one is current.

How do you fix it without an LLM? Assign facts a deterministic (subject, relation) key and retire the old value when a new one arrives for that key — no similarity threshold, no extra model call. In our test this drove stale recall from 41.7% to 0%.

Is this your discovery? No. The supersession blind spot was named and measured by Neeraj Yadav (MemStrata, arXiv:2606.26511). We independently replicated it (AUROC 0.61, matching his ~0.59) and shipped the deterministic-key fix in our open-source mnemo.

Related research

Krátka verzia. Retrieval pamäť — RAG, vektorové úložiská, pamäť agentov — nemá model času. Keď sa fakt aktualizuje (premenuje sa funkcia, zmení cena, rotuje API kľúč), stará a nová hodnota sedia v embedding priestore takmer v tom istom bode. Similarity search nedokáže rozlíšiť, ktorá je aktuálna, a ticho vráti tú zastaranú. Reprodukovali sme to na vlastnom stacku a odmerali opravu.

Kredit kam patrí: tento supersession slepý bod pomenoval a odmeral Neeraj Yadav (MemStrata, arXiv:2606.26511), ktorý uvádza cosine klasifikátor oddeľujúci aktuálne od nahradených faktov na AUROC ~0.59 — takmer náhoda. Reprodukovali sme to nezávisle na inom stacku a potom sme shipli deterministickú opravu do nášho open-source pamäťového jadra mnemo.

Tvrdenie, reprodukované

Na 24 faktoch (subjekt, relácia, objekt) embednutých lokálnym nomic-embed-text (centrované, lebo nomic je anizotropný):

meranie	hodnota
priemerný cosine(original, protirečenie)	0.768
priemerný cosine(original, parafráza-duplikát)	0.843
protirečenie aspoň také podobné ako duplikát	10 / 24
AUROC: "nízka podobnosť znamená supersession"	0.613 (náhoda = 0.5; Yadav ~0.59)

AUROC je celý príbeh. Protirečenie — nová hodnota — je často podobnejšie originálu než skutočná parafráza. Takže žiadny prah podobnosti nedokáže spoľahlivo označiť "tento záznam nahrádza tamten". Verdikt: REPRODUCED.

Čo ťa to stojí

Ulož original, potom update, potom sa spýtaj na aktuálnu hodnotu:

retrieval	miera zastaraného faktu
cosine top-1	41.7 % (Yadavov RAG: 15–40 %)
deterministický (subjekt, relácia, objekt) key	0.0 %

Similarity úložisko podá nahradenú hodnotu asi 40 % času. To nie je otázka ladenia — je to štrukturálne.

Oprava: key, nie prah

Podobnosť zlyháva, lebo odpovedá na nesprávnu otázku. Supersession nie je "sú tieto dva texty podobné?" — je to "opisujú tieto dva záznamy ten istý (subjekt, relácia)?" To je deterministický key, nie vzdialenosť. Keď príde nová hodnota pre existujúci (subjekt, relácia) key, retire-ni starú — žiadny embedding, žiadny LLM, žiadny prah.

Shipli sme presne toto v mnemo v0.2.0:

remember("Billing API auth method: API keys", key="billing-api::auth")

retire-ne každý aktívny záznam s tým key, takže recall nikdy nevráti zastaranú hodnotu. Je to bi-temporal — back-fill staršej hodnoty neprepíše aktuálnu — a append-only: stará hodnota je degradovaná, nie zmazaná (stále dostupná s include_superseded=True).

Toto je storage-strana komplement k Yadavovmu retrieval-strana výsledku. Jeho bi-temporal ledger rieši, ktorý fakt vyhľadáš; tá istá (subjekt, relácia, objekt) disciplína rieši, ktorý fakt tvoja pamäť drží aktuálny. Tie dve polovice chcú byť navrhnuté spolu.

Čestné limity

Syntetické, 24 faktov. Charakterizuje to mechanizmus, nie benchmark produktu. Presné AUROC a stale čísla sa pohnú s tvojimi dátami a embedderom; sonda je jeden súbor — spusti ju na svojich.
Falzifikátor. Keby deterministický key neznÍžil stale-fact rate pod similarity search, nápad by bol bezcenný. Znížil: 41.7 % → 0 %.
Potrebuje key. Oprava platí len pre fakty, ktorým vieš priradiť (subjekt, relácia) — config, ceny, verzie, status, identity. Pre voľný text bez prirodzeného key si späť pri ťažkom retrieval-strana probléme, na ktorý práve útočí Yadavov paper.

FAQ

Vyrieši lepší embedding model retrieval zastaraných faktov? Nie. Problém je štrukturálny: protirečený fakt je často podobnejší originálu než parafráza (odmerali sme AUROC 0.61, takmer náhoda). Silnejší embedder posunie pásmo podobnosti, nie slepý bod.

Čo je supersession v pamäti AI? Keď je uložený fakt nahradený novšou hodnotou — premenovaná funkcia, zmenená cena, rotovaný kľúč. Pamäť bez modelu času si nechá oboje a nevie, ktorá je aktuálna.

Ako to opraviť bez LLM? Prirad faktom deterministický (subjekt, relácia) key a retire-ni starú hodnotu, keď príde nová pre ten key — žiadny prah podobnosti, žiadne extra volanie modelu. V našom teste to znížilo stale recall zo 41.7 % na 0 %.

Je to váš objav? Nie. Supersession slepý bod pomenoval a odmeral Neeraj Yadav (MemStrata, arXiv:2606.26511). Nezávisle sme ho reprodukovali (AUROC 0.61, zhoda s jeho ~0.59) a shipli deterministickú opravu do nášho open-source mnemo.

Súvisiaci výskum

← More writing from Agora← Ďalšie texty od Agory