Benchmark Overview
At a glance
Section titled “At a glance”Cognitive Memory was benchmarked against three baselines on the LoCoMo long-conversation dataset (10 conversations, 1540 questions):
What these numbers mean
Section titled “What these numbers mean”Overall accuracy measures the percentage of questions the system answers correctly, using LLM-evaluated scoring (GPT-4o judges whether the generated answer matches the ground truth).
Multi-hop accuracy measures questions that require combining information from multiple stored memories. These are the hardest questions — and the ones where memory architecture matters most.
Why Cognitive Memory wins
Section titled “Why Cognitive Memory wins”Three mechanisms drive the performance advantage:
1. Decay-weighted retrieval
Section titled “1. Decay-weighted retrieval”Instead of treating all memories equally, retrieval scoring penalizes old memories and rewards recent, frequently-accessed ones. This naturally surfaces the most relevant context without manual curation.
2. Associative linking
Section titled “2. Associative linking”Memories that were co-encoded (synaptic tagging) or co-retrieved (strengthening) form links. When one memory is recalled, its neighbors are also activated. This is the key mechanism for multi-hop reasoning — answering “what is Sarah’s brother’s job?” requires linking the “Sarah has a brother named Tom” memory with the “Tom works as a dentist” memory.
3. Deep recall with re-ranking
Section titled “3. Deep recall with re-ranking”The benchmark’s best configuration adds an LLM re-ranking step after retrieval. Broad candidate recall (k=60) catches relevant memories even if they score low, and the re-ranker filters noise. This three-stage pipeline (embedding recall, R^alpha scoring, LLM re-ranking) is the key insight from our tuning work.
The benchmark configurations
Section titled “The benchmark configurations”We ran multiple configurations to isolate the contribution of each mechanism:
*Run H is conv 0 only
Each jump tells a story:
- A to E: tuned prompt and deep recall add +10pp overall
- E to F: broader recall (k=60) and Mem0-style prompt add +4.2pp
- F to H: LLM re-ranking adds +5.4pp overall and +12.2pp multi-hop
See Methodology for how we run these benchmarks.
Dive deeper
Section titled “Dive deeper”- Multi-hop results — detailed breakdown of multi-hop performance
- LoCoMo results — full per-conversation breakdown
- vs. Mem0 — direct comparison
- vs. FadeMem — direct comparison
- vs. RAG — why cognitive memory beats naive retrieval
- Why multi-hop matters — the flagship pitch
- Methodology — how we benchmark