Skip to content

Benchmark Overview

Cognitive Memory was benchmarked against three baselines on the LoCoMo long-conversation dataset (10 conversations, 1540 questions):

Overall accuracy measures the percentage of questions the system answers correctly, using LLM-evaluated scoring (GPT-4o judges whether the generated answer matches the ground truth).

Multi-hop accuracy measures questions that require combining information from multiple stored memories. These are the hardest questions — and the ones where memory architecture matters most.

Three mechanisms drive the performance advantage:

Instead of treating all memories equally, retrieval scoring penalizes old memories and rewards recent, frequently-accessed ones. This naturally surfaces the most relevant context without manual curation.

Memories that were co-encoded (synaptic tagging) or co-retrieved (strengthening) form links. When one memory is recalled, its neighbors are also activated. This is the key mechanism for multi-hop reasoning — answering “what is Sarah’s brother’s job?” requires linking the “Sarah has a brother named Tom” memory with the “Tom works as a dentist” memory.

The benchmark’s best configuration adds an LLM re-ranking step after retrieval. Broad candidate recall (k=60) catches relevant memories even if they score low, and the re-ranker filters noise. This three-stage pipeline (embedding recall, R^alpha scoring, LLM re-ranking) is the key insight from our tuning work.

We ran multiple configurations to isolate the contribution of each mechanism:

*Run H is conv 0 only

Each jump tells a story:

  • A to E: tuned prompt and deep recall add +10pp overall
  • E to F: broader recall (k=60) and Mem0-style prompt add +4.2pp
  • F to H: LLM re-ranking adds +5.4pp overall and +12.2pp multi-hop

See Methodology for how we run these benchmarks.