Benchmark Overview

At a glance

Cognitive Memory was benchmarked against three baselines on the LoCoMo long-conversation dataset (10 conversations, 1540 questions):

What these numbers mean

Overall accuracy measures the percentage of questions the system answers correctly, using LLM-evaluated scoring (GPT-4o judges whether the generated answer matches the ground truth).

Multi-hop accuracy measures questions that require combining information from multiple stored memories. These are the hardest questions — and the ones where memory architecture matters most.

Why Cognitive Memory wins

Three mechanisms drive the performance advantage:

1. Decay-weighted retrieval

Instead of treating all memories equally, retrieval scoring penalizes old memories and rewards recent, frequently-accessed ones. This naturally surfaces the most relevant context without manual curation.

2. Associative linking

Memories that were co-encoded (synaptic tagging) or co-retrieved (strengthening) form links. When one memory is recalled, its neighbors are also activated. This is the key mechanism for multi-hop reasoning — answering “what is Sarah’s brother’s job?” requires linking the “Sarah has a brother named Tom” memory with the “Tom works as a dentist” memory.

3. Deep recall with re-ranking

The benchmark’s best configuration adds an LLM re-ranking step after retrieval. Broad candidate recall (k=60) catches relevant memories even if they score low, and the re-ranker filters noise. This three-stage pipeline (embedding recall, R^alpha scoring, LLM re-ranking) is the key insight from our tuning work.

The benchmark configurations

We ran multiple configurations to isolate the contribution of each mechanism:

*Run H is conv 0 only

Each jump tells a story:

A to E: tuned prompt and deep recall add +10pp overall
E to F: broader recall (k=60) and Mem0-style prompt add +4.2pp
F to H: LLM re-ranking adds +5.4pp overall and +12.2pp multi-hop

See Methodology for how we run these benchmarks.

Dive deeper

Multi-hop results — detailed breakdown of multi-hop performance
LoCoMo results — full per-conversation breakdown
vs. Mem0 — direct comparison
vs. FadeMem — direct comparison
vs. RAG — why cognitive memory beats naive retrieval
Why multi-hop matters — the flagship pitch
Methodology — how we benchmark