Skip to content

LoCoMo Results

LoCoMo (Long Conversation Memory) is a benchmark designed to test long-term memory systems. It consists of 10 multi-session conversations (each 10-30 turns), with 1,540 questions testing different aspects of memory:

  • Single-hop: Direct recall of a single fact
  • Multi-hop: Combining 2+ facts to answer
  • Temporal: Questions about when events happened
  • Open-ended: Questions with no single correct answer

Each conversation simulates a realistic user-agent interaction over weeks or months, with facts that evolve, contradict, and interleave.

Our results (Run F — full 10 conversations)

Section titled “Our results (Run F — full 10 conversations)”

Configuration: Mem0-style prompt, k=60, standard retrieval (no deep recall or re-ranking).

OverallSingle-hopMulti-hopTemporal
Conv 0 40.5% 44.9% 43.2% 33.3%
Conv 1 44.1% 44.9% 49% 39.3%
Conv 2 43.3% 46.3% 46.9% 36.7%
Conv 3 42.5% 43.8% 47.3% 36%
Conv 4 41.9% 43.8% 48.2% 34.7%
Conv 5 45.3% 46.1% 51.2% 39.3%
Conv 6 38.9% 40.4% 43.6% 33.3%
Conv 7 42.1% 44.9% 47.5% 34%
Conv 8 41.8% 42.7% 46.9% 36%
Conv 9 43.6% 44.9% 46.9% 40%
Average 42.4% 44.2% 47.1% 36.7%

Run H (Conv 0 only — deep recall + LLM re-ranking)

Section titled “Run H (Conv 0 only — deep recall + LLM re-ranking)”

The jump from deep recall + re-ranking is dramatic, especially for multi-hop. Full 10-conversation Run H results are in progress.

The biggest gains came from improving retrieval quality, not from changing the answer generation prompt or the decay model. The three-stage pipeline (embedding recall -> R^alpha scoring -> LLM re-ranking) was the key breakthrough.

Broader recall helps more than better scoring

Section titled “Broader recall helps more than better scoring”

Going from k=10 (Run A) to k=60 (Run F) added +14.2pp overall. The extra candidates contain relevant memories that rank poorly by embedding similarity alone but are critical for answering the question.

Deep recall expands the candidate pool with superseded originals. Re-ranking filters the noise this creates. Neither works as well alone:

  • Deep recall without re-ranking: adds noise, minimal improvement
  • Re-ranking without deep recall: helps, but misses consolidated details
  • Both together: +7.3pp overall, +16.1pp multi-hop

Temporal accuracy (36.7%) consistently lags behind other categories. Questions like “when did the user go hiking?” require precise date extraction and retrieval, which embedding similarity handles poorly. This is an area for future improvement.

All results use:

  • GPT-4o for answer evaluation (LLM-as-judge)
  • GPT-4o-mini for memory extraction
  • text-embedding-3-small for embeddings (1536 dimensions)
  • Temperature 0 for deterministic answers
  • Ground truth from the LoCoMo dataset

See Methodology for full details.