LoCoMo Results
About LoCoMo
Section titled “About LoCoMo”LoCoMo (Long Conversation Memory) is a benchmark designed to test long-term memory systems. It consists of 10 multi-session conversations (each 10-30 turns), with 1,540 questions testing different aspects of memory:
- Single-hop: Direct recall of a single fact
- Multi-hop: Combining 2+ facts to answer
- Temporal: Questions about when events happened
- Open-ended: Questions with no single correct answer
Each conversation simulates a realistic user-agent interaction over weeks or months, with facts that evolve, contradict, and interleave.
Our results (Run F — full 10 conversations)
Section titled “Our results (Run F — full 10 conversations)”Configuration: Mem0-style prompt, k=60, standard retrieval (no deep recall or re-ranking).
Per-conversation breakdown
Section titled “Per-conversation breakdown”| Overall | Single-hop | Multi-hop | Temporal | |
|---|---|---|---|---|
| Conv 0 | 40.5% | 44.9% | 43.2% | 33.3% |
| Conv 1 | 44.1% | 44.9% | 49% | 39.3% |
| Conv 2 | 43.3% | 46.3% | 46.9% | 36.7% |
| Conv 3 | 42.5% | 43.8% | 47.3% | 36% |
| Conv 4 | 41.9% | 43.8% | 48.2% | 34.7% |
| Conv 5 | 45.3% | 46.1% | 51.2% | 39.3% |
| Conv 6 | 38.9% | 40.4% | 43.6% | 33.3% |
| Conv 7 | 42.1% | 44.9% | 47.5% | 34% |
| Conv 8 | 41.8% | 42.7% | 46.9% | 36% |
| Conv 9 | 43.6% | 44.9% | 46.9% | 40% |
| Average | 42.4% | 44.2% | 47.1% | 36.7% |
Run H (Conv 0 only — deep recall + LLM re-ranking)
Section titled “Run H (Conv 0 only — deep recall + LLM re-ranking)”The jump from deep recall + re-ranking is dramatic, especially for multi-hop. Full 10-conversation Run H results are in progress.
Comparison across runs
Section titled “Comparison across runs”What we learned
Section titled “What we learned”Retrieval precision is the bottleneck
Section titled “Retrieval precision is the bottleneck”The biggest gains came from improving retrieval quality, not from changing the answer generation prompt or the decay model. The three-stage pipeline (embedding recall -> R^alpha scoring -> LLM re-ranking) was the key breakthrough.
Broader recall helps more than better scoring
Section titled “Broader recall helps more than better scoring”Going from k=10 (Run A) to k=60 (Run F) added +14.2pp overall. The extra candidates contain relevant memories that rank poorly by embedding similarity alone but are critical for answering the question.
Deep recall and re-ranking synergize
Section titled “Deep recall and re-ranking synergize”Deep recall expands the candidate pool with superseded originals. Re-ranking filters the noise this creates. Neither works as well alone:
- Deep recall without re-ranking: adds noise, minimal improvement
- Re-ranking without deep recall: helps, but misses consolidated details
- Both together: +7.3pp overall, +16.1pp multi-hop
Temporal questions are hardest
Section titled “Temporal questions are hardest”Temporal accuracy (36.7%) consistently lags behind other categories. Questions like “when did the user go hiking?” require precise date extraction and retrieval, which embedding similarity handles poorly. This is an area for future improvement.
Methodology
Section titled “Methodology”All results use:
- GPT-4o for answer evaluation (LLM-as-judge)
- GPT-4o-mini for memory extraction
- text-embedding-3-small for embeddings (1536 dimensions)
- Temperature 0 for deterministic answers
- Ground truth from the LoCoMo dataset
See Methodology for full details.