LoCoMo Results

About LoCoMo

LoCoMo (Long Conversation Memory) is a benchmark designed to test long-term memory systems. It consists of 10 multi-session conversations (each 10-30 turns), with 1,540 questions testing different aspects of memory:

Single-hop: Direct recall of a single fact
Multi-hop: Combining 2+ facts to answer
Temporal: Questions about when events happened
Open-ended: Questions with no single correct answer

Each conversation simulates a realistic user-agent interaction over weeks or months, with facts that evolve, contradict, and interleave.

Our results (Run F — full 10 conversations)

Configuration: Mem0-style prompt, k=60, standard retrieval (no deep recall or re-ranking).

Per-conversation breakdown

	Overall	Single-hop	Multi-hop	Temporal
Conv 0	40.5%	44.9%	43.2%	33.3%
Conv 1	44.1%	44.9%	49%	39.3%
Conv 2	43.3%	46.3%	46.9%	36.7%
Conv 3	42.5%	43.8%	47.3%	36%
Conv 4	41.9%	43.8%	48.2%	34.7%
Conv 5	45.3%	46.1%	51.2%	39.3%
Conv 6	38.9%	40.4%	43.6%	33.3%
Conv 7	42.1%	44.9%	47.5%	34%
Conv 8	41.8%	42.7%	46.9%	36%
Conv 9	43.6%	44.9%	46.9%	40%
Average	42.4%	44.2%	47.1%	36.7%

Run H (Conv 0 only — deep recall + LLM re-ranking)

The jump from deep recall + re-ranking is dramatic, especially for multi-hop. Full 10-conversation Run H results are in progress.

Comparison across runs

What we learned

Retrieval precision is the bottleneck

The biggest gains came from improving retrieval quality, not from changing the answer generation prompt or the decay model. The three-stage pipeline (embedding recall -> R^alpha scoring -> LLM re-ranking) was the key breakthrough.

Broader recall helps more than better scoring

Going from k=10 (Run A) to k=60 (Run F) added +14.2pp overall. The extra candidates contain relevant memories that rank poorly by embedding similarity alone but are critical for answering the question.

Deep recall and re-ranking synergize

Deep recall expands the candidate pool with superseded originals. Re-ranking filters the noise this creates. Neither works as well alone:

Deep recall without re-ranking: adds noise, minimal improvement
Re-ranking without deep recall: helps, but misses consolidated details
Both together: +7.3pp overall, +16.1pp multi-hop

Temporal questions are hardest

Temporal accuracy (36.7%) consistently lags behind other categories. Questions like “when did the user go hiking?” require precise date extraction and retrieval, which embedding similarity handles poorly. This is an area for future improvement.

Methodology

All results use:

GPT-4o for answer evaluation (LLM-as-judge)
GPT-4o-mini for memory extraction
text-embedding-3-small for embeddings (1536 dimensions)
Temperature 0 for deterministic answers
Ground truth from the LoCoMo dataset

See Methodology for full details.