Skip to content

Multi-hop Results

Multi-hop questions require combining information from two or more stored memories to produce an answer. They test whether a memory system can connect related facts across different conversations and time periods.

Single-hop example:

  • Q: “What is Alex’s job?”
  • Requires: one memory (“Alex is a software engineer”)

Multi-hop example:

  • Q: “What does Sarah’s brother do for a living?”
  • Requires: “Sarah has a brother named Tom” + “Tom works as a dentist”

Multi-hop is where most memory systems fail. They retrieve the one most similar memory to the query, miss the connecting facts, and produce incomplete or wrong answers.

*Run H is conversation 0 only.

Cognitive Memory’s multi-hop score beats:

  • Mem0 by 66% (Run F) / 108% (Run H)
  • FadeMem by 60% (Run F) / 102% (Run H)
  • Naive RAG by 214% (Run F)

When “Sarah has a brother named Tom” and “Tom works as a dentist” are encoded in the same conversation, synaptic tagging creates a link between them. Later, when the query “What does Sarah’s brother do?” retrieves the Sarah memory, the association activates the Tom memory too.

Without associations, the query embedding is semantically close to “Sarah” and “brother” but not to “dentist” — so the Tom memory would never surface through vector search alone.

Consolidated summaries might merge “Tom works as a dentist” into a broader summary that loses the specific profession. Deep recall brings back the original memory, ensuring the specific answer is available.

Broad candidate recall catches weak matches

Section titled “Broad candidate recall catches weak matches”

With k=60, the system retrieves a large candidate pool. Multi-hop questions often have weak similarity to one of the required memories — “What does Sarah’s brother do?” is semantically distant from “Tom works as a dentist.” Broad recall catches these weak but critical matches.

The broad candidate pool contains noise. The LLM re-ranker (Run H) evaluates each candidate’s relevance to the specific question, filtering out false positives and promoting true positives. This is particularly effective for multi-hop because the re-ranker can recognize that “Tom works as a dentist” is relevant to “What does Sarah’s brother do?” even though the embedding similarity is low.

From the LoCoMo benchmark, questions are categorized by the number of hops required:

The improvement is largest for 3+ hop questions — exactly where associative linking and deep recall provide the most value.

Multi-hop is the acid test for memory systems. Any system can retrieve a single relevant document. The hard part is connecting related facts across time, conversations, and topics. Cognitive Memory’s association graph and deep recall mechanisms are purpose-built for this challenge.