Back to Wire
MEMENTO: LLMs Learn to Manage Context for Efficiency
LLMs

MEMENTO: LLMs Learn to Manage Context for Efficiency

Source: ArXiv cs.AI Original Author: Kontonis; Vasilis; Zeng; Yuchen; Garg; Shivam; Chen; Lingjiao; Tang; Hao; Wang; Ziyan; Awadallah; Ahmed; Horvitz; Eric; Langford; John; Papailiopoulos; Dimitris 2 min read Intelligence Analysis by Gemini

Sonic Intelligence

00:00 / 00:00
Signal Summary

MEMENTO teaches LLMs to compress reasoning into mementos, significantly reducing context and KV cache.

Explain Like I'm Five

"Imagine your brain trying to remember everything you've ever thought about a topic. It gets messy! MEMENTO is like teaching your brain to write down only the most important ideas from each thought session, so you only need to look at those summaries to keep thinking clearly, saving brainpower."

Original Reporting
ArXiv cs.AI

Read the original article for full context.

Read Article at Source

Deep Intelligence Analysis

The MEMENTO method introduces a transformative approach to Large Language Model (LLM) context management, directly addressing the escalating computational and memory demands associated with processing long, unstructured reasoning streams. By teaching models to segment their internal reasoning into discrete blocks and compress each block into a "memento"—a dense state summary—MEMENTO enables LLMs to reason forward by attending only to these summaries. This innovative self-management mechanism fundamentally reduces the required context length, Key-Value (KV) cache size, and overall computational load, which are critical bottlenecks limiting the scalability and efficiency of current LLM architectures.

The practical efficacy of MEMENTO is supported by the release of OpenMementos, a substantial public dataset comprising 228,000 reasoning traces derived from OpenThoughts-v3, meticulously segmented and annotated with intermediate summaries. This dataset facilitates a two-stage Supervised Fine-Tuning (SFT) recipe, which has been proven effective across a range of model families, including Qwen3, Phi-4, and Olmo 3, and across various scales from 8B to 32B parameters. The results are compelling: trained models maintain strong accuracy on challenging math, science, and coding benchmarks while achieving an impressive approximately 2.5x peak KV cache reduction. Furthermore, extending vLLM to support this inference method yielded an approximate 1.75x throughput improvement, demonstrating tangible gains in operational efficiency.

A crucial insight from this research is the identification of a "dual information stream," where information from each reasoning block is carried not only by the memento text but also implicitly by the corresponding KV states. The removal of this implicit channel resulted in a significant 15 percentage point drop in accuracy on the AIME24 benchmark, underscoring the subtle complexities of context compression. This finding highlights that effective context management requires more than just textual summarization; it necessitates preserving the latent information encoded within the model's internal states. MEMENTO's ability to achieve substantial efficiency gains while maintaining or even improving accuracy, coupled with its novel approach to context compression, positions it as a potential standard architectural component for future high-performance LLMs, enabling more sophisticated and sustainable AI applications across diverse domains.
AI-assisted intelligence report · EU AI Act Art. 50 compliant

Visual Intelligence

flowchart LR
    A[Long Reasoning Stream] --> B[Segment into Blocks];
    B --> C[Compress Block];
    C --> D[Generate Memento];
    D --> E[Store Memento & KV State];
    E --> F[Reason Forward];
    F --> G[Attend to Mementos];
    G --> B;

Auto-generated diagram · AI-interpreted flow

Impact Assessment

This method directly addresses a critical bottleneck in large language models: the escalating computational cost and memory footprint associated with long context windows. By enabling LLMs to self-manage and compress their internal reasoning state, MEMENTO promises significant efficiency gains, making larger, more complex reasoning tasks feasible.

Key Details

  • MEMENTO teaches LLMs to segment reasoning into blocks and compress them into "mementos."
  • Mementos are dense state summaries used for forward reasoning.
  • Achieves ~2.5x peak KV cache reduction.
  • Achieves ~1.75x throughput improvement with vLLM extension.
  • OpenMementos dataset released: 228K reasoning traces from OpenThoughts-v3.

Optimistic Outlook

MEMENTO could unlock new capabilities for LLMs by allowing them to maintain coherence over much longer interactions and complex reasoning chains without prohibitive resource demands. This efficiency gain could lead to more powerful, accessible, and sustainable AI applications, accelerating advancements in areas like scientific discovery, complex problem-solving, and personalized AI assistants.

Pessimistic Outlook

The dual information stream identified, where KV states retain implicit information even after text compression, suggests that simply compressing text might lead to information loss if not handled carefully. Over-compression or flawed memento generation could degrade reasoning accuracy, potentially introducing subtle errors that are difficult to diagnose in complex, long-context tasks.

Stay on the wire

Get the next signal in your inbox.

One concise weekly briefing with direct source links, fast analysis, and no inbox clutter.

Free. Unsubscribe anytime.

Continue reading

More reporting around this signal.

Related coverage selected to keep the thread going without dropping you into another card wall.