MEMENTO: LLMs Learn to Manage Context for Efficiency
Sonic Intelligence
MEMENTO teaches LLMs to compress reasoning into mementos, significantly reducing context and KV cache.
Explain Like I'm Five
"Imagine your brain trying to remember everything you've ever thought about a topic. It gets messy! MEMENTO is like teaching your brain to write down only the most important ideas from each thought session, so you only need to look at those summaries to keep thinking clearly, saving brainpower."
Deep Intelligence Analysis
The practical efficacy of MEMENTO is supported by the release of OpenMementos, a substantial public dataset comprising 228,000 reasoning traces derived from OpenThoughts-v3, meticulously segmented and annotated with intermediate summaries. This dataset facilitates a two-stage Supervised Fine-Tuning (SFT) recipe, which has been proven effective across a range of model families, including Qwen3, Phi-4, and Olmo 3, and across various scales from 8B to 32B parameters. The results are compelling: trained models maintain strong accuracy on challenging math, science, and coding benchmarks while achieving an impressive approximately 2.5x peak KV cache reduction. Furthermore, extending vLLM to support this inference method yielded an approximate 1.75x throughput improvement, demonstrating tangible gains in operational efficiency.
A crucial insight from this research is the identification of a "dual information stream," where information from each reasoning block is carried not only by the memento text but also implicitly by the corresponding KV states. The removal of this implicit channel resulted in a significant 15 percentage point drop in accuracy on the AIME24 benchmark, underscoring the subtle complexities of context compression. This finding highlights that effective context management requires more than just textual summarization; it necessitates preserving the latent information encoded within the model's internal states. MEMENTO's ability to achieve substantial efficiency gains while maintaining or even improving accuracy, coupled with its novel approach to context compression, positions it as a potential standard architectural component for future high-performance LLMs, enabling more sophisticated and sustainable AI applications across diverse domains.
Visual Intelligence
flowchart LR
A[Long Reasoning Stream] --> B[Segment into Blocks];
B --> C[Compress Block];
C --> D[Generate Memento];
D --> E[Store Memento & KV State];
E --> F[Reason Forward];
F --> G[Attend to Mementos];
G --> B;
Auto-generated diagram · AI-interpreted flow
Impact Assessment
This method directly addresses a critical bottleneck in large language models: the escalating computational cost and memory footprint associated with long context windows. By enabling LLMs to self-manage and compress their internal reasoning state, MEMENTO promises significant efficiency gains, making larger, more complex reasoning tasks feasible.
Key Details
- MEMENTO teaches LLMs to segment reasoning into blocks and compress them into "mementos."
- Mementos are dense state summaries used for forward reasoning.
- Achieves ~2.5x peak KV cache reduction.
- Achieves ~1.75x throughput improvement with vLLM extension.
- OpenMementos dataset released: 228K reasoning traces from OpenThoughts-v3.
Optimistic Outlook
MEMENTO could unlock new capabilities for LLMs by allowing them to maintain coherence over much longer interactions and complex reasoning chains without prohibitive resource demands. This efficiency gain could lead to more powerful, accessible, and sustainable AI applications, accelerating advancements in areas like scientific discovery, complex problem-solving, and personalized AI assistants.
Pessimistic Outlook
The dual information stream identified, where KV states retain implicit information even after text compression, suggests that simply compressing text might lead to information loss if not handled carefully. Over-compression or flawed memento generation could degrade reasoning accuracy, potentially introducing subtle errors that are difficult to diagnose in complex, long-context tasks.
Get the next signal in your inbox.
One concise weekly briefing with direct source links, fast analysis, and no inbox clutter.
More reporting around this signal.
Related coverage selected to keep the thread going without dropping you into another card wall.