LLMs

MEMENTO: LLMs Learn to Manage Context for Efficiency

Source: ArXiv cs.AI Original Author: Kontonis; Vasilis; Zeng; Yuchen; Garg; Shivam; Chen; Lingjiao; Tang; Hao; Wang; Ziyan; Awadallah; Ahmed; Horvitz; Eric; Langford; John; Papailiopoulos; Dimitris 2 min read Intelligence Analysis by Gemini

Sonic Intelligence

00:00 / 00:00

Signal Summary

MEMENTO teaches LLMs to compress reasoning into mementos, significantly reducing context and KV cache.

Explain Like I'm Five

"Imagine your brain trying to remember everything you've ever thought about a topic. It gets messy! MEMENTO is like teaching your brain to write down only the most important ideas from each thought session, so you only need to look at those summaries to keep thinking clearly, saving brainpower."

Deep Intelligence Analysis

The MEMENTO method introduces a transformative approach to Large Language Model (LLM) context management, directly addressing the escalating computational and memory demands associated with processing long, unstructured reasoning streams. By teaching models to segment their internal reasoning into discrete blocks and compress each block into a "memento"—a dense state summary—MEMENTO enables LLMs to reason forward by attending only to these summaries. This innovative self-management mechanism fundamentally reduces the required context length, Key-Value (KV) cache size, and overall computational load, which are critical bottlenecks limiting the scalability and efficiency of current LLM architectures.

The practical efficacy of MEMENTO is supported by the release of OpenMementos, a substantial public dataset comprising 228,000 reasoning traces derived from OpenThoughts-v3, meticulously segmented and annotated with intermediate summaries. This dataset facilitates a two-stage Supervised Fine-Tuning (SFT) recipe, which has been proven effective across a range of model families, including Qwen3, Phi-4, and Olmo 3, and across various scales from 8B to 32B parameters. The results are compelling: trained models maintain strong accuracy on challenging math, science, and coding benchmarks while achieving an impressive approximately 2.5x peak KV cache reduction. Furthermore, extending vLLM to support this inference method yielded an approximate 1.75x throughput improvement, demonstrating tangible gains in operational efficiency.

A crucial insight from this research is the identification of a "dual information stream," where information from each reasoning block is carried not only by the memento text but also implicitly by the corresponding KV states. The removal of this implicit channel resulted in a significant 15 percentage point drop in accuracy on the AIME24 benchmark, underscoring the subtle complexities of context compression. This finding highlights that effective context management requires more than just textual summarization; it necessitates preserving the latent information encoded within the model's internal states. MEMENTO's ability to achieve substantial efficiency gains while maintaining or even improving accuracy, coupled with its novel approach to context compression, positions it as a potential standard architectural component for future high-performance LLMs, enabling more sophisticated and sustainable AI applications across diverse domains.

AI-assisted intelligence report · EU AI Act Art. 50 compliant

Visual Intelligence

flowchart LR
    A[Long Reasoning Stream] --> B[Segment into Blocks];
    B --> C[Compress Block];
    C --> D[Generate Memento];
    D --> E[Store Memento & KV State];
    E --> F[Reason Forward];
    F --> G[Attend to Mementos];
    G --> B;

Auto-generated diagram · AI-interpreted flow

Impact Assessment

This method directly addresses a critical bottleneck in large language models: the escalating computational cost and memory footprint associated with long context windows. By enabling LLMs to self-manage and compress their internal reasoning state, MEMENTO promises significant efficiency gains, making larger, more complex reasoning tasks feasible.

Key Details

MEMENTO teaches LLMs to segment reasoning into blocks and compress them into "mementos."
Mementos are dense state summaries used for forward reasoning.
Achieves ~2.5x peak KV cache reduction.
Achieves ~1.75x throughput improvement with vLLM extension.
OpenMementos dataset released: 228K reasoning traces from OpenThoughts-v3.

Optimistic Outlook

MEMENTO could unlock new capabilities for LLMs by allowing them to maintain coherence over much longer interactions and complex reasoning chains without prohibitive resource demands. This efficiency gain could lead to more powerful, accessible, and sustainable AI applications, accelerating advancements in areas like scientific discovery, complex problem-solving, and personalized AI assistants.

Pessimistic Outlook

The dual information stream identified, where KV states retain implicit information even after text compression, suggests that simply compressing text might lead to information loss if not handled carefully. Over-compression or flawed memento generation could degrade reasoning accuracy, potentially introducing subtle errors that are difficult to diagnose in complex, long-context tasks.

More reporting around this signal.

Related coverage selected to keep the thread going without dropping you into another card wall.

LLMs

TIDE System Boosts LLM Inference Efficiency with Per-Token Early Exit

TIDE optimizes LLM inference by enabling per-token early exit, reducing latency and increasing throughput.

LLMs

Hacker News Engagement: Unpacking LLM Launch Performance

Analysis reveals LLM launch engagement trends and provider performance on Hacker News.

LLMs

NVIDIA's TensorRT LLM Accelerates AI Inference with Specialized Optimizations

TensorRT LLM optimizes LLM and visual generation model inference.

Business

OpenAI's Strategic Acqui-Hires Signal Product Diversification and Image Management Efforts

OpenAI's recent acquisitions target product diversification and public image improvement.

Business

Economist Finds Hope in AI's Labor Market Impact

A leading economist finds a nuanced path to AI-driven economic stability.

Security

Vercel Hacked Via Compromised Third-Party AI Tool

**Vercel suffered a breach through a compromised third-party AI tool.**

MEMENTO: LLMs Learn to Manage Context for Efficiency

Sonic Intelligence

Explain Like I'm Five

Deep Intelligence Analysis

Visual Intelligence

Impact Assessment

Key Details

Optimistic Outlook

Pessimistic Outlook

Get the next signal in your inbox.

More reporting around this signal.

TIDE System Boosts LLM Inference Efficiency with Per-Token Early Exit

Hacker News Engagement: Unpacking LLM Launch Performance

NVIDIA's TensorRT LLM Accelerates AI Inference with Specialized Optimizations

OpenAI's Strategic Acqui-Hires Signal Product Diversification and Image Management Efforts

Economist Finds Hope in AI's Labor Market Impact

Vercel Hacked Via Compromised Third-Party AI Tool