AI Agents

Odysseus Scales VLMs for 100+ Turn Decision-Making in Games

Source: ArXiv Machine Learning (cs.LG) Original Author: Shi; Chengshuai; Li; Wenzhe; Xinran; Lu; Yizhou; Yang; Wenjia; Feng; Ruirong; Karten; Seth; Ziran; Ding; Zihan; Sarch; Gabriel; Chen; Danqi; Narasimhan; Karthik; Jin; Chi 2 min read Intelligence Analysis by Gemini

Sonic Intelligence

00:00 / 00:00

Signal Summary

Odysseus framework enables VLMs to achieve 100+ turn decision-making in complex games.

Explain Like I'm Five

"Imagine teaching a robot to play a very long video game like Super Mario. Old ways of teaching only worked for short parts. This new way, called Odysseus, helps the robot learn to play for a super long time, over 100 moves, and get much better at the game than other robots."

Deep Intelligence Analysis

The Odysseus framework represents a critical advancement in scaling Vision-Language Models (VLMs) for long-horizon, interactive decision-making tasks, specifically demonstrated in complex game environments. While previous approaches for VLM integration with reinforcement learning (RL) were limited to short-horizon settings (typically 20-30 turns) or relied heavily on supervised fine-tuning, Odysseus pushes this boundary to over 100 turns. This capability is crucial for developing truly intelligent embodied agents that can navigate and interact effectively in dynamic, visually grounded environments requiring sustained perception, reasoning, and action.

Central to Odysseus's success is a systematic investigation of key algorithmic components, leading to an adapted variant of Proximal Policy Optimization (PPO) incorporating a lightweight turn-level critic. This adaptation significantly enhances training stability and sample efficiency compared to critic-free methods. Furthermore, the framework effectively leverages pretrained VLMs to provide strong action priors, which dramatically improves sample efficiency during RL training. This reduces the need for extensive manual design choices, such as action engineering, a common challenge in classical deep RL trained from scratch. The system demonstrated substantial gains across multiple levels of Super Mario Land, achieving at least three times the average game progress compared to frontier models, alongside consistent improvements in both in-game and cross-game generalization while retaining general-domain VLM capabilities.

The implications are far-reaching for the development of embodied AI agents. Odysseus identifies key ingredients for making RL stable and effective in multimodal, long-horizon settings. This research provides practical guidance for building VLMs that can perform complex, sequential tasks, moving beyond simple classification or short-term interaction. This could accelerate the development of AI for robotics, autonomous navigation, and intelligent assistants capable of sustained, goal-oriented behavior in dynamic and unpredictable environments.

AI-assisted intelligence report · EU AI Act Art. 50 compliant

Visual Intelligence

flowchart LR
  A["VLM Short-Horizon Limit"] --> B["Odysseus Framework"]
  B --> C["Adapted PPO Variant"]
  C --> D["Lightweight Turn Critic"]
  D --> E["Pretrained VLM Priors"]
  E --> F["Improved Sample Efficiency"]
  F --> G["100+ Turn Decision-Making"]
  G --> H["Enhanced Game Progress"]

Auto-generated diagram · AI-interpreted flow

Impact Assessment

Extending VLMs to long-horizon, interactive decision-making tasks like video games is a significant frontier. Odysseus demonstrates a robust method for achieving this, overcoming limitations of previous RL approaches and potentially paving the way for more capable embodied AI agents.

Key Details

Odysseus scales Vision-Language Models (VLMs) to 100+ turn decision-making.
Utilizes an adapted PPO variant with a lightweight turn-level critic.
Achieves substantial gains across multiple game levels in Super Mario Land.
Demonstrates at least 3 times average game progress compared to frontier models.
Pretrained VLMs provide strong action priors, improving RL sample efficiency.

Optimistic Outlook

Odysseus's success in long-horizon game environments suggests a powerful pathway for developing highly capable embodied AI agents. The framework's ability to leverage pretrained VLMs for strong priors could significantly accelerate RL training in complex, multimodal settings, leading to more intelligent and adaptable AI.

Pessimistic Outlook

While impressive in game environments, the generalization of these techniques to real-world, open-ended tasks remains a challenge. The specific adaptations to PPO and the critic might not translate directly to scenarios with less structured feedback or higher degrees of uncertainty, limiting broader applicability.

More reporting around this signal.

Related coverage selected to keep the thread going without dropping you into another card wall.

AI Agents

Web2BigTable: A Bi-Level Multi-Agent LLM System for Internet-Scale Information Search and Extraction

Web2BigTable is a bi-level multi-agent LLM system for internet-scale search.

AI Agents

LLM Agents Exhibit Uninstructed Emergent Behavior and Refusal

Eight LLM agents wrote 1.7M words; two refused, even when ordered.

AI Agents

Structured Skill Representation Boosts AI Agent Performance

New SSL representation improves AI agent skill discovery and risk assessment.

LLMs

Talker-T2AV: Autoregressive Diffusion for Joint Talking Audio-Video Generation

Talker-T2AV improves talking head synthesis by decoupling high-level reasoning from low-level refinement.

Tools

AnalogRetriever: Tri-Modal Framework Revolutionizes Analog Circuit Search

AnalogRetriever unifies analog circuit search across schematics, descriptions, and netlists.

Science

End-to-End Autoregressive Image Generation Achieves SOTA

New end-to-end training for autoregressive image models achieves state-of-the-art results.

Odysseus Scales VLMs for 100+ Turn Decision-Making in Games

Sonic Intelligence

Explain Like I'm Five

Deep Intelligence Analysis

Visual Intelligence

Impact Assessment

Key Details

Optimistic Outlook

Pessimistic Outlook

Get the next signal in your inbox.

More reporting around this signal.

Web2BigTable: A Bi-Level Multi-Agent LLM System for Internet-Scale Information Search and Extraction

LLM Agents Exhibit Uninstructed Emergent Behavior and Refusal

Structured Skill Representation Boosts AI Agent Performance

Talker-T2AV: Autoregressive Diffusion for Joint Talking Audio-Video Generation

AnalogRetriever: Tri-Modal Framework Revolutionizes Analog Circuit Search

End-to-End Autoregressive Image Generation Achieves SOTA