AI Agents

New Benchmark Reveals MLLM Agents Struggle with Ambiguous Website Generation

Source: ArXiv cs.AI Original Author: Wang; Qiyao; Hu; Haoran; Chen; Longze; Hongbo; Alinejad-Rokny; Hamid; Lin; Yuan; Yang; Min 2 min read Intelligence Analysis by Gemini

Sonic Intelligence

00:00 / 00:00

Signal Summary

A new benchmark exposes 'blind execution' in MLLM agents for website generation.

Explain Like I'm Five

"Imagine you ask a super-smart robot to build you a website, but your instructions are a bit messy or unclear. This paper shows that even the smartest robots often just try to build something without asking questions, like they're 'blindly' following bad directions. So, they made a new game (benchmark) to help robots learn to ask for clarification and build better websites, but even the best robots still struggle!"

Deep Intelligence Analysis

The introduction of InteractWeb-Bench addresses a critical gap in the evaluation of multimodal large language models (MLLMs) and coding agents for interactive website generation. Existing benchmarks often operate under idealized assumptions, failing to account for the semantic misalignment that arises from ambiguous, low-quality instructions provided by non-expert users. This leads to a pervasive issue termed 'blind execution,' where agents proceed without truly understanding user intent, resulting in suboptimal or incorrect outputs. InteractWeb-Bench provides a much-needed real-world testing ground, pushing the boundaries of what current MLLMs can achieve in practical, interactive development scenarios.

The benchmark's design is innovative, incorporating four types of user agents and persona-driven instruction perturbations to systematically simulate the complexities of human communication, including ambiguity, redundancy, and contradiction. This approach, grounded in requirement engineering defect taxonomies, creates a more realistic and challenging environment for AI evaluation. Furthermore, the interactive execution environment, featuring a unified action space (Clarify, Implement, Verify, Submit), enables agents to engage in iterative intent refinement and visual feedback-based validation. This interactive loop is essential for mimicking real-world development processes and moving beyond static code synthesis.

Experimental results from InteractWeb-Bench are revealing, indicating that even frontier MLLM-based agents remain trapped in blind execution. This exposes significant limitations in their intent recognition capabilities and adaptive interaction strategies. The implications are substantial for the future of AI-driven development: while MLLMs show promise in code generation, their practical utility for non-expert users will be severely constrained until they can robustly handle the inherent messiness of human language and engage in meaningful, iterative clarification. This benchmark serves as a crucial call to action for researchers to focus on developing agents with enhanced conversational intelligence and adaptive reasoning for truly collaborative human-AI development workflows.

AI-assisted intelligence report · EU AI Act Art. 50 compliant

Visual Intelligence

flowchart LR
  A["User Instruction"] --> B["User Agent"]
  B --> C["Instruction Perturbation"]
  C --> D["MLLM Agent"]
  D -- "Action: Clarify" --> B
  D -- "Action: Implement" --> E["Code Synthesis"]
  E --> F["Visual Feedback"]
  F --> D
  D -- "Action: Verify" --> F
  D -- "Action: Submit" --> G["Website Output"]

Auto-generated diagram · AI-interpreted flow

Impact Assessment

This benchmark highlights a critical limitation in current multimodal large language models (MLLMs) when faced with real-world, ambiguous user instructions for website generation. It underscores the need for agents to move beyond static execution and develop robust interactive capabilities, directly impacting the practical utility and adoption of AI for low-code development.

Key Details

Introduces InteractWeb-Bench, a multimodal interactive benchmark for website generation.
Addresses 'blind execution' caused by semantic misalignment from ambiguous user instructions.
Simulates diverse user behaviors using four user agents and persona-driven instruction perturbations.
Interactive environment includes actions: Clarify, Implement, Verify, Submit.
Experiments show frontier MLLM-based agents still exhibit blind execution.

Optimistic Outlook

By clearly identifying the 'blind execution' problem, InteractWeb-Bench provides a targeted challenge for AI researchers, fostering innovation in intent recognition and adaptive interaction for MLLM agents. This will lead to more robust and user-friendly AI development tools, democratizing website creation for non-expert users.

Pessimistic Outlook

The persistent 'blind execution' in frontier MLLMs suggests that achieving truly adaptive and context-aware AI agents for complex tasks remains a significant hurdle. Without substantial breakthroughs in handling ambiguity and interactive refinement, the promise of fully autonomous, user-friendly AI development tools may remain distant, limiting their real-world applicability.

More reporting around this signal.

Related coverage selected to keep the thread going without dropping you into another card wall.

AI Agents

Multi-Agent LLM System Transforms Internet-Scale Information Extraction

A bi-level multi-agent LLM system significantly improves internet-scale information search and extraction.

AI Agents

Safe Bilevel Delegation Enhances Multi-Agent AI Safety

SBD framework ensures runtime safety for multi-agent AI delegation.

AI Agents

AI Agent Achieves End-to-End Autonomous Scientific Discovery on Optical Platform

An LLM-based agent autonomously discovered a new physical mechanism on a real optical platform.

Science

Machine Collective Intelligence Unlocks Explainable Scientific Discovery, Outperforming DNNs

Machine collective intelligence integrates symbolic and metaheuristic AI for autonomous, explainable scientific discover...

LLMs

Veroic Improves LLM Reliability and Cost-Efficiency

Veroic framework optimizes LLM reliability and cost via adaptive inference control.

Society

New Framework Maps Human-AI Decision-Making Spectrum for Leaders

A conceptual framework defines five human-AI decision-making relationships for leaders.

New Benchmark Reveals MLLM Agents Struggle with Ambiguous Website Generation

Sonic Intelligence

Explain Like I'm Five

Deep Intelligence Analysis

Visual Intelligence

Impact Assessment

Key Details

Optimistic Outlook

Pessimistic Outlook

Get the next signal in your inbox.

More reporting around this signal.

Multi-Agent LLM System Transforms Internet-Scale Information Extraction

Safe Bilevel Delegation Enhances Multi-Agent AI Safety

AI Agent Achieves End-to-End Autonomous Scientific Discovery on Optical Platform

Machine Collective Intelligence Unlocks Explainable Scientific Discovery, Outperforming DNNs

Veroic Improves LLM Reliability and Cost-Efficiency

New Framework Maps Human-AI Decision-Making Spectrum for Leaders