Last 7 Days (June 02 – June 08, 2026)
Audio-language models (ALMs) often follow text that conflicts with audio, even when the audio evidence is clear. This raises a basic question: is the audio-supported answer unavailable, or is it represented but overridden by the conflicting text? We examine this question using a same-audio counterfactual that keeps the audio fixed, removes only the conflicting text, and measures the resulting shift in model preference. Across five ALMs and four conflict tasks, 64.1% of conflict samples show a sign flip: the same-audio branch prefers the audio-supported answer, whereas the joint branch prefers the text-supported answer. This pattern suggests that the relevant audio evidence is encoded but loses in arbitration. Activation patching further localizes the reversal to answer-position computation, and patching effects closely track output candidate-score differences (Spearman rho=0.93). Using this diagnostic, we propose Gated Audio Counterfactual Logit Correction (GACL), a training-free decoding rule that interpolates between joint and same-audio scores. Under a strict 5 pp faithfulness-drop budget, GACL improves nAUC by 17.8 points over the best contrastive baseline and transfers without retuning to vision-text arbitration (up to +40.5 pp).
Primary: Northeastern University
All Institutions: Northeastern University, Shanghai Artificial Intelligence Laboratory
This work has significant broader impact for the development of robust and trustworthy multimodal AI systems. By providing a rigorous diagnostic framework and an effective, generalizable intervention, it directly addresses a critical safety and reliability concern in ALMs: their tendency to prioritize conflicting text over clear audio evidence. This is particularly important for agentic applications in sensitive domains like healthcare, emergency services, or legal assistance, where accurate interpretation of audio is paramount. The mechanistic understanding gained through causal localization offers a powerful new lens for analyzing internal decision-making in complex multimodal models, moving beyond black-box observations and fostering more interpretable AI. The demonstrated cross-modal transfer suggests that the principles of diagnosing and correcting arbitration failures using counterfactual references and logit correction could be broadly applicable across various multimodal AI systems (e.g., vision-language, video-language), paving the way for more faithful and reliable AI across the board. This research not only provides a practical solution but also advances our fundamental understanding of multimodal reasoning and conflict resolution in large models. This paper presents a highly novel diagnostic methodology and a mechanistically-informed, training-free decoding rule to address a critical arbitration failure in audio-language models, demonstrating significant performance gains and cross-modal generalizability. The rigorous causal analysis, coupled with a practical and effective solution, makes this a standout contribution to multimodal machine learning.
The methodology is exceptionally strong, building a coherent and rigorous chain from behavioral observation to mechanistic understanding and finally to an effective intervention. The core innovation is the "same-audio counterfactual" diagnostic, which uses two branches (joint audio-text vs. audio-only) to precisely distinguish between perceptual failure and arbitration failure in Audio-Language Models (ALMs). This elegant setup, coupled with signed log-probability margins, provides a clear quantitative signature of "repairable arbitration reversals." The paper then employs activation patching, a robust causal intervention technique, to localize the arbitration failure to the answer-position residual stream within the model's "commit window." This mechanistic finding is crucial, demonstrating that audio evidence is indeed encoded but overridden during the final decision-making process. A key methodological bridge is the discovery of a high Spearman correlation (0.93) between this internal patch-induced repair direction and the observable output score difference ($s_A - s_J$). This alignment is critical because it enables the development of an output-space intervention without requiring internal model access. The proposed Gated Audio Counterfactual Logit Correction (GACL) decoding rule is directly derived from these insights, incorporating a branch-disagreement gate, a reference-reliability gate, and convex bounded interpolation. Each component is mechanistically justified and contributes to the method's robustness and safety. The methodology is a prime example of interpretable ML research, moving beyond symptom identification to root cause analysis and targeted solution design.
The experimental evaluation is comprehensive and rigorously designed. The authors evaluate GACL across five diverse open-weight ALMs (7B-30B parameters) and four distinct audio-text conflict tasks (AQA, VSC, SER, ALME) from established benchmarks (MCR-Bench, ALME). This broad coverage demonstrates the widespread nature of the "text-following" problem and the general applicability of GACL. The use of normalized AUC (nAUC) over a strict faithfulness-drop budget (e.g., 5 pp) is an excellent evaluation metric, realistically capturing the trade-off between conflict resolution and preserving accuracy on faithful inputs. GACL consistently outperforms strong contrastive decoding baselines (AAD, ACD) and the joint model, achieving an impressive average improvement of 17.8 nAUC points under the strict 5 pp budget. Detailed ablation studies meticulously validate the contribution of each component of GACL, showing how gates and bounds ensure stability and prevent undesirable side effects (e.g., surface form rewriting, parse failures). The comparison to a LoRA fine-tuning baseline, where GACL retains 76% of the gain without any parameter updates, highlights its efficiency and practical value. Furthermore, the successful, untuned transfer of GACL to vision-text arbitration on MC$^2$ (achieving up to +40.5 pp adversarial accuracy) is a powerful demonstration of the generalizability of the underlying diagnostic principles across different modalities, significantly amplifying the potential impact of this work.
The paper demonstrates a high commitment to reproducibility. The appendix provides extensive details, including specific public model checkpoints (with Hugging Face snapshot hashes), precise descriptions of benchmark splits, detailed prompt templates for each task, and the exact candidate scoring and normalization procedures. The hyperparameter tuning process, including the use of a development set and freezing parameters for testing, is clearly outlined. Furthermore, the paper provides comprehensive details for the LoRA fine-tuning baseline, including architecture, training data, optimization parameters, and hardware. Inference cost metrics (time, GPU memory, FLOPs) are also reported. This level of detail should enable researchers to reproduce the core findings and build upon this work.
The authors acknowledge several pertinent limitations. The study focuses on controlled, explicit audio-text conflicts, which, while crucial for isolating mechanisms, may not fully capture the complexity of naturally occurring conflicts involving noisier transcripts, partial notes, or broader conversational context. GACL is designed to repair arbitration failures where audio evidence is available but overridden, meaning it cannot compensate for fundamental perceptual failures where the model simply did not encode the relevant acoustic information. This distinction is important for guiding future research towards either decoding-time repair or improved acoustic modeling. A practical limitation is the increased inference latency due to the additional forward pass required for the audio-reference branch, although the authors suggest potential optimizations. Finally, while cross-modal transfer is demonstrated, the generalizability to all possible conflict sources and modality pairs remains an area for future exploration.
This work has significant broader impact for the development of robust and trustworthy multimodal AI systems. By providing a rigorous diagnostic framework and an effective, generalizable intervention, it directly addresses a critical safety and reliability concern in ALMs: their tendency to prioritize conflicting text over clear audio evidence. This is particularly important for agentic applications in sensitive domains like healthcare, emergency services, or legal assistance, where accurate interpretation of audio is paramount. The mechanistic understanding gained through causal localization offers a powerful new lens for analyzing internal decision-making in complex multimodal models, moving beyond black-box observations and fostering more interpretable AI. The demonstrated cross-modal transfer suggests that the principles of diagnosing and correcting arbitration failures using counterfactual references and logit correction could be broadly applicable across various multimodal AI systems (e.g., vision-language, video-language), paving the way for more faithful and reliable AI across the board. This research not only provides a practical solution but also advances our fundamental understanding of multimodal reasoning and conflict resolution in large models. This paper presents a highly novel diagnostic methodology and a mechanistically-informed, training-free decoding rule to address a critical arbitration failure in audio-language models, demonstrating significant performance gains and cross-modal generalizability. The rigorous causal analysis, coupled with a practical and effective solution, makes this a standout contribution to multimodal machine learning.
Current Vision-Language Models struggle with hours-long videos because processing full-length visual sequences induces prohibitive token explosion and attention dilution. To overcome this, we introduce MemDreamer to decouple perception and reasoning, shifting long-video understanding into an agentic exploration process. As a plug-and-play framework, it incrementally streams videos to construct a Hierarchical Graph Memory, a top-down three-tier architecture for semantic abstraction, anchored by a foundational graph capturing spatiotemporal and causal relations. During inference, the reasoning model employs agentic tool-augmented retrieval, navigating hierarchies, searching nodes, and traversing logical edges via an Observation-Reason-Action loop. Experiments show MemDreamer achieves SOTA results across four mainstream benchmarks, narrowing the gap with human experts to only 3.7 points. It constrains the reasoning context window to merely 2% of full-context ingestion while delivering a 12.5 point absolute accuracy gain. Furthermore, statistical analysis uncovers a strong positive linear correlation between an VLM's performance on logic reasoning and long-video understanding benchmarks, establishing agentic capability scaling as a new paradigm for multimodal comprehension.
Primary: Ant Group
All Institutions: Ant Group, Zhejiang University
While highly effective, MemDreamer has a few limitations. The initial perception module still requires processing the full video to construct the hierarchical graph memory, which can be computationally intensive for extremely long videos, even with incremental processing. The quality of the graph memory heavily relies on the capabilities of the underlying VLM used for feature extraction and summarization; any limitations in the VLM's perceptual understanding will propagate. For videos of unprecedented length or complexity, the graph itself might become very large, potentially impacting the efficiency of graph traversal and retrieval, although the hierarchical structure aims to mitigate this. The generalizability of the specific three-tier graph structure and predefined edge types might need adaptation for highly specialized video domains. Finally, while the correlation analysis is compelling, establishing "agentic capability scaling as a new paradigm" is a strong claim that will require further research and validation across diverse tasks and models. BROADER IMPACT: MemDreamer makes a significant contribution to the field of multimodal AI, particularly in long video understanding, a critical and challenging area. By effectively decoupling perception and reasoning, it offers a scalable solution to the token explosion and attention dilution problems that plague current Vision-Language Models. This framework has broad implications for applications requiring deep understanding of extended visual narratives, such as autonomous driving (understanding long-term driving scenarios), surveillance (identifying complex event chains), educational content analysis, and personal video assistants. The plug-and-play nature allows for easy integration into existing VLM pipelines, potentially accelerating research and development in this domain. The empirical finding regarding the correlation between logic reasoning and long-video understanding also opens up new research avenues, suggesting that improving LLM's reasoning capabilities could directly translate to better long-term multimodal comprehension. This work pushes the boundaries of what's possible with current VLMs and LLMs, paving the way for more intelligent and capable AI systems. MemDreamer introduces a novel framework that decouples perception and reasoning for long video understanding via a hierarchical graph memory and an agentic retrieval mechanism. This paper presents a robust and innovative solution to the critical challenge of processing hours-long videos, achieving state-of-the-art results across multiple benchmarks with significant accuracy gains while drastically reducing the reasoning context window, and provides compelling evidence for the importance of agentic reasoning in multimodal comprehension.
MemDreamer proposes an innovative framework to tackle the challenge of long video understanding by decoupling perception and reasoning. The core of the methodology lies in two main components: a Hierarchical Graph Memory for perception and an Agentic Retrieval Mechanism for reasoning. The Hierarchical Graph Memory is a top-down, three-tier architecture designed for semantic abstraction. It incrementally streams video content to construct: 1) an Event Graph (Level 1) capturing spatiotemporal and causal relations between short video events, 2) a Summary Graph (Level 2) abstracting sequences of events into higher-level summaries, and 3) a Concept Graph (Level 3) representing overarching themes and concepts. Each level is populated and connected using a Vision-Language Model (VLM) to summarize and relate information. This hierarchical structure effectively compresses vast amounts of visual information into a manageable, semantically rich graph. The Agentic Retrieval Mechanism employs an LLM-based agent that interacts with this graph memory through an Observation-Reason-Action (O-R-A) loop. The agent is equipped with a set of tools (e.g., `search_node`, `traverse_edge`, `summarize_path`, `query_VLM`) to navigate the hierarchical graph, retrieve relevant information, and synthesize answers to complex queries. This agentic approach allows the reasoning module to operate on a highly condensed, contextually relevant subset of information, rather than processing the entire video sequence, thereby mitigating token explosion and attention dilution. The plug-and-play nature of the framework, allowing integration with various VLMs and LLMs, is a significant design strength.
The experimental evaluation is comprehensive and compelling. MemDreamer is tested across four mainstream benchmarks: EgoSchema (long-term planning), Perception-Reasoning (causal reasoning), Next-QA (temporal reasoning), and ActivityNet-QA (factual QA). The results consistently demonstrate SOTA performance, significantly outperforming various strong VLM baselines (e.g., Video-LLaVA, Video-ChatGPT, Long-Video-LLaMA). Notably, MemDreamer achieves a 12.5 point absolute accuracy gain on EgoSchema while constraining the reasoning context window to merely 2% of full-context ingestion, showcasing its efficiency and effectiveness. Ablation studies rigorously validate the design choices, confirming the importance of each hierarchical graph level, the superiority of agentic retrieval over simpler methods, and the flexibility with different LLM backbones (GPT-4 vs. LLaMA-2). A particularly insightful contribution is the statistical analysis revealing a strong positive linear correlation between a VLM's performance on logic reasoning benchmarks (Big-Bench Hard) and its performance on long-video understanding tasks. This finding provides empirical support for the agentic, reasoning-centric approach and suggests a new paradigm for multimodal comprehension. The gap with human experts is narrowed to only 3.7 points, indicating a high level of performance.
The paper provides a clear methodology, detailed architectural descriptions, and specific choices for VLM and LLM backbones (e.g., Video-LLaVA, GPT-4, LLaMA-2). The benchmarks used are standard and publicly available. The authors state that their code will be released at a specified GitHub repository, which is crucial for reproducibility. The appendix includes additional implementation details, hyper-parameters, and experimental setups, further aiding reproducibility. Given the complexity of the system, the release of code will be essential, but the current level of detail suggests that the work is designed to be reproducible.
While highly effective, MemDreamer has a few limitations. The initial perception module still requires processing the full video to construct the hierarchical graph memory, which can be computationally intensive for extremely long videos, even with incremental processing. The quality of the graph memory heavily relies on the capabilities of the underlying VLM used for feature extraction and summarization; any limitations in the VLM's perceptual understanding will propagate. For videos of unprecedented length or complexity, the graph itself might become very large, potentially impacting the efficiency of graph traversal and retrieval, although the hierarchical structure aims to mitigate this. The generalizability of the specific three-tier graph structure and predefined edge types might need adaptation for highly specialized video domains. Finally, while the correlation analysis is compelling, establishing "agentic capability scaling as a new paradigm" is a strong claim that will require further research and validation across diverse tasks and models. BROADER IMPACT: MemDreamer makes a significant contribution to the field of multimodal AI, particularly in long video understanding, a critical and challenging area. By effectively decoupling perception and reasoning, it offers a scalable solution to the token explosion and attention dilution problems that plague current Vision-Language Models. This framework has broad implications for applications requiring deep understanding of extended visual narratives, such as autonomous driving (understanding long-term driving scenarios), surveillance (identifying complex event chains), educational content analysis, and personal video assistants. The plug-and-play nature allows for easy integration into existing VLM pipelines, potentially accelerating research and development in this domain. The empirical finding regarding the correlation between logic reasoning and long-video understanding also opens up new research avenues, suggesting that improving LLM's reasoning capabilities could directly translate to better long-term multimodal comprehension. This work pushes the boundaries of what's possible with current VLMs and LLMs, paving the way for more intelligent and capable AI systems. MemDreamer introduces a novel framework that decouples perception and reasoning for long video understanding via a hierarchical graph memory and an agentic retrieval mechanism. This paper presents a robust and innovative solution to the critical challenge of processing hours-long videos, achieving state-of-the-art results across multiple benchmarks with significant accuracy gains while drastically reducing the reasoning context window, and provides compelling evidence for the importance of agentic reasoning in multimodal comprehension.
Multi-agent reasoning systems adopt a "generate-then-transfer" paradigm that forces end-to-end latency to scale linearly with pipeline depth. We introduce StreamMA, a multi-agent reasoning system that streams each reasoning step to downstream agents as soon as it is generated, pipelining adjacent agents and thus reducing latency. Surprisingly, this pipelining also improves effectiveness: because multi-step reasoning quality is non-uniform and early steps are more reliable than later ones, working with these reliable early steps instead of the full chain prevents error-prone late steps from misleading downstream agents. We formalize both advantages with the first closed-form joint analysis of stream, serial, and single protocols, deriving the effectiveness ordering, speedup upper bound, and cost ratio. Across eight reasoning benchmarks spanning mathematics, science, and code, two frontier LLMs (Claude Opus 4.6 and GPT-5.4), and three topologies (Chain, Tree, Graph), StreamMA outperforms both baselines (avg. +7.3 pp, max +22.4 pp on HMMT 2026; Claude Opus 4.6-high). Beyond these contributions, we discover a "step-level scaling law": increasing per-agent steps consistently improves both effectiveness and efficiency, a new scaling dimension orthogonal to and composable with agent-count scaling.
Primary: University of California, Berkeley
All Institutions: University of California, Berkeley, Google DeepMind, Stanford University, Carnegie Mellon University
This paper introduces StreamMA, a novel multi-agent reasoning system that employs streaming communication to reduce latency and surprisingly improve effectiveness by leveraging reliable early reasoning steps. The work presents a rigorous formal analysis, extensive empirical validation across diverse benchmarks and frontier LLMs, and discovers a new "step-level scaling law," making it a highly significant contribution to multi-agent AI and LLM research.
The paper introduces StreamMA, a novel multi-agent reasoning system that shifts from the traditional "generate-then-transfer" paradigm to a "streaming communication" approach. This involves pipelining reasoning steps, where downstream agents receive and process partial information as soon as it's generated by upstream agents. The core innovation lies in demonstrating a dual benefit: reduced end-to-end latency and, surprisingly, improved effectiveness. The effectiveness gain is attributed to leveraging more reliable early reasoning steps, preventing error propagation from potentially flawed later steps. The methodology is rigorously supported by the first closed-form joint analysis of stream, serial, and single protocols, providing theoretical derivations for effectiveness ordering, speedup upper bounds, and cost ratios. Agents are designed to generate reasoning steps and an "end-of-step" token, allowing for flexible granularity. The approach is versatile, demonstrated across Chain, Tree, and Graph topologies. This is a well-conceived and theoretically grounded methodology.
The experimental evaluation is comprehensive and robust. The authors test StreamMA across eight diverse reasoning benchmarks spanning mathematics (HMMT, GSM8K, MATH), science (ARC, BigBench Hard), and code generation (HumanEval, MBPP, APPS). This breadth demonstrates the generalizability of the approach. Two frontier LLMs, Claude Opus 4.6 and GPT-5.4, are used, providing strong baselines and highlighting the practical relevance to state-of-the-art systems. StreamMA consistently outperforms both "Serial" (generate-then-transfer) and "Single" (single-agent) baselines, achieving significant average effectiveness gains of +7.3 percentage points and a maximum of +22.4 pp on HMMT 2026. The paper also validates latency reduction and explores the "step-level scaling law," a novel empirical finding that increasing per-agent steps improves both effectiveness and efficiency. The experiments across different topologies (Chain, Tree, Graph) further solidify the findings. While the use of proprietary LLMs limits direct reproducibility for all researchers, the results are compelling and well-supported.
The paper provides a detailed description of the StreamMA methodology, including agent prompting strategies, communication protocols, and the formal analysis. This level of detail is commendable. However, the reliance on proprietary frontier LLMs (Claude Opus 4.6, GPT-5.4) means that exact replication of the results requires access to these specific models, which might not be universally available. The authors state that "Our code is available at [URL redacted for anonymity]," indicating that code exists but is not publicly linked in the provided version. Publicly available code would significantly enhance reproducibility. Given the detailed methodology and the promise of code, the work is reproducible in principle, but the LLM dependency and current lack of a public code link are practical limitations.
The authors acknowledge several limitations. Streaming communication can increase the total token count if agents re-process information, potentially leading to higher API costs, though this is often offset by improved effectiveness. Designing and managing complex graph-based multi-agent systems remains challenging. The approach relies on LLMs being capable of effectively processing and acting on partial, streaming information. The current focus is primarily on reasoning tasks, and its generalizability to other LLM applications like creative generation is not explored. For very simple tasks, the overhead of streaming might outweigh the benefits. Additionally, the reliance on proprietary frontier LLMs limits immediate open-source replication, and while the "step-level scaling law" is a fascinating discovery, its theoretical underpinnings and boundary conditions are not fully explored.
This paper offers a significant contribution to the field of multi-agent LLM systems. It introduces a new paradigm for communication that addresses a critical bottleneck (latency) while simultaneously improving reasoning effectiveness. This has profound implications for designing more efficient and responsive multi-agent systems, making them more viable for real-time and interactive applications. The discovery of the "step-level scaling law" opens up a novel research dimension for optimizing LLM performance and multi-agent system design, orthogonal to existing scaling laws. The insight that leveraging early, more reliable reasoning steps can prevent error propagation is a valuable lesson for structuring complex LLM-based reasoning tasks. This work is likely to influence future research and development in multi-agent AI and LLM deployment strategies. This paper introduces StreamMA, a novel multi-agent reasoning system that employs streaming communication to reduce latency and surprisingly improve effectiveness by leveraging reliable early reasoning steps. The work presents a rigorous formal analysis, extensive empirical validation across diverse benchmarks and frontier LLMs, and discovers a new "step-level scaling law," making it a highly significant contribution to multi-agent AI and LLM research.
Code language models need repository-level context to resolve imports, APIs, and project conventions. Existing methods inject this knowledge as long inputs (retrieved through RAG or dependency analysis) or through per-repository fine-tuning and LoRA -- costly at repository scale and brittle to evolving codebases. We introduce Code2LoRA, a hypernetwork framework that generates repository-specific LoRA adapters, effectively injecting repository knowledge with zero inference-time token overhead. Code2LoRA supports two usage scenarios: Code2LoRA-Static converts a single repository snapshot into an adapter, suitable for comprehension of stable codebases; while Code2LoRA-Evo maintains an adapter backed by a GRU hidden state updated per code diff, suitable for active development of evolving codebases. To evaluate Code2LoRA against parameter-efficient fine-tuning baselines, we build RepoPeftBench, a benchmark of 604 Python repositories with two tracks: a static track with 40K training and 12K test assertion-completion tasks, and an evolution track with 215K commit-derived training and 87K commit-derived test tasks. On the static track, Code2LoRA-Static achieves 63.8% cross-repo and 66.2% in-repo exact match, matching the per-repository LoRA upper bound; on the evolution track, Code2LoRA-Evo achieves 60.3% cross-repo exact match (+5.2 pp over a single shared LoRA). Code2LoRA's code can be found at https://anonymous.4open.science/r/code2lora-6857; the model checkpoints and RepoPeftBench datasets can be found at https://huggingface.co/code2lora.
Primary: unknown
All Institutions: unknown
Code2LoRA has significant broader impact for the development of more capable and efficient AI code assistants. By injecting repository knowledge parametrically with zero inference-time token overhead, it overcomes key limitations of current approaches (context window limits, per-query retrieval costs). The ability of Code2LoRA-Evo to adapt to evolving codebases commit-by-commit is particularly impactful for active development environments, enabling assistants that stay current with ongoing changes. This framework could lead to more accurate code completion, bug fixing, and project navigation tools. The RepoPeftBench benchmark itself is a valuable contribution, providing a standardized and challenging evaluation platform for future research in repository-level code understanding and PEFT. The paper also includes a responsible AI section discussing potential risks like insecure code generation, attribution risk, and the need for human review, which is commendable. Code2LoRA introduces a novel hypernetwork framework that generates repository-specific LoRA adapters for code language models, effectively injecting repository knowledge with zero inference-time token overhead and adapting to software evolution. This paper presents a significant advancement in making code LLMs more efficient and adaptable to real-world, evolving codebases, backed by a comprehensive new benchmark and strong empirical results that demonstrate its superiority over existing methods for handling repository-level context.
The methodology proposed in Code2LoRA is robust and well-conceived, addressing the critical challenge of injecting repository-level context into code language models efficiently. The core idea of using a hypernetwork to generate repository-specific LoRA adapters is a clever way to achieve zero inference-time token overhead, a significant advantage over RAG or dependency analysis methods. The framework is thoughtfully designed along two axes: how knowledge enters parameters and when it is refreshed. Code2LoRA-Static, which maps a single repository snapshot to an adapter, employs a practical two-step repository encoder using a frozen Qwen3-Embedding model, followed by a weighted mean and max pool aggregation. This effectively compresses large codebases into fixed-size vectors. The hypernetwork itself is a 2-layer MLP with dedicated output heads for generating LoRA matrices for all seven attention/MLP projection types, which is more comprehensive than prior work. Code2LoRA-Evo introduces genuine novelty by incorporating a GRU to aggregate sequential code diff embeddings. This recurrent mechanism allows the adapter to evolve with the codebase, addressing the brittleness of static adapters to software evolution. The initial GRU state is projected from the initial repository embedding, providing a strong prior. Training involves end-to-end optimization with truncated backpropagation through time, which is standard for recurrent networks. The design choices are technically sound and well-justified for the problem at hand.
The experimental evaluation is exceptionally thorough and rigorous. The authors introduce RepoPeftBench, a new benchmark of 604 Python repositories, specifically designed for evaluating repository-level PEFT methods under both static and evolving conditions. This benchmark is a significant contribution in itself, providing a much-needed resource for the community. The two evaluation tracks (static and evolution) accurately reflect the proposed usage scenarios of Code2LoRA-Static and Code2LoRA-Evo. The task of assertion completion is well-chosen, requiring complex reasoning capabilities and naturally leveraging repository context. The use of in-repo (IR), cross-repo (CR), and temporal out-of-distribution (OOD) splits demonstrates a strong commitment to evaluating generalization. Baselines are comprehensive, including pretrained LLM, RAG, dependency-resolved context (DRC), full fine-tuning (FFT), shared LoRA (sLoRA), per-repository LoRA (pLoRA, serving as an upper bound), and a carefully strengthened Text2LoRA hypernetwork. The strengthening of Text2LoRA to match Code2LoRA's input modality and target coverage is crucial for isolating the hypernetwork's architectural contribution. The results are compelling: 1. On the static track, Code2LoRA-Static achieves 63.8% CR EM and 66.2% IR EM, effectively matching the per-repository LoRA upper bound (64.0% IR EM). This is a remarkable achievement, demonstrating that a hypernetwork can achieve the performance of per-repository fine-tuning without the associated training cost for new repositories. 2. On the evolution track, Code2LoRA-Evo achieves 60.3% CR EM, a substantial 5.2 percentage point improvement over a single shared LoRA and significantly outperforming Code2LoRA-Static (41.7%) which goes stale. This validates the recurrent design for evolving codebases. 3. Strong generalization is shown on the temporal OOD holdout, with Code2LoRA-Evo leading the next-best fine-tuned adapter by 1.8 pp EM. The ablation study on RAG parameters further strengthens the conclusion that Code2LoRA's parametric approach is superior to context injection. The use of Qwen2.5-Coder-1.5B as the base LLM and Qwen3-Embedding-0.6B for the repository encoder are reasonable choices for demonstrating the method's effectiveness.
The paper demonstrates a strong commitment to reproducibility. The authors state that code, RepoPeftBench datasets, and model checkpoints will be released. The appendix provides detailed hyperparameters, training schedules, and sequence-length budgets. All experiments were run on a single H100 80GB GPU, making the compute requirements clear. The detailed description of the dataset construction pipeline, including repository selection, licensing, structured prefix construction, and quality filters, further enhances reproducibility. The fine-grained statistics for every split and token-length distributions are also provided.
The authors openly discuss several limitations: 1. **Scope of evaluation:** Limited to Python repositories, a single base LLM (Qwen2.5-Coder-1.5B), and one task (assertion completion). While the architecture is designed to be language- and task-agnostic, empirical validation across these dimensions is left for future work. 2. **OOD target-length artifact:** The OOD exact match scores are inflated due to systematically shorter targets, requiring careful interpretation of absolute numbers, though the relative performance holds. 3. **Surface-level metrics:** Exact Match, Edit Similarity, and CodeBLEU are used. A more semantic evaluation, such as executing generated assertions, was out of scope due to compute budget. 4. **Model size:** The hypernetwork itself is large (720M-745M parameters), which is a function of the backbone's projection dimensions. The necessity of recurrent aggregation for much larger backbones remains an open question. These limitations are acknowledged transparently and do not detract significantly from the paper's core contributions, but rather point to avenues for future research.
Code2LoRA has significant broader impact for the development of more capable and efficient AI code assistants. By injecting repository knowledge parametrically with zero inference-time token overhead, it overcomes key limitations of current approaches (context window limits, per-query retrieval costs). The ability of Code2LoRA-Evo to adapt to evolving codebases commit-by-commit is particularly impactful for active development environments, enabling assistants that stay current with ongoing changes. This framework could lead to more accurate code completion, bug fixing, and project navigation tools. The RepoPeftBench benchmark itself is a valuable contribution, providing a standardized and challenging evaluation platform for future research in repository-level code understanding and PEFT. The paper also includes a responsible AI section discussing potential risks like insecure code generation, attribution risk, and the need for human review, which is commendable. Code2LoRA introduces a novel hypernetwork framework that generates repository-specific LoRA adapters for code language models, effectively injecting repository knowledge with zero inference-time token overhead and adapting to software evolution. This paper presents a significant advancement in making code LLMs more efficient and adaptable to real-world, evolving codebases, backed by a comprehensive new benchmark and strong empirical results that demonstrate its superiority over existing methods for handling repository-level context.
Controlled experiments are the backbone of machine learning research, but at the scale of modern foundation models, they have become prohibitively expensive. Instead, the community increasingly relies on research strategies that approximate the ideal experiment at a fraction of the cost: proxy experiments and scaling laws, observational studies with publicly available models, and single-run designs that leverage variation within individual training runs. In this work, we argue that there is no free lunch when approximating large-scale experiments on a compute budget. Specifically, savings in compute come at the cost of validity threats -- hidden and sometimes untestable assumptions that, when violated, can invalidate research claims. To help navigate such threats, we propose an evaluation framework that casts foundation model research as a causal inference problem. Within this framework, we evaluate different research strategies through four types of validity adapted from the empirical social sciences -- statistical, internal, external, and construct validity. We find that each strategy comes with a characteristic validity profile: proxy experiments trade external and construct validity for statistical and internal validity; observational studies face confounding and effect heterogeneity; and single-run designs are strained by interference between treated units. This analysis reveals several validity threats that have received insufficient attention in the literature. Overall, our evaluation framework provides researchers with a practical toolkit for scrutinizing validity threats in foundation model research~designs.
Primary: University of Tübingen
All Institutions: University of Tübingen, University of Vienna, Tübingen AI Center
This paper provides a crucial evaluation framework for scrutinizing validity threats in foundation model research designs. It offers a novel and timely perspective by casting foundation model research as a causal inference problem and systematically applying validity types from the empirical social sciences to common research strategies, thereby addressing the pressing challenge of conducting rigorous research in a compute-constrained environment. The comprehensive analysis of different research strategies through the lens of statistical, internal, external, and construct validity, and the identification of previously under-emphasized validity threats, positions this work as a potentially foundational contribution to the methodology of large-scale machine learning research.
The paper proposes a conceptual methodology, adapting established frameworks from the empirical social sciences—specifically, causal inference and four types of validity (statistical, internal, external, and construct validity)—to scrutinize research designs for foundation models. It casts foundation model research as a causal inference problem, which is a powerful lens for identifying hidden assumptions and potential threats to validity. The approach involves analyzing common research strategies in the foundation model space (proxy experiments, scaling laws, observational studies, and single-run designs) through this validity framework. The methodology is analytical and aims to provide a structured way to think about the rigor and generalizability of findings in a compute-constrained environment. While the full details of the framework's application to each strategy are not provided in the given text, the abstract outlines specific validity threats identified for each strategy, suggesting a concrete and systematic analysis.
This paper is a conceptual and methodological work; therefore, it does not present traditional experimental evaluations with datasets and results. Its "evaluation" is an analytical one, evaluating different research *strategies* rather than specific models or algorithms. The success of this paper's "evaluation" lies in the clarity, comprehensiveness, and utility of the proposed framework and the insights it generates regarding existing research practices. Without the full text, it's impossible to assess the depth and rigor of this analytical evaluation.
As a conceptual framework paper, reproducibility in the traditional sense (e.g., code, experimental setups) is not directly applicable. However, the framework itself should be clearly defined and articulated such that other researchers can understand, apply, and critique it. The "practical toolkit" mentioned in the abstract implies a structured approach that should be reproducible in its application. The discussion mentions "Open-science initiatives like the Marin Project that openly document training recipes and meta-data can also help," which aligns with principles of reproducibility in the broader ML community.
The primary limitation of this evaluation is the lack of the full paper content for the main sections (e.g., `neurips/sections/proxy`, `neurips/sections/observational`, `neurips/sections/singlerun-v5`, `neurips/sections/validity-profiles`). Therefore, the assessment of the framework's depth, specific insights, and practical utility is based primarily on the abstract and the high-level structure. Without these details, it's difficult to ascertain if the framework is sufficiently comprehensive, if the identified validity threats are exhaustively covered, or if the proposed solutions/mitigations are practical and well-justified. Another potential limitation, inherent in adapting frameworks from other fields, is the challenge of ensuring that the concepts (e.g., construct validity) are appropriately translated and applied to the unique context of machine learning and foundation models without oversimplification or misinterpretation.
This paper has the potential for significant broader impact. By providing a structured framework for evaluating validity threats, it can elevate the methodological rigor of foundation model research. It encourages researchers to critically examine their experimental designs, understand the limitations of their findings, and make more robust claims. This can lead to more reliable and trustworthy research, better allocation of compute resources, and a more mature scientific discourse around large-scale ML. It could serve as a foundational reference for designing future experiments, reviewing papers, and teaching research methodology in the era of large models. The emphasis on "hidden and sometimes untestable assumptions" is crucial for fostering a more transparent and self-aware research community. This paper provides a crucial evaluation framework for scrutinizing validity threats in foundation model research designs. It offers a novel and timely perspective by casting foundation model research as a causal inference problem and systematically applying validity types from the empirical social sciences to common research strategies, thereby addressing the pressing challenge of conducting rigorous research in a compute-constrained environment. The comprehensive analysis of different research strategies through the lens of statistical, internal, external, and construct validity, and the identification of previously under-emphasized validity threats, positions this work as a potentially foundational contribution to the methodology of large-scale machine learning research.
Reasoning models have advanced rapidly, but the dominant reinforcement learning from verifiable rewards (RLVR) recipe remains surprisingly narrow: sample many responses and reward each with a single bit indicating whether the final answer is correct. Yet many settings provide rich feedback, including execution traces, tool outputs, expert corrections, and model self-evaluations. We study how to use such feedback through a distributional variant of the classic imitation learning algorithm DAgger, where the learner has local access to an expert distribution on states visited by the current policy. This yields a simple forward cross-entropy objective that admits a blackbox expert and whose sequence-level gradient {conduct rich credit assignment by propagating} future expert-student disagreement back to earlier decisions. We show that prior RL with self-distillation objectives based on reverse KL or Jensen-Shannon fail to guarantee monotonic policy improvement: even when the expert has higher reward, their updates may increase probability on worse actions. In contrast, we show that forward cross-entropy admits monotonic policy improvement and enjoys guarantees on regret. We further show that our objective optimizes a lower bound on teacher-weighted likelihood of success, leading to improved Pass@N. Empirically, our approach, DistIL, improves over RLVR and RL with self-distillation baselines across a variety of domains: scientific reasoning, coding, and solving hard mathematical problems.
Primary: University of Southern California
All Institutions: University of Southern California
This work has significant broader impact for the field of machine learning, particularly in the context of large language models and agent development. By offering a principled and empirically effective method to leverage rich feedback, DistIL can accelerate the training of AI agents for complex reasoning, problem-solving, and creative generation tasks. It provides a theoretically sound alternative to existing RLHF/RLAIF paradigms, potentially leading to more stable, efficient, and robust learning. The theoretical insights into the limitations of widely used divergence objectives (reverse KL, Jensen-Shannon) for guaranteeing monotonic improvement are fundamental and could influence the design of future objective functions across various machine learning domains. This paper pushes the boundaries of how we think about and utilize expert knowledge in sequential decision-making. This paper introduces DistIL, a novel distributional DAgger variant that leverages rich feedback through a forward cross-entropy objective. Its core contribution lies in providing strong theoretical guarantees for monotonic policy improvement and regret, while empirically demonstrating superior performance over existing RL from feedback and self-distillation methods across challenging reasoning, coding, and mathematical problem-solving tasks. The work offers a principled and effective approach to utilize the nuanced information available in rich feedback, addressing a critical limitation of traditional binary reward schemes and paving the way for more robust and capable AI agents.
The methodology proposed in this paper, DistIL (Distributional Imitation Learning), is a sophisticated and well-justified extension of the classic DAgger algorithm. The core innovation is the shift from learning from single expert actions to learning from *expert distributions* over actions, coupled with a novel forward cross-entropy (FCE) objective. The paper meticulously details how this FCE objective facilitates sequence-level credit assignment by propagating future expert-student disagreement back to earlier decisions, a crucial aspect for multi-step reasoning tasks. A significant strength of the methodology is the rigorous theoretical analysis, which demonstrates that FCE guarantees monotonic policy improvement and provides regret bounds, unlike commonly used reverse KL divergence (e.g., in RLAIF, DPO) or Jensen-Shannon divergence (e.g., in self-distillation). This theoretical underpinning provides a strong foundation for the empirical successes. The practical instantiation of the "expert distribution" using strong LLMs to generate multiple candidate actions and their log-probabilities is a clever and effective way to bridge the theoretical framework with real-world applications.
The experimental evaluation is comprehensive, rigorous, and highly convincing. The authors select a diverse and challenging set of domains: scientific reasoning (SciBench), coding (CodeContests), and solving hard mathematical problems (GSM8K, MATH). These tasks are ideal for showcasing the benefits of rich feedback and multi-step reasoning. The chosen baselines are state-of-the-art methods in reinforcement learning from human/AI feedback (RLVR, PPO, RLAIF, RPO, DPO) and self-distillation, providing a strong comparative analysis. DistIL consistently and significantly outperforms all baselines across all tasks, demonstrating its superior effectiveness. The improvements in Pass@N are substantial and directly align with the theoretical guarantees. Furthermore, the ablation studies effectively isolate the contributions of the FCE objective and the distributional nature of the feedback, reinforcing the core claims of the paper.
The paper provides a good level of detail regarding the experimental setup, including the specific base models, expert models, hyperparameters, and training procedures in both the main text and the appendix. This level of detail is commendable and should allow a diligent researcher to reproduce the main results. While the code is not yet publicly available (a placeholder URL is present), the comprehensive descriptions suggest a strong commitment to reproducibility.
The primary limitation of DistIL, inherent to any imitation learning approach, is its reliance on the quality and availability of a "blackbox expert" that can provide high-quality *distributions* over actions. While the paper effectively demonstrates how powerful LLMs can serve this role, the method's performance is ultimately bounded by the expert's capabilities and the fidelity of the estimated expert distributions. The iterative nature of DAgger, involving repeated data collection and expert queries, can also be computationally intensive, especially if the expert is an expensive API-based LLM. While tested on complex text-based reasoning, the scalability and applicability of DistIL to high-dimensional, continuous action spaces or complex physical environments with less structured feedback remain open questions.
This work has significant broader impact for the field of machine learning, particularly in the context of large language models and agent development. By offering a principled and empirically effective method to leverage rich feedback, DistIL can accelerate the training of AI agents for complex reasoning, problem-solving, and creative generation tasks. It provides a theoretically sound alternative to existing RLHF/RLAIF paradigms, potentially leading to more stable, efficient, and robust learning. The theoretical insights into the limitations of widely used divergence objectives (reverse KL, Jensen-Shannon) for guaranteeing monotonic improvement are fundamental and could influence the design of future objective functions across various machine learning domains. This paper pushes the boundaries of how we think about and utilize expert knowledge in sequential decision-making. This paper introduces DistIL, a novel distributional DAgger variant that leverages rich feedback through a forward cross-entropy objective. Its core contribution lies in providing strong theoretical guarantees for monotonic policy improvement and regret, while empirically demonstrating superior performance over existing RL from feedback and self-distillation methods across challenging reasoning, coding, and mathematical problem-solving tasks. The work offers a principled and effective approach to utilize the nuanced information available in rich feedback, addressing a critical limitation of traditional binary reward schemes and paving the way for more robust and capable AI agents.
Current Vision-Language Models struggle with hours-long videos because processing full-length visual sequences induces prohibitive token explosion and attention dilution. To overcome this, we introduce MemDreamer to decouple perception and reasoning, shifting long-video understanding into an agentic exploration process. As a plug-and-play framework, it incrementally streams videos to construct a Hierarchical Graph Memory, a top-down three-tier architecture for semantic abstraction, anchored by a foundational graph capturing spatiotemporal and causal relations. During inference, the reasoning model employs agentic tool-augmented retrieval, navigating hierarchies, searching nodes, and traversing logical edges via an Observation-Reason-Action loop. Experiments show MemDreamer achieves SOTA results across four mainstream benchmarks, narrowing the gap with human experts to only 3.7 points. It constrains the reasoning context window to merely 2% of full-context ingestion while delivering a 12.5 point absolute accuracy gain. Furthermore, statistical analysis uncovers a strong positive linear correlation between an VLM's performance on logic reasoning and long-video understanding benchmarks, establishing agentic capability scaling as a new paradigm for multimodal comprehension.
Primary: Ant Group
All Institutions: Ant Group, Zhejiang University
While highly effective, MemDreamer has a few limitations. The initial perception module still requires processing the full video to construct the hierarchical graph memory, which can be computationally intensive for extremely long videos, even with incremental processing. The quality of the graph memory heavily relies on the capabilities of the underlying VLM used for feature extraction and summarization; any limitations in the VLM's perceptual understanding will propagate. For videos of unprecedented length or complexity, the graph itself might become very large, potentially impacting the efficiency of graph traversal and retrieval, although the hierarchical structure aims to mitigate this. The generalizability of the specific three-tier graph structure and predefined edge types might need adaptation for highly specialized video domains. Finally, while the correlation analysis is compelling, establishing "agentic capability scaling as a new paradigm" is a strong claim that will require further research and validation across diverse tasks and models. BROADER IMPACT: MemDreamer makes a significant contribution to the field of multimodal AI, particularly in long video understanding, a critical and challenging area. By effectively decoupling perception and reasoning, it offers a scalable solution to the token explosion and attention dilution problems that plague current Vision-Language Models. This framework has broad implications for applications requiring deep understanding of extended visual narratives, such as autonomous driving (understanding long-term driving scenarios), surveillance (identifying complex event chains), educational content analysis, and personal video assistants. The plug-and-play nature allows for easy integration into existing VLM pipelines, potentially accelerating research and development in this domain. The empirical finding regarding the correlation between logic reasoning and long-video understanding also opens up new research avenues, suggesting that improving LLM's reasoning capabilities could directly translate to better long-term multimodal comprehension. This work pushes the boundaries of what's possible with current VLMs and LLMs, paving the way for more intelligent and capable AI systems. MemDreamer introduces a novel framework that decouples perception and reasoning for long video understanding via a hierarchical graph memory and an agentic retrieval mechanism. This paper presents a robust and innovative solution to the critical challenge of processing hours-long videos, achieving state-of-the-art results across multiple benchmarks with significant accuracy gains while drastically reducing the reasoning context window, and provides compelling evidence for the importance of agentic reasoning in multimodal comprehension.
MemDreamer proposes an innovative framework to tackle the challenge of long video understanding by decoupling perception and reasoning. The core of the methodology lies in two main components: a Hierarchical Graph Memory for perception and an Agentic Retrieval Mechanism for reasoning. The Hierarchical Graph Memory is a top-down, three-tier architecture designed for semantic abstraction. It incrementally streams video content to construct: 1) an Event Graph (Level 1) capturing spatiotemporal and causal relations between short video events, 2) a Summary Graph (Level 2) abstracting sequences of events into higher-level summaries, and 3) a Concept Graph (Level 3) representing overarching themes and concepts. Each level is populated and connected using a Vision-Language Model (VLM) to summarize and relate information. This hierarchical structure effectively compresses vast amounts of visual information into a manageable, semantically rich graph. The Agentic Retrieval Mechanism employs an LLM-based agent that interacts with this graph memory through an Observation-Reason-Action (O-R-A) loop. The agent is equipped with a set of tools (e.g., `search_node`, `traverse_edge`, `summarize_path`, `query_VLM`) to navigate the hierarchical graph, retrieve relevant information, and synthesize answers to complex queries. This agentic approach allows the reasoning module to operate on a highly condensed, contextually relevant subset of information, rather than processing the entire video sequence, thereby mitigating token explosion and attention dilution. The plug-and-play nature of the framework, allowing integration with various VLMs and LLMs, is a significant design strength.
The experimental evaluation is comprehensive and compelling. MemDreamer is tested across four mainstream benchmarks: EgoSchema (long-term planning), Perception-Reasoning (causal reasoning), Next-QA (temporal reasoning), and ActivityNet-QA (factual QA). The results consistently demonstrate SOTA performance, significantly outperforming various strong VLM baselines (e.g., Video-LLaVA, Video-ChatGPT, Long-Video-LLaMA). Notably, MemDreamer achieves a 12.5 point absolute accuracy gain on EgoSchema while constraining the reasoning context window to merely 2% of full-context ingestion, showcasing its efficiency and effectiveness. Ablation studies rigorously validate the design choices, confirming the importance of each hierarchical graph level, the superiority of agentic retrieval over simpler methods, and the flexibility with different LLM backbones (GPT-4 vs. LLaMA-2). A particularly insightful contribution is the statistical analysis revealing a strong positive linear correlation between a VLM's performance on logic reasoning benchmarks (Big-Bench Hard) and its performance on long-video understanding tasks. This finding provides empirical support for the agentic, reasoning-centric approach and suggests a new paradigm for multimodal comprehension. The gap with human experts is narrowed to only 3.7 points, indicating a high level of performance.
The paper provides a clear methodology, detailed architectural descriptions, and specific choices for VLM and LLM backbones (e.g., Video-LLaVA, GPT-4, LLaMA-2). The benchmarks used are standard and publicly available. The authors state that their code will be released at a specified GitHub repository, which is crucial for reproducibility. The appendix includes additional implementation details, hyper-parameters, and experimental setups, further aiding reproducibility. Given the complexity of the system, the release of code will be essential, but the current level of detail suggests that the work is designed to be reproducible.
While highly effective, MemDreamer has a few limitations. The initial perception module still requires processing the full video to construct the hierarchical graph memory, which can be computationally intensive for extremely long videos, even with incremental processing. The quality of the graph memory heavily relies on the capabilities of the underlying VLM used for feature extraction and summarization; any limitations in the VLM's perceptual understanding will propagate. For videos of unprecedented length or complexity, the graph itself might become very large, potentially impacting the efficiency of graph traversal and retrieval, although the hierarchical structure aims to mitigate this. The generalizability of the specific three-tier graph structure and predefined edge types might need adaptation for highly specialized video domains. Finally, while the correlation analysis is compelling, establishing "agentic capability scaling as a new paradigm" is a strong claim that will require further research and validation across diverse tasks and models. BROADER IMPACT: MemDreamer makes a significant contribution to the field of multimodal AI, particularly in long video understanding, a critical and challenging area. By effectively decoupling perception and reasoning, it offers a scalable solution to the token explosion and attention dilution problems that plague current Vision-Language Models. This framework has broad implications for applications requiring deep understanding of extended visual narratives, such as autonomous driving (understanding long-term driving scenarios), surveillance (identifying complex event chains), educational content analysis, and personal video assistants. The plug-and-play nature allows for easy integration into existing VLM pipelines, potentially accelerating research and development in this domain. The empirical finding regarding the correlation between logic reasoning and long-video understanding also opens up new research avenues, suggesting that improving LLM's reasoning capabilities could directly translate to better long-term multimodal comprehension. This work pushes the boundaries of what's possible with current VLMs and LLMs, paving the way for more intelligent and capable AI systems. MemDreamer introduces a novel framework that decouples perception and reasoning for long video understanding via a hierarchical graph memory and an agentic retrieval mechanism. This paper presents a robust and innovative solution to the critical challenge of processing hours-long videos, achieving state-of-the-art results across multiple benchmarks with significant accuracy gains while drastically reducing the reasoning context window, and provides compelling evidence for the importance of agentic reasoning in multimodal comprehension.
Multi-agent reasoning systems adopt a "generate-then-transfer" paradigm that forces end-to-end latency to scale linearly with pipeline depth. We introduce StreamMA, a multi-agent reasoning system that streams each reasoning step to downstream agents as soon as it is generated, pipelining adjacent agents and thus reducing latency. Surprisingly, this pipelining also improves effectiveness: because multi-step reasoning quality is non-uniform and early steps are more reliable than later ones, working with these reliable early steps instead of the full chain prevents error-prone late steps from misleading downstream agents. We formalize both advantages with the first closed-form joint analysis of stream, serial, and single protocols, deriving the effectiveness ordering, speedup upper bound, and cost ratio. Across eight reasoning benchmarks spanning mathematics, science, and code, two frontier LLMs (Claude Opus 4.6 and GPT-5.4), and three topologies (Chain, Tree, Graph), StreamMA outperforms both baselines (avg. +7.3 pp, max +22.4 pp on HMMT 2026; Claude Opus 4.6-high). Beyond these contributions, we discover a "step-level scaling law": increasing per-agent steps consistently improves both effectiveness and efficiency, a new scaling dimension orthogonal to and composable with agent-count scaling.
Primary: University of California, Berkeley
All Institutions: University of California, Berkeley, Google DeepMind, Stanford University, Carnegie Mellon University
This paper introduces StreamMA, a novel multi-agent reasoning system that employs streaming communication to reduce latency and surprisingly improve effectiveness by leveraging reliable early reasoning steps. The work presents a rigorous formal analysis, extensive empirical validation across diverse benchmarks and frontier LLMs, and discovers a new "step-level scaling law," making it a highly significant contribution to multi-agent AI and LLM research.
The paper introduces StreamMA, a novel multi-agent reasoning system that shifts from the traditional "generate-then-transfer" paradigm to a "streaming communication" approach. This involves pipelining reasoning steps, where downstream agents receive and process partial information as soon as it's generated by upstream agents. The core innovation lies in demonstrating a dual benefit: reduced end-to-end latency and, surprisingly, improved effectiveness. The effectiveness gain is attributed to leveraging more reliable early reasoning steps, preventing error propagation from potentially flawed later steps. The methodology is rigorously supported by the first closed-form joint analysis of stream, serial, and single protocols, providing theoretical derivations for effectiveness ordering, speedup upper bounds, and cost ratios. Agents are designed to generate reasoning steps and an "end-of-step" token, allowing for flexible granularity. The approach is versatile, demonstrated across Chain, Tree, and Graph topologies. This is a well-conceived and theoretically grounded methodology.
The experimental evaluation is comprehensive and robust. The authors test StreamMA across eight diverse reasoning benchmarks spanning mathematics (HMMT, GSM8K, MATH), science (ARC, BigBench Hard), and code generation (HumanEval, MBPP, APPS). This breadth demonstrates the generalizability of the approach. Two frontier LLMs, Claude Opus 4.6 and GPT-5.4, are used, providing strong baselines and highlighting the practical relevance to state-of-the-art systems. StreamMA consistently outperforms both "Serial" (generate-then-transfer) and "Single" (single-agent) baselines, achieving significant average effectiveness gains of +7.3 percentage points and a maximum of +22.4 pp on HMMT 2026. The paper also validates latency reduction and explores the "step-level scaling law," a novel empirical finding that increasing per-agent steps improves both effectiveness and efficiency. The experiments across different topologies (Chain, Tree, Graph) further solidify the findings. While the use of proprietary LLMs limits direct reproducibility for all researchers, the results are compelling and well-supported.
The paper provides a detailed description of the StreamMA methodology, including agent prompting strategies, communication protocols, and the formal analysis. This level of detail is commendable. However, the reliance on proprietary frontier LLMs (Claude Opus 4.6, GPT-5.4) means that exact replication of the results requires access to these specific models, which might not be universally available. The authors state that "Our code is available at [URL redacted for anonymity]," indicating that code exists but is not publicly linked in the provided version. Publicly available code would significantly enhance reproducibility. Given the detailed methodology and the promise of code, the work is reproducible in principle, but the LLM dependency and current lack of a public code link are practical limitations.
The authors acknowledge several limitations. Streaming communication can increase the total token count if agents re-process information, potentially leading to higher API costs, though this is often offset by improved effectiveness. Designing and managing complex graph-based multi-agent systems remains challenging. The approach relies on LLMs being capable of effectively processing and acting on partial, streaming information. The current focus is primarily on reasoning tasks, and its generalizability to other LLM applications like creative generation is not explored. For very simple tasks, the overhead of streaming might outweigh the benefits. Additionally, the reliance on proprietary frontier LLMs limits immediate open-source replication, and while the "step-level scaling law" is a fascinating discovery, its theoretical underpinnings and boundary conditions are not fully explored.
This paper offers a significant contribution to the field of multi-agent LLM systems. It introduces a new paradigm for communication that addresses a critical bottleneck (latency) while simultaneously improving reasoning effectiveness. This has profound implications for designing more efficient and responsive multi-agent systems, making them more viable for real-time and interactive applications. The discovery of the "step-level scaling law" opens up a novel research dimension for optimizing LLM performance and multi-agent system design, orthogonal to existing scaling laws. The insight that leveraging early, more reliable reasoning steps can prevent error propagation is a valuable lesson for structuring complex LLM-based reasoning tasks. This work is likely to influence future research and development in multi-agent AI and LLM deployment strategies. This paper introduces StreamMA, a novel multi-agent reasoning system that employs streaming communication to reduce latency and surprisingly improve effectiveness by leveraging reliable early reasoning steps. The work presents a rigorous formal analysis, extensive empirical validation across diverse benchmarks and frontier LLMs, and discovers a new "step-level scaling law," making it a highly significant contribution to multi-agent AI and LLM research.
Autoregressive chain-of-thought (CoT) reasoning in large language models (LLMs) is fundamentally forward-directed: each step conditions only on prior tokens. This unidirectional inductive bias renders even capable models susceptible to error snowballing, wherein a single logical or arithmetic mistake in an early step irreversibly corrupts the entire reasoning chain. We introduce Teleological Reasoning Infilling (\TRI{}), a training framework that endows decoder-only transformers with a native \emph{goal-conditioned bridging} capability. The key insight is to reframe erroneous reasoning segments as fill-in-the-middle (FIM) tasks: given a verified prefix premise $P$, a verified downstream milestone $S$, and the original query $Q$, the model must synthesise the logical bridge $M$ that connects $P$ to $S$ rigorously and completely. To achieve this with standard causal architectures, we introduce a Prefix-Suffix-Middle (PSM) sequence rearrangement with three non-overlapping sentinel tokens, enabling $M$ to attend to both $P$ and $S$ without any structural modification to the self-attention mechanism. Training proceeds in two stages: (i) Supervised Fine-Tuning (SFT) on symbolically verified $(P, S, M)$ triples extracted from formal mathematics corpora, and (ii) Direct Preference Optimisation (DPO) with a deterministic symbolic verifier (Lean 4 / Python) as the sole reward oracle, eliminating LLM-judge sycophancy. At inference, TRI operates as a surgical repair module within a dual-system loop: a causal draft model generates an initial trace, the verifier pinpoints failures, and TRI infills only the damaged segment, leaving verified sections intact. Comprehensive experiments on three benchmarks demonstrate that TRI achieves state-of-the-art performance across all tasks, while reducing per-problem token expenditure by 31.2%.
Primary: University of Oxford
All Institutions: University of Oxford, FLock.io, TU Wien
TRI represents a significant step towards making LLM reasoning more robust, reliable, and trustworthy, especially in high-stakes domains. Its dual-system approach, combining the generative power of LLMs with the precision of symbolic verifiers, offers a powerful paradigm for building more capable AI systems. The ability to surgically repair reasoning chains efficiently has major implications for: 1. **Formal Methods and Mathematics:** Accelerating mathematical discovery, proof generation, and verification by providing a robust tool for bridging logical gaps. 2. **Software Engineering:** Enhancing automated code generation, debugging, and repair, leading to more reliable and efficient software development. 3. **Scientific Discovery:** Improving the reliability of LLM-assisted scientific reasoning and hypothesis generation in fields requiring rigorous logical deduction. 4. **Resource Efficiency:** The substantial token efficiency gains contribute to reducing the computational cost and environmental footprint of complex LLM reasoning tasks. 5. **Beyond CoT:** By addressing a fundamental limitation of autoregressive generation, TRI offers a principled alternative or augmentation to existing CoT methods, potentially influencing future LLM architectures and training strategies for reasoning. This paper introduces Teleological Reasoning Infilling (TRI), a novel framework that endows decoder-only transformers with a native goal-conditioned bridging capability for robust chain repair, achieving state-of-the-art performance and significant token efficiency on complex reasoning tasks. The work makes substantial contributions through its elegant Prefix-Suffix-Middle (PSM) sequence architecture, a principled two-stage training pipeline leveraging deterministic symbolic verifiers, a surgical dual-system inference repair algorithm, and rigorous theoretical analysis, offering a powerful solution to the critical problem of error snowballing in LLM reasoning.
The paper introduces Teleological Reasoning Infilling (TRI), a novel framework addressing the critical "error snowballing" problem in autoregressive Chain-of-Thought (CoT) reasoning by LLMs. The core idea is to reframe erroneous reasoning segments as Fill-in-the-Middle (FIM) tasks, where the model must synthesize a logical bridge (M) between a verified prefix premise (P) and a verified downstream milestone (S), given the original query (Q). This goal-conditioned bridging capability is a significant conceptual leap from purely forward-directed generation. A key technical innovation is the Prefix-Suffix-Middle (PSM) sequence rearrangement. By introducing three non-overlapping sentinel tokens and reordering the input as `[Q _premise P _milestone S _bridge M]`, the authors elegantly enable standard causal decoder-only transformers to attend to both P and S when generating M, without any modification to the self-attention mechanism. This is a clever and efficient architectural trick. The training pipeline is robust and principled, consisting of two stages: (i) Supervised Fine-Tuning (SFT) on symbolically verified (Q, P, S, M) triples extracted from formal mathematics corpora (MATH, Lean-Workbook). The meticulous data curation, including independent verification of P and S and anti-contamination measures, ensures high-quality training data. (ii) Direct Preference Optimisation (DPO) with a *deterministic symbolic verifier* (Lean 4 / Python) as the sole reward oracle. This is a crucial design choice, explicitly rejecting LLM-based judges to overcome sycophancy and structural blindness in formal logical validity, providing a provably correct feedback signal. The categorization of rejection failure modes further refines the DPO signal. At inference, TRI operates as a surgical repair module within a dual-system loop. A causal draft model generates an initial trace, a verifier pinpoints failures, and TRI infills only the damaged segment, leaving verified sections intact. The `ExtractMilestone` subroutine, which performs a bounded forward scan to find the first verifiable downstream step, is a practical component of this loop. The paper also provides formal theoretical analysis, including a proof of "Topological Consistency" for PSM training, DPO convergence guarantees, and a "Universal Approximation of Bidirectional Conditionals via PSM" theorem. While the Lipschitz assumption for the logical scorer in discrete domains is an idealization, the discussion clarifies its implications for distributional concentration in embedding space, adding significant rigor to the work. Property A.5 on gap-length monotonicity provides theoretical justification for training choices.
The experimental evaluation is comprehensive and compelling. TRI is evaluated on three diverse and challenging benchmarks: MATH (competition mathematics), HumanEval-Fix (program repair), and Lean-Workbook (formal theorem proving). This broad coverage effectively demonstrates the generalizability of the approach across different domains requiring rigorous logical reasoning. TRI achieves consistent state-of-the-art performance across all tasks and MATH difficulty levels, significantly outperforming strong baselines including Qwen2.5-72B-Instruct, Llama-3.1-70B-Instruct (with CoT, CoT-SC, and ToT variants), and the domain-specific InternLM2.5-StepProver. The performance gains are particularly pronounced on higher difficulty MATH levels, validating the hypothesis that TRI's benefit accrues where error snowballing is most problematic. Beyond accuracy, TRI demonstrates remarkable efficiency, reducing per-problem token expenditure by 31.2% compared to baselines. This is a substantial practical advantage, stemming from its surgical repair strategy that avoids regenerating entire traces. The robustness analysis further highlights TRI's strengths, showing superior performance under tight computational budgets and high fault densities. The "asymmetric benefit" under low token budgets is a key finding, demonstrating that TRI's targeted repair is much more effective than exhaustive search or ensemble methods when resources are constrained. The Repair Success Rate (RSR) of 73.8% on MATH Level 5 indicates the effectiveness of the iterative repair loop. The ablation study is well-designed and provides crucial insights. It definitively shows that the symbolic verifier oracle in the DPO stage is the most consequential component, with replacing it with an LLM-as-judge leading to a drastic 12.1 pp drop in MATH Level 5 accuracy. This strongly validates the paper's methodological choice to use a deterministic oracle. The ablation on milestone selection also confirms the optimal strategy of choosing the first verifiable milestone.
The paper provides a good level of detail for reproducibility. The base model (Qwen2.5-72B) is specified. Comprehensive hyperparameters for both SFT (epochs, learning rate, schedule, weight decay, batch size, max sequence length, label smoothing) and DPO (beta, learning rate, batch size, epochs) are provided. Details on data curation, including the number of quadruples and the procedure for extraction, are given. Inference parameters such as maximum repair iterations and the `ExtractMilestone` window size are also specified. While explicit code or data release URLs are not provided in the text, the level of detail suggests that an informed researcher could reproduce the results given access to the base model and datasets.
1. **Verifier Dependency:** The core methodology relies on the existence of a deterministic symbolic verifier. This limits TRI's applicability to domains where such a verifier is available (e.g., formal mathematics, programming, logic puzzles) and prevents its direct use in open-ended or subjective reasoning tasks where ground truth verification is ambiguous. 2. **Not a Zero-Shot Generator:** TRI is designed as a specialized repair module within a dual-system loop, not a standalone zero-shot reasoning generator. It requires an initial draft trace and identified failure points to operate, which means it cannot initiate reasoning from scratch in an unconstrained environment. 3. **Milestone Discovery Challenges:** While the `ExtractMilestone` subroutine is effective, in scenarios with extremely sparse verifiable steps or very deeply flawed traces, it might fail to find a suitable milestone within its bounded scan window, leading to a fallback to less efficient full suffix regeneration. 4. **Theoretical Assumptions:** The Lipschitz continuity assumption for the logical scoring function, while clarified, is an idealization in discrete symbolic domains where small changes can lead to large logical shifts. The theoretical guarantees are thus interpreted as distributional concentrations rather than pointwise correctness. 5. **Gap Length Sensitivity:** Although the paper justifies the training gap span, very long logical gaps between P and S might still pose significant challenges for the model to bridge effectively, as suggested by the theoretical property on decreasing verification probability with gap length.
TRI represents a significant step towards making LLM reasoning more robust, reliable, and trustworthy, especially in high-stakes domains. Its dual-system approach, combining the generative power of LLMs with the precision of symbolic verifiers, offers a powerful paradigm for building more capable AI systems. The ability to surgically repair reasoning chains efficiently has major implications for: 1. **Formal Methods and Mathematics:** Accelerating mathematical discovery, proof generation, and verification by providing a robust tool for bridging logical gaps. 2. **Software Engineering:** Enhancing automated code generation, debugging, and repair, leading to more reliable and efficient software development. 3. **Scientific Discovery:** Improving the reliability of LLM-assisted scientific reasoning and hypothesis generation in fields requiring rigorous logical deduction. 4. **Resource Efficiency:** The substantial token efficiency gains contribute to reducing the computational cost and environmental footprint of complex LLM reasoning tasks. 5. **Beyond CoT:** By addressing a fundamental limitation of autoregressive generation, TRI offers a principled alternative or augmentation to existing CoT methods, potentially influencing future LLM architectures and training strategies for reasoning. This paper introduces Teleological Reasoning Infilling (TRI), a novel framework that endows decoder-only transformers with a native goal-conditioned bridging capability for robust chain repair, achieving state-of-the-art performance and significant token efficiency on complex reasoning tasks. The work makes substantial contributions through its elegant Prefix-Suffix-Middle (PSM) sequence architecture, a principled two-stage training pipeline leveraging deterministic symbolic verifiers, a surgical dual-system inference repair algorithm, and rigorous theoretical analysis, offering a powerful solution to the critical problem of error snowballing in LLM reasoning.
Audio-language models (ALMs) often follow text that conflicts with audio, even when the audio evidence is clear. This raises a basic question: is the audio-supported answer unavailable, or is it represented but overridden by the conflicting text? We examine this question using a same-audio counterfactual that keeps the audio fixed, removes only the conflicting text, and measures the resulting shift in model preference. Across five ALMs and four conflict tasks, 64.1% of conflict samples show a sign flip: the same-audio branch prefers the audio-supported answer, whereas the joint branch prefers the text-supported answer. This pattern suggests that the relevant audio evidence is encoded but loses in arbitration. Activation patching further localizes the reversal to answer-position computation, and patching effects closely track output candidate-score differences (Spearman rho=0.93). Using this diagnostic, we propose Gated Audio Counterfactual Logit Correction (GACL), a training-free decoding rule that interpolates between joint and same-audio scores. Under a strict 5 pp faithfulness-drop budget, GACL improves nAUC by 17.8 points over the best contrastive baseline and transfers without retuning to vision-text arbitration (up to +40.5 pp).
Primary: Northeastern University
All Institutions: Northeastern University, Shanghai Artificial Intelligence Laboratory
This work has significant broader impact for the development of robust and trustworthy multimodal AI systems. By providing a rigorous diagnostic framework and an effective, generalizable intervention, it directly addresses a critical safety and reliability concern in ALMs: their tendency to prioritize conflicting text over clear audio evidence. This is particularly important for agentic applications in sensitive domains like healthcare, emergency services, or legal assistance, where accurate interpretation of audio is paramount. The mechanistic understanding gained through causal localization offers a powerful new lens for analyzing internal decision-making in complex multimodal models, moving beyond black-box observations and fostering more interpretable AI. The demonstrated cross-modal transfer suggests that the principles of diagnosing and correcting arbitration failures using counterfactual references and logit correction could be broadly applicable across various multimodal AI systems (e.g., vision-language, video-language), paving the way for more faithful and reliable AI across the board. This research not only provides a practical solution but also advances our fundamental understanding of multimodal reasoning and conflict resolution in large models. This paper presents a highly novel diagnostic methodology and a mechanistically-informed, training-free decoding rule to address a critical arbitration failure in audio-language models, demonstrating significant performance gains and cross-modal generalizability. The rigorous causal analysis, coupled with a practical and effective solution, makes this a standout contribution to multimodal machine learning.
The methodology is exceptionally strong, building a coherent and rigorous chain from behavioral observation to mechanistic understanding and finally to an effective intervention. The core innovation is the "same-audio counterfactual" diagnostic, which uses two branches (joint audio-text vs. audio-only) to precisely distinguish between perceptual failure and arbitration failure in Audio-Language Models (ALMs). This elegant setup, coupled with signed log-probability margins, provides a clear quantitative signature of "repairable arbitration reversals." The paper then employs activation patching, a robust causal intervention technique, to localize the arbitration failure to the answer-position residual stream within the model's "commit window." This mechanistic finding is crucial, demonstrating that audio evidence is indeed encoded but overridden during the final decision-making process. A key methodological bridge is the discovery of a high Spearman correlation (0.93) between this internal patch-induced repair direction and the observable output score difference ($s_A - s_J$). This alignment is critical because it enables the development of an output-space intervention without requiring internal model access. The proposed Gated Audio Counterfactual Logit Correction (GACL) decoding rule is directly derived from these insights, incorporating a branch-disagreement gate, a reference-reliability gate, and convex bounded interpolation. Each component is mechanistically justified and contributes to the method's robustness and safety. The methodology is a prime example of interpretable ML research, moving beyond symptom identification to root cause analysis and targeted solution design.
The experimental evaluation is comprehensive and rigorously designed. The authors evaluate GACL across five diverse open-weight ALMs (7B-30B parameters) and four distinct audio-text conflict tasks (AQA, VSC, SER, ALME) from established benchmarks (MCR-Bench, ALME). This broad coverage demonstrates the widespread nature of the "text-following" problem and the general applicability of GACL. The use of normalized AUC (nAUC) over a strict faithfulness-drop budget (e.g., 5 pp) is an excellent evaluation metric, realistically capturing the trade-off between conflict resolution and preserving accuracy on faithful inputs. GACL consistently outperforms strong contrastive decoding baselines (AAD, ACD) and the joint model, achieving an impressive average improvement of 17.8 nAUC points under the strict 5 pp budget. Detailed ablation studies meticulously validate the contribution of each component of GACL, showing how gates and bounds ensure stability and prevent undesirable side effects (e.g., surface form rewriting, parse failures). The comparison to a LoRA fine-tuning baseline, where GACL retains 76% of the gain without any parameter updates, highlights its efficiency and practical value. Furthermore, the successful, untuned transfer of GACL to vision-text arbitration on MC$^2$ (achieving up to +40.5 pp adversarial accuracy) is a powerful demonstration of the generalizability of the underlying diagnostic principles across different modalities, significantly amplifying the potential impact of this work.
The paper demonstrates a high commitment to reproducibility. The appendix provides extensive details, including specific public model checkpoints (with Hugging Face snapshot hashes), precise descriptions of benchmark splits, detailed prompt templates for each task, and the exact candidate scoring and normalization procedures. The hyperparameter tuning process, including the use of a development set and freezing parameters for testing, is clearly outlined. Furthermore, the paper provides comprehensive details for the LoRA fine-tuning baseline, including architecture, training data, optimization parameters, and hardware. Inference cost metrics (time, GPU memory, FLOPs) are also reported. This level of detail should enable researchers to reproduce the core findings and build upon this work.
The authors acknowledge several pertinent limitations. The study focuses on controlled, explicit audio-text conflicts, which, while crucial for isolating mechanisms, may not fully capture the complexity of naturally occurring conflicts involving noisier transcripts, partial notes, or broader conversational context. GACL is designed to repair arbitration failures where audio evidence is available but overridden, meaning it cannot compensate for fundamental perceptual failures where the model simply did not encode the relevant acoustic information. This distinction is important for guiding future research towards either decoding-time repair or improved acoustic modeling. A practical limitation is the increased inference latency due to the additional forward pass required for the audio-reference branch, although the authors suggest potential optimizations. Finally, while cross-modal transfer is demonstrated, the generalizability to all possible conflict sources and modality pairs remains an area for future exploration.
This work has significant broader impact for the development of robust and trustworthy multimodal AI systems. By providing a rigorous diagnostic framework and an effective, generalizable intervention, it directly addresses a critical safety and reliability concern in ALMs: their tendency to prioritize conflicting text over clear audio evidence. This is particularly important for agentic applications in sensitive domains like healthcare, emergency services, or legal assistance, where accurate interpretation of audio is paramount. The mechanistic understanding gained through causal localization offers a powerful new lens for analyzing internal decision-making in complex multimodal models, moving beyond black-box observations and fostering more interpretable AI. The demonstrated cross-modal transfer suggests that the principles of diagnosing and correcting arbitration failures using counterfactual references and logit correction could be broadly applicable across various multimodal AI systems (e.g., vision-language, video-language), paving the way for more faithful and reliable AI across the board. This research not only provides a practical solution but also advances our fundamental understanding of multimodal reasoning and conflict resolution in large models. This paper presents a highly novel diagnostic methodology and a mechanistically-informed, training-free decoding rule to address a critical arbitration failure in audio-language models, demonstrating significant performance gains and cross-modal generalizability. The rigorous causal analysis, coupled with a practical and effective solution, makes this a standout contribution to multimodal machine learning.
Audio is an inherently interactive modality, yet today's Large Audio Language Models (LALMs) are offline, and streaming audio models each handle only a single task such as streaming ASR or voice chatting. It is time to unify them into one online LALM: a model that, through an always-on perceive-decide-respond loop, listens to sound, environment, and instructions in real time and reacts on the fly. We formalize this regime as the Audio Interaction Model, and realize it with Audio-Interaction, a unified streaming model that retains offline task execution while adding online general audio instruction following, from dialogue to full voice chatting, deciding when to respond from the semantics of the stream. To enable this, we propose SoundFlow, a framework that instantiates the perceive-decide-respond loop end to end, from data to training to deployment, through streaming-native data construction, comprehension-aware training, and asynchronous low-latency inference for stable real-time interaction. We further construct StreamAudio-2M, a 2.6M-item streaming corpus spanning 7 fundamental abilities and 28 sub-tasks, and Proactive-Sound-Bench for evaluating proactive audio intervention. Across 8 benchmarks, Audio-Interaction preserves competitive performance on mainstream audio tasks while unlocking capabilities inaccessible to offline LALMs, including real-time ASR, streaming audio instruction following, and proactive help.
Primary: unknown
All Institutions: unknown
This paper has significant broader impact across several dimensions: 1. **Paradigm Shift for Audio AI**: It proposes a fundamental shift from offline, clip-based LALMs and single-task streaming models to a unified, interactive, real-time "Audio Interaction Model." This vision is crucial for developing truly intelligent and helpful audio assistants. 2. **Enabling New Capabilities**: The work unlocks capabilities previously inaccessible to offline LALMs, such as comprehension-grounded response triggering, long-stream interaction, and proactive assistance. This has direct implications for applications like smart homes, automotive assistants, accessibility tools, and advanced conversational AI, where real-time, context-aware intervention is vital. 3. **Resource Contribution**: The release of the large-scale StreamAudio-2M dataset and the Proactive-Sound-Bench benchmark provides invaluable resources for the research community, accelerating future work on streaming audio intelligence and interactive AI. 4. **Reduced Model Proliferation**: By unifying multiple audio tasks into a single model, it offers a path towards more efficient and general-purpose audio AI systems, potentially reducing the need for numerous specialized models. 5. **Ethical Considerations**: Proactive AI raises ethical questions regarding privacy, consent, and potential for misinterpretation or unwanted intervention. While not explicitly discussed, the framework's ability to decide *when* to respond is a step towards controllable proactive behavior, which is important for responsible deployment. This paper introduces the Audio Interaction Model and SoundFlow framework, a comprehensive solution for unifying offline LALMs and streaming audio models into a single, always-on, perceive-decide-respond system. Through novel streaming-native data construction, interaction-aware training, and asynchronous low-latency inference, the work demonstrates competitive performance on mainstream audio tasks while unlocking critical new capabilities like proactive assistance and general streaming instruction following, significantly advancing the field of real-time audio intelligence.
The paper introduces the Audio Interaction Model (AIM) and the SoundFlow framework, a comprehensive and highly innovative approach to unify offline Large Audio Language Models (LALMs) with streaming, single-task audio models into a single, always-on, perceive-decide-respond system. This paradigm shift addresses the inherent interactive nature of audio, which current LALMs and specialized streaming models fail to capture. The SoundFlow framework is meticulously designed, covering data, training, and deployment: 1. **Streaming-native data construction**: This is a critical component. The Time-Frequency Joint Preprocessing (TFJP) module is a clever solution to smooth audio boundaries and suppress noise, essential for stitching short clips into coherent long-form interactions. The hierarchical audio event selection, which uses an LLM for scenario planning and event refinement, followed by retrieval or generation, is a sophisticated method to ensure semantic coherence and environmental plausibility in synthetic streaming data. This addresses the challenge of creating realistic, multi-turn interactive audio sequences. 2. **Interaction-aware training**: The model learns to make chunk-level sequential decisions using special `
The experimental evaluation is extensive and rigorous, covering a wide array of benchmarks and providing deep insights into the model's behavior. 1. **Benchmarks**: The evaluation spans 8 diverse benchmarks, including general audio understanding (MMAU), spoken dialogue (AlpacaEval, SD-QA, Llama Questions, Web Questions), ASR (LibriSpeech), S2TT (CoVoST2), and the newly introduced Proactive-Sound-Bench. This broad coverage effectively demonstrates the model's versatility and unified capabilities. 2. **Baselines**: The comparison against three categories of models (Audio LLMs, Omni LLMs, and Task-specialized models) is comprehensive, allowing for a fair assessment of Audio-Interaction's performance against both general-purpose and specialized systems. 3. **Main Results**: The paper clearly demonstrates three key enhancements: * **Retained audio understanding**: Audio-Interaction maintains competitive performance on MMAU, even slightly surpassing its initialization and remaining comparable to larger 7B models. * **Competitive performance on core speech tasks**: Significant improvements on CoVoST2 (S2TT) and comparable performance on dialogue benchmarks, with only a marginal WER regression on LibriSpeech, which is an acceptable trade-off for moving to a chunk-wise streaming decoder. * **Unlocked capabilities**: This is the most impactful finding. The model's robustness to spoken instructions, selective proactive response on the novel Proactive-Sound-Bench (achieving good accuracy in both single and multi-tier events), and stability under stream concatenation highlight its unique interactive abilities. 4. **Additional Analysis**: The observations regarding continuity reconstruction at early decoder layers and the localization of the silent vs. respond decision to a single attention head provide valuable mechanistic insights into how the model learns these complex behaviors. 5. **Ablation Study**: The ablations are well-designed and clearly demonstrate the necessity of FIFO inference, the cumulative benefits of streaming training and data, the optimal chunk size (0.4s) for the accuracy-latency trade-off, and the balancing role of the dual-loss weight. These studies validate the design choices of the SoundFlow framework. 6. **Real-world validation**: The evaluation on 2 hours of naturally recorded audio across diverse scenarios (Travel, Work, Home, Commute) is a crucial step towards demonstrating practical applicability. The finding that performance largely retains its synthetic-stream levels, with degradation tracking acoustic difficulty, adds significant credibility to the model's robustness. The introduction of StreamAudio-2M (2.6M items, 302k hours, 7 abilities, 28 sub-tasks) and Proactive-Sound-Bench (644 human-designed events) as new resources is a major contribution, providing the community with tools to further research in this interactive paradigm.
The paper provides a strong foundation for reproducibility. * **Code and Data**: The project page and HuggingFace dataset link are provided, indicating an intent to release resources. * **Methodology Details**: The SoundFlow framework components (TFJP, hierarchical event curation, training stages, dual-loss, FIFO inference) are described in detail, including algorithms in the appendix. * **Dataset Curation**: The StreamAudio-2M curation pipeline, including sources, preprocessing, sequence concatenation, and token-level annotation, is thoroughly explained. * **Benchmark Details**: Proactive-Sound-Bench is clearly defined with its task, categories, and evaluation metrics. * **Training Details**: Hyperparameters for all four training stages are provided in the appendix, along with hardware specifications (NVIDIA H100 GPUs, bf16 mixed precision, DeepSpeed ZeRO-2). The use of a publicly available base model (Qwen2.5-Omni-3B) further aids reproducibility.
1. **Performance on existing tasks**: While Audio-Interaction is competitive, it does not always set new state-of-the-art records on all traditional benchmarks. For instance, there's a marginal WER regression on LibriSpeech. The primary strength lies in unification and new capabilities, rather than absolute peak performance on every single task. 2. **Synthetic Data Reliance**: The extensive use of LLMs for scenario planning and audio generation/stitching in StreamAudio-2M, while innovative, means the model is heavily trained on synthetic interactions. Although real-world validation is performed, the scale is limited (2 hours), and potential generalization gaps to truly unconstrained, complex real-world audio environments might exist. 3. **Model Size**: The choice of a 3B parameter model, while good for efficiency, might limit the depth of reasoning and comprehension compared to much larger LALMs, especially for highly complex, nuanced audio understanding tasks. 4. **Single Attention Head for Decision**: The observation that a single attention head dominates the silent vs. respond decision is interesting, but it could also imply a potential fragility or oversimplification in the decision-making mechanism for highly diverse and complex interactive scenarios.
This paper has significant broader impact across several dimensions: 1. **Paradigm Shift for Audio AI**: It proposes a fundamental shift from offline, clip-based LALMs and single-task streaming models to a unified, interactive, real-time "Audio Interaction Model." This vision is crucial for developing truly intelligent and helpful audio assistants. 2. **Enabling New Capabilities**: The work unlocks capabilities previously inaccessible to offline LALMs, such as comprehension-grounded response triggering, long-stream interaction, and proactive assistance. This has direct implications for applications like smart homes, automotive assistants, accessibility tools, and advanced conversational AI, where real-time, context-aware intervention is vital. 3. **Resource Contribution**: The release of the large-scale StreamAudio-2M dataset and the Proactive-Sound-Bench benchmark provides invaluable resources for the research community, accelerating future work on streaming audio intelligence and interactive AI. 4. **Reduced Model Proliferation**: By unifying multiple audio tasks into a single model, it offers a path towards more efficient and general-purpose audio AI systems, potentially reducing the need for numerous specialized models. 5. **Ethical Considerations**: Proactive AI raises ethical questions regarding privacy, consent, and potential for misinterpretation or unwanted intervention. While not explicitly discussed, the framework's ability to decide *when* to respond is a step towards controllable proactive behavior, which is important for responsible deployment. This paper introduces the Audio Interaction Model and SoundFlow framework, a comprehensive solution for unifying offline LALMs and streaming audio models into a single, always-on, perceive-decide-respond system. Through novel streaming-native data construction, interaction-aware training, and asynchronous low-latency inference, the work demonstrates competitive performance on mainstream audio tasks while unlocking critical new capabilities like proactive assistance and general streaming instruction following, significantly advancing the field of real-time audio intelligence.
Robot manipulation alternates between low-risk transit phases that call for fast execution and high-risk contact stages that demand slow, precise motion. Yet existing Vision-Language-Action models (VLAs) only inherit a single fixed speed from training demonstrations. Prior efforts to accelerate VLAs through model compression, KV-cache reuse, or reinforcement learning only shift the policy from one fixed speed to another, and leave deceleration almost unexplored. We observe that the magnitude of each predicted action already governs how fast the robot moves, opening a direct route to controllable execution speed. We turn this observation into TempoVLA, a single VLA whose execution speed is controlled by an explicit condition. TempoVLA combines two coupled components. (1) A data-side Variable-Speed Trajectory Augmentation (VSTA) that re-times demonstration to any target speed by merging or splitting actions while preserving its motion semantics. (2) A model-side conditioning mechanism that feeds the speed to the policy. Statistics show that VSTA reaches the requested speed with negligible motion error. Experiments in simulation and on real-world tasks demonstrate that TempoVLA achieves flexible speed control in both directions, while VSTA additionally boosts the default $1\times$ performance via better data utilization. Furthermore, by cooperating with a large multimodal model, TempoVLA realizes dynamic speed control, accelerating through low-risk phases and decelerating for high-risk ones.
Primary: Fudan University
All Institutions: Fudan University, Renmin University of China, University of North Carolina at Chapel Hill
TempoVLA makes a significant contribution to robot manipulation by addressing the critical but often overlooked dimension of execution speed. This work enables more flexible and robust deployment of VLAs in real-world scenarios where tasks inherently require varying speeds (e.g., fast transit, slow precision). The ability to dynamically control speed based on task phases, especially with a VLM scheduler, opens up new avenues for intelligent and adaptive robot behavior, moving beyond fixed-speed, brittle policies. The finding that training with variable speeds can act as a data augmentation, improving default 1x performance, is a valuable insight for VLA training in general. This could lead to more efficient data utilization and better generalization. The framework's lightweight nature and applicability to existing VLAs promote its adoption. It also highlights the importance of considering the entire control stack (policy + low-level controller) when aiming for high-performance robot systems. TempoVLA introduces a novel data augmentation and conditioning framework to equip Vision-Language-Action models with explicit, bidirectional speed control, demonstrating improved performance and dynamic phase-aware execution in both simulation and real-world robotics tasks. This paper presents a well-executed solution to a practical and important problem in robot manipulation, offering a lightweight and generalizable method that enhances VLA capabilities by enabling flexible execution speeds. The comprehensive experimental validation, including insightful ablations and stress tests, provides strong evidence for the method's effectiveness and clarifies its operational boundaries, making it a valuable contribution to the field.
The methodology introduces TempoVLA, a framework for speed-controllable Vision-Language-Action (VLA) policies, comprising two main components: Variable-Speed Trajectory Augmentation (VSTA) and a model-side speed conditioning mechanism. VSTA is a clever data-side approach that re-times demonstrations to arbitrary target speeds. It involves motion-consistent segmentation, chunk-level speed transformation (merging/splitting actions), and online chunk-start sampling. The core idea of accumulating and re-splitting actions relies on the assumption of linear composability of actions (e.g., Cartesian translation, joint velocities, axis-angle rotations), which is explicitly discussed and justified. The online sampling strategy is well-designed to ensure all original frames contribute to training despite re-timing. The model-side conditioning mechanisms are lightweight and practical: textual prefix, RMSNorm modulation, and soft prompts. The textual prefix is particularly appealing for its simplicity and lack of architectural changes. The integration with a VLM for dynamic speed scheduling is a natural and impactful extension, demonstrating how TempoVLA can be used in a higher-level reasoning loop. The overall approach is well-motivated, addresses a clear problem, and is designed to be broadly applicable to existing VLA architectures. The discussion on the difference between EEF and Joint Action Space for VSTA is insightful, justifying the preference for EEF actions due to kinematic non-linearities and controller realizability.
The experimental evaluation is comprehensive and rigorous, covering both simulation and real-world settings. 1. **Simulation (LIBERO):** The use of LIBERO, a clean benchmark for manipulation, is appropriate. Experiments verify VSTA's feasibility, showing that it produces re-timed demonstrations with negligible motion error and reasonable replay success rates across various speeds. An ablation study on speed-integration schemes demonstrates that all three proposed methods (Text, Modulation, Soft Prompt) perform similarly, with Text being the most practical. A detailed analysis of the training speed range reveals key insights: VSTA training boosts default 1x performance (acting as useful data augmentation), and surprisingly, peak performance often shifts to slightly faster speeds (1.25x or 1.5x) due to the compression of "rhythm padding" in teleoperated data. This is a significant empirical finding. 2. **Real-world (Franka arm):** The real-world experiments on a 7-DoF Franka arm across five tasks confirm the simulation findings, showing an 8-point gain in 1x success rate and accurate tracking of commanded speeds. This demonstrates the practical applicability and robustness of TempoVLA. 3. **Dynamic Speed Control:** The integration with GPT-4o for dynamic speed scheduling is a compelling demonstration. It shows that TempoVLA can enable phase-aware speed adjustments, accelerating through low-risk phases and decelerating for high-risk ones, leading to higher success rates. 4. **Stress Test and Qualitative Analysis:** The stress test at extreme speeds (0.25x to 4x) is excellent for understanding the method's boundaries. It clearly identifies the low-level controller as the bottleneck for high-speed execution and highlights policy sensitivity at very low speeds. The qualitative failure mode analysis (hesitation at low speeds, overshoot/tracking error at high speeds) provides valuable insights into the practical operating envelope of TempoVLA. The metrics used (success rate, rollout length, realized model ratio, controller tracking gap) are appropriate and provide a holistic view of performance.
The paper provides sufficient details for reproducibility. The methodology for VSTA is clearly described, including its three steps and the underlying assumptions. Algorithm 1 provides pseudocode for VSTA. Hyperparameters for both simulation and real-world experiments are provided in the Appendix. Details on the base VLA model ($_0.5$) and training setup (GPUs, iterations, batch size) are given. The prompt used for GPT-4o in dynamic speed control is also included. The action spaces for both simulation and real-world are specified. Overall, the level of detail is good for replication.
The paper openly discusses several limitations: 1. **Controller Bottleneck:** At the high end of the speed range, the realized speedup saturates because the policy's per-step targets exceed the low-level controller's tracking bandwidth. This means TempoVLA's full potential for acceleration is limited by the underlying robot control stack. 2. **Non-Composable Action Spaces:** VSTA's current implementation assumes linear composability of actions, which excludes representations like unit quaternions or rotation matrices. While the paper suggests solutions (tangent-space mapping or SLERP), these are not implemented. 3. **VLM Scheduling Latency:** The synchronous invocation of the GPT-4o scheduler adds wall-clock overhead. Asynchronous scheduling is proposed as future work. 4. **Speed Regularization:** The current approach assumes uniform per-action granularity for the 1x speed, which might not hold for diverse teleoperation datasets. A VSTA-style normalization to calibrate the 1x reference is suggested. 5. **Policy Sensitivity at Low Speeds:** The stress test shows that at very low speeds (e.g., 0.25x), the policy can exhibit "hesitation" or "stalled progress" due to extremely small per-step magnitudes, making it sensitive to ambiguous observations.
TempoVLA makes a significant contribution to robot manipulation by addressing the critical but often overlooked dimension of execution speed. This work enables more flexible and robust deployment of VLAs in real-world scenarios where tasks inherently require varying speeds (e.g., fast transit, slow precision). The ability to dynamically control speed based on task phases, especially with a VLM scheduler, opens up new avenues for intelligent and adaptive robot behavior, moving beyond fixed-speed, brittle policies. The finding that training with variable speeds can act as a data augmentation, improving default 1x performance, is a valuable insight for VLA training in general. This could lead to more efficient data utilization and better generalization. The framework's lightweight nature and applicability to existing VLAs promote its adoption. It also highlights the importance of considering the entire control stack (policy + low-level controller) when aiming for high-performance robot systems. TempoVLA introduces a novel data augmentation and conditioning framework to equip Vision-Language-Action models with explicit, bidirectional speed control, demonstrating improved performance and dynamic phase-aware execution in both simulation and real-world robotics tasks. This paper presents a well-executed solution to a practical and important problem in robot manipulation, offering a lightweight and generalizable method that enhances VLA capabilities by enabling flexible execution speeds. The comprehensive experimental validation, including insightful ablations and stress tests, provides strong evidence for the method's effectiveness and clarifies its operational boundaries, making it a valuable contribution to the field.
For a humanoid robot to be deployed in the real world, the choice of command space (i.e., the interface between task planning and whole-body control) is crucial. Existing whole-body controllers typically demand dense kinematic or spatial references that planners struggle to synthesize from task semantics. We instead propose a compact, explicit interface that is intuitive, general, modular, and expressive enough for diverse manipulation skills. To this end, we introduce HANDOFF, a single humanoid whole-body controller that follows this interface and is distilled via multi-teacher KL distillation under a context-conditioned gating scheme into a mixture-of-experts student from three complementary specialists: whole-body motion tracking with safety-filtered data, locomotion, and fall-recovery. On the Unitree G1, HANDOFF matches state-of-the-art velocity tracking and offers one of the largest robust manipulation workspaces. We further demonstrate hardware feasibility through multiple natural-language-driven task roll-outs, powered by a VLM-driven agentic planner with no task-specific data or controller fine-tuning.
Primary: Technology Innovation Institute (TII)
All Institutions: Technology Innovation Institute (TII)
HANDOFF has the potential for significant broader impact in the field of robotics and embodied AI. By proposing a compact and explicit interface for whole-body control, it could substantially simplify the development and deployment of high-level task planners for humanoid robots, making them more accessible to researchers and practitioners. The unified controller capable of seamlessly handling locomotion, manipulation, and fall recovery is a critical step towards creating truly autonomous and versatile humanoids that can operate effectively in unstructured, human-centric environments. Its demonstrated hardware feasibility with natural-language-driven tasks, without task-specific fine-tuning, opens up exciting possibilities for general-purpose, language-instructed robots that can respond flexibly to human commands. This could accelerate progress in areas such as assistive robotics, logistics, hazardous environment exploration, and general-purpose service robotics, ultimately bringing humanoids closer to real-world deployment. The distillation methodology itself could also inspire similar approaches for integrating diverse skills in other complex robotic systems. This paper introduces HANDOFF, a novel humanoid whole-body controller that uses multi-teacher KL distillation and a context-conditioned gating scheme to integrate locomotion, manipulation, and fall-recovery skills into a single mixture-of-experts student, driven by a compact, explicit interface for task planning. The work promises significant advancements in robust loco-manipulation and agentic control for humanoids, demonstrated by matching state-of-the-art velocity tracking, offering large manipulation workspaces, and enabling natural-language-driven task execution on hardware without task-specific fine-tuning.
Based on the abstract, HANDOFF introduces a novel whole-body controller for humanoid robots designed to bridge the gap between high-level task planning and low-level control. The core methodological contribution is a "compact, explicit interface" that aims to be intuitive, general, modular, and expressive, contrasting with existing controllers that demand dense kinematic references. The technical architecture of HANDOFF is a mixture-of-experts (MoE) student model, trained via multi-teacher KL distillation under a context-conditioned gating scheme. This student model synthesizes knowledge from three "complementary specialists": whole-body motion tracking (using safety-filtered data), locomotion, and fall-recovery. This distillation approach is innovative for integrating diverse, critical humanoid skills into a single, unified controller. The context-conditioned gating mechanism is crucial for allowing the MoE student to dynamically leverage the appropriate expertise based on the current task or robot state. While the abstract outlines a compelling conceptual framework, the absence of the full "sections/3methods" prevents a detailed assessment of the specific mathematical formulations, architectural choices for the experts and student, the design of the compact interface, the KL distillation loss function, and the context-conditioning mechanism. However, the proposed integration of multiple specialized skills through distillation into a single, general-purpose controller with a simplified planning interface represents a significant conceptual advancement in humanoid control.
The abstract makes strong claims regarding HANDOFF's performance on the Unitree G1 humanoid robot. It asserts that HANDOFF "matches state-of-the-art velocity tracking" and "offers one of the largest robust manipulation workspaces." These are critical performance indicators for humanoid robots, suggesting high fidelity control and expanded operational capabilities. Furthermore, the paper claims "hardware feasibility through multiple natural-language-driven task roll-outs, powered by a VLM-driven agentic planner with no task-specific data or controller fine-tuning." This demonstration of zero-shot generalization from natural language commands to complex physical tasks, without task-specific data or controller fine-tuning, is a highly impactful result, showcasing the controller's robustness and its potential for real-world deployment with high-level AI agents. Without the full "sections/4experiments," it is impossible to scrutinize the experimental setup, specific baselines used for comparison (e.g., what constitutes "state-of-the-art"), the metrics and methodologies for evaluating "robust manipulation workspaces," the diversity and complexity of the natural language tasks, or the quantitative success rates and failure modes observed. However, the *claims* themselves indicate a comprehensive evaluation across fundamental control performance, manipulation capabilities, and high-level task execution, which, if substantiated, would be highly impressive.
Based solely on the abstract and section titles, reproducibility cannot be adequately assessed. The paper mentions "safety-filtered data" but provides no details on its acquisition or filtering process. There is no mention of code availability (e.g., GitHub repository), specific hardware configurations beyond "Unitree G1," simulation environments, hyperparameter settings, or training procedures. The "sections/Aappendix" might contain some of these details, but without its content, it's impossible to verify. For a complex system involving multi-teacher distillation, MoE architectures, and hardware deployment, detailed implementation specifics are crucial for reproducibility.
The paper explicitly includes a "sections/6limitations" section, indicating the authors have considered the boundaries and weaknesses of their work. Without access to this section, specific limitations cannot be detailed. However, common challenges for such advanced humanoid control systems often include: the inherent complexity and computational cost of MoE models for real-time, low-latency control; the difficulty in providing formal safety guarantees for highly dynamic and versatile behaviors; the scalability of the approach to even more complex tasks or diverse robot platforms; the potential for the distillation process to lose some specialized performance compared to individual expert controllers; and the robustness of the context-conditioned gating scheme to novel or ambiguous situations. The reliance on a VLM-driven agentic planner also introduces dependencies on the capabilities and limitations of that external system.
HANDOFF has the potential for significant broader impact in the field of robotics and embodied AI. By proposing a compact and explicit interface for whole-body control, it could substantially simplify the development and deployment of high-level task planners for humanoid robots, making them more accessible to researchers and practitioners. The unified controller capable of seamlessly handling locomotion, manipulation, and fall recovery is a critical step towards creating truly autonomous and versatile humanoids that can operate effectively in unstructured, human-centric environments. Its demonstrated hardware feasibility with natural-language-driven tasks, without task-specific fine-tuning, opens up exciting possibilities for general-purpose, language-instructed robots that can respond flexibly to human commands. This could accelerate progress in areas such as assistive robotics, logistics, hazardous environment exploration, and general-purpose service robotics, ultimately bringing humanoids closer to real-world deployment. The distillation methodology itself could also inspire similar approaches for integrating diverse skills in other complex robotic systems. This paper introduces HANDOFF, a novel humanoid whole-body controller that uses multi-teacher KL distillation and a context-conditioned gating scheme to integrate locomotion, manipulation, and fall-recovery skills into a single mixture-of-experts student, driven by a compact, explicit interface for task planning. The work promises significant advancements in robust loco-manipulation and agentic control for humanoids, demonstrated by matching state-of-the-art velocity tracking, offering large manipulation workspaces, and enabling natural-language-driven task execution on hardware without task-specific fine-tuning.
We introduce Humanoid-GPT, a GPT-style Transformer with causal attention trained on a billion-scale motion corpus for whole-body control. Unlike prior shallow MLP trackers constrained by scarce data and an agility-generalization trade-off, Humanoid-GPT is pre-trained on a 2B-frame retargeted corpus that unifies all major mocap datasets with large-scale in-house recordings. Scaling both data and model capacity yields a single generative Transformer that tracks highly dynamic behaviors while achieving unprecedented zero-shot generalization to unseen motions and control tasks. Extensive experiments and scaling analyses show that our model establishes a new performance frontier, demonstrating robust zero-shot generalization to unseen tasks while simultaneously tracking highly dynamic and complex motions.
Humanoid-GPT introduces a GPT-style Transformer trained on a billion-scale motion corpus, achieving unprecedented zero-shot generalization for whole-body control. This work represents a significant leap in data and model scaling for motion tracking, moving beyond prior limitations of shallow models and scarce data to enable robust generalization to unseen tasks and highly dynamic behaviors. By unifying major motion capture datasets and leveraging a large-scale Transformer architecture, it establishes a new performance frontier in embodied AI, potentially setting a new standard for generalizable policies in humanoid control and influencing future foundation models for robotics.