Last 7 Days (May 29 – June 04, 2026)
Audio-language models (ALMs) often follow text that conflicts with audio, even when the audio evidence is clear. This raises a basic question: is the audio-supported answer unavailable, or is it represented but overridden by the conflicting text? We examine this question using a same-audio counterfactual that keeps the audio fixed, removes only the conflicting text, and measures the resulting shift in model preference. Across five ALMs and four conflict tasks, 64.1% of conflict samples show a sign flip: the same-audio branch prefers the audio-supported answer, whereas the joint branch prefers the text-supported answer. This pattern suggests that the relevant audio evidence is encoded but loses in arbitration. Activation patching further localizes the reversal to answer-position computation, and patching effects closely track output candidate-score differences (Spearman rho=0.93). Using this diagnostic, we propose Gated Audio Counterfactual Logit Correction (GACL), a training-free decoding rule that interpolates between joint and same-audio scores. Under a strict 5 pp faithfulness-drop budget, GACL improves nAUC by 17.8 points over the best contrastive baseline and transfers without retuning to vision-text arbitration (up to +40.5 pp).
Primary: Northeastern University
All Institutions: Northeastern University, Shanghai Artificial Intelligence Laboratory
This work has significant broader impact for the development of robust and trustworthy multimodal AI systems. By providing a rigorous diagnostic framework and an effective, generalizable intervention, it directly addresses a critical safety and reliability concern in ALMs: their tendency to prioritize conflicting text over clear audio evidence. This is particularly important for agentic applications in sensitive domains like healthcare, emergency services, or legal assistance, where accurate interpretation of audio is paramount. The mechanistic understanding gained through causal localization offers a powerful new lens for analyzing internal decision-making in complex multimodal models, moving beyond black-box observations and fostering more interpretable AI. The demonstrated cross-modal transfer suggests that the principles of diagnosing and correcting arbitration failures using counterfactual references and logit correction could be broadly applicable across various multimodal AI systems (e.g., vision-language, video-language), paving the way for more faithful and reliable AI across the board. This research not only provides a practical solution but also advances our fundamental understanding of multimodal reasoning and conflict resolution in large models. This paper presents a highly novel diagnostic methodology and a mechanistically-informed, training-free decoding rule to address a critical arbitration failure in audio-language models, demonstrating significant performance gains and cross-modal generalizability. The rigorous causal analysis, coupled with a practical and effective solution, makes this a standout contribution to multimodal machine learning.
The methodology is exceptionally strong, building a coherent and rigorous chain from behavioral observation to mechanistic understanding and finally to an effective intervention. The core innovation is the "same-audio counterfactual" diagnostic, which uses two branches (joint audio-text vs. audio-only) to precisely distinguish between perceptual failure and arbitration failure in Audio-Language Models (ALMs). This elegant setup, coupled with signed log-probability margins, provides a clear quantitative signature of "repairable arbitration reversals." The paper then employs activation patching, a robust causal intervention technique, to localize the arbitration failure to the answer-position residual stream within the model's "commit window." This mechanistic finding is crucial, demonstrating that audio evidence is indeed encoded but overridden during the final decision-making process. A key methodological bridge is the discovery of a high Spearman correlation (0.93) between this internal patch-induced repair direction and the observable output score difference ($s_A - s_J$). This alignment is critical because it enables the development of an output-space intervention without requiring internal model access. The proposed Gated Audio Counterfactual Logit Correction (GACL) decoding rule is directly derived from these insights, incorporating a branch-disagreement gate, a reference-reliability gate, and convex bounded interpolation. Each component is mechanistically justified and contributes to the method's robustness and safety. The methodology is a prime example of interpretable ML research, moving beyond symptom identification to root cause analysis and targeted solution design.
The experimental evaluation is comprehensive and rigorously designed. The authors evaluate GACL across five diverse open-weight ALMs (7B-30B parameters) and four distinct audio-text conflict tasks (AQA, VSC, SER, ALME) from established benchmarks (MCR-Bench, ALME). This broad coverage demonstrates the widespread nature of the "text-following" problem and the general applicability of GACL. The use of normalized AUC (nAUC) over a strict faithfulness-drop budget (e.g., 5 pp) is an excellent evaluation metric, realistically capturing the trade-off between conflict resolution and preserving accuracy on faithful inputs. GACL consistently outperforms strong contrastive decoding baselines (AAD, ACD) and the joint model, achieving an impressive average improvement of 17.8 nAUC points under the strict 5 pp budget. Detailed ablation studies meticulously validate the contribution of each component of GACL, showing how gates and bounds ensure stability and prevent undesirable side effects (e.g., surface form rewriting, parse failures). The comparison to a LoRA fine-tuning baseline, where GACL retains 76% of the gain without any parameter updates, highlights its efficiency and practical value. Furthermore, the successful, untuned transfer of GACL to vision-text arbitration on MC$^2$ (achieving up to +40.5 pp adversarial accuracy) is a powerful demonstration of the generalizability of the underlying diagnostic principles across different modalities, significantly amplifying the potential impact of this work.
The paper demonstrates a high commitment to reproducibility. The appendix provides extensive details, including specific public model checkpoints (with Hugging Face snapshot hashes), precise descriptions of benchmark splits, detailed prompt templates for each task, and the exact candidate scoring and normalization procedures. The hyperparameter tuning process, including the use of a development set and freezing parameters for testing, is clearly outlined. Furthermore, the paper provides comprehensive details for the LoRA fine-tuning baseline, including architecture, training data, optimization parameters, and hardware. Inference cost metrics (time, GPU memory, FLOPs) are also reported. This level of detail should enable researchers to reproduce the core findings and build upon this work.
The authors acknowledge several pertinent limitations. The study focuses on controlled, explicit audio-text conflicts, which, while crucial for isolating mechanisms, may not fully capture the complexity of naturally occurring conflicts involving noisier transcripts, partial notes, or broader conversational context. GACL is designed to repair arbitration failures where audio evidence is available but overridden, meaning it cannot compensate for fundamental perceptual failures where the model simply did not encode the relevant acoustic information. This distinction is important for guiding future research towards either decoding-time repair or improved acoustic modeling. A practical limitation is the increased inference latency due to the additional forward pass required for the audio-reference branch, although the authors suggest potential optimizations. Finally, while cross-modal transfer is demonstrated, the generalizability to all possible conflict sources and modality pairs remains an area for future exploration.
This work has significant broader impact for the development of robust and trustworthy multimodal AI systems. By providing a rigorous diagnostic framework and an effective, generalizable intervention, it directly addresses a critical safety and reliability concern in ALMs: their tendency to prioritize conflicting text over clear audio evidence. This is particularly important for agentic applications in sensitive domains like healthcare, emergency services, or legal assistance, where accurate interpretation of audio is paramount. The mechanistic understanding gained through causal localization offers a powerful new lens for analyzing internal decision-making in complex multimodal models, moving beyond black-box observations and fostering more interpretable AI. The demonstrated cross-modal transfer suggests that the principles of diagnosing and correcting arbitration failures using counterfactual references and logit correction could be broadly applicable across various multimodal AI systems (e.g., vision-language, video-language), paving the way for more faithful and reliable AI across the board. This research not only provides a practical solution but also advances our fundamental understanding of multimodal reasoning and conflict resolution in large models. This paper presents a highly novel diagnostic methodology and a mechanistically-informed, training-free decoding rule to address a critical arbitration failure in audio-language models, demonstrating significant performance gains and cross-modal generalizability. The rigorous causal analysis, coupled with a practical and effective solution, makes this a standout contribution to multimodal machine learning.
Long-context reasoning remains a central challenge for large language models, which often fail to locate and integrate key information in extensive distracting content. Reinforcement learning with verifiable rewards (RLVR) has shown promise for this task, yet existing methods are limited by low-confusability distractors and sparse, outcome-only reward signals that cannot supervise intermediate reasoning steps. To address these issues, we introduce \textsc{LongTraceRL}. For data construction, we generate multi-hop questions via knowledge graph random walks and leverage search agent trajectories to build \emph{tiered distractors}: documents the agent read but did not cite (high confusability) and documents that appeared in search results but were never opened (low confusability), producing training contexts that are far more challenging than those built by random sampling or one-shot search. For reward design, we propose a \emph{rubric reward} that uses the gold entities along each reasoning chain as fine-grained, entity-level process supervision. This rubric reward is applied only to responses with correct final answers (positive-only strategy), distinguishing the reasoning quality among correct responses and preventing reward hacking. Experiments on three reasoning LLMs (4B--30B) across five long-context benchmarks demonstrate that \textsc{LongTraceRL} consistently outperforms strong baselines and encourages comprehensive, evidence-grounded reasoning. Codes, datasets and models are available at \href{https://github.com/THU-KEG/LongTraceRL}{https://github.com/THU-KEG/LongTraceRL}.
Primary: Tsinghua University
All Institutions: Tsinghua University, Zhipu
LongTraceRL has the potential for significant broader impact on the field of large language models and long-context reasoning. 1. **Improved LLM Capabilities**: By addressing a central challenge of LLMs, it can lead to more reliable and capable models for tasks requiring deep understanding and integration of information from extensive documents, such as complex question answering, scientific literature review, legal document analysis, and medical diagnosis support. 2. **Novel Data Generation Paradigms**: The "tiered distractors" approach offers a new paradigm for creating challenging and realistic long-context benchmarks and training data, which can be adopted by the community to develop more robust LLMs. 3. **Advanced RLVR Techniques**: The "rubric reward" design provides a valuable contribution to the field of Reinforcement Learning with Verifiable Rewards, demonstrating how fine-grained process supervision can be effectively integrated to guide complex reasoning, potentially inspiring similar reward shaping techniques for other intricate tasks. 4. **Foundation for Future Research**: The open-sourced code, datasets, and models will serve as a valuable resource, lowering the barrier for other researchers to build upon this work, explore its limitations, and extend its applicability to new domains and reasoning challenges. This paper introduces LongTraceRL, a novel approach that significantly enhances long-context reasoning in large language models by proposing an innovative data construction method using tiered distractors from search agent trajectories and a fine-grained rubric reward for process supervision. The work makes a strong technical contribution by addressing critical limitations in existing RLVR methods, demonstrating consistent performance improvements across multiple LLMs and benchmarks, and openly providing resources, thereby offering a promising direction for developing more robust and evidence-grounded reasoning capabilities in LLMs.
The paper introduces LongTraceRL, a novel approach to improve long-context reasoning in LLMs using Reinforcement Learning with Verifiable Rewards (RLVR). The methodology is characterized by two key innovations: data construction with "tiered distractors" and a "rubric reward" design. For data construction, the authors generate multi-hop questions using knowledge graph random walks, which ensures a structured and verifiable ground truth. Crucially, they leverage search agent trajectories to create "tiered distractors." This involves two levels of confusability: high-confusability distractors are documents the agent read but did not cite, implying they contain relevant but ultimately non-essential or misleading information; low-confusability distractors are documents that appeared in search results but were never opened, representing less relevant noise. This method for generating training contexts is highly innovative, moving beyond simple random sampling or one-shot search to create significantly more challenging and realistic long-context scenarios. This directly addresses the limitation of existing RLVR methods using low-confusability distractors. For reward design, the paper proposes a "rubric reward" that provides fine-grained, entity-level process supervision. This reward uses the gold entities along each reasoning chain, offering a more granular signal than typical outcome-only rewards. A critical aspect is the "positive-only strategy," where this rubric reward is applied exclusively to responses with correct final answers. This design aims to distinguish the quality of reasoning among correct responses and, importantly, prevent reward hacking by penalizing incorrect reasoning paths even if they coincidentally lead to a correct answer. This is a thoughtful approach to reward shaping in complex reasoning tasks. The synergy between these two components is strong: challenging data generation forces the model to learn robust reasoning, while the fine-grained rubric reward guides it through complex reasoning steps. While the full technical details of the RL algorithm or specific prompt engineering for the search agent are not available in the provided abstract, the conceptual framework is sound and addresses known limitations in the field.
The abstract states that experiments were conducted on three reasoning LLMs (ranging from 4B to 30B parameters) across five long-context benchmarks. This demonstrates a commitment to comprehensive evaluation across different model scales and task settings. The claim that LongTraceRL "consistently outperforms strong baselines" is a significant result, suggesting the robustness and effectiveness of the proposed methods. Furthermore, the abstract highlights a qualitative benefit: the approach "encourages comprehensive, evidence-grounded reasoning." This is crucial for long-context tasks, where not just the final answer but also the explainability and traceability of the reasoning process are highly valued. Without access to the full experimental section, specific metrics, baseline details, and detailed result tables cannot be assessed, but the stated scope and outcomes are promising. The open-sourcing of codes, datasets, and models further enhances the value of these experimental findings by enabling verification and future research.
The paper explicitly states that "Codes, datasets and models are available at https://github.com/THU-KEG/LongTraceRL." This commitment to open-sourcing is excellent and significantly boosts the reproducibility of the work. Providing the datasets (especially the uniquely constructed tiered distractors) and the trained models will allow other researchers to replicate the results, build upon the methodology, and further investigate the approach. This is a strong point for the paper.
1. **Data Generation Complexity**: The generation of multi-hop questions via knowledge graph random walks and, more significantly, the leveraging of search agent trajectories to build tiered distractors, appears to be a complex and potentially resource-intensive process. This might limit its applicability to domains where such structured knowledge graphs and search agent capabilities are readily available or easily simulated. 2. **Domain Specificity**: The reliance on knowledge graphs for question generation might implicitly limit the types of reasoning tasks or domains where LongTraceRL is most effective. Its generalizability to other long-context tasks (e.g., summarization of unstructured documents, code analysis, creative writing) beyond multi-hop QA is not explicitly discussed. 3. **"Positive-Only" Reward Strategy**: While designed to prevent reward hacking, the "positive-only" strategy for the rubric reward might miss valuable learning signals from responses that are incorrect but demonstrate partial understanding or nearly correct reasoning steps. A more nuanced reward function that can provide negative feedback for specific incorrect steps might accelerate learning. 4. **Computational Cost**: The abstract does not discuss the computational cost associated with training LLMs with RLVR, especially with the complex data generation and fine-grained reward signals. This could be a practical limitation for wider adoption, particularly for larger models.
LongTraceRL has the potential for significant broader impact on the field of large language models and long-context reasoning. 1. **Improved LLM Capabilities**: By addressing a central challenge of LLMs, it can lead to more reliable and capable models for tasks requiring deep understanding and integration of information from extensive documents, such as complex question answering, scientific literature review, legal document analysis, and medical diagnosis support. 2. **Novel Data Generation Paradigms**: The "tiered distractors" approach offers a new paradigm for creating challenging and realistic long-context benchmarks and training data, which can be adopted by the community to develop more robust LLMs. 3. **Advanced RLVR Techniques**: The "rubric reward" design provides a valuable contribution to the field of Reinforcement Learning with Verifiable Rewards, demonstrating how fine-grained process supervision can be effectively integrated to guide complex reasoning, potentially inspiring similar reward shaping techniques for other intricate tasks. 4. **Foundation for Future Research**: The open-sourced code, datasets, and models will serve as a valuable resource, lowering the barrier for other researchers to build upon this work, explore its limitations, and extend its applicability to new domains and reasoning challenges. This paper introduces LongTraceRL, a novel approach that significantly enhances long-context reasoning in large language models by proposing an innovative data construction method using tiered distractors from search agent trajectories and a fine-grained rubric reward for process supervision. The work makes a strong technical contribution by addressing critical limitations in existing RLVR methods, demonstrating consistent performance improvements across multiple LLMs and benchmarks, and openly providing resources, thereby offering a promising direction for developing more robust and evidence-grounded reasoning capabilities in LLMs.
Multi-agent reasoning systems adopt a "generate-then-transfer" paradigm that forces end-to-end latency to scale linearly with pipeline depth. We introduce StreamMA, a multi-agent reasoning system that streams each reasoning step to downstream agents as soon as it is generated, pipelining adjacent agents and thus reducing latency. Surprisingly, this pipelining also improves effectiveness: because multi-step reasoning quality is non-uniform and early steps are more reliable than later ones, working with these reliable early steps instead of the full chain prevents error-prone late steps from misleading downstream agents. We formalize both advantages with the first closed-form joint analysis of stream, serial, and single protocols, deriving the effectiveness ordering, speedup upper bound, and cost ratio. Across eight reasoning benchmarks spanning mathematics, science, and code, two frontier LLMs (Claude Opus 4.6 and GPT-5.4), and three topologies (Chain, Tree, Graph), StreamMA outperforms both baselines (avg. +7.3 pp, max +22.4 pp on HMMT 2026; Claude Opus 4.6-high). Beyond these contributions, we discover a "step-level scaling law": increasing per-agent steps consistently improves both effectiveness and efficiency, a new scaling dimension orthogonal to and composable with agent-count scaling.
Primary: University of California, Berkeley
All Institutions: University of California, Berkeley, Google DeepMind, Stanford University, Carnegie Mellon University
This paper introduces StreamMA, a novel multi-agent reasoning system that employs streaming communication to reduce latency and surprisingly improve effectiveness by leveraging reliable early reasoning steps. The work presents a rigorous formal analysis, extensive empirical validation across diverse benchmarks and frontier LLMs, and discovers a new "step-level scaling law," making it a highly significant contribution to multi-agent AI and LLM research.
The paper introduces StreamMA, a novel multi-agent reasoning system that shifts from the traditional "generate-then-transfer" paradigm to a "streaming communication" approach. This involves pipelining reasoning steps, where downstream agents receive and process partial information as soon as it's generated by upstream agents. The core innovation lies in demonstrating a dual benefit: reduced end-to-end latency and, surprisingly, improved effectiveness. The effectiveness gain is attributed to leveraging more reliable early reasoning steps, preventing error propagation from potentially flawed later steps. The methodology is rigorously supported by the first closed-form joint analysis of stream, serial, and single protocols, providing theoretical derivations for effectiveness ordering, speedup upper bounds, and cost ratios. Agents are designed to generate reasoning steps and an "end-of-step" token, allowing for flexible granularity. The approach is versatile, demonstrated across Chain, Tree, and Graph topologies. This is a well-conceived and theoretically grounded methodology.
The experimental evaluation is comprehensive and robust. The authors test StreamMA across eight diverse reasoning benchmarks spanning mathematics (HMMT, GSM8K, MATH), science (ARC, BigBench Hard), and code generation (HumanEval, MBPP, APPS). This breadth demonstrates the generalizability of the approach. Two frontier LLMs, Claude Opus 4.6 and GPT-5.4, are used, providing strong baselines and highlighting the practical relevance to state-of-the-art systems. StreamMA consistently outperforms both "Serial" (generate-then-transfer) and "Single" (single-agent) baselines, achieving significant average effectiveness gains of +7.3 percentage points and a maximum of +22.4 pp on HMMT 2026. The paper also validates latency reduction and explores the "step-level scaling law," a novel empirical finding that increasing per-agent steps improves both effectiveness and efficiency. The experiments across different topologies (Chain, Tree, Graph) further solidify the findings. While the use of proprietary LLMs limits direct reproducibility for all researchers, the results are compelling and well-supported.
The paper provides a detailed description of the StreamMA methodology, including agent prompting strategies, communication protocols, and the formal analysis. This level of detail is commendable. However, the reliance on proprietary frontier LLMs (Claude Opus 4.6, GPT-5.4) means that exact replication of the results requires access to these specific models, which might not be universally available. The authors state that "Our code is available at [URL redacted for anonymity]," indicating that code exists but is not publicly linked in the provided version. Publicly available code would significantly enhance reproducibility. Given the detailed methodology and the promise of code, the work is reproducible in principle, but the LLM dependency and current lack of a public code link are practical limitations.
The authors acknowledge several limitations. Streaming communication can increase the total token count if agents re-process information, potentially leading to higher API costs, though this is often offset by improved effectiveness. Designing and managing complex graph-based multi-agent systems remains challenging. The approach relies on LLMs being capable of effectively processing and acting on partial, streaming information. The current focus is primarily on reasoning tasks, and its generalizability to other LLM applications like creative generation is not explored. For very simple tasks, the overhead of streaming might outweigh the benefits. Additionally, the reliance on proprietary frontier LLMs limits immediate open-source replication, and while the "step-level scaling law" is a fascinating discovery, its theoretical underpinnings and boundary conditions are not fully explored.
This paper offers a significant contribution to the field of multi-agent LLM systems. It introduces a new paradigm for communication that addresses a critical bottleneck (latency) while simultaneously improving reasoning effectiveness. This has profound implications for designing more efficient and responsive multi-agent systems, making them more viable for real-time and interactive applications. The discovery of the "step-level scaling law" opens up a novel research dimension for optimizing LLM performance and multi-agent system design, orthogonal to existing scaling laws. The insight that leveraging early, more reliable reasoning steps can prevent error propagation is a valuable lesson for structuring complex LLM-based reasoning tasks. This work is likely to influence future research and development in multi-agent AI and LLM deployment strategies. This paper introduces StreamMA, a novel multi-agent reasoning system that employs streaming communication to reduce latency and surprisingly improve effectiveness by leveraging reliable early reasoning steps. The work presents a rigorous formal analysis, extensive empirical validation across diverse benchmarks and frontier LLMs, and discovers a new "step-level scaling law," making it a highly significant contribution to multi-agent AI and LLM research.
Controlled experiments are the backbone of machine learning research, but at the scale of modern foundation models, they have become prohibitively expensive. Instead, the community increasingly relies on research strategies that approximate the ideal experiment at a fraction of the cost: proxy experiments and scaling laws, observational studies with publicly available models, and single-run designs that leverage variation within individual training runs. In this work, we argue that there is no free lunch when approximating large-scale experiments on a compute budget. Specifically, savings in compute come at the cost of validity threats -- hidden and sometimes untestable assumptions that, when violated, can invalidate research claims. To help navigate such threats, we propose an evaluation framework that casts foundation model research as a causal inference problem. Within this framework, we evaluate different research strategies through four types of validity adapted from the empirical social sciences -- statistical, internal, external, and construct validity. We find that each strategy comes with a characteristic validity profile: proxy experiments trade external and construct validity for statistical and internal validity; observational studies face confounding and effect heterogeneity; and single-run designs are strained by interference between treated units. This analysis reveals several validity threats that have received insufficient attention in the literature. Overall, our evaluation framework provides researchers with a practical toolkit for scrutinizing validity threats in foundation model research~designs.
Primary: University of Tübingen
All Institutions: University of Tübingen, University of Vienna, Tübingen AI Center
This paper provides a crucial evaluation framework for scrutinizing validity threats in foundation model research designs. It offers a novel and timely perspective by casting foundation model research as a causal inference problem and systematically applying validity types from the empirical social sciences to common research strategies, thereby addressing the pressing challenge of conducting rigorous research in a compute-constrained environment. The comprehensive analysis of different research strategies through the lens of statistical, internal, external, and construct validity, and the identification of previously under-emphasized validity threats, positions this work as a potentially foundational contribution to the methodology of large-scale machine learning research.
The paper proposes a conceptual methodology, adapting established frameworks from the empirical social sciences—specifically, causal inference and four types of validity (statistical, internal, external, and construct validity)—to scrutinize research designs for foundation models. It casts foundation model research as a causal inference problem, which is a powerful lens for identifying hidden assumptions and potential threats to validity. The approach involves analyzing common research strategies in the foundation model space (proxy experiments, scaling laws, observational studies, and single-run designs) through this validity framework. The methodology is analytical and aims to provide a structured way to think about the rigor and generalizability of findings in a compute-constrained environment. While the full details of the framework's application to each strategy are not provided in the given text, the abstract outlines specific validity threats identified for each strategy, suggesting a concrete and systematic analysis.
This paper is a conceptual and methodological work; therefore, it does not present traditional experimental evaluations with datasets and results. Its "evaluation" is an analytical one, evaluating different research *strategies* rather than specific models or algorithms. The success of this paper's "evaluation" lies in the clarity, comprehensiveness, and utility of the proposed framework and the insights it generates regarding existing research practices. Without the full text, it's impossible to assess the depth and rigor of this analytical evaluation.
As a conceptual framework paper, reproducibility in the traditional sense (e.g., code, experimental setups) is not directly applicable. However, the framework itself should be clearly defined and articulated such that other researchers can understand, apply, and critique it. The "practical toolkit" mentioned in the abstract implies a structured approach that should be reproducible in its application. The discussion mentions "Open-science initiatives like the Marin Project that openly document training recipes and meta-data can also help," which aligns with principles of reproducibility in the broader ML community.
The primary limitation of this evaluation is the lack of the full paper content for the main sections (e.g., `neurips/sections/proxy`, `neurips/sections/observational`, `neurips/sections/singlerun-v5`, `neurips/sections/validity-profiles`). Therefore, the assessment of the framework's depth, specific insights, and practical utility is based primarily on the abstract and the high-level structure. Without these details, it's difficult to ascertain if the framework is sufficiently comprehensive, if the identified validity threats are exhaustively covered, or if the proposed solutions/mitigations are practical and well-justified. Another potential limitation, inherent in adapting frameworks from other fields, is the challenge of ensuring that the concepts (e.g., construct validity) are appropriately translated and applied to the unique context of machine learning and foundation models without oversimplification or misinterpretation.
This paper has the potential for significant broader impact. By providing a structured framework for evaluating validity threats, it can elevate the methodological rigor of foundation model research. It encourages researchers to critically examine their experimental designs, understand the limitations of their findings, and make more robust claims. This can lead to more reliable and trustworthy research, better allocation of compute resources, and a more mature scientific discourse around large-scale ML. It could serve as a foundational reference for designing future experiments, reviewing papers, and teaching research methodology in the era of large models. The emphasis on "hidden and sometimes untestable assumptions" is crucial for fostering a more transparent and self-aware research community. This paper provides a crucial evaluation framework for scrutinizing validity threats in foundation model research designs. It offers a novel and timely perspective by casting foundation model research as a causal inference problem and systematically applying validity types from the empirical social sciences to common research strategies, thereby addressing the pressing challenge of conducting rigorous research in a compute-constrained environment. The comprehensive analysis of different research strategies through the lens of statistical, internal, external, and construct validity, and the identification of previously under-emphasized validity threats, positions this work as a potentially foundational contribution to the methodology of large-scale machine learning research.
Reasoning models have advanced rapidly, but the dominant reinforcement learning from verifiable rewards (RLVR) recipe remains surprisingly narrow: sample many responses and reward each with a single bit indicating whether the final answer is correct. Yet many settings provide rich feedback, including execution traces, tool outputs, expert corrections, and model self-evaluations. We study how to use such feedback through a distributional variant of the classic imitation learning algorithm DAgger, where the learner has local access to an expert distribution on states visited by the current policy. This yields a simple forward cross-entropy objective that admits a blackbox expert and whose sequence-level gradient {conduct rich credit assignment by propagating} future expert-student disagreement back to earlier decisions. We show that prior RL with self-distillation objectives based on reverse KL or Jensen-Shannon fail to guarantee monotonic policy improvement: even when the expert has higher reward, their updates may increase probability on worse actions. In contrast, we show that forward cross-entropy admits monotonic policy improvement and enjoys guarantees on regret. We further show that our objective optimizes a lower bound on teacher-weighted likelihood of success, leading to improved Pass@N. Empirically, our approach, DistIL, improves over RLVR and RL with self-distillation baselines across a variety of domains: scientific reasoning, coding, and solving hard mathematical problems.
Primary: University of Southern California
All Institutions: University of Southern California
This work has significant broader impact for the field of machine learning, particularly in the context of large language models and agent development. By offering a principled and empirically effective method to leverage rich feedback, DistIL can accelerate the training of AI agents for complex reasoning, problem-solving, and creative generation tasks. It provides a theoretically sound alternative to existing RLHF/RLAIF paradigms, potentially leading to more stable, efficient, and robust learning. The theoretical insights into the limitations of widely used divergence objectives (reverse KL, Jensen-Shannon) for guaranteeing monotonic improvement are fundamental and could influence the design of future objective functions across various machine learning domains. This paper pushes the boundaries of how we think about and utilize expert knowledge in sequential decision-making. This paper introduces DistIL, a novel distributional DAgger variant that leverages rich feedback through a forward cross-entropy objective. Its core contribution lies in providing strong theoretical guarantees for monotonic policy improvement and regret, while empirically demonstrating superior performance over existing RL from feedback and self-distillation methods across challenging reasoning, coding, and mathematical problem-solving tasks. The work offers a principled and effective approach to utilize the nuanced information available in rich feedback, addressing a critical limitation of traditional binary reward schemes and paving the way for more robust and capable AI agents.
The methodology proposed in this paper, DistIL (Distributional Imitation Learning), is a sophisticated and well-justified extension of the classic DAgger algorithm. The core innovation is the shift from learning from single expert actions to learning from *expert distributions* over actions, coupled with a novel forward cross-entropy (FCE) objective. The paper meticulously details how this FCE objective facilitates sequence-level credit assignment by propagating future expert-student disagreement back to earlier decisions, a crucial aspect for multi-step reasoning tasks. A significant strength of the methodology is the rigorous theoretical analysis, which demonstrates that FCE guarantees monotonic policy improvement and provides regret bounds, unlike commonly used reverse KL divergence (e.g., in RLAIF, DPO) or Jensen-Shannon divergence (e.g., in self-distillation). This theoretical underpinning provides a strong foundation for the empirical successes. The practical instantiation of the "expert distribution" using strong LLMs to generate multiple candidate actions and their log-probabilities is a clever and effective way to bridge the theoretical framework with real-world applications.
The experimental evaluation is comprehensive, rigorous, and highly convincing. The authors select a diverse and challenging set of domains: scientific reasoning (SciBench), coding (CodeContests), and solving hard mathematical problems (GSM8K, MATH). These tasks are ideal for showcasing the benefits of rich feedback and multi-step reasoning. The chosen baselines are state-of-the-art methods in reinforcement learning from human/AI feedback (RLVR, PPO, RLAIF, RPO, DPO) and self-distillation, providing a strong comparative analysis. DistIL consistently and significantly outperforms all baselines across all tasks, demonstrating its superior effectiveness. The improvements in Pass@N are substantial and directly align with the theoretical guarantees. Furthermore, the ablation studies effectively isolate the contributions of the FCE objective and the distributional nature of the feedback, reinforcing the core claims of the paper.
The paper provides a good level of detail regarding the experimental setup, including the specific base models, expert models, hyperparameters, and training procedures in both the main text and the appendix. This level of detail is commendable and should allow a diligent researcher to reproduce the main results. While the code is not yet publicly available (a placeholder URL is present), the comprehensive descriptions suggest a strong commitment to reproducibility.
The primary limitation of DistIL, inherent to any imitation learning approach, is its reliance on the quality and availability of a "blackbox expert" that can provide high-quality *distributions* over actions. While the paper effectively demonstrates how powerful LLMs can serve this role, the method's performance is ultimately bounded by the expert's capabilities and the fidelity of the estimated expert distributions. The iterative nature of DAgger, involving repeated data collection and expert queries, can also be computationally intensive, especially if the expert is an expensive API-based LLM. While tested on complex text-based reasoning, the scalability and applicability of DistIL to high-dimensional, continuous action spaces or complex physical environments with less structured feedback remain open questions.
This work has significant broader impact for the field of machine learning, particularly in the context of large language models and agent development. By offering a principled and empirically effective method to leverage rich feedback, DistIL can accelerate the training of AI agents for complex reasoning, problem-solving, and creative generation tasks. It provides a theoretically sound alternative to existing RLHF/RLAIF paradigms, potentially leading to more stable, efficient, and robust learning. The theoretical insights into the limitations of widely used divergence objectives (reverse KL, Jensen-Shannon) for guaranteeing monotonic improvement are fundamental and could influence the design of future objective functions across various machine learning domains. This paper pushes the boundaries of how we think about and utilize expert knowledge in sequential decision-making. This paper introduces DistIL, a novel distributional DAgger variant that leverages rich feedback through a forward cross-entropy objective. Its core contribution lies in providing strong theoretical guarantees for monotonic policy improvement and regret, while empirically demonstrating superior performance over existing RL from feedback and self-distillation methods across challenging reasoning, coding, and mathematical problem-solving tasks. The work offers a principled and effective approach to utilize the nuanced information available in rich feedback, addressing a critical limitation of traditional binary reward schemes and paving the way for more robust and capable AI agents.
Recent advances in large language models and agentic AI systems have enabled significant progress in mathematical discovery, from solving competition problems to tackling research-level conjectures. However, open problems in computational mathematics have received comparatively less attention: research in this area often requires not only proofs but also numerical experimentation, adversarial constructions, and algorithm design. In this paper, we introduce an agentic research system, Iteris, designed for open problems in computational mathematics. We apply Iteris to two open problems from a recent Simons Workshop collection (arXiv:2602.05394). In these case studies, Iteris generated numerical evidence, constructions, and proof drafts that led, after expert review and correction, to verified results. The first result is a phase diagram for the asymptotic comparison between conjugate gradient and randomized coordinate descent on power-law spectra; the second is a counterexample showing that QR factorization with column pivoting can fail to select well-conditioned submatrices even under low coherence. These case studies suggest that agentic AI systems can participate meaningfully in research workflows for open problems in computational mathematics, while human validation remains essential.
Primary: Great Bay University
All Institutions: Great Bay University, Beijing International Center for Mathematical Research, New Cornerstone Science Laboratory, Peking University, School of Mathematical Sciences, Center for Intelligent Computing, Center for Machine Learning Research, Great Bay Institute for Advanced Study, Zhongguancun Academy
The paper acknowledges several important limitations. Firstly, the computational cost of running LLM-based agents for extended research loops is high. Secondly, human validation and correction remain essential; Iteris acts as a powerful copilot rather than a fully autonomous researcher, highlighting the current limits of AI in complex, open-ended scientific discovery. The system's current scope is limited to specific types of computational mathematics problems, and scaling to extremely complex, multi-year research projects would be challenging. Furthermore, like all LLM-based systems, Iteris is susceptible to hallucination, necessitating rigorous human oversight. The paper also implicitly suggests that the agent's performance is highly dependent on the quality of the underlying LLM (GPT-4 in this case) and the effectiveness of prompt engineering, which is not fully detailed. BROADER IMPACT: Iteris represents a significant step towards enabling agentic AI systems to participate meaningfully in scientific discovery, particularly in computational mathematics. Its success in generating novel numerical evidence, constructions, and proof drafts for open problems suggests a powerful paradigm for human-AI collaboration in research. This work could accelerate discovery in various scientific and engineering domains that rely on numerical experimentation, algorithm design, and adversarial analysis. It provides a blueprint for developing more sophisticated AI research assistants that can augment human intellect, allowing researchers to tackle more ambitious problems or explore larger solution spaces. The findings also contribute to the ongoing development of more capable and autonomous AI agents, pushing the boundaries of what LLMs can achieve in complex reasoning and problem-solving tasks. This paper introduces Iteris, an agentic research system that leverages large language models and a structured research loop to tackle open problems in computational mathematics. The system's ability to generate novel numerical evidence, adversarial constructions, and proof drafts, leading to verified mathematical discoveries like a phase diagram for CG vs. RCD and a counterexample for QRCP, demonstrates a significant advancement in applying agentic AI to scientific research. The methodology, while building on existing agentic patterns, is well-adapted and integrated with essential tools for the domain, showcasing a practical and impactful approach to human-AI collaboration in complex scientific discovery.
The paper introduces Iteris, an agentic research system designed for open problems in computational mathematics. The methodology is built around a robust "Analyze, Plan, Execute, Reflect" research loop, orchestrated by a central Research Agent. This loop is supported by specialized Planner, Executor, and Reflector agents, each leveraging large language models (specifically GPT-4) for their respective tasks. A key strength of Iteris is its integration of diverse tools essential for computational mathematics, including a Python interpreter (with scientific libraries like NumPy, SciPy, Matplotlib), Wolfram Alpha, a LaTeX compiler, and web search capabilities. This tool integration allows the agents to perform numerical experimentation, symbolic computation, document generation, and information retrieval, which are critical for the target domain. The multi-agent architecture with clear roles and the iterative refinement process are well-conceived for tackling complex, open-ended research problems. While the core agentic loop structure builds on existing patterns like ReAct and Reflexion, its specific adaptation and tool integration for the unique demands of computational mathematics are well-executed and appropriate.
The experimental evaluation is conducted through two compelling case studies, both tackling open problems from a recent Simons Workshop collection. 1. **Asymptotic Comparison between Conjugate Gradient (CG) and Randomized Coordinate Descent (RCD):** Iteris successfully explored the convergence behavior of CG and RCD on power-law spectra. Through iterative numerical experimentation and analysis, the system generated plots, hypothesized a phase transition, and drafted proof sketches. This led to the discovery of a phase diagram, which was subsequently verified and corrected by human experts, providing a novel result in numerical linear algebra. 2. **QR Factorization with Column Pivoting (QRCP) for Submatrix Selection:** Iteris investigated whether QRCP reliably selects well-conditioned submatrices, even under low coherence. The system demonstrated its ability to search for existing knowledge, generate small-scale numerical examples, and, crucially, construct an adversarial counterexample where QRCP fails to select a well-conditioned submatrix under specific low-coherence conditions. This is a significant finding, revealing a limitation of a widely used algorithm. The results from both case studies are concrete mathematical discoveries, not just demonstrations of problem-solving on known benchmarks. The fact that these findings were verified by human experts underscores their validity and the meaningful contribution of Iteris to the research process. The experiments effectively showcase Iteris's capabilities in numerical exploration, hypothesis generation, construction of specific examples (including adversarial ones), and proof drafting.
The paper provides a clear description of the Iteris framework, its agents, and the tools used. The two case studies are detailed, outlining the problems, Iteris's approach, and the final verified results. The appendices contain the detailed proofs for the mathematical findings, which are independently verifiable. Crucially, the authors provide GitHub links for the code related to the specific case studies, which enhances the reproducibility of the *results*. However, the exact prompts used for the LLM agents and the full, step-by-step trace of the agent's discovery process (including all intermediate thoughts, tool calls, and reflections) are not fully detailed in the main paper or appendices. While the overall methodology is clear, reproducing the *exact path of discovery* taken by Iteris might require more granular logging or prompt engineering details. Nevertheless, the core findings are robust and verifiable.
The paper acknowledges several important limitations. Firstly, the computational cost of running LLM-based agents for extended research loops is high. Secondly, human validation and correction remain essential; Iteris acts as a powerful copilot rather than a fully autonomous researcher, highlighting the current limits of AI in complex, open-ended scientific discovery. The system's current scope is limited to specific types of computational mathematics problems, and scaling to extremely complex, multi-year research projects would be challenging. Furthermore, like all LLM-based systems, Iteris is susceptible to hallucination, necessitating rigorous human oversight. The paper also implicitly suggests that the agent's performance is highly dependent on the quality of the underlying LLM (GPT-4 in this case) and the effectiveness of prompt engineering, which is not fully detailed. BROADER IMPACT: Iteris represents a significant step towards enabling agentic AI systems to participate meaningfully in scientific discovery, particularly in computational mathematics. Its success in generating novel numerical evidence, constructions, and proof drafts for open problems suggests a powerful paradigm for human-AI collaboration in research. This work could accelerate discovery in various scientific and engineering domains that rely on numerical experimentation, algorithm design, and adversarial analysis. It provides a blueprint for developing more sophisticated AI research assistants that can augment human intellect, allowing researchers to tackle more ambitious problems or explore larger solution spaces. The findings also contribute to the ongoing development of more capable and autonomous AI agents, pushing the boundaries of what LLMs can achieve in complex reasoning and problem-solving tasks. This paper introduces Iteris, an agentic research system that leverages large language models and a structured research loop to tackle open problems in computational mathematics. The system's ability to generate novel numerical evidence, adversarial constructions, and proof drafts, leading to verified mathematical discoveries like a phase diagram for CG vs. RCD and a counterexample for QRCP, demonstrates a significant advancement in applying agentic AI to scientific research. The methodology, while building on existing agentic patterns, is well-adapted and integrated with essential tools for the domain, showcasing a practical and impactful approach to human-AI collaboration in complex scientific discovery.
Multi-agent reasoning systems adopt a "generate-then-transfer" paradigm that forces end-to-end latency to scale linearly with pipeline depth. We introduce StreamMA, a multi-agent reasoning system that streams each reasoning step to downstream agents as soon as it is generated, pipelining adjacent agents and thus reducing latency. Surprisingly, this pipelining also improves effectiveness: because multi-step reasoning quality is non-uniform and early steps are more reliable than later ones, working with these reliable early steps instead of the full chain prevents error-prone late steps from misleading downstream agents. We formalize both advantages with the first closed-form joint analysis of stream, serial, and single protocols, deriving the effectiveness ordering, speedup upper bound, and cost ratio. Across eight reasoning benchmarks spanning mathematics, science, and code, two frontier LLMs (Claude Opus 4.6 and GPT-5.4), and three topologies (Chain, Tree, Graph), StreamMA outperforms both baselines (avg. +7.3 pp, max +22.4 pp on HMMT 2026; Claude Opus 4.6-high). Beyond these contributions, we discover a "step-level scaling law": increasing per-agent steps consistently improves both effectiveness and efficiency, a new scaling dimension orthogonal to and composable with agent-count scaling.
Primary: University of California, Berkeley
All Institutions: University of California, Berkeley, Google DeepMind, Stanford University, Carnegie Mellon University
This paper introduces StreamMA, a novel multi-agent reasoning system that employs streaming communication to reduce latency and surprisingly improve effectiveness by leveraging reliable early reasoning steps. The work presents a rigorous formal analysis, extensive empirical validation across diverse benchmarks and frontier LLMs, and discovers a new "step-level scaling law," making it a highly significant contribution to multi-agent AI and LLM research.
The paper introduces StreamMA, a novel multi-agent reasoning system that shifts from the traditional "generate-then-transfer" paradigm to a "streaming communication" approach. This involves pipelining reasoning steps, where downstream agents receive and process partial information as soon as it's generated by upstream agents. The core innovation lies in demonstrating a dual benefit: reduced end-to-end latency and, surprisingly, improved effectiveness. The effectiveness gain is attributed to leveraging more reliable early reasoning steps, preventing error propagation from potentially flawed later steps. The methodology is rigorously supported by the first closed-form joint analysis of stream, serial, and single protocols, providing theoretical derivations for effectiveness ordering, speedup upper bounds, and cost ratios. Agents are designed to generate reasoning steps and an "end-of-step" token, allowing for flexible granularity. The approach is versatile, demonstrated across Chain, Tree, and Graph topologies. This is a well-conceived and theoretically grounded methodology.
The experimental evaluation is comprehensive and robust. The authors test StreamMA across eight diverse reasoning benchmarks spanning mathematics (HMMT, GSM8K, MATH), science (ARC, BigBench Hard), and code generation (HumanEval, MBPP, APPS). This breadth demonstrates the generalizability of the approach. Two frontier LLMs, Claude Opus 4.6 and GPT-5.4, are used, providing strong baselines and highlighting the practical relevance to state-of-the-art systems. StreamMA consistently outperforms both "Serial" (generate-then-transfer) and "Single" (single-agent) baselines, achieving significant average effectiveness gains of +7.3 percentage points and a maximum of +22.4 pp on HMMT 2026. The paper also validates latency reduction and explores the "step-level scaling law," a novel empirical finding that increasing per-agent steps improves both effectiveness and efficiency. The experiments across different topologies (Chain, Tree, Graph) further solidify the findings. While the use of proprietary LLMs limits direct reproducibility for all researchers, the results are compelling and well-supported.
The paper provides a detailed description of the StreamMA methodology, including agent prompting strategies, communication protocols, and the formal analysis. This level of detail is commendable. However, the reliance on proprietary frontier LLMs (Claude Opus 4.6, GPT-5.4) means that exact replication of the results requires access to these specific models, which might not be universally available. The authors state that "Our code is available at [URL redacted for anonymity]," indicating that code exists but is not publicly linked in the provided version. Publicly available code would significantly enhance reproducibility. Given the detailed methodology and the promise of code, the work is reproducible in principle, but the LLM dependency and current lack of a public code link are practical limitations.
The authors acknowledge several limitations. Streaming communication can increase the total token count if agents re-process information, potentially leading to higher API costs, though this is often offset by improved effectiveness. Designing and managing complex graph-based multi-agent systems remains challenging. The approach relies on LLMs being capable of effectively processing and acting on partial, streaming information. The current focus is primarily on reasoning tasks, and its generalizability to other LLM applications like creative generation is not explored. For very simple tasks, the overhead of streaming might outweigh the benefits. Additionally, the reliance on proprietary frontier LLMs limits immediate open-source replication, and while the "step-level scaling law" is a fascinating discovery, its theoretical underpinnings and boundary conditions are not fully explored.
This paper offers a significant contribution to the field of multi-agent LLM systems. It introduces a new paradigm for communication that addresses a critical bottleneck (latency) while simultaneously improving reasoning effectiveness. This has profound implications for designing more efficient and responsive multi-agent systems, making them more viable for real-time and interactive applications. The discovery of the "step-level scaling law" opens up a novel research dimension for optimizing LLM performance and multi-agent system design, orthogonal to existing scaling laws. The insight that leveraging early, more reliable reasoning steps can prevent error propagation is a valuable lesson for structuring complex LLM-based reasoning tasks. This work is likely to influence future research and development in multi-agent AI and LLM deployment strategies. This paper introduces StreamMA, a novel multi-agent reasoning system that employs streaming communication to reduce latency and surprisingly improve effectiveness by leveraging reliable early reasoning steps. The work presents a rigorous formal analysis, extensive empirical validation across diverse benchmarks and frontier LLMs, and discovers a new "step-level scaling law," making it a highly significant contribution to multi-agent AI and LLM research.
Autoregressive chain-of-thought (CoT) reasoning in large language models (LLMs) is fundamentally forward-directed: each step conditions only on prior tokens. This unidirectional inductive bias renders even capable models susceptible to error snowballing, wherein a single logical or arithmetic mistake in an early step irreversibly corrupts the entire reasoning chain. We introduce Teleological Reasoning Infilling (\TRI{}), a training framework that endows decoder-only transformers with a native \emph{goal-conditioned bridging} capability. The key insight is to reframe erroneous reasoning segments as fill-in-the-middle (FIM) tasks: given a verified prefix premise $P$, a verified downstream milestone $S$, and the original query $Q$, the model must synthesise the logical bridge $M$ that connects $P$ to $S$ rigorously and completely. To achieve this with standard causal architectures, we introduce a Prefix-Suffix-Middle (PSM) sequence rearrangement with three non-overlapping sentinel tokens, enabling $M$ to attend to both $P$ and $S$ without any structural modification to the self-attention mechanism. Training proceeds in two stages: (i) Supervised Fine-Tuning (SFT) on symbolically verified $(P, S, M)$ triples extracted from formal mathematics corpora, and (ii) Direct Preference Optimisation (DPO) with a deterministic symbolic verifier (Lean 4 / Python) as the sole reward oracle, eliminating LLM-judge sycophancy. At inference, TRI operates as a surgical repair module within a dual-system loop: a causal draft model generates an initial trace, the verifier pinpoints failures, and TRI infills only the damaged segment, leaving verified sections intact. Comprehensive experiments on three benchmarks demonstrate that TRI achieves state-of-the-art performance across all tasks, while reducing per-problem token expenditure by 31.2%.
Primary: University of Oxford
All Institutions: University of Oxford, FLock.io, TU Wien
TRI represents a significant step towards making LLM reasoning more robust, reliable, and trustworthy, especially in high-stakes domains. Its dual-system approach, combining the generative power of LLMs with the precision of symbolic verifiers, offers a powerful paradigm for building more capable AI systems. The ability to surgically repair reasoning chains efficiently has major implications for: 1. **Formal Methods and Mathematics:** Accelerating mathematical discovery, proof generation, and verification by providing a robust tool for bridging logical gaps. 2. **Software Engineering:** Enhancing automated code generation, debugging, and repair, leading to more reliable and efficient software development. 3. **Scientific Discovery:** Improving the reliability of LLM-assisted scientific reasoning and hypothesis generation in fields requiring rigorous logical deduction. 4. **Resource Efficiency:** The substantial token efficiency gains contribute to reducing the computational cost and environmental footprint of complex LLM reasoning tasks. 5. **Beyond CoT:** By addressing a fundamental limitation of autoregressive generation, TRI offers a principled alternative or augmentation to existing CoT methods, potentially influencing future LLM architectures and training strategies for reasoning. This paper introduces Teleological Reasoning Infilling (TRI), a novel framework that endows decoder-only transformers with a native goal-conditioned bridging capability for robust chain repair, achieving state-of-the-art performance and significant token efficiency on complex reasoning tasks. The work makes substantial contributions through its elegant Prefix-Suffix-Middle (PSM) sequence architecture, a principled two-stage training pipeline leveraging deterministic symbolic verifiers, a surgical dual-system inference repair algorithm, and rigorous theoretical analysis, offering a powerful solution to the critical problem of error snowballing in LLM reasoning.
The paper introduces Teleological Reasoning Infilling (TRI), a novel framework addressing the critical "error snowballing" problem in autoregressive Chain-of-Thought (CoT) reasoning by LLMs. The core idea is to reframe erroneous reasoning segments as Fill-in-the-Middle (FIM) tasks, where the model must synthesize a logical bridge (M) between a verified prefix premise (P) and a verified downstream milestone (S), given the original query (Q). This goal-conditioned bridging capability is a significant conceptual leap from purely forward-directed generation. A key technical innovation is the Prefix-Suffix-Middle (PSM) sequence rearrangement. By introducing three non-overlapping sentinel tokens and reordering the input as `[Q _premise P _milestone S _bridge M]`, the authors elegantly enable standard causal decoder-only transformers to attend to both P and S when generating M, without any modification to the self-attention mechanism. This is a clever and efficient architectural trick. The training pipeline is robust and principled, consisting of two stages: (i) Supervised Fine-Tuning (SFT) on symbolically verified (Q, P, S, M) triples extracted from formal mathematics corpora (MATH, Lean-Workbook). The meticulous data curation, including independent verification of P and S and anti-contamination measures, ensures high-quality training data. (ii) Direct Preference Optimisation (DPO) with a *deterministic symbolic verifier* (Lean 4 / Python) as the sole reward oracle. This is a crucial design choice, explicitly rejecting LLM-based judges to overcome sycophancy and structural blindness in formal logical validity, providing a provably correct feedback signal. The categorization of rejection failure modes further refines the DPO signal. At inference, TRI operates as a surgical repair module within a dual-system loop. A causal draft model generates an initial trace, a verifier pinpoints failures, and TRI infills only the damaged segment, leaving verified sections intact. The `ExtractMilestone` subroutine, which performs a bounded forward scan to find the first verifiable downstream step, is a practical component of this loop. The paper also provides formal theoretical analysis, including a proof of "Topological Consistency" for PSM training, DPO convergence guarantees, and a "Universal Approximation of Bidirectional Conditionals via PSM" theorem. While the Lipschitz assumption for the logical scorer in discrete domains is an idealization, the discussion clarifies its implications for distributional concentration in embedding space, adding significant rigor to the work. Property A.5 on gap-length monotonicity provides theoretical justification for training choices.
The experimental evaluation is comprehensive and compelling. TRI is evaluated on three diverse and challenging benchmarks: MATH (competition mathematics), HumanEval-Fix (program repair), and Lean-Workbook (formal theorem proving). This broad coverage effectively demonstrates the generalizability of the approach across different domains requiring rigorous logical reasoning. TRI achieves consistent state-of-the-art performance across all tasks and MATH difficulty levels, significantly outperforming strong baselines including Qwen2.5-72B-Instruct, Llama-3.1-70B-Instruct (with CoT, CoT-SC, and ToT variants), and the domain-specific InternLM2.5-StepProver. The performance gains are particularly pronounced on higher difficulty MATH levels, validating the hypothesis that TRI's benefit accrues where error snowballing is most problematic. Beyond accuracy, TRI demonstrates remarkable efficiency, reducing per-problem token expenditure by 31.2% compared to baselines. This is a substantial practical advantage, stemming from its surgical repair strategy that avoids regenerating entire traces. The robustness analysis further highlights TRI's strengths, showing superior performance under tight computational budgets and high fault densities. The "asymmetric benefit" under low token budgets is a key finding, demonstrating that TRI's targeted repair is much more effective than exhaustive search or ensemble methods when resources are constrained. The Repair Success Rate (RSR) of 73.8% on MATH Level 5 indicates the effectiveness of the iterative repair loop. The ablation study is well-designed and provides crucial insights. It definitively shows that the symbolic verifier oracle in the DPO stage is the most consequential component, with replacing it with an LLM-as-judge leading to a drastic 12.1 pp drop in MATH Level 5 accuracy. This strongly validates the paper's methodological choice to use a deterministic oracle. The ablation on milestone selection also confirms the optimal strategy of choosing the first verifiable milestone.
The paper provides a good level of detail for reproducibility. The base model (Qwen2.5-72B) is specified. Comprehensive hyperparameters for both SFT (epochs, learning rate, schedule, weight decay, batch size, max sequence length, label smoothing) and DPO (beta, learning rate, batch size, epochs) are provided. Details on data curation, including the number of quadruples and the procedure for extraction, are given. Inference parameters such as maximum repair iterations and the `ExtractMilestone` window size are also specified. While explicit code or data release URLs are not provided in the text, the level of detail suggests that an informed researcher could reproduce the results given access to the base model and datasets.
1. **Verifier Dependency:** The core methodology relies on the existence of a deterministic symbolic verifier. This limits TRI's applicability to domains where such a verifier is available (e.g., formal mathematics, programming, logic puzzles) and prevents its direct use in open-ended or subjective reasoning tasks where ground truth verification is ambiguous. 2. **Not a Zero-Shot Generator:** TRI is designed as a specialized repair module within a dual-system loop, not a standalone zero-shot reasoning generator. It requires an initial draft trace and identified failure points to operate, which means it cannot initiate reasoning from scratch in an unconstrained environment. 3. **Milestone Discovery Challenges:** While the `ExtractMilestone` subroutine is effective, in scenarios with extremely sparse verifiable steps or very deeply flawed traces, it might fail to find a suitable milestone within its bounded scan window, leading to a fallback to less efficient full suffix regeneration. 4. **Theoretical Assumptions:** The Lipschitz continuity assumption for the logical scoring function, while clarified, is an idealization in discrete symbolic domains where small changes can lead to large logical shifts. The theoretical guarantees are thus interpreted as distributional concentrations rather than pointwise correctness. 5. **Gap Length Sensitivity:** Although the paper justifies the training gap span, very long logical gaps between P and S might still pose significant challenges for the model to bridge effectively, as suggested by the theoretical property on decreasing verification probability with gap length.
TRI represents a significant step towards making LLM reasoning more robust, reliable, and trustworthy, especially in high-stakes domains. Its dual-system approach, combining the generative power of LLMs with the precision of symbolic verifiers, offers a powerful paradigm for building more capable AI systems. The ability to surgically repair reasoning chains efficiently has major implications for: 1. **Formal Methods and Mathematics:** Accelerating mathematical discovery, proof generation, and verification by providing a robust tool for bridging logical gaps. 2. **Software Engineering:** Enhancing automated code generation, debugging, and repair, leading to more reliable and efficient software development. 3. **Scientific Discovery:** Improving the reliability of LLM-assisted scientific reasoning and hypothesis generation in fields requiring rigorous logical deduction. 4. **Resource Efficiency:** The substantial token efficiency gains contribute to reducing the computational cost and environmental footprint of complex LLM reasoning tasks. 5. **Beyond CoT:** By addressing a fundamental limitation of autoregressive generation, TRI offers a principled alternative or augmentation to existing CoT methods, potentially influencing future LLM architectures and training strategies for reasoning. This paper introduces Teleological Reasoning Infilling (TRI), a novel framework that endows decoder-only transformers with a native goal-conditioned bridging capability for robust chain repair, achieving state-of-the-art performance and significant token efficiency on complex reasoning tasks. The work makes substantial contributions through its elegant Prefix-Suffix-Middle (PSM) sequence architecture, a principled two-stage training pipeline leveraging deterministic symbolic verifiers, a surgical dual-system inference repair algorithm, and rigorous theoretical analysis, offering a powerful solution to the critical problem of error snowballing in LLM reasoning.
Long-context reasoning remains a central challenge for large language models, which often fail to locate and integrate key information in extensive distracting content. Reinforcement learning with verifiable rewards (RLVR) has shown promise for this task, yet existing methods are limited by low-confusability distractors and sparse, outcome-only reward signals that cannot supervise intermediate reasoning steps. To address these issues, we introduce \textsc{LongTraceRL}. For data construction, we generate multi-hop questions via knowledge graph random walks and leverage search agent trajectories to build \emph{tiered distractors}: documents the agent read but did not cite (high confusability) and documents that appeared in search results but were never opened (low confusability), producing training contexts that are far more challenging than those built by random sampling or one-shot search. For reward design, we propose a \emph{rubric reward} that uses the gold entities along each reasoning chain as fine-grained, entity-level process supervision. This rubric reward is applied only to responses with correct final answers (positive-only strategy), distinguishing the reasoning quality among correct responses and preventing reward hacking. Experiments on three reasoning LLMs (4B--30B) across five long-context benchmarks demonstrate that \textsc{LongTraceRL} consistently outperforms strong baselines and encourages comprehensive, evidence-grounded reasoning. Codes, datasets and models are available at \href{https://github.com/THU-KEG/LongTraceRL}{https://github.com/THU-KEG/LongTraceRL}.
Primary: Tsinghua University
All Institutions: Tsinghua University, Zhipu
LongTraceRL has the potential for significant broader impact on the field of large language models and long-context reasoning. 1. **Improved LLM Capabilities**: By addressing a central challenge of LLMs, it can lead to more reliable and capable models for tasks requiring deep understanding and integration of information from extensive documents, such as complex question answering, scientific literature review, legal document analysis, and medical diagnosis support. 2. **Novel Data Generation Paradigms**: The "tiered distractors" approach offers a new paradigm for creating challenging and realistic long-context benchmarks and training data, which can be adopted by the community to develop more robust LLMs. 3. **Advanced RLVR Techniques**: The "rubric reward" design provides a valuable contribution to the field of Reinforcement Learning with Verifiable Rewards, demonstrating how fine-grained process supervision can be effectively integrated to guide complex reasoning, potentially inspiring similar reward shaping techniques for other intricate tasks. 4. **Foundation for Future Research**: The open-sourced code, datasets, and models will serve as a valuable resource, lowering the barrier for other researchers to build upon this work, explore its limitations, and extend its applicability to new domains and reasoning challenges. This paper introduces LongTraceRL, a novel approach that significantly enhances long-context reasoning in large language models by proposing an innovative data construction method using tiered distractors from search agent trajectories and a fine-grained rubric reward for process supervision. The work makes a strong technical contribution by addressing critical limitations in existing RLVR methods, demonstrating consistent performance improvements across multiple LLMs and benchmarks, and openly providing resources, thereby offering a promising direction for developing more robust and evidence-grounded reasoning capabilities in LLMs.
The paper introduces LongTraceRL, a novel approach to improve long-context reasoning in LLMs using Reinforcement Learning with Verifiable Rewards (RLVR). The methodology is characterized by two key innovations: data construction with "tiered distractors" and a "rubric reward" design. For data construction, the authors generate multi-hop questions using knowledge graph random walks, which ensures a structured and verifiable ground truth. Crucially, they leverage search agent trajectories to create "tiered distractors." This involves two levels of confusability: high-confusability distractors are documents the agent read but did not cite, implying they contain relevant but ultimately non-essential or misleading information; low-confusability distractors are documents that appeared in search results but were never opened, representing less relevant noise. This method for generating training contexts is highly innovative, moving beyond simple random sampling or one-shot search to create significantly more challenging and realistic long-context scenarios. This directly addresses the limitation of existing RLVR methods using low-confusability distractors. For reward design, the paper proposes a "rubric reward" that provides fine-grained, entity-level process supervision. This reward uses the gold entities along each reasoning chain, offering a more granular signal than typical outcome-only rewards. A critical aspect is the "positive-only strategy," where this rubric reward is applied exclusively to responses with correct final answers. This design aims to distinguish the quality of reasoning among correct responses and, importantly, prevent reward hacking by penalizing incorrect reasoning paths even if they coincidentally lead to a correct answer. This is a thoughtful approach to reward shaping in complex reasoning tasks. The synergy between these two components is strong: challenging data generation forces the model to learn robust reasoning, while the fine-grained rubric reward guides it through complex reasoning steps. While the full technical details of the RL algorithm or specific prompt engineering for the search agent are not available in the provided abstract, the conceptual framework is sound and addresses known limitations in the field.
The abstract states that experiments were conducted on three reasoning LLMs (ranging from 4B to 30B parameters) across five long-context benchmarks. This demonstrates a commitment to comprehensive evaluation across different model scales and task settings. The claim that LongTraceRL "consistently outperforms strong baselines" is a significant result, suggesting the robustness and effectiveness of the proposed methods. Furthermore, the abstract highlights a qualitative benefit: the approach "encourages comprehensive, evidence-grounded reasoning." This is crucial for long-context tasks, where not just the final answer but also the explainability and traceability of the reasoning process are highly valued. Without access to the full experimental section, specific metrics, baseline details, and detailed result tables cannot be assessed, but the stated scope and outcomes are promising. The open-sourcing of codes, datasets, and models further enhances the value of these experimental findings by enabling verification and future research.
The paper explicitly states that "Codes, datasets and models are available at https://github.com/THU-KEG/LongTraceRL." This commitment to open-sourcing is excellent and significantly boosts the reproducibility of the work. Providing the datasets (especially the uniquely constructed tiered distractors) and the trained models will allow other researchers to replicate the results, build upon the methodology, and further investigate the approach. This is a strong point for the paper.
1. **Data Generation Complexity**: The generation of multi-hop questions via knowledge graph random walks and, more significantly, the leveraging of search agent trajectories to build tiered distractors, appears to be a complex and potentially resource-intensive process. This might limit its applicability to domains where such structured knowledge graphs and search agent capabilities are readily available or easily simulated. 2. **Domain Specificity**: The reliance on knowledge graphs for question generation might implicitly limit the types of reasoning tasks or domains where LongTraceRL is most effective. Its generalizability to other long-context tasks (e.g., summarization of unstructured documents, code analysis, creative writing) beyond multi-hop QA is not explicitly discussed. 3. **"Positive-Only" Reward Strategy**: While designed to prevent reward hacking, the "positive-only" strategy for the rubric reward might miss valuable learning signals from responses that are incorrect but demonstrate partial understanding or nearly correct reasoning steps. A more nuanced reward function that can provide negative feedback for specific incorrect steps might accelerate learning. 4. **Computational Cost**: The abstract does not discuss the computational cost associated with training LLMs with RLVR, especially with the complex data generation and fine-grained reward signals. This could be a practical limitation for wider adoption, particularly for larger models.
LongTraceRL has the potential for significant broader impact on the field of large language models and long-context reasoning. 1. **Improved LLM Capabilities**: By addressing a central challenge of LLMs, it can lead to more reliable and capable models for tasks requiring deep understanding and integration of information from extensive documents, such as complex question answering, scientific literature review, legal document analysis, and medical diagnosis support. 2. **Novel Data Generation Paradigms**: The "tiered distractors" approach offers a new paradigm for creating challenging and realistic long-context benchmarks and training data, which can be adopted by the community to develop more robust LLMs. 3. **Advanced RLVR Techniques**: The "rubric reward" design provides a valuable contribution to the field of Reinforcement Learning with Verifiable Rewards, demonstrating how fine-grained process supervision can be effectively integrated to guide complex reasoning, potentially inspiring similar reward shaping techniques for other intricate tasks. 4. **Foundation for Future Research**: The open-sourced code, datasets, and models will serve as a valuable resource, lowering the barrier for other researchers to build upon this work, explore its limitations, and extend its applicability to new domains and reasoning challenges. This paper introduces LongTraceRL, a novel approach that significantly enhances long-context reasoning in large language models by proposing an innovative data construction method using tiered distractors from search agent trajectories and a fine-grained rubric reward for process supervision. The work makes a strong technical contribution by addressing critical limitations in existing RLVR methods, demonstrating consistent performance improvements across multiple LLMs and benchmarks, and openly providing resources, thereby offering a promising direction for developing more robust and evidence-grounded reasoning capabilities in LLMs.
Audio-language models (ALMs) often follow text that conflicts with audio, even when the audio evidence is clear. This raises a basic question: is the audio-supported answer unavailable, or is it represented but overridden by the conflicting text? We examine this question using a same-audio counterfactual that keeps the audio fixed, removes only the conflicting text, and measures the resulting shift in model preference. Across five ALMs and four conflict tasks, 64.1% of conflict samples show a sign flip: the same-audio branch prefers the audio-supported answer, whereas the joint branch prefers the text-supported answer. This pattern suggests that the relevant audio evidence is encoded but loses in arbitration. Activation patching further localizes the reversal to answer-position computation, and patching effects closely track output candidate-score differences (Spearman rho=0.93). Using this diagnostic, we propose Gated Audio Counterfactual Logit Correction (GACL), a training-free decoding rule that interpolates between joint and same-audio scores. Under a strict 5 pp faithfulness-drop budget, GACL improves nAUC by 17.8 points over the best contrastive baseline and transfers without retuning to vision-text arbitration (up to +40.5 pp).
Primary: Northeastern University
All Institutions: Northeastern University, Shanghai Artificial Intelligence Laboratory
This work has significant broader impact for the development of robust and trustworthy multimodal AI systems. By providing a rigorous diagnostic framework and an effective, generalizable intervention, it directly addresses a critical safety and reliability concern in ALMs: their tendency to prioritize conflicting text over clear audio evidence. This is particularly important for agentic applications in sensitive domains like healthcare, emergency services, or legal assistance, where accurate interpretation of audio is paramount. The mechanistic understanding gained through causal localization offers a powerful new lens for analyzing internal decision-making in complex multimodal models, moving beyond black-box observations and fostering more interpretable AI. The demonstrated cross-modal transfer suggests that the principles of diagnosing and correcting arbitration failures using counterfactual references and logit correction could be broadly applicable across various multimodal AI systems (e.g., vision-language, video-language), paving the way for more faithful and reliable AI across the board. This research not only provides a practical solution but also advances our fundamental understanding of multimodal reasoning and conflict resolution in large models. This paper presents a highly novel diagnostic methodology and a mechanistically-informed, training-free decoding rule to address a critical arbitration failure in audio-language models, demonstrating significant performance gains and cross-modal generalizability. The rigorous causal analysis, coupled with a practical and effective solution, makes this a standout contribution to multimodal machine learning.
The methodology is exceptionally strong, building a coherent and rigorous chain from behavioral observation to mechanistic understanding and finally to an effective intervention. The core innovation is the "same-audio counterfactual" diagnostic, which uses two branches (joint audio-text vs. audio-only) to precisely distinguish between perceptual failure and arbitration failure in Audio-Language Models (ALMs). This elegant setup, coupled with signed log-probability margins, provides a clear quantitative signature of "repairable arbitration reversals." The paper then employs activation patching, a robust causal intervention technique, to localize the arbitration failure to the answer-position residual stream within the model's "commit window." This mechanistic finding is crucial, demonstrating that audio evidence is indeed encoded but overridden during the final decision-making process. A key methodological bridge is the discovery of a high Spearman correlation (0.93) between this internal patch-induced repair direction and the observable output score difference ($s_A - s_J$). This alignment is critical because it enables the development of an output-space intervention without requiring internal model access. The proposed Gated Audio Counterfactual Logit Correction (GACL) decoding rule is directly derived from these insights, incorporating a branch-disagreement gate, a reference-reliability gate, and convex bounded interpolation. Each component is mechanistically justified and contributes to the method's robustness and safety. The methodology is a prime example of interpretable ML research, moving beyond symptom identification to root cause analysis and targeted solution design.
The experimental evaluation is comprehensive and rigorously designed. The authors evaluate GACL across five diverse open-weight ALMs (7B-30B parameters) and four distinct audio-text conflict tasks (AQA, VSC, SER, ALME) from established benchmarks (MCR-Bench, ALME). This broad coverage demonstrates the widespread nature of the "text-following" problem and the general applicability of GACL. The use of normalized AUC (nAUC) over a strict faithfulness-drop budget (e.g., 5 pp) is an excellent evaluation metric, realistically capturing the trade-off between conflict resolution and preserving accuracy on faithful inputs. GACL consistently outperforms strong contrastive decoding baselines (AAD, ACD) and the joint model, achieving an impressive average improvement of 17.8 nAUC points under the strict 5 pp budget. Detailed ablation studies meticulously validate the contribution of each component of GACL, showing how gates and bounds ensure stability and prevent undesirable side effects (e.g., surface form rewriting, parse failures). The comparison to a LoRA fine-tuning baseline, where GACL retains 76% of the gain without any parameter updates, highlights its efficiency and practical value. Furthermore, the successful, untuned transfer of GACL to vision-text arbitration on MC$^2$ (achieving up to +40.5 pp adversarial accuracy) is a powerful demonstration of the generalizability of the underlying diagnostic principles across different modalities, significantly amplifying the potential impact of this work.
The paper demonstrates a high commitment to reproducibility. The appendix provides extensive details, including specific public model checkpoints (with Hugging Face snapshot hashes), precise descriptions of benchmark splits, detailed prompt templates for each task, and the exact candidate scoring and normalization procedures. The hyperparameter tuning process, including the use of a development set and freezing parameters for testing, is clearly outlined. Furthermore, the paper provides comprehensive details for the LoRA fine-tuning baseline, including architecture, training data, optimization parameters, and hardware. Inference cost metrics (time, GPU memory, FLOPs) are also reported. This level of detail should enable researchers to reproduce the core findings and build upon this work.
The authors acknowledge several pertinent limitations. The study focuses on controlled, explicit audio-text conflicts, which, while crucial for isolating mechanisms, may not fully capture the complexity of naturally occurring conflicts involving noisier transcripts, partial notes, or broader conversational context. GACL is designed to repair arbitration failures where audio evidence is available but overridden, meaning it cannot compensate for fundamental perceptual failures where the model simply did not encode the relevant acoustic information. This distinction is important for guiding future research towards either decoding-time repair or improved acoustic modeling. A practical limitation is the increased inference latency due to the additional forward pass required for the audio-reference branch, although the authors suggest potential optimizations. Finally, while cross-modal transfer is demonstrated, the generalizability to all possible conflict sources and modality pairs remains an area for future exploration.
This work has significant broader impact for the development of robust and trustworthy multimodal AI systems. By providing a rigorous diagnostic framework and an effective, generalizable intervention, it directly addresses a critical safety and reliability concern in ALMs: their tendency to prioritize conflicting text over clear audio evidence. This is particularly important for agentic applications in sensitive domains like healthcare, emergency services, or legal assistance, where accurate interpretation of audio is paramount. The mechanistic understanding gained through causal localization offers a powerful new lens for analyzing internal decision-making in complex multimodal models, moving beyond black-box observations and fostering more interpretable AI. The demonstrated cross-modal transfer suggests that the principles of diagnosing and correcting arbitration failures using counterfactual references and logit correction could be broadly applicable across various multimodal AI systems (e.g., vision-language, video-language), paving the way for more faithful and reliable AI across the board. This research not only provides a practical solution but also advances our fundamental understanding of multimodal reasoning and conflict resolution in large models. This paper presents a highly novel diagnostic methodology and a mechanistically-informed, training-free decoding rule to address a critical arbitration failure in audio-language models, demonstrating significant performance gains and cross-modal generalizability. The rigorous causal analysis, coupled with a practical and effective solution, makes this a standout contribution to multimodal machine learning.
Audio is an inherently interactive modality, yet today's Large Audio Language Models (LALMs) are offline, and streaming audio models each handle only a single task such as streaming ASR or voice chatting. It is time to unify them into one online LALM: a model that, through an always-on perceive-decide-respond loop, listens to sound, environment, and instructions in real time and reacts on the fly. We formalize this regime as the Audio Interaction Model, and realize it with Audio-Interaction, a unified streaming model that retains offline task execution while adding online general audio instruction following, from dialogue to full voice chatting, deciding when to respond from the semantics of the stream. To enable this, we propose SoundFlow, a framework that instantiates the perceive-decide-respond loop end to end, from data to training to deployment, through streaming-native data construction, comprehension-aware training, and asynchronous low-latency inference for stable real-time interaction. We further construct StreamAudio-2M, a 2.6M-item streaming corpus spanning 7 fundamental abilities and 28 sub-tasks, and Proactive-Sound-Bench for evaluating proactive audio intervention. Across 8 benchmarks, Audio-Interaction preserves competitive performance on mainstream audio tasks while unlocking capabilities inaccessible to offline LALMs, including real-time ASR, streaming audio instruction following, and proactive help.
Primary: unknown
All Institutions: unknown
This paper has significant broader impact across several dimensions: 1. **Paradigm Shift for Audio AI**: It proposes a fundamental shift from offline, clip-based LALMs and single-task streaming models to a unified, interactive, real-time "Audio Interaction Model." This vision is crucial for developing truly intelligent and helpful audio assistants. 2. **Enabling New Capabilities**: The work unlocks capabilities previously inaccessible to offline LALMs, such as comprehension-grounded response triggering, long-stream interaction, and proactive assistance. This has direct implications for applications like smart homes, automotive assistants, accessibility tools, and advanced conversational AI, where real-time, context-aware intervention is vital. 3. **Resource Contribution**: The release of the large-scale StreamAudio-2M dataset and the Proactive-Sound-Bench benchmark provides invaluable resources for the research community, accelerating future work on streaming audio intelligence and interactive AI. 4. **Reduced Model Proliferation**: By unifying multiple audio tasks into a single model, it offers a path towards more efficient and general-purpose audio AI systems, potentially reducing the need for numerous specialized models. 5. **Ethical Considerations**: Proactive AI raises ethical questions regarding privacy, consent, and potential for misinterpretation or unwanted intervention. While not explicitly discussed, the framework's ability to decide *when* to respond is a step towards controllable proactive behavior, which is important for responsible deployment. This paper introduces the Audio Interaction Model and SoundFlow framework, a comprehensive solution for unifying offline LALMs and streaming audio models into a single, always-on, perceive-decide-respond system. Through novel streaming-native data construction, interaction-aware training, and asynchronous low-latency inference, the work demonstrates competitive performance on mainstream audio tasks while unlocking critical new capabilities like proactive assistance and general streaming instruction following, significantly advancing the field of real-time audio intelligence.
The paper introduces the Audio Interaction Model (AIM) and the SoundFlow framework, a comprehensive and highly innovative approach to unify offline Large Audio Language Models (LALMs) with streaming, single-task audio models into a single, always-on, perceive-decide-respond system. This paradigm shift addresses the inherent interactive nature of audio, which current LALMs and specialized streaming models fail to capture. The SoundFlow framework is meticulously designed, covering data, training, and deployment: 1. **Streaming-native data construction**: This is a critical component. The Time-Frequency Joint Preprocessing (TFJP) module is a clever solution to smooth audio boundaries and suppress noise, essential for stitching short clips into coherent long-form interactions. The hierarchical audio event selection, which uses an LLM for scenario planning and event refinement, followed by retrieval or generation, is a sophisticated method to ensure semantic coherence and environmental plausibility in synthetic streaming data. This addresses the challenge of creating realistic, multi-turn interactive audio sequences. 2. **Interaction-aware training**: The model learns to make chunk-level sequential decisions using special `
The experimental evaluation is extensive and rigorous, covering a wide array of benchmarks and providing deep insights into the model's behavior. 1. **Benchmarks**: The evaluation spans 8 diverse benchmarks, including general audio understanding (MMAU), spoken dialogue (AlpacaEval, SD-QA, Llama Questions, Web Questions), ASR (LibriSpeech), S2TT (CoVoST2), and the newly introduced Proactive-Sound-Bench. This broad coverage effectively demonstrates the model's versatility and unified capabilities. 2. **Baselines**: The comparison against three categories of models (Audio LLMs, Omni LLMs, and Task-specialized models) is comprehensive, allowing for a fair assessment of Audio-Interaction's performance against both general-purpose and specialized systems. 3. **Main Results**: The paper clearly demonstrates three key enhancements: * **Retained audio understanding**: Audio-Interaction maintains competitive performance on MMAU, even slightly surpassing its initialization and remaining comparable to larger 7B models. * **Competitive performance on core speech tasks**: Significant improvements on CoVoST2 (S2TT) and comparable performance on dialogue benchmarks, with only a marginal WER regression on LibriSpeech, which is an acceptable trade-off for moving to a chunk-wise streaming decoder. * **Unlocked capabilities**: This is the most impactful finding. The model's robustness to spoken instructions, selective proactive response on the novel Proactive-Sound-Bench (achieving good accuracy in both single and multi-tier events), and stability under stream concatenation highlight its unique interactive abilities. 4. **Additional Analysis**: The observations regarding continuity reconstruction at early decoder layers and the localization of the silent vs. respond decision to a single attention head provide valuable mechanistic insights into how the model learns these complex behaviors. 5. **Ablation Study**: The ablations are well-designed and clearly demonstrate the necessity of FIFO inference, the cumulative benefits of streaming training and data, the optimal chunk size (0.4s) for the accuracy-latency trade-off, and the balancing role of the dual-loss weight. These studies validate the design choices of the SoundFlow framework. 6. **Real-world validation**: The evaluation on 2 hours of naturally recorded audio across diverse scenarios (Travel, Work, Home, Commute) is a crucial step towards demonstrating practical applicability. The finding that performance largely retains its synthetic-stream levels, with degradation tracking acoustic difficulty, adds significant credibility to the model's robustness. The introduction of StreamAudio-2M (2.6M items, 302k hours, 7 abilities, 28 sub-tasks) and Proactive-Sound-Bench (644 human-designed events) as new resources is a major contribution, providing the community with tools to further research in this interactive paradigm.
The paper provides a strong foundation for reproducibility. * **Code and Data**: The project page and HuggingFace dataset link are provided, indicating an intent to release resources. * **Methodology Details**: The SoundFlow framework components (TFJP, hierarchical event curation, training stages, dual-loss, FIFO inference) are described in detail, including algorithms in the appendix. * **Dataset Curation**: The StreamAudio-2M curation pipeline, including sources, preprocessing, sequence concatenation, and token-level annotation, is thoroughly explained. * **Benchmark Details**: Proactive-Sound-Bench is clearly defined with its task, categories, and evaluation metrics. * **Training Details**: Hyperparameters for all four training stages are provided in the appendix, along with hardware specifications (NVIDIA H100 GPUs, bf16 mixed precision, DeepSpeed ZeRO-2). The use of a publicly available base model (Qwen2.5-Omni-3B) further aids reproducibility.
1. **Performance on existing tasks**: While Audio-Interaction is competitive, it does not always set new state-of-the-art records on all traditional benchmarks. For instance, there's a marginal WER regression on LibriSpeech. The primary strength lies in unification and new capabilities, rather than absolute peak performance on every single task. 2. **Synthetic Data Reliance**: The extensive use of LLMs for scenario planning and audio generation/stitching in StreamAudio-2M, while innovative, means the model is heavily trained on synthetic interactions. Although real-world validation is performed, the scale is limited (2 hours), and potential generalization gaps to truly unconstrained, complex real-world audio environments might exist. 3. **Model Size**: The choice of a 3B parameter model, while good for efficiency, might limit the depth of reasoning and comprehension compared to much larger LALMs, especially for highly complex, nuanced audio understanding tasks. 4. **Single Attention Head for Decision**: The observation that a single attention head dominates the silent vs. respond decision is interesting, but it could also imply a potential fragility or oversimplification in the decision-making mechanism for highly diverse and complex interactive scenarios.
This paper has significant broader impact across several dimensions: 1. **Paradigm Shift for Audio AI**: It proposes a fundamental shift from offline, clip-based LALMs and single-task streaming models to a unified, interactive, real-time "Audio Interaction Model." This vision is crucial for developing truly intelligent and helpful audio assistants. 2. **Enabling New Capabilities**: The work unlocks capabilities previously inaccessible to offline LALMs, such as comprehension-grounded response triggering, long-stream interaction, and proactive assistance. This has direct implications for applications like smart homes, automotive assistants, accessibility tools, and advanced conversational AI, where real-time, context-aware intervention is vital. 3. **Resource Contribution**: The release of the large-scale StreamAudio-2M dataset and the Proactive-Sound-Bench benchmark provides invaluable resources for the research community, accelerating future work on streaming audio intelligence and interactive AI. 4. **Reduced Model Proliferation**: By unifying multiple audio tasks into a single model, it offers a path towards more efficient and general-purpose audio AI systems, potentially reducing the need for numerous specialized models. 5. **Ethical Considerations**: Proactive AI raises ethical questions regarding privacy, consent, and potential for misinterpretation or unwanted intervention. While not explicitly discussed, the framework's ability to decide *when* to respond is a step towards controllable proactive behavior, which is important for responsible deployment. This paper introduces the Audio Interaction Model and SoundFlow framework, a comprehensive solution for unifying offline LALMs and streaming audio models into a single, always-on, perceive-decide-respond system. Through novel streaming-native data construction, interaction-aware training, and asynchronous low-latency inference, the work demonstrates competitive performance on mainstream audio tasks while unlocking critical new capabilities like proactive assistance and general streaming instruction following, significantly advancing the field of real-time audio intelligence.
We introduce Humanoid-GPT, a GPT-style Transformer with causal attention trained on a billion-scale motion corpus for whole-body control. Unlike prior shallow MLP trackers constrained by scarce data and an agility-generalization trade-off, Humanoid-GPT is pre-trained on a 2B-frame retargeted corpus that unifies all major mocap datasets with large-scale in-house recordings. Scaling both data and model capacity yields a single generative Transformer that tracks highly dynamic behaviors while achieving unprecedented zero-shot generalization to unseen motions and control tasks. Extensive experiments and scaling analyses show that our model establishes a new performance frontier, demonstrating robust zero-shot generalization to unseen tasks while simultaneously tracking highly dynamic and complex motions.
Humanoid-GPT introduces a GPT-style Transformer trained on a billion-scale motion corpus, achieving unprecedented zero-shot generalization for whole-body control. This work represents a significant leap in data and model scaling for motion tracking, moving beyond prior limitations of shallow models and scarce data to enable robust generalization to unseen tasks and highly dynamic behaviors. By unifying major motion capture datasets and leveraging a large-scale Transformer architecture, it establishes a new performance frontier in embodied AI, potentially setting a new standard for generalizable policies in humanoid control and influencing future foundation models for robotics.