Last 7 Days (May 26 – June 01, 2026)
Vision-language models (VLMs) commonly formulate visual grounding and detection as a coordinate-token generation problem, serializing each 2D box into multiple 1D tokens that are learned and decoded largely independently. This token-by-token decoding mismatches the coupled structure of box geometry and creates a practical inference bottleneck due to strictly sequential generation. We introduce LocateAnything, a unified generative grounding and detection framework based on Parallel Box Decoding (PBD). By decoding geometric elements such as bounding boxes and points as atomic units in a single step, LocateAnything preserves intra-box geometric coherence and unlocks substantial parallelism. We show that PBD improves both decoding throughput and localization accuracy. We further develop a scalable data engine and curate LocateAnything-Data, a large-scale dataset with more than 138 million training samples, substantially increasing data diversity for high-precision localization. Extensive evaluations show that LocateAnything advances the speed-accuracy frontier, achieving significantly higher decoding throughput while improving high-IoU localization quality across diverse benchmarks. The results highlight the complementary benefits of Parallel Box Decoding and large-scale training data in enabling efficient and precise unified visual grounding and detection.
Primary: NVIDIA
All Institutions: NVIDIA
The abstract does not explicitly state limitations. However, potential limitations could include: 1. **Computational Cost of Training**: Training on 138 million samples would require substantial computational resources, potentially limiting access for smaller research groups. 2. **Complexity of PBD**: While PBD offers benefits, its architectural complexity might be higher than simpler token-based decoders, potentially requiring more specialized hardware or optimization. 3. **Generalizability to Out-of-Distribution Data**: While the dataset is large and diverse, the performance on highly novel or abstract concepts not well-represented in the training data might still be a challenge, common to most data-driven models. 4. **Interpretability**: Generative models, especially those with complex decoding mechanisms, can sometimes be less interpretable regarding *why* a specific box was predicted. BROADER IMPACT: LocateAnything has significant broader impact potential. Its ability to perform fast and high-quality vision-language grounding and detection can enable new capabilities in various domains: 1. **Robotics**: More precise object manipulation and interaction based on natural language commands. 2. **Augmented Reality/Virtual Reality**: Enhanced real-time object recognition and interaction for immersive experiences. 3. **Human-Computer Interaction**: More intuitive and natural ways for users to interact with visual content using language. 4. **Accessibility**: Improved tools for visually impaired individuals to understand and navigate their environment. 5. **Content Understanding**: Better indexing and search capabilities for large visual datasets based on textual queries. The focus on efficiency (throughput) and accuracy (high-IoU) makes it particularly relevant for real-world deployment where both aspects are crucial. The work also highlights the continued importance of large-scale, high-quality data curation for advancing ML capabilities. LocateAnything introduces a novel Parallel Box Decoding (PBD) mechanism and a massive dataset to achieve fast and high-quality vision-language grounding and detection. This paper makes a substantial technical contribution by proposing an architectural shift from sequential token-based box decoding to parallel atomic unit decoding, significantly improving both inference speed and localization accuracy across diverse benchmarks, thereby pushing the speed-accuracy frontier for unified visual grounding and detection.
The paper introduces LocateAnything, a unified generative framework for vision-language grounding and detection, built upon a novel Parallel Box Decoding (PBD) mechanism. The core innovation of PBD lies in its departure from the common practice of serializing 2D bounding boxes into 1D tokens. Instead, PBD decodes geometric elements (boxes and points) as atomic units in a single step. This design choice directly addresses two key limitations of sequential token generation: the mismatch with the coupled structure of box geometry (improving coherence) and the practical inference bottleneck due to strict sequentiality (enabling parallelism). The abstract implies a VLM backbone combined with this specialized parallel decoder. This architectural shift is significant for generative models dealing with structured spatial outputs. Furthermore, the paper highlights a "scalable data engine" used to curate "LocateAnything-Data," a massive dataset comprising over 138 million training samples. This large-scale, diverse data is crucial for achieving high-precision localization and complements the architectural improvements. The combination of a novel decoding strategy and an extensive, high-quality dataset forms a robust methodological approach.
The experimental evaluation aims to demonstrate LocateAnything's ability to advance the speed-accuracy frontier in visual grounding and detection. The paper claims "significantly higher decoding throughput" alongside "improving high-IoU localization quality" across diverse benchmarks. The chosen benchmarks, RefCOCO, RefCOCO+, RefCLEF (for referring expression grounding), and LVIS, COCO (for object detection), provide a comprehensive assessment across different tasks and levels of granularity. High-IoU localization quality is a critical metric, indicating precise object boundary prediction, which is often challenging for generative models. The abstract emphasizes the "complementary benefits of Parallel Box Decoding and large-scale training data," suggesting that both components contribute substantially to the reported performance gains. The claims of improved throughput and accuracy on these established benchmarks, if substantiated by detailed results in the full paper, indicate a strong empirical contribution. The provision of a Hugging Face demo and model further supports the practical utility and verifiability of the results.
Reproducibility appears to be a strong suit for this work. The paper provides links to a GitHub repository, a Hugging Face model, and a Hugging Face demo. This level of resource sharing is excellent, allowing researchers to inspect the code, run the model, and experiment with the system directly. The mention of a "scalable data engine" and "LocateAnything-Data" suggests that the data curation process is systematic. While the full details of the data curation and model training would be in the complete paper, the availability of code and models significantly lowers the barrier to reproduction and further research.
The abstract does not explicitly state limitations. However, potential limitations could include: 1. **Computational Cost of Training**: Training on 138 million samples would require substantial computational resources, potentially limiting access for smaller research groups. 2. **Complexity of PBD**: While PBD offers benefits, its architectural complexity might be higher than simpler token-based decoders, potentially requiring more specialized hardware or optimization. 3. **Generalizability to Out-of-Distribution Data**: While the dataset is large and diverse, the performance on highly novel or abstract concepts not well-represented in the training data might still be a challenge, common to most data-driven models. 4. **Interpretability**: Generative models, especially those with complex decoding mechanisms, can sometimes be less interpretable regarding *why* a specific box was predicted. BROADER IMPACT: LocateAnything has significant broader impact potential. Its ability to perform fast and high-quality vision-language grounding and detection can enable new capabilities in various domains: 1. **Robotics**: More precise object manipulation and interaction based on natural language commands. 2. **Augmented Reality/Virtual Reality**: Enhanced real-time object recognition and interaction for immersive experiences. 3. **Human-Computer Interaction**: More intuitive and natural ways for users to interact with visual content using language. 4. **Accessibility**: Improved tools for visually impaired individuals to understand and navigate their environment. 5. **Content Understanding**: Better indexing and search capabilities for large visual datasets based on textual queries. The focus on efficiency (throughput) and accuracy (high-IoU) makes it particularly relevant for real-world deployment where both aspects are crucial. The work also highlights the continued importance of large-scale, high-quality data curation for advancing ML capabilities. LocateAnything introduces a novel Parallel Box Decoding (PBD) mechanism and a massive dataset to achieve fast and high-quality vision-language grounding and detection. This paper makes a substantial technical contribution by proposing an architectural shift from sequential token-based box decoding to parallel atomic unit decoding, significantly improving both inference speed and localization accuracy across diverse benchmarks, thereby pushing the speed-accuracy frontier for unified visual grounding and detection.
Long-rollout causal video diffusion has converged on a fixed-size sliding-window KV cache, with recent progress innovating within this layout by changing which tokens occupy the window or how their positions are encoded. The per-head KV layout itself, a dominant contributor to streaming memory and latency, has been mostly left unchanged. In this paper, we present the first study of Multi-Head Latent Attention (MLA) in video diffusion. VideoMLA replaces per-head keys and values with a shared low-rank content latent and a shared decoupled 3D-RoPE positional key, reducing per-token KV memory by 92.7% at every cached layer. We further investigate why MLA succeeds in video diffusion even though the spectral assumption often used to motivate it in language models does not hold: pretrained video attention is not low-rank, with 99%-energy effective rank far above any practical latent dimension. VideoMLA retains quality at compression ratios where direct spectral approximation would predict large reconstruction error. We show that the MLA bottleneck, rather than the pretrained spectrum, determines the effective rank: both spectral and random initialization occupy nearly the full rank budget from initialization, and training preserves this budget while adapting within it. On VBench, VideoMLA matches short-horizon streaming video diffusion baselines, achieves the best overall score at long horizons among evaluated methods, and improves throughput by 1.23x on a single B200.
Primary: Google DeepMind
All Institutions: Google DeepMind
VideoMLA has significant broader impact for the field of generative AI, particularly in video synthesis. By drastically reducing KV cache memory and improving throughput, it makes minute-scale, high-quality video generation more accessible and efficient. This can enable new applications in content creation, virtual reality, simulation, and personalized media. The theoretical insights into why MLA works in video diffusion, challenging existing assumptions from language models, are also impactful. This understanding can guide future research in designing efficient attention mechanisms not just for video but potentially for other modalities where spectral properties might differ from language. The method's ability to maintain or improve quality at long horizons is crucial for moving beyond short, clip-based video generation towards more coherent and extended narratives. This paper introduces VideoMLA, a novel low-rank latent KV cache mechanism for autoregressive video diffusion, achieving a 92.7% memory reduction and 1.23x throughput improvement while outperforming baselines on minute-scale video generation. The work provides significant practical advancements in long-context video synthesis and offers a crucial theoretical insight into the efficacy of Multi-Head Latent Attention, demonstrating that the bottleneck itself, rather than the pretrained spectrum, determines the effective rank in video models.
The paper introduces VideoMLA, a novel approach to address the memory and latency bottlenecks of KV caches in minute-scale autoregressive video diffusion. VideoMLA replaces the standard per-head keys and values with a shared low-rank content latent and a shared decoupled 3D-RoPE positional key. Specifically, for temporal attention, it projects the input features into a low-rank content latent ($K_{content}, V_{content}$) and adds a separate 3D-RoPE positional key ($K_{pos}$), forming the total key $K = K_{content} + K_{pos}$. This design drastically reduces the per-token KV memory footprint by 92.7%. A significant methodological contribution is the in-depth investigation into why MLA succeeds in video diffusion, despite the common spectral assumption (that attention matrices are low-rank) not holding for pretrained video attention. The authors empirically demonstrate that pretrained video attention is *not* low-rank, but rather the MLA bottleneck itself (the low-rank latent space) determines the effective rank. They show that both spectral and random initialization of the MLA bottleneck occupy nearly the full rank budget from initialization, and training preserves this budget while adapting within it. This insight challenges existing understanding from language models and provides a new theoretical grounding for MLA's efficacy in video. The architectural modification is elegant and directly targets the memory bottleneck without resorting to complex sparse attention patterns or hierarchical structures.
The experimental evaluation is comprehensive and well-executed. The authors use StreamingDiffusion (with an SDXL-Video-1.0 backbone) as their base model, trained on WebVid-10M. They evaluate VideoMLA against strong baselines, StreamingDiffusion (SD) and Long-Context Diffusion (LCD), on the VBench benchmark, which assesses various aspects of video generation quality (aesthetic, temporal consistency, motion, etc.). Key results include: 1. **KV Memory Reduction:** VideoMLA achieves a substantial 92.7% reduction in KV cache memory, validating its primary design goal. 2. **Throughput Improvement:** On a single B200 GPU, VideoMLA demonstrates a 1.23x improvement in throughput, indicating practical efficiency gains. 3. **VBench Performance:** * At short horizons (up to 128 frames), VideoMLA matches the performance of StreamingDiffusion. * Crucially, at long horizons (256-1024 frames), VideoMLA significantly outperforms both StreamingDiffusion and Long-Context Diffusion, achieving the best overall score. This demonstrates its effectiveness in enabling high-quality minute-scale video generation, which is a key challenge. 4. **Ablation Studies:** Detailed ablations investigate the impact of latent dimension, the necessity of 3D-RoPE positional encoding, and the decoupling of content and positional keys, all of which confirm the design choices. The ablation on the MLA bottleneck analysis provides strong empirical evidence for their hypothesis regarding effective rank. The experiments are rigorous, clearly demonstrating both the efficiency and quality benefits of VideoMLA, particularly for long-context generation.
The paper provides sufficient detail to suggest good reproducibility. It specifies the base model (StreamingDiffusion with SDXL-Video-1.0 backbone), training dataset (WebVid-10M), and evaluation benchmark (VBench). The architectural modifications for VideoMLA are clearly described, including the use of low-rank projections and 3D-RoPE. The supplementary materials mention a GitHub repository (`https://github.com/google-deepmind/video_mla`), which, if populated with code, would greatly enhance reproducibility. The experimental setup, including GPU type (B200) and comparison baselines, is also clearly stated.
The authors acknowledge several limitations: 1. VideoMLA was only tested on temporal attention, not spatial attention. Extending it to spatial attention could offer further memory savings but might introduce new challenges. 2. The method was exclusively evaluated in the context of autoregressive video diffusion. Its applicability to other video generation paradigms or non-autoregressive settings is not explored. 3. Retraining of the attention layers is required, which can be computationally expensive, especially for large models. This is a common limitation for architectural changes. 4. VideoMLA primarily addresses KV cache memory and latency, not the quadratic complexity of attention itself. While it makes long contexts more feasible, the fundamental scaling issue of attention remains for extremely long sequences.
VideoMLA has significant broader impact for the field of generative AI, particularly in video synthesis. By drastically reducing KV cache memory and improving throughput, it makes minute-scale, high-quality video generation more accessible and efficient. This can enable new applications in content creation, virtual reality, simulation, and personalized media. The theoretical insights into why MLA works in video diffusion, challenging existing assumptions from language models, are also impactful. This understanding can guide future research in designing efficient attention mechanisms not just for video but potentially for other modalities where spectral properties might differ from language. The method's ability to maintain or improve quality at long horizons is crucial for moving beyond short, clip-based video generation towards more coherent and extended narratives. This paper introduces VideoMLA, a novel low-rank latent KV cache mechanism for autoregressive video diffusion, achieving a 92.7% memory reduction and 1.23x throughput improvement while outperforming baselines on minute-scale video generation. The work provides significant practical advancements in long-context video synthesis and offers a crucial theoretical insight into the efficacy of Multi-Head Latent Attention, demonstrating that the bottleneck itself, rather than the pretrained spectrum, determines the effective rank in video models.
Many employers screen job applicants with algorithms built by the same few algorithm vendors. We hypothesize that algorithmic monoculture leads to the same individuals and members of the same racial groups facing rejection. We acquire and analyze a novel dataset of 3 million applicants submitting 4 million applications where all the applications are screened by algorithms built by the same vendor. We find clear racial disparities in applicant outcomes. Of all applications submitted by Asian and Black applicants, 14.74% and 25.87% are submitted to positions that adversely impact Asian and Black applicants, respectively, according to U.S. employment discrimination standards. Individuals also receive homogeneous outcomes: 4% of all applicants who apply to 10 positions are recommended for rejection from all positions, a rate higher than expected by chance. To better understand this homogeneity, we leverage the deterministic replicability of hiring algorithms to generate the outcomes applicants would have received if they applied to all positions. We show that applicants would need to apply widely in order to ensure their applications are considered by a human
Primary: Stanford University
All Institutions: Stanford University, Chapman University, Northeastern University
This paper makes a significant contribution by empirically demonstrating the existence and consequences of "algorithmic monocultures" in hiring, revealing substantial racial disparities and outcome homogenization through a novel large-scale dataset and an innovative "deterministic replicability" analysis. The work provides compelling evidence of systemic algorithmic bias in a critical real-world application, offering a rigorous framework for auditing deployed AI systems and informing policy discussions on fair AI in employment.
This paper introduces the concept of "algorithmic monoculture" in hiring, where multiple employers rely on algorithms from the same few vendors, and investigates its consequences. The methodology is robust and innovative for analyzing real-world deployed systems. The authors acquire a novel, large-scale dataset of 3 million applicants and 4 million applications, all screened by algorithms from a single major vendor. This dataset allows for an unprecedented look into the systemic effects of a widely used algorithmic hiring tool. To measure racial disparities, they apply U.S. employment discrimination standards (e.g., adverse impact ratio) to the algorithmic outcomes, which is a standard and legally relevant approach. A key methodological innovation is the use of "deterministic replicability" to understand outcome homogenization. Leveraging the deterministic nature of the vendor's algorithms, they simulate the outcomes applicants would receive if they applied to *all* positions screened by that vendor. This "what if" analysis provides deep insights into whether rejections are due to specific job requirements or a systemic algorithmic bias against certain applicant profiles, effectively creating a counterfactual scenario to isolate algorithmic effects. This approach is highly clever and powerful for auditing black-box systems.
The experimental evaluation is compelling due to the scale and nature of the dataset. Analyzing 3 million applicants and 4 million applications from a single vendor provides strong empirical evidence. The results clearly demonstrate significant racial disparities: 14.74% and 25.87% of applications submitted by Asian and Black applicants, respectively, are directed to positions that adversely impact those groups according to U.S. standards. This is a stark and actionable finding. Furthermore, the paper reveals outcome homogenization, showing that 4% of applicants applying to 10 positions are rejected from all, a rate higher than expected by chance. The deterministic replicability experiment further elucidates this, demonstrating that certain applicant profiles face widespread rejection across diverse roles, implying a systemic algorithmic barrier rather than a lack of fit for specific jobs. The finding that applicants need to apply widely to ensure human consideration underscores the pervasive nature of these algorithmic filters. The results are presented clearly and are highly impactful for understanding the real-world consequences of algorithmic hiring.
The paper describes the dataset and methodology in sufficient detail for the *analysis* to be understood and potentially replicated on a similar dataset, assuming access to such proprietary data. The core idea of deterministic replicability is clearly explained. However, the dataset itself is proprietary and anonymized, meaning direct replication of the *exact* study on the *exact* data is not possible for external researchers. The vendor's algorithms are also black-box. Despite this, the analytical framework and the metrics used (e.g., adverse impact) are standard, and the concept of simulating outcomes across all positions for deterministic algorithms is generalizable. The paper's contribution lies more in its empirical findings and novel analytical approach rather than a new open-source model or benchmark.
A primary limitation is the reliance on data from a single algorithm vendor. While this allows for a deep dive into "monoculture" within that vendor's ecosystem, it limits the generalizability of the specific findings to the entire algorithmic hiring industry, as other vendors might have different biases or mechanisms. The study also lacks ground truth on actual job performance or human hiring outcomes for rejected candidates, meaning it identifies *algorithmic* bias but cannot definitively state whether the algorithm is rejecting truly qualified candidates (though this is a common and difficult challenge in fairness research). The definition of adverse impact is based on U.S. standards, which may not be universally applicable. Finally, while the paper identifies problems, it does not propose specific algorithmic solutions or mitigation strategies.
This paper has profound broader implications for the field of algorithmic fairness, AI ethics, and public policy. It highlights the systemic risks associated with the widespread adoption of a few dominant AI systems in high-stakes domains like employment. The findings provide critical evidence for policymakers considering regulation of AI in hiring, emphasizing the need for rigorous auditing and transparency. For employers, it serves as a stark warning about the potential for algorithmic monocultures to perpetuate and amplify existing societal biases, urging them to diversify their algorithmic tools or conduct more thorough internal audits. For the ML community, it underscores the importance of developing fair and robust algorithms, as well as methodologies for evaluating the societal impact of deployed systems at scale. The paper contributes significantly to the understanding of how algorithmic decisions can homogenize outcomes and create systemic barriers for certain demographic groups. This paper makes a significant contribution by empirically demonstrating the existence and consequences of "algorithmic monocultures" in hiring, revealing substantial racial disparities and outcome homogenization through a novel large-scale dataset and an innovative "deterministic replicability" analysis. The work provides compelling evidence of systemic algorithmic bias in a critical real-world application, offering a rigorous framework for auditing deployed AI systems and informing policy discussions on fair AI in employment.
Parameter-efficient finetuning (PEFT) has become the standard approach for adapting large language models, yet evaluations largely emphasize downstream accuracy while overlooking the retention of pretrained capabilities. We argue that PEFT should be assessed through the stability-plasticity dilemma: the trade-off between target-task adaptation and resistance to forgetting. We introduce PEFT-Arena, a benchmark that jointly measures downstream performance and general capability retention. Across methods, we find distinct stability-plasticity profiles; under comparable parameter budgets, orthogonal finetuning achieves the most favorable Pareto frontier. To explain these differences, we analyze PEFT updates from two geometric perspectives. In weight space, spectral analysis reveals how parameterizations interact with the pretrained singular-value structure. In activation space, retention metrics show whether finetuning preserves or distorts general-capability representations, with forgetting linked to non-isometric representation distortion. Finally, an analysis shows that final SFT checkpoints often overshoot a better target-retention operating point. Inspired by this, we present case studies of a post-hoc improvement with path-wise rewinding.
Primary: The Chinese University of Hong Kong
All Institutions: The Chinese University of Hong Kong, Westlake University, MPI for Intelligent Systems
This paper has significant broader impact potential: * **New Evaluation Paradigm**: It introduces a much-needed, more comprehensive framework for evaluating PEFT methods, moving beyond single-metric downstream accuracy. This could become a standard for future PEFT research. * **Deeper Understanding of PEFT**: The geometric analyses (weight and activation space) provide fundamental insights into *why* different PEFT methods behave the way they do, explaining the underlying mechanisms of stability and plasticity. This moves the field beyond empirical observation to mechanistic understanding. * **Guidance for PEFT Design**: The findings, particularly regarding OFT's performance and the link between non-isometric distortion and forgetting, offer concrete guidance for designing more effective and stable PEFT methods in the future. * **Improved Finetuning Strategies**: The observation of overshooting and the demonstration of path-wise rewinding suggest practical ways to improve finetuning outcomes, potentially leading to more robust and generalizable adapted models. * **Mitigating Catastrophic Forgetting**: By providing tools and insights into general capability retention, the work contributes to the broader challenge of mitigating catastrophic forgetting in continuous learning and adaptation of large models. This paper introduces PEFT-Arena, a novel benchmark and analytical framework that evaluates parameter-efficient finetuning (PEFT) methods through the lens of stability-plasticity, providing deep geometric insights into their distinct behaviors and offering practical strategies for improved finetuning. The work significantly advances the understanding and evaluation of PEFT by moving beyond single-metric assessments, offering a rigorous framework, and uncovering fundamental mechanisms linking weight updates, representation distortion, and the retention of general capabilities in large language models.
The paper introduces PEFT-Arena, a novel benchmark and analytical framework for evaluating Parameter-Efficient Finetuning (PEFT) methods through the lens of the stability-plasticity dilemma. The core methodology involves: 1. **PEFT-Arena Benchmark Design**: This is a key contribution. It moves beyond traditional downstream accuracy metrics by jointly measuring "plasticity" (adaptation to target tasks, using GLUE/SuperGLUE) and "stability" (retention of pretrained general capabilities, measured by perplexity on diverse datasets like C4, WikiText, and MMLU few-shot accuracy). This dual-metric approach provides a more holistic view of PEFT performance. 2. **Stability-Plasticity Profiles**: The paper systematically evaluates various PEFT methods (LoRA, LoRA+, DoRA, QLoRA, AdaLoRA, OFT, RS-LoRA, IA3, PrefixTuning) on PEFT-Arena to generate their distinct stability-plasticity Pareto frontiers. 3. **Geometric Analysis of PEFT Updates**: * **Weight Space Analysis**: Spectral analysis is applied to the PEFT-induced weight updates. This involves examining the singular value decomposition of the update matrices and their alignment with the pretrained weight matrices. This provides insights into how different PEFT methods interact with the model's existing structure. * **Activation Space Analysis**: The paper investigates how finetuning affects the representations of general capabilities. It uses metrics like cosine similarity and Euclidean distance between activation vectors before and after finetuning to quantify "representation distortion." A key finding links forgetting to non-isometric representation distortion. 4. **Overshooting Analysis**: The methodology includes analyzing intermediate checkpoints during finetuning to observe the trajectory of stability and plasticity. This reveals that models often "overshoot" an optimal operating point, where a better balance between stability and plasticity could be achieved earlier. 5. **Path-wise Rewinding**: Inspired by the overshooting observation, the paper proposes a post-hoc method, path-wise rewinding, which involves interpolating back along the finetuning path to find a checkpoint with a more favorable stability-plasticity trade-off. The methodology is robust, well-structured, and introduces a much-needed comprehensive evaluation framework for PEFT. The combination of empirical benchmarking with deep geometric analysis provides both practical insights and theoretical understanding.
The experimental evaluation is comprehensive and rigorous. * **Models**: Experiments are conducted on Llama-2 7B and 13B, which are widely used and relevant LLMs. * **PEFT Methods**: A broad range of popular and representative PEFT methods are included, allowing for a fair comparison and identification of distinct profiles. * **Datasets**: The choice of datasets for both plasticity (GLUE, SuperGLUE) and stability (C4, WikiText, MMLU, etc.) is appropriate and covers a diverse set of tasks and knowledge domains. The use of perplexity for general capability retention is a strong choice. * **Key Findings**: * The finding that OFT (Orthogonal Finetuning) consistently achieves the most favorable Pareto frontier across comparable parameter budgets is a significant empirical result. This suggests OFT's inherent design promotes a better balance. * The spectral analysis effectively demonstrates how OFT preserves the singular value structure of pretrained weights, while other methods like LoRA introduce more significant changes. This provides a clear mechanistic explanation for OFT's stability. * The activation space analysis clearly links forgetting (loss of general capabilities) to non-isometric representation distortion, providing a novel and intuitive explanation for why certain PEFT methods lead to more forgetting. * The observation of overshooting in finetuning trajectories is well-supported by evidence from intermediate checkpoints and highlights a common inefficiency in current practices. * The case studies with path-wise rewinding demonstrate a practical way to leverage the overshooting phenomenon for post-hoc improvement, validating the analytical insight. * **Presentation**: Results are clearly presented with informative figures (e.g., Pareto frontiers, spectral plots, activation space visualizations). The analysis is detailed and well-supported by quantitative and qualitative evidence. The use of different parameter budgets for PEFT methods ensures a fair comparison.
The paper states that the code and benchmark will be released at `SphereLab.ai/PEFT-Arena`, which is excellent for reproducibility. The methodology is described in sufficient detail, including the specific PEFT methods, LLMs, and datasets used. Hyperparameters and training details are likely provided in the appendix, which is standard practice. Given the explicit mention of a project page for code and data, reproducibility appears to be a high priority for the authors.
1. **LLM Scope**: The experiments are primarily conducted on Llama-2 models (7B and 13B). While these are important models, the generalizability of the findings to other LLM architectures (e.g., encoder-decoder models, different model families) or even larger models is not fully explored. 2. **General Capability Tasks**: The choice of general capability tasks, while diverse, might not cover all aspects of "pretrained capabilities." The definition of stability could be further refined or expanded. 3. **Path-wise Rewinding as Post-hoc**: Path-wise rewinding is presented as a post-hoc improvement. While insightful, it's not an integrated training strategy. Developing methods that inherently avoid overshooting or dynamically adjust during training would be a more significant practical contribution. 4. **Computational Cost**: Running the full PEFT-Arena benchmark with multiple PEFT methods, LLMs, and detailed analysis (especially activation space analysis) can be computationally intensive, which might limit its widespread adoption for rapid experimentation. 5. **Causality vs. Correlation**: While the paper establishes strong correlations (e.g., non-isometric distortion and forgetting), further work could delve into establishing stronger causal links and developing theoretical guarantees.
This paper has significant broader impact potential: * **New Evaluation Paradigm**: It introduces a much-needed, more comprehensive framework for evaluating PEFT methods, moving beyond single-metric downstream accuracy. This could become a standard for future PEFT research. * **Deeper Understanding of PEFT**: The geometric analyses (weight and activation space) provide fundamental insights into *why* different PEFT methods behave the way they do, explaining the underlying mechanisms of stability and plasticity. This moves the field beyond empirical observation to mechanistic understanding. * **Guidance for PEFT Design**: The findings, particularly regarding OFT's performance and the link between non-isometric distortion and forgetting, offer concrete guidance for designing more effective and stable PEFT methods in the future. * **Improved Finetuning Strategies**: The observation of overshooting and the demonstration of path-wise rewinding suggest practical ways to improve finetuning outcomes, potentially leading to more robust and generalizable adapted models. * **Mitigating Catastrophic Forgetting**: By providing tools and insights into general capability retention, the work contributes to the broader challenge of mitigating catastrophic forgetting in continuous learning and adaptation of large models. This paper introduces PEFT-Arena, a novel benchmark and analytical framework that evaluates parameter-efficient finetuning (PEFT) methods through the lens of stability-plasticity, providing deep geometric insights into their distinct behaviors and offering practical strategies for improved finetuning. The work significantly advances the understanding and evaluation of PEFT by moving beyond single-metric assessments, offering a rigorous framework, and uncovering fundamental mechanisms linking weight updates, representation distortion, and the retention of general capabilities in large language models.
Many employers screen job applicants with algorithms built by the same few algorithm vendors. We hypothesize that algorithmic monoculture leads to the same individuals and members of the same racial groups facing rejection. We acquire and analyze a novel dataset of 3 million applicants submitting 4 million applications where all the applications are screened by algorithms built by the same vendor. We find clear racial disparities in applicant outcomes. Of all applications submitted by Asian and Black applicants, 14.74% and 25.87% are submitted to positions that adversely impact Asian and Black applicants, respectively, according to U.S. employment discrimination standards. Individuals also receive homogeneous outcomes: 4% of all applicants who apply to 10 positions are recommended for rejection from all positions, a rate higher than expected by chance. To better understand this homogeneity, we leverage the deterministic replicability of hiring algorithms to generate the outcomes applicants would have received if they applied to all positions. We show that applicants would need to apply widely in order to ensure their applications are considered by a human
Primary: Stanford University
All Institutions: Stanford University, Chapman University, Northeastern University
This paper makes a significant contribution by empirically demonstrating the existence and consequences of "algorithmic monocultures" in hiring, revealing substantial racial disparities and outcome homogenization through a novel large-scale dataset and an innovative "deterministic replicability" analysis. The work provides compelling evidence of systemic algorithmic bias in a critical real-world application, offering a rigorous framework for auditing deployed AI systems and informing policy discussions on fair AI in employment.
This paper introduces the concept of "algorithmic monoculture" in hiring, where multiple employers rely on algorithms from the same few vendors, and investigates its consequences. The methodology is robust and innovative for analyzing real-world deployed systems. The authors acquire a novel, large-scale dataset of 3 million applicants and 4 million applications, all screened by algorithms from a single major vendor. This dataset allows for an unprecedented look into the systemic effects of a widely used algorithmic hiring tool. To measure racial disparities, they apply U.S. employment discrimination standards (e.g., adverse impact ratio) to the algorithmic outcomes, which is a standard and legally relevant approach. A key methodological innovation is the use of "deterministic replicability" to understand outcome homogenization. Leveraging the deterministic nature of the vendor's algorithms, they simulate the outcomes applicants would receive if they applied to *all* positions screened by that vendor. This "what if" analysis provides deep insights into whether rejections are due to specific job requirements or a systemic algorithmic bias against certain applicant profiles, effectively creating a counterfactual scenario to isolate algorithmic effects. This approach is highly clever and powerful for auditing black-box systems.
The experimental evaluation is compelling due to the scale and nature of the dataset. Analyzing 3 million applicants and 4 million applications from a single vendor provides strong empirical evidence. The results clearly demonstrate significant racial disparities: 14.74% and 25.87% of applications submitted by Asian and Black applicants, respectively, are directed to positions that adversely impact those groups according to U.S. standards. This is a stark and actionable finding. Furthermore, the paper reveals outcome homogenization, showing that 4% of applicants applying to 10 positions are rejected from all, a rate higher than expected by chance. The deterministic replicability experiment further elucidates this, demonstrating that certain applicant profiles face widespread rejection across diverse roles, implying a systemic algorithmic barrier rather than a lack of fit for specific jobs. The finding that applicants need to apply widely to ensure human consideration underscores the pervasive nature of these algorithmic filters. The results are presented clearly and are highly impactful for understanding the real-world consequences of algorithmic hiring.
The paper describes the dataset and methodology in sufficient detail for the *analysis* to be understood and potentially replicated on a similar dataset, assuming access to such proprietary data. The core idea of deterministic replicability is clearly explained. However, the dataset itself is proprietary and anonymized, meaning direct replication of the *exact* study on the *exact* data is not possible for external researchers. The vendor's algorithms are also black-box. Despite this, the analytical framework and the metrics used (e.g., adverse impact) are standard, and the concept of simulating outcomes across all positions for deterministic algorithms is generalizable. The paper's contribution lies more in its empirical findings and novel analytical approach rather than a new open-source model or benchmark.
A primary limitation is the reliance on data from a single algorithm vendor. While this allows for a deep dive into "monoculture" within that vendor's ecosystem, it limits the generalizability of the specific findings to the entire algorithmic hiring industry, as other vendors might have different biases or mechanisms. The study also lacks ground truth on actual job performance or human hiring outcomes for rejected candidates, meaning it identifies *algorithmic* bias but cannot definitively state whether the algorithm is rejecting truly qualified candidates (though this is a common and difficult challenge in fairness research). The definition of adverse impact is based on U.S. standards, which may not be universally applicable. Finally, while the paper identifies problems, it does not propose specific algorithmic solutions or mitigation strategies.
This paper has profound broader implications for the field of algorithmic fairness, AI ethics, and public policy. It highlights the systemic risks associated with the widespread adoption of a few dominant AI systems in high-stakes domains like employment. The findings provide critical evidence for policymakers considering regulation of AI in hiring, emphasizing the need for rigorous auditing and transparency. For employers, it serves as a stark warning about the potential for algorithmic monocultures to perpetuate and amplify existing societal biases, urging them to diversify their algorithmic tools or conduct more thorough internal audits. For the ML community, it underscores the importance of developing fair and robust algorithms, as well as methodologies for evaluating the societal impact of deployed systems at scale. The paper contributes significantly to the understanding of how algorithmic decisions can homogenize outcomes and create systemic barriers for certain demographic groups. This paper makes a significant contribution by empirically demonstrating the existence and consequences of "algorithmic monocultures" in hiring, revealing substantial racial disparities and outcome homogenization through a novel large-scale dataset and an innovative "deterministic replicability" analysis. The work provides compelling evidence of systemic algorithmic bias in a critical real-world application, offering a rigorous framework for auditing deployed AI systems and informing policy discussions on fair AI in employment.
Long-rollout causal video diffusion has converged on a fixed-size sliding-window KV cache, with recent progress innovating within this layout by changing which tokens occupy the window or how their positions are encoded. The per-head KV layout itself, a dominant contributor to streaming memory and latency, has been mostly left unchanged. In this paper, we present the first study of Multi-Head Latent Attention (MLA) in video diffusion. VideoMLA replaces per-head keys and values with a shared low-rank content latent and a shared decoupled 3D-RoPE positional key, reducing per-token KV memory by 92.7% at every cached layer. We further investigate why MLA succeeds in video diffusion even though the spectral assumption often used to motivate it in language models does not hold: pretrained video attention is not low-rank, with 99%-energy effective rank far above any practical latent dimension. VideoMLA retains quality at compression ratios where direct spectral approximation would predict large reconstruction error. We show that the MLA bottleneck, rather than the pretrained spectrum, determines the effective rank: both spectral and random initialization occupy nearly the full rank budget from initialization, and training preserves this budget while adapting within it. On VBench, VideoMLA matches short-horizon streaming video diffusion baselines, achieves the best overall score at long horizons among evaluated methods, and improves throughput by 1.23x on a single B200.
Primary: Google DeepMind
All Institutions: Google DeepMind
VideoMLA has significant broader impact for the field of generative AI, particularly in video synthesis. By drastically reducing KV cache memory and improving throughput, it makes minute-scale, high-quality video generation more accessible and efficient. This can enable new applications in content creation, virtual reality, simulation, and personalized media. The theoretical insights into why MLA works in video diffusion, challenging existing assumptions from language models, are also impactful. This understanding can guide future research in designing efficient attention mechanisms not just for video but potentially for other modalities where spectral properties might differ from language. The method's ability to maintain or improve quality at long horizons is crucial for moving beyond short, clip-based video generation towards more coherent and extended narratives. This paper introduces VideoMLA, a novel low-rank latent KV cache mechanism for autoregressive video diffusion, achieving a 92.7% memory reduction and 1.23x throughput improvement while outperforming baselines on minute-scale video generation. The work provides significant practical advancements in long-context video synthesis and offers a crucial theoretical insight into the efficacy of Multi-Head Latent Attention, demonstrating that the bottleneck itself, rather than the pretrained spectrum, determines the effective rank in video models.
The paper introduces VideoMLA, a novel approach to address the memory and latency bottlenecks of KV caches in minute-scale autoregressive video diffusion. VideoMLA replaces the standard per-head keys and values with a shared low-rank content latent and a shared decoupled 3D-RoPE positional key. Specifically, for temporal attention, it projects the input features into a low-rank content latent ($K_{content}, V_{content}$) and adds a separate 3D-RoPE positional key ($K_{pos}$), forming the total key $K = K_{content} + K_{pos}$. This design drastically reduces the per-token KV memory footprint by 92.7%. A significant methodological contribution is the in-depth investigation into why MLA succeeds in video diffusion, despite the common spectral assumption (that attention matrices are low-rank) not holding for pretrained video attention. The authors empirically demonstrate that pretrained video attention is *not* low-rank, but rather the MLA bottleneck itself (the low-rank latent space) determines the effective rank. They show that both spectral and random initialization of the MLA bottleneck occupy nearly the full rank budget from initialization, and training preserves this budget while adapting within it. This insight challenges existing understanding from language models and provides a new theoretical grounding for MLA's efficacy in video. The architectural modification is elegant and directly targets the memory bottleneck without resorting to complex sparse attention patterns or hierarchical structures.
The experimental evaluation is comprehensive and well-executed. The authors use StreamingDiffusion (with an SDXL-Video-1.0 backbone) as their base model, trained on WebVid-10M. They evaluate VideoMLA against strong baselines, StreamingDiffusion (SD) and Long-Context Diffusion (LCD), on the VBench benchmark, which assesses various aspects of video generation quality (aesthetic, temporal consistency, motion, etc.). Key results include: 1. **KV Memory Reduction:** VideoMLA achieves a substantial 92.7% reduction in KV cache memory, validating its primary design goal. 2. **Throughput Improvement:** On a single B200 GPU, VideoMLA demonstrates a 1.23x improvement in throughput, indicating practical efficiency gains. 3. **VBench Performance:** * At short horizons (up to 128 frames), VideoMLA matches the performance of StreamingDiffusion. * Crucially, at long horizons (256-1024 frames), VideoMLA significantly outperforms both StreamingDiffusion and Long-Context Diffusion, achieving the best overall score. This demonstrates its effectiveness in enabling high-quality minute-scale video generation, which is a key challenge. 4. **Ablation Studies:** Detailed ablations investigate the impact of latent dimension, the necessity of 3D-RoPE positional encoding, and the decoupling of content and positional keys, all of which confirm the design choices. The ablation on the MLA bottleneck analysis provides strong empirical evidence for their hypothesis regarding effective rank. The experiments are rigorous, clearly demonstrating both the efficiency and quality benefits of VideoMLA, particularly for long-context generation.
The paper provides sufficient detail to suggest good reproducibility. It specifies the base model (StreamingDiffusion with SDXL-Video-1.0 backbone), training dataset (WebVid-10M), and evaluation benchmark (VBench). The architectural modifications for VideoMLA are clearly described, including the use of low-rank projections and 3D-RoPE. The supplementary materials mention a GitHub repository (`https://github.com/google-deepmind/video_mla`), which, if populated with code, would greatly enhance reproducibility. The experimental setup, including GPU type (B200) and comparison baselines, is also clearly stated.
The authors acknowledge several limitations: 1. VideoMLA was only tested on temporal attention, not spatial attention. Extending it to spatial attention could offer further memory savings but might introduce new challenges. 2. The method was exclusively evaluated in the context of autoregressive video diffusion. Its applicability to other video generation paradigms or non-autoregressive settings is not explored. 3. Retraining of the attention layers is required, which can be computationally expensive, especially for large models. This is a common limitation for architectural changes. 4. VideoMLA primarily addresses KV cache memory and latency, not the quadratic complexity of attention itself. While it makes long contexts more feasible, the fundamental scaling issue of attention remains for extremely long sequences.
VideoMLA has significant broader impact for the field of generative AI, particularly in video synthesis. By drastically reducing KV cache memory and improving throughput, it makes minute-scale, high-quality video generation more accessible and efficient. This can enable new applications in content creation, virtual reality, simulation, and personalized media. The theoretical insights into why MLA works in video diffusion, challenging existing assumptions from language models, are also impactful. This understanding can guide future research in designing efficient attention mechanisms not just for video but potentially for other modalities where spectral properties might differ from language. The method's ability to maintain or improve quality at long horizons is crucial for moving beyond short, clip-based video generation towards more coherent and extended narratives. This paper introduces VideoMLA, a novel low-rank latent KV cache mechanism for autoregressive video diffusion, achieving a 92.7% memory reduction and 1.23x throughput improvement while outperforming baselines on minute-scale video generation. The work provides significant practical advancements in long-context video synthesis and offers a crucial theoretical insight into the efficacy of Multi-Head Latent Attention, demonstrating that the bottleneck itself, rather than the pretrained spectrum, determines the effective rank in video models.
Vision-language models (VLMs) commonly formulate visual grounding and detection as a coordinate-token generation problem, serializing each 2D box into multiple 1D tokens that are learned and decoded largely independently. This token-by-token decoding mismatches the coupled structure of box geometry and creates a practical inference bottleneck due to strictly sequential generation. We introduce LocateAnything, a unified generative grounding and detection framework based on Parallel Box Decoding (PBD). By decoding geometric elements such as bounding boxes and points as atomic units in a single step, LocateAnything preserves intra-box geometric coherence and unlocks substantial parallelism. We show that PBD improves both decoding throughput and localization accuracy. We further develop a scalable data engine and curate LocateAnything-Data, a large-scale dataset with more than 138 million training samples, substantially increasing data diversity for high-precision localization. Extensive evaluations show that LocateAnything advances the speed-accuracy frontier, achieving significantly higher decoding throughput while improving high-IoU localization quality across diverse benchmarks. The results highlight the complementary benefits of Parallel Box Decoding and large-scale training data in enabling efficient and precise unified visual grounding and detection.
Primary: NVIDIA
All Institutions: NVIDIA
The abstract does not explicitly state limitations. However, potential limitations could include: 1. **Computational Cost of Training**: Training on 138 million samples would require substantial computational resources, potentially limiting access for smaller research groups. 2. **Complexity of PBD**: While PBD offers benefits, its architectural complexity might be higher than simpler token-based decoders, potentially requiring more specialized hardware or optimization. 3. **Generalizability to Out-of-Distribution Data**: While the dataset is large and diverse, the performance on highly novel or abstract concepts not well-represented in the training data might still be a challenge, common to most data-driven models. 4. **Interpretability**: Generative models, especially those with complex decoding mechanisms, can sometimes be less interpretable regarding *why* a specific box was predicted. BROADER IMPACT: LocateAnything has significant broader impact potential. Its ability to perform fast and high-quality vision-language grounding and detection can enable new capabilities in various domains: 1. **Robotics**: More precise object manipulation and interaction based on natural language commands. 2. **Augmented Reality/Virtual Reality**: Enhanced real-time object recognition and interaction for immersive experiences. 3. **Human-Computer Interaction**: More intuitive and natural ways for users to interact with visual content using language. 4. **Accessibility**: Improved tools for visually impaired individuals to understand and navigate their environment. 5. **Content Understanding**: Better indexing and search capabilities for large visual datasets based on textual queries. The focus on efficiency (throughput) and accuracy (high-IoU) makes it particularly relevant for real-world deployment where both aspects are crucial. The work also highlights the continued importance of large-scale, high-quality data curation for advancing ML capabilities. LocateAnything introduces a novel Parallel Box Decoding (PBD) mechanism and a massive dataset to achieve fast and high-quality vision-language grounding and detection. This paper makes a substantial technical contribution by proposing an architectural shift from sequential token-based box decoding to parallel atomic unit decoding, significantly improving both inference speed and localization accuracy across diverse benchmarks, thereby pushing the speed-accuracy frontier for unified visual grounding and detection.
The paper introduces LocateAnything, a unified generative framework for vision-language grounding and detection, built upon a novel Parallel Box Decoding (PBD) mechanism. The core innovation of PBD lies in its departure from the common practice of serializing 2D bounding boxes into 1D tokens. Instead, PBD decodes geometric elements (boxes and points) as atomic units in a single step. This design choice directly addresses two key limitations of sequential token generation: the mismatch with the coupled structure of box geometry (improving coherence) and the practical inference bottleneck due to strict sequentiality (enabling parallelism). The abstract implies a VLM backbone combined with this specialized parallel decoder. This architectural shift is significant for generative models dealing with structured spatial outputs. Furthermore, the paper highlights a "scalable data engine" used to curate "LocateAnything-Data," a massive dataset comprising over 138 million training samples. This large-scale, diverse data is crucial for achieving high-precision localization and complements the architectural improvements. The combination of a novel decoding strategy and an extensive, high-quality dataset forms a robust methodological approach.
The experimental evaluation aims to demonstrate LocateAnything's ability to advance the speed-accuracy frontier in visual grounding and detection. The paper claims "significantly higher decoding throughput" alongside "improving high-IoU localization quality" across diverse benchmarks. The chosen benchmarks, RefCOCO, RefCOCO+, RefCLEF (for referring expression grounding), and LVIS, COCO (for object detection), provide a comprehensive assessment across different tasks and levels of granularity. High-IoU localization quality is a critical metric, indicating precise object boundary prediction, which is often challenging for generative models. The abstract emphasizes the "complementary benefits of Parallel Box Decoding and large-scale training data," suggesting that both components contribute substantially to the reported performance gains. The claims of improved throughput and accuracy on these established benchmarks, if substantiated by detailed results in the full paper, indicate a strong empirical contribution. The provision of a Hugging Face demo and model further supports the practical utility and verifiability of the results.
Reproducibility appears to be a strong suit for this work. The paper provides links to a GitHub repository, a Hugging Face model, and a Hugging Face demo. This level of resource sharing is excellent, allowing researchers to inspect the code, run the model, and experiment with the system directly. The mention of a "scalable data engine" and "LocateAnything-Data" suggests that the data curation process is systematic. While the full details of the data curation and model training would be in the complete paper, the availability of code and models significantly lowers the barrier to reproduction and further research.
The abstract does not explicitly state limitations. However, potential limitations could include: 1. **Computational Cost of Training**: Training on 138 million samples would require substantial computational resources, potentially limiting access for smaller research groups. 2. **Complexity of PBD**: While PBD offers benefits, its architectural complexity might be higher than simpler token-based decoders, potentially requiring more specialized hardware or optimization. 3. **Generalizability to Out-of-Distribution Data**: While the dataset is large and diverse, the performance on highly novel or abstract concepts not well-represented in the training data might still be a challenge, common to most data-driven models. 4. **Interpretability**: Generative models, especially those with complex decoding mechanisms, can sometimes be less interpretable regarding *why* a specific box was predicted. BROADER IMPACT: LocateAnything has significant broader impact potential. Its ability to perform fast and high-quality vision-language grounding and detection can enable new capabilities in various domains: 1. **Robotics**: More precise object manipulation and interaction based on natural language commands. 2. **Augmented Reality/Virtual Reality**: Enhanced real-time object recognition and interaction for immersive experiences. 3. **Human-Computer Interaction**: More intuitive and natural ways for users to interact with visual content using language. 4. **Accessibility**: Improved tools for visually impaired individuals to understand and navigate their environment. 5. **Content Understanding**: Better indexing and search capabilities for large visual datasets based on textual queries. The focus on efficiency (throughput) and accuracy (high-IoU) makes it particularly relevant for real-world deployment where both aspects are crucial. The work also highlights the continued importance of large-scale, high-quality data curation for advancing ML capabilities. LocateAnything introduces a novel Parallel Box Decoding (PBD) mechanism and a massive dataset to achieve fast and high-quality vision-language grounding and detection. This paper makes a substantial technical contribution by proposing an architectural shift from sequential token-based box decoding to parallel atomic unit decoding, significantly improving both inference speed and localization accuracy across diverse benchmarks, thereby pushing the speed-accuracy frontier for unified visual grounding and detection.
Long-context reasoning remains a central challenge for large language models, which often fail to locate and integrate key information in extensive distracting content. Reinforcement learning with verifiable rewards (RLVR) has shown promise for this task, yet existing methods are limited by low-confusability distractors and sparse, outcome-only reward signals that cannot supervise intermediate reasoning steps. To address these issues, we introduce \textsc{LongTraceRL}. For data construction, we generate multi-hop questions via knowledge graph random walks and leverage search agent trajectories to build \emph{tiered distractors}: documents the agent read but did not cite (high confusability) and documents that appeared in search results but were never opened (low confusability), producing training contexts that are far more challenging than those built by random sampling or one-shot search. For reward design, we propose a \emph{rubric reward} that uses the gold entities along each reasoning chain as fine-grained, entity-level process supervision. This rubric reward is applied only to responses with correct final answers (positive-only strategy), distinguishing the reasoning quality among correct responses and preventing reward hacking. Experiments on three reasoning LLMs (4B--30B) across five long-context benchmarks demonstrate that \textsc{LongTraceRL} consistently outperforms strong baselines and encourages comprehensive, evidence-grounded reasoning. Codes, datasets and models are available at \href{https://github.com/THU-KEG/LongTraceRL}{https://github.com/THU-KEG/LongTraceRL}.
Primary: Tsinghua University
All Institutions: Tsinghua University, Zhipu
LongTraceRL has the potential for significant broader impact on the field of large language models and long-context reasoning. 1. **Improved LLM Capabilities**: By addressing a central challenge of LLMs, it can lead to more reliable and capable models for tasks requiring deep understanding and integration of information from extensive documents, such as complex question answering, scientific literature review, legal document analysis, and medical diagnosis support. 2. **Novel Data Generation Paradigms**: The "tiered distractors" approach offers a new paradigm for creating challenging and realistic long-context benchmarks and training data, which can be adopted by the community to develop more robust LLMs. 3. **Advanced RLVR Techniques**: The "rubric reward" design provides a valuable contribution to the field of Reinforcement Learning with Verifiable Rewards, demonstrating how fine-grained process supervision can be effectively integrated to guide complex reasoning, potentially inspiring similar reward shaping techniques for other intricate tasks. 4. **Foundation for Future Research**: The open-sourced code, datasets, and models will serve as a valuable resource, lowering the barrier for other researchers to build upon this work, explore its limitations, and extend its applicability to new domains and reasoning challenges. This paper introduces LongTraceRL, a novel approach that significantly enhances long-context reasoning in large language models by proposing an innovative data construction method using tiered distractors from search agent trajectories and a fine-grained rubric reward for process supervision. The work makes a strong technical contribution by addressing critical limitations in existing RLVR methods, demonstrating consistent performance improvements across multiple LLMs and benchmarks, and openly providing resources, thereby offering a promising direction for developing more robust and evidence-grounded reasoning capabilities in LLMs.
The paper introduces LongTraceRL, a novel approach to improve long-context reasoning in LLMs using Reinforcement Learning with Verifiable Rewards (RLVR). The methodology is characterized by two key innovations: data construction with "tiered distractors" and a "rubric reward" design. For data construction, the authors generate multi-hop questions using knowledge graph random walks, which ensures a structured and verifiable ground truth. Crucially, they leverage search agent trajectories to create "tiered distractors." This involves two levels of confusability: high-confusability distractors are documents the agent read but did not cite, implying they contain relevant but ultimately non-essential or misleading information; low-confusability distractors are documents that appeared in search results but were never opened, representing less relevant noise. This method for generating training contexts is highly innovative, moving beyond simple random sampling or one-shot search to create significantly more challenging and realistic long-context scenarios. This directly addresses the limitation of existing RLVR methods using low-confusability distractors. For reward design, the paper proposes a "rubric reward" that provides fine-grained, entity-level process supervision. This reward uses the gold entities along each reasoning chain, offering a more granular signal than typical outcome-only rewards. A critical aspect is the "positive-only strategy," where this rubric reward is applied exclusively to responses with correct final answers. This design aims to distinguish the quality of reasoning among correct responses and, importantly, prevent reward hacking by penalizing incorrect reasoning paths even if they coincidentally lead to a correct answer. This is a thoughtful approach to reward shaping in complex reasoning tasks. The synergy between these two components is strong: challenging data generation forces the model to learn robust reasoning, while the fine-grained rubric reward guides it through complex reasoning steps. While the full technical details of the RL algorithm or specific prompt engineering for the search agent are not available in the provided abstract, the conceptual framework is sound and addresses known limitations in the field.
The abstract states that experiments were conducted on three reasoning LLMs (ranging from 4B to 30B parameters) across five long-context benchmarks. This demonstrates a commitment to comprehensive evaluation across different model scales and task settings. The claim that LongTraceRL "consistently outperforms strong baselines" is a significant result, suggesting the robustness and effectiveness of the proposed methods. Furthermore, the abstract highlights a qualitative benefit: the approach "encourages comprehensive, evidence-grounded reasoning." This is crucial for long-context tasks, where not just the final answer but also the explainability and traceability of the reasoning process are highly valued. Without access to the full experimental section, specific metrics, baseline details, and detailed result tables cannot be assessed, but the stated scope and outcomes are promising. The open-sourcing of codes, datasets, and models further enhances the value of these experimental findings by enabling verification and future research.
The paper explicitly states that "Codes, datasets and models are available at https://github.com/THU-KEG/LongTraceRL." This commitment to open-sourcing is excellent and significantly boosts the reproducibility of the work. Providing the datasets (especially the uniquely constructed tiered distractors) and the trained models will allow other researchers to replicate the results, build upon the methodology, and further investigate the approach. This is a strong point for the paper.
1. **Data Generation Complexity**: The generation of multi-hop questions via knowledge graph random walks and, more significantly, the leveraging of search agent trajectories to build tiered distractors, appears to be a complex and potentially resource-intensive process. This might limit its applicability to domains where such structured knowledge graphs and search agent capabilities are readily available or easily simulated. 2. **Domain Specificity**: The reliance on knowledge graphs for question generation might implicitly limit the types of reasoning tasks or domains where LongTraceRL is most effective. Its generalizability to other long-context tasks (e.g., summarization of unstructured documents, code analysis, creative writing) beyond multi-hop QA is not explicitly discussed. 3. **"Positive-Only" Reward Strategy**: While designed to prevent reward hacking, the "positive-only" strategy for the rubric reward might miss valuable learning signals from responses that are incorrect but demonstrate partial understanding or nearly correct reasoning steps. A more nuanced reward function that can provide negative feedback for specific incorrect steps might accelerate learning. 4. **Computational Cost**: The abstract does not discuss the computational cost associated with training LLMs with RLVR, especially with the complex data generation and fine-grained reward signals. This could be a practical limitation for wider adoption, particularly for larger models.
LongTraceRL has the potential for significant broader impact on the field of large language models and long-context reasoning. 1. **Improved LLM Capabilities**: By addressing a central challenge of LLMs, it can lead to more reliable and capable models for tasks requiring deep understanding and integration of information from extensive documents, such as complex question answering, scientific literature review, legal document analysis, and medical diagnosis support. 2. **Novel Data Generation Paradigms**: The "tiered distractors" approach offers a new paradigm for creating challenging and realistic long-context benchmarks and training data, which can be adopted by the community to develop more robust LLMs. 3. **Advanced RLVR Techniques**: The "rubric reward" design provides a valuable contribution to the field of Reinforcement Learning with Verifiable Rewards, demonstrating how fine-grained process supervision can be effectively integrated to guide complex reasoning, potentially inspiring similar reward shaping techniques for other intricate tasks. 4. **Foundation for Future Research**: The open-sourced code, datasets, and models will serve as a valuable resource, lowering the barrier for other researchers to build upon this work, explore its limitations, and extend its applicability to new domains and reasoning challenges. This paper introduces LongTraceRL, a novel approach that significantly enhances long-context reasoning in large language models by proposing an innovative data construction method using tiered distractors from search agent trajectories and a fine-grained rubric reward for process supervision. The work makes a strong technical contribution by addressing critical limitations in existing RLVR methods, demonstrating consistent performance improvements across multiple LLMs and benchmarks, and openly providing resources, thereby offering a promising direction for developing more robust and evidence-grounded reasoning capabilities in LLMs.
Recent advances in Large Audio-Language Models (LALMs) have made real-time, streaming spoken interaction increasingly practical. In this setting, reasoning quality and responsiveness are tightly coupled: delaying reasoning until the speech endpoint can improve answer quality but moves deliberation into user-visible response delay, while answering too early risks committing before decisive evidence arrives. We introduce a learnable wait-think-answer control formulation for LALMs. Motivated by the incremental nature of human conversation, the controller decides under partial audio evidence when to wait, when to externalize a compact reasoning update, and when to answer. Using Qwen2.5-Omni-7B as the base model, we construct aligned wait-think-answer traces from spoken reasoning data, train the controller with supervised fine-tuning (SFT), and then apply Decoupled Clip and Dynamic Sampling Policy Optimization (DAPO). The reward combines answer correctness, action validity, update timing, latency synchronization, reasoning quality, and chain consistency, optimizing the complete wait-think-answer trajectory and not the final answer alone. On a six-task synthetic spoken reasoning question answering (SRQA) benchmark, the six-reward DAPO controller improves the row-weighted accuracy from 67.6% to 70.3% while reducing post-endpoint final-think length by 14% under the same Qwen deployment harness. On a 186-item human-recorded Real Audio Bench, a transfer check beyond text-to-speech (TTS)-rendered speech, the controller family remains functional: SFT achieves the strongest accuracy, while the six-reward DAPO controller is the only learned variant whose final-think length falls below the base. These results suggest that a streaming model should learn when to make intermediate reasoning explicit during the audio stream.
Primary: unknown (author affiliations not present in provided text)
All Institutions: unknown (author affiliations not present in provided text)
The paper includes a thoughtful broader impact statement. It highlights the positive implications of low-latency spoken reasoning for accessibility tools, real-time captioning, on-device translation, and conversational tutoring, where reduced response delay is crucial. However, it also responsibly addresses potential misuses, such as systems monitoring or pre-empting conversations without user awareness, lowering the cost of real-time social engineering, and creating asymmetric advantages. The authors state that their design constrains misuse by restricting the action space and explicitly penalizing spurious thinking, aligning with non-interruptive conversational norms. This balanced discussion is commendable. This paper formulates streaming speech reasoning as a learnable wait-think-answer control problem for Large Audio-Language Models, demonstrating that a multi-objective policy optimization approach can effectively shift the accuracy-latency trade-off by enabling models to externalize intermediate reasoning during the audio stream. The work presents a principled and empirically validated methodology for building more responsive and intelligent interactive LALMs, making a significant contribution to the field by addressing a core challenge in real-time human-computer interaction with spoken language.
The paper introduces a novel "wait-think-answer" control formulation for Large Audio-Language Models (LALMs) operating in streaming, real-time environments. This addresses the critical trade-off between reasoning quality and responsiveness. The core idea is to allow the LALM to explicitly decide, under partial audio evidence, when to `
The evaluation is conducted on two benchmarks: a six-task synthetic Spoken Reasoning Question Answering (SRQA) benchmark (8,959 items) and a 186-item human-recorded Real Audio Bench. On the synthetic SRQA benchmark, the six-reward DAPO controller demonstrates significant improvements: 1. **Accuracy**: Row-weighted accuracy increases from 67.6% (base deployment controller) to 70.3%, a notable gain of 2.7 percentage points. 2. **Latency**: Mean post-endpoint final-think length (a proxy for user-visible response delay) is reduced by 14%, from 10.44 to 8.99 tokens. This simultaneous improvement in both accuracy and responsiveness is a strong result, validating the core hypothesis that learned control can effectively manage this trade-off. The Real Audio Bench serves as an important transfer check, demonstrating that the controller family remains functional beyond TTS-rendered speech. While the best accuracy and shortest final-think points come from different learned variants on this smaller dataset, it confirms the generalizability of the approach to real human speech. The paper includes appropriate baselines, comparing the base Qwen2.5-Omni-7B (in both offline and deployment modes) with SFT and various DAPO reward stack configurations. External baselines (Audio Flamingo, GLM-4-Voice, Moshi) are also provided in offline mode for context, though a direct controller-level comparison is acknowledged as future work due to implementation constraints. The statistical significance of results on the smaller Real Audio Bench is appropriately acknowledged as limited, focusing on it as a transfer check rather than a fine-grained ranking.
The paper provides a good level of detail regarding the methodology, including architecture, training data construction, SFT, DAPO policy optimization, and reward design. Hyperparameters, training curves, and compute details are relegated to an appendix (though the appendix itself is not provided in the text, it is referenced). The mention of a "public repository available on GitHub" (though without a direct URL in the provided text) suggests an intention for reproducibility. The detailed description of the reward terms and their weights, along with the protocol gate, further aids reproducibility. The full-prefix replay implementation is also clearly described, allowing others to replicate the experimental setup.
The authors openly discuss several limitations: 1. **Real Audio Bench Scale**: The human-recorded benchmark is small (186 items, five speakers), limiting its generalizability across accents, environments, and interaction styles. 2. **Latency Measurement**: Latency measurements are within-harness only, separating residual reasoning length from implementation runtime. The full-prefix replay incurs repeated prefill costs, meaning the reported RTF (in an unprovided appendix) measures the paper's implementation rather than an ideal cache-native server. 3. **Deployment Optimization**: The current implementation uses full-prefix replay rather than an optimized cache-native serving, which would require lower-level runtime and kernel work. 4. **Baseline Comparison**: A head-to-head implementation of other question-completeness controllers within the same Qwen harness is identified as future work, meaning direct comparisons of controller paradigms are not fully realized. These acknowledged limitations are fair and demonstrate a realistic understanding of the current scope of the work.
The paper includes a thoughtful broader impact statement. It highlights the positive implications of low-latency spoken reasoning for accessibility tools, real-time captioning, on-device translation, and conversational tutoring, where reduced response delay is crucial. However, it also responsibly addresses potential misuses, such as systems monitoring or pre-empting conversations without user awareness, lowering the cost of real-time social engineering, and creating asymmetric advantages. The authors state that their design constrains misuse by restricting the action space and explicitly penalizing spurious thinking, aligning with non-interruptive conversational norms. This balanced discussion is commendable. This paper formulates streaming speech reasoning as a learnable wait-think-answer control problem for Large Audio-Language Models, demonstrating that a multi-objective policy optimization approach can effectively shift the accuracy-latency trade-off by enabling models to externalize intermediate reasoning during the audio stream. The work presents a principled and empirically validated methodology for building more responsive and intelligent interactive LALMs, making a significant contribution to the field by addressing a core challenge in real-time human-computer interaction with spoken language.