Week of May 17 – May 24, 2026
Pipeline parallelism is a key technique for scaling large-model training, but modern workloads exhibit runtime variability in computation and communication. Existing pipeline systems typically consume static, profiled, or adaptively generated schedules as pre-committed execution orders. When realized task readiness diverges from the pre-committed order, stages may wait for not-yet-ready work even though other executable work is available, creating stage misalignment, idle bubbles, and reduced utilization. We present Runtime-Readiness-First Pipeline (RRFP), a readiness-driven runtime for pipeline-parallel training. RRFP changes how schedules are consumed at runtime: instead of treating a schedule as a sequence that stages must wait to follow, it treats the schedule as a non-binding hint order for ranking currently ready work. To support this model, RRFP combines message-driven asynchronous communication, lightweight tensor-parallel coordination for collective consistency, and ready-set arbitration for low-overhead dispatch. We implement RRFP in a Megatron-based training framework and evaluate it on language-only and multimodal workloads at up to 128 GPUs. RRFP improves over fixed-order pipeline baselines across all settings. Using the BFW hint, RRFP achieves up to 1.77$\times$ speedup on language-only workloads and up to 2.77$\times$ on multimodal workloads. In cross-framework comparisons, RRFP with the default BF hint outperforms the faster available external system by up to 1.84$\times$ while preserving training correctness.
Primary: Tsinghua University
All Institutions: Tsinghua University, Scitix AI
This paper introduces Runtime-Readiness-First Pipeline (RRFP), a novel runtime system for pipeline-parallel training that treats schedules as non-binding hints to dynamically dispatch ready work, significantly improving efficiency under runtime variability. The technical contribution is substantial, presenting a well-designed system with message-driven asynchronous communication, lightweight tensor-parallel coordination, and ready-set arbitration, all rigorously evaluated across diverse workloads and scales, demonstrating up to 2.77x speedup over fixed-order baselines and outperforming state-of-the-art external systems.
The paper presents a highly relevant and well-conceived methodology to address a critical limitation in pipeline-parallel training: the fragility of pre-committed execution orders under runtime variability. The core innovation of RRFP lies in its "readiness-driven" approach, where schedules are treated as non-binding hints for ranking currently ready work, rather than strict sequences to be followed. This conceptual shift is supported by three robust mechanisms: (1) message-driven asynchronous communication, which decouples data transfer from computation and correctly handles out-of-order tensor arrivals; (2) lightweight tensor-parallel coordination, which ensures collective consistency across TP ranks without enforcing a global pipeline order; and (3) ready-set arbitration, a low-overhead dispatch layer that efficiently selects tasks from the ready set based on the hint order. The design is modular, allowing existing scheduling strategies to be integrated as hints. The analytical characterization, while simplified, provides a useful theoretical foundation for understanding the behavior of the BF hint and its proximity to optimal performance under certain conditions. This holistic approach demonstrates a deep understanding of the practical challenges in distributed ML systems.
The experimental evaluation is exceptionally comprehensive and rigorous, providing strong evidence for RRFP's effectiveness. The authors evaluate RRFP across a diverse range of workloads, including both language-only (GPT3-Large) and heterogeneous multimodal models (Qwen3, LLaMA3 with various ViT sizes), which are particularly prone to runtime variability. The experiments are conducted at significant scale, up to 128 GPUs, demonstrating practical applicability. Crucially, RRFP is compared against strong baselines, including both same-codebase fixed-order methods (1F1B, ZeroBubble) and leading external distributed training frameworks (DeepSpeed, Cornstarch). The results consistently show substantial speedups, up to 1.77x on language-only and 2.77x on multimodal workloads over 1F1B, and up to 1.84x over the faster external system. The detailed runtime breakdown analysis (RQ2) is particularly impactful, clearly demonstrating that RRFP's gains primarily stem from a significant reduction in blocking time, directly validating the paper's central hypothesis. Further experiments on robustness to injected jitter (RQ4), sensitivity to different hint orders (RQ5), and scaling across pipeline depth, modality imbalance, and global batch size (RQ6) provide compelling evidence for RRFP's reliability, flexibility, and broad applicability.
The paper provides a solid basis for reproducibility. It clearly states that RRFP is implemented as an extension to a Megatron-based training framework and utilizes a C++ communication backend. The experimental setup is meticulously detailed, including specific model architectures, parallel configurations (TP/PP/DP), global batch sizes, hardware specifications, and the number of runs and measured iterations. Key configurable parameters, such as the buffer-size limit, are discussed with sensitivity analyses. The hint algorithms (BF, BFW) are described, with further details for BF provided in the appendix. The authors also validate training correctness by comparing loss trends with baselines under matched seeds. While the full source code is not included in the paper text, the level of detail provided should enable experienced systems researchers to reproduce the core findings.
While the paper is outstanding, a few minor limitations can be noted. The analytical characterization, though helpful, is simplified by ignoring factors like communication time and tensor-parallel coordination, which are present in the full RRFP runtime. While RRFP is shown to be robust to various hint orders, it still relies on an *external* hint; the paper does not propose novel hint generation strategies, focusing solely on runtime consumption. The overhead of the lightweight tensor-parallel coordination, while shown to be small in the evaluated settings, might become more significant in scenarios with extremely high tensor parallelism or very small microbatches. Finally, while the buffer-size limit is analyzed, the memory footprint of these buffers could be a consideration in extremely memory-constrained environments, though the chosen default seems reasonable for the evaluated models.
This work has a profound broader impact on the field of large-scale distributed machine learning. By providing a robust and efficient solution to the pervasive problem of runtime variability in pipeline parallelism, RRFP enables more effective training of increasingly large and complex models, especially heterogeneous multimodal architectures. The readiness-driven execution paradigm represents a significant conceptual advancement in runtime system design for distributed ML, potentially influencing future frameworks to adopt more adaptive, runtime-aware execution strategies over rigid, pre-determined schedules. This could lead to substantial improvements in GPU utilization, reduced training times, and unlock the ability to train even larger models that are currently bottlenecked by pipeline inefficiencies. The insights into handling out-of-order communication and maintaining collective consistency in dynamic environments are also valuable contributions to general distributed systems research in ML. This paper introduces Runtime-Readiness-First Pipeline (RRFP), a novel runtime system for pipeline-parallel training that treats schedules as non-binding hints to dynamically dispatch ready work, significantly improving efficiency under runtime variability. The technical contribution is substantial, presenting a well-designed system with message-driven asynchronous communication, lightweight tensor-parallel coordination, and ready-set arbitration, all rigorously evaluated across diverse workloads and scales, demonstrating up to 2.77x speedup over fixed-order baselines and outperforming state-of-the-art external systems.
Hyperparameter transfer allows extrapolating optimal optimization hyperparameters from small to large scales, making it critical for training large language models (LLMs). This is done either by fitting a scaling law to the hyperparameters or by a judicious choice of parameterization, such as Maximal Update ($μ$P), that renders optimal hyperparameters approximately scale invariant. In this paper, we first develop a framework to quantify hyperparameter transfer through three metrics: (1) the quality of the scaling law fit, (2) the robustness to extrapolation errors, and (3) the asymptotic loss penalty due to choice of parameterization. Next, we investigate through a comprehensive series of ablations why $μ$P appears to offer high-quality learning rate transfer relative to standard parameterization (SP), as existing theory is inadequate. We find that the overwhelming benefit of $μ$P relative to SP when training with AdamW arises simply from maximizing the learning rate of the embedding layer. In SP, the embedding layer learning rate acts as a bottleneck that induces training instabilities; increasing it by a factor of width to match $μ$P dramatically smooths out training while improving hyperparameter transfer. We also find that weight decay improves the scaling law fits, while, in the fixed token-per-parameter setting, it hurts the robustness of the extrapolation.
Primary: University of Maryland, College Park
All Institutions: Department of Physics, University of Maryland, College Park; Department of Computer Science, University of Maryland, College Park; Joint Quantum Institute, University of Maryland, College Park; Meta Superintelligence Labs, Fundamental AI Research
This paper introduces a rigorous quantitative framework for evaluating hyperparameter transfer and empirically demonstrates that the primary advantage of Maximal Update Parameterization ($\mu$P) over Standard Parameterization (SP) in AdamW-trained Transformers lies in its higher embedding layer learning rate, offering a practical and simplified path to robust hyperparameter transfer. The comprehensive analysis provides both a valuable diagnostic tool for the field and a surprising, actionable insight that can significantly impact the efficiency and stability of large language model training. The meticulous experimental design, extensive compute usage, and clear presentation of results make this a highly impactful contribution to the understanding and practice of scaling laws in deep learning.
The paper develops a robust and well-articulated quantitative framework for evaluating hyperparameter transfer, which is a significant methodological contribution. The framework introduces three complementary metrics: Loss Predictability Error ($E$), Transfer Robustness Exponent ($\alpha$), and Asymptotic Loss Degradation ($R(\cdot)$). These metrics are derived from a sound theoretical model that describes how the loss landscape, optimal learning rate, and Hessian scale with width, using power laws. The choice to operate in log-learning-rate space to mitigate asymmetry around the optimum is a practical and justified decision. The fitting procedures are meticulously detailed, including filtering of unstable runs, cubic spline interpolation for smoothing, the use of Huber loss for robustness to outliers, and a sophisticated method for resolving degeneracies in scaling law fits. The systematic ablation study, which exhaustively explores all 16 combinations of differences between SP and $\mu$P, is exceptionally thorough and provides a strong basis for isolating the key factors. The extension of the findings to CNNs on CIFAR-100 further supports the generalizability of the insights regarding first-layer learning rates.
The experimental evaluation is comprehensive and demonstrates a high degree of rigor. The authors pre-train GPT-style Transformers on the FineWeb-Edu dataset, scaling model width from 128 to 2048, using the AdamW optimizer with a Warmup-Stable-Decay schedule. The scale of the hyperparameter sweeps is impressive, covering 20 learning rates, 8 weight decay values, 8 widths, 16 parameterizations, and 2 training regimes (fixed-step and compute-optimal), amounting to an estimated 160,000 H100 GPU hours. This extensive compute budget underscores the thoroughness of the investigation. The core finding—that SP with an appropriately scaled embedding layer learning rate (SP+Embd) achieves transfer quality comparable to $\mu$P, while reducing the embedding LR in $\mu$P ($\mu$P-Embd) degrades its performance—is strongly supported by clear empirical evidence and visualizations. The nuanced analysis of weight decay's impact on predictability and robustness across different scaling regimes provides valuable insights into its complex role. Additional experiments, such as switching embedding learning rates during training and freezing the embedding layer, further reinforce the critical importance of the embedding layer.
The paper provides excellent detail for reproducibility. The experimental setup is thoroughly described, covering model architecture specifics (GPT-style, 12 blocks, 1024 context, GPT-2 tokenizer, head dimension), optimizer parameters, learning rate schedule, and training configurations for both fixed-step and compute-optimal settings. The methodology for filtering, interpolation, and fitting the transfer metrics is also well-documented in the appendix. The explicit mention of the substantial compute resources used (160,000 H100 GPU hours) provides context for the scale of the experiments. The primary limitation for reproducibility, acknowledged by the authors, is the use of a single random seed per configuration. While understandable given the vastness of the hyperparameter sweeps, it means that some stochastic variability might not be fully captured.
The authors acknowledge several limitations. The experiments are confined to decoder-only Transformers with fixed depth, using the AdamW optimizer, and a single dataset (FineWeb-Edu). The findings regarding the embedding layer learning rate are thus specific to this context, although the framework itself is general. The use of a single random seed per configuration is a practical necessity given the scale but could limit the generalizability of specific quantitative results. The paper also identifies that the appropriate weight decay scaling in the compute-optimal regime is not fully resolved and requires further investigation. Minor methodological choices, such as the specific threshold `f=1.35` for filtering loss curves and capping scaling exponents at `2.0`, while practical, could potentially influence the interpretation of extreme behaviors.
This paper has substantial broader impact for the machine learning community, particularly for researchers and practitioners involved in training large neural networks and LLMs. The proposed quantitative framework for hyperparameter transfer offers a much-needed standardized tool for rigorously evaluating and comparing different scaling strategies, which can be applied across various architectures, hyperparameters, and scaling dimensions. The central empirical finding—that the overwhelming benefit of $\mu$P over SP (when using AdamW) stems from simply maximizing the embedding layer learning rate—is highly actionable. It simplifies the understanding of $\mu$P's practical advantages and provides a direct, low-cost modification for SP users to achieve comparable transfer quality, potentially leading to more efficient and stable LLM training. This insight also highlights the embedding layer as a critical, and often overlooked, source of training instabilities, prompting a re-evaluation of hyperparameter choices for boundary layers in neural networks. The work significantly advances the understanding of scaling laws and practical challenges in large-scale model training. This paper introduces a rigorous quantitative framework for evaluating hyperparameter transfer and empirically demonstrates that the primary advantage of Maximal Update Parameterization ($\mu$P) over Standard Parameterization (SP) in AdamW-trained Transformers lies in its higher embedding layer learning rate, offering a practical and simplified path to robust hyperparameter transfer. The comprehensive analysis provides both a valuable diagnostic tool for the field and a surprising, actionable insight that can significantly impact the efficiency and stability of large language model training. The meticulous experimental design, extensive compute usage, and clear presentation of results make this a highly impactful contribution to the understanding and practice of scaling laws in deep learning.
Optimization-based adversarial suffixes can jailbreak aligned large language models (LLMs) while remaining fluent, weakening static and windowed perplexity-based detectors. We cast adversarial suffix detection as an online change-point detection problem over the token-level next-token entropy stream. Using the LLM system prompt to estimate a robust baseline, we standardize user-token entropies and apply a one-sided CUSUM statistic. The resulting detector, CPD Online (CPD), is model-agnostic, training-free, runs online, and localizes the adversarial suffix onset. On a benchmark of 1,012 optimization-based suffix attacks (GCG, AutoDAN, AdvPrompter, BEAST, AutoDAN-HGA) and 1,012 perplexity-controlled benign prompts, CPD improves F1 over the strongest windowed-perplexity baseline on all six open-weight chat models (LLaMA-2-7B/13B, Vicuna-7B/13B, Qwen2.5-7B/14B). On LLaMA-2-7B at the canonical CUSUM setting ($k=0$), CPD reaches AUROC $0.88$ and F1 $0.82$. Beyond prompt-level detection, CPD concentrates 79.6% of its triggers inside the adversarial suffix, versus 17-46% for windowed perplexity. Finally, when used as a lightweight gate for LLaMA Guard, CPD reduces guard calls by 17-22% on a high-volume, benign-dominated deployment while preserving guard-level detection quality
Primary: University College London
All Institutions: University College London
This work has significant positive broader impact: 1. **LLM Safety**: It directly addresses a critical and growing challenge in LLM safety: detecting sophisticated, fluent adversarial attacks that can jailbreak models. 2. **Practical Deployment**: The method's lightweight, training-free, online, and model-agnostic nature makes it highly practical for real-world LLM deployments, especially in resource-constrained environments. 3. **Cost Reduction**: By acting as a gate for expensive safety classifiers like LLaMA Guard, CPD can substantially reduce operational costs associated with safety monitoring. 4. **Enhanced Mitigation**: Token-level localization of adversarial suffixes enables more targeted and effective downstream mitigations (e.g., selective filtering, sanitization). 5. **Research Direction**: It encourages further exploration of sequential and statistical tools, particularly internal uncertainty dynamics, for LLM safety and adversarial detection. Potential negative impacts, as acknowledged by the authors, include adaptive adversaries attempting to evade entropy shifts and false positives on out-of-distribution benign prompts. The authors recommend safeguards like periodic recalibration and pairing with semantic classifiers. This paper introduces CPD Online, a novel, training-free, and online detector for fluent optimization-based adversarial suffixes in LLMs, leveraging sequential entropy changes and a CUSUM statistic. The work significantly advances LLM safety by providing a highly practical and effective method that outperforms perplexity-based baselines in detection F1 and offers superior token-level localization, demonstrating its utility in reducing computational overhead when gating expensive safety classifiers.
The paper proposes a novel approach to detect fluent optimization-based adversarial suffixes in LLMs by casting it as an online change-point detection problem over the token-level next-token entropy stream. The core methodology involves: 1. **Entropy Stream**: Leveraging the sequence of next-token entropies emitted by the LLM as it processes a prompt. The insight is that adversarial suffixes, even when fluent, can induce a persistent shift in this entropy stream. 2. **Robust Baseline Estimation**: Using the fixed LLM system prompt to estimate a robust baseline for entropy. This is a clever and practical choice, as the system prompt is deployment-specific and stable. Median and Median Absolute Deviation (MAD) are used for robust location and scale estimation. 3. **Standardization**: User-token entropies are standardized using the estimated baseline statistics, aiming for a near-zero mean under benign conditions. 4. **One-Sided Page CUSUM Statistic**: A classical online change-point detection algorithm, CUSUM, is applied to the standardized entropy stream. A one-sided CUSUM is used to detect sustained *upward* shifts, accumulating deviations only when standardized entropies exceed a reference value `k`. This allows for online detection and localization. 5. **Localization**: The standard CUSUM backtracking rule is employed to estimate the onset of the adversarial suffix, providing token-level granularity. 6. **Hybrid Gating**: The proposed detector (CPD Online) is designed to be lightweight and training-free, enabling its use as a gate for more expensive safety classifiers like LLaMA Guard, reducing computational overhead in high-volume deployments. The methodology is sound, well-motivated by the limitations of existing perplexity-based methods against fluent attacks, and leverages established statistical tools (CUSUM) in a novel application context. The model-agnostic, training-free, and online nature are significant practical advantages.
The experimental evaluation is comprehensive and rigorous: 1. **Benchmark**: A robust benchmark of 1,012 optimization-based suffix attacks (GCG, AutoDAN, AdvPrompter, BEAST, AutoDAN-HGA) and 1,012 perplexity-controlled benign prompts per model is constructed. The perplexity-matching of benign prompts to fluency-optimized attacks is crucial for stress-testing detectors. 2. **Models**: Evaluation is performed across six diverse open-weight chat models (LLaMA-2-7B/13B, Vicuna-7B/13B, Qwen2.5-7B/14B), demonstrating broad applicability. 3. **Baselines**: CPD is compared against strong perplexity-based baselines: global perplexity (PP) and windowed perplexity (WPP) with various window sizes, including a per-token max-NLL ($w=1$) variant. LLaMA Guard is used for hybrid deployment analysis. 4. **Metrics**: Prompt-level F1 and AUROC are used for detection performance. A detailed locality analysis (before-suffix, before+in, in-suffix alarms) is provided for localization accuracy. Guard call reduction is measured for hybrid deployment. 5. **Key Findings**: * **Detection**: CPD consistently improves F1 over the strongest WPP baseline across all six models, with significant margins on LLaMA-2 and Qwen2.5-14B. Global PP is shown to be ineffective due to perplexity-matched benigns. * **Localization**: CPD demonstrates superior localization, concentrating 79.6% of its triggers inside the adversarial suffix, significantly outperforming WPP (17-46%). This is a major practical advantage. * **Hybrid Gating**: When used as a gate for LLaMA Guard, CPD reduces guard calls by 17-22% on a high-volume, benign-dominated deployment while preserving detection quality. 6. **Ablations and Sensitivity**: The paper includes thorough sensitivity analyses for the CUSUM slack parameter `k` and the PP-gap multiplier `alpha`, demonstrating the robustness of CPD's performance. A leave-one-attack-out (LOAO) analysis is also provided to assess generalization to unseen attack families. The experimental setup is well-designed to validate the claims, and the results clearly demonstrate the superiority of CPD over perplexity-based methods, especially for fluent attacks and localization.
The paper provides a clear description of the methodology, including equations for entropy, standardization, and the CUSUM statistic. Details on baseline estimation (median, MAD), CUSUM parameters (`k`, `h`), and localization are explicit. The experimental setup, including data sources, attack families, benign prompt sampling (perplexity matching), model details, and evaluation protocols (5-fold CV, metrics), is well-documented. The authors state they will release the code at `https://github.com/cpdonline/cpdonline`, which is a strong commitment to reproducibility.
The authors acknowledge several limitations: 1. **Heuristic Nature**: CPD uses token-level entropy as a heuristic proxy for distributional change, not a true log-likelihood ratio, and violates classical change-point detection assumptions (independent observations, known likelihood ratios). Thus, it lacks minimax optimality guarantees. 2. **Access to Probabilities**: The method requires access to token-level probabilities (or logits), which might not be available for closed LLM APIs. 3. **Calibration**: Performance can degrade with strong distribution shifts in benign prompts, requiring recalibration. Online calibration is suggested as future work. 4. **Deployment Engineering**: Full practical deployment of a hybrid system raises further questions about intervention, alarm explanation, and disagreement handling. 5. **Threat Model**: The current threat model is restricted to optimization-based suffixes; extending to prefix attacks or indirect prompt injection is future work. These are reasonable limitations for a practical method, and the authors are transparent about them.
This work has significant positive broader impact: 1. **LLM Safety**: It directly addresses a critical and growing challenge in LLM safety: detecting sophisticated, fluent adversarial attacks that can jailbreak models. 2. **Practical Deployment**: The method's lightweight, training-free, online, and model-agnostic nature makes it highly practical for real-world LLM deployments, especially in resource-constrained environments. 3. **Cost Reduction**: By acting as a gate for expensive safety classifiers like LLaMA Guard, CPD can substantially reduce operational costs associated with safety monitoring. 4. **Enhanced Mitigation**: Token-level localization of adversarial suffixes enables more targeted and effective downstream mitigations (e.g., selective filtering, sanitization). 5. **Research Direction**: It encourages further exploration of sequential and statistical tools, particularly internal uncertainty dynamics, for LLM safety and adversarial detection. Potential negative impacts, as acknowledged by the authors, include adaptive adversaries attempting to evade entropy shifts and false positives on out-of-distribution benign prompts. The authors recommend safeguards like periodic recalibration and pairing with semantic classifiers. This paper introduces CPD Online, a novel, training-free, and online detector for fluent optimization-based adversarial suffixes in LLMs, leveraging sequential entropy changes and a CUSUM statistic. The work significantly advances LLM safety by providing a highly practical and effective method that outperforms perplexity-based baselines in detection F1 and offers superior token-level localization, demonstrating its utility in reducing computational overhead when gating expensive safety classifiers.
While ubiquitous wearable sensors capture a wealth of behavioral and physiological information, effectively transforming these signals into personalized health insights is challenging. Specifically, converting low-level sensor data into representations capable of characterizing higher-level states is difficult due to high phenotypic diversity and variation in individual baseline health, physiology, and lifestyle factors. Moreover, collecting wearable data paired with health outcome annotations is laborious and expensive, and retrospective annotation remains practically unfeasible, contributing to a scarcity of data with high-quality labels. To overcome these limitations, we propose a foundation model for wearable health that is pretrained on more than one trillion minutes of unlabeled sensor signals drawn from a large cohort of five million participants. We demonstrate that the joint scaling of model capacity and pretraining data volume leads to systematic improvements in performance, as evaluated on a diverse set of 35 health prediction tasks, spanning cardiovascular, metabolic, sleep, and mental health, as well as lifestyle choices and demographic factors. We find that this population scale representation unlocks label-efficient few-shot learning and generative capabilities for robust daily metric estimation. To further leverage this learned representation, we deploy a classroom of LLM agents to autonomously search the space of downstream predictive heads built on the model embeddings, showing broad performance improvements that increase with LLM model capacity. Finally, we show how integrating these downstream predictors into a Personal Health Agent can support model responses that are more relevant, contextually aware, and safe, and we validate this via 1,860 ratings from a cohort of clinicians.
This paper introduces a foundation model for wearable health, pretrained on an unprecedented scale of unlabeled sensor data, demonstrating its ability to enable label-efficient learning and generative capabilities across diverse health tasks, further enhanced by LLM agents for autonomous model head discovery. The work presents a highly ambitious and technically sophisticated approach to a critical problem in digital health: extracting personalized insights from ubiquitous wearable data. The sheer scale of pretraining data (one trillion minutes from five million participants) is unprecedented for this domain, establishing a new benchmark for data-driven health AI. The demonstration of systematic performance improvements through scaling, coupled with the ability to perform few-shot learning and generative tasks, positions this as a potentially transformative step for wearable AI. Furthermore, the novel integration of LLM agents to autonomously optimize downstream predictive heads represents a significant methodological contribution that could generalize to other foundation model applications, pushing the boundaries of automated model development. The validation by clinicians for a Personal Health Agent underscores its practical relevance and potential for real-world impact, making it a foundational work for future research in this rapidly evolving field.
Hyperparameter transfer allows extrapolating optimal optimization hyperparameters from small to large scales, making it critical for training large language models (LLMs). This is done either by fitting a scaling law to the hyperparameters or by a judicious choice of parameterization, such as Maximal Update ($μ$P), that renders optimal hyperparameters approximately scale invariant. In this paper, we first develop a framework to quantify hyperparameter transfer through three metrics: (1) the quality of the scaling law fit, (2) the robustness to extrapolation errors, and (3) the asymptotic loss penalty due to choice of parameterization. Next, we investigate through a comprehensive series of ablations why $μ$P appears to offer high-quality learning rate transfer relative to standard parameterization (SP), as existing theory is inadequate. We find that the overwhelming benefit of $μ$P relative to SP when training with AdamW arises simply from maximizing the learning rate of the embedding layer. In SP, the embedding layer learning rate acts as a bottleneck that induces training instabilities; increasing it by a factor of width to match $μ$P dramatically smooths out training while improving hyperparameter transfer. We also find that weight decay improves the scaling law fits, while, in the fixed token-per-parameter setting, it hurts the robustness of the extrapolation.
Primary: University of Maryland, College Park
All Institutions: Department of Physics, University of Maryland, College Park; Department of Computer Science, University of Maryland, College Park; Joint Quantum Institute, University of Maryland, College Park; Meta Superintelligence Labs, Fundamental AI Research
This paper introduces a rigorous quantitative framework for evaluating hyperparameter transfer and empirically demonstrates that the primary advantage of Maximal Update Parameterization ($\mu$P) over Standard Parameterization (SP) in AdamW-trained Transformers lies in its higher embedding layer learning rate, offering a practical and simplified path to robust hyperparameter transfer. The comprehensive analysis provides both a valuable diagnostic tool for the field and a surprising, actionable insight that can significantly impact the efficiency and stability of large language model training. The meticulous experimental design, extensive compute usage, and clear presentation of results make this a highly impactful contribution to the understanding and practice of scaling laws in deep learning.
The paper develops a robust and well-articulated quantitative framework for evaluating hyperparameter transfer, which is a significant methodological contribution. The framework introduces three complementary metrics: Loss Predictability Error ($E$), Transfer Robustness Exponent ($\alpha$), and Asymptotic Loss Degradation ($R(\cdot)$). These metrics are derived from a sound theoretical model that describes how the loss landscape, optimal learning rate, and Hessian scale with width, using power laws. The choice to operate in log-learning-rate space to mitigate asymmetry around the optimum is a practical and justified decision. The fitting procedures are meticulously detailed, including filtering of unstable runs, cubic spline interpolation for smoothing, the use of Huber loss for robustness to outliers, and a sophisticated method for resolving degeneracies in scaling law fits. The systematic ablation study, which exhaustively explores all 16 combinations of differences between SP and $\mu$P, is exceptionally thorough and provides a strong basis for isolating the key factors. The extension of the findings to CNNs on CIFAR-100 further supports the generalizability of the insights regarding first-layer learning rates.
The experimental evaluation is comprehensive and demonstrates a high degree of rigor. The authors pre-train GPT-style Transformers on the FineWeb-Edu dataset, scaling model width from 128 to 2048, using the AdamW optimizer with a Warmup-Stable-Decay schedule. The scale of the hyperparameter sweeps is impressive, covering 20 learning rates, 8 weight decay values, 8 widths, 16 parameterizations, and 2 training regimes (fixed-step and compute-optimal), amounting to an estimated 160,000 H100 GPU hours. This extensive compute budget underscores the thoroughness of the investigation. The core finding—that SP with an appropriately scaled embedding layer learning rate (SP+Embd) achieves transfer quality comparable to $\mu$P, while reducing the embedding LR in $\mu$P ($\mu$P-Embd) degrades its performance—is strongly supported by clear empirical evidence and visualizations. The nuanced analysis of weight decay's impact on predictability and robustness across different scaling regimes provides valuable insights into its complex role. Additional experiments, such as switching embedding learning rates during training and freezing the embedding layer, further reinforce the critical importance of the embedding layer.
The paper provides excellent detail for reproducibility. The experimental setup is thoroughly described, covering model architecture specifics (GPT-style, 12 blocks, 1024 context, GPT-2 tokenizer, head dimension), optimizer parameters, learning rate schedule, and training configurations for both fixed-step and compute-optimal settings. The methodology for filtering, interpolation, and fitting the transfer metrics is also well-documented in the appendix. The explicit mention of the substantial compute resources used (160,000 H100 GPU hours) provides context for the scale of the experiments. The primary limitation for reproducibility, acknowledged by the authors, is the use of a single random seed per configuration. While understandable given the vastness of the hyperparameter sweeps, it means that some stochastic variability might not be fully captured.
The authors acknowledge several limitations. The experiments are confined to decoder-only Transformers with fixed depth, using the AdamW optimizer, and a single dataset (FineWeb-Edu). The findings regarding the embedding layer learning rate are thus specific to this context, although the framework itself is general. The use of a single random seed per configuration is a practical necessity given the scale but could limit the generalizability of specific quantitative results. The paper also identifies that the appropriate weight decay scaling in the compute-optimal regime is not fully resolved and requires further investigation. Minor methodological choices, such as the specific threshold `f=1.35` for filtering loss curves and capping scaling exponents at `2.0`, while practical, could potentially influence the interpretation of extreme behaviors.
This paper has substantial broader impact for the machine learning community, particularly for researchers and practitioners involved in training large neural networks and LLMs. The proposed quantitative framework for hyperparameter transfer offers a much-needed standardized tool for rigorously evaluating and comparing different scaling strategies, which can be applied across various architectures, hyperparameters, and scaling dimensions. The central empirical finding—that the overwhelming benefit of $\mu$P over SP (when using AdamW) stems from simply maximizing the embedding layer learning rate—is highly actionable. It simplifies the understanding of $\mu$P's practical advantages and provides a direct, low-cost modification for SP users to achieve comparable transfer quality, potentially leading to more efficient and stable LLM training. This insight also highlights the embedding layer as a critical, and often overlooked, source of training instabilities, prompting a re-evaluation of hyperparameter choices for boundary layers in neural networks. The work significantly advances the understanding of scaling laws and practical challenges in large-scale model training. This paper introduces a rigorous quantitative framework for evaluating hyperparameter transfer and empirically demonstrates that the primary advantage of Maximal Update Parameterization ($\mu$P) over Standard Parameterization (SP) in AdamW-trained Transformers lies in its higher embedding layer learning rate, offering a practical and simplified path to robust hyperparameter transfer. The comprehensive analysis provides both a valuable diagnostic tool for the field and a surprising, actionable insight that can significantly impact the efficiency and stability of large language model training. The meticulous experimental design, extensive compute usage, and clear presentation of results make this a highly impactful contribution to the understanding and practice of scaling laws in deep learning.
Reinforcement learning with verifiable rewards (RLVR) has become a dominant paradigm for improving reasoning in large language models (LLMs), yet the underlying geometry of the resulting parameter trajectories remains underexplored. In this work, we demonstrate that RLVR weight trajectories are extremely low-rank and highly predictable. Specifically, we find that the majority of downstream performance gains are captured by a rank-1 approximation of the parameter deltas, where the magnitude of this projection evolves near-linearly with training steps. Motivated by this, we propose a simple and compute-efficient method RELEX (REinforcement Learning EXtrapolation), which estimates the rank-1 subspace from a short observation window and extrapolates future checkpoints via linear regression, with no learned model required. Across three models (i.e., Qwen2.5-Math-1.5B, Qwen3-4B-Base, and Qwen3-8B-Base), RELEX produces checkpoints that match or exceed RLVR performance on both in-domain and out-of-domain benchmarks, requiring as few as 15% steps of full RLVR training. Remarkably, RELEX is able to extrapolate far beyond the observation window at no training cost, predicting checkpoints up to 10-20$\times$ beyond the observed prefix with continued improvement (e.g., observe only the first 50 steps and extrapolate to 1000 steps). Our ablation analysis confirms the minimalist sufficiency of RELEX: neither increasing the subspace rank nor employing non-linear modeling yields further gains in extrapolation. Finally, we show that RELEX's success stems from a "denoising" effect: by projecting updates onto the rank-1 subspace, the model discards stochastic optimization noise that would otherwise degrade performance during extrapolation. Our code is available at https://github.com/weizhepei/RELEX.
This paper reveals that RLVR training trajectories for LLMs are surprisingly low-rank and linearly predictable, enabling a simple extrapolation method (RELEX) that drastically reduces training costs while maintaining or improving performance. The work presents a profound insight into the geometry of RLVR training, demonstrating that the majority of performance gains are captured by a rank-1 approximation of parameter deltas that evolve linearly. RELEX, a remarkably simple method requiring no learned model, leverages this insight to extrapolate training checkpoints up to 10-20 times beyond the observed window, achieving full RLVR performance with as little as 15% of the original training steps. This represents a major leap in efficiency for a dominant LLM alignment paradigm, with the potential to significantly reduce the computational cost and time associated with RL-based fine-tuning. The "denoising" explanation for RELEX's success further strengthens the paper's intellectual contribution, providing a valuable theoretical understanding alongside its substantial practical impact. This work is highly significant for its combination of a surprising fundamental observation and a highly effective, practical solution to a critical problem in LLM development.
Mechanistic interpretability produces circuit-level causal analyses of neural network behaviour, but discovered circuits often remain isolated experimental artefacts: there is no shared formal representation for what circuits compute, how they relate, or when two findings provide evidence for the same mechanism. This work provides a formal infrastructure for cumulative mechanistic science by treating circuit interpretation as inductive theory construction. Each circuit is characterised at two levels: a Causal Functional Signature (CFS), which grounds component behaviour in causal attribution evidence and token role profiles, and an architectural signature $τ_{\mathrm{arch}}$, learned by inductive logic programming (ILP) from scale-invariant structural predicates. Together, these constitute a formal coherence layer that makes mechanistic claims explicit, comparable via $θ$-subsumption, and portable across model scales. CFS reveals qualitatively distinct computational strategies across task types, including attention-mediated copying versus MLP-mediated binding. ILP signatures achieve substantially better structural separation than graph kernel and feature-vector baselines, and support principled transfer across model scales and architecture families.
Primary: Unknown
All Institutions: Unknown
This work has significant broader impact potential for the field of mechanistic interpretability and, by extension, for AI safety and trustworthy AI. By providing a formal infrastructure for cumulative mechanistic science, it moves the field beyond isolated findings towards a more systematic, theory-driven approach. This could significantly accelerate the discovery and understanding of fundamental computational mechanisms in neural networks. The ability to compare and transfer circuit knowledge across models and scales is crucial for building generalizable theories of AI. This could lead to: 1) **Accelerated MI Research:** Researchers can build upon existing formal theories rather than starting from scratch. 2) **Improved AI Safety:** A deeper, formal understanding of how models achieve their behaviors is essential for identifying and mitigating undesirable behaviors, biases, and vulnerabilities. 3) **Better Model Design:** Formal theories of computation could inform the design of more robust, interpretable, and efficient neural architectures. 4) **Educational Tools:** The symbolic theories learned by ILP could serve as pedagogical tools to teach how neural networks function. The framework's emphasis on grounding interpretations in causal evidence also promotes a more rigorous and scientific approach to interpretability. This paper provides a formal infrastructure for cumulative mechanistic science by characterizing neural network circuits with Causal Functional Signatures and architectural signatures learned via Inductive Logic Programming. The work presents a highly novel and technically rigorous framework that addresses a critical gap in mechanistic interpretability, demonstrating strong empirical results in distinguishing circuit types, achieving superior structural separation compared to baselines, and enabling principled transfer of knowledge across model scales and architectures, thereby laying a crucial foundation for building generalizable theories of AI.
The methodology is exceptionally well-conceived and addresses a critical gap in mechanistic interpretability research: the lack of formalization and cumulative theory building. The proposed two-level characterization of circuits is a strong point. The Causal Functional Signature (CFS) effectively grounds circuit behavior in empirical evidence by combining causal attribution (e.g., path patching) with token role profiles. This ensures that interpretations are tied to observable effects. The architectural signature ($\tau_{\text{arch}}$), learned via Inductive Logic Programming (ILP) from scale-invariant structural predicates, is the most novel and impactful component. ILP's ability to learn symbolic, human-readable Horn clauses that describe relational structures is an ideal fit for formalizing circuit architecture, offering a significant advantage over black-box structural comparison methods. The choice of `Popper` for ILP, which focuses on learning minimal and general theories, is appropriate. The concept of a "formal coherence layer" that enables explicit claims, comparability via $\theta$-subsumption, and portability across scales is a powerful theoretical contribution that directly addresses the paper's core motivation. While the reliance on hand-engineered token role profiles for CFS is a current limitation, the overall framework is robust and clearly articulated.
The experimental evaluation is thorough and effectively demonstrates the utility and advantages of the proposed framework. The use of small Transformer models on controlled, well-understood tasks (copying, binding, induction heads) is appropriate for a foundational paper, allowing for clear demonstration of the core concepts without overwhelming complexity. The CFS evaluation, using t-SNE plots, convincingly shows its ability to qualitatively distinguish between different computational strategies employed by circuits. For $\tau_{\text{arch}}$, the comparison against strong baselines, including Weisfeiler-Lehman graph kernels and hand-engineered feature vectors, is crucial. The results showing significantly higher classification accuracy (90-100% for ILP vs. 50-80% for baselines) in structural separation are very compelling and highlight the power of the ILP approach. Furthermore, the experiments on transferability across model scales and architecture families are critical for validating the claim of "cumulative science" and portability, with high accuracy achieved, indicating the robustness of the learned theories. The inclusion of examples of learned ILP theories in the appendix further illustrates the interpretability advantage. The experiments are well-designed to support the paper's claims, even if conducted on smaller-scale problems.
The paper provides a detailed methodology section and comprehensive appendices (0B_appendix, 0C_appendix) that describe the experimental setup, model architectures, training details, and circuit extraction process. The specific ILP system (`Popper`) is named, and the structural predicates used are outlined. The abstract explicitly states that "Code and supplementary materials are available at [anonymised for review]," indicating that the authors have prepared code for release, even if not publicly linked in the submitted version. Given the level of detail provided in the paper, a researcher with expertise in mechanistic interpretability and inductive logic programming should be able to reproduce the core findings.
The paper acknowledges several limitations and areas for future work. A primary limitation is the current reliance on hand-engineered token role profiles for the Causal Functional Signature (CFS), which can be labor-intensive and may not scale well to more complex tasks or larger models. Automating this process is a clear next step. The experiments are conducted on relatively small Transformer models and synthetic tasks; scaling the approach to large, production-level models and real-world tasks will present significant challenges, both in terms of circuit extraction complexity and the computational demands of ILP. While the structural predicates are designed to be scale-invariant, the sheer number of potential predicates and the complexity of circuits in very large models could make ILP learning more challenging. The paper focuses on *characterizing* circuits rather than *discovering* them, assuming circuit extraction is a pre-processing step.
This work has significant broader impact potential for the field of mechanistic interpretability and, by extension, for AI safety and trustworthy AI. By providing a formal infrastructure for cumulative mechanistic science, it moves the field beyond isolated findings towards a more systematic, theory-driven approach. This could significantly accelerate the discovery and understanding of fundamental computational mechanisms in neural networks. The ability to compare and transfer circuit knowledge across models and scales is crucial for building generalizable theories of AI. This could lead to: 1) **Accelerated MI Research:** Researchers can build upon existing formal theories rather than starting from scratch. 2) **Improved AI Safety:** A deeper, formal understanding of how models achieve their behaviors is essential for identifying and mitigating undesirable behaviors, biases, and vulnerabilities. 3) **Better Model Design:** Formal theories of computation could inform the design of more robust, interpretable, and efficient neural architectures. 4) **Educational Tools:** The symbolic theories learned by ILP could serve as pedagogical tools to teach how neural networks function. The framework's emphasis on grounding interpretations in causal evidence also promotes a more rigorous and scientific approach to interpretability. This paper provides a formal infrastructure for cumulative mechanistic science by characterizing neural network circuits with Causal Functional Signatures and architectural signatures learned via Inductive Logic Programming. The work presents a highly novel and technically rigorous framework that addresses a critical gap in mechanistic interpretability, demonstrating strong empirical results in distinguishing circuit types, achieving superior structural separation compared to baselines, and enabling principled transfer of knowledge across model scales and architectures, thereby laying a crucial foundation for building generalizable theories of AI.
Optimization-based adversarial suffixes can jailbreak aligned large language models (LLMs) while remaining fluent, weakening static and windowed perplexity-based detectors. We cast adversarial suffix detection as an online change-point detection problem over the token-level next-token entropy stream. Using the LLM system prompt to estimate a robust baseline, we standardize user-token entropies and apply a one-sided CUSUM statistic. The resulting detector, CPD Online (CPD), is model-agnostic, training-free, runs online, and localizes the adversarial suffix onset. On a benchmark of 1,012 optimization-based suffix attacks (GCG, AutoDAN, AdvPrompter, BEAST, AutoDAN-HGA) and 1,012 perplexity-controlled benign prompts, CPD improves F1 over the strongest windowed-perplexity baseline on all six open-weight chat models (LLaMA-2-7B/13B, Vicuna-7B/13B, Qwen2.5-7B/14B). On LLaMA-2-7B at the canonical CUSUM setting ($k=0$), CPD reaches AUROC $0.88$ and F1 $0.82$. Beyond prompt-level detection, CPD concentrates 79.6% of its triggers inside the adversarial suffix, versus 17-46% for windowed perplexity. Finally, when used as a lightweight gate for LLaMA Guard, CPD reduces guard calls by 17-22% on a high-volume, benign-dominated deployment while preserving guard-level detection quality
Primary: University College London
All Institutions: University College London
This work has significant positive broader impact: 1. **LLM Safety**: It directly addresses a critical and growing challenge in LLM safety: detecting sophisticated, fluent adversarial attacks that can jailbreak models. 2. **Practical Deployment**: The method's lightweight, training-free, online, and model-agnostic nature makes it highly practical for real-world LLM deployments, especially in resource-constrained environments. 3. **Cost Reduction**: By acting as a gate for expensive safety classifiers like LLaMA Guard, CPD can substantially reduce operational costs associated with safety monitoring. 4. **Enhanced Mitigation**: Token-level localization of adversarial suffixes enables more targeted and effective downstream mitigations (e.g., selective filtering, sanitization). 5. **Research Direction**: It encourages further exploration of sequential and statistical tools, particularly internal uncertainty dynamics, for LLM safety and adversarial detection. Potential negative impacts, as acknowledged by the authors, include adaptive adversaries attempting to evade entropy shifts and false positives on out-of-distribution benign prompts. The authors recommend safeguards like periodic recalibration and pairing with semantic classifiers. This paper introduces CPD Online, a novel, training-free, and online detector for fluent optimization-based adversarial suffixes in LLMs, leveraging sequential entropy changes and a CUSUM statistic. The work significantly advances LLM safety by providing a highly practical and effective method that outperforms perplexity-based baselines in detection F1 and offers superior token-level localization, demonstrating its utility in reducing computational overhead when gating expensive safety classifiers.
The paper proposes a novel approach to detect fluent optimization-based adversarial suffixes in LLMs by casting it as an online change-point detection problem over the token-level next-token entropy stream. The core methodology involves: 1. **Entropy Stream**: Leveraging the sequence of next-token entropies emitted by the LLM as it processes a prompt. The insight is that adversarial suffixes, even when fluent, can induce a persistent shift in this entropy stream. 2. **Robust Baseline Estimation**: Using the fixed LLM system prompt to estimate a robust baseline for entropy. This is a clever and practical choice, as the system prompt is deployment-specific and stable. Median and Median Absolute Deviation (MAD) are used for robust location and scale estimation. 3. **Standardization**: User-token entropies are standardized using the estimated baseline statistics, aiming for a near-zero mean under benign conditions. 4. **One-Sided Page CUSUM Statistic**: A classical online change-point detection algorithm, CUSUM, is applied to the standardized entropy stream. A one-sided CUSUM is used to detect sustained *upward* shifts, accumulating deviations only when standardized entropies exceed a reference value `k`. This allows for online detection and localization. 5. **Localization**: The standard CUSUM backtracking rule is employed to estimate the onset of the adversarial suffix, providing token-level granularity. 6. **Hybrid Gating**: The proposed detector (CPD Online) is designed to be lightweight and training-free, enabling its use as a gate for more expensive safety classifiers like LLaMA Guard, reducing computational overhead in high-volume deployments. The methodology is sound, well-motivated by the limitations of existing perplexity-based methods against fluent attacks, and leverages established statistical tools (CUSUM) in a novel application context. The model-agnostic, training-free, and online nature are significant practical advantages.
The experimental evaluation is comprehensive and rigorous: 1. **Benchmark**: A robust benchmark of 1,012 optimization-based suffix attacks (GCG, AutoDAN, AdvPrompter, BEAST, AutoDAN-HGA) and 1,012 perplexity-controlled benign prompts per model is constructed. The perplexity-matching of benign prompts to fluency-optimized attacks is crucial for stress-testing detectors. 2. **Models**: Evaluation is performed across six diverse open-weight chat models (LLaMA-2-7B/13B, Vicuna-7B/13B, Qwen2.5-7B/14B), demonstrating broad applicability. 3. **Baselines**: CPD is compared against strong perplexity-based baselines: global perplexity (PP) and windowed perplexity (WPP) with various window sizes, including a per-token max-NLL ($w=1$) variant. LLaMA Guard is used for hybrid deployment analysis. 4. **Metrics**: Prompt-level F1 and AUROC are used for detection performance. A detailed locality analysis (before-suffix, before+in, in-suffix alarms) is provided for localization accuracy. Guard call reduction is measured for hybrid deployment. 5. **Key Findings**: * **Detection**: CPD consistently improves F1 over the strongest WPP baseline across all six models, with significant margins on LLaMA-2 and Qwen2.5-14B. Global PP is shown to be ineffective due to perplexity-matched benigns. * **Localization**: CPD demonstrates superior localization, concentrating 79.6% of its triggers inside the adversarial suffix, significantly outperforming WPP (17-46%). This is a major practical advantage. * **Hybrid Gating**: When used as a gate for LLaMA Guard, CPD reduces guard calls by 17-22% on a high-volume, benign-dominated deployment while preserving detection quality. 6. **Ablations and Sensitivity**: The paper includes thorough sensitivity analyses for the CUSUM slack parameter `k` and the PP-gap multiplier `alpha`, demonstrating the robustness of CPD's performance. A leave-one-attack-out (LOAO) analysis is also provided to assess generalization to unseen attack families. The experimental setup is well-designed to validate the claims, and the results clearly demonstrate the superiority of CPD over perplexity-based methods, especially for fluent attacks and localization.
The paper provides a clear description of the methodology, including equations for entropy, standardization, and the CUSUM statistic. Details on baseline estimation (median, MAD), CUSUM parameters (`k`, `h`), and localization are explicit. The experimental setup, including data sources, attack families, benign prompt sampling (perplexity matching), model details, and evaluation protocols (5-fold CV, metrics), is well-documented. The authors state they will release the code at `https://github.com/cpdonline/cpdonline`, which is a strong commitment to reproducibility.
The authors acknowledge several limitations: 1. **Heuristic Nature**: CPD uses token-level entropy as a heuristic proxy for distributional change, not a true log-likelihood ratio, and violates classical change-point detection assumptions (independent observations, known likelihood ratios). Thus, it lacks minimax optimality guarantees. 2. **Access to Probabilities**: The method requires access to token-level probabilities (or logits), which might not be available for closed LLM APIs. 3. **Calibration**: Performance can degrade with strong distribution shifts in benign prompts, requiring recalibration. Online calibration is suggested as future work. 4. **Deployment Engineering**: Full practical deployment of a hybrid system raises further questions about intervention, alarm explanation, and disagreement handling. 5. **Threat Model**: The current threat model is restricted to optimization-based suffixes; extending to prefix attacks or indirect prompt injection is future work. These are reasonable limitations for a practical method, and the authors are transparent about them.
This work has significant positive broader impact: 1. **LLM Safety**: It directly addresses a critical and growing challenge in LLM safety: detecting sophisticated, fluent adversarial attacks that can jailbreak models. 2. **Practical Deployment**: The method's lightweight, training-free, online, and model-agnostic nature makes it highly practical for real-world LLM deployments, especially in resource-constrained environments. 3. **Cost Reduction**: By acting as a gate for expensive safety classifiers like LLaMA Guard, CPD can substantially reduce operational costs associated with safety monitoring. 4. **Enhanced Mitigation**: Token-level localization of adversarial suffixes enables more targeted and effective downstream mitigations (e.g., selective filtering, sanitization). 5. **Research Direction**: It encourages further exploration of sequential and statistical tools, particularly internal uncertainty dynamics, for LLM safety and adversarial detection. Potential negative impacts, as acknowledged by the authors, include adaptive adversaries attempting to evade entropy shifts and false positives on out-of-distribution benign prompts. The authors recommend safeguards like periodic recalibration and pairing with semantic classifiers. This paper introduces CPD Online, a novel, training-free, and online detector for fluent optimization-based adversarial suffixes in LLMs, leveraging sequential entropy changes and a CUSUM statistic. The work significantly advances LLM safety by providing a highly practical and effective method that outperforms perplexity-based baselines in detection F1 and offers superior token-level localization, demonstrating its utility in reducing computational overhead when gating expensive safety classifiers.
Machine learning-based malware detectors are widely deployed in antivirus and endpoint detection systems, yet their reliance on static features makes them vulnerable to adversarial manipulation. This paper investigates whether a malware sample can be intentionally misclassified as a specific benign software category, not merely as "not malware", by adding a small number of Win32 API imports characteristic of that selected category, without removing any existing imports or retraining the detector. We propose a framework centered on a Conditional Variational Autoencoder (CVAE) whose decoder is strictly additive. It can introduce new API calls but never remove existing ones, preserving malware functionality by design. For each malware sample, the framework automatically identifies which benign category it most closely resembles and uses that as the evasion target. A knowledge-distilled differentiable proxy enables gradient-based training against the non-differentiable ensemble detector. Experiments on a six-class dataset of binary Win32 API import vectors extracted from 3,799 Windows executables (five benign categories, one malware class) show that, against a detector achieving 87.5% malware recall, adding just 20 API imports reduces recall to 30%. At k=20, among samples that evaded detection, 99% are classified as the intended target category. The CVAE outperforms both a frequency-based baseline and random selection at every tested injection size (k = 5 to 50). Validation on real PE files submitted to VirusTotal confirms that the attack transfers to commercial static detection engines, with an average 54.5% reduction in flagging engines. These findings expose a concrete vulnerability in API-based malware classifiers and demonstrate that targeted evasion into a chosen benign category is achievable with minimal, functionality-preserving modifications.
Primary: Vilnius University
All Institutions: Vilnius University
This paper has significant broader impact for several reasons: 1. **Enhanced Understanding of Malware Detector Vulnerabilities**: It exposes a concrete and sophisticated vulnerability in ML-based static malware detectors. The demonstration of *targeted* evasion into specific benign categories is a more advanced threat than untargeted evasion, as it could allow malware to impersonate trusted software (e.g., security tools, office applications), potentially bypassing security policies or gaining user trust. 2. **Guidance for Robust Defense Development**: The findings provide crucial insights for developers of antivirus and EDR systems. They highlight the need for more robust static analysis models that are less susceptible to feature injection, potentially by incorporating more complex feature interactions, using more resilient models, or integrating dynamic analysis more effectively. 3. **Advancement in Adversarial Machine Learning**: The methodology, particularly the CVAE with an additive decoder and the target selection strategy, offers a novel approach to generating functionality-preserving adversarial examples in a multi-class, semantic context. This can inspire similar work in other domains where additive perturbations are required. 4. **New Benchmark and Dataset**: The released multi-class dataset and the framework itself can serve as a valuable benchmark for future research in targeted adversarial malware evasion, fostering further development in this critical area. 5. **Ethical Implications**: The research contributes to the "arms race" between attackers and defenders in cybersecurity. By demonstrating advanced attack capabilities, it implicitly pushes for stronger defensive measures, ultimately aiming to improve overall cybersecurity posture. The paper's demonstration of real-world transferability via VirusTotal is particularly impactful, moving the findings from theoretical possibility to practical concern. This paper introduces a novel framework for targeted adversarial evasion of API-based malware classifiers, demonstrating that malware can be misclassified as a specific benign software category by injecting a small number of category-specific Win32 API imports. The work presents a technically robust methodology, centered on a Conditional Variational Autoencoder with a strictly additive decoder and knowledge distillation, and provides compelling experimental validation, including real-world transferability to commercial antivirus engines via VirusTotal, significantly advancing the understanding of vulnerabilities in ML-based malware detection.
The proposed framework for targeted evasion of malware detectors via API import injection is well-conceived and technically sound. It addresses a significant gap in adversarial machine learning for security by moving beyond untargeted "not malware" misclassification to specific benign category impersonation. The core of the methodology is a Conditional Variational Autoencoder (CVAE) with a strictly additive decoder, a crucial design choice that ensures malware functionality is preserved by only adding new API imports and never removing existing ones. This additive constraint is highly realistic for real-world attacks. The framework integrates several robust ML techniques: 1. **Ensemble-Based Detector (Ensemble A)**: A strong ensemble combining Random Forest and Logistic Regression on both raw binary features and learned embeddings (from an MLP encoder trained with ArcFace and Supervised Contrastive loss). This creates a challenging target for evasion, as ensembles are generally more robust. The use of ArcFace and SupCon loss for embedding generation is a sophisticated approach to create well-separated and compact class clusters, enhancing the detector's performance and making evasion harder. 2. **Target Selection (Ensemble B)**: A separate ensemble, trained only on benign classes, intelligently selects the most plausible benign category for each malware sample to target. This is a clever strategy, as it leverages existing similarities between malware and certain benign software to minimize the required perturbation, making the attack more efficient. 3. **Differentiable Proxy via Knowledge Distillation**: To enable gradient-based training of the CVAE against the non-differentiable ensemble detector, a differentiable MLP proxy is trained using knowledge distillation. This is a standard but effective technique, and the paper details its implementation, including the use of soft labels and temperature scaling, which is appropriate for capturing inter-class relationships. 4. **CVAE with Additive Decoder**: The CVAE architecture is specifically tailored for the problem. The encoder takes the malware sample and target class embedding, producing a latent distribution. The decoder, conditioned on the original sample, latent code, and target embedding, outputs scores for *absent* API calls. The additive constraint `x' = x + (1 - x) * s` is elegant and perfectly enforces functionality preservation. 5. **Comprehensive Loss Function**: The CVAE's training objective combines reconstruction loss (BCE on absent features against a benign reference), KL divergence (for latent space regularization), a sparsity penalty (to encourage selective additions), and a classification loss (using the differentiable proxy to guide towards the target class). This multi-objective loss function effectively balances the various requirements for generating effective, sparse, and targeted adversarial samples. 6. **Top-k Injection**: The final step of selecting only the top-k highest-scoring absent API calls simulates a realistic attacker's constraint on the number of modifications, adding to the practical relevance. The overall methodology is well-justified, technically sound, and demonstrates a strong understanding of both adversarial ML and practical malware analysis constraints.
The experimental evaluation is thorough and provides compelling evidence for the effectiveness of the proposed CVAE-based targeted evasion. 1. **Dataset**: The creation of a custom six-class dataset (five benign categories, one malware class) of binary Win32 API import vectors is a significant contribution, as it directly supports the multi-class, targeted evasion objective. The collection methodology (installing benign software, MalwareBazaar for malware) is reasonable, and the release of the feature vectors on Zenodo enhances reproducibility. The dataset size (3,799 samples, 2,713 features) is adequate for this type of research. 2. **Baseline Performance**: The ensemble detector achieves a strong baseline malware recall of 87.5%, making it a challenging target. The differentiable proxy's accuracy (0.846) is sufficiently close to the ensemble's (0.853) to provide reliable gradient signals. 3. **Evasion Metrics**: The use of Untargeted Evasion Rate (UER), Targeted Success Rate (TSR), and Conditional Target Success (CTS) provides a comprehensive view of attack performance, distinguishing between simply evading detection and successfully impersonating a specific benign category. The defender's perspective, `Recall_6 = 1 - UER`, is also reported, which is practical. 4. **Comparison with Baselines**: The CVAE significantly outperforms both the "MostPopular" (frequency-based) and "Random" baselines across all tested injection sizes (k=5 to 50) and all evasion metrics. For instance, at k=20, CVAE reduces malware recall to 30%, while MostPopular only reduces it to 69%. This clearly demonstrates the value of the learned, targeted approach over simpler strategies. 5. **Targeted Evasion Success**: The high CTS values (e.g., 99.33% at k=20) are particularly impressive, confirming that the CVAE not only evades detection but consistently misclassifies malware into the *intended* benign category. This is a key differentiator from prior untargeted work. 6. **Real-world Validation (VirusTotal)**: This is the strongest aspect of the experimental evaluation. Submitting modified PE files to VirusTotal and observing an average 54.5% reduction in flagging engines provides crucial real-world evidence that the attack transfers to commercial static detection engines. This bridges the gap between theoretical adversarial ML and practical security implications, which is often a weakness in the field. 7. **Visualizations**: The t-SNE projections effectively illustrate the class separation achieved by the learned embeddings and the shift of malware samples towards benign clusters after adversarial modification. The experiments are well-designed, the results are clearly presented, and the real-world validation adds significant weight to the findings.
The paper provides a good level of detail for reproducibility: * **Dataset**: The dataset of API import vectors is publicly released on Zenodo, which is excellent. * **Algorithms**: Detailed algorithms for ensemble training, proxy training, and the overall CVAE framework are provided, outlining the sequential steps. * **Architectures**: MLP encoder and proxy architectures are described in tables. * **Loss Functions and Hyperparameters**: The composite loss function for the CVAE is clearly defined, and the use of Optuna for hyperparameter tuning is mentioned, indicating a systematic approach. Loss weights are specified. * **Training Details**: Adam optimizer, learning rates, gradient clipping, and early stopping are mentioned. * **Evaluation Metrics**: All metrics are precisely defined. * **Computational Resources**: General information (NVIDIA GeForce RTX 4070 GPU, 8GB memory, 64GB RAM, PyTorch) is provided. While the exact code is not provided, the level of detail for the dataset, algorithms, architectures, and training procedures makes the work highly reproducible for researchers in the field.
The paper acknowledges some limitations and others can be inferred: 1. **Grey-box Threat Model**: The attacker is assumed to have knowledge of the feature representation (API imports) and access to representative training data. This is not a full black-box attack where the attacker only observes outputs. However, in many real-world scenarios, attackers can reverse-engineer features or collect data to train surrogate models. 2. **Feature Space Limitation**: The attack is limited to Win32 API import injection. Malware detectors often use a much richer set of static features (e.g., PE header information, section entropy, string analysis) and dynamic features (runtime behavior). An attacker might need to manipulate multiple feature types to evade more sophisticated detectors. 3. **Functionality Preservation Assumption**: While the additive constraint is designed to preserve functionality, the paper does not empirically verify that the modified malware samples *still execute correctly and retain their original malicious behavior*. This is a common challenge in adversarial malware research. The VirusTotal results provide some confidence that the files are still valid executables, but not necessarily that their malicious payload is intact. 4. **Dataset Size and Scope**: While the custom dataset is valuable, 3,799 samples is not massive, and the five benign categories are specific. Broader generalization might require larger and more diverse datasets. 5. **Focus on Attack, Not Defense**: The paper primarily focuses on demonstrating the vulnerability. While it highlights a weakness, it does not propose specific defense mechanisms against this targeted API injection attack. 6. **Real-world PE Modification**: The paper states that the framework operates on binary API import vectors. While the VirusTotal validation implies actual PE file modification, the details of how the CVAE's suggested API additions are translated into a valid, executable PE file are not explicitly described in the methodology section. This is a critical step for practical applicability.
This paper has significant broader impact for several reasons: 1. **Enhanced Understanding of Malware Detector Vulnerabilities**: It exposes a concrete and sophisticated vulnerability in ML-based static malware detectors. The demonstration of *targeted* evasion into specific benign categories is a more advanced threat than untargeted evasion, as it could allow malware to impersonate trusted software (e.g., security tools, office applications), potentially bypassing security policies or gaining user trust. 2. **Guidance for Robust Defense Development**: The findings provide crucial insights for developers of antivirus and EDR systems. They highlight the need for more robust static analysis models that are less susceptible to feature injection, potentially by incorporating more complex feature interactions, using more resilient models, or integrating dynamic analysis more effectively. 3. **Advancement in Adversarial Machine Learning**: The methodology, particularly the CVAE with an additive decoder and the target selection strategy, offers a novel approach to generating functionality-preserving adversarial examples in a multi-class, semantic context. This can inspire similar work in other domains where additive perturbations are required. 4. **New Benchmark and Dataset**: The released multi-class dataset and the framework itself can serve as a valuable benchmark for future research in targeted adversarial malware evasion, fostering further development in this critical area. 5. **Ethical Implications**: The research contributes to the "arms race" between attackers and defenders in cybersecurity. By demonstrating advanced attack capabilities, it implicitly pushes for stronger defensive measures, ultimately aiming to improve overall cybersecurity posture. The paper's demonstration of real-world transferability via VirusTotal is particularly impactful, moving the findings from theoretical possibility to practical concern. This paper introduces a novel framework for targeted adversarial evasion of API-based malware classifiers, demonstrating that malware can be misclassified as a specific benign software category by injecting a small number of category-specific Win32 API imports. The work presents a technically robust methodology, centered on a Conditional Variational Autoencoder with a strictly additive decoder and knowledge distillation, and provides compelling experimental validation, including real-world transferability to commercial antivirus engines via VirusTotal, significantly advancing the understanding of vulnerabilities in ML-based malware detection.
As wearable and mobile devices become increasingly embedded in daily life, they offer a practical way to continuously sense human motion in the wild. But inertial signals are highly dependent on the sensing setup, including body location, mounting position, sensor orientation, device hardware, and sampling protocol. This setup dependence makes it difficult to learn motion representations that transfer across devices and datasets, and limits the broader use of wearable IMUs beyond closed-set recognition. We introduce AnyMo, a geometry-aware framework for setup-agnostic human motion modeling. AnyMo uses physics-grounded IMU simulation over dense body-surface placements to generate diverse and plausible synthetic signals, pre-trains a graph encoder from paired synthetic placement views and masked partial observations, tokenizes multi-position IMU into full-body motion tokens, and aligns these tokens with an LLM for motion-language understanding. We evaluate AnyMo on three complementary tasks: zero-shot activity recognition across 14 unseen downstream datasets, cross-modal retrieval, and wearable IMU motion captioning, where it improves average Accuracy/F1/R@2 by 11.7\%/11.6\%/22.6\% on HAR, increases zero-shot IMU-to-text and text-to-IMU retrieval MRR by 15.9\% and 28.6\%, respectively, and improves zero-shot captioning BERT-F1 by 18.8\%. These results support AnyMo as a generalist model for wearable motion understanding in the wild. Project page: https://baiyuchen.com/project/AnyMo.
AnyMo introduces a geometry-aware, setup-agnostic framework for human motion understanding from wearable IMUs, leveraging physics-grounded simulation and large language models to enable robust zero-shot activity recognition, cross-modal retrieval, and motion captioning in the wild. The paper tackles a critical and long-standing challenge in wearable sensing: the high dependence of inertial signals on sensing setup, which severely limits the generalizability and broader utility of IMUs. AnyMo's core innovation lies in its multi-faceted approach, combining physics-grounded simulation to generate diverse synthetic data that covers the "setup space," a graph encoder for robust motion tokenization, and a novel integration with large language models for open-ended motion-language understanding. This allows the model to achieve impressive zero-shot performance across numerous unseen datasets and tasks, including activity recognition, cross-modal retrieval, and motion captioning, significantly expanding the capabilities of IMU-based systems beyond closed-set recognition. The methodology is technically sound and the results demonstrate a substantial leap forward in making wearable IMUs practical for generalist human motion understanding in unconstrained environments, positioning it as a potentially influential work for future research in pervasive sensing and multi-modal AI.
Spatial intelligence requires multimodal large language models (MLLMs) to move beyond single-view perception and reason consistently about objects, visibility, geometry, and interactions across multiple viewpoints. However, progress in cross-view reasoning remains limited by three major gaps: the scarcity of large-scale well-annotated training data, the lack of comprehensive benchmarks for systematic evaluation, and the absence of explicit alignment mechanisms that establish object-level consistency across views. To address these gaps, we thoroughly develop CrossView Suite across three coordinated components: CrossViewSet, CrossViewBench, and CrossViewer. Firstly, we introduce a multi-agent data engine to meticulously curate a large-scale, high-quality cross-view instruction dataset, termed CrossViewSet, covering 17 fine-grained task types with 1.6M samples. Second, we meticulously create a scene-disjoint CrossViewBench to comprehensively assess the cross-view spatial understanding capability of an MLLM, evaluating it across various aspects. Finally, we propose CrossViewer, a progressive three-stage framework for cross-view spatial reasoning in MLLMs, following a Perception -> Alignment -> Reasoning paradigm. Our method equips an adaptive spatial region tokenizer to capture fine-grained object representations, and then aligns the multi-view objects explicitly, and thus fuses aligned features for boosting the cross-view inference capacity for MLLMs. Extensive experiments and analyses show that large-scale training data, systematic evaluation, and explicit cross-view alignment are all critical for advancing MLLMs from single-view perception toward real-world spatial intelligence. The project page is available at https://github.com/Thinkirin/Crossview-Suite.
Primary: Zhejiang University
All Institutions: Zhejiang University
This work has a profound broader impact on the field of multimodal AI, particularly for applications requiring robust spatial understanding. By providing CrossView Suite, the authors establish a new, much-needed foundation for research into cross-view spatial intelligence for MLLMs. The dataset and benchmark will serve as critical resources for developing and evaluating future models. The CrossViewer model's paradigm of explicit object-level alignment offers a crucial conceptual and architectural advancement, moving MLLMs closer to real-world capabilities. This is particularly impactful for embodied AI, where agents must navigate and interact with environments from dynamic viewpoints, as well as for robotics, multi-agent systems, and advanced surveillance. The insights gleaned from this research, emphasizing the necessity of explicit alignment and fine-grained mask-grounded supervision, will undoubtedly guide the development of more capable, reliable, and spatially intelligent MLLMs that can operate effectively in complex, multi-perspective environments. Main contribution: This paper introduces CrossView Suite, a comprehensive framework comprising a large-scale mask-grounded instruction dataset (CrossViewSet), a systematic scene-disjoint benchmark (CrossViewBench), and a novel MLLM (CrossViewer) designed for explicit cross-view object alignment and spatial reasoning. The technical contribution is substantial, addressing critical gaps in data, evaluation, and model architecture for advancing MLLMs beyond single-view perception towards real-world multi-view spatial intelligence, with the CrossViewer model demonstrating significant empirical gains across diverse spatial reasoning tasks, underscoring the importance of explicit object-level consistency for robust multi-view understanding.
The methodology presented in "CrossView Suite" is exceptionally well-conceived and comprehensive, directly addressing the identified limitations in cross-view spatial intelligence for MLLMs. The integrated approach of developing a dataset (CrossViewSet), a benchmark (CrossViewBench), and a model (CrossViewer) is a significant strength. CrossViewSet, a 1.6M-sample mask-grounded instruction dataset with 17 fine-grained task types, is meticulously curated using a multi-agent data engine from diverse multi-view sources. This systematic generation of object-level QA supervision with precise masks is a crucial step forward. CrossViewBench provides a robust, scene-disjoint evaluation set of 17K questions, ensuring fair and comprehensive assessment across the same 17 task types. The CrossViewer model's progressive three-stage framework (Perception -> Alignment -> Reasoning) is logically sound and technically innovative. The Adaptive Region Tokenizer (ART) effectively handles the challenge of varying object scales across views. The core innovation lies in the Object-Centric Cross-View Aligner (OCVA), which explicitly establishes object-level consistency through a dual mechanism of cross-attention fusion and supervised contrastive learning. This explicit alignment mechanism is a critical departure from prior MLLMs that rely on implicit fusion. Finally, the region-guided reasoning stage effectively injects these aligned object tokens into the LLM, enabling grounded and consistent spatial reasoning. The combined objective function, incorporating VQA loss with contrastive and triplet losses, is well-balanced for the multi-faceted learning task.
The experimental evaluation is thorough, rigorous, and highly convincing. CrossViewer is benchmarked against an extensive array of 15 state-of-the-art MLLMs, including powerful proprietary models like GPT-5.2 and Qwen3.5-397B, as well as leading open-source models. CrossViewer achieves a remarkable 62.7% overall accuracy on CrossViewBench, demonstrating a substantial improvement of 20.0 points over its Qwen3-VL-8B backbone and 11.0 points over the strongest reference model, Qwen3.5-397B. This significant performance gain strongly validates the efficacy of the proposed explicit alignment and reasoning paradigm. The detailed per-task analysis reveals broad improvements, particularly in Correspondence and Visibility/Occlusion tasks, which are areas where general-purpose MLLMs typically struggle. Furthermore, the model demonstrates strong out-of-domain generalization, achieving a 19.4-point gain over the baseline on the external MMVMBench. Comprehensive ablation studies systematically dissect the contributions of ART, cross-view attention, and contrastive alignment, confirming their individual importance. Sensitivity analyses on ART token count, loss weights, and OCVA depth provide valuable insights into optimal configurations. The qualitative analysis, including t-SNE visualizations, visually confirms the improved clustering of co-referent objects in the learned embedding space, reinforcing the quantitative results.
The paper demonstrates a high commitment to reproducibility. A project page (https://github.com/Thinkirin/Crossview-Suite) is provided, indicating that code and potentially data will be made available. The implementation details are sufficiently described, including the base MLLM, specific parameters for ART (e.g., K=10 tokens), OCVA architecture (8 attention heads, d_contrast=256), training optimizer (AdamW, learning rate, schedule), and the weights for the various loss components. The data generation pipeline is outlined in detail, explaining the multi-agent approach and the conversion of raw multi-view sources into mask-grounded QA. The evaluation protocol for baselines, including the method for handling MLLMs without native region-token inputs, is also clearly articulated. These comprehensive details, coupled with the public project page, suggest that the work is highly reproducible.
While the contributions are significant, some limitations can be noted. The dataset curation, while extensive, relies on existing multi-view sources, meaning the inherent biases or specific characteristics of these original datasets are carried over. Although a multi-agent data engine is used, the generation process is primarily rule-based and tool-augmented, which might still introduce subtle biases compared to purely human-annotated data, despite human verification on a subset. The model's performance on "Geometric reasoning" tasks, while strong, is sometimes slightly surpassed by larger frontier models, suggesting that holistic scene understanding and global spatial abstraction might require further research beyond object-level alignment. The computational resources required for training such a large-scale MLLM with a comprehensive suite are likely substantial, which could be a barrier for some researchers, though this is not explicitly discussed. Finally, the current framework primarily addresses static multi-view images; extending it to dynamic video streams with temporal consistency and object tracking would be a natural and challenging next step.
This work has a profound broader impact on the field of multimodal AI, particularly for applications requiring robust spatial understanding. By providing CrossView Suite, the authors establish a new, much-needed foundation for research into cross-view spatial intelligence for MLLMs. The dataset and benchmark will serve as critical resources for developing and evaluating future models. The CrossViewer model's paradigm of explicit object-level alignment offers a crucial conceptual and architectural advancement, moving MLLMs closer to real-world capabilities. This is particularly impactful for embodied AI, where agents must navigate and interact with environments from dynamic viewpoints, as well as for robotics, multi-agent systems, and advanced surveillance. The insights gleaned from this research, emphasizing the necessity of explicit alignment and fine-grained mask-grounded supervision, will undoubtedly guide the development of more capable, reliable, and spatially intelligent MLLMs that can operate effectively in complex, multi-perspective environments. Main contribution: This paper introduces CrossView Suite, a comprehensive framework comprising a large-scale mask-grounded instruction dataset (CrossViewSet), a systematic scene-disjoint benchmark (CrossViewBench), and a novel MLLM (CrossViewer) designed for explicit cross-view object alignment and spatial reasoning. The technical contribution is substantial, addressing critical gaps in data, evaluation, and model architecture for advancing MLLMs beyond single-view perception towards real-world multi-view spatial intelligence, with the CrossViewer model demonstrating significant empirical gains across diverse spatial reasoning tasks, underscoring the importance of explicit object-level consistency for robust multi-view understanding.
Most existing vision-language manipulation research targets rigid robotic arms, whose fixed morphology limits adaptability in cluttered or confined spaces. Soft robotic arms offer an appealing alternative due to their deformability, but confront challenges such as unreliable proprioception and distributed low-level actuation. To investigate these challenges, we introduce \ManiSoft, a benchmark for vision-language manipulation with soft arms. ManiSoft features a tailored simulator that couples realistic soft-body dynamics with contact-rich interactions via an elastic force constraint. On this basis, ManiSoft defines four tasks, each highlighting distinct aspects of deformable control, from basic end-effector coordination to obstacle avoidance. To support policy training and evaluation, \ManiSoft{} includes an automated pipeline that generates $6{,}300$ diverse scenes and corresponding expert trajectories. To produce high-quality trajectories at scale, we first employ a high-level planner to decompose each task into a sequence of waypoints, followed by a low-level reinforcement learning policy that generates torque commands to track waypoints. Benchmarking three representative policy models shows relatively promising results in clean scenes but substantial performance drop under randomization. Visualization analysis indicates that failures stem primarily from inaccurate visual estimation of proprioceptive state and limited exploitation of deformability for adaptive obstacle avoiding. We anticipate ManiSoft to serve as a valuable testbed, bridging the gap between rigid and soft arms in the context of vision-language manipulation. Out codes and datasets are released at https://buaa-colalab.github.io/ManiSoft.
Primary: Beihang University
All Institutions: Beihang University, National University of Singapore, Hangzhou Innovation Institute, Beihang University
This paper introduces ManiSoft, a novel and comprehensive benchmark for vision-language manipulation in soft continuum robotics, featuring a tailored simulator, diverse tasks, and a large-scale expert trajectory dataset to accelerate research in this challenging domain. ManiSoft addresses a critical gap in robotics research by focusing on soft continuum arms, which offer superior adaptability but present unique control challenges compared to rigid robots. The paper's strength lies in its multi-faceted contribution: a custom simulator accurately modeling soft-body dynamics and contact, a set of four tasks designed to test deformable control, and an automated pipeline generating 6,300 diverse scenes with expert trajectories using a hierarchical planner-RL approach. The benchmarking of existing models reveals key limitations, particularly in visual proprioception and leveraging deformability, providing clear directions for future work. The commitment to open-sourcing code and datasets further enhances its potential impact, positioning ManiSoft as a foundational testbed for advancing vision-language control and embodied intelligence in soft robotics.
The paper introduces ManiSoft, a benchmark for vision-language manipulation with soft continuum robots, addressing a less explored but highly promising area. The core methodology involves developing a tailored simulator that accurately models soft-body dynamics and contact interactions using an elastic force constraint, which is a significant technical undertaking. Four distinct tasks are defined to probe various aspects of deformable control, from basic end-effector coordination to obstacle avoidance, demonstrating a thoughtful design for a comprehensive benchmark. A key methodological contribution is the automated pipeline for generating a large-scale dataset (6,300 diverse scenes and corresponding expert trajectories). This pipeline employs a two-stage approach: a high-level planner for decomposing tasks into waypoints and a low-level reinforcement learning policy for generating precise torque commands to track these waypoints. This hierarchical approach is a robust and scalable strategy for creating high-quality, diverse expert data, which is crucial for training and evaluating complex robotic policies.
The abstract indicates that three representative policy models were benchmarked on ManiSoft. The results show "relatively promising results in clean scenes but substantial performance drop under randomization," which effectively highlights the current challenges and the value of the benchmark in exposing these difficulties. Visualization analysis is mentioned as leading to insights that failures primarily stem from "inaccurate visual estimation of proprioceptive state and limited exploitation of deformability for adaptive obstacle avoiding." While specific quantitative results are not available in the abstract, this qualitative summary points to a rigorous evaluation framework that successfully identifies key limitations and provides concrete directions for future research in soft robotics control. The scale of 6,300 generated scenes and trajectories suggests a robust and diverse test environment.
The paper explicitly states, "Out codes and datasets are released at https://buaa-colalab.github.io/ManiSoft." This commitment to open-sourcing the benchmark, simulator, and data is a strong indicator of high reproducibility. It allows other researchers to validate findings, build upon this work, and compare new methods against a standardized baseline, significantly contributing to the research community.
The abstract clearly identifies two primary limitations based on their benchmarking: (1) "inaccurate visual estimation of proprioceptive state" and (2) "limited exploitation of deformability for adaptive obstacle avoiding." Additionally, the "substantial performance drop under randomization" suggests that current methods struggle with generalization and robustness in more realistic, varied environments. These identified failure modes provide clear avenues for future research. The inherent sim-to-real gap, while not explicitly stated, is a common challenge for simulator-based robotics research that would need to be addressed in future work.
ManiSoft has significant broader impact potential. By providing a standardized, realistic, and large-scale benchmark for soft continuum robotics, it can accelerate research in a field currently lagging behind rigid robotics due to unique challenges in modeling and control. It can foster the development of novel vision-language models, control policies, and simulation techniques specifically tailored for deformable bodies. This could lead to advancements in applications requiring highly adaptive manipulation in cluttered or confined spaces, such as medical robotics, inspection, and delicate material handling. The benchmark bridges a critical gap, enabling systematic progress in an important and emerging area of embodied AI. This paper introduces ManiSoft, a novel and comprehensive benchmark for vision-language manipulation in soft continuum robotics, featuring a tailored simulator, diverse tasks, and a large-scale expert trajectory dataset to accelerate research in this challenging domain. ManiSoft addresses a critical gap in robotics research by focusing on soft continuum arms, which offer superior adaptability but present unique control challenges compared to rigid robots. The paper's strength lies in its multi-faceted contribution: a custom simulator accurately modeling soft-body dynamics and contact, a set of four tasks designed to test deformable control, and an automated pipeline generating 6,300 diverse scenes with expert trajectories using a hierarchical planner-RL approach. The benchmarking of existing models reveals key limitations, particularly in visual proprioception and leveraging deformability, providing clear directions for future work. The commitment to open-sourcing code and datasets further enhances its potential impact, positioning ManiSoft as a foundational testbed for advancing vision-language control and embodied intelligence in soft robotics.
Pipeline parallelism is a key technique for scaling large-model training, but modern workloads exhibit runtime variability in computation and communication. Existing pipeline systems typically consume static, profiled, or adaptively generated schedules as pre-committed execution orders. When realized task readiness diverges from the pre-committed order, stages may wait for not-yet-ready work even though other executable work is available, creating stage misalignment, idle bubbles, and reduced utilization. We present Runtime-Readiness-First Pipeline (RRFP), a readiness-driven runtime for pipeline-parallel training. RRFP changes how schedules are consumed at runtime: instead of treating a schedule as a sequence that stages must wait to follow, it treats the schedule as a non-binding hint order for ranking currently ready work. To support this model, RRFP combines message-driven asynchronous communication, lightweight tensor-parallel coordination for collective consistency, and ready-set arbitration for low-overhead dispatch. We implement RRFP in a Megatron-based training framework and evaluate it on language-only and multimodal workloads at up to 128 GPUs. RRFP improves over fixed-order pipeline baselines across all settings. Using the BFW hint, RRFP achieves up to 1.77$\times$ speedup on language-only workloads and up to 2.77$\times$ on multimodal workloads. In cross-framework comparisons, RRFP with the default BF hint outperforms the faster available external system by up to 1.84$\times$ while preserving training correctness.
Primary: Tsinghua University
All Institutions: Tsinghua University, Scitix AI
This paper introduces Runtime-Readiness-First Pipeline (RRFP), a novel runtime system for pipeline-parallel training that treats schedules as non-binding hints to dynamically dispatch ready work, significantly improving efficiency under runtime variability. The technical contribution is substantial, presenting a well-designed system with message-driven asynchronous communication, lightweight tensor-parallel coordination, and ready-set arbitration, all rigorously evaluated across diverse workloads and scales, demonstrating up to 2.77x speedup over fixed-order baselines and outperforming state-of-the-art external systems.
The paper presents a highly relevant and well-conceived methodology to address a critical limitation in pipeline-parallel training: the fragility of pre-committed execution orders under runtime variability. The core innovation of RRFP lies in its "readiness-driven" approach, where schedules are treated as non-binding hints for ranking currently ready work, rather than strict sequences to be followed. This conceptual shift is supported by three robust mechanisms: (1) message-driven asynchronous communication, which decouples data transfer from computation and correctly handles out-of-order tensor arrivals; (2) lightweight tensor-parallel coordination, which ensures collective consistency across TP ranks without enforcing a global pipeline order; and (3) ready-set arbitration, a low-overhead dispatch layer that efficiently selects tasks from the ready set based on the hint order. The design is modular, allowing existing scheduling strategies to be integrated as hints. The analytical characterization, while simplified, provides a useful theoretical foundation for understanding the behavior of the BF hint and its proximity to optimal performance under certain conditions. This holistic approach demonstrates a deep understanding of the practical challenges in distributed ML systems.
The experimental evaluation is exceptionally comprehensive and rigorous, providing strong evidence for RRFP's effectiveness. The authors evaluate RRFP across a diverse range of workloads, including both language-only (GPT3-Large) and heterogeneous multimodal models (Qwen3, LLaMA3 with various ViT sizes), which are particularly prone to runtime variability. The experiments are conducted at significant scale, up to 128 GPUs, demonstrating practical applicability. Crucially, RRFP is compared against strong baselines, including both same-codebase fixed-order methods (1F1B, ZeroBubble) and leading external distributed training frameworks (DeepSpeed, Cornstarch). The results consistently show substantial speedups, up to 1.77x on language-only and 2.77x on multimodal workloads over 1F1B, and up to 1.84x over the faster external system. The detailed runtime breakdown analysis (RQ2) is particularly impactful, clearly demonstrating that RRFP's gains primarily stem from a significant reduction in blocking time, directly validating the paper's central hypothesis. Further experiments on robustness to injected jitter (RQ4), sensitivity to different hint orders (RQ5), and scaling across pipeline depth, modality imbalance, and global batch size (RQ6) provide compelling evidence for RRFP's reliability, flexibility, and broad applicability.
The paper provides a solid basis for reproducibility. It clearly states that RRFP is implemented as an extension to a Megatron-based training framework and utilizes a C++ communication backend. The experimental setup is meticulously detailed, including specific model architectures, parallel configurations (TP/PP/DP), global batch sizes, hardware specifications, and the number of runs and measured iterations. Key configurable parameters, such as the buffer-size limit, are discussed with sensitivity analyses. The hint algorithms (BF, BFW) are described, with further details for BF provided in the appendix. The authors also validate training correctness by comparing loss trends with baselines under matched seeds. While the full source code is not included in the paper text, the level of detail provided should enable experienced systems researchers to reproduce the core findings.
While the paper is outstanding, a few minor limitations can be noted. The analytical characterization, though helpful, is simplified by ignoring factors like communication time and tensor-parallel coordination, which are present in the full RRFP runtime. While RRFP is shown to be robust to various hint orders, it still relies on an *external* hint; the paper does not propose novel hint generation strategies, focusing solely on runtime consumption. The overhead of the lightweight tensor-parallel coordination, while shown to be small in the evaluated settings, might become more significant in scenarios with extremely high tensor parallelism or very small microbatches. Finally, while the buffer-size limit is analyzed, the memory footprint of these buffers could be a consideration in extremely memory-constrained environments, though the chosen default seems reasonable for the evaluated models.
This work has a profound broader impact on the field of large-scale distributed machine learning. By providing a robust and efficient solution to the pervasive problem of runtime variability in pipeline parallelism, RRFP enables more effective training of increasingly large and complex models, especially heterogeneous multimodal architectures. The readiness-driven execution paradigm represents a significant conceptual advancement in runtime system design for distributed ML, potentially influencing future frameworks to adopt more adaptive, runtime-aware execution strategies over rigid, pre-determined schedules. This could lead to substantial improvements in GPU utilization, reduced training times, and unlock the ability to train even larger models that are currently bottlenecked by pipeline inefficiencies. The insights into handling out-of-order communication and maintaining collective consistency in dynamic environments are also valuable contributions to general distributed systems research in ML. This paper introduces Runtime-Readiness-First Pipeline (RRFP), a novel runtime system for pipeline-parallel training that treats schedules as non-binding hints to dynamically dispatch ready work, significantly improving efficiency under runtime variability. The technical contribution is substantial, presenting a well-designed system with message-driven asynchronous communication, lightweight tensor-parallel coordination, and ready-set arbitration, all rigorously evaluated across diverse workloads and scales, demonstrating up to 2.77x speedup over fixed-order baselines and outperforming state-of-the-art external systems.