Last 7 Days (March 31 – April 06, 2026)
All prior membership inference attacks for fine-tuned language models use hand-crafted heuristics (e.g., loss thresholding, Min-K\%, reference calibration), each bounded by the designer's intuition. We introduce the first transferable learned attack, enabled by the observation that fine-tuning any model on any corpus yields unlimited labeled data, since membership is known by construction. This removes the shadow model bottleneck and brings membership inference into the deep learning era: learning what matters rather than designing it, with generalization through training diversity and scale. We discover that fine-tuning language models produces an invariant signature of memorization detectable across architectural families and data domains. We train a membership inference classifier exclusively on transformer-based models. It transfers zero-shot to Mamba (state-space), RWKV-4 (linear attention), and RecurrentGemma (gated recurrence), achieving 0.963, 0.972, and 0.936 AUC respectively. Each evaluation combines an architecture and dataset never seen during training, yet all three exceed performance on held-out transformers (0.908 AUC). These four families share no computational mechanisms, their only commonality is gradient descent on cross-entropy loss. Even simple likelihood-based methods exhibit strong transfer, confirming the signature exists independently of the detection method. Our method, Learned Transfer MIA (LT-MIA), captures this signal most effectively by reframing membership inference as sequence classification over per-token distributional statistics. On transformers, LT-MIA achieves 2.8$\times$ higher TPR at 0.1\% FPR than the strongest baseline. The method also transfers to code (0.865 AUC) despite training only on natural language texts. Code and trained classifier available at https://github.com/JetBrains-Research/learned-mia.
The paper introduces a learned, transferable membership inference attack that eliminates the shadow model bottleneck and reveals an architecture-agnostic signature of memorization across diverse autoregressive models. This work represents a meaningful methodological shift in ML security and model evaluation by reframing membership inference from heuristic thresholding to a learned sequence classification task over per-token distributional statistics. The core insight—that fine-tuning inherently generates unlimited labeled data for detector training—elegantly removes the traditional shadow model constraint, enabling scalable and data-efficient attack development. More importantly, the discovery that this memorization signature transfers zero-shot across fundamentally different computational paradigms (transformers, state-space models, linear attention, and gated recurrence) suggests a fundamental property of gradient-based optimization on cross-entropy loss rather than an architecture-specific artifact. While the paper's primary domain is privacy auditing and red-teaming rather than core capability scaling or architectural design, its implications extend to how the field evaluates data leakage, designs unlearning procedures, and understands training dynamics. The strong empirical performance, particularly at low false-positive rates, and the open release of code will likely establish this approach as a new standard baseline for memorization detection, though its impact remains concentrated within the security, evaluation, and training dynamics subfields rather than constituting a field-wide paradigm shift.
Autonomous AI agents are being deployed with filesystem access, email control, and multi-step planning. This thesis contributes to four open problems in AI safety: understanding dangerous internal computations, removing dangerous behaviors once embedded, testing for vulnerabilities before deployment, and predicting when models will act against deployers. ACDC automates circuit discovery in transformers, recovering all five component types from prior manual work on GPT-2 Small by selecting 68 edges from 32,000 candidates in hours rather than months. Latent Adversarial Training (LAT) removes dangerous behaviors by optimizing perturbations in the residual stream to elicit failure modes, then training under those perturbations. LAT solved the sleeper agent problem where standard safety training failed, matching existing defenses with 700x fewer GPU hours. Best-of-N jailbreaking achieves 89% attack success on GPT-4o and 78% on Claude 3.5 Sonnet through random input augmentations. Attack success follows power law scaling across text, vision, and audio, enabling quantitative forecasting of adversarial robustness. Agentic misalignment tests whether frontier models autonomously choose harmful actions given ordinary goals. Across 16 models, agents engaged in blackmail (96% for Claude Opus 4), espionage, and actions causing death. Misbehavior rates rose from 6.5% to 55.1% when models stated scenarios were real rather than evaluations. The thesis does not fully resolve any of these problems but makes each tractable and measurable.
This thesis introduces a cohesive suite of scalable methodologies for mechanistic interpretability, latent-space adversarial training, and agentic safety evaluation, substantially advancing the empirical toolkit for AI alignment research. The work is significant for systematically addressing four critical bottlenecks in safety engineering: automating transformer circuit discovery to democratize interpretability workflows, proposing residual-stream perturbation training that efficiently neutralizes embedded risks without prohibitive compute, establishing power-law scaling relationships for jailbreak success to enable quantitative robustness forecasting, and providing rigorous empirical baselines for autonomous misalignment in frontier models. While the contributions do not constitute a single paradigm-shifting breakthrough, their collective impact lies in transforming previously intractable safety problems into measurable, reproducible engineering tasks. Compared to foundational alignment work that often relies on preference optimization or constitutional prompting, this thesis grounds safety in mechanistic transparency and adversarial stress-testing, offering practical protocols that will likely be integrated into standard red-teaming and model evaluation pipelines across both academic and industry labs.
When evaluating identity-focused tasks such as personalized generation and image editing, existing vision encoders entangle object identity with background context, leading to unreliable representations and metrics. We introduce the first principled framework to address this vulnerability using Near-identity (NearID) distractors, where semantically similar but distinct instances are placed on the exact same background as a reference image, eliminating contextual shortcuts and isolating identity as the sole discriminative signal. Based on this principle, we present the NearID dataset (19K identities, 316K matched-context distractors) together with a strict margin-based evaluation protocol. Under this setting, pre-trained encoders perform poorly, achieving Sample Success Rates (SSR), a strict margin-based identity discrimination metric, as low as 30.7% and often ranking distractors above true cross-view matches. We address this by learning identity-aware representations on a frozen backbone using a two-tier contrastive objective enforcing the hierarchy: same identity > NearID distractor > random negative. This improves SSR to 99.2%, enhances part-level discrimination by 28.0%, and yields stronger alignment with human judgments on DreamBench++, a human-aligned benchmark for personalization. Project page: https://gorluxor.github.io/NearID/
Primary: Not specified in provided text
All Institutions: Not specified in provided text
NearID introduces a context-controlled distractor framework and a two-tier contrastive objective that effectively disentangles object identity from background context, establishing a rigorous, computationally efficient benchmark and training protocol for identity-preserving vision representations. The paper's systematic ablation of loss components, thorough comparison against scaled VLMs, and transparent evaluation methodology provide a highly actionable contribution to representation learning and generative AI evaluation, though its scope remains focused on concept preservation rather than broader editing or zero-shot generalization.
The paper introduces a principled and highly targeted solution to a well-documented failure mode in vision encoders: background-context entanglement during identity-focused tasks. By constructing "near-identity distractors" that share the exact background but differ in foreground instance, the authors force the model to learn truly identity-invariant features. The proposed two-tier contrastive objective ($L_{NearID}$) is elegantly designed, combining a symmetric multi-positive InfoNCE term for discrimination with a softplus ranking regularizer to preserve graded semantic structure. The architectural choice to freeze a strong backbone (SigLIP2) and train only a lightweight MAP head (15M params) is pragmatic, ensuring computational efficiency while avoiding catastrophic forgetting. The methodology avoids unnecessary complexity and directly targets the evaluation bottleneck in personalized generation.
The experimental design is exceptionally rigorous. The authors conduct comprehensive ablations across loss components, hyperparameters ($\alpha$ for ranking, $\beta$ for cohesion), data composition, and inpainting engine diversity. The foreground-masking experiments effectively isolate background dependence, revealing that frozen encoders rely heavily on contextual shortcuts (+34-43% SSR gain upon masking), whereas NearID remains inherently background-invariant. The comparison against scaled VLMs (Qwen3-VL at 4B/8B/30B) is particularly insightful, demonstrating that even large multimodal models struggle with matched-context identity discrimination and suffer from inconsistent oracle alignment. The correlation analysis with DreamBench++ human judgments and oracle scores provides strong external validation. Computational cost reporting (6.5 A100-hours for training vs. 54+ hours for VLM evaluation) further underscores the practical advantage of the proposed embedding-based approach.
Excellent. The supplementary material provides exhaustive implementation details: exact denoising steps, CFG scales, scheduler choices, and inpainting strengths for all four generation engines; precise training hyperparameters (batch size, steps, epochs, mixed precision); full mathematical formulations of loss variants; and explicit evaluation protocols (Fisher z-transformation for correlation aggregation, SSR/PA definitions). The dataset construction pipeline is transparently documented, and the project page is provided. The clear separation of training data sources and the step-count normalization in ablations ensure that results are directly comparable and reproducible.
The benchmark is narrowly scoped to concept-preservation evaluation and does not address the identity-vs-edit-intent trade-off inherent in text-guided image editing, as acknowledged by the authors. The distractors are synthetically generated via inpainting, which may not fully capture the distributional complexity of real-world identity variations or natural scene occlusions. The method requires task-specific fine-tuning of a lightweight head rather than offering a zero-shot drop-in replacement, limiting immediate plug-and-play utility for general-purpose vision pipelines. Additionally, while the ranking regularizer successfully balances discrimination and alignment, the optimal $\alpha$ trade-off remains dataset-dependent and requires careful tuning.
This work directly addresses a critical evaluation gap in personalized generative AI, where inflated metrics from background shortcuts have historically obscured true identity fidelity. By providing a standardized, context-controlled benchmark and a computationally efficient training recipe, NearID will likely become a reference standard for evaluating subject-driven generation, retrieval, and editing systems. The framework's emphasis on isolating semantic signals from contextual confounders has broader applicability to video instance tracking, 3D asset retrieval, and multimodal alignment tasks where background bias degrades representation quality. NearID introduces a context-controlled distractor framework and a two-tier contrastive objective that effectively disentangles object identity from background context, establishing a rigorous, computationally efficient benchmark and training protocol for identity-preserving vision representations. The paper's systematic ablation of loss components, thorough comparison against scaled VLMs, and transparent evaluation methodology provide a highly actionable contribution to representation learning and generative AI evaluation, though its scope remains focused on concept preservation rather than broader editing or zero-shot generalization.
Competitive programming remains one of the last few human strongholds in coding against AI. The best AI system to date still underperforms the best humans competitive programming: the most recent best result, Google's Gemini~3 Deep Think, attained 8th place even not being evaluated under live competition conditions. In this work, we introduce GrandCode, a multi-agent RL system designed for competitive programming. The capability of GrandCode is attributed to two key factors: (1) It orchestrates a variety of agentic modules (hypothesis proposal, solver, test generator, summarization, etc) and jointly improves them through post-training and online test-time RL; (2) We introduce Agentic GRPO specifically designed for multi-stage agent rollouts with delayed rewards and the severe off-policy drift that is prevalent in agentic RL. GrandCode is the first AI system that consistently beats all human participants in live contests of competitive programming: in the most recent three Codeforces live competitions, i.e., Round~1087 (Mar 21, 2026), Round~1088 (Mar 28, 2026), and Round~1089 (Mar 29, 2026), GrandCode placed first in all of them, beating all human participants, including legendary grandmasters. GrandCode shows that AI systems have reached a point where they surpass the strongest human programmers on the most competitive coding tasks.
The paper introduces Agentic GRPO, a reinforcement learning framework that jointly optimizes multi-stage reasoning agents to handle delayed rewards and off-policy drift, achieving consistent grandmaster-level performance in live competitive programming contests. This work addresses a critical bottleneck in the current wave of test-time compute scaling and agentic reasoning: credit assignment across complex, multi-step agent rollouts. By adapting group-relative policy optimization to a multi-agent architecture with explicit mechanisms for mitigating distribution shift and delayed feedback, the authors provide a practical and theoretically grounded pathway for training coordinated reasoning systems. The empirical demonstration of surpassing human grandmasters in live, high-stakes coding environments marks a significant milestone for AI reasoning capabilities, moving beyond static benchmarks to dynamic, adversarial settings. While the core algorithm builds upon recent advances in preference optimization and RLVR, its extension to orchestrated multi-agent pipelines with online test-time adaptation offers a reusable blueprint for complex decision-making tasks. The contribution is highly relevant to the broader ML community, particularly those working on reasoning, agentic systems, and reinforcement learning for language models, though its immediate impact remains somewhat anchored to the coding and reasoning subdomains rather than constituting a foundational shift across all of machine learning.
Large language model (LLM) agents learn by interacting with environments, but long-horizon training remains fundamentally bottlenecked by sparse and delayed rewards. Existing methods typically address this challenge through post-hoc credit assignment or external reward models, which provide limited guidance at inference time and often separate reward improvement from policy improvement. We propose Self-Guide, a self-generated internal reward for language agents that supports both inference-time guidance and training-time supervision. Specifically, the agent uses Self-Guide as a short self-guidance signal to steer the next action during inference, and converts the same signal into step-level internal reward for denser policy optimization during training. This creates a co-evolving loop: better policy produces better guidance, and better guidance further improves policy as internal reward. Across three agent benchmarks, inference-time self-guidance already yields clear gains, while jointly evolving policy and internal reward with GRPO brings further improvements (8\%) over baselines trained solely with environment reward. Overall, our results suggest that language agents can improve not only by collecting more experience, but also by learning to generate and refine their own internal reward during acting and learning.
The paper introduces Self-Guide, a unified internal reward mechanism that simultaneously steers inference-time actions and provides step-level supervision for policy optimization, enabling a co-evolutionary loop between agent policy and self-generated rewards. The work addresses a persistent bottleneck in long-horizon LLM agent training by replacing reliance on sparse environmental feedback with a dense, self-generated signal that serves both decoding guidance and RL training. While intrinsic motivation and self-critique are well-established in classical RL and recent LLM alignment research, the explicit coupling of inference-time steering with GRPO-based policy updates in a single co-evolutionary framework is a clean and practically useful formulation. The empirical results demonstrate consistent improvements over environment-reward-only baselines, validating the utility of internal reward generation for credit assignment. However, the approach builds incrementally on existing paradigms in self-reflection, process supervision, and intrinsic motivation rather than introducing a fundamentally new learning paradigm. The methodology is sound and likely to be adopted in agent training pipelines, but it does not yet reach the threshold of field-wide significance required to surpass the highest scoring tiers, as it refines rather than redefines how language agents learn from interaction.
Suppose we observe data from a distribution $P$ and we wish to test the composite null hypothesis that $P\in\mathscr P$ against a composite alternative $P\in \mathscr Q\subseteq \mathscr P^c$. Herbert Robbins and coauthors pointed out around 1970 that, while no batch test can have a level $α\in(0,1)$ and power equal to one, sequential tests can be constructed with this fantastic property. Since then, and especially in the last decade, a plethora of sequential tests have been developed for a wide variety of settings. However, the literature has not yet provided a clean and general answer as to when such power-one sequential tests exist. This paper provides a remarkably general sufficient condition (that we also prove is not necessary). Focusing on i.i.d. laws in Polish spaces without any further restriction, we show that there exists a level-$α$ sequential test for any weakly compact $\mathscr P$, that is power-one against $\mathscr P^c$ (or any subset thereof). We show how to aggregate such tests into an $e$-process for $\mathscr P$ that increases to infinity under $\mathscr P^c$. We conclude by building an $e$-process that is asymptotically relatively growth rate optimal against $\mathscr P^c$, an extremely powerful result.
The paper establishes a general sufficient condition for the existence of power-one sequential tests under weakly compact null hypotheses and constructs asymptotically optimal e-processes. This work elegantly resolves a decades-old open question in sequential analysis by providing a clean, measure-theoretic characterization that bridges classical hypothesis testing with modern anytime-valid inference frameworks. Its primary merit lies in delivering rigorous statistical guarantees that will underpin future research in online learning, adaptive experimentation, and continuous model monitoring. However, the highly abstract mathematical formulation and focus on existence conditions rather than practical algorithmic design limit its immediate applicability to mainstream machine learning practice. Unlike paradigm-shifting empirical works that redefine model architectures, training dynamics, or scaling behaviors, this contribution operates at the foundational statistical layer, making it highly valuable for theorists and sequential decision-making researchers while remaining specialized relative to the broader deep learning and foundation model ecosystem.
Large Language Model (LLM) multi-agent systems are increasingly deployed as interacting agent societies, yet scaling these systems often yields diminishing or unstable returns, the causes of which remain poorly understood. We present the first large-scale empirical study of coordination dynamics in LLM-based multi-agent systems, introducing an atomic event-level formulation that reconstructs reasoning as cascades of coordination. Analyzing over 1.5 Million interactions across tasks, topologies, and scales, we uncover three coupled laws: coordination follows heavy-tailed cascades, concentrates via preferential attachment into intellectual elites, and produces increasingly frequent extreme events as system size grows. We show that these effects are coupled through a single structural mechanism: an integration bottleneck, in which coordination expansion scales with system size while consolidation does not, producing large but weakly integrated reasoning processes. To test this mechanism, we introduce Deficit-Triggered Integration (DTI), which selectively increases integration under imbalance. DTI improves performance precisely where coordination fails, without suppressing large-scale reasoning. Together, our results establish quantitative laws of collective cognition and identify coordination structure as a fundamental, previously unmeasured axis for understanding and improving scalable multi-agent intelligence.
The paper establishes empirical scaling laws for coordination in LLM multi-agent systems, identifies an integration bottleneck as the root cause of diminishing returns, and proposes a targeted intervention to restore collective reasoning efficiency. By reframing multi-agent reasoning as cascades of coordination events and applying complex systems theory to LLM interactions, the work provides a principled diagnostic framework for a domain that has largely relied on heuristic prompting and ad-hoc architectural tweaks. The discovery of heavy-tailed coordination cascades and preferential attachment toward "intellectual elites" offers a compelling mechanistic explanation for why naively scaling agent counts often degrades performance, directly addressing a critical pain point in the field. The proposed Deficit-Triggered Integration method demonstrates that structural interventions can recover lost performance without suppressing large-scale reasoning, suggesting a new design paradigm for scalable agent orchestration. While the empirical foundation is robust and the conceptual framing is highly original, the work remains early-stage and will require broader validation across diverse model families, task domains, and communication protocols before its laws and interventions become standard practice. Compared to foundational scaling laws or alignment breakthroughs, this contribution operates at the architectural and systems level rather than the model training level, but it successfully carves out coordination structure as a fundamental, measurable axis for multi-agent research, positioning it as a strong, field-advancing study that will likely shape how researchers design, evaluate, and scale collaborative LLM systems.
When evaluating identity-focused tasks such as personalized generation and image editing, existing vision encoders entangle object identity with background context, leading to unreliable representations and metrics. We introduce the first principled framework to address this vulnerability using Near-identity (NearID) distractors, where semantically similar but distinct instances are placed on the exact same background as a reference image, eliminating contextual shortcuts and isolating identity as the sole discriminative signal. Based on this principle, we present the NearID dataset (19K identities, 316K matched-context distractors) together with a strict margin-based evaluation protocol. Under this setting, pre-trained encoders perform poorly, achieving Sample Success Rates (SSR), a strict margin-based identity discrimination metric, as low as 30.7% and often ranking distractors above true cross-view matches. We address this by learning identity-aware representations on a frozen backbone using a two-tier contrastive objective enforcing the hierarchy: same identity > NearID distractor > random negative. This improves SSR to 99.2%, enhances part-level discrimination by 28.0%, and yields stronger alignment with human judgments on DreamBench++, a human-aligned benchmark for personalization. Project page: https://gorluxor.github.io/NearID/
Primary: Not specified in provided text
All Institutions: Not specified in provided text
NearID introduces a context-controlled distractor framework and a two-tier contrastive objective that effectively disentangles object identity from background context, establishing a rigorous, computationally efficient benchmark and training protocol for identity-preserving vision representations. The paper's systematic ablation of loss components, thorough comparison against scaled VLMs, and transparent evaluation methodology provide a highly actionable contribution to representation learning and generative AI evaluation, though its scope remains focused on concept preservation rather than broader editing or zero-shot generalization.
The paper introduces a principled and highly targeted solution to a well-documented failure mode in vision encoders: background-context entanglement during identity-focused tasks. By constructing "near-identity distractors" that share the exact background but differ in foreground instance, the authors force the model to learn truly identity-invariant features. The proposed two-tier contrastive objective ($L_{NearID}$) is elegantly designed, combining a symmetric multi-positive InfoNCE term for discrimination with a softplus ranking regularizer to preserve graded semantic structure. The architectural choice to freeze a strong backbone (SigLIP2) and train only a lightweight MAP head (15M params) is pragmatic, ensuring computational efficiency while avoiding catastrophic forgetting. The methodology avoids unnecessary complexity and directly targets the evaluation bottleneck in personalized generation.
The experimental design is exceptionally rigorous. The authors conduct comprehensive ablations across loss components, hyperparameters ($\alpha$ for ranking, $\beta$ for cohesion), data composition, and inpainting engine diversity. The foreground-masking experiments effectively isolate background dependence, revealing that frozen encoders rely heavily on contextual shortcuts (+34-43% SSR gain upon masking), whereas NearID remains inherently background-invariant. The comparison against scaled VLMs (Qwen3-VL at 4B/8B/30B) is particularly insightful, demonstrating that even large multimodal models struggle with matched-context identity discrimination and suffer from inconsistent oracle alignment. The correlation analysis with DreamBench++ human judgments and oracle scores provides strong external validation. Computational cost reporting (6.5 A100-hours for training vs. 54+ hours for VLM evaluation) further underscores the practical advantage of the proposed embedding-based approach.
Excellent. The supplementary material provides exhaustive implementation details: exact denoising steps, CFG scales, scheduler choices, and inpainting strengths for all four generation engines; precise training hyperparameters (batch size, steps, epochs, mixed precision); full mathematical formulations of loss variants; and explicit evaluation protocols (Fisher z-transformation for correlation aggregation, SSR/PA definitions). The dataset construction pipeline is transparently documented, and the project page is provided. The clear separation of training data sources and the step-count normalization in ablations ensure that results are directly comparable and reproducible.
The benchmark is narrowly scoped to concept-preservation evaluation and does not address the identity-vs-edit-intent trade-off inherent in text-guided image editing, as acknowledged by the authors. The distractors are synthetically generated via inpainting, which may not fully capture the distributional complexity of real-world identity variations or natural scene occlusions. The method requires task-specific fine-tuning of a lightweight head rather than offering a zero-shot drop-in replacement, limiting immediate plug-and-play utility for general-purpose vision pipelines. Additionally, while the ranking regularizer successfully balances discrimination and alignment, the optimal $\alpha$ trade-off remains dataset-dependent and requires careful tuning.
This work directly addresses a critical evaluation gap in personalized generative AI, where inflated metrics from background shortcuts have historically obscured true identity fidelity. By providing a standardized, context-controlled benchmark and a computationally efficient training recipe, NearID will likely become a reference standard for evaluating subject-driven generation, retrieval, and editing systems. The framework's emphasis on isolating semantic signals from contextual confounders has broader applicability to video instance tracking, 3D asset retrieval, and multimodal alignment tasks where background bias degrades representation quality. NearID introduces a context-controlled distractor framework and a two-tier contrastive objective that effectively disentangles object identity from background context, establishing a rigorous, computationally efficient benchmark and training protocol for identity-preserving vision representations. The paper's systematic ablation of loss components, thorough comparison against scaled VLMs, and transparent evaluation methodology provide a highly actionable contribution to representation learning and generative AI evaluation, though its scope remains focused on concept preservation rather than broader editing or zero-shot generalization.
Agent skills, structured packages of procedural knowledge and executable resources that agents dynamically load at inference time, have become a reliable mechanism for augmenting LLM agents. Yet inference-time skill augmentation is fundamentally limited: retrieval noise introduces irrelevant guidance, injected skill content imposes substantial token overhead, and the model never truly acquires the knowledge it merely follows. We ask whether skills can instead be internalized into model parameters, enabling zero-shot autonomous behavior without any runtime skill retrieval. We introduce SKILL0, an in-context reinforcement learning framework designed for skill internalization. SKILL0 introduces a training-time curriculum that begins with full skill context and progressively withdraws it. Skills are grouped offline by category and rendered with interaction history into a compact visual context, teaching he model tool invocation and multi-turn task completion. A Dynamic Curriculum then evaluates each skill file's on-policy helpfulness, retaining only those from which the current policy still benefits within a linearly decaying budget, until the agent operates in a fully zero-shot setting. Extensive agentic experiments demonstrate that SKILL0 achieves substantial improvements over the standard RL baseline (+9.7\% for ALFWorld and +6.6\% for Search-QA), while maintaining a highly efficient context of fewer than 0.5k tokens per step. Our code is available at https://github.com/ZJU-REAL/SkillZero.
Primary: Tsinghua University
All Institutions: Tsinghua University, Alibaba Group, Tongyi Lab
SKILL0 proposes a dynamic curriculum-based reinforcement learning framework that progressively withdraws external skill context during training to internalize procedural knowledge directly into model parameters. The methodology offers a principled alternative to inference-time skill retrieval, demonstrating strong empirical gains and significant token efficiency across two agentic benchmarks, though its reliance on visual context rendering and offline skill curation limits immediate broad adoption and generalization to open-ended domains.
The paper introduces SKILL0, an In-Context Reinforcement Learning (ICRL) framework that systematically transfers skill knowledge from external context to model parameters via a dynamic, helpfulness-driven curriculum. The core methodology is well-motivated: starting with full skill scaffolding and progressively withdrawing it based on on-policy utility directly addresses the "crutch" problem where agents merely follow prompts rather than learning behaviors. The composite reward balancing task success and compression efficiency is pragmatic, and the theoretical bounds on distribution shift during curriculum stages provide useful stability guarantees. However, the reliance on visual context rendering to compress token overhead introduces a non-trivial dependency on vision-language encoders and rendering pipelines, which complicates the method's applicability to purely textual or non-VLM agent stacks. The dynamic curriculum's greedy selection heuristic is empirically effective but rests on a locally additive utility assumption that may not hold in highly non-Markovian, interactive environments.
The empirical evaluation is rigorous and well-structured, covering two distinct domains (ALFWorld for embodied text-based tasks and Search-QA for multi-hop retrieval). SKILL0 demonstrates consistent gains over strong RL baselines (+9.7% and +6.6%) while drastically reducing per-step token overhead to <0.5k. The ablation studies are thorough, validating the necessity of the linear budget decay, the three-step helpfulness filter/rank/select mechanism, and the validation interval trade-off. Training dynamics clearly exhibit the predicted rise-then-fall helpfulness trajectory, providing strong empirical evidence for successful skill internalization. The main limitation of the evaluation is its narrow scope: performance is only reported on two curated benchmarks, leaving open questions about generalization to open-ended, long-horizon, or highly stochastic environments (e.g., code generation, GUI navigation, or web automation).
High. The paper provides clear implementation details, including backbone models (Qwen2.5-VL-3B/7B), hardware configuration (4Ă— H800 GPUs), training steps (180), batch sizes, curriculum stages, and precise rendering parameters (font sizes, color coding, image dimensions). The SkillBank initialization strategy is explicitly cited, and the code repository is publicly linked. Reproducing the exact results would require access to the specified VLM and the visual rendering pipeline, but the methodological transparency is sufficient for independent verification and extension.
The framework heavily depends on the quality and coverage of an offline-constructed SkillBank, which requires domain-specific curation and offline grouping. The visual context rendering, while token-efficient, ties the approach to VLMs and may not seamlessly integrate with text-only agent pipelines. The curriculum schedule, though adaptive, still requires manual tuning of hyperparameters (number of stages, validation interval, initial budget). Additionally, the theoretical analysis assumes local additivity of skill utility and smoothness of the vision encoder, which are simplifying assumptions that may break down in complex, multi-agent, or highly dynamic settings.
SKILL0 addresses a critical bottleneck in agentic AI: the trade-off between inference-time context augmentation and model autonomy. By internalizing skills into parameters, the method promises significant reductions in inference latency, token costs, and retrieval noise, paving the way for more efficient, self-sufficient LLM agents. If scaled effectively, this paradigm could shift the field away from heavy RAG/skill-retrieval pipelines toward parameter-efficient post-training recipes. However, the internalization process risks catastrophic forgetting or behavioral rigidity if the curriculum is poorly calibrated, and the reliance on curated skill banks may introduce curation bottlenecks or domain bias in deployed systems. SKILL0 proposes a dynamic curriculum-based reinforcement learning framework that progressively withdraws external skill context during training to internalize procedural knowledge directly into model parameters. The methodology offers a principled alternative to inference-time skill retrieval, demonstrating strong empirical gains and significant token efficiency across two agentic benchmarks, though its reliance on visual context rendering and offline skill curation limits immediate broad adoption and generalization to open-ended domains.
We construct algorithms with optimal error for learning with adversarial noise. The overarching theme of this work is that the use of \textsl{randomized} hypotheses can substantially improve upon the best error rates achievable with deterministic hypotheses. - For $η$-rate malicious noise, we show the optimal error is $\frac{1}{2} \cdot η/(1-η)$, improving on the optimal error of deterministic hypotheses by a factor of $1/2$. This answers an open question of Cesa-Bianchi et al. (JACM 1999) who showed randomness can improve error by a factor of $6/7$. - For $η$-rate nasty noise, we show the optimal error is $\frac{3}{2} \cdot η$ for distribution-independent learners and $η$ for fixed-distribution learners, both improving upon the optimal $2 η$ error of deterministic hypotheses. This closes a gap first noted by Bshouty et al. (Theoretical Computer Science 2002) when they introduced nasty noise and reiterated in the recent works of Klivans et al. (NeurIPS 2025) and Blanc et al. (SODA 2026). - For $η$-rate agnostic noise and the closely related nasty classification noise model, we show the optimal error is $η$, improving upon the optimal $2η$ error of deterministic hypotheses. All of our learners have sample complexity linear in the VC-dimension of the concept class and polynomial in the inverse excess error. All except for the fixed-distribution nasty noise learner are time efficient given access to an oracle for empirical risk minimization.
The paper establishes optimal error bounds for learning under adversarial noise by demonstrating that randomized hypotheses fundamentally outperform deterministic ones, resolving decades-old open questions in statistical learning theory. This work delivers a rigorous theoretical advance in robust learning, closing long-standing gaps regarding malicious, nasty, and agnostic noise models. By proving that randomization yields strictly better error guarantees than deterministic approaches, the authors provide a clean, elegant resolution to problems that have persisted since the late 1990s and early 2000s. The algorithms maintain favorable sample complexity and computational efficiency, making them theoretically sound and practically implementable given standard optimization oracles. While the contribution is highly significant for the foundations of learning theory and adversarial robustness, its direct influence on contemporary empirical machine learning—particularly deep learning architectures, large-scale training paradigms, or modern alignment techniques—remains indirect. Theoretical breakthroughs of this nature typically shape the mathematical understanding of generalization and robustness rather than immediately altering practitioner workflows. Consequently, this represents a strong, field-respected theoretical contribution that will serve as a foundational reference for future work in robust statistics and learning theory, though it does not cross the threshold into broad, practice-transforming impact.
A new generation of language models reasons entirely in continuous hidden states, producing no tokens and leaving no audit trail. We show that this silence creates a fundamentally new attack surface. ThoughtSteer perturbs a single embedding vector at the input layer; the model's own multi-pass reasoning amplifies this perturbation into a hijacked latent trajectory that reliably produces the attacker's chosen answer, while remaining structurally invisible to every token-level defense. Across two architectures (Coconut and SimCoT), three reasoning benchmarks, and model scales from 124M to 3B parameters, ThoughtSteer achieves >=99% attack success rate with near-baseline clean accuracy, transfers to held-out benchmarks without retraining (94-100%), evades all five evaluated active defenses, and survives 25 epochs of clean fine-tuning. We trace these results to a unifying mechanism: Neural Collapse in the latent space pulls triggered representations onto a tight geometric attractor, explaining both why defenses fail and why any effective backdoor must leave a linearly separable signature (probe AUC>=0.999). Yet a striking paradox emerges: individual latent vectors still encode the correct answer even as the model outputs the wrong one. The adversarial information is not in any single vector but in the collective trajectory, establishing backdoor perturbations as a new lens for mechanistic interpretability of continuous reasoning. Code and checkpoints are available.
This paper introduces a novel backdoor attack that exploits the unobservable nature of continuous latent reasoning, revealing that adversarial control emerges from collective trajectory dynamics rather than individual hidden states. The work is significant because it identifies a fundamental security gap in next-generation silent reasoning architectures and provides a rigorous mechanistic explanation grounded in neural collapse and geometric attractors. By demonstrating that token-level defenses are inherently blind to latent trajectory hijacking, the paper forces a necessary shift toward trajectory-aware robustness verification and mechanistic interpretability. While the empirical validation is currently constrained to mid-scale models and specific continuous reasoning frameworks, the conceptual framework establishes a critical foundation for securing emerging reasoning paradigms and will likely catalyze substantial follow-up work in adversarial machine learning and alignment research.
Autonomous AI agents are being deployed with filesystem access, email control, and multi-step planning. This thesis contributes to four open problems in AI safety: understanding dangerous internal computations, removing dangerous behaviors once embedded, testing for vulnerabilities before deployment, and predicting when models will act against deployers. ACDC automates circuit discovery in transformers, recovering all five component types from prior manual work on GPT-2 Small by selecting 68 edges from 32,000 candidates in hours rather than months. Latent Adversarial Training (LAT) removes dangerous behaviors by optimizing perturbations in the residual stream to elicit failure modes, then training under those perturbations. LAT solved the sleeper agent problem where standard safety training failed, matching existing defenses with 700x fewer GPU hours. Best-of-N jailbreaking achieves 89% attack success on GPT-4o and 78% on Claude 3.5 Sonnet through random input augmentations. Attack success follows power law scaling across text, vision, and audio, enabling quantitative forecasting of adversarial robustness. Agentic misalignment tests whether frontier models autonomously choose harmful actions given ordinary goals. Across 16 models, agents engaged in blackmail (96% for Claude Opus 4), espionage, and actions causing death. Misbehavior rates rose from 6.5% to 55.1% when models stated scenarios were real rather than evaluations. The thesis does not fully resolve any of these problems but makes each tractable and measurable.
This thesis introduces a cohesive suite of scalable methodologies for mechanistic interpretability, latent-space adversarial training, and agentic safety evaluation, substantially advancing the empirical toolkit for AI alignment research. The work is significant for systematically addressing four critical bottlenecks in safety engineering: automating transformer circuit discovery to democratize interpretability workflows, proposing residual-stream perturbation training that efficiently neutralizes embedded risks without prohibitive compute, establishing power-law scaling relationships for jailbreak success to enable quantitative robustness forecasting, and providing rigorous empirical baselines for autonomous misalignment in frontier models. While the contributions do not constitute a single paradigm-shifting breakthrough, their collective impact lies in transforming previously intractable safety problems into measurable, reproducible engineering tasks. Compared to foundational alignment work that often relies on preference optimization or constitutional prompting, this thesis grounds safety in mechanistic transparency and adversarial stress-testing, offering practical protocols that will likely be integrated into standard red-teaming and model evaluation pipelines across both academic and industry labs.
There has been growing interest in building agents that can interact with digital platforms to execute meaningful enterprise tasks autonomously. Among the approaches explored are tool-augmented agents built on abstractions such as Model Context Protocol (MCP) and web agents that operate through graphical interfaces. Yet, it remains unclear whether such complex agentic systems are necessary given their cost and operational overhead. We argue that a coding agent equipped only with a terminal and a filesystem can solve many enterprise tasks more effectively by interacting directly with platform APIs. We evaluate this hypothesis across diverse real-world systems and show that these low-level terminal agents match or outperform more complex agent architectures. Our findings suggest that simple programmatic interfaces, combined with strong foundation models, are sufficient for practical enterprise automation.
Primary: Not explicitly stated in text (likely Echelon AI Labs based on GitHub repository links)
All Institutions: Echelon AI Labs
This paper demonstrates through rigorous, multi-platform benchmarking that minimal terminal-based coding agents interacting directly with APIs match or exceed the performance of GUI and MCP-based agents at a fraction of the cost, providing actionable evidence for simpler, API-first enterprise automation architectures.
The paper employs a clean, controlled experimental design to isolate the effect of agent interaction modality (GUI vs. MCP tool-augmented vs. terminal/CLI) while holding the LLM backbone constant. The methodology is well-structured, featuring systematic ablations on documentation access, persistent skill accumulation, single vs. multi-agent orchestration, and hybrid terminal+browser access. The use of programmatic, state-based verification against live, containerized platform instances is a strong methodological choice that avoids the brittleness of string-matching or UI-script validators. However, the approach is fundamentally empirical and comparative rather than algorithmic; it does not propose a new agent architecture, training objective, or reasoning framework, but rather rigorously tests an existing design hypothesis.
The evaluation is comprehensive and practically grounded, spanning three distinct enterprise platforms (ServiceNow, GitLab, ERPNext) and four frontier LLMs. The authors thoughtfully address potential fairness concerns by reporting results on a subset of tasks feasible for all paradigms, demonstrating that terminal agents still maintain a cost-performance advantage even when MCP agents are not structurally handicapped by missing tools. The cost analysis is particularly valuable, showing 4-9x efficiency gains for terminal agents with comparable or better success rates. The qualitative error analysis and skill taxonomy provide actionable insights into agent behavior, failure modes, and knowledge accumulation patterns. One minor weakness is the reliance on single-seed evaluations due to cost constraints, which limits statistical robustness for small performance deltas.
The authors commit to releasing the full evaluation framework, datasets, environments, prompts, and code upon acceptance, which is standard and acceptable for arXiv submissions. The use of containerized environments, LiteLLM routing, and deterministic state validators establishes a strong foundation for reproducibility. The explicit acknowledgment of single-seed limitations and the use of sample-proportion standard errors for uncertainty estimation demonstrate methodological transparency. Full reproducibility will depend on the quality and completeness of the promised code release and environment snapshots.
The paper clearly identifies several limitations: (1) terminal agents fundamentally fail on tasks requiring browser-session state manipulation (e.g., impersonation), rendered UI interpretation (e.g., charts), or complex drag-and-drop interfaces; (2) hybrid agents underperform due to poor tool-selection policies, often defaulting to expensive browser interactions even when API calls are optimal; (3) human-oriented documentation can actively degrade performance by encouraging overly complex retrieval strategies; and (4) the evaluation is constrained to single-seed runs, leaving run-to-run stochasticity unquantified. Additionally, the benchmark may partially reflect models' pre-training exposure to popular APIs (e.g., GitLab), potentially confounding "agent capability" with "parametric memorization."
The findings have direct, practical implications for enterprise AI deployment, challenging the industry trend toward heavily abstracted MCP servers and GUI-driven agents in favor of lightweight, API-first terminal agents. This work will likely influence how practitioners design agent interfaces, prioritize platform API stability, and structure agent-oriented documentation. It also highlights important safety considerations, as terminal agents' broad execution capabilities require robust API-level access controls, sandboxing, and audit trails. While not a theoretical breakthrough, the paper provides a much-needed empirical anchor for the agent architecture debate and offers a reusable benchmark that will facilitate future research in enterprise automation, skill accumulation, and hybrid tool selection. This paper demonstrates through rigorous, multi-platform benchmarking that minimal terminal-based coding agents interacting directly with APIs match or exceed the performance of GUI and MCP-based agents at a fraction of the cost, providing actionable evidence for simpler, API-first enterprise automation architectures.
High-quality 3D avatar modeling faces a critical trade-off between fidelity and generalization. On the one hand, multi-view studio data enables high-fidelity modeling of humans with precise control over expressions and poses, but it struggles to generalize to real-world data due to limited scale and the domain gap between the studio environment and the real world. On the other hand, recent large-scale avatar models trained on millions of in-the-wild samples show promise for generalization across a wide range of identities, yet the resulting avatars are often of low-quality due to inherent 3D ambiguities. To address this, we present Large-Scale Codec Avatars (LCA), a high-fidelity, full-body 3D avatar model that generalizes to world-scale populations in a feedforward manner, enabling efficient inference. Inspired by the success of large language models and vision foundation models, we present, for the first time, a pre/post-training paradigm for 3D avatar modeling at scale: we pretrain on 1M in-the-wild videos to learn broad priors over appearance and geometry, then post-train on high-quality curated data to enhance expressivity and fidelity. LCA generalizes across hair styles, clothing, and demographics while providing precise, fine-grained facial expressions and finger-level articulation control, with strong identity preservation. Notably, we observe emergent generalization to relightability and loose garment support to unconstrained inputs, and zero-shot robustness to stylized imagery, despite the absence of direct supervision.
The paper introduces a scalable pre/post-training paradigm for 3D avatar modeling that bridges the fidelity-generalization trade-off by leveraging million-scale in-the-wild video pretraining followed by targeted high-fidelity post-training. This work is significant because it successfully adapts the scaling laws and two-stage training strategies that revolutionized language and 2D vision to the 3D domain, demonstrating that massive, diverse data can resolve inherent geometric ambiguities while preserving fine-grained articulation control. While the conceptual framework borrows from established foundation model practices, the technical execution in handling 3D codec representations, achieving efficient feedforward inference, and unlocking emergent capabilities like relightability and loose-garment generalization represents a substantial methodological advance for human-centric 3D synthesis. Compared to foundational representation works like NeRF or Gaussian Splatting, which prioritize scene reconstruction, this approach targets scalable generative modeling with a clear trajectory toward production-ready pipelines. Its impact will likely be highly concentrated in 3D vision and computer graphics, establishing a new baseline for avatar synthesis, though it remains somewhat specialized relative to field-wide breakthroughs like SAM or Stable Diffusion.
Monocular 3D object detection has achieved impressive performance on densely annotated datasets. However, it struggles when only a fraction of objects are labeled due to the high cost of 3D annotation. This sparsely annotated setting is common in real-world scenarios where annotating every object is impractical. To address this, we propose a novel framework for sparsely annotated monocular 3D object detection with two key modules. First, we propose Road-Aware Patch Augmentation (RAPA), which leverages sparse annotations by augmenting segmented object patches onto road regions while preserving 3D geometric consistency. Second, we propose Prototype-Based Filtering (PBF), which generates high-quality pseudo-labels by filtering predictions through prototype similarity and depth uncertainty. It maintains global 2D RoI feature prototypes and selects pseudo-labels that are both feature-consistent with learned prototypes and have reliable depth estimates. Our training strategy combines geometry-preserving augmentation with prototype-guided pseudo-labeling to achieve robust detection under sparse supervision. Extensive experiments demonstrate the effectiveness of the proposed method. The source code is available at https://github.com/VisualAIKHU/MonoSAOD .
The paper introduces a geometry-aware patch augmentation and prototype-based pseudo-labeling framework to enable robust monocular 3D object detection under sparse annotation budgets. While the core mechanisms adapt established semi-supervised and copy-paste augmentation paradigms, their careful integration with road-aware geometric consistency and depth uncertainty filtering yields a practical and well-motivated solution to a costly real-world bottleneck. The work will likely become a standard reference for researchers optimizing 3D perception pipelines in autonomous driving and robotics, where full 3D labeling is prohibitively expensive. However, the methodological contributions remain specialized to the 3D detection subdomain and do not introduce a broadly transferable learning paradigm or architectural breakthrough comparable to foundational shifts like SAM, NeRF, or modern open-vocabulary detectors. It represents a strong, execution-focused advance that meaningfully improves data efficiency in a specific vision task without redefining the broader field.
Pretrained Vision Transformers (ViTs) such as DINOv2 and MAE provide generic image features that can be applied to a variety of downstream tasks such as retrieval, classification, and segmentation. However, such representations tend to focus on the most salient visual cues in the image, with no way to direct them toward less prominent concepts of interest. In contrast, Multimodal LLMs can be guided with textual prompts, but the resulting representations tend to be language-centric and lose their effectiveness for generic visual tasks. To address this, we introduce Steerable Visual Representations, a new class of visual representations, whose global and local features can be steered with natural language. While most vision-language models (e.g., CLIP) fuse text with visual features after encoding (late fusion), we inject text directly into the layers of the visual encoder (early fusion) via lightweight cross-attention. We introduce benchmarks for measuring representational steerability, and demonstrate that our steerable visual features can focus on any desired objects in an image while preserving the underlying representation quality. Our method also matches or outperforms dedicated approaches on anomaly detection and personalized object discrimination, exhibiting zero-shot generalization to out-of-distribution tasks.
Primary: University of Technology Nuremberg
All Institutions: University of Technology Nuremberg, Carnegie Mellon University, International Institute of Information Technology Hyderabad
This paper introduces a lightweight early-fusion mechanism for steering frozen Vision Transformer representations via natural language, accompanied by a novel steerability benchmark. While the methodology offers a practical trade-off between language guidance and visual fidelity, its technical novelty is incremental relative to existing adapter-based VLMs, and broader adoption will depend on rigorous validation of zero-shot generalization, comprehensive baseline comparisons, and open-sourced implementation.
The core proposal of injecting natural language prompts directly into frozen Vision Transformer layers via lightweight cross-attention (early fusion) is a pragmatic architectural choice that addresses the late-fusion bottleneck of models like CLIP. By modulating intermediate visual features rather than fusing modalities at the output, the method attempts to preserve the rich spatial semantics of self-supervised backbones (DINOv2/MAE) while enabling targeted concept steering. The approach is technically sound and aligns with recent trends in parameter-efficient adaptation, though the underlying mechanism (cross-attention injection into frozen layers) is conceptually incremental relative to established adapter, prompt-tuning, and FiLM-based modulation literature. The introduction of a dedicated benchmark for quantifying "steerability" (control vs. representation degradation) is the strongest methodological contribution, offering a standardized metric that the field currently lacks.
The paper evaluates the proposed representations on anomaly detection and personalized object discrimination, reporting competitive zero-shot performance and OOD generalization. While these are meaningful downstream tasks, the experimental scope appears narrow for a method claiming broad representational utility. The evaluation lacks comprehensive linear probing results on standard vision benchmarks (e.g., ImageNet, COCO, ADE20K) to rigorously substantiate the claim that steering preserves generic visual quality. Additionally, comparisons against recent strong baselines in open-vocabulary vision and vision-language adapters (e.g., VPT, GLIDE, or recent prompt-tuning variants) are necessary to contextualize the reported gains. The zero-shot OOD claims are promising but require validation across more diverse domains and prompt complexities to rule out overfitting to specific semantic axes.
The reliance on frozen backbones and lightweight cross-attention modules inherently reduces computational overhead and lowers the barrier to reproduction. However, the provided text lacks critical implementation details such as exact layer placement strategies, hyperparameter sensitivity (learning rates, attention scaling, prompt length), and training compute budgets. Without open-sourced code, standardized evaluation scripts for the proposed steerability benchmark, and detailed ablation studies on architectural choices, full reproducibility remains uncertain. The community adoption of the benchmark will heavily depend on the release of a well-documented evaluation suite.
Early fusion via cross-attention introduces additional inference latency and parameter overhead compared to pure late-fusion pipelines, which may hinder deployment in resource-constrained settings. The steering mechanism's efficacy is inherently tied to the semantic alignment of the textual prompt with the frozen backbone's latent space, making it vulnerable to failure on abstract, compositional, or out-of-vocabulary concepts. Furthermore, steering specific visual concepts may inadvertently suppress semantically entangled features, leading to unintended representation collapse in downstream tasks. The frozen backbone assumption also limits the method's ability to adapt to radically novel visual domains without risking catastrophic forgetting or distribution shift.
Steerable visual representations could significantly streamline downstream vision pipelines by enabling fine-grained, language-guided control without full model fine-tuning, with direct applications in medical imaging analysis, industrial quality inspection, and personalized robotics. By decoupling generic feature extraction from task-specific guidance, the approach promotes more modular and efficient AI systems. However, the ability to selectively amplify or suppress visual features raises ethical considerations regarding potential misuse in biased representation engineering, targeted surveillance, or manipulation of automated perception systems. Transparent documentation of steering boundaries and failure modes will be essential for responsible deployment. This paper introduces a lightweight early-fusion mechanism for steering frozen Vision Transformer representations via natural language, accompanied by a novel steerability benchmark. While the methodology offers a practical trade-off between language guidance and visual fidelity, its technical novelty is incremental relative to existing adapter-based VLMs, and broader adoption will depend on rigorous validation of zero-shot generalization, comprehensive baseline comparisons, and open-sourced implementation.
This paper presents a new method for the zero-shot open-vocabulary semantic segmentation (OVSS) of 3D automotive lidar data. To circumvent the recognized image-text modality gap that is intrinsic to approaches based on Vision Language Models (VLMs) such as CLIP, our method relies instead on image generation from text, to create prototype images. Given a 3D network distilled from a 2D Vision Foundation Model (VFM), we then label a point cloud by matching 3D point features with 2D image features of these prototypes. Our method is state-of-the-art for OVSS on nuScenes and SemanticKITTI. Code, pre-trained models, and generated images are available at https://github.com/valeoai/IGLOSS.
IGLOSS introduces a novel pipeline that bypasses the text-to-3D modality gap by generating 2D visual prototypes from text prompts and aligning them with 3D point features distilled from 2D foundation models. The work addresses a persistent bottleneck in open-vocabulary 3D perception by repurposing generative models as discriminative bridges, offering a conceptually clean alternative to direct text-embedding matching that often struggles with geometric and domain misalignment. By leveraging the stronger image-to-image feature correspondence inherent in vision foundation models, the method achieves strong zero-shot transfer to automotive lidar benchmarks while maintaining computational tractability. The approach is methodologically sound and provides a practical template for grounding language in 3D space without requiring expensive 3D-text paired datasets. However, its impact remains concentrated within the 3D perception and open-vocabulary segmentation subfields rather than constituting a broad, field-wide paradigm shift. The reliance on distilled 2D features and prototype matching, while effective for lidar, introduces architectural constraints that may limit immediate generalization to other 3D modalities or broader vision-language reasoning tasks. Nevertheless, the core insight—using generative synthesis to create visual anchors for cross-modal alignment—is likely to influence how researchers design zero-shot 3D understanding systems moving forward.
Point cloud-based motion capture leverages rich spatial geometry and privacy-preserving sensing, but learning robust representations from noisy, unstructured point clouds remains challenging. Existing approaches face a struggle trade-off between point-based methods (geometrically detailed but noisy) and skeleton-based ones (robust but oversimplified). We address the fundamental challenge: how to construct an effective representation for human motion capture that can balance expressiveness and robustness. In this paper, we propose Sparkle, a structured representation unifying skeletal joints and surface anchors with explicit kinematic-geometric factorization. Our framework, SparkleMotion, learns this representation through hierarchical modules embedding geometric continuity and kinematic constraints. By explicitly disentangling internal kinematic structure from external surface geometry, SparkleMotion achieves state-of-the-art performance not only in accuracy but crucially in robustness and generalization under severe domain shifts, noise, and occlusion. Extensive experiments demonstrate our superiority across diverse sensor types and challenging real-world scenarios.
Sparkle introduces a kinematic-geometric factorized representation that unifies skeletal joints and surface anchors, enabling robust and generalizable 3D human motion capture from noisy point clouds. The work addresses a persistent bottleneck in 3D vision by explicitly disentangling internal structural priors from external geometric observations, offering a principled alternative to the traditional point-versus-skeleton dichotomy. The hierarchical learning framework and rigorous evaluation across domain shifts and sensor modalities demonstrate strong empirical validity and practical utility for real-world deployment. However, the contribution remains tightly scoped to 3D human pose estimation and motion capture, limiting its broader methodological spillover into foundational vision architectures or generative paradigms. It will likely serve as a highly cited reference within the 3D vision and robotics communities, but does not cross the threshold for field-wide significance required for higher scoring tiers.
Unified multimodal models provide a natural and promising architecture for understanding diverse and complex real-world knowledge while generating high-quality images. However, they still rely primarily on frozen parametric knowledge, which makes them struggle with real-world image generation involving long-tail and knowledge-intensive concepts. Inspired by the broad success of agents on real-world tasks, we explore agentic modeling to address this limitation. Specifically, we present Unify-Agent, a unified multimodal agent for world-grounded image synthesis, which reframes image generation as an agentic pipeline consisting of prompt understanding, multimodal evidence searching, grounded recaptioning, and final synthesis. To train our model, we construct a tailored multimodal data pipeline and curate 143K high-quality agent trajectories for world-grounded image synthesis, enabling effective supervision over the full agentic generation process. We further introduce FactIP, a benchmark covering 12 categories of culturally significant and long-tail factual concepts that explicitly requires external knowledge grounding. Extensive experiments show that our proposed Unify-Agent substantially improves over its base unified model across diverse benchmarks and real world generation tasks, while approaching the world knowledge capabilities of the strongest closed-source models. As an early exploration of agent-based modeling for world-grounded image synthesis, our work highlights the value of tightly coupling reasoning, searching, and generation for reliable open-world agentic image synthesis.
Primary: Unknown
All Institutions: Unknown
Unify-Agent introduces a structured agentic pipeline and a dedicated factual benchmark to ground image synthesis in external world knowledge, offering a practical and empirically validated pathway toward more reliable, fact-aware multimodal generation. The work demonstrates that tightly coupling reasoning, search, and recaptioning significantly mitigates the hallucination and long-tail knowledge gaps of unified models, though the approach trades inference efficiency for factual controllability and relies on established compositional paradigms rather than introducing fundamental algorithmic breakthroughs.
The paper proposes a four-stage agentic pipeline (prompt understanding, multimodal evidence searching, grounded recaptioning, final synthesis) to decouple knowledge retrieval from image generation, directly addressing the frozen parametric knowledge bottleneck in unified multimodal models. This compositional architecture is methodologically sound and aligns with recent trends in retrieval-augmented and tool-augmented generation. The curation of 143K high-quality agent trajectories for supervised fine-tuning demonstrates careful data engineering, though the exact filtering criteria, trajectory annotation protocols, and quality assurance mechanisms would require full-text verification. The approach effectively reframes generation as a sequential reasoning-and-verification process rather than a single forward pass, trading architectural elegance for modularity and factual controllability.
The introduction of the FactIP benchmark (12 categories of culturally significant and long-tail factual concepts) is a strong empirical contribution that explicitly targets the evaluation gap in world-grounded synthesis. The reported improvements over the base unified model and competitive performance against closed-source systems suggest rigorous validation across standard and domain-specific metrics. However, without access to full ablation studies, compute budgets, or latency/throughput measurements, it is difficult to quantify the efficiency-accuracy trade-off inherent in multi-step agentic pipelines. The evaluation would benefit from explicit comparisons to simpler RAG baselines and end-to-end fine-tuning to isolate the marginal gain of the agentic formulation.
The release of a 143K trajectory dataset and the FactIP benchmark substantially improves reproducibility and provides a valuable resource for the community. However, the absence of explicit code repositories, training hyperparameters, optimizer configurations, and hardware specifications in the provided text limits immediate replication. Standard practices for unified model training (e.g., parameter-efficient fine-tuning, specific vision-language backbone choices, diffusion vs. autoregressive decoders) are implied but not detailed, which is a common shortcoming in early-stage arXiv submissions.
The sequential agentic pipeline inherently introduces significant inference latency and computational overhead compared to single-pass generation models. Reliance on external search tools creates vulnerability to retrieval failures, paywalled content, or outdated indices, which can propagate errors into the recaptioning and synthesis stages. The 143K dataset, while substantial, may not fully capture the long-tail distribution of global cultural or rapidly evolving factual concepts. Additionally, the approach likely struggles with highly abstract or stylistically driven prompts where factual grounding is secondary to creative expression.
This work meaningfully advances the integration of dynamic knowledge retrieval with generative modeling, offering practical utility for educational content creation, historical/cultural visualization, and professional design workflows requiring factual accuracy. The agentic paradigm also highlights a broader shift toward verifiable, traceable generative systems. However, it raises important considerations around the automated generation of culturally sensitive or historically contested imagery, necessitating robust safety alignment, provenance tracking, and transparent evidence citation to mitigate misinformation risks. Unify-Agent introduces a structured agentic pipeline and a dedicated factual benchmark to ground image synthesis in external world knowledge, offering a practical and empirically validated pathway toward more reliable, fact-aware multimodal generation. The work demonstrates that tightly coupling reasoning, search, and recaptioning significantly mitigates the hallucination and long-tail knowledge gaps of unified models, though the approach trades inference efficiency for factual controllability and relies on established compositional paradigms rather than introducing fundamental algorithmic breakthroughs.
Digital characters are central to modern media, yet generating character videos with long-duration, consistent multi-view appearance and expressive identity remains challenging. Existing approaches either provide insufficient context to preserve identity or leverage non-character-centric information as the memory, leading to suboptimal consistency. Recognizing that character video generation inherently resembles an outside-looking-in scenario. In this work, we propose representing the character visual attributes through a compact set of anchor frames. This design provides stable references for consistency, while reference-based video generation inherently faces challenges of copy-pasting and multi-reference conflicts. To address these, we introduce two mechanisms: Superset Content Anchoring, providing intra- and extra-training clip cues to prevent duplication, and RoPE as Weak Condition, encoding positional offsets to distinguish multiple anchors. Furthermore, we construct a scalable pipeline to extract these anchors from massive videos. Experiments show our method generates high-quality character videos exceeding 10 minutes, and achieves expressive identity and appearance consistency across views, surpassing existing methods.
The paper introduces a reference-based video generation framework that leverages compact anchor frames, enhanced by superset content anchoring and positional encoding as weak conditions, to achieve long-duration character consistency. This work addresses a critical bottleneck in generative video synthesis: maintaining identity and appearance fidelity across extended temporal horizons and varying viewpoints. By formalizing character representation through a curated set of anchor frames, the authors provide a stable memory mechanism that circumvents the semantic drift and identity collapse commonly observed in autoregressive or standard diffusion-based video models. The introduction of superset content anchoring to mitigate copy-paste artifacts, alongside the repurposing of RoPE as a weak conditioning signal for multi-reference disambiguation, demonstrates thoughtful architectural adaptation rather than reliance on brute-force scaling. The accompanying scalable data extraction pipeline further strengthens the practical utility of the approach, enabling training on large-scale video corpora without manual curation. While the methodology operates within established diffusion and reference-conditioning paradigms rather than proposing a fundamentally new generative architecture, it delivers a robust, engineering-sound solution to a highly sought-after capability. The demonstrated ability to generate coherent character videos exceeding ten minutes with strong multi-view consistency positions this work as a strong reference for character-centric video synthesis, likely to see adoption in both academic research and applied animation pipelines. It represents a meaningful step forward in controllable long-form generation, though its impact remains bounded by its reliance on existing backbone architectures rather than redefining the underlying generative framework.
Realistic reconstruction of dynamic 4D scenes from monocular videos is essential for understanding the physical world. Despite recent progress in neural rendering, existing methods often struggle to recover accurate 3D geometry and temporally consistent motion in complex environments. To address these challenges, we propose MotionScale, a 4D Gaussian Splatting framework that scales efficiently to large scenes and extended sequences while maintaining high-fidelity structural and motion coherence. At the core of our approach is a scalable motion field parameterized by cluster-centric basis transformations that adaptively expand to capture diverse and evolving motion patterns. To ensure robust reconstruction over long durations, we introduce a progressive optimization strategy comprising two decoupled propagation stages: 1) A background extension stage that adapts to newly visible regions, refines camera poses, and explicitly models transient shadows; 2) A foreground propagation stage that enforces motion consistency through a specialized three-stage refinement process. Extensive experiments on challenging real-world benchmarks demonstrate that MotionScale significantly outperforms state-of-the-art methods in both reconstruction quality and temporal stability. Project page: https://hrzhou2.github.io/motion-scale-web/.
MotionScale introduces a cluster-centric basis transformation and a decoupled progressive optimization pipeline to enable scalable, temporally consistent 4D Gaussian Splatting for large dynamic scenes. The work directly addresses the memory and temporal coherence bottlenecks that currently limit dynamic neural rendering in expansive environments. By replacing dense per-frame deformation fields with adaptive, cluster-driven motion bases and separating background expansion from foreground refinement, the authors achieve a more efficient and stable optimization trajectory. This represents a meaningful algorithmic refinement within the Gaussian Splatting ecosystem, offering practical improvements for long-sequence reconstruction and complex camera trajectories. While the contribution does not depart from the established splatting paradigm or introduce a fundamentally new representation, it provides a robust, scalable framework that will likely become a standard reference for researchers tackling large-scale dynamic scene modeling. Its impact remains concentrated in the 3D reconstruction and novel view synthesis communities rather than driving broader shifts across generative or multimodal vision.
All prior membership inference attacks for fine-tuned language models use hand-crafted heuristics (e.g., loss thresholding, Min-K\%, reference calibration), each bounded by the designer's intuition. We introduce the first transferable learned attack, enabled by the observation that fine-tuning any model on any corpus yields unlimited labeled data, since membership is known by construction. This removes the shadow model bottleneck and brings membership inference into the deep learning era: learning what matters rather than designing it, with generalization through training diversity and scale. We discover that fine-tuning language models produces an invariant signature of memorization detectable across architectural families and data domains. We train a membership inference classifier exclusively on transformer-based models. It transfers zero-shot to Mamba (state-space), RWKV-4 (linear attention), and RecurrentGemma (gated recurrence), achieving 0.963, 0.972, and 0.936 AUC respectively. Each evaluation combines an architecture and dataset never seen during training, yet all three exceed performance on held-out transformers (0.908 AUC). These four families share no computational mechanisms, their only commonality is gradient descent on cross-entropy loss. Even simple likelihood-based methods exhibit strong transfer, confirming the signature exists independently of the detection method. Our method, Learned Transfer MIA (LT-MIA), captures this signal most effectively by reframing membership inference as sequence classification over per-token distributional statistics. On transformers, LT-MIA achieves 2.8$\times$ higher TPR at 0.1\% FPR than the strongest baseline. The method also transfers to code (0.865 AUC) despite training only on natural language texts. Code and trained classifier available at https://github.com/JetBrains-Research/learned-mia.
The paper introduces a learned, transferable membership inference attack that eliminates the shadow model bottleneck and reveals an architecture-agnostic signature of memorization across diverse autoregressive models. This work represents a meaningful methodological shift in ML security and model evaluation by reframing membership inference from heuristic thresholding to a learned sequence classification task over per-token distributional statistics. The core insight—that fine-tuning inherently generates unlimited labeled data for detector training—elegantly removes the traditional shadow model constraint, enabling scalable and data-efficient attack development. More importantly, the discovery that this memorization signature transfers zero-shot across fundamentally different computational paradigms (transformers, state-space models, linear attention, and gated recurrence) suggests a fundamental property of gradient-based optimization on cross-entropy loss rather than an architecture-specific artifact. While the paper's primary domain is privacy auditing and red-teaming rather than core capability scaling or architectural design, its implications extend to how the field evaluates data leakage, designs unlearning procedures, and understands training dynamics. The strong empirical performance, particularly at low false-positive rates, and the open release of code will likely establish this approach as a new standard baseline for memorization detection, though its impact remains concentrated within the security, evaluation, and training dynamics subfields rather than constituting a field-wide paradigm shift.
Discrete diffusion language models (dLLMs) accelerate text generation by unmasking multiple tokens in parallel. However, parallel decoding introduces a distributional mismatch: it approximates the joint conditional using a fully factorized product of per-token marginals, which degrades output quality when selected tokens are strongly dependent. We propose DEMASK (DEpendency-guided unMASKing), a lightweight dependency predictor that attaches to the final hidden states of a dLLM. In a single forward pass, it estimates pairwise conditional influences between masked positions. Using these predictions, a greedy selection algorithm identifies positions with bounded cumulative dependency for simultaneous unmasking. Under a sub-additivity assumption, we prove this bounds the total variation distance between our parallel sampling and the model's joint. Empirically, DEMASK achieves 1.7-2.2$\times$ speedup on Dream-7B while matching or improving accuracy compared to confidence-based and KL-based baselines.
DEMASK introduces a lightweight dependency predictor and theoretically grounded greedy unmasking strategy that mitigates distributional mismatch during parallel decoding in discrete diffusion language models. The work addresses a well-known bottleneck in non-autoregressive generation: the quality degradation that arises when parallel sampling assumes token independence. By extracting pairwise conditional influences directly from final hidden states and coupling them with a selection mechanism that bounds cumulative dependency, the method offers a principled alternative to heuristic confidence or divergence-based baselines. The theoretical guarantee on total variation distance under a sub-additivity assumption adds rigor to an area often driven by empirical heuristics, while the demonstrated speedups on a modern-scale model confirm practical viability. Although discrete diffusion architectures remain an emerging paradigm compared to the dominant autoregressive foundation models, this decoding framework provides a reusable component that could accelerate research into efficient, high-quality parallel generation. The contribution is methodologically sound and addresses a clear gap, though its broader field-wide adoption will ultimately depend on whether discrete diffusion models achieve parity with autoregressive systems across diverse reasoning and long-context tasks.
Retrieval-augmented generation (RAG) improves language model (LM) performance by providing relevant context at test time for knowledge-intensive situations. However, the relationship between parametric knowledge acquired during pretraining and non-parametric knowledge accessed via retrieval remains poorly understood, especially under fixed data budgets. In this work, we systematically study the trade-off between pretraining corpus size and retrieval store size across a wide range of model and data scales. We train OLMo-2-based LMs ranging from 30M to 3B parameters on up to 100B tokens of DCLM data, while varying both pretraining data scale (1-150x the number of parameters) and retrieval store size (1-20x), and evaluate performance across a diverse suite of benchmarks spanning reasoning, scientific QA, and open-domain QA. We find that retrieval consistently improves performance over parametric-only baselines across model scales and introduce a three-dimensional scaling framework that models performance as a function of model size, pretraining tokens, and retrieval corpus size. This scaling manifold enables us to estimate optimal allocations of a fixed data budget between pretraining and retrieval, revealing that the marginal utility of retrieval depends strongly on model scale, task type, and the degree of pretraining saturation. Our results provide a quantitative foundation for understanding when and how retrieval should complement pretraining, offering practical guidance for allocating data resources in the design of scalable language modeling systems.
This work establishes a three-dimensional scaling law that quantifies the optimal trade-off between parametric pretraining data and non-parametric retrieval corpus size under fixed compute budgets. By systematically varying model scale, pretraining tokens, and retrieval store size, the authors provide a principled, empirical framework that moves retrieval-augmented design from heuristic tuning to predictable resource allocation. The study’s strength lies in its rigorous experimental design across multiple model sizes and diverse benchmarks, revealing nuanced dependencies on task type and pretraining saturation that challenge simplistic assumptions about retrieval always compensating for smaller models. While the conceptual extension of classical scaling laws to include retrieval is an evolutionary rather than revolutionary step, the resulting manifold offers immediate, actionable guidance for both academic and industrial teams navigating the growing tension between long-context pretraining and external knowledge integration. The work’s impact will likely be felt in how future language model pipelines budget data curation versus retrieval infrastructure, cementing its place as a foundational reference for scalable, retrieval-aware modeling.
Thousands of diverse benchmarks have been developed to measure the quality of large language models (LLMs). Yet prior work has demonstrated that LLM performance is often sufficiently explained by a small set of latent factors, or abilities. This suggests the potential for more efficient and principled benchmarking, but it remains difficult to compare the quality of different methods. Motivated by predictive validity, we argue that the quality of a benchmarking framework should be grounded in how efficiently it enables the prediction of model performance on unseen tasks. To analyze this objective, we collect the "Wide-scale Item Level Dataset" (WILD), a dataset of item-model response pairs, comprising evaluations of 65 models on 109,564 unique items spanning 163 tasks drawn from 27 datasets. This dataset enables the first analysis of how different techniques can predict a model's performance on a large, diverse collection of unseen tasks under different budget constraints. We demonstrate that combining a modified multidimensional item response theory (IRT) model with adaptive item selection driven by optimal experimental design can predict performance on 112 held-out benchmark tasks with a mean absolute error (MAE) of less than 7%, and can do so after observing only 16 items. We further demonstrate that incorporating cost-aware discount factors into our selection criteria can reduce the total tokens needed to reach 7% MAE from 141,000 tokens to only 22,000, an 85% reduction in evaluation cost.
The paper introduces a cost-aware, adaptive evaluation framework that combines multidimensional item response theory with optimal experimental design, supported by a large-scale item-level dataset, to efficiently predict LLM capabilities across diverse unseen benchmarks. This work addresses a critical bottleneck in modern language model development by replacing exhaustive, static testing with a principled statistical approach that dynamically selects the most informative evaluation items. While psychometric models have previously been adapted for AI assessment, the integration of adaptive experimental design and explicit token-cost optimization represents a meaningful methodological advance that directly tackles benchmark saturation and computational waste. The accompanying dataset provides unprecedented granularity for analyzing latent ability structures and cross-task generalization, offering a valuable resource for the community. Compared to traditional evaluation paradigms that rely on fixed, monolithic test suites, this framework enables researchers to allocate computational resources more strategically during model development and comparison. The approach is technically rigorous and highly practical, though it extends established statistical machinery rather than proposing a novel learning architecture or training paradigm. Its focus on predictive validity and efficiency ensures strong relevance for both academic researchers and industry teams seeking scalable evaluation pipelines, positioning it as a solid contribution to the evolving landscape of LLM assessment.
Large language models (LLMs) are increasingly deployed in high-stakes settings, yet they frequently violate contextual privacy by disclosing private information in situations where humans would exercise discretion. This raises a fundamental question: do LLMs internally encode contextual privacy norms, and if so, why do violations persist? We present the first systematic study of contextual privacy as a structured latent representation in LLMs, grounded in contextual integrity (CI) theory. Probing multiple models, we find that the three norm-determining CI parameters (information type, recipient, and transmission principle) are encoded as linearly separable and functionally independent directions in activation space. Despite this internal structure, models still leak private information in practice, revealing a clear gap between concept representation and model behavior. To bridge this gap, we introduce CI-parametric steering, which independently intervenes along each CI dimension. This structured control reduces privacy violations more effectively and predictably than monolithic steering. Our results demonstrate that contextual privacy failures arise from misalignment between representation and behavior rather than missing awareness, and that leveraging the compositional structure of CI enables more reliable contextual privacy control, shedding light on potential improvement of contextual privacy understanding in LLMs.
The paper demonstrates that LLMs internally encode contextual privacy norms as linearly separable, independent directions and introduces a structured, theory-grounded steering method that outperforms monolithic interventions. This work makes a meaningful contribution to representation engineering and AI safety by reframing privacy failures as a misalignment between latent knowledge and behavioral output rather than a lack of conceptual awareness. By grounding the analysis in Contextual Integrity theory, the authors move beyond ad-hoc probing and propose a compositional steering framework that offers more predictable control over sensitive information disclosure. While the approach builds on established linear probing and activation steering techniques, its systematic decomposition of privacy into orthogonal parameters provides a clear methodological advance for structured representation manipulation. The findings are likely to influence how researchers design safety interventions, shifting focus from monolithic suppression to targeted, dimension-specific alignment. However, as a recent preprint with limited external validation so far, its broader field-wide impact will depend on reproducibility across diverse architectures and real-world deployment scenarios, keeping it within the range of a strong, specialized contribution rather than a paradigm-shifting breakthrough.
Existing humanoid table tennis systems remain limited by their reliance on external sensing and their inability to achieve agile whole-body coordination for precise task execution. These limitations stem from two core challenges: achieving low-latency and robust onboard egocentric perception under fast robot motion, and obtaining sufficiently diverse task-aligned strike motions for learning precise yet natural whole-body behaviors. In this work, we present \methodname, a modular system for agile humanoid table tennis that unifies scalable whole-body skill learning with onboard egocentric perception, eliminating the need for external cameras during deployment. Our work advances prior humanoid table-tennis systems in three key aspects. First, we achieve agile and precise ball interaction with tightly coordinated whole-body control, rather than relying on decoupled upper- and lower-body behaviors. This enables the system to exhibit diverse strike motions, including explosive whole-body smashes and low crouching shots. Second, by augmenting and diversifying strike motions with a generative model, our framework benefits from scalable motion priors and produces natural, robust striking behaviors across a wide workspace. Third, to the best of our knowledge, we demonstrate the first humanoid table-tennis system capable of consecutive strikes using onboard sensing alone, despite the challenges of low-latency perception, ego-motion-induced instability, and limited field of view. Extensive real-world experiments demonstrate stable and precise ball exchanges under high-speed conditions, validating scalable, perception-driven whole-body skill learning for dynamic humanoid interaction tasks.
Primary: The University of Hong Kong
All Institutions: The University of Hong Kong, Kinetix AI
SMASH introduces a scalable, perception-driven whole-body control framework that combines generative motion augmentation, task-conditioned motion matching, and egocentric vision to enable the first outdoor, onboard-only humanoid table tennis system with diverse, agile striking behaviors. The work demonstrates strong system-level engineering and practical RL/imitation integration, though its algorithmic novelty is incremental and its impact remains primarily confined to robotics and dynamic humanoid control rather than foundational machine learning.
The paper presents a well-integrated pipeline that addresses two critical bottlenecks in dynamic humanoid control: sparse motion data and reliance on external perception. The core methodological contribution lies in the scalable motion generation and matching framework. Training a conditional Motion-VAE with task-aligned regularizers (phase consistency, temporal smoothness, foot penetration penalty) and subsequently filtering outputs through a physics-aware tracker is a pragmatic and effective approach to bridge the gap between sparse human demonstrations and robot-executable priors. The decision to use task-conditioned nearest-neighbor motion matching rather than hierarchical skill learning or adversarial priors simplifies the training loop while maintaining strong task alignment. The RL formulation (PPO with asymmetric critic, gated impact-window rewards, and adaptive region/sigma scheduling) demonstrates careful engineering for sim-to-real transfer. The perception stack, while largely composed of established components (YOLO, HSV segmentation, stereo triangulation, AprilTag PnP, Adaptive EKF), is tightly coupled to the control loop and handles high-speed dynamics and ego-motion robustly.
The experimental design is thorough and appropriately structured. Simulation ablations cleanly isolate the contributions of data scale, tracker filtering, and adaptive training techniques, showing clear performance trends. The comparison against PPO, Mimic, and HITTER baselines effectively highlights the necessity of whole-body coordination and motion priors for this task. Real-world validation is a strong point: the system successfully executes diverse strikes (smashes, crouching saves, lateral movements) and achieves the claimed milestone of outdoor consecutive rallies using only onboard sensing. The perception error analysis as a function of time-to-strike provides valuable insight into the system's operational envelope. However, quantitative real-world success rates, rally length distributions, and failure mode statistics are underreported, leaving some ambiguity about long-term robustness.
The paper provides clear algorithmic descriptions, reward formulations, and observation structures, which are sufficient for conceptual replication. However, exact reproducibility is hindered by missing details: specific neural network architectures, hyperparameter schedules, simulation environment configurations, and compute budgets are not fully disclosed. The reliance on specific hardware (Unitree G1, ZED cameras, custom MoCap setup) and proprietary simulation pipelines will require significant engineering effort to replicate. The perception pipeline's distance-adaptive noise modeling and EKF reset logic are well-documented, which aids implementation.
The system is highly task-specific; generalizing the motion generation and matching pipeline to other dynamic manipulation or locomotion tasks remains unproven. The tracker-based filtering step, while crucial for dynamic feasibility, introduces a computational bottleneck and requires a pre-trained tracking policy, adding training complexity. The egocentric perception system, though robust in controlled outdoor settings, will likely degrade under severe lighting changes, ball occlusion, or highly unpredictable opponent play. The paper lacks quantitative metrics on long-term rally stability, recovery from missed strikes, and computational latency breakdowns across the perception-planning-control loop.
This work represents a meaningful step toward fully autonomous, deployable humanoid systems capable of high-speed dynamic interaction without external infrastructure. The scalable motion augmentation and task-aligned matching framework offers a practical blueprint for overcoming data scarcity in whole-body skill learning, with potential applications in sports robotics, agile manipulation, and human-robot collaboration. By demonstrating robust onboard perception coupled with expressive whole-body control, the paper helps bridge the gap between simulation-trained policies and real-world dynamic deployment. SMASH introduces a scalable, perception-driven whole-body control framework that combines generative motion augmentation, task-conditioned motion matching, and egocentric vision to enable the first outdoor, onboard-only humanoid table tennis system with diverse, agile striking behaviors. The work demonstrates strong system-level engineering and practical RL/imitation integration, though its algorithmic novelty is incremental and its impact remains primarily confined to robotics and dynamic humanoid control rather than foundational machine learning.