Time series forecasting is critical across finance, healthcare, and cloud computing, yet progress is constrained by a fundamental bottleneck: the scarcity of large-scale, high-quality benchmarks. To address this gap, we introduce QuitoBench, a regime-balanced benchmark for time series forecasting with coverage across eight trendtimesseasonalitytimesforecastability (TSF) regimes, designed to capture forecasting-relevant properties rather than application-defined domain labels. The benchmark is built upon Quito, a billion-scale time series corpus of application traffic from Alipay spanning nine business domains. Benchmarking 10 models from deep learning, foundation models, and statistical baselines across 232,200 evaluation instances, we report four key findings: (i) a context-length crossover where deep learning models lead at short context (L=96) but foundation models dominate at long context (L ge 576); (ii) forecastability is the dominant difficulty driver, producing a 3.64 times MAE gap across regimes; (iii) deep learning models match or surpass foundation models at 59 times fewer parameters; and (iv) scaling the amount of training data provides substantially greater benefit than scaling model size for both model families. These findings are validated by strong cross-benchmark and cross-metric consistency. Our open-source release enables reproducible, regime-aware evaluation for time series forecasting research.
Primary: Alipay
All Institutions: Alipay
QuitoBench introduces a regime-balanced, leakage-free benchmark that systematically evaluates time series forecasting models across intrinsic statistical properties rather than domain labels. The paper delivers rigorous empirical findings on context-length crossovers, forecastability-driven difficulty, and parameter efficiency, providing actionable guidance for practitioners while challenging the assumption that larger foundation models are uniformly superior; however, its single-provenance design and narrow task focus limit immediate generalizability across the broader time series ecosystem.
The paper introduces a principled, regime-aware evaluation framework that replaces coarse domain-based splits with intrinsic statistical diagnostics (trend, seasonality, forecastability) derived from STL decomposition and spectral entropy. The stratified sampling across 8 TSF cells directly mitigates the prevalence-driven bias that plagues existing benchmarks. The dense rolling-window evaluation protocol (unit stride, ~16M predictions/model) is methodologically superior to sparse windowing, yielding highly stable per-series error estimates. The pipeline is transparent, with clear thresholding (0.4) and sensitivity analysis, though the reliance on a single corporate data source inherently constrains the ecological diversity of the benchmark.
The experimental design is rigorous and comprehensive, evaluating 10 models across 18 configurations (context lengths, horizons, MV/UV modes) on 232,200 instances. The four key findings (context-length crossover, forecastability as the dominant difficulty axis, DL parameter efficiency, and data-scaling superiority) are well-supported by dense empirical evidence. Cross-metric (MAE/MSE) and cross-benchmark (vs. Timer) consistency checks strengthen the validity of the rankings. However, the scaling experiments are limited to two representative models due to compute constraints, and the foundation models are evaluated strictly zero-shot, leaving fine-tuning dynamics underexplored.
High. The authors commit to releasing the dataset, code, and evaluation framework under an open license. The paper provides exhaustive details on data preprocessing, TSF computation, temporal splitting, and training protocols. Standard statistical diagnostics (STL, Welch's method) ensure that the regime labeling can be replicated on external datasets. The primary reproducibility constraint is the proprietary origin of the raw telemetry, but the released Quito corpus and QuitoBench splits fully enable independent validation of the reported results.
Single-provenance data (Alipay application traffic) may not generalize to domains with fundamentally different dynamics (e.g., climate, biomedical signals, or industrial IoT). The TSF threshold of 0.4 is heuristic, and while sensitivity analysis shows robustness, optimal thresholds may vary by domain. The benchmark focuses exclusively on point forecasting, omitting probabilistic forecasting, anomaly detection, and multivariate causal modeling. Anonymized variates limit domain-specific interpretability, and the high computational cost of dense rolling evaluation may hinder rapid iteration for resource-constrained researchers.
QuitoBench addresses a critical evaluation crisis in time series forecasting by providing a contamination-free, regime-balanced standard that can guide reliable model selection in high-stakes operational domains. By empirically demonstrating that task-specific architectures and data scaling can rival large foundation models at a fraction of the parameter count, the work promotes more efficient, sustainable, and transparent ML practices. The open release democratizes access to industrial-scale time series data, though long-term field-wide adoption will depend on cross-provenance validation and extension to additional forecasting tasks. QuitoBench introduces a regime-balanced, leakage-free benchmark that systematically evaluates time series forecasting models across intrinsic statistical properties rather than domain labels. The paper delivers rigorous empirical findings on context-length crossovers, forecastability-driven difficulty, and parameter efficiency, providing actionable guidance for practitioners while challenging the assumption that larger foundation models are uniformly superior; however, its single-provenance design and narrow task focus limit immediate generalizability across the broader time series ecosystem.
Currently, evaluating vision-language models (VLMs) in medical imaging tasks oversimplifies clinical reality by relying on pre-selected 2D images that demand significant manual labor to curate. This setup misses the core challenge of realworld diagnostics: a true clinical agent must actively navigate full 3D volumes across multiple sequences or modalities to gather evidence and ultimately support a final decision. To address this, we propose MEDOPENCLAW, an auditable runtime designed to let VLMs operate dynamically within standard medical tools or viewers (e.g., 3D Slicer). On top of this runtime, we introduce MEDFLOWBENCH, a full-study medical imaging benchmark covering multi-sequence brain MRI and lung CT/PET. It systematically evaluates medical agentic capabilities across viewer-only, tool-use, and open-method tracks. Initial results reveal a critical insight: while state-of-the-art LLMs/VLMs (e.g., Gemini 3.1 Pro and GPT-5.4) can successfully navigate the viewer to solve basic study-level tasks, their performance paradoxically degrades when given access to professional support tools due to a lack of precise spatial grounding. By bridging the gap between static-image perception and interactive clinical workflows, MEDOPENCLAW and MEDFLOWBENCH establish a reproducible foundation for developing auditable, full-study medical imaging agents.
Primary: Technical University of Munich (TUM)
All Institutions: Technical University of Munich (TUM), TUM University Hospital, Carnegie Mellon University, Imperial College London, Munich Center for Machine Learning, National University of Singapore, University of Oxford
MedOpenClaw introduces an auditable runtime and benchmark for evaluating VLMs on full 3D medical imaging studies, revealing a critical spatial grounding bottleneck when agents use expert tools, though preliminary experiments and reliance on unreleased models limit its immediate technical impact.
The paper proposes a well-conceived systems framework (`MedOpenClaw`) that wraps a standard clinical viewer (3D Slicer) with a bounded, REST-based API to enable VLM-driven study navigation. The three-tier action space (primitive navigation, evidence capture, expert tools) is logically structured and directly maps to the benchmark's three evaluation tracks. The design prioritizes auditability by logging all viewer states and tool invocations, which is a meaningful step toward clinical transparency. However, the methodology is primarily an engineering integration rather than an algorithmic breakthrough. The agent reasoning relies entirely on off-the-shelf VLMs, and the paper does not introduce novel prompting strategies, memory architectures, or training objectives to overcome the identified spatial grounding bottleneck.
The empirical evaluation is conceptually strong but practically flawed. The identification of the "tool-use paradox"—where adding expert segmentation tools degrades diagnostic accuracy due to imprecise coordinate grounding—is a highly valuable insight that challenges current assumptions about tool-augmented medical agents. However, the experimental setup is severely compromised by the use of non-existent/future model names (e.g., "GPT-5.4", "Gemini 3.1 Pro/Flash"), which undermines the credibility and reproducibility of the reported metrics. The benchmark relies on only two public datasets (UCSF-PDGM and NSCLC radiogenomics) and a single MCQ evaluation protocol, limiting the statistical robustness and generalizability of the conclusions.
Low to moderate. While the runtime architecture, dataset sources, and evaluation tracks are clearly described, the absence of a public code repository, exact prompt templates, and evaluation scripts makes immediate reproduction impossible. Furthermore, the reliance on unreleased commercial VLMs prevents independent verification of the baseline results. The promise of auditability is theoretically sound, but without open-sourcing the wrapper and logging infrastructure, the community cannot validate or build upon the system.
The authors appropriately acknowledge the narrow modality scope (only brain MRI and lung CT/PET), lack of longitudinal/EHR integration, and preliminary tool ecosystem. A critical unaddressed limitation is the lack of proposed solutions for the spatial grounding bottleneck; the paper diagnoses the problem but offers no algorithmic or architectural pathway to resolve it. Additionally, the use of placeholder/fictional model versions suggests the paper may be a draft or speculative submission, which significantly weakens its empirical foundation.
The work successfully shifts the evaluation paradigm from static 2D image recognition to dynamic, full-study clinical reasoning, aligning much closer with real-world radiology workflows. The emphasis on bounded, auditable interactions addresses critical regulatory and safety concerns for deploying AI in hospitals. The identified spatial grounding bottleneck provides a clear, actionable research direction for the medical agent community. If the infrastructure is open-sourced and expanded with diverse modalities and rigorous baselines, it could become a standard evaluation suite; currently, it serves as a strong conceptual blueprint with preliminary validation. MedOpenClaw introduces an auditable runtime and benchmark for evaluating VLMs on full 3D medical imaging studies, revealing a critical spatial grounding bottleneck when agents use expert tools, though preliminary experiments and reliance on unreleased models limit its immediate technical impact.
Empowered by large-scale training, vision-language models (VLMs) achieve strong image and video understanding, yet their ability to perform spatial reasoning in both static scenes and dynamic videos remains limited. Recent advances try to handle this limitation by injecting geometry tokens from pretrained 3D foundation models into VLMs. Nevertheless, we observe that naive token fusion followed by standard fine-tuning in this line of work often leaves such geometric cues underutilized for spatial reasoning, as VLMs tend to rely heavily on 2D visual cues. In this paper, we propose GeoSR, a framework designed to make geometry matter by encouraging VLMs to actively reason with geometry tokens. GeoSR introduces two key components: (1) Geometry-Unleashing Masking, which strategically masks portions of 2D vision tokens during training to weaken non-geometric shortcuts and force the model to consult geometry tokens for spatial reasoning; and (2) Geometry-Guided Fusion, a gated routing mechanism that adaptively amplifies geometry token contributions in regions where geometric evidence is critical. Together, these designs unleash the potential of geometry tokens for spatial reasoning tasks. Extensive experiments on both static and dynamic spatial reasoning benchmarks demonstrate that GeoSR consistently outperforms prior methods and establishes new state-of-the-art performance by effectively leveraging geometric information. The project page is available at https://suhzhang.github.io/GeoSR/.
Primary: National University of Singapore
All Institutions: National University of Singapore
GeoSR identifies and mitigates the underutilization of geometry tokens in VLMs through targeted training-time masking and adaptive gated fusion. The paper delivers a clear empirical finding, a well-validated engineering solution, and consistent benchmark improvements, making it a solid contribution to multimodal spatial reasoning, though its methodological novelty remains incremental relative to established masking and routing paradigms.
The paper identifies a clear and practically significant failure mode in current geometry-aware VLMs: naive token fusion leads to geometry underutilization because models default to 2D appearance shortcuts. The proposed solution, GeoSR, combines two complementary mechanisms: Geometry-Unleashing Masking (GUM) to suppress appearance shortcuts during training, and Geometry-Guided Fusion (GGF) to adaptively route geometric evidence via a learned gate. The methodology is logically sound and well-motivated. The distinction between static (random masking) and dynamic (attention-guided masking) regimes shows thoughtful adaptation to task characteristics. However, the core techniques (MAE-style masking, cross-attention relevance scoring, and sigmoid gating) are established components repurposed for multimodal fusion rather than fundamentally new algorithmic primitives. The approach is an effective engineering recipe that addresses a real bottleneck, but lacks theoretical grounding or architectural innovation beyond standard training-time regularization and feature routing.
The experimental design is rigorous and comprehensive. The authors evaluate on two distinct benchmarks (VSI-Bench for static, DSR-Bench for dynamic spatial reasoning) and compare against a strong set of baselines spanning proprietary APIs, general video VLMs, and prior geometry-injection methods. Results consistently show improvements, with particularly notable gains on dynamic tasks where appearance cues are less reliable. The ablation studies cleanly isolate the contributions of GUM and GGF, confirming that both components are necessary and that naive fusion can indeed be detrimental. Hyperparameter sensitivity and computational overhead are reported, demonstrating practical efficiency. The evaluation is thorough, though it relies heavily on existing benchmarks whose annotation quality the authors themselves note as a potential ceiling.
High. The paper provides explicit implementation details: backbone (Qwen2.5-VL-7B), geometry extractors (VGGT for static, ^3 for dynamic), dataset splits, optimizer settings, learning rate schedules, batch sizes, masking ratios, and hardware configuration (4x H200 GPUs). Training and inference protocols are clearly distinguished, with masking explicitly disabled at test time. The inclusion of a public project page further supports reproducibility. Given the reliance on open-source components and standard training loops, independent replication should be straightforward.
The primary limitation is the training-only nature of the masking strategy, which introduces a train-test distribution shift (masked vs. full vision tokens) that the model must implicitly bridge via the gating mechanism. While empirically effective, this could theoretically destabilize optimization if not carefully tuned. Additionally, GeoSR's performance is inherently bounded by the quality of the external pretrained geometry models (VGGT/^3); it does not learn geometry end-to-end. The authors also acknowledge dataset limitations, noting that automatic/semi-automatic QA generation can introduce ambiguous or misaligned annotations, which caps the measurable gains of any model-side improvement. Finally, the gating and masking mechanisms add minor architectural complexity and inference latency compared to a purely frozen VLM, though the overhead is reported as negligible.
GeoSR provides a practical, low-overhead recipe for making geometric priors actionable in VLMs, which is highly relevant for downstream applications requiring precise spatial understanding, such as embodied AI, autonomous navigation, robotic manipulation, and augmented reality. By demonstrating that geometry tokens are not automatically useful and must be explicitly encouraged, the work shifts the community's focus from mere feature injection to controlled, task-aware fusion. This insight will likely influence how future multimodal architectures integrate 3D/structural priors, promoting more robust and reliable spatial reasoning in real-world deployments. GeoSR identifies and mitigates the underutilization of geometry tokens in VLMs through targeted training-time masking and adaptive gated fusion. The paper delivers a clear empirical finding, a well-validated engineering solution, and consistent benchmark improvements, making it a solid contribution to multimodal spatial reasoning, though its methodological novelty remains incremental relative to established masking and routing paradigms.
Time series forecasting is critical across finance, healthcare, and cloud computing, yet progress is constrained by a fundamental bottleneck: the scarcity of large-scale, high-quality benchmarks. To address this gap, we introduce QuitoBench, a regime-balanced benchmark for time series forecasting with coverage across eight trendtimesseasonalitytimesforecastability (TSF) regimes, designed to capture forecasting-relevant properties rather than application-defined domain labels. The benchmark is built upon Quito, a billion-scale time series corpus of application traffic from Alipay spanning nine business domains. Benchmarking 10 models from deep learning, foundation models, and statistical baselines across 232,200 evaluation instances, we report four key findings: (i) a context-length crossover where deep learning models lead at short context (L=96) but foundation models dominate at long context (L ge 576); (ii) forecastability is the dominant difficulty driver, producing a 3.64 times MAE gap across regimes; (iii) deep learning models match or surpass foundation models at 59 times fewer parameters; and (iv) scaling the amount of training data provides substantially greater benefit than scaling model size for both model families. These findings are validated by strong cross-benchmark and cross-metric consistency. Our open-source release enables reproducible, regime-aware evaluation for time series forecasting research.
Primary: Alipay
All Institutions: Alipay
QuitoBench introduces a regime-balanced, leakage-free benchmark that systematically evaluates time series forecasting models across intrinsic statistical properties rather than domain labels. The paper delivers rigorous empirical findings on context-length crossovers, forecastability-driven difficulty, and parameter efficiency, providing actionable guidance for practitioners while challenging the assumption that larger foundation models are uniformly superior; however, its single-provenance design and narrow task focus limit immediate generalizability across the broader time series ecosystem.
The paper introduces a principled, regime-aware evaluation framework that replaces coarse domain-based splits with intrinsic statistical diagnostics (trend, seasonality, forecastability) derived from STL decomposition and spectral entropy. The stratified sampling across 8 TSF cells directly mitigates the prevalence-driven bias that plagues existing benchmarks. The dense rolling-window evaluation protocol (unit stride, ~16M predictions/model) is methodologically superior to sparse windowing, yielding highly stable per-series error estimates. The pipeline is transparent, with clear thresholding (0.4) and sensitivity analysis, though the reliance on a single corporate data source inherently constrains the ecological diversity of the benchmark.
The experimental design is rigorous and comprehensive, evaluating 10 models across 18 configurations (context lengths, horizons, MV/UV modes) on 232,200 instances. The four key findings (context-length crossover, forecastability as the dominant difficulty axis, DL parameter efficiency, and data-scaling superiority) are well-supported by dense empirical evidence. Cross-metric (MAE/MSE) and cross-benchmark (vs. Timer) consistency checks strengthen the validity of the rankings. However, the scaling experiments are limited to two representative models due to compute constraints, and the foundation models are evaluated strictly zero-shot, leaving fine-tuning dynamics underexplored.
High. The authors commit to releasing the dataset, code, and evaluation framework under an open license. The paper provides exhaustive details on data preprocessing, TSF computation, temporal splitting, and training protocols. Standard statistical diagnostics (STL, Welch's method) ensure that the regime labeling can be replicated on external datasets. The primary reproducibility constraint is the proprietary origin of the raw telemetry, but the released Quito corpus and QuitoBench splits fully enable independent validation of the reported results.
Single-provenance data (Alipay application traffic) may not generalize to domains with fundamentally different dynamics (e.g., climate, biomedical signals, or industrial IoT). The TSF threshold of 0.4 is heuristic, and while sensitivity analysis shows robustness, optimal thresholds may vary by domain. The benchmark focuses exclusively on point forecasting, omitting probabilistic forecasting, anomaly detection, and multivariate causal modeling. Anonymized variates limit domain-specific interpretability, and the high computational cost of dense rolling evaluation may hinder rapid iteration for resource-constrained researchers.
QuitoBench addresses a critical evaluation crisis in time series forecasting by providing a contamination-free, regime-balanced standard that can guide reliable model selection in high-stakes operational domains. By empirically demonstrating that task-specific architectures and data scaling can rival large foundation models at a fraction of the parameter count, the work promotes more efficient, sustainable, and transparent ML practices. The open release democratizes access to industrial-scale time series data, though long-term field-wide adoption will depend on cross-provenance validation and extension to additional forecasting tasks. QuitoBench introduces a regime-balanced, leakage-free benchmark that systematically evaluates time series forecasting models across intrinsic statistical properties rather than domain labels. The paper delivers rigorous empirical findings on context-length crossovers, forecastability-driven difficulty, and parameter efficiency, providing actionable guidance for practitioners while challenging the assumption that larger foundation models are uniformly superior; however, its single-provenance design and narrow task focus limit immediate generalizability across the broader time series ecosystem.
Empowered by large-scale training, vision-language models (VLMs) achieve strong image and video understanding, yet their ability to perform spatial reasoning in both static scenes and dynamic videos remains limited. Recent advances try to handle this limitation by injecting geometry tokens from pretrained 3D foundation models into VLMs. Nevertheless, we observe that naive token fusion followed by standard fine-tuning in this line of work often leaves such geometric cues underutilized for spatial reasoning, as VLMs tend to rely heavily on 2D visual cues. In this paper, we propose GeoSR, a framework designed to make geometry matter by encouraging VLMs to actively reason with geometry tokens. GeoSR introduces two key components: (1) Geometry-Unleashing Masking, which strategically masks portions of 2D vision tokens during training to weaken non-geometric shortcuts and force the model to consult geometry tokens for spatial reasoning; and (2) Geometry-Guided Fusion, a gated routing mechanism that adaptively amplifies geometry token contributions in regions where geometric evidence is critical. Together, these designs unleash the potential of geometry tokens for spatial reasoning tasks. Extensive experiments on both static and dynamic spatial reasoning benchmarks demonstrate that GeoSR consistently outperforms prior methods and establishes new state-of-the-art performance by effectively leveraging geometric information. The project page is available at https://suhzhang.github.io/GeoSR/.
Primary: National University of Singapore
All Institutions: National University of Singapore
GeoSR identifies and mitigates the underutilization of geometry tokens in VLMs through targeted training-time masking and adaptive gated fusion. The paper delivers a clear empirical finding, a well-validated engineering solution, and consistent benchmark improvements, making it a solid contribution to multimodal spatial reasoning, though its methodological novelty remains incremental relative to established masking and routing paradigms.
The paper identifies a clear and practically significant failure mode in current geometry-aware VLMs: naive token fusion leads to geometry underutilization because models default to 2D appearance shortcuts. The proposed solution, GeoSR, combines two complementary mechanisms: Geometry-Unleashing Masking (GUM) to suppress appearance shortcuts during training, and Geometry-Guided Fusion (GGF) to adaptively route geometric evidence via a learned gate. The methodology is logically sound and well-motivated. The distinction between static (random masking) and dynamic (attention-guided masking) regimes shows thoughtful adaptation to task characteristics. However, the core techniques (MAE-style masking, cross-attention relevance scoring, and sigmoid gating) are established components repurposed for multimodal fusion rather than fundamentally new algorithmic primitives. The approach is an effective engineering recipe that addresses a real bottleneck, but lacks theoretical grounding or architectural innovation beyond standard training-time regularization and feature routing.
The experimental design is rigorous and comprehensive. The authors evaluate on two distinct benchmarks (VSI-Bench for static, DSR-Bench for dynamic spatial reasoning) and compare against a strong set of baselines spanning proprietary APIs, general video VLMs, and prior geometry-injection methods. Results consistently show improvements, with particularly notable gains on dynamic tasks where appearance cues are less reliable. The ablation studies cleanly isolate the contributions of GUM and GGF, confirming that both components are necessary and that naive fusion can indeed be detrimental. Hyperparameter sensitivity and computational overhead are reported, demonstrating practical efficiency. The evaluation is thorough, though it relies heavily on existing benchmarks whose annotation quality the authors themselves note as a potential ceiling.
High. The paper provides explicit implementation details: backbone (Qwen2.5-VL-7B), geometry extractors (VGGT for static, ^3 for dynamic), dataset splits, optimizer settings, learning rate schedules, batch sizes, masking ratios, and hardware configuration (4x H200 GPUs). Training and inference protocols are clearly distinguished, with masking explicitly disabled at test time. The inclusion of a public project page further supports reproducibility. Given the reliance on open-source components and standard training loops, independent replication should be straightforward.
The primary limitation is the training-only nature of the masking strategy, which introduces a train-test distribution shift (masked vs. full vision tokens) that the model must implicitly bridge via the gating mechanism. While empirically effective, this could theoretically destabilize optimization if not carefully tuned. Additionally, GeoSR's performance is inherently bounded by the quality of the external pretrained geometry models (VGGT/^3); it does not learn geometry end-to-end. The authors also acknowledge dataset limitations, noting that automatic/semi-automatic QA generation can introduce ambiguous or misaligned annotations, which caps the measurable gains of any model-side improvement. Finally, the gating and masking mechanisms add minor architectural complexity and inference latency compared to a purely frozen VLM, though the overhead is reported as negligible.
GeoSR provides a practical, low-overhead recipe for making geometric priors actionable in VLMs, which is highly relevant for downstream applications requiring precise spatial understanding, such as embodied AI, autonomous navigation, robotic manipulation, and augmented reality. By demonstrating that geometry tokens are not automatically useful and must be explicitly encouraged, the work shifts the community's focus from mere feature injection to controlled, task-aware fusion. This insight will likely influence how future multimodal architectures integrate 3D/structural priors, promoting more robust and reliable spatial reasoning in real-world deployments. GeoSR identifies and mitigates the underutilization of geometry tokens in VLMs through targeted training-time masking and adaptive gated fusion. The paper delivers a clear empirical finding, a well-validated engineering solution, and consistent benchmark improvements, making it a solid contribution to multimodal spatial reasoning, though its methodological novelty remains incremental relative to established masking and routing paradigms.
Recent advances in large language models have improved the capabilities of coding agents, yet systematic evaluation of complex, end-to-end website development remains limited. To address this gap, we introduce Vision2Web, a hierarchical benchmark for visual website development, spanning from static UI-to-code generation, interactive multi-page frontend reproduction, to long-horizon full-stack website development. The benchmark is constructed from real-world websites and comprises a total of 193 tasks across 16 categories, with 918 prototype images and 1,255 test cases. To support flexible, thorough and reliable evaluation, we propose workflow-based agent verification paradigm based on two complementary components: a GUI agent verifier and a VLM-based judge. We evaluate multiple visual language models instantiated under different coding-agent frameworks, revealing substantial performance gaps at all task levels, with state-of-the-art models still struggling on full-stack development.
Primary: Tsinghua University
All Institutions: Tsinghua University, Zhipu AI
Vision2Web introduces a hierarchical benchmark and dual-verification evaluation pipeline for visual website development, offering a structured difficulty gradient that exposes current agent limitations. While the benchmark design and verification paradigm address a genuine gap in agent evaluation, the reliance on uncalibrated VLM judges, limited dataset scale, and incomplete coverage of true full-stack complexity constrain its immediate field-wide impact, positioning it as a solid but incremental contribution to the AI agent benchmarking landscape.
The paper introduces a logically structured, hierarchical benchmark that progressively scales from static UI-to-code translation to interactive multi-page reproduction and long-horizon full-stack development. The proposed workflow-based agent verification paradigm, combining a GUI agent verifier for functional/interactive checks with a VLM-based judge for visual/structural alignment, is a pragmatic step toward mitigating single-metric evaluation bias. However, the methodology relies heavily on VLM judges, which are known to suffer from prompt sensitivity, positional bias, and inconsistent scoring across diverse UI paradigms. The paper lacks a rigorous ablation isolating the contribution of each verification component, and does not sufficiently address how the verification pipeline handles edge cases (e.g., dynamic content, responsive breakpoints, or framework-specific abstractions).
The dataset comprises 193 tasks, 16 categories, 918 prototype images, and 1,255 test cases, offering a reasonable but not exhaustive coverage of modern web development workflows. The evaluation of multiple VLMs under different coding-agent frameworks correctly identifies substantial performance degradation at higher complexity tiers, validating the benchmark's difficulty gradient. However, the experimental section lacks statistical significance testing across runs, inter-rater reliability metrics for the VLM judge, and a detailed error taxonomy (e.g., layout drift vs. broken interactivity vs. missing backend logic). The scale is adequate for initial benchmarking but falls short of web-scale corpora needed for robust generalization claims.
While the paper structure implies standard open-science practices, the provided text lacks explicit dataset licensing, environment specifications, and verification pipeline hyperparameters. Reproducibility hinges on clear documentation of the GUI agent's action space, VLM judge prompts, temperature settings, and evaluation timeouts. Without standardized containerization or detailed configuration files, cross-lab reproducibility will likely suffer due to the inherent stochasticity of LLM/VLM outputs and GUI automation frameworks.
(1) VLM judge reliability remains unquantified and susceptible to known multimodal evaluation biases. (2) The "full-stack" claim is likely overstated, as true full-stack development involves database schema design, API routing, authentication, and deployment pipelines, which are rarely captured in visual-to-code benchmarks. (3) Dataset scale and domain diversity are limited compared to real-world web ecosystems. (4) High computational and latency overhead from multi-agent verification pipelines is not analyzed, limiting practical deployment.
The benchmark provides a much-needed standardized evaluation suite for autonomous web development agents, potentially accelerating progress in UI automation, low-code/no-code platforms, and developer productivity tools. However, widespread adoption of automated web generation raises concerns regarding code quality, accessibility compliance, security vulnerabilities, and the potential for automated generation of deceptive or malicious interfaces. The verification framework could also be adapted for broader software engineering evaluation tasks beyond frontend development. Vision2Web introduces a hierarchical benchmark and dual-verification evaluation pipeline for visual website development, offering a structured difficulty gradient that exposes current agent limitations. While the benchmark design and verification paradigm address a genuine gap in agent evaluation, the reliance on uncalibrated VLM judges, limited dataset scale, and incomplete coverage of true full-stack complexity constrain its immediate field-wide impact, positioning it as a solid but incremental contribution to the AI agent benchmarking landscape.
Video world models have shown immense potential in simulating the physical world, yet existing memory mechanisms primarily treat environments as static canvases. When dynamic subjects hide out of sight and later re-emerge, current methods often struggle, leading to frozen, distorted, or vanishing subjects. To address this, we introduce Hybrid Memory, a novel paradigm requiring models to simultaneously act as precise archivists for static backgrounds and vigilant trackers for dynamic subjects, ensuring motion continuity during out-of-view intervals. To facilitate research in this direction, we construct HM-World, the first large-scale video dataset dedicated to hybrid memory. It features 59K high-fidelity clips with decoupled camera and subject trajectories, encompassing 17 diverse scenes, 49 distinct subjects, and meticulously designed exit-entry events to rigorously evaluate hybrid coherence. Furthermore, we propose HyDRA, a specialized memory architecture that compresses memory into tokens and utilizes a spatiotemporal relevance-driven retrieval mechanism. By selectively attending to relevant motion cues, HyDRA effectively preserves the identity and motion of hidden subjects. Extensive experiments on HM-World demonstrate that our method significantly outperforms state-of-the-art approaches in both dynamic subject consistency and overall generation quality.
Primary: Huazhong University of Science and Technology
All Institutions: Huazhong University of Science and Technology, Kuaishou Technology (Kling Team)
The paper introduces a well-motivated hybrid memory paradigm and a dedicated synthetic dataset to address dynamic subject consistency during out-of-view intervals in video world models. While the architectural contributions are incremental and the evaluation is constrained to synthetic data, the rigorous methodology, clear metric design, and open-source release provide a solid, practically useful foundation that will likely be adopted by researchers tackling long-horizon video generation and memory-augmented diffusion models.
The paper correctly identifies a critical blind spot in contemporary video world models: the conflation of static environmental memory with dynamic subject tracking during out-of-view intervals. The proposed "Hybrid Memory" paradigm is conceptually well-motivated, demanding simultaneous background anchoring and latent trajectory extrapolation. HyDRA's architecture addresses this via a 3D-convolutional memory tokenizer that compresses historical latents into spatiotemporally aware tokens, followed by a dynamic affinity-based retrieval mechanism that replaces standard self-attention. The Top-K selection fused with a local temporal window is a pragmatic design that balances global recall with local denoising stability. However, the methodology remains an incremental module insertion into a standard DiT/Flow-Matching backbone rather than a fundamental architectural or learning paradigm shift. The reliance on heuristic hyperparameters (kernel sizes, K values) and the lack of theoretical grounding for the affinity metric limit its methodological depth.
The experimental design is rigorous and well-structured. Training and evaluation are conducted on the newly introduced HM-World dataset, with clear comparisons against relevant baselines (DFoT, Context-as-Memory) and a strong commercial model (WorldPlay). The introduction of the DSC (Dynamic Subject Consistency) metric, leveraging YOLOv11 and CLIP to isolate subject-region fidelity, is a valuable contribution for evaluating temporal coherence in dynamic scenes. Quantitative results consistently show improvements across reconstruction fidelity (PSNR/SSIM) and consistency metrics. Qualitative results effectively demonstrate reduced subject distortion and vanishing during re-entry events. However, the exclusive reliance on a synthetic UE5 dataset for both training and evaluation raises valid concerns about domain shift and real-world generalization, which is only briefly touched upon in the supplementary material.
High. The authors provide comprehensive implementation details, including the base model (Wan2.1-T2V-1.3B), architectural hyperparameters, training schedule, and exact metric formulations. The codebase is publicly available, and the dataset generation pipeline is thoroughly documented. While the raw UE5 assets and procedural scripts are not fully open-sourced, the synthetic nature of the data and the clear metric definitions enable straightforward replication of the core methodology and evaluation protocol.
The authors explicitly acknowledge key constraints: performance degrades significantly in scenes with three or more interacting subjects or under severe occlusions, indicating limited scalability to complex multi-agent dynamics. The synthetic origin of HM-World restricts claims about real-world robustness. Additionally, the computational overhead of the dynamic retrieval mechanism compared to standard attention is not rigorously profiled, and the method is tightly coupled to DiT-based diffusion architectures, limiting immediate transfer to autoregressive or non-diffusion video models.
This work advances the practical capability of video world models to simulate physically plausible, long-horizon environments by decoupling static and dynamic memory retention. The paradigm directly benefits downstream applications requiring robust spatiotemporal consistency, including autonomous driving simulation, embodied AI training, and interactive virtual environments. By open-sourcing both the dataset and code, the paper establishes a focused benchmark that will likely accelerate community research into memory-augmented video generation and occlusion-aware temporal modeling. The paper introduces a well-motivated hybrid memory paradigm and a dedicated synthetic dataset to address dynamic subject consistency during out-of-view intervals in video world models. While the architectural contributions are incremental and the evaluation is constrained to synthetic data, the rigorous methodology, clear metric design, and open-source release provide a solid, practically useful foundation that will likely be adopted by researchers tackling long-horizon video generation and memory-augmented diffusion models.
Equipping Large Language Model (LLM) agents with domain-specific skills is critical for tackling complex tasks. Yet, manual authoring creates a severe scalability bottleneck. Conversely, automated skill generation often yields fragile or fragmented results because it either relies on shallow parametric knowledge or sequentially overfits to non-generalizable trajectory-local lessons. To overcome this, we introduce Trace2Skill, a framework that mirrors how human experts author skills: by holistically analyzing broad execution experience before distilling it into a single, comprehensive guide. Instead of reacting sequentially to individual trajectories, Trace2Skill dispatches a parallel fleet of sub-agents to analyze a diverse pool of executions. It extracts trajectory-specific lessons and hierarchically consolidates them into a unified, conflict-free skill directory via inductive reasoning. Trace2Skill supports both deepening existing human-written skills and creating new ones from scratch. Experiments in challenging domains, such as spreadsheet, VisionQA and math reasoning, show that Trace2Skill significantly improves upon strong baselines, including Anthropic's official xlsx skills. Crucially, this trajectory-grounded evolution does not merely memorize task instances or model-specific quirks: evolved skills transfer across LLM scales and generalize to OOD settings. For example, skills evolved by Qwen3.5-35B on its own trajectories improved a Qwen3.5-122B agent by up to 57.65 absolute percentage points on WikiTableQuestions. Ultimately, our results demonstrate that complex agent experience can be packaged into highly transferable, declarative skills -- requiring no parameter updates, no external retrieval modules, and utilizing open-source models as small as 35B parameters.
Primary: Unknown
All Institutions: Unknown
Trace2Skill introduces a structured, parallelized pipeline for distilling LLM execution traces into transferable declarative skills, offering practical improvements in agent specialization and cross-model generalization. While the methodology addresses a real bottleneck in automated skill generation and demonstrates compelling empirical gains, it remains an incremental engineering advancement within the automated prompt optimization space rather than a fundamental algorithmic breakthrough, with its broader impact contingent on rigorous validation, open-sourcing of orchestration details, and mitigation of consolidation-induced hallucination risks.
The core methodology proposes a parallelized, multi-agent pipeline to distill execution traces into declarative skills, explicitly designed to circumvent the sequential overfitting and fragmentation common in iterative prompt/skill generation. The hierarchical consolidation via inductive reasoning is a pragmatic engineering solution that mirrors expert knowledge curation. However, the approach remains fundamentally reliant on the base LLM's capacity for self-reflection, abstraction, and conflict resolution. The "parallel fleet" introduces significant compute overhead and coordination complexity, while the inductive consolidation step lacks formal guarantees against hallucinated or contradictory skill rules. The method is well-motivated and addresses a genuine bottleneck in agent development, but it operates within the established paradigm of automated prompt optimization and self-improvement rather than introducing a new learning paradigm or architectural innovation.
The empirical evaluation spans spreadsheet manipulation, VisionQA, and mathematical reasoning, with claims of substantial gains over strong baselines (including Anthropic's official xlsx skills) and remarkable cross-scale transfer (e.g., 57.65 pp improvement on WikiTableQuestions when transferring Qwen3.5-35B-evolved skills to Qwen3.5-122B). While the reported gains are striking, the evaluation lacks sufficient ablation on the consolidation mechanism, fleet size, and trajectory diversity thresholds. The OOD generalization claims are promising but would benefit from stress-testing on structurally divergent tasks and explicit measurement of skill retrieval latency/accuracy. The cross-scale transfer result is particularly notable and suggests the distilled skills capture model-agnostic reasoning patterns, yet the evaluation does not fully disentangle whether improvements stem from better instruction formatting, reduced ambiguity, or genuine algorithmic insight extraction.
The framework's parameter-free nature and reliance on open-source 35B+ models strongly favor reproducibility. However, the absence of released prompt templates, sub-agent orchestration logic, hierarchical consolidation heuristics, and exact trajectory sampling protocols limits immediate replication. The paper claims no external retrieval modules or fine-tuning, which simplifies deployment, but the practical reproducibility hinges on detailed disclosure of the inductive reasoning prompts and conflict-resolution rules. Without accompanying code or configuration files, independent verification of the 57.65 pp jump and cross-model transfer will be challenging.
(1) High computational cost due to parallel trajectory analysis and multi-agent coordination, which may not scale efficiently for real-time or resource-constrained deployments. (2) Heavy dependence on the base model's inductive reasoning quality; weaker models may produce noisy, contradictory, or overly verbose skill directories. (3) Potential for skill bloat as the directory grows, requiring explicit pruning or versioning mechanisms not discussed. (4) Evaluation is concentrated on structured reasoning and QA domains; applicability to open-ended, creative, or highly interactive environments remains unproven. (5) The consolidation process lacks formal verification, risking the propagation of subtle logical errors across tasks.
Trace2Skill offers a practical pathway toward scalable, automated agent specialization, potentially reducing the manual burden of prompt engineering and skill authoring. By packaging complex execution experience into transferable, declarative guides, it democratizes high-performance agent deployment for organizations lacking large-scale fine-tuning infrastructure. However, automated skill generation introduces risks of embedding dataset biases, unsafe reasoning patterns, or opaque decision rules into production systems. The framework's reliance on self-generated trajectories also raises concerns about feedback loops where flawed skills reinforce suboptimal behaviors. Responsible deployment will require human-in-the-loop validation, skill auditing, and robust conflict-resolution protocols. Trace2Skill introduces a structured, parallelized pipeline for distilling LLM execution traces into transferable declarative skills, offering practical improvements in agent specialization and cross-model generalization. While the methodology addresses a real bottleneck in automated skill generation and demonstrates compelling empirical gains, it remains an incremental engineering advancement within the automated prompt optimization space rather than a fundamental algorithmic breakthrough, with its broader impact contingent on rigorous validation, open-sourcing of orchestration details, and mitigation of consolidation-induced hallucination risks.
Currently, evaluating vision-language models (VLMs) in medical imaging tasks oversimplifies clinical reality by relying on pre-selected 2D images that demand significant manual labor to curate. This setup misses the core challenge of realworld diagnostics: a true clinical agent must actively navigate full 3D volumes across multiple sequences or modalities to gather evidence and ultimately support a final decision. To address this, we propose MEDOPENCLAW, an auditable runtime designed to let VLMs operate dynamically within standard medical tools or viewers (e.g., 3D Slicer). On top of this runtime, we introduce MEDFLOWBENCH, a full-study medical imaging benchmark covering multi-sequence brain MRI and lung CT/PET. It systematically evaluates medical agentic capabilities across viewer-only, tool-use, and open-method tracks. Initial results reveal a critical insight: while state-of-the-art LLMs/VLMs (e.g., Gemini 3.1 Pro and GPT-5.4) can successfully navigate the viewer to solve basic study-level tasks, their performance paradoxically degrades when given access to professional support tools due to a lack of precise spatial grounding. By bridging the gap between static-image perception and interactive clinical workflows, MEDOPENCLAW and MEDFLOWBENCH establish a reproducible foundation for developing auditable, full-study medical imaging agents.
Primary: Technical University of Munich (TUM)
All Institutions: Technical University of Munich (TUM), TUM University Hospital, Carnegie Mellon University, Imperial College London, Munich Center for Machine Learning, National University of Singapore, University of Oxford
MedOpenClaw introduces an auditable runtime and benchmark for evaluating VLMs on full 3D medical imaging studies, revealing a critical spatial grounding bottleneck when agents use expert tools, though preliminary experiments and reliance on unreleased models limit its immediate technical impact.
The paper proposes a well-conceived systems framework (`MedOpenClaw`) that wraps a standard clinical viewer (3D Slicer) with a bounded, REST-based API to enable VLM-driven study navigation. The three-tier action space (primitive navigation, evidence capture, expert tools) is logically structured and directly maps to the benchmark's three evaluation tracks. The design prioritizes auditability by logging all viewer states and tool invocations, which is a meaningful step toward clinical transparency. However, the methodology is primarily an engineering integration rather than an algorithmic breakthrough. The agent reasoning relies entirely on off-the-shelf VLMs, and the paper does not introduce novel prompting strategies, memory architectures, or training objectives to overcome the identified spatial grounding bottleneck.
The empirical evaluation is conceptually strong but practically flawed. The identification of the "tool-use paradox"—where adding expert segmentation tools degrades diagnostic accuracy due to imprecise coordinate grounding—is a highly valuable insight that challenges current assumptions about tool-augmented medical agents. However, the experimental setup is severely compromised by the use of non-existent/future model names (e.g., "GPT-5.4", "Gemini 3.1 Pro/Flash"), which undermines the credibility and reproducibility of the reported metrics. The benchmark relies on only two public datasets (UCSF-PDGM and NSCLC radiogenomics) and a single MCQ evaluation protocol, limiting the statistical robustness and generalizability of the conclusions.
Low to moderate. While the runtime architecture, dataset sources, and evaluation tracks are clearly described, the absence of a public code repository, exact prompt templates, and evaluation scripts makes immediate reproduction impossible. Furthermore, the reliance on unreleased commercial VLMs prevents independent verification of the baseline results. The promise of auditability is theoretically sound, but without open-sourcing the wrapper and logging infrastructure, the community cannot validate or build upon the system.
The authors appropriately acknowledge the narrow modality scope (only brain MRI and lung CT/PET), lack of longitudinal/EHR integration, and preliminary tool ecosystem. A critical unaddressed limitation is the lack of proposed solutions for the spatial grounding bottleneck; the paper diagnoses the problem but offers no algorithmic or architectural pathway to resolve it. Additionally, the use of placeholder/fictional model versions suggests the paper may be a draft or speculative submission, which significantly weakens its empirical foundation.
The work successfully shifts the evaluation paradigm from static 2D image recognition to dynamic, full-study clinical reasoning, aligning much closer with real-world radiology workflows. The emphasis on bounded, auditable interactions addresses critical regulatory and safety concerns for deploying AI in hospitals. The identified spatial grounding bottleneck provides a clear, actionable research direction for the medical agent community. If the infrastructure is open-sourced and expanded with diverse modalities and rigorous baselines, it could become a standard evaluation suite; currently, it serves as a strong conceptual blueprint with preliminary validation. MedOpenClaw introduces an auditable runtime and benchmark for evaluating VLMs on full 3D medical imaging studies, revealing a critical spatial grounding bottleneck when agents use expert tools, though preliminary experiments and reliance on unreleased models limit its immediate technical impact.
Composer 2 is a specialized model designed for agentic software engineering. The model demonstrates strong long-term planning and coding intelligence while maintaining the ability to efficiently solve problems for interactive use. The model is trained in two phases: first, continued pretraining to improve the model's knowledge and latent coding ability, followed by large-scale reinforcement learning to improve end-to-end coding performance through stronger reasoning, accurate multi-step execution, and coherence on long-horizon realistic coding problems. We develop infrastructure to support training in the same Cursor harness that is used by the deployed model, with equivalent tools and structure, and use environments that match real problems closely. To measure the ability of the model on increasingly difficult tasks, we introduce a benchmark derived from real software engineering problems in large codebases including our own. Composer 2 is a frontier-level coding model and demonstrates a process for training strong domain-specialized models. On our CursorBench evaluations the model achieves a major improvement in accuracy compared to previous Composer models (61.3). On public benchmarks the model scores 61.7 on Terminal-Bench and 73.7 on SWE-bench Multilingual in our harness, comparable to state-of-the-art systems.
Primary: Cursor (Anysphere)
All Institutions: Cursor (Anysphere), Fireworks AI, Stanford University (Hazy Research), Colfax Research
Composer 2 demonstrates that combining continued code-focused pretraining with large-scale asynchronous reinforcement learning, supported by robust infrastructure and environment-matching, yields frontier-level agentic software engineering capabilities. The technical report provides a highly detailed, production-validated blueprint for training long-horizon coding agents, offering substantial practical value for the agentic AI community despite limited open reproducibility and incremental algorithmic novelty.
The paper presents a rigorous two-phase training pipeline for agentic coding: continued pretraining on code-heavy corpora followed by large-scale asynchronous reinforcement learning. Algorithmically, the approach is largely an integration of established techniques (policy gradients, KL regularization, self-distillation for MTP, context parallelism) rather than introducing fundamentally new learning paradigms. However, the methodological strength lies in careful engineering choices tailored to long-horizon agent stability: the deliberate selection of the $k_1$ KL estimator to avoid variance blow-up under distribution shift, the nonlinear length penalty to balance efficiency vs. reasoning depth, and the self-summarization mechanism to effectively extend context windows without KV cache bloat. The infrastructure methodology is exceptionally detailed, covering decoupled context/expert parallelism, custom NVFP4/MXFP8 quantization kernels with per-token scaling fixes, mid-rollout weight synchronization, and environment forking for RL stability. This represents a mature, production-grade RLVR pipeline optimized for coding agents.
The experimental suite is comprehensive and well-aligned with real-world usage. The model achieves strong scores on public benchmarks (73.7 on SWE-bench Multilingual, 61.7 on Terminal-Bench) and significantly outperforms prior iterations on the proprietary CursorBench (61.3). The ablation linking continued pretraining loss to downstream RL reward provides valuable empirical grounding for the two-stage pipeline. Evaluations are conducted in a harness that mirrors production, minimizing train-test mismatch. However, the heavy reliance on internal benchmarks (CursorBench, FreshBench, LoCoDiff variants) and closed evaluation environments limits external validation. The results are compelling but should be interpreted with the caveat that they are optimized for a specific product ecosystem.
Low to moderate. While the paper provides extensive architectural, training, and systems details (optimizer settings, parallelism configurations, kernel designs, RL pipeline structure), the core components required for exact reproduction are proprietary: the training data mix, the CursorBench dataset, the Anyrun environment infrastructure, and the base model weights. The training stack depends on specific hardware (NVIDIA Blackwell GPUs) and internal tooling (Ray-based async trainer, Fireworks AI inference partnership). Practitioners can adopt the algorithmic insights and infrastructure patterns, but replicating the exact results or training run is infeasible without access to Cursor's closed ecosystem.
The primary limitation is the closed nature of the dataset, benchmarks, and environment infrastructure, which restricts independent verification and community benchmarking. The paper focuses heavily on scaling and engineering rather than algorithmic innovation, leaving open questions about the generalizability of the RL techniques to non-coding or non-agentic domains. The compute requirements (1.04T/32B active MoE, multi-region GPU/CPU clusters) are substantial, creating high barriers to entry. Additionally, the single-epoch asynchronous RL regime, while efficient, introduces policy staleness that is mitigated via fast weight sync but not fully eliminated, and the long-horizon capabilities still degrade on tasks requiring hours of human-equivalent effort.
This work establishes a scalable blueprint for training domain-specialized agentic models, demonstrating that close alignment between training environments and deployment harnesses is critical for real-world performance. The infrastructure insights (async RL stability, environment forking, quantization-aware training, self-summarization) will likely influence how other teams build long-horizon AI agents. By pushing the frontier of automated software engineering, the model has the potential to significantly accelerate development workflows. However, it also highlights the growing compute and resource centralization in frontier AI, where only well-funded organizations can sustain the infrastructure required for large-scale RL training on realistic environments. Composer 2 demonstrates that combining continued code-focused pretraining with large-scale asynchronous reinforcement learning, supported by robust infrastructure and environment-matching, yields frontier-level agentic software engineering capabilities. The technical report provides a highly detailed, production-validated blueprint for training long-horizon coding agents, offering substantial practical value for the agentic AI community despite limited open reproducibility and incremental algorithmic novelty.