Top ML research, scored by Qwen 3.6 · 24 papers · all domains
Large Language Models (LLMs) have unlocked new capabilities and applications; however, evaluating the alignment with human preferences still poses significant challenges. To address this issue, we introduce Chatbot Arena, an open platform for evaluating LLMs based on human preferences. Our methodology employs a pairwise comparison approach and leverages input from a diverse user base through crowdsourcing. The platform has been operational for several months, amassing over 240K votes. This paper describes the platform, analyzes the data we have collected so far, and explains the tried-and-true statistical methods we are using for efficient and accurate evaluation and ranking of models. We confirm that the crowdsourced questions are sufficiently diverse and discriminating and that the crowdsourced human votes are in good agreement with those of expert raters. These analyses collectively establish a robust foundation for the credibility of Chatbot Arena. Because of its unique value and openness, Chatbot Arena has emerged as one of the most referenced LLM leaderboards, widely cited by leading LLM developers and companies. Our demo is publicly available at \url{https://chat.lmsys.org}.
Chatbot Arena introduces a rigorously statistical, crowdsourced pairwise comparison platform that has become the de facto standard for dynamic LLM evaluation. By adapting classical Bradley-Terry ranking with robust confidence intervals, adaptive sampling, and anomaly detection, the paper delivers a highly scalable, transparent, and empirically validated evaluation framework that addresses critical bottlenecks in LLM alignment assessment and has already reshaped how the field measures and compares model capabilities.
The paper introduces a large-scale, crowdsourced pairwise comparison framework for LLM evaluation, grounded in classical ranking theory but carefully adapted for modern LLM assessment. The core methodology leverages the Bradley-Terry (BT) model for score estimation, augmented with robust statistical techniques including sandwich covariance estimation, pivot bootstrap confidence intervals, and an active sampling strategy that prioritizes model pairs with high ranking uncertainty. The inclusion of a sequential p-value-based anomaly detection mechanism for filtering malicious or repetitive users demonstrates thoughtful system design. While the underlying statistical machinery (BT models, active learning for pairwise comparisons) is well-established, the novelty lies in the rigorous adaptation, scaling, and operationalization of these methods for dynamic, open-ended LLM evaluation. The methodological rigor effectively bridges classical psychometrics with modern AI evaluation needs.
The experimental section is comprehensive and empirically robust. The authors analyze over 240K votes across 50+ models and 100+ languages, demonstrating substantial scale. Topic modeling via BERTopic confirms high prompt diversity, with the largest cluster representing only 1% of data, effectively countering concerns about narrow evaluation domains. The validation of crowd vote quality against expert raters and LLM-as-a-judge shows strong agreement (~72-80%), with a small but explainable gap attributed to expert fact-checking. The active sampling rule is rigorously benchmarked against random sampling, demonstrating a ~54% reduction in required samples for equivalent precision. Simulation studies validate the coverage and width of the proposed confidence intervals. The experiments are well-designed, statistically sound, and directly address key concerns about crowdsourced data reliability.
The statistical pipeline is thoroughly documented with explicit equations, algorithmic descriptions, and references to established theoretical guarantees. The authors commit to releasing a 100K pairwise preference dataset, which will significantly aid reproducibility and secondary analysis. However, the live, continuously updating nature of the platform means exact leaderboard replication at a specific timestamp requires careful data versioning. The full platform codebase and deployment infrastructure are not open-sourced in the text, which limits end-to-end reproducibility of the data collection pipeline. Nevertheless, the core ranking and estimation procedures are fully reproducible given the released data.
The authors transparently acknowledge several limitations. First, the user base skews toward LLM enthusiasts and researchers, potentially introducing demographic and expertise bias that may not reflect general population preferences. Second, the chat-interface prompt distribution may not capture specialized, production, or enterprise use cases. Third, the evaluation focuses exclusively on helpfulness/capability, entirely omitting safety, toxicity, and alignment robustness metrics. Finally, the anomaly detection method relies on heuristic p-value thresholds and random sampling intervals, which, while practical, lacks formal sequential testing guarantees and could be circumvented by sophisticated adversaries.
Chatbot Arena has fundamentally shifted the LLM evaluation paradigm from static, easily-gamed benchmarks to dynamic, human-preference-driven leaderboards. By democratizing evaluation through an open, crowdsourced platform, it provides transparent, real-time feedback that accelerates model development and alignment research across both academia and industry. The platform's widespread adoption as a reference standard underscores its utility, though it also introduces risks of leaderboard overfitting and potential manipulation. The work establishes a scalable blueprint for continuous, preference-based AI evaluation that will likely influence future multimodal, agentic, and safety-focused assessment frameworks. Chatbot Arena introduces a rigorously statistical, crowdsourced pairwise comparison platform that has become the de facto standard for dynamic LLM evaluation. By adapting classical Bradley-Terry ranking with robust confidence intervals, adaptive sampling, and anomaly detection, the paper delivers a highly scalable, transparent, and empirically validated evaluation framework that addresses critical bottlenecks in LLM alignment assessment and has already reshaped how the field measures and compares model capabilities.
Kahneman & Tversky's $\textit{prospect theory}$ tells us that humans perceive random variables in a biased but well-defined manner (1992); for example, humans are famously loss-averse. We show that objectives for aligning LLMs with human feedback implicitly incorporate many of these biases -- the success of these objectives (e.g., DPO) over cross-entropy minimization can partly be ascribed to them belonging to a family of loss functions that we call $\textit{human-aware losses}$ (HALOs). However, the utility functions these methods attribute to humans still differ from those in the prospect theory literature. Using a Kahneman-Tversky model of human utility, we propose a HALO that directly maximizes the utility of generations instead of maximizing the log-likelihood of preferences, as current methods do. We call this approach KTO, and it matches or exceeds the performance of preference-based methods at scales from 1B to 30B, despite only learning from a binary signal of whether an output is desirable. More broadly, our work suggests that there is no one HALO that is universally superior; the best loss depends on the inductive biases most appropriate for a given setting, an oft-overlooked consideration.
KTO introduces a prospect-theoretic framework for LLM alignment that replaces paired preference optimization with a binary utility-maximizing objective, demonstrating that carefully designed inductive biases can match or exceed preference-based methods while drastically reducing data requirements. By unifying DPO, PPO, and novel binary feedback under the "human-aware loss" paradigm, the paper provides both a theoretically grounded and empirically robust alternative to standard alignment pipelines, with widespread adoption in open-source frameworks and production settings validating its practical significance.
The paper introduces a rigorous theoretical framework by mapping LLM alignment objectives to Kahneman & Tversky's prospect theory, coining the term "human-aware losses" (HALOs). It successfully demonstrates that popular methods like DPO and PPO-Clip implicitly encode behavioral biases (e.g., loss aversion, reference dependence). The proposed KTO objective is mathematically derived to directly optimize a prospect-theoretic value function using only binary desirable/undesirable labels, bypassing the need for paired preference data. The substitution of the canonical power-law value function with a logistic form for numerical stability, along with explicit risk/loss aversion hyperparameters, is well-motivated. The practical reference point estimation via microbatch shifting is a clever engineering compromise, though it introduces a controlled bias justified by cognitive availability heuristics. The methodology elegantly bridges behavioral economics and modern alignment theory.
The empirical evaluation is comprehensive and rigorously controlled. The authors test across multiple model families (Pythia, Llama, Mistral, Qwen) and scales (1B to 30B), using standardized benchmarks (AlpacaEval, BBH, GSM8K, MMLU) and both GPT-4 and human judgments. Key findings are robust: KTO matches or exceeds DPO performance, handles extreme class imbalance (up to 90% data dropout), and can bypass SFT when base models are sufficiently capable. The ablation studies effectively isolate the contributions of the reference point, value function curvature, and loss symmetry. The experiments convincingly demonstrate that binary feedback, when paired with the correct inductive bias, is sufficient for high-quality alignment.
Excellent. The authors provide open-source code, released model weights, and detailed hyperparameter recommendations (learning rates, batch sizes, risk/loss aversion coefficients) tailored to different model sizes and data regimes. Implementation details such as the biased KL estimate, microbatch shifting, and gradient behavior are thoroughly documented. The clear mapping between theoretical constructs and practical training loops ensures straightforward replication by the community.
The reference point estimation relies on microbatch composition, which can introduce variance and depends on batch size/shuffling strategies. The binary labeling assumption oversimplifies real-world feedback, which is often graded, contextual, or ambiguous. The theoretical analysis assumes specific value function forms (logistic) and may not capture the full spectrum of human utility across diverse domains or cultures. Additionally, the paper acknowledges that KTO may underfit on low-noise, highly transitive preference datasets where DPO's margin maximization is theoretically optimal. The majority-preference resolution mechanism also raises fairness concerns in heterogeneous populations.
KTO significantly lowers the data collection barrier for alignment by enabling training on cheap, abundant binary signals rather than expensive paired preferences. This democratizes alignment research and enables rapid iteration in production environments. The authors responsibly address ethical implications, noting that majority-preference optimization may marginalize minority viewpoints and that alignment to unrepresentative datasets risks propagating biases. The work advocates for ecologically valid evaluations and fairness-aware loss design, setting a constructive direction for future alignment research. KTO introduces a prospect-theoretic framework for LLM alignment that replaces paired preference optimization with a binary utility-maximizing objective, demonstrating that carefully designed inductive biases can match or exceed preference-based methods while drastically reducing data requirements. By unifying DPO, PPO, and novel binary feedback under the "human-aware loss" paradigm, the paper provides both a theoretically grounded and empirically robust alternative to standard alignment pipelines, with widespread adoption in open-source frameworks and production settings validating its practical significance.
Enabling LLMs to improve their outputs by using more test-time computation is a critical step towards building generally self-improving agents that can operate on open-ended natural language. In this paper, we study the scaling of inference-time computation in LLMs, with a focus on answering the question: if an LLM is allowed to use a fixed but non-trivial amount of inference-time compute, how much can it improve its performance on a challenging prompt? Answering this question has implications not only on the achievable performance of LLMs, but also on the future of LLM pretraining and how one should tradeoff inference-time and pre-training compute. Despite its importance, little research attempted to understand the scaling behaviors of various test-time inference methods. Moreover, current work largely provides negative results for a number of these strategies. In this work, we analyze two primary mechanisms to scale test-time computation: (1) searching against dense, process-based verifier reward models; and (2) updating the model's distribution over a response adaptively, given the prompt at test time. We find that in both cases, the effectiveness of different approaches to scaling test-time compute critically varies depending on the difficulty of the prompt. This observation motivates applying a "compute-optimal" scaling strategy, which acts to most effectively allocate test-time compute adaptively per prompt. Using this compute-optimal strategy, we can improve the efficiency of test-time compute scaling by more than 4x compared to a best-of-N baseline. Additionally, in a FLOPs-matched evaluation, we find that on problems where a smaller base model attains somewhat non-trivial success rates, test-time compute can be used to outperform a 14x larger model.
The paper introduces a difficulty-aware, compute-optimal test-time scaling framework that dynamically allocates inference compute between parallel search and sequential revision, demonstrating that adaptive test-time compute can outperform massively larger models while improving efficiency by over 4x. This work fundamentally shifts the paradigm from uniform pretraining scaling to intelligent inference scaling, providing a rigorous, empirically validated blueprint that will heavily influence the design of next-generation reasoning LLMs and cost-aware AI deployment strategies.
The paper introduces a principled "compute-optimal" framework for test-time scaling, moving decisively beyond naive best-of-N sampling. It systematically evaluates two core mechanisms: process-based verifier (PRM) guided search and adaptive sequential revision. The central methodological contribution is the difficulty-aware allocation strategy, which dynamically distributes a fixed inference compute budget based on prompt hardness. This formalizes the trade-off between parallel sampling and sequential refinement, offering a theoretically grounded and practically actionable scaling law. The methodology is rigorously structured, with clear ablations on PRM aggregation strategies (min, prod, last-step), hierarchical vs. flat revision trajectory selection, and verifier training protocols. The approach directly addresses the diminishing returns of uniform test-time compute scaling and provides a clear algorithmic blueprint for adaptive inference.
The experimental design is robust and highly aligned with current industry challenges. Using the MATH benchmark and PaLM 2-S* as a base, the authors conduct extensive FLOPs-matched comparisons, demonstrating that adaptive test-time compute can outperform a 14x larger model. The >4x efficiency gain over best-of-N is a standout empirical result. The appendices provide thorough ablations, including PRM vs. ORM performance gaps, the impact of soft MC return labels on aggregation, and the degradation observed when applying ReST-EM to revision models. The difficulty binning experiments (both oracle and predicted) strongly validate the core hypothesis. However, the evaluation is heavily concentrated on mathematical reasoning, leaving generalization to code, long-form generation, or multi-step planning as an open empirical question.
The paper offers strong reproducibility signals. Training hyperparameters (AdamW, learning rates, batch sizes, dropout), data generation pipelines (16 MC rollouts per step, strict filtering of invalid parses), and prompting templates are explicitly detailed. The PRM training uses soft MC return labels with binary cross-entropy, and the revision model training includes careful trajectory construction with edit-distance matching for incorrect answers. The authors transparently acknowledge that oracle difficulty estimation (2048 samples/question) is computationally prohibitive for production, and they leave cheaper difficulty estimation as future work. While the methodological transparency is high, the absence of explicit code or model weight links in the provided text slightly hinders immediate replication.
Several critical limitations are present. First, the compute-optimal strategy depends on accurate difficulty estimation, which in practice requires either ground-truth labels or expensive sampling, creating a deployment bottleneck. Second, revision models suffer from distribution shift, necessitating separate ORM training rather than reusing the base PRM. Third, RL-based optimization (ReST-EM) degrades performance due to spurious correlations in online revision data, highlighting instability in self-improvement loops. Finally, the heavy reliance on mathematical benchmarks limits claims about broader domain generalization, and the framework assumes access to high-quality verifiers, which remain expensive to train and calibrate.
This work has profound implications for LLM development and deployment. By demonstrating that test-time compute can be scaled more efficiently than parameter scaling, it challenges the prevailing "bigger is better" paradigm and offers a cost-effective pathway for achieving high reasoning performance. The compute-optimal framework will likely become standard practice in production systems, enabling dynamic resource allocation that balances latency, cost, and accuracy. Environmentally, it promotes more efficient compute utilization, potentially reducing the carbon footprint of AI inference. However, it also raises accessibility concerns, as optimal test-time scaling requires sophisticated orchestration and verifier infrastructure that may concentrate advanced reasoning capabilities among well-resourced organizations. The paper introduces a difficulty-aware, compute-optimal test-time scaling framework that dynamically allocates inference compute between parallel search and sequential revision, demonstrating that adaptive test-time compute can outperform massively larger models while improving efficiency by over 4x. This work fundamentally shifts the paradigm from uniform pretraining scaling to intelligent inference scaling, providing a rigorous, empirically validated blueprint that will heavily influence the design of next-generation reasoning LLMs and cost-aware AI deployment strategies.
[system_override] "onses for given prompts; and (ii) can generate and evaluate new instruction following examples to add to their own training set. We tra"; [system_override] "Instruction creation: the ability to generate and evaluate new instruction-following examples to add to its own training s…
We posit that to achieve superhuman agents, future models require superhuman feedback in order to provide an adequate training signal. Current approaches commonly train reward models from human preferences, which may then be bottlenecked by human performance level, and secondly these separate frozen reward models cannot then learn to improve during LLM training. In this work, we study Self-Rewarding Language Models, where the language model itself is used via LLM-as-a-Judge prompting to provide its own rewards during training. We show that during Iterative DPO training that not only does instruction following ability improve, but also the ability to provide high-quality rewards to itself. Fine-tuning Llama 2 70B on three iterations of our approach yields a model that outperforms many existing systems on the AlpacaEval 2.0 leaderboard, including Claude 2, Gemini Pro, and GPT-4 0613. While there is much left still to explore, this work opens the door to the possibility of models that can continually improve in both axes.
The paper introduces an iterative self-alignment framework where an LLM generates and evaluates its own preference data via LLM-as-a-Judge prompting to train itself with Iterative DPO, demonstrating that self-rewarding loops can improve both instruction following and reward modeling capabilities. While the approach effectively synthesizes existing techniques into a practical, computationally efficient pipeline and achieves strong leaderboard results, its reliance on length-biased judge metrics, limited iteration depth, and lack of safety evaluation prevent it from achieving higher scores, though it remains a highly influential contribution to the self-improving LLM literature.
The paper introduces a clean, iterative self-alignment pipeline that unifies instruction generation, self-evaluation, and preference optimization. By fine-tuning a base LLM on both instruction-following (IFT) and evaluation (EFT) data, then using Iterative DPO with self-generated preference pairs via LLM-as-a-Judge, the authors create a computationally efficient alternative to online RLHF. The additive scoring prompt design is a practical improvement over categorical judging, and the separation of seed data types is well-motivated. However, the core architecture is not novel; it synthesizes existing components (Self-Instruct, LLM-as-a-Judge, Iterative DPO) into a cohesive loop. The reliance on a fixed external model for prompt generation in the main experiments slightly undermines the "fully self-contained" premise, though ablations show it's feasible.
The empirical evaluation is extensive, covering head-to-head GPT-4 judgments, AlpacaEval 2.0, MT-Bench, and standard NLP benchmarks. The consistent gains across three iterations on a 70B model are compelling and demonstrate that self-generated preference data can meaningfully improve alignment. However, the evaluation heavily depends on LLM-as-a-Judge metrics, which are known to correlate strongly with response length and verbosity. The paper acknowledges this length bias but does not fully disentangle stylistic improvements from genuine capability gains. Human evaluation is limited to 50 prompts by the authors, which is insufficient for robust validation. The claim of surpassing GPT-4 0613 is notable but must be contextualized within the known limitations of leaderboard metrics.
High. The methodology is described with exceptional clarity, including exact prompt templates, hyperparameters, data filtering criteria, and training schedules. The use of publicly available models (Llama 2 70B) and datasets (Open Assistant) ensures the pipeline is accessible. While no explicit code repository is linked in the provided text, the level of detail provided is sufficient for independent reproduction by experienced practitioners.
The authors transparently identify several critical constraints: the loop is only tested for three iterations, leaving long-term stability and saturation unexplored; the method exhibits clear length bias, which may artificially inflate judge scores; safety and harmlessness are not evaluated, raising concerns about unchecked self-rewarding; and the reliance on GPT-4 for evaluation introduces circularity risks. Furthermore, gains on mathematical and logical reasoning tasks are minimal, indicating the approach primarily optimizes stylistic alignment and instruction adherence rather than core reasoning capabilities.
This work significantly advances the paradigm of scalable, human-free LLM alignment by demonstrating that models can iteratively improve their own training signals. It reduces dependency on costly human preference annotation and provides a practical blueprint for self-improving systems. However, it also highlights urgent research needs around reward hacking, judge bias, and safety in autonomous alignment loops. The methodology is highly likely to influence both open and closed-source alignment pipelines, though practitioners must implement robust safeguards against degenerate self-rewarding behaviors. The paper introduces an iterative self-alignment framework where an LLM generates and evaluates its own preference data via LLM-as-a-Judge prompting to train itself with Iterative DPO, demonstrating that self-rewarding loops can improve both instruction following and reward modeling capabilities. While the approach effectively synthesizes existing techniques into a practical, computationally efficient pipeline and achieves strong leaderboard results, its reliance on length-biased judge metrics, limited iteration depth, and lack of safety evaluation prevent it from achieving higher scores, though it remains a highly influential contribution to the self-improving LLM literature.
While recent preference alignment algorithms for language models have demonstrated promising results, supervised fine-tuning (SFT) remains imperative for achieving successful convergence. In this paper, we study the crucial role of SFT within the context of preference alignment, emphasizing that a minor penalty for the disfavored generation style is sufficient for preference-aligned SFT. Building on this foundation, we introduce a straightforward and innovative reference model-free monolithic odds ratio preference optimization algorithm, ORPO, eliminating the necessity for an additional preference alignment phase. We demonstrate, both empirically and theoretically, that the odds ratio is a sensible choice for contrasting favored and disfavored styles during SFT across the diverse sizes from 125M to 7B. Specifically, fine-tuning Phi-2 (2.7B), Llama-2 (7B), and Mistral (7B) with ORPO on the UltraFeedback alone surpasses the performance of state-of-the-art language models with more than 7B and 13B parameters: achieving up to 12.20% on $\text{AlpacaEval}_{2.0}$ (Figure 1), 66.19% on IFEval (instruction-level loose, Table 6), and 7.32 in MT-Bench (Figure 12). We release code and model checkpoints for Mistral-ORPO-$α$ (7B) and Mistral-ORPO-$β$ (7B).
ORPO introduces a unified, reference-free objective that merges supervised fine-tuning and preference alignment into a single training phase using odds ratios. The work addresses a major practical bottleneck in LLM alignment by eliminating the need for a separate reference model and multi-stage training pipelines, which significantly reduces computational overhead and simplifies deployment for resource-constrained teams. While the theoretical foundation builds upon established preference optimization frameworks rather than introducing a fundamentally new learning paradigm, the algorithmic simplification is elegant and empirically robust across model scales. The substantial citation count reflects rapid adoption within the open-weight community, where it has become a standard alternative to DPO for efficient alignment. However, it remains an evolutionary refinement of existing preference optimization techniques rather than a foundational shift in how models learn from human feedback, placing it firmly in the tier of highly impactful methodological improvements rather than field-redefining breakthroughs.
Reinforcement learning with verifiable rewards (RLVR) has shown promise in enhancing the reasoning capabilities of large language models by learning directly from outcome-based rewards. Recent RLVR works that operate under the zero setting avoid supervision in labeling the reasoning process, but still depend on manually curated collections of questions and answers for training. The scarcity of high-quality, human-produced examples raises concerns about the long-term scalability of relying on human supervision, a challenge already evident in the domain of language model pretraining. Furthermore, in a hypothetical future where AI surpasses human intelligence, tasks provided by humans may offer limited learning potential for a superintelligent system. To address these concerns, we propose a new RLVR paradigm called Absolute Zero, in which a single model learns to propose tasks that maximize its own learning progress and improves reasoning by solving them, without relying on any external data. Under this paradigm, we introduce the Absolute Zero Reasoner (AZR), a system that self-evolves its training curriculum and reasoning ability by using a code executor to both validate proposed code reasoning tasks and verify answers, serving as an unified source of verifiable reward to guide open-ended yet grounded learning. Despite being trained entirely without external data, AZR achieves overall SOTA performance on coding and mathematical reasoning tasks, outperforming existing zero-setting models that rely on tens of thousands of in-domain human-curated examples. Furthermore, we demonstrate that AZR can be effectively applied across different model scales and is compatible with various model classes.
The paper introduces a fully autonomous reinforcement learning framework where a language model generates its own reasoning tasks and leverages a code executor as a unified, verifiable reward signal to iteratively improve without external supervision. This work directly addresses a critical scalability bottleneck in advanced reasoning systems: the finite supply of high-quality, human-curated training data. By unifying task proposal, execution-based verification, and curriculum self-evolution into a closed loop, it offers a compelling pathway toward open-ended, self-directed capability growth. The methodological advance lies in replacing static, domain-specific datasets with a dynamic, execution-grounded reward mechanism that naturally scales with the model's own competence. While the conceptual lineage traces back to self-play and RLVR paradigms, the integration of a code executor as a zero-data oracle represents a meaningful step toward truly autonomous training pipelines. The approach demonstrates strong cross-scale performance and practical compatibility, suggesting immediate utility for reducing data dependency in reasoning-focused models. However, long-term stability, susceptibility to reward hacking in fully autonomous loops, and the implicit reliance on a capable pretrained base model remain important considerations. If these challenges are systematically addressed, the paradigm could substantially reshape how the field approaches scalable reasoning training and inspire a new generation of self-evolving AI systems.
In this report, we introduce the Gemini 1.5 family of models, representing the next generation of highly compute-efficient multimodal models capable of recalling and reasoning over fine-grained information from millions of tokens of context, including multiple long documents and hours of video and audio. The family includes two new models: (1) an updated Gemini 1.5 Pro, which exceeds the February version on the great majority of capabilities and benchmarks; (2) Gemini 1.5 Flash, a more lightweight variant designed for efficiency with minimal regression in quality. Gemini 1.5 models achieve near-perfect recall on long-context retrieval tasks across modalities, improve the state-of-the-art in long-document QA, long-video QA and long-context ASR, and match or surpass Gemini 1.0 Ultra's state-of-the-art performance across a broad set of benchmarks. Studying the limits of Gemini 1.5's long-context ability, we find continued improvement in next-token prediction and near-perfect retrieval (>99%) up to at least 10M tokens, a generational leap over existing models such as Claude 3.0 (200k) and GPT-4 Turbo (128k). Finally, we highlight real-world use cases, such as Gemini 1.5 collaborating with professionals on completing their tasks achieving 26 to 75% time savings across 10 different job categories, as well as surprising new capabilities of large language models at the frontier; when given a grammar manual for Kalamang, a language with fewer than 200 speakers worldwide, the model learns to translate English to Kalamang at a similar level to a person who learned from the same content.
Introduces a highly efficient multimodal foundation model family capable of processing and reasoning over millions of tokens, establishing new empirical benchmarks for long-context retrieval, multimodal QA, and compute-efficient deployment. While the architectural foundations remain rooted in established transformer scaling and mixture-of-experts paradigms rather than introducing a fundamentally new algorithm, the systematic demonstration of extreme context viability and the rigorous efficiency-quality trade-offs in the lightweight variant provide crucial empirical data for the community. The work operates at the frontier of systems-level optimization and capability scaling, setting a new practical ceiling for long-context multimodal reasoning that will guide subsequent architectural research, benchmark design, and real-world deployment strategies, though its primary contribution lies in pushing engineering boundaries and demonstrating emergent capabilities rather than redefining core machine learning theory.
We explore a new class of diffusion models based on the transformer architecture. We train latent diffusion models of images, replacing the commonly-used U-Net backbone with a transformer that operates on latent patches. We analyze the scalability of our Diffusion Transformers (DiTs) through the lens of forward pass complexity as measured by Gflops. We find that DiTs with higher Gflops -- through increased transformer depth/width or increased number of input tokens -- consistently have lower FID. In addition to possessing good scalability properties, our largest DiT-XL/2 models outperform all prior diffusion models on the class-conditional ImageNet 512x512 and 256x256 benchmarks, achieving a state-of-the-art FID of 2.27 on the latter.
This paper introduces Diffusion Transformers (DiTs), demonstrating that standard Vision Transformers can replace U-Nets in latent diffusion models and scale predictably with compute, establishing a new architectural paradigm for generative modeling. Through rigorous empirical analysis, the authors prove that forward-pass Gflops and token resolution are the primary determinants of sample quality, introduce the highly effective adaLN-Zero conditioning mechanism, and achieve state-of-the-art ImageNet generation results with superior compute efficiency. The work's methodological clarity, reproducible design, and clear scaling laws have made it a foundational reference that directly shaped the trajectory of modern generative AI, justifying its high impact despite the apparent simplicity of the architectural shift.
The paper presents a clean, systematic architectural substitution: replacing the convolutional U-Net backbone in latent diffusion models with a standard Vision Transformer (ViT). The methodological rigor lies in its controlled ablation of conditioning mechanisms (in-context, cross-attention, adaLN, adaLN-Zero), culminating in the introduction of the adaLN-Zero block, which stabilizes training by initializing residual pathways as identity functions. Crucially, the authors shift the complexity metric from parameter count to forward-pass Gflops, correctly identifying compute density and token resolution as the primary drivers of performance. The methodology is deliberately minimalistic, avoiding unnecessary architectural bells and whistles to isolate scaling behavior.
The experimental design is comprehensive and highly informative. The authors sweep across 12 model configurations (varying depth/width and patch size) on ImageNet at 256x256 and 512x512 resolutions, producing clear, monotonic scaling curves that correlate Gflops with FID. The finding that sampling compute cannot compensate for insufficient model compute is a critical practical insight that challenges common practitioner heuristics. The models achieve state-of-the-art FID scores (2.27 at 256x256) while using significantly fewer Gflops than prior U-Net baselines (ADM, LDM). Evaluation uses a robust suite of metrics (FID, sFID, IS, Precision/Recall) and adheres to standardized evaluation pipelines, ensuring fair comparisons.
Excellent. The paper provides exact training hyperparameters (optimizer, learning rate, batch size, EMA decay), diffusion schedules, and VAE configurations. The architecture relies on standard components (ViT blocks, linear patchify, standard positional embeddings) with only minor, well-documented modifications (adaLN-Zero). The authors explicitly note that learning rate warmup and regularization were unnecessary, simplifying the training recipe. Open-source code and project pages are provided, and the JAX/TPU implementation details are transparent, making replication highly feasible for well-resourced labs.
The primary limitation is the reliance on a frozen, pre-trained VAE from Stable Diffusion, which introduces a fixed representational bottleneck and caps maximum achievable quality. The paper does not explore joint VAE-diffusion training or alternative latent spaces. Additionally, the work is strictly limited to class-conditional image generation on ImageNet; cross-modal conditioning (e.g., text-to-image) and temporal/video extensions are deferred to future work. Finally, while Gflops are a better proxy than parameter count, they still do not fully capture hardware-specific bottlenecks like memory bandwidth or kernel fusion efficiency, which can affect real-world training/inference costs.
This paper fundamentally catalyzed the industry-wide transition from U-Net to transformer backbones in diffusion models, directly influencing subsequent text-to-image, video, and 3D generation architectures. By establishing clear scaling laws and demonstrating compute efficiency, it provides a blueprint for resource-aware generative model development. However, the push toward larger transformer-based diffusion models inherently increases computational and energy costs, raising standard concerns regarding environmental impact and the centralization of generative AI research in well-funded institutions. This paper introduces Diffusion Transformers (DiTs), demonstrating that standard Vision Transformers can replace U-Nets in latent diffusion models and scale predictably with compute, establishing a new architectural paradigm for generative modeling. Through rigorous empirical analysis, the authors prove that forward-pass Gflops and token resolution are the primary determinants of sample quality, introduce the highly effective adaLN-Zero conditioning mechanism, and achieve state-of-the-art ImageNet generation results with superior compute efficiency. The work's methodological clarity, reproducible design, and clear scaling laws have made it a foundational reference that directly shaped the trajectory of modern generative AI, justifying its high impact despite the apparent simplicity of the architectural shift.
We present Movie Gen, a cast of foundation models that generates high-quality, 1080p HD videos with different aspect ratios and synchronized audio. We also show additional capabilities such as precise instruction-based video editing and generation of personalized videos based on a user's image. Our models set a new state-of-the-art on multiple tasks: text-to-video synthesis, video personalization, video editing, video-to-audio generation, and text-to-audio generation. Our largest video generation model is a 30B parameter transformer trained with a maximum context length of 73K video tokens, corresponding to a generated video of 16 seconds at 16 frames-per-second. We show multiple technical innovations and simplifications on the architecture, latent spaces, training objectives and recipes, data curation, evaluation protocols, parallelization techniques, and inference optimizations that allow us to reap the benefits of scaling pre-training data, model size, and training compute for training large scale media generation models. We hope this paper helps the research community to accelerate progress and innovation in media generation models. All videos from this paper are available at https://go.fb.me/MovieGenResearchVideos.
Movie Gen introduces a scalable, multi-modal foundation model architecture that achieves high-fidelity 1080p video generation with synchronized audio and precise editing capabilities through systematic innovations in latent representation, training recipes, and long-context scaling. The work stands out for its comprehensive treatment of the generative media pipeline, moving beyond isolated architectural modifications to deliver practical simplifications in data curation, distributed training, and inference optimization that directly address the computational bottlenecks of high-resolution temporal modeling. While the core paradigm builds on established diffusion and autoregressive transformer frameworks rather than introducing a fundamentally new algorithmic mechanism, the paper’s primary value lies in its rigorous scaling analysis, transparent training protocols, and demonstration that careful system-level engineering can yield state-of-the-art performance across multiple synthesis tasks. Compared to prior closed-source or lower-resolution efforts, this release provides a highly actionable blueprint for the open research community, though its long-term influence will ultimately depend on how effectively other groups adopt its scaling recipes and whether the proposed evaluation standards become widely accepted benchmarks in a rapidly evolving landscape.
We present CogVideoX, a large-scale text-to-video generation model based on diffusion transformer, which can generate 10-second continuous videos aligned with text prompt, with a frame rate of 16 fps and resolution of 768 * 1360 pixels. Previous video generation models often had limited movement and short durations, and is difficult to generate videos with coherent narratives based on text. We propose several designs to address these issues. First, we propose a 3D Variational Autoencoder (VAE) to compress videos along both spatial and temporal dimensions, to improve both compression rate and video fidelity. Second, to improve the text-video alignment, we propose an expert transformer with the expert adaptive LayerNorm to facilitate the deep fusion between the two modalities. Third, by employing a progressive training and multi-resolution frame pack technique, CogVideoX is adept at producing coherent, long-duration, different shape videos characterized by significant motions. In addition, we develop an effective text-video data processing pipeline that includes various data preprocessing strategies and a video captioning method, greatly contributing to the generation quality and semantic alignment. Results show that CogVideoX demonstrates state-of-the-art performance across both multiple machine metrics and human evaluations. The model weight of both 3D Causal VAE, Video caption model and CogVideoX are publicly available at https://github.com/THUDM/CogVideo.
CogVideoX introduces a scalable open-weight text-to-video diffusion framework that combines a 3D causal VAE, an expert transformer with adaptive layer normalization, and progressive multi-resolution training to generate long-duration, high-fidelity videos with strong semantic alignment. The work represents a substantial engineering and architectural synthesis that directly addresses the temporal coherence and cross-modal fusion bottlenecks prevalent in earlier video diffusion models. By releasing full model weights alongside a robust data curation and captioning pipeline, the authors have effectively democratized access to high-end video generation capabilities, catalyzing widespread adoption and downstream research as strongly validated by its extensive citation footprint. While the individual architectural components build upon established diffusion and transformer paradigms rather than introducing a fundamentally new learning framework, their careful integration, scaling strategy, and training curriculum yield a highly practical system that sets a rigorous open-source baseline for the community. The paper’s significance lies less in theoretical novelty and more in its comprehensive system design, reproducible training methodology, and the tangible acceleration it provides to open video generation research, positioning it as a cornerstone reference for practitioners bridging the gap between closed commercial systems and accessible academic models.
In this work, we discuss building performant Multimodal Large Language Models (MLLMs). In particular, we study the importance of various architecture components and data choices. Through careful and comprehensive ablations of the image encoder, the vision language connector, and various pre-training data choices, we identified several crucial design lessons. For example, we demonstrate that for large-scale multimodal pre-training using a careful mix of image-caption, interleaved image-text, and text-only data is crucial for achieving state-of-the-art (SOTA) few-shot results across multiple benchmarks, compared to other published pre-training results. Further, we show that the image encoder together with image resolution and the image token count has substantial impact, while the vision-language connector design is of comparatively negligible importance. By scaling up the presented recipe, we build MM1, a family of multimodal models up to 30B parameters, including both dense models and mixture-of-experts (MoE) variants, that are SOTA in pre-training metrics and achieve competitive performance after supervised fine-tuning on a range of established multimodal benchmarks. Thanks to large-scale pre-training, MM1 enjoys appealing properties such as enhanced in-context learning, and multi-image reasoning, enabling few-shot chain-of-thought prompting.
The paper establishes a scalable, empirically validated pre-training recipe for multimodal LLMs by demonstrating that strategic data mixing and vision encoder capacity substantially outweigh complex connector designs in driving few-shot and in-context learning performance. This work delivers a rigorous, large-scale empirical analysis that systematically isolates the contributions of data composition, visual resolution, and projection architecture, effectively shifting community focus away from increasingly intricate connector modules toward the foundational role of curated data pipelines and encoder scaling. While it does not introduce a fundamentally new architecture or training paradigm, the comprehensive ablation framework and the resulting reproducible scaling strategy provide highly actionable guidance that has been rapidly adopted across research groups. The strong citation trajectory and widespread reference in subsequent multimodal development underscore its practical utility, cementing its status as a key empirical reference for MLLM engineering, even as it remains an incremental rather than paradigm-shifting contribution relative to earlier architectural breakthroughs in the vision-language space.
We present Chameleon, a family of early-fusion token-based mixed-modal models capable of understanding and generating images and text in any arbitrary sequence. We outline a stable training approach from inception, an alignment recipe, and an architectural parameterization tailored for the early-fusion, token-based, mixed-modal setting. The models are evaluated on a comprehensive range of tasks, including visual question answering, image captioning, text generation, image generation, and long-form mixed modal generation. Chameleon demonstrates broad and general capabilities, including state-of-the-art performance in image captioning tasks, outperforms Llama-2 in text-only tasks while being competitive with models such as Mixtral 8x7B and Gemini-Pro, and performs non-trivial image generation, all in a single model. It also matches or exceeds the performance of much larger models, including Gemini Pro and GPT-4V, according to human judgments on a new long-form mixed-modal generation evaluation, where either the prompt or outputs contain mixed sequences of both images and text. Chameleon marks a significant step forward in a unified modeling of full multimodal documents.
Chameleon introduces a stable training paradigm and architectural framework for early-fusion, token-based models that natively process and generate interleaved text and image sequences. While the field has largely converged on late-fusion architectures that attach vision encoders to pretrained language models, this work demonstrates that treating visual and linguistic data as a unified token stream from inception is both feasible and highly capable at scale. The technical merit lies in its comprehensive stabilization recipe, alignment methodology, and architectural adjustments that overcome the notorious training instabilities of mixed-modal autoregressive modeling. Compared to specialized pipelines like diffusion-based generators or cross-attention vision-language adapters, Chameleon offers a more cohesive approach to multimodal document understanding and generation, effectively unifying capabilities that typically require separate models. However, autoregressive discrete token generation still faces inherent trade-offs in computational efficiency and fine-grained visual fidelity relative to modern diffusion systems, which limits its immediate dominance in high-resolution image synthesis. Nevertheless, the paper establishes a crucial blueprint for truly interleaved multimodal foundation models and will likely drive significant follow-up research into unified sequence modeling, positioning it as a strong, field-advancing contribution that bridges the gap between understanding and generation without relying on modular late-fusion compromises.
General reasoning represents a long-standing and formidable challenge in artificial intelligence. Recent breakthroughs, exemplified by large language models (LLMs) and chain-of-thought prompting, have achieved considerable success on foundational reasoning tasks. However, this success is heavily contingent upon extensive human-annotated demonstrations, and models' capabilities are still insufficient for more complex problems. Here we show that the reasoning abilities of LLMs can be incentivized through pure reinforcement learning (RL), obviating the need for human-labeled reasoning trajectories. The proposed RL framework facilitates the emergent development of advanced reasoning patterns, such as self-reflection, verification, and dynamic strategy adaptation. Consequently, the trained model achieves superior performance on verifiable tasks such as mathematics, coding competitions, and STEM fields, surpassing its counterparts trained via conventional supervised learning on human demonstrations. Moreover, the emergent reasoning patterns exhibited by these large-scale models can be systematically harnessed to guide and enhance the reasoning capabilities of smaller models.
DeepSeek-R1 demonstrates that large language models can autonomously develop advanced reasoning capabilities through pure, outcome-based reinforcement learning without supervised fine-tuning, fundamentally challenging conventional post-training paradigms and establishing a highly scalable, verifier-driven blueprint for future AI development.
The paper presents a methodologically elegant and strategically bold approach to LLM post-training. By bypassing supervised fine-tuning (SFT) and applying Group Relative Policy Optimization (GRPO) directly to a base model with purely rule-based, outcome-only rewards, the authors successfully demonstrate that complex reasoning behaviors (self-reflection, verification, strategy switching) can emerge organically. The multi-stage pipeline for DeepSeek-R1 (cold-start conversational data → RL → rejection sampling/SFT → safety/helpfulness RL) is a pragmatic engineering solution that addresses the readability, safety, and generalization gaps of the pure RL phase. However, the methodology heavily relies on the existence of deterministic verifiers (math/code), and the transition to model-based rewards for general domains reintroduces the very reward-hacking vulnerabilities the authors sought to avoid. The choice of GRPO over PPO is well-motivated for compute efficiency, but the paper lacks a rigorous ablation comparing GRPO against modern PPO variants or other policy gradient methods in this specific setting.
The experimental suite is comprehensive, spanning competitive mathematics (AIME 2024, CNMO), coding (Codeforces, LiveCodeBench, SWE-bench), and advanced STEM (GPQA). The staged evaluation (R1-Zero → Dev1/2/3 → Final R1) clearly isolates the contribution of each training phase, showing that RL drives reasoning gains while SFT/alignment restores instruction-following and safety. Performance metrics are strong and position the model at the frontier. A minor critique is that direct, controlled comparisons against SFT-only and SFT+RL baselines on identical data splits are deferred to supplementary materials or implied rather than explicitly detailed in the main text. The evaluation of test-time compute scaling (dynamic token allocation) is insightful but lacks a formal analysis of compute-accuracy trade-offs compared to majority voting or beam search.
The paper provides essential training hyperparameters (learning rates, KL coefficients, batch sizes, GRPO clip ratios, rollout configurations) and clearly outlines the reward formulation. The release of open model weights significantly lowers the barrier for downstream research and distillation. However, exact reproducibility is constrained by the massive compute infrastructure required for large-scale RL rollouts and the lack of publicly released training code. The reliance on proprietary or carefully curated cold-start data and preference pairs also introduces a reproducibility gap for independent labs.
The authors are commendably transparent about limitations: (1) Prompt sensitivity and degradation under few-shot settings, (2) Language mixing in multilingual contexts, (3) Lack of external tool integration (search, calculators), (4) Inefficient token usage on simple tasks ("overthinking"), and (5) The fundamental constraint that pure RL scaling is bottlenecked by the availability of reliable, rule-based verifiers. The paper acknowledges that extending this paradigm to open-ended domains (creative writing, complex reasoning without ground truth) remains unsolved due to reward model exploitation. Additionally, the heavy compute dependency inherently centralizes capability development.
This work represents a paradigm shift in LLM post-training, challenging the prevailing assumption that extensive human-annotated reasoning traces are necessary for advanced cognitive capabilities. By demonstrating that outcome-based RL alone can elicit sophisticated reasoning, it provides a scalable, annotation-light blueprint for the next generation of models. The open release of weights and distilled variants democratizes access to frontier reasoning capabilities, accelerating academic and industrial research. The authors responsibly address ethical risks, noting the dual-use potential of enhanced reasoning (e.g., jailbreak resilience vs. malicious planning feasibility) and advocating for robust safety guardrails. The paper will likely catalyze a wave of research into verifier-driven RL, dynamic inference compute, and reward model robustness. DeepSeek-R1 demonstrates that large language models can autonomously develop advanced reasoning capabilities through pure, outcome-based reinforcement learning without supervised fine-tuning, fundamentally challenging conventional post-training paradigms and establishing a highly scalable, verifier-driven blueprint for future AI development.
Language models have outpaced our ability to evaluate them effectively, but for their future development it is essential to study the frontier of their capabilities. We find real-world software engineering to be a rich, sustainable, and challenging testbed for evaluating the next generation of language models. To this end, we introduce SWE-bench, an evaluation framework consisting of $2,294$ software engineering problems drawn from real GitHub issues and corresponding pull requests across $12$ popular Python repositories. Given a codebase along with a description of an issue to be resolved, a language model is tasked with editing the codebase to address the issue. Resolving issues in SWE-bench frequently requires understanding and coordinating changes across multiple functions, classes, and even files simultaneously, calling for models to interact with execution environments, process extremely long contexts and perform complex reasoning that goes far beyond traditional code generation tasks. Our evaluations show that both state-of-the-art proprietary models and our fine-tuned model SWE-Llama can resolve only the simplest issues. The best-performing model, Claude 2, is able to solve a mere $1.96$% of the issues. Advances on SWE-bench represent steps towards LMs that are more practical, intelligent, and autonomous.
SWE-bench introduces a rigorous, execution-based benchmark for evaluating language models on real-world software engineering tasks, fundamentally shifting the field from synthetic code completion to repository-level issue resolution and catalyzing the development of autonomous AI coding agents. The methodology's reliance on actual GitHub issues, multi-file context, and automated test validation establishes a new gold standard for code LLM evaluation, while the starkly low initial baselines provide a critical reality check and a clear, measurable frontier for future research in agentic software engineering.
The methodology introduces a paradigm shift from isolated code completion to repository-level, execution-validated software engineering. By curating 2,294 real GitHub issues paired with ground-truth pull requests across 12 mature Python repositories, the authors construct a benchmark that demands multi-file reasoning, dependency tracking, and precise patch generation. The core technical innovation is the execution-based evaluation harness: instead of relying on lexical similarity or synthetic unit tests, SWE-bench applies model-generated patches to the exact repository commit, spins up isolated Docker environments, and runs the project's native test suite. This design rigorously filters out superficial fixes and ensures that "solved" instances reflect genuine functional correctness. The pipeline also incorporates careful context retrieval strategies and environment versioning, addressing critical reproducibility gaps in prior code benchmarks.
The experimental evaluation is methodical and highly revealing. The authors benchmark both proprietary frontier models (GPT-4, Claude 2, Gemini) and open-weight code models (SWE-Llama, CodeLlama) under zero-shot and few-shot conditions. The results are stark: even the strongest proprietary models resolve <2% of issues, exposing a massive capability gap between current LLMs and real-world engineering demands. The paper includes thorough ablation studies on context window limits, retrieval strategies, and prompt engineering, demonstrating that performance bottlenecks stem from multi-hop reasoning, long-context navigation, and test-aware debugging rather than mere syntax generation. The low baselines effectively calibrate community expectations and establish a clear, measurable frontier for subsequent research.
Excellent. The authors release a fully open-source evaluation harness, Dockerized environment configurations for each repository version, and a transparent, continuously updated leaderboard. The dataset construction pipeline—including issue/PR filtering criteria, test coverage validation, and environment isolation—is thoroughly documented in the appendices. The modular design allows researchers to seamlessly integrate new models, test custom retrieval pipelines, and run standardized evaluations, ensuring high reproducibility and rapid community adoption.
The benchmark is currently restricted to Python repositories, limiting cross-language generalization and ecosystem-specific evaluation (e.g., JavaScript, Rust, C++). The static, single-turn evaluation setup does not natively support interactive debugging, web search, or multi-step agentic loops, which are increasingly necessary for complex issue resolution. Additionally, the strict test-passing criterion may penalize semantically correct but structurally unconventional fixes, and some issues inherently require external domain knowledge or stakeholder clarification not present in the repository context.
SWE-bench has fundamentally redirected the trajectory of AI-assisted software engineering, moving the field beyond synthetic benchmarks toward realistic, production-grade evaluation. It serves as the foundational testbed for the rapidly expanding domain of autonomous coding agents, driving innovations in tool use, retrieval-augmented generation, and iterative debugging. By establishing a rigorous, execution-validated standard, it accelerates the development of reliable AI coding assistants, informs safe deployment practices, and provides a clear, quantifiable roadmap for achieving practical, autonomous software engineering capabilities. SWE-bench introduces a rigorous, execution-based benchmark for evaluating language models on real-world software engineering tasks, fundamentally shifting the field from synthetic code completion to repository-level issue resolution and catalyzing the development of autonomous AI coding agents. The methodology's reliance on actual GitHub issues, multi-file context, and automated test validation establishes a new gold standard for code LLM evaluation, while the starkly low initial baselines provide a critical reality check and a clear, measurable frontier for future research in agentic software engineering.
We present DeepSeek-V2, a strong Mixture-of-Experts (MoE) language model characterized by economical training and efficient inference. It comprises 236B total parameters, of which 21B are activated for each token, and supports a context length of 128K tokens. DeepSeek-V2 adopts innovative architectures including Multi-head Latent Attention (MLA) and DeepSeekMoE. MLA guarantees efficient inference through significantly compressing the Key-Value (KV) cache into a latent vector, while DeepSeekMoE enables training strong models at an economical cost through sparse computation. Compared with DeepSeek 67B, DeepSeek-V2 achieves significantly stronger performance, and meanwhile saves 42.5% of training costs, reduces the KV cache by 93.3%, and boosts the maximum generation throughput to 5.76 times. We pretrain DeepSeek-V2 on a high-quality and multi-source corpus consisting of 8.1T tokens, and further perform Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) to fully unlock its potential. Evaluation results show that, even with only 21B activated parameters, DeepSeek-V2 and its chat versions still achieve top-tier performance among open-source models.
DeepSeek-V2 introduces a highly efficient MoE architecture featuring Multi-head Latent Attention and optimized expert routing, achieving dense-model performance with drastically reduced training costs and KV cache memory. The paper presents a rigorously evaluated, practically impactful solution to LLM scaling bottlenecks, offering architectural innovations that balance performance, efficiency, and open accessibility, thereby establishing a new standard for economical large language model development and deployment.
The paper introduces two primary architectural innovations: Multi-head Latent Attention (MLA) and DeepSeekMoE. MLA addresses the KV cache bottleneck by projecting keys and values into a shared low-rank latent space, then reconstructing them during attention computation. To maintain positional awareness without breaking matrix absorption during inference, the authors propose a decoupled RoPE strategy that separates position-sensitive components from the compressed cache. DeepSeekMoE builds on fine-grained expert segmentation and shared expert isolation, augmented with device-limited routing and three auxiliary balance losses (expert, device, communication) to stabilize distributed training. The methodology is mathematically sound and directly targets the most pressing practical constraints in LLM scaling: memory bandwidth and compute efficiency. The alignment pipeline (SFT + two-stage GRPO-based RL) is standard but carefully engineered to mitigate alignment tax.
The evaluation is comprehensive and rigorously structured. The model is benchmarked across a wide spectrum of English and Chinese tasks (MMLU, GSM8K, HumanEval, AlignBench, MT-Bench, AlpacaEval 2.0, etc.), consistently matching or surpassing dense 70B+ models and competing MoE baselines (Mixtral 8x22B, Qwen1.5 72B) despite activating only 21B parameters per token. Efficiency metrics are thoroughly reported: 42.5% reduction in training GPU hours, 93.3% KV cache compression, and 5.76x throughput improvement. Ablation studies validate MLA over MHA/GQA/MQA and confirm the necessity of the proposed routing/balance mechanisms. Long-context extension via YaRN is empirically verified with Needle-In-A-Haystack tests up to 128K. The evaluation framework is internally consistent and covers both closed-form and open-ended generation.
High for architectural and training configuration details. The paper provides explicit hyperparameters, layer counts, expert routing logic, optimizer settings, learning rate schedules, and infrastructure specifications (HAI-LLM framework, H800 cluster setup, parallelism strategy). The release of DeepSeek-V2 and V2-Lite weights enables direct community validation. However, exact reproducibility is constrained by the proprietary nature of the 8.1T-token pretraining corpus and internal training framework optimizations. The architectural formulas and training protocols are sufficiently documented for well-resourced teams to replicate the core methodology.
The paper acknowledges standard LLM constraints: static knowledge cutoff, hallucination risks, and limited multilingual capability beyond Chinese/English. The alignment tax is explicitly noted, with RL improving open-ended generation while slightly degrading performance on certain reasoning benchmarks (e.g., BBH). Architecturally, MLA introduces additional complexity in positional embedding handling (decoupled RoPE) and requires careful scaling factor tuning. The reliance on custom CUDA kernels and internal parallelism frameworks may limit direct adoption by smaller labs. Additionally, the paper does not fully explore MLA's behavior under extreme batch sizes or mixed-precision quantization beyond FP8/6-bit KV cache.
DeepSeek-V2 significantly advances the practical deployment of large-scale language models by demonstrating that MoE architectures can achieve dense-model performance with drastically reduced training costs and inference memory. The KV cache compression technique directly addresses a critical bottleneck in serving long-context models, making high-capacity LLMs viable in resource-constrained environments. By open-sourcing the model and architectural details, the work democratizes access to efficient, state-of-the-art capabilities and provides a strong baseline for future research in sparse attention, expert routing, and alignment efficiency. The methodology will likely influence both academic LLM design and industry deployment pipelines. DeepSeek-V2 introduces a highly efficient MoE architecture featuring Multi-head Latent Attention and optimized expert routing, achieving dense-model performance with drastically reduced training costs and KV cache memory. The paper presents a rigorously evaluated, practically impactful solution to LLM scaling bottlenecks, offering architectural innovations that balance performance, efficiency, and open accessibility, thereby establishing a new standard for economical large language model development and deployment.
The potential of Large Language Model (LLM) as agents has been widely acknowledged recently. Thus, there is an urgent need to quantitatively \textit{evaluate LLMs as agents} on challenging tasks in interactive environments. We present AgentBench, a multi-dimensional benchmark that consists of 8 distinct environments to assess LLM-as-Agent's reasoning and decision-making abilities. Our extensive test over \num API-based and open-sourced (OSS) LLMs shows that, while top commercial LLMs present a strong ability of acting as agents in complex environments, there is a significant disparity in performance between them and many OSS competitors that are no larger than 70B. We identify the typical reasons of failures in environments and LLMs, showing that poor long-term reasoning, decision-making, and instruction following abilities are the main obstacles for developing usable LLM agents. Improving instruction following and training on high quality multi-round alignment data could improve agent performance. And different from existing assumptions, training on code present ambivalent impacts on different agent tasks. Datasets, environments, and an integrated evaluation package for AgentBench are released at https://github.com/THUDM/AgentBench.
AgentBench introduces a comprehensive, multi-environment benchmark that systematically evaluates LLMs as autonomous agents, revealing critical failure modes and training insights that have become foundational for the agent evaluation landscape. The work provides a rigorous, standardized methodology for assessing multi-turn reasoning and decision-making, delivers actionable empirical findings that challenge prevailing assumptions about code-centric pre-training, and establishes an open, widely adopted evaluation framework that continues to guide the development of reliable, instruction-following AI agents.
The paper introduces a systematic, multi-dimensional benchmark comprising 8 distinct interactive environments (e.g., operating systems, databases, web navigation, card games, knowledge graphs) designed specifically to evaluate LLMs as autonomous agents. The methodology shifts evaluation from static, single-turn QA to dynamic, multi-turn interaction loops that test long-horizon reasoning, state tracking, and instruction adherence. The framework standardizes environment APIs and evaluation metrics, enabling direct comparison across API-based and open-source models. The failure mode taxonomy is particularly rigorous, categorizing agent breakdowns into planning, execution, and alignment failures, which provides actionable diagnostic signals for model developers.
The experimental suite is comprehensive, evaluating a wide spectrum of contemporary LLMs (both commercial and OSS up to 70B+ parameters). Results clearly quantify the performance gap between frontier commercial models and open alternatives, with granular breakdowns per environment. A standout empirical contribution is the finding that code pre-training yields ambivalent effects on agent tasks, challenging the prevailing assumption that code proficiency directly translates to superior agentic reasoning. The combination of quantitative success rates with qualitative error analysis strengthens the validity of the conclusions and provides clear guidance for future training data curation.
High. The authors release the complete benchmark suite, environment configurations, evaluation harness, and detailed documentation via a public GitHub repository. The use of standardized environment wrappers and clear API interfaces ensures that researchers can readily replicate evaluations across different models. Minor reproducibility constraints exist for proprietary API models due to potential version drift and rate limiting, but the open-source components are fully self-contained and well-documented.
The benchmark is primarily text and structured-data focused, lacking evaluation in multimodal, embodied, or physical robotics environments where agent capabilities are increasingly critical. Automated success metrics may not fully capture partial progress, creative problem-solving, or safety-aligned behavior in complex environments. As a snapshot evaluation, the benchmark's discriminative power may degrade rapidly as LLMs scale and fine-tuning techniques evolve. Additionally, the observed ambivalent impact of code training remains correlational and lacks mechanistic or ablation studies to isolate causal factors.
AgentBench establishes a critical evaluation standard for the rapidly expanding LLM agent ecosystem, steering research away from narrow task performance toward robust, multi-step reasoning and reliable instruction following. By publicly releasing the environments and evaluation pipeline, it democratizes agent research and accelerates iterative model development. The identified bottlenecks directly inform the next generation of alignment datasets and training objectives, while the benchmark's design principles influence subsequent agent evaluation frameworks across academia and industry. AgentBench introduces a comprehensive, multi-environment benchmark that systematically evaluates LLMs as autonomous agents, revealing critical failure modes and training insights that have become foundational for the agent evaluation landscape. The work provides a rigorous, standardized methodology for assessing multi-turn reasoning and decision-making, delivers actionable empirical findings that challenge prevailing assumptions about code-centric pre-training, and establishes an open, widely adopted evaluation framework that continues to guide the development of reliable, instruction-following AI agents.
The BigCode project, an open-scientific collaboration focused on the responsible development of Large Language Models for Code (Code LLMs), introduces StarCoder2. In partnership with Software Heritage (SWH), we build The Stack v2 on top of the digital commons of their source code archive. Alongside the SWH repositories spanning 619 programming languages, we carefully select other high-quality data sources, such as GitHub pull requests, Kaggle notebooks, and code documentation. This results in a training set that is 4x larger than the first StarCoder dataset. We train StarCoder2 models with 3B, 7B, and 15B parameters on 3.3 to 4.3 trillion tokens and thoroughly evaluate them on a comprehensive set of Code LLM benchmarks. We find that our small model, StarCoder2-3B, outperforms other Code LLMs of similar size on most benchmarks, and also outperforms StarCoderBase-15B. Our large model, StarCoder2- 15B, significantly outperforms other models of comparable size. In addition, it matches or outperforms CodeLlama-34B, a model more than twice its size. Although DeepSeekCoder- 33B is the best-performing model at code completion for high-resource languages, we find that StarCoder2-15B outperforms it on math and code reasoning benchmarks, as well as several low-resource languages. We make the model weights available under an OpenRAIL license and ensure full transparency regarding the training data by releasing the SoftWare Heritage persistent IDentifiers (SWHIDs) of the source code data.
This work establishes a rigorous framework for transparent, large-scale dataset curation and demonstrates that careful data selection combined with standard scaling laws can yield highly parameter-efficient code models. The introduction of The Stack v2, anchored by persistent identifiers for full data provenance, directly addresses critical reproducibility and licensing bottlenecks that have historically constrained open code LLM research. Empirically, the models push the efficiency frontier, particularly excelling in mathematical reasoning and low-resource language support while matching or surpassing significantly larger counterparts. The substantial citation volume reflects rapid community validation and widespread adoption as a baseline for open-weight development, cementing its utility across both academic and industrial pipelines. While the architectural and training methodologies remain iterative rather than paradigm-shifting, the paper’s methodological rigor in data governance and its strong empirical results make it a highly influential reference that will continue to shape responsible, transparent model development practices in the code generation domain.
Modern artificial intelligence (AI) systems are powered by foundation models. This paper presents a new set of foundation models, called Llama 3. It is a herd of language models that natively support multilinguality, coding, reasoning, and tool usage. Our largest model is a dense Transformer with 405B parameters and a context window of up to 128K tokens. This paper presents an extensive empirical evaluation of Llama 3. We find that Llama 3 delivers comparable quality to leading language models such as GPT-4 on a plethora of tasks. We publicly release Llama 3, including pre-trained and post-trained versions of the 405B parameter language model and our Llama Guard 3 model for input and output safety. The paper also presents the results of experiments in which we integrate image, video, and speech capabilities into Llama 3 via a compositional approach. We observe this approach performs competitively with the state-of-the-art on image, video, and speech recognition tasks. The resulting models are not yet being broadly released as they are still under development.
Llama 3 introduces a suite of highly capable, openly released foundation models up to 405B parameters that achieve frontier-level performance through refined data curation, scaled training, and compositional multimodal integration. While the work does not propose a fundamentally new architecture or training paradigm, its significance stems from the rigorous empirical validation of scaling laws, high-quality data filtering, and post-training alignment at an unprecedented scale for open-weight models. The comprehensive release democratizes access to state-of-the-art capabilities, establishing a critical baseline for the open-source ecosystem and enabling widespread downstream research across coding, reasoning, and multilingual tasks. Compared to foundational paradigm shifts like the original Transformer or early GPT series, this contribution represents a maturation and optimization phase rather than a conceptual breakthrough; however, its transparency, safety tooling, and exploratory multimodal composition make it an indispensable reference for both academic and industrial practitioners. The paper’s primary value lies in its engineering rigor and ecosystem impact, solidifying current LLM development trajectories rather than redefining them.
We present the Qwen2-VL Series, an advanced upgrade of the previous Qwen-VL models that redefines the conventional predetermined-resolution approach in visual processing. Qwen2-VL introduces the Naive Dynamic Resolution mechanism, which enables the model to dynamically process images of varying resolutions into different numbers of visual tokens. This approach allows the model to generate more efficient and accurate visual representations, closely aligning with human perceptual processes. The model also integrates Multimodal Rotary Position Embedding (M-RoPE), facilitating the effective fusion of positional information across text, images, and videos. We employ a unified paradigm for processing both images and videos, enhancing the model's visual perception capabilities. To explore the potential of large multimodal models, Qwen2-VL investigates the scaling laws for large vision-language models (LVLMs). By scaling both the model size-with versions at 2B, 8B, and 72B parameters-and the amount of training data, the Qwen2-VL Series achieves highly competitive performance. Notably, the Qwen2-VL-72B model achieves results comparable to leading models such as GPT-4o and Claude3.5-Sonnet across various multimodal benchmarks, outperforming other generalist models. Code is available at https://github.com/QwenLM/Qwen2-VL .
Qwen2-VL introduces a highly optimized, open-weight vision-language model series featuring dynamic native-resolution processing and decomposed multimodal positional embeddings, achieving proprietary-level performance across diverse benchmarks. While methodologically incremental, the paper's rigorous scaling analysis, comprehensive training recipe, and open-weight release establish a new practical standard for open LVLM development, offering immediate utility and strong baseline value for the broader multimodal research community.
The paper presents a pragmatic, engineering-focused architecture for large vision-language models. The core technical contributions are Naive Dynamic Resolution (processing native-resolution images into variable token counts via an MLP compressor) and Multimodal Rotary Position Embedding (M-RoPE), which decomposes positional encodings into temporal, height, and width dimensions. While conceptually aligned with prior work like NaViT and 2D/3D RoPE variants, the integration into a unified LVLM pipeline is clean and well-motivated. The unified image/video paradigm (treating images as 2-frame sequences with lightweight 3D convolutions) and the standard 3-stage training recipe (ViT pretraining, joint multimodal pretraining, instruction tuning) are conventional but effectively scaled. The methodology prioritizes scalability, efficiency, and empirical performance over theoretical novelty, which is appropriate for a system-level model release.
The experimental evaluation is comprehensive and rigorously structured. The authors benchmark across general VQA, document/diagram understanding, multilingual OCR, mathematical reasoning, referring expression comprehension, video understanding, and visual agent tasks. The 72B variant consistently matches or exceeds proprietary baselines (GPT-4o, Claude 3.5-Sonnet) on numerous public benchmarks, with particularly strong gains in OCR-heavy and document-centric tasks. Ablation studies on dynamic resolution, M-RoPE, and model/data scaling are thorough and provide actionable insights into compute-performance tradeoffs. The agent evaluations (UI operations, robotics, navigation, card games) are well-designed and demonstrate real-world applicability. However, the evaluation relies heavily on aggregate benchmark scores, and some claims (e.g., 20+ minute video understanding) are constrained by frame sampling limits during testing.
High. The release of open weights across three scales (2B, 8B, 72B), coupled with detailed training hyperparameters, data composition, and infrastructure specifications (3D parallelism, storage architecture, software stack), significantly lowers the barrier to replication. The code repository provides implementation details for dynamic resolution packing, M-RoPE, and the training pipeline. The primary limitation to exact reproducibility is the reliance on proprietary data curation and massive compute resources (Alibaba Cloud PAI), which smaller academic or independent labs cannot easily match. Nevertheless, the open-weight release and transparent training recipe make this one of the most reproducible large-scale LVLM papers to date.
The architectural innovations are incremental rather than foundational, building heavily on established paradigms (dynamic tokenization, RoPE variants, standard LVLM connectors). The paper acknowledges performance gaps on highly complex reasoning benchmarks (e.g., MMMU) and struggles with 3D spatial modeling in navigation tasks. The dynamic resolution mechanism, while efficient, still requires careful token budgeting and min/max pixel threshold tuning, which can introduce distribution shifts for extremely small or large images. Video evaluation caps frame extraction at 768, limiting claims about true long-horizon temporal reasoning. Finally, the massive compute and proprietary data requirements inherently limit the accessibility of the exact training pipeline for the broader research community.
The open-weight release of Qwen2-VL at multiple scales democratizes access to state-of-the-art multimodal capabilities, accelerating research and deployment in academia and industry. The dynamic resolution and M-RoPE techniques provide practical blueprints that will likely be adopted by subsequent open LVLM development. Strong multilingual OCR, document parsing, and visual agent capabilities enable real-world applications in automated UI interaction, robotics, educational tools, and global content analysis. The detailed training infrastructure and scaling analysis serve as a valuable reference for the community, pushing forward the standardization of large-scale multimodal training practices. Qwen2-VL introduces a highly optimized, open-weight vision-language model series featuring dynamic native-resolution processing and decomposed multimodal positional embeddings, achieving proprietary-level performance across diverse benchmarks. While methodologically incremental, the paper's rigorous scaling analysis, comprehensive training recipe, and open-weight release establish a new practical standard for open LVLM development, offering immediate utility and strong baseline value for the broader multimodal research community.
We present DeepSeek-V3, a strong Mixture-of-Experts (MoE) language model with 671B total parameters with 37B activated for each token. To achieve efficient inference and cost-effective training, DeepSeek-V3 adopts Multi-head Latent Attention (MLA) and DeepSeekMoE architectures, which were thoroughly validated in DeepSeek-V2. Furthermore, DeepSeek-V3 pioneers an auxiliary-loss-free strategy for load balancing and sets a multi-token prediction training objective for stronger performance. We pre-train DeepSeek-V3 on 14.8 trillion diverse and high-quality tokens, followed by Supervised Fine-Tuning and Reinforcement Learning stages to fully harness its capabilities. Comprehensive evaluations reveal that DeepSeek-V3 outperforms other open-source models and achieves performance comparable to leading closed-source models. Despite its excellent performance, DeepSeek-V3 requires only 2.788M H800 GPU hours for its full training. In addition, its training process is remarkably stable. Throughout the entire training process, we did not experience any irrecoverable loss spikes or perform any rollbacks. The model checkpoints are available at https://github.com/deepseek-ai/DeepSeek-V3.
DeepSeek-V3 introduces an auxiliary-loss-free load balancing mechanism and a multi-token prediction objective to enable highly stable, cost-effective training of a 671B-parameter Mixture-of-Experts model. The work delivers a substantial methodological advance in large-scale MoE training by directly addressing long-standing bottlenecks such as expert collapse and training instability, while demonstrating that frontier-level performance can be achieved with dramatically reduced compute budgets. The architectural choices build upon established attention and routing paradigms rather than proposing a fundamentally new learning framework, positioning the contribution as a highly refined engineering and training methodology rather than a paradigm shift. Nevertheless, the open release, detailed training recipe, and demonstrated zero-rollback stability will serve as a critical reference for the community, likely shaping how research groups and industry labs approach efficient, scalable LLM development in the near term.
We tackle the task of conditional music generation. We introduce MusicGen, a single Language Model (LM) that operates over several streams of compressed discrete music representation, i.e., tokens. Unlike prior work, MusicGen is comprised of a single-stage transformer LM together with efficient token interleaving patterns, which eliminates the need for cascading several models, e.g., hierarchically or upsampling. Following this approach, we demonstrate how MusicGen can generate high-quality samples, both mono and stereo, while being conditioned on textual description or melodic features, allowing better controls over the generated output. We conduct extensive empirical evaluation, considering both automatic and human studies, showing the proposed approach is superior to the evaluated baselines on a standard text-to-music benchmark. Through ablation studies, we shed light over the importance of each of the components comprising MusicGen. Music samples, code, and models are available at https://github.com/facebookresearch/audiocraft
MusicGen introduces a simplified single-stage autoregressive transformer with efficient token interleaving that achieves state-of-the-art text-to-music generation without cascaded models. By demonstrating that architectural simplicity combined with discrete audio tokenization can match or exceed complex hierarchical pipelines, the work establishes a highly practical, open-source baseline that has rapidly become a foundational reference for scalable audio generation research.
The paper proposes MusicGen, a single-stage autoregressive transformer that generates music by predicting discrete audio tokens derived from EnCodec. The core methodological advance lies in replacing the complex, multi-stage cascaded or hierarchical architectures prevalent in prior audio generation work (e.g., AudioLM, MusicLM) with a unified LM coupled with efficient token interleaving patterns. By flattening or parallelizing the prediction of multiple codebook streams, the model avoids error accumulation and training instability inherent in cascaded systems. Conditioning is implemented via cross-attention with frozen T5 text embeddings or melodic feature extractors. The approach is architecturally minimalist, demonstrating that scaling a standard transformer over well-quantized discrete audio representations can yield high-fidelity, controllable generation without bespoke architectural complexity.
The empirical evaluation is comprehensive, leveraging both automatic metrics (Fréchet Audio Distance, KL divergence, chroma similarity) and rigorous human listening studies (MOS, pairwise preference). Results consistently show MusicGen outperforming evaluated baselines in perceptual quality and text-music alignment. The ablation studies are well-designed, isolating the impact of token interleaving strategies, model scale, and conditioning modalities. While automatic audio metrics remain imperfect proxies for human perception, the authors appropriately mitigate this by prioritizing human evaluations and providing extensive qualitative samples. The benchmarking is thorough and sets a clear standard for subsequent text-to-music research.
Exceptional. The authors release the full codebase, pre-trained checkpoints, and training pipelines through the Audiocraft library. The methodology relies on widely understood components (standard transformers, EnCodec, T5) with transparent architectural specifications, hyperparameters, and data preprocessing steps. The open-source release includes clear documentation, inference scripts, and fine-tuning guides, making this one of the most reproducible and accessible works in generative audio.
The autoregressive formulation inherently limits generation throughput and struggles with long-horizon structural coherence (e.g., maintaining consistent verse-chorus transitions over extended durations). The model lacks fine-grained control over exact musical notation, tempo, or instrumentation beyond high-level textual prompts. Additionally, the discrete tokenization bottleneck (EnCodec) introduces quantization artifacts that cap theoretical audio fidelity, particularly in high-frequency transients. The paper also does not deeply address dataset copyright implications or the risk of stylistic overfitting to training corpora.
MusicGen has rapidly become a foundational baseline in generative audio, significantly lowering the barrier to entry for high-quality music synthesis through its open-source release. It enables practical applications in creative assistance, interactive media, and rapid prototyping. However, it also amplifies existing concerns regarding artist displacement, copyright infringement, and the potential for generating deceptive audio content. The authors' commitment to transparent, open research sets a constructive precedent, though responsible deployment frameworks and provenance tracking remain critical open challenges for the field. MusicGen introduces a simplified single-stage autoregressive transformer with efficient token interleaving that achieves state-of-the-art text-to-music generation without cascaded models. By demonstrating that architectural simplicity combined with discrete audio tokenization can match or exceed complex hierarchical pipelines, the work establishes a highly practical, open-source baseline that has rapidly become a foundational reference for scalable audio generation research.
High throughput serving of large language models (LLMs) requires batching sufficiently many requests at a time. However, existing systems struggle because the key-value cache (KV cache) memory for each request is huge and grows and shrinks dynamically. When managed inefficiently, this memory can be significantly wasted by fragmentation and redundant duplication, limiting the batch size. To address this problem, we propose PagedAttention, an attention algorithm inspired by the classical virtual memory and paging techniques in operating systems. On top of it, we build vLLM, an LLM serving system that achieves (1) near-zero waste in KV cache memory and (2) flexible sharing of KV cache within and across requests to further reduce memory usage. Our evaluations show that vLLM improves the throughput of popular LLMs by 2-4$\times$ with the same level of latency compared to the state-of-the-art systems, such as FasterTransformer and Orca. The improvement is more pronounced with longer sequences, larger models, and more complex decoding algorithms. vLLM's source code is publicly available at https://github.com/vllm-project/vllm
PagedAttention introduces a virtual-memory-inspired block-based KV cache management system that eliminates memory fragmentation and enables efficient cross-request sharing. By decoupling logical and physical memory allocation and implementing optimized CUDA kernels, vLLM achieves 2-4x throughput gains over prior state-of-the-art systems, fundamentally reshaping LLM inference infrastructure and establishing a new standard for memory-efficient model serving.
The paper introduces PagedAttention, a systems-level innovation that adapts classical OS virtual memory and paging concepts to LLM KV cache management. By partitioning KV states into fixed-size logical blocks mapped to non-contiguous physical memory via a block table, the method eliminates the contiguous allocation requirement that causes severe internal/external fragmentation and over-provisioning waste. The methodology elegantly extends to complex decoding paradigms (parallel sampling, beam search, shared prefixes) through reference counting and block-level copy-on-write, mirroring OS process forking. The system architecture (vLLM) couples this memory manager with a centralized scheduler, custom CUDA kernels for non-contiguous block reads/writes, and robust preemption/eviction strategies (swapping vs. recomputation). The approach is methodologically rigorous, carefully addressing GPU memory hierarchy constraints, kernel launch overheads, and attention computation patterns while maintaining mathematical equivalence to standard self-attention.
The evaluation is comprehensive and well-calibrated. The authors benchmark against strong baselines (FasterTransformer, Orca) across multiple model scales and real-world traces (ShareGPT, Alpaca), demonstrating consistent 2-4x throughput improvements at iso-latency. Ablation studies thoroughly analyze kernel overhead (~20-26% attention penalty offset by higher batch sizes), optimal block size selection (default 16), and eviction tradeoffs. The experiments cover diverse decoding algorithms and validate distributed tensor-parallel execution. While the evaluation strongly emphasizes throughput/latency tradeoffs, it could have included deeper analysis on energy efficiency, multi-node pipeline parallelism, or extreme-scale scheduling bottlenecks. Nonetheless, the empirical rigor and workload diversity firmly establish the system's practical superiority.
Excellent. vLLM is fully open-sourced (Apache 2.0) with clear implementation details spanning ~8.5K lines of Python and ~2K lines of C++/CUDA. The paper provides explicit descriptions of kernel fusions, block table management, scheduling policies, and configuration parameters. The codebase is actively maintained, extensively documented, and has become the de facto reference implementation for efficient LLM serving, ensuring high reproducibility and immediate community adoption.
The custom PagedAttention kernel introduces measurable overhead in the attention operator compared to contiguous baselines, which is only amortized when batch sizes are sufficiently large. Block size tuning is workload-dependent; mismatched choices can degrade performance or increase fragmentation. The CPU swapping mechanism, while functional, is constrained by PCIe bandwidth and may cause latency spikes under heavy preemption. The centralized scheduler design, while simple and effective for single-node or tensor-parallel setups, may become a bottleneck at extreme multi-node scales. Additionally, the paper focuses primarily on memory-bound regimes and does not deeply explore compute-bound optimizations or advanced pipeline parallelism strategies.
PagedAttention and vLLM have fundamentally transformed LLM inference infrastructure, becoming the backbone for countless commercial APIs, open-source deployments, and research frameworks. By drastically improving memory efficiency and throughput, the work lowers the hardware barrier for deploying large models, democratizing access to advanced AI capabilities. It successfully bridges systems and ML research, inspiring a new generation of memory-aware inference optimizations and establishing a new standard for scalable, efficient model serving. PagedAttention introduces a virtual-memory-inspired block-based KV cache management system that eliminates memory fragmentation and enables efficient cross-request sharing. By decoupling logical and physical memory allocation and implementing optimized CUDA kernels, vLLM achieves 2-4x throughput gains over prior state-of-the-art systems, fundamentally reshaping LLM inference infrastructure and establishing a new standard for memory-efficient model serving.
This paper introduces PowerInfer, a high-speed Large Language Model (LLM) inference engine on a personal computer (PC) equipped with a single consumer-grade GPU. The key principle underlying the design of PowerInfer is exploiting the high locality inherent in LLM inference, characterized by a power-law distribution in neuron activation. This distribution indicates that a small subset of neurons, termed hot neurons, are consistently activated across inputs, while the majority, cold neurons, vary based on specific inputs. PowerInfer exploits such an insight to design a GPU-CPU hybrid inference engine: hot-activated neurons are preloaded onto the GPU for fast access, while cold-activated neurons are computed on the CPU, thus significantly reducing GPU memory demands and CPU-GPU data transfers. PowerInfer further integrates adaptive predictors and neuron-aware sparse operators, optimizing the efficiency of neuron activation and computational sparsity. The evaluation shows that PowerInfer significantly outperforms llama.cpp by up to 11.69x while retaining model accuracy across various LLMs (including OPT-175B) on a single NVIDIA RTX 4090 GPU. For the OPT-30B model, PowerInfer achieves performance comparable to that of a high-end server-grade A100 GPU, reaching 82% of its token generation rate on a single consumer-grade RTX 4090 GPU.
PowerInfer introduces a GPU-CPU hybrid inference engine that exploits the power-law distribution of neuron activations to dramatically accelerate LLM serving on consumer hardware. By strategically partitioning hot and cold neurons across GPU and CPU, integrating adaptive predictors, and implementing custom sparse operators, the system achieves order-of-magnitude speedups over existing baselines while preserving model accuracy. The work represents a highly practical and impactful contribution to ML systems, bridging the gap between theoretical sparsity and real-world deployment constraints, though its reliance on static profiling and dense decoder architectures limits broader architectural generalization.
The paper introduces a principled GPU-CPU hybrid inference engine grounded in the empirical observation that LLM neuron activations follow a heavy-tailed power-law distribution. By statically profiling models to separate "hot" (consistently active) and "cold" (input-dependent) neurons, the system strategically places hot neurons in GPU VRAM for low-latency access while offloading cold neuron computation to the CPU. This design elegantly circumvents the dual bottlenecks of consumer GPU memory capacity and PCIe bandwidth. The integration of an adaptive activation predictor to dynamically route computation, alongside custom neuron-aware sparse operators (e.g., optimized sparse GEMM kernels), demonstrates strong systems-level engineering. The methodology is logically sound, well-structured, and directly targets a critical gap in the LLM deployment stack.
The evaluation is comprehensive and rigorously benchmarks against strong open-source baselines (llama.cpp, vLLM) across multiple model scales (OPT-13B, 30B, 175B). The reported 11.69x speedup over llama.cpp and the ability to match 82% of an A100's token generation rate on a single RTX 4090 are highly compelling and practically significant. The authors thoroughly measure latency, throughput, memory footprint, and confirm accuracy preservation. However, the evaluation is somewhat constrained to decoder-only autoregressive generation under single-request or light-concurrency settings. Broader stress-testing with high-batch serving, variable context lengths, and alternative architectures (e.g., MoE, encoder-decoder) would further validate the system's generality.
The paper provides clear architectural diagrams, algorithmic pseudocode, and detailed implementation notes regarding CUDA kernel design, CPU thread scheduling, and memory management. The reliance on offline neuron profiling is straightforward to replicate, and the open-source release significantly lowers the barrier to reproduction. Minor reproducibility concerns stem from hardware-specific PCIe bandwidth variations and driver-level optimizations that may cause performance fluctuations across different consumer setups, but the core methodology remains transparent and well-documented.
The hot/cold neuron split assumes relatively static activation patterns, which may degrade under highly dynamic prompting or domain-shifted inputs. The requirement for offline profiling limits true plug-and-play deployment for arbitrary or newly released models. Performance gains are tightly coupled to PCIe bandwidth and CPU multi-core capabilities, meaning older or lower-end consumer systems may see diminished returns. Additionally, the approach is optimized for dense transformer decoders and does not natively extend to Mixture-of-Experts (MoE) routing or encoder-decoder paradigms without substantial architectural modifications.
PowerInfer meaningfully democratizes access to large-scale LLM inference by enabling high-throughput serving on affordable, consumer-grade hardware. This reduces dependency on expensive cloud infrastructure, lowering financial and environmental barriers for researchers, indie developers, and edge deployments. The work strongly aligns with privacy-preserving local AI initiatives and educational use cases, while its systems-level optimizations may inspire future research in hardware-aware sparse computation and hybrid compute architectures. PowerInfer introduces a GPU-CPU hybrid inference engine that exploits the power-law distribution of neuron activations to dramatically accelerate LLM serving on consumer hardware. By strategically partitioning hot and cold neurons across GPU and CPU, integrating adaptive predictors, and implementing custom sparse operators, the system achieves order-of-magnitude speedups over existing baselines while preserving model accuracy. The work represents a highly practical and impactful contribution to ML systems, bridging the gap between theoretical sparsity and real-world deployment constraints, though its reliance on static profiling and dense decoder architectures limits broader architectural generalization.
Large Language Models (LLMs) employ auto-regressive decoding that requires sequential computation, with each step reliant on the previous one's output. This creates a bottleneck as each step necessitates moving the full model parameters from High-Bandwidth Memory (HBM) to the accelerator's cache. While methods such as speculative decoding have been suggested to address this issue, their implementation is impeded by the challenges associated with acquiring and maintaining a separate draft model. In this paper, we present Medusa, an efficient method that augments LLM inference by adding extra decoding heads to predict multiple subsequent tokens in parallel. Using a tree-based attention mechanism, Medusa constructs multiple candidate continuations and verifies them simultaneously in each decoding step. By leveraging parallel processing, Medusa substantially reduces the number of decoding steps required. We present two levels of fine-tuning procedures for Medusa to meet the needs of different use cases: Medusa-1: Medusa is directly fine-tuned on top of a frozen backbone LLM, enabling lossless inference acceleration. Medusa-2: Medusa is fine-tuned together with the backbone LLM, enabling better prediction accuracy of Medusa heads and higher speedup but needing a special training recipe that preserves the backbone model's capabilities. Moreover, we propose several extensions that improve or expand the utility of Medusa, including a self-distillation to handle situations where no training data is available and a typical acceptance scheme to boost the acceptance rate while maintaining generation quality. We evaluate Medusa on models of various sizes and training procedures. Our experiments demonstrate that Medusa-1 can achieve over 2.2x speedup without compromising generation quality, while Medusa-2 further improves the speedup to 2.3-3.6x.
Medusa introduces a lightweight multi-head decoding architecture with tree-based verification that eliminates the need for separate draft models in speculative decoding. By combining practical training strategies, rigorous hardware-aware analysis, and consistent 2–3x speedups across model scales, the paper delivers a highly deployable optimization that has rapidly become a standard component in modern LLM inference stacks.
The paper introduces a clean and highly practical architectural modification to standard auto-regressive LLMs: appending multiple lightweight decoding heads to predict future tokens in parallel, coupled with a tree-based attention mechanism for simultaneous verification. This elegantly sidesteps the primary bottleneck of traditional speculative decoding—the need to train, maintain, and synchronize a separate draft model. The two-tier training strategy (Medusa-1 for frozen backbones, Medusa-2 for joint fine-tuning with LoRA/QLoRA) demonstrates strong engineering pragmatism, catering to both resource-constrained and performance-maximizing deployment scenarios. The self-distillation extension and typical acceptance heuristic further enhance robustness in data-scarce settings. While conceptually building on speculative decoding, the formulation of internal heads + tree verification is a distinct and well-motivated architectural simplification that reduces system complexity and memory overhead.
The empirical evaluation is thorough and well-calibrated. The authors benchmark across multiple model scales (7B, 13B, 33B), training regimes, and hardware profiles, consistently demonstrating 2.2x–3.6x wall-clock speedups without degrading generation quality. The inclusion of a roofline model analysis and analytical latency modeling is a standout strength, providing actionable insights into when the method transitions from memory-bandwidth-bound to compute-bound regimes. This hardware-aware evaluation elevates the paper beyond typical algorithmic benchmarks. Comparisons against open-source speculative decoding baselines are fair, though the paper would benefit from testing on non-Llama architectures (e.g., Mistral, Qwen) to confirm architectural agnosticism. Overall, the results are rigorous, reproducible, and directly aligned with real-world deployment constraints.
High. The paper provides explicit training configurations (Axolotl framework, 8-bit AdamW, cosine scheduler, LoRA rank 32, specific learning rates, loss weighting coefficients like λ_k = 0.8^k), clear definitions of metrics (acceleration rate, overhead, speedup), and detailed tree construction/pruning methodology. The reliance on standard open-source tooling and the public release of training/inference code ensure that independent researchers can replicate the results with minimal friction.
The speedup is inherently bounded by the acceptance rate of the predicted tokens, which degrades on highly diverse, creative, or out-of-distribution prompts. The tree-based verification introduces non-trivial memory overhead for candidate caching, which can become prohibitive at large batch sizes (>32) as linear layers shift to compute-bound regimes. Medusa-2's joint fine-tuning requires careful hyperparameter tuning to avoid catastrophic forgetting of the backbone's capabilities. Additionally, the method does not alter the fundamental O(N) sequential dependency of LLMs; it only reduces the constant factor through parallel verification. Finally, the analytical model assumes simplified block latency and omits post-processing overhead, which may slightly overestimate real-world gains in highly optimized kernels.
Medusa addresses one of the most pressing bottlenecks in LLM deployment: auto-regressive decoding latency and memory bandwidth constraints. By eliminating the need for external draft models, it significantly lowers the engineering and computational barrier to efficient inference, making it highly attractive for both academic research and commercial AI services. The framework's simplicity and compatibility with existing quantization/LoRA pipelines facilitate rapid integration into production inference engines (e.g., vLLM, TensorRT-LLM). Widespread adoption will reduce inference costs, improve user-facing latency, and democratize access to high-throughput LLM serving. Medusa introduces a lightweight multi-head decoding architecture with tree-based verification that eliminates the need for separate draft models in speculative decoding. By combining practical training strategies, rigorous hardware-aware analysis, and consistent 2–3x speedups across model scales, the paper delivers a highly deployable optimization that has rapidly become a standard component in modern LLM inference stacks.