Top ML research, scored by Qwen 3.6 · 98 papers · all domains
The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
The paper introduces the Transformer architecture, replacing recurrent and convolutional sequence models with a purely attention-based design that enables unprecedented parallelization and scalability. By demonstrating that self-attention alone can capture long-range dependencies more effectively than sequential architectures while drastically reducing training time, the work fundamentally redefined sequence modeling and established the architectural blueprint for modern foundation models across language, vision, and multimodal domains. Its elegant formulation of multi-head attention, sinusoidal positional encodings, and residual connections created a highly modular and hardware-friendly framework that directly enabled the empirical scaling laws and emergent capabilities observed in subsequent large-scale models. The architectural shift not only solved longstanding bottlenecks in gradient propagation and computational efficiency but also catalyzed an entire ecosystem of research spanning efficient attention variants, alignment techniques, and reasoning frameworks, cementing its status as a paradigm-defining contribution that permanently altered the trajectory of machine learning research and deployment.
An important paradigm of natural language processing consists of large-scale pre-training on general domain data and adaptation to particular tasks or domains. As we pre-train larger models, full fine-tuning, which retrains all model parameters, becomes less feasible. Using GPT-3 175B as an example -- deploying independent instances of fine-tuned models, each with 175B parameters, is prohibitively expensive. We propose Low-Rank Adaptation, or LoRA, which freezes the pre-trained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture, greatly reducing the number of trainable parameters for downstream tasks. Compared to GPT-3 175B fine-tuned with Adam, LoRA can reduce the number of trainable parameters by 10,000 times and the GPU memory requirement by 3 times. LoRA performs on-par or better than fine-tuning in model quality on RoBERTa, DeBERTa, GPT-2, and GPT-3, despite having fewer trainable parameters, a higher training throughput, and, unlike adapters, no additional inference latency. We also provide an empirical investigation into rank-deficiency in language model adaptation, which sheds light on the efficacy of LoRA. We release a package that facilitates the integration of LoRA with PyTorch models and provide our implementations and model checkpoints for RoBERTa, DeBERTa, and GPT-2 at https://github.com/microsoft/LoRA.
LoRA introduces a parameter-efficient fine-tuning paradigm that freezes pre-trained model weights and injects trainable low-rank decomposition matrices into Transformer layers, matching full fine-tuning performance while drastically reducing trainable parameters and eliminating inference latency. The work’s significance stems from its rigorous empirical demonstration that downstream adaptation occupies a low-rank subspace, a finding that fundamentally redefined how practitioners approach LLM customization. By removing the prohibitive compute and memory barriers of full fine-tuning, it democratized access to large-scale model training, catalyzed an entire research ecosystem (including quantization hybrids and dynamic rank allocation), and rapidly became the industry and academic standard for efficient model adaptation, placing it alongside foundational training methodologies in its field-wide impact.
Creating noise from data is easy; creating data from noise is generative modeling. We present a stochastic differential equation (SDE) that smoothly transforms a complex data distribution to a known prior distribution by slowly injecting noise, and a corresponding reverse-time SDE that transforms the prior distribution back into the data distribution by slowly removing the noise. Crucially, the reverse-time SDE depends only on the time-dependent gradient field (\aka, score) of the perturbed data distribution. By leveraging advances in score-based generative modeling, we can accurately estimate these scores with neural networks, and use numerical SDE solvers to generate samples. We show that this framework encapsulates previous approaches in score-based generative modeling and diffusion probabilistic modeling, allowing for new sampling procedures and new modeling capabilities. In particular, we introduce a predictor-corrector framework to correct errors in the evolution of the discretized reverse-time SDE. We also derive an equivalent neural ODE that samples from the same distribution as the SDE, but additionally enables exact likelihood computation, and improved sampling efficiency. In addition, we provide a new way to solve inverse problems with score-based models, as demonstrated with experiments on class-conditional generation, image inpainting, and colorization. Combined with multiple architectural improvements, we achieve record-breaking performance for unconditional image generation on CIFAR-10 with an Inception score of 9.89 and FID of 2.20, a competitive likelihood of 2.99 bits/dim, and demonstrate high fidelity generation of 1024 x 1024 images for the first time from a score-based generative model.
This work unifies discrete diffusion processes and continuous score-based generative modeling into a single stochastic differential equation framework, deriving the reverse-time SDE and establishing an equivalent probability flow ODE for exact likelihood computation. The paper provides a rigorous mathematical foundation that bridges previously disjoint generative paradigms, introducing a predictor-corrector sampling scheme and demonstrating how learned score functions can be systematically leveraged for high-fidelity synthesis and principled inverse problem solving. By formalizing the continuous-time limit of noise injection and removal, it fundamentally reshaped the theoretical understanding of likelihood-free generative modeling and directly enabled the subsequent proliferation of diffusion-based architectures that now dominate visual, auditory, and multimodal synthesis. Its conceptual clarity, mathematical elegance, and immediate empirical breakthroughs establish it as a cornerstone reference that redefined how the field approaches generative modeling, warranting placement alongside the most transformative methodological advances in modern machine learning.
DeepMind; launched modern deep RL
Introduced the Deep Q-Network architecture, which successfully combined convolutional neural networks with Q-learning stabilized by experience replay and target networks to learn control policies directly from raw pixel inputs. This work serves as the foundational catalyst for modern deep reinforcement learning, effectively acting as the field's "AlexNet" moment by proving that end-to-end representation learning could replace handcrafted features and tabular methods in complex control tasks. The algorithmic primitives it established—particularly off-policy stabilization techniques and the decoupling of data collection from policy updates—became standard components in virtually all subsequent deep RL research, spawning entire lineages of follow-up work and shifting the community's focus toward scalable, general-purpose agents. While the metadata lists zero citations, its historical trajectory demonstrates transformational, field-wide significance that comfortably places it alongside foundational breakthroughs like ResNet and the Transformer in terms of paradigm-shifting impact.
Kaplan et al.; power-law compute/data/parameter tradeoffs
Establishes empirical power-law relationships between compute, dataset size, and model parameters, providing a principled framework for compute-optimal scaling of neural language models. This work fundamentally shifted large-scale model development from heuristic, trial-and-error experimentation to a rigorous, predictable engineering discipline. By demonstrating that performance follows smooth, predictable scaling trajectories rather than exhibiting abrupt phase transitions, it provided the mathematical and empirical foundation for strategically allocating compute budgets across architectural size and training data. The methodology directly informed the training strategies of subsequent foundational models and established scaling analysis as a core research paradigm across the field. While later studies refined the precise coefficients and data-compute tradeoffs, this paper introduced the conceptual framework and systematic evaluation protocol that redefined how researchers approach resource allocation, capability forecasting, and training efficiency, cementing its status as a foundational pillar of modern AI research.
Neural network pruning techniques can reduce the parameter counts of trained networks by over 90%, decreasing storage requirements and improving computational performance of inference without compromising accuracy. However, contemporary experience is that the sparse architectures produced by pruning are difficult to train from the start, which would similarly improve training performance. We find that a standard pruning technique naturally uncovers subnetworks whose initializations made them capable of training effectively. Based on these results, we articulate the "lottery ticket hypothesis:" dense, randomly-initialized, feed-forward networks contain subnetworks ("winning tickets") that - when trained in isolation - reach test accuracy comparable to the original network in a similar number of iterations. The winning tickets we find have won the initialization lottery: their connections have initial weights that make training particularly effective. We present an algorithm to identify winning tickets and a series of experiments that support the lottery ticket hypothesis and the importance of these fortuitous initializations. We consistently find winning tickets that are less than 10-20% of the size of several fully-connected and convolutional feed-forward architectures for MNIST and CIFAR10. Above this size, the winning tickets that we find learn faster than the original network and reach higher test accuracy.
The paper introduces the lottery ticket hypothesis, demonstrating that dense neural networks contain sparse subnetworks that, when isolated and trained from their original random initialization, can match the performance of the full network. This work fundamentally reshaped our understanding of neural network trainability, initialization, and model compression by revealing that successful optimization depends critically on fortuitous initial weight configurations rather than merely architectural capacity. The proposed iterative magnitude pruning procedure to uncover these winning tickets overturned the prevailing assumption that sparse networks are inherently difficult to train from scratch, sparking a sustained wave of research into sparse training dynamics, loss landscape geometry, and the role of overparameterization. Its insights directly inform modern approaches to efficient model deployment, early-bird training strategies, and theoretical analyses of why gradient descent succeeds in high-dimensional spaces. While subsequent studies have refined the exact conditions under which tickets emerge and extended the framework to modern architectures and large-scale regimes, the core hypothesis remains a cornerstone of machine learning efficiency research, effectively bridging empirical compression practices with deeper questions about optimization and generalization.
We report the development of GPT-4, a large-scale, multimodal model which can accept image and text inputs and produce text outputs. While less capable than humans in many real-world scenarios, GPT-4 exhibits human-level performance on various professional and academic benchmarks, including passing a simulated bar exam with a score around the top 10% of test takers. GPT-4 is a Transformer-based model pre-trained to predict the next token in a document. The post-training alignment process results in improved performance on measures of factuality and adherence to desired behavior. A core component of this project was developing infrastructure and optimization methods that behave predictably across a wide range of scales. This allowed us to accurately predict some aspects of GPT-4's performance based on models trained with no more than 1/1,000th the compute of GPT-4.
The paper establishes a highly capable, multimodal foundation model whose performance scales predictably across orders of magnitude in compute, setting new benchmarks for factual reasoning and alignment. This work represents a watershed moment in scaling large language models, demonstrating that careful infrastructure optimization and alignment pipelines can yield reliable, human-level performance across professional and academic domains. While the architectural details remain deliberately opaque, the empirical validation of predictable scaling laws and the successful integration of vision-language pretraining fundamentally shifted industry standards and research trajectories. The alignment methodology and benchmark evaluations provided a crucial reference point for subsequent open and closed models, cementing its role as a cornerstone reference for modern foundation model development.
We propose an algorithm for meta-learning that is model-agnostic, in the sense that it is compatible with any model trained with gradient descent and applicable to a variety of different learning problems, including classification, regression, and reinforcement learning. The goal of meta-learning is to train a model on a variety of learning tasks, such that it can solve new learning tasks using only a small number of training samples. In our approach, the parameters of the model are explicitly trained such that a small number of gradient steps with a small amount of training data from a new task will produce good generalization performance on that task. In effect, our method trains the model to be easy to fine-tune. We demonstrate that this approach leads to state-of-the-art performance on two few-shot image classification benchmarks, produces good results on few-shot regression, and accelerates fine-tuning for policy gradient reinforcement learning with neural network policies.
The paper introduces a gradient-based meta-learning framework that optimizes model initializations for rapid task adaptation, establishing a rigorous and scalable paradigm for few-shot learning. By framing meta-learning as a differentiable optimization problem over parameter space rather than relying on hand-crafted architectures or distance metrics, it provides a theoretically grounded and highly flexible approach that integrates seamlessly with standard deep learning pipelines. Its influence extends well beyond few-shot classification, fundamentally shaping research in meta-reinforcement learning, continual learning, and efficient fine-tuning strategies. While subsequent methods have addressed its computational overhead and stability challenges, the core insight of learning an initialization that is maximally sensitive to task-specific gradients remains a cornerstone of modern adaptive learning research.
We investigate the optimal model size and number of tokens for training a transformer language model under a given compute budget. We find that current large language models are significantly undertrained, a consequence of the recent focus on scaling language models whilst keeping the amount of training data constant. By training over 400 language models ranging from 70 million to over 16 billion parameters on 5 to 500 billion tokens, we find that for compute-optimal training, the model size and the number of training tokens should be scaled equally: for every doubling of model size the number of training tokens should also be doubled. We test this hypothesis by training a predicted compute-optimal model, Chinchilla, that uses the same compute budget as Gopher but with 70B parameters and 4$\times$ more more data. Chinchilla uniformly and significantly outperforms Gopher (280B), GPT-3 (175B), Jurassic-1 (178B), and Megatron-Turing NLG (530B) on a large range of downstream evaluation tasks. This also means that Chinchilla uses substantially less compute for fine-tuning and inference, greatly facilitating downstream usage. As a highlight, Chinchilla reaches a state-of-the-art average accuracy of 67.5% on the MMLU benchmark, greater than a 7% improvement over Gopher.
The paper establishes a compute-optimal scaling law demonstrating that model parameters and training tokens should scale proportionally, fundamentally redirecting LLM training strategies toward data-rich, parameter-efficient regimes. This work systematically dismantles the prevailing industry heuristic of prioritizing parameter count over training data volume. By conducting an extensive empirical sweep across hundreds of models, the authors derive a precise mathematical relationship for compute allocation that maximizes performance per FLOP. The resulting paradigm shift has been rapidly adopted across both academic and industrial labs, directly informing the training recipes of subsequent state-of-the-art models. Rather than introducing a new architecture or optimization algorithm, the contribution lies in its rigorous empirical characterization of scaling dynamics, which has proven indispensable for resource-constrained training and efficient inference. The findings effectively recalibrate the entire field’s approach to foundation model development, making it a cornerstone reference for modern LLM engineering.
We propose a new family of policy gradient methods for reinforcement learning, which alternate between sampling data through interaction with the environment, and optimizing a "surrogate" objective function using stochastic gradient ascent. Whereas standard policy gradient methods perform one gradient update per data sample, we propose a novel objective function that enables multiple epochs of minibatch updates. The new methods, which we call proximal policy optimization (PPO), have some of the benefits of trust region policy optimization (TRPO), but they are much simpler to implement, more general, and have better sample complexity (empirically). Our experiments test PPO on a collection of benchmark tasks, including simulated robotic locomotion and Atari game playing, and we show that PPO outperforms other online policy gradient methods, and overall strikes a favorable balance between sample complexity, simplicity, and wall-time.
Introduces a clipped surrogate objective that enables stable, multi-epoch policy gradient updates without complex second-order optimization. The work solves a persistent instability problem in policy gradient methods by replacing computationally heavy trust-region constraints with a simple probability ratio clipping mechanism, allowing practitioners to safely reuse sampled trajectories across multiple optimization epochs using standard first-order optimizers. This algorithmic simplification dramatically lowers the engineering barrier to stable reinforcement learning while delivering robust empirical performance across continuous control, robotics, and discrete action domains. Its field-wide significance extends well beyond traditional RL benchmarks; the method’s reliability, ease of tuning, and computational efficiency made it the de facto standard for on-policy optimization, ultimately serving as the foundational training engine for reinforcement learning from human feedback. By directly enabling the practical alignment of modern large language models, the approach has fundamentally reshaped how the broader machine learning community approaches policy optimization and preference-based training.
While large-scale unsupervised language models (LMs) learn broad world knowledge and some reasoning skills, achieving precise control of their behavior is difficult due to the completely unsupervised nature of their training. Existing methods for gaining such steerability collect human labels of the relative quality of model generations and fine-tune the unsupervised LM to align with these preferences, often with reinforcement learning from human feedback (RLHF). However, RLHF is a complex and often unstable procedure, first fitting a reward model that reflects the human preferences, and then fine-tuning the large unsupervised LM using reinforcement learning to maximize this estimated reward without drifting too far from the original model. In this paper we introduce a new parameterization of the reward model in RLHF that enables extraction of the corresponding optimal policy in closed form, allowing us to solve the standard RLHF problem with only a simple classification loss. The resulting algorithm, which we call Direct Preference Optimization (DPO), is stable, performant, and computationally lightweight, eliminating the need for sampling from the LM during fine-tuning or performing significant hyperparameter tuning. Our experiments show that DPO can fine-tune LMs to align with human preferences as well as or better than existing methods. Notably, fine-tuning with DPO exceeds PPO-based RLHF in ability to control sentiment of generations, and matches or improves response quality in summarization and single-turn dialogue while being substantially simpler to implement and train.
Direct Preference Optimization reformulates the reinforcement learning from human feedback objective to derive a closed-form optimal policy, enabling preference alignment through a simple supervised classification loss without explicit reward modeling or complex reinforcement learning algorithms. The work provides a mathematically elegant bridge between supervised fine-tuning and reinforcement learning, demonstrating that the language model itself can implicitly parameterize the reward function during training. By eliminating the need for separate reward model training, policy sampling, and unstable PPO optimization, the method dramatically reduces computational overhead and engineering complexity while maintaining or surpassing the alignment quality of traditional pipelines. This conceptual simplification has rapidly reshaped the alignment landscape, establishing a new standard for preference optimization that is highly reproducible, computationally efficient, and has already inspired a broad family of derivative methods across both academic and open-source communities. The paper’s theoretical clarity and practical utility represent a substantial advance in making large-scale model alignment accessible and robust.
We present QLoRA, an efficient finetuning approach that reduces memory usage enough to finetune a 65B parameter model on a single 48GB GPU while preserving full 16-bit finetuning task performance. QLoRA backpropagates gradients through a frozen, 4-bit quantized pretrained language model into Low Rank Adapters~(LoRA). Our best model family, which we name Guanaco, outperforms all previous openly released models on the Vicuna benchmark, reaching 99.3% of the performance level of ChatGPT while only requiring 24 hours of finetuning on a single GPU. QLoRA introduces a number of innovations to save memory without sacrificing performance: (a) 4-bit NormalFloat (NF4), a new data type that is information theoretically optimal for normally distributed weights (b) double quantization to reduce the average memory footprint by quantizing the quantization constants, and (c) paged optimziers to manage memory spikes. We use QLoRA to finetune more than 1,000 models, providing a detailed analysis of instruction following and chatbot performance across 8 instruction datasets, multiple model types (LLaMA, T5), and model scales that would be infeasible to run with regular finetuning (e.g. 33B and 65B parameter models). Our results show that QLoRA finetuning on a small high-quality dataset leads to state-of-the-art results, even when using smaller models than the previous SoTA. We provide a detailed analysis of chatbot performance based on both human and GPT-4 evaluations showing that GPT-4 evaluations are a cheap and reasonable alternative to human evaluation. Furthermore, we find that current chatbot benchmarks are not trustworthy to accurately evaluate the performance levels of chatbots. A lemon-picked analysis demonstrates where Guanaco fails compared to ChatGPT. We release all of our models and code, including CUDA kernels for 4-bit training.
QLoRA introduces a theoretically grounded 4-bit quantization scheme combined with targeted memory management techniques that enable full-parameter-equivalent fine-tuning of massive language models on single consumer-grade GPUs. This work fundamentally democratizes access to large-scale model adaptation by solving the critical memory bottleneck that previously restricted fine-tuning to well-resourced compute clusters. The introduction of the NormalFloat data type provides a rigorous information-theoretic foundation for quantizing normally distributed neural weights, while double quantization and paged optimizers address practical training instabilities with elegant systems-level engineering. By preserving the representational capacity of standard precision training while drastically compressing the frozen backbone, the method bridges the gap between theoretical efficiency and practical deployment. Its impact extends far beyond isolated benchmark improvements, catalyzing the open-weight ecosystem and establishing a new standard for accessible LLM development. Compared to foundational training methods, it operates as a critical enabler that multiplies the utility of existing architectures and adaptation techniques, functioning much like how memory-efficient attention mechanisms unlocked previously infeasible context lengths. The comprehensive empirical validation across model scales and instruction datasets, alongside its critical analysis of evaluation methodologies, demonstrates both methodological rigor and a clear understanding of the field's evolving needs. This paper represents a substantial shift in how practitioners approach model adaptation, transforming fine-tuning from a resource-intensive bottleneck into a widely accessible workflow that will likely serve as a foundational component in future open-model pipelines.
We present a scalable approach for semi-supervised learning on graph-structured data that is based on an efficient variant of convolutional neural networks which operate directly on graphs. We motivate the choice of our convolutional architecture via a localized first-order approximation of spectral graph convolutions. Our model scales linearly in the number of graph edges and learns hidden layer representations that encode both local graph structure and features of nodes. In a number of experiments on citation networks and on a knowledge graph dataset we demonstrate that our approach outperforms related methods by a significant margin.
Introduces a scalable, first-order approximation of spectral graph convolutions that enables efficient semi-supervised learning on graph-structured data. This work bridges the gap between computationally prohibitive spectral graph theory and practical deep learning by deriving a localized, linear-complexity convolution operation that naturally propagates node features across graph neighborhoods. By simplifying spectral filtering to a highly efficient message-passing scheme, it establishes a theoretically grounded architecture that rapidly became the de facto standard for graph representation learning. The approach fundamentally shifted the field away from handcrafted graph kernels and expensive eigen-decompositions toward end-to-end trainable neural architectures, catalyzing the modern era of graph neural networks. Much like how convolutional networks revolutionized image processing by exploiting spatial locality, this method unlocked scalable representation learning for non-Euclidean data, enabling breakthroughs across computational chemistry, social network analysis, and knowledge graph reasoning. Its elegant mathematical derivation, combined with empirical robustness and computational efficiency, ensures it will remain a cornerstone reference for any work operating on relational data structures.
Foundation models, now powering most of the exciting applications in deep learning, are almost universally based on the Transformer architecture and its core attention module. Many subquadratic-time architectures such as linear attention, gated convolution and recurrent models, and structured state space models (SSMs) have been developed to address Transformers' computational inefficiency on long sequences, but they have not performed as well as attention on important modalities such as language. We identify that a key weakness of such models is their inability to perform content-based reasoning, and make several improvements. First, simply letting the SSM parameters be functions of the input addresses their weakness with discrete modalities, allowing the model to selectively propagate or forget information along the sequence length dimension depending on the current token. Second, even though this change prevents the use of efficient convolutions, we design a hardware-aware parallel algorithm in recurrent mode. We integrate these selective SSMs into a simplified end-to-end neural network architecture without attention or even MLP blocks (Mamba). Mamba enjoys fast inference (5$\times$ higher throughput than Transformers) and linear scaling in sequence length, and its performance improves on real data up to million-length sequences. As a general sequence model backbone, Mamba achieves state-of-the-art performance across several modalities such as language, audio, and genomics. On language modeling, our Mamba-3B model outperforms Transformers of the same size and matches Transformers twice its size, both in pretraining and downstream evaluation.
Introduces selective state space models that dynamically adjust parameters based on input content, combined with a hardware-aware parallel algorithm, enabling linear-time sequence modeling that rivals Transformer performance across multiple modalities. The work successfully addresses a fundamental bottleneck in sequence modeling by identifying the lack of content-based reasoning as the primary limitation of prior subquadratic architectures. By making state space parameters input-dependent, the authors bridge the gap between the computational efficiency of recurrent models and the representational power of attention mechanisms. The accompanying hardware-aware parallel scan algorithm is a crucial systems contribution that makes training these selective models practical at scale. Empirically, the architecture demonstrates strong scaling properties, competitive performance against larger Transformers, and versatility across language, audio, and genomics. While the Transformer remains the dominant paradigm, this work provides a compelling, theoretically grounded alternative for long-context and efficiency-critical applications, sparking substantial follow-up research and hybrid architectural designs. Its methodological clarity, empirical rigor, and immediate practical utility position it as a highly influential contribution to modern foundation model design.
We adapt the ideas underlying the success of Deep Q-Learning to the continuous action domain. We present an actor-critic, model-free algorithm based on the deterministic policy gradient that can operate over continuous action spaces. Using the same learning algorithm, network architecture and hyper-parameters, our algorithm robustly solves more than 20 simulated physics tasks, including classic problems such as cartpole swing-up, dexterous manipulation, legged locomotion and car driving. Our algorithm is able to find policies whose performance is competitive with those found by a planning algorithm with full access to the dynamics of the domain and its derivatives. We further demonstrate that for many of the tasks the algorithm can learn policies end-to-end: directly from raw pixel inputs.
This work introduces Deep Deterministic Policy Gradient (DDPG), successfully adapting experience replay and target networks from discrete-action Deep Q-Networks to deterministic policy gradients for stable, model-free continuous control. The paper bridges a critical gap in deep reinforcement learning by demonstrating that a single, unified architecture can robustly solve diverse continuous control tasks directly from high-dimensional sensory inputs, matching planning-based baselines without requiring domain-specific derivatives. By establishing a reliable off-policy framework for continuous action spaces, it became the foundational baseline that directly catalyzed subsequent breakthroughs in actor-critic methods, fundamentally shifting how the community approaches sample-efficient control in robotics and simulation.
Model-free deep reinforcement learning (RL) algorithms have been demonstrated on a range of challenging decision making and control tasks. However, these methods typically suffer from two major challenges: very high sample complexity and brittle convergence properties, which necessitate meticulous hyperparameter tuning. Both of these challenges severely limit the applicability of such methods to complex, real-world domains. In this paper, we propose soft actor-critic, an off-policy actor-critic deep RL algorithm based on the maximum entropy reinforcement learning framework. In this framework, the actor aims to maximize expected reward while also maximizing entropy. That is, to succeed at the task while acting as randomly as possible. Prior deep RL methods based on this framework have been formulated as Q-learning methods. By combining off-policy updates with a stable stochastic actor-critic formulation, our method achieves state-of-the-art performance on a range of continuous control benchmark tasks, outperforming prior on-policy and off-policy methods. Furthermore, we demonstrate that, in contrast to other off-policy algorithms, our approach is very stable, achieving very similar performance across different random seeds.
This work introduces a principled framework that unifies maximum entropy reinforcement learning with off-policy actor-critic methods to achieve stable, sample-efficient continuous control. By explicitly optimizing for both reward maximization and policy entropy, the method naturally encourages robust exploration and yields policies that are inherently more resilient to environmental perturbations and distributional shifts. The integration of a stochastic actor with twin Q-networks and an adaptive temperature parameter resolves the brittle convergence and hyperparameter sensitivity that historically limited deep RL deployment. While reinforcement learning operates as a specialized subdomain, this algorithm rapidly became the de facto standard for continuous control benchmarks, directly enabling practical advances in robotic manipulation, sim-to-real transfer, and offline reinforcement learning. Its methodological elegance, empirical reliability, and widespread adoption across both academic and industrial robotics pipelines justify its placement among the most influential algorithmic contributions in modern machine learning, bridging the gap between theoretical exploration principles and real-world control applicability.
As AI systems become more capable, we would like to enlist their help to supervise other AIs. We experiment with methods for training a harmless AI assistant through self-improvement, without any human labels identifying harmful outputs. The only human oversight is provided through a list of rules or principles, and so we refer to the method as 'Constitutional AI'. The process involves both a supervised learning and a reinforcement learning phase. In the supervised phase we sample from an initial model, then generate self-critiques and revisions, and then finetune the original model on revised responses. In the RL phase, we sample from the finetuned model, use a model to evaluate which of the two samples is better, and then train a preference model from this dataset of AI preferences. We then train with RL using the preference model as the reward signal, i.e. we use 'RL from AI Feedback' (RLAIF). As a result we are able to train a harmless but non-evasive AI assistant that engages with harmful queries by explaining its objections to them. Both the SL and RL methods can leverage chain-of-thought style reasoning to improve the human-judged performance and transparency of AI decision making. These methods make it possible to control AI behavior more precisely and with far fewer human labels.
Introduces a scalable alignment framework that replaces human preference labels with AI-generated critiques and preferences guided by a fixed set of principles, enabling effective harmlessness training with minimal human oversight. The work addresses a critical bottleneck in preference optimization by formalizing a self-improvement loop where models critique and revise their own outputs against a predefined constitution before generating training data for reward modeling. This paradigm shift away from human-in-the-loop labeling toward AI-driven oversight has rapidly become a cornerstone of modern alignment research, offering a more transparent, scalable, and less evasive alternative to traditional reinforcement learning from human feedback. While it synthesizes existing concepts like chain-of-thought reasoning and preference learning, its systematic integration into a cohesive, two-phase training pipeline represents a substantial methodological advance that has fundamentally reshaped how practitioners approach scalable oversight. The framework’s practical efficacy and widespread adoption across subsequent alignment pipelines justify its high standing, even as it remains primarily an applied systems contribution rather than a theoretical breakthrough.
We present graph attention networks (GATs), novel neural network architectures that operate on graph-structured data, leveraging masked self-attentional layers to address the shortcomings of prior methods based on graph convolutions or their approximations. By stacking layers in which nodes are able to attend over their neighborhoods' features, we enable (implicitly) specifying different weights to different nodes in a neighborhood, without requiring any kind of costly matrix operation (such as inversion) or depending on knowing the graph structure upfront. In this way, we address several key challenges of spectral-based graph neural networks simultaneously, and make our model readily applicable to inductive as well as transductive problems. Our GAT models have achieved or matched state-of-the-art results across four established transductive and inductive graph benchmarks: the Cora, Citeseer and Pubmed citation network datasets, as well as a protein-protein interaction dataset (wherein test graphs remain unseen during training).
The paper introduces masked self-attention to graph-structured data, enabling nodes to dynamically weight neighborhood contributions without relying on fixed spectral filters or costly matrix operations. By decoupling attention computation from graph topology, it elegantly bridges the gap between sequence-based attention mechanisms and relational learning, solving critical limitations of earlier graph convolutional approaches regarding inductive generalization and computational scalability. While spectral methods require fixed graph structures and struggle with inductive settings, this approach leverages localized, learnable weighting that scales efficiently and generalizes to unseen graphs. Its conceptual clarity and practical utility quickly established it as a foundational baseline, fundamentally shifting how the community approaches neighborhood aggregation and cementing attention as a standard primitive for modeling complex relational structures alongside sequence and vision domains.
Kitaev et al.; LSH attention; reduced quadratic complexity
Reformer introduces locality-sensitive hashing (LSH) attention and reversible residual layers to reduce the computational and memory complexity of self-attention from quadratic to near-linear, enabling practical training on long sequences. The work directly tackles the primary scaling bottleneck of the original Transformer architecture by replacing exact dot-product attention with an approximate, hash-based routing scheme while using reversible layers to eliminate intermediate activation storage. This combination catalyzed a major research direction in efficient long-context modeling, directly inspiring subsequent sparse, linear, and kernel-based attention mechanisms. While hardware-aware exact attention kernels like FlashAttention have since become the dominant paradigm for standard sequence lengths, Reformer established the algorithmic blueprint for memory-constrained training and approximate attention, securing its place as a foundational reference in the efficiency literature.
This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases. Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultra model advances the state of the art in 30 of 32 of these benchmarks - notably being the first model to achieve human-expert performance on the well-studied exam benchmark MMLU, and improving the state of the art in every one of the 20 multimodal benchmarks we examined. We believe that the new capabilities of the Gemini family in cross-modal reasoning and language understanding will enable a wide variety of use cases. We discuss our approach toward post-training and deploying Gemini models responsibly to users through services including Gemini, Gemini Advanced, Google AI Studio, and Cloud Vertex AI.
Introduces a natively multimodal transformer family trained from scratch on mixed modalities, establishing new state-of-the-art performance across diverse reasoning and perception benchmarks while detailing scalable post-training and deployment pipelines. The work represents a substantial architectural and engineering milestone by shifting away from late-fusion or adapter-based paradigms toward unified, natively trained representations across text, vision, audio, and video. Its rigorous benchmarking resets performance ceilings across both unimodal and multimodal tasks, particularly in complex reasoning domains, and provides valuable empirical insights into scaling laws and post-training alignment for heterogeneous data. While the architectural specifics remain somewhat high-level compared to traditional academic publications, the methodological choices around mixture-of-experts routing, cross-modal attention mechanisms, and instruction-tuning pipelines offer a practical blueprint for next-generation foundation models. The proprietary nature of the weights and training infrastructure somewhat limits immediate reproducibility, yet the reported capabilities and evaluation frameworks will undoubtedly steer research priorities and industrial development toward truly unified multimodal systems.
We introduce Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments. The method is straightforward to implement, is computationally efficient, has little memory requirements, is invariant to diagonal rescaling of the gradients, and is well suited for problems that are large in terms of data and/or parameters. The method is also appropriate for non-stationary objectives and problems with very noisy and/or sparse gradients. The hyper-parameters have intuitive interpretations and typically require little tuning. Some connections to related algorithms, on which Adam was inspired, are discussed. We also analyze the theoretical convergence properties of the algorithm and provide a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework. Empirical results demonstrate that Adam works well in practice and compares favorably to other stochastic optimization methods. Finally, we discuss AdaMax, a variant of Adam based on the infinity norm.
Training Deep Neural Networks is complicated by the fact that the distribution of each layer's inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful parameter initialization, and makes it notoriously hard to train models with saturating nonlinearities. We refer to this phenomenon as internal covariate shift, and address the problem by normalizing layer inputs. Our method draws its strength from making normalization a part of the model architecture and performing the normalization for each training mini-batch. Batch Normalization allows us to use much higher learning rates and be less careful about initialization. It also acts as a regularizer, in some cases eliminating the need for Dropout. Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin. Using an ensemble of batch-normalized networks, we improve upon the best published result on ImageNet classification: reaching 4.9% top-5 validation error (and 4.8% test error), exceeding the accuracy of human raters.
While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited. In vision, attention is either applied in conjunction with convolutional networks, or used to replace certain components of convolutional networks while keeping their overall structure in place. We show that this reliance on CNNs is not necessary and a pure transformer applied directly to sequences of image patches can perform very well on image classification tasks. When pre-trained on large amounts of data and transferred to multiple mid-sized or small image recognition benchmarks (ImageNet, CIFAR-100, VTAB, etc.), Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.
The paper demonstrates that a pure Transformer architecture, applied directly to sequences of image patches without convolutional inductive biases, achieves state-of-the-art image classification when pre-trained at scale. This work fundamentally challenged the long-standing dominance of convolutional neural networks by proving that architectural simplicity combined with massive data can effectively replace hand-crafted spatial priors. Its straightforward design and rigorous empirical validation catalyzed a rapid paradigm shift across the entire vision community, spawning a vast ecosystem of architectural variants and establishing the standard backbone for modern multimodal, generative, and dense prediction systems. By decoupling vision performance from complex architectural engineering and tying it directly to compute and data scaling, the paper redefined how researchers approach representation learning and cemented its place as a foundational milestone in machine learning.
We present high quality image synthesis results using diffusion probabilistic models, a class of latent variable models inspired by considerations from nonequilibrium thermodynamics. Our best results are obtained by training on a weighted variational bound designed according to a novel connection between diffusion probabilistic models and denoising score matching with Langevin dynamics, and our models naturally admit a progressive lossy decompression scheme that can be interpreted as a generalization of autoregressive decoding. On the unconditional CIFAR10 dataset, we obtain an Inception score of 9.46 and a state-of-the-art FID score of 3.17. On 256x256 LSUN, we obtain sample quality similar to ProgressiveGAN. Our implementation is available at https://github.com/hojonathanho/diffusion
The paper establishes a mathematically principled and practically stable framework for training diffusion-based generative models by unifying variational inference with denoising score matching, achieving image quality competitive with adversarial methods. This work fundamentally shifts the generative modeling landscape by demonstrating that non-adversarial, likelihood-inspired training can rival the sample quality of leading GAN architectures while entirely sidestepping their notorious training instability and mode collapse. The core intellectual contribution lies in deriving a simplified, weighted variational objective that directly optimizes denoising steps, effectively bridging nonequilibrium thermodynamics with scalable deep learning practice. By providing a robust training paradigm that naturally supports progressive refinement and interpretable lossy decompression, the methodology establishes a new architectural standard that rapidly displaces adversarial approaches as the dominant paradigm for high-fidelity synthesis. Its theoretical clarity, empirical rigor, and open implementation catalyze a broad transition across the field, laying the essential mathematical and engineering groundwork for the subsequent generation of text-conditioned, video, and 3D generative systems.
State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of supervision limits their generality and usability since additional labeled data is needed to specify any other visual concept. Learning directly from raw text about images is a promising alternative which leverages a much broader source of supervision. We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet. After pre-training, natural language is used to reference learned visual concepts (or describe new ones) enabling zero-shot transfer of the model to downstream tasks. We study the performance of this approach by benchmarking on over 30 different existing computer vision datasets, spanning tasks such as OCR, action recognition in videos, geo-localization, and many types of fine-grained object classification. The model transfers non-trivially to most tasks and is often competitive with a fully supervised baseline without the need for any dataset specific training. For instance, we match the accuracy of the original ResNet-50 on ImageNet zero-shot without needing to use any of the 1.28 million training examples it was trained on. We release our code and pre-trained model weights at https://github.com/OpenAI/CLIP.
Introduces a scalable contrastive pre-training framework that learns robust visual representations directly from natural language supervision, enabling unprecedented zero-shot transfer across diverse vision tasks. This work fundamentally shifts computer vision from closed-set, manually annotated training to open-vocabulary, web-scale multimodal learning. By demonstrating that a straightforward dual-encoder architecture trained on hundreds of millions of noisy image-text pairs can match fully supervised baselines without seeing a single labeled example, it proves the power of scale and natural language as a universal supervisory signal. The technical simplicity belies its profound impact: it established the architectural and training blueprint for modern vision-language models, catalyzed the open-vocabulary detection and segmentation subfields, and became the foundational alignment mechanism for subsequent generative systems. While contrastive learning existed in prior literature, the rigorous empirical demonstration of scaling laws, the systematic zero-shot evaluation across dozens of datasets, and the public release of highly capable weights collectively redefine how the field approaches representation learning and multimodal alignment. Its influence extends far beyond classification, serving as the critical bridge between vision and language that underpins the current generation of multimodal AI.
Ren et al.; end-to-end detector; standard baseline for years
Introduces the Region Proposal Network (RPN) to unify feature extraction, proposal generation, and bounding box regression into a single, end-to-end trainable architecture. This work fundamentally shifted object detection from multi-stage pipelines reliant on hand-crafted external proposals to learned, shared-convolution paradigms, establishing the two-stage detector blueprint that dominated the field for nearly a decade and directly enabled subsequent breakthroughs like Mask R-CNN and Cascade R-CNN. Despite the rise of single-stage and transformer-based detectors, its architectural principles remain foundational to modern vision systems, justifying its placement well above the threshold for field-wide significance.
There is large consent that successful training of deep networks requires many thousand annotated training samples. In this paper, we present a network and training strategy that relies on the strong use of data augmentation to use the available annotated samples more efficiently. The architecture consists of a contracting path to capture context and a symmetric expanding path that enables precise localization. We show that such a network can be trained end-to-end from very few images and outperforms the prior best method (a sliding-window convolutional network) on the ISBI challenge for segmentation of neuronal structures in electron microscopic stacks. Using the same network trained on transmitted light microscopy images (phase contrast and DIC) we won the ISBI cell tracking challenge 2015 in these categories by a large margin. Moreover, the network is fast. Segmentation of a 512x512 image takes less than a second on a recent GPU. The full implementation (based on Caffe) and the trained networks are available at http://lmb.informatik.uni-freiburg.de/people/ronneber/u-net .
Introduces a symmetric encoder-decoder convolutional architecture with skip connections and a data-efficient training strategy that became the foundational backbone for dense prediction and biomedical image segmentation. U-Net fundamentally shifted the paradigm for pixel-wise prediction by demonstrating that precise localization requires symmetric upsampling paths with direct feature concatenation from the contracting path, effectively solving the spatial resolution loss inherent in early fully convolutional networks. Its strategic use of heavy elastic data augmentation and overlap-tile inference established a robust blueprint for training on small, high-precision datasets, directly addressing the data scarcity that plagues specialized imaging domains. Beyond its original medical focus, the architecture’s design principles became the de facto standard for volumetric analysis, influenced modern segmentation frameworks, and were later adopted as the core backbone for contemporary diffusion-based generative models. While it synthesizes concepts from prior convolutional approaches, the specific architectural symmetry, training methodology, and consistent empirical dominance across diverse benchmarks cement its status as a field-defining contribution that continues to serve as the primary starting point for dense prediction research and deployment worldwide.
Carion et al.; detection as set prediction; replaced anchors
DETR reformulates object detection as a direct set prediction problem using a Transformer encoder-decoder architecture, eliminating hand-crafted components like anchor boxes and non-maximum suppression. This work represents a fundamental paradigm shift in dense prediction, successfully adapting sequence modeling architectures to spatial reasoning tasks and proving that end-to-end differentiable bipartite matching can replace heuristic post-processing. By removing the inductive biases and engineering complexity of anchor-based CNN pipelines, it established a cleaner, more scalable design philosophy that directly enabled the subsequent wave of transformer-based detectors and segmentation models, effectively redefining the standard architecture for modern vision systems.
Contrastive models like CLIP have been shown to learn robust representations of images that capture both semantics and style. To leverage these representations for image generation, we propose a two-stage model: a prior that generates a CLIP image embedding given a text caption, and a decoder that generates an image conditioned on the image embedding. We show that explicitly generating image representations improves image diversity with minimal loss in photorealism and caption similarity. Our decoders conditioned on image representations can also produce variations of an image that preserve both its semantics and style, while varying the non-essential details absent from the image representation. Moreover, the joint embedding space of CLIP enables language-guided image manipulations in a zero-shot fashion. We use diffusion models for the decoder and experiment with both autoregressive and diffusion models for the prior, finding that the latter are computationally more efficient and produce higher-quality samples.
The paper introduces a two-stage generative framework that first maps text prompts to a CLIP image embedding via a learned prior, then synthesizes high-fidelity images from that embedding using a diffusion decoder. This architecture fundamentally restructured the text-to-image generation pipeline by decoupling semantic alignment from pixel-level synthesis, addressing the diversity and fidelity bottlenecks of earlier autoregressive and GAN-based approaches. By explicitly conditioning a diffusion decoder on contrastive latent representations, the work demonstrated that high-level semantic guidance can be cleanly separated from low-level texture generation, yielding superior prompt adherence, sample diversity, and zero-shot language-guided manipulation. The methodological shift toward hierarchical latent conditioning directly established the blueprint for the modern wave of scalable text-to-image systems, catalyzing the broader transition from pixel-space modeling to latent diffusion paradigms and enabling widespread downstream applications in creative and industrial workflows.
Text-to-image generation has traditionally focused on finding better modeling assumptions for training on a fixed dataset. These assumptions might involve complex architectures, auxiliary losses, or side information such as object part labels or segmentation masks supplied during training. We describe a simple approach for this task based on a transformer that autoregressively models the text and image tokens as a single stream of data. With sufficient data and scale, our approach is competitive with previous domain-specific models when evaluated in a zero-shot fashion.
Introduces a unified autoregressive transformer that jointly models text and discrete image tokens, demonstrating that scaling data and compute enables competitive zero-shot text-to-image synthesis without task-specific architectural priors. This work fundamentally shifts the generative modeling paradigm by treating multimodal generation as a single sequence prediction problem, effectively bypassing the need for complex adversarial training dynamics or hand-crafted architectural inductive biases. By leveraging discrete visual representations and aligning them with textual embeddings in a unified transformer, the authors demonstrate that scale alone can unlock robust zero-shot capabilities across diverse visual concepts. The approach catalyzed a major transition in the field toward large-scale autoregressive and diffusion-based multimodal models, establishing the blueprint for subsequent foundation models in vision-language generation. While later diffusion architectures would eventually surpass its sample quality, the conceptual framework of unified token streams and scaling-driven zero-shot generalization remains a cornerstone of modern generative AI research.
Caron et al.; Meta; self-distillation; strong visual features without labels
Introduces a self-distillation framework that trains Vision Transformers without labels, revealing emergent semantic properties like object segmentation and patch correspondence in attention maps. The work fundamentally shifted self-supervised representation learning by demonstrating that carefully designed teacher-student dynamics, combined with multi-crop augmentation and output centering, can produce features rivaling supervised pretraining. Its architectural simplicity and the discovery that unsupervised ViT attention naturally aligns with object boundaries catalyzed a wave of subsequent research, establishing a new standard for foundation model pretraining and directly influencing the trajectory of modern vision systems.
This paper shows that masked autoencoders (MAE) are scalable self-supervised learners for computer vision. Our MAE approach is simple: we mask random patches of the input image and reconstruct the missing pixels. It is based on two core designs. First, we develop an asymmetric encoder-decoder architecture, with an encoder that operates only on the visible subset of patches (without mask tokens), along with a lightweight decoder that reconstructs the original image from the latent representation and mask tokens. Second, we find that masking a high proportion of the input image, e.g., 75%, yields a nontrivial and meaningful self-supervisory task. Coupling these two designs enables us to train large models efficiently and effectively: we accelerate training (by 3x or more) and improve accuracy. Our scalable approach allows for learning high-capacity models that generalize well: e.g., a vanilla ViT-Huge model achieves the best accuracy (87.8%) among methods that use only ImageNet-1K data. Transfer performance in downstream tasks outperforms supervised pre-training and shows promising scaling behavior.
The paper establishes that aggressive random patch masking combined with an asymmetric encoder-decoder architecture enables highly efficient and scalable self-supervised pre-training for vision transformers. By demonstrating that high masking ratios force models to learn robust semantic features rather than trivial local patterns, the work successfully adapts language-modeling paradigms to computer vision and fundamentally shifts the field away from computationally expensive contrastive learning. Its architectural simplicity, dramatic training speedups, and state-of-the-art transfer performance have made it a foundational pre-training recipe, directly catalyzing subsequent breakthroughs in multimodal representation learning, video understanding, and scalable visual foundation models. The approach redefined how practitioners approach visual representation learning, offering a practical and theoretically sound pathway to train massive vision backbones on standard datasets.
Building models that can be rapidly adapted to novel tasks using only a handful of annotated examples is an open challenge for multimodal machine learning research. We introduce Flamingo, a family of Visual Language Models (VLM) with this ability. We propose key architectural innovations to: (i) bridge powerful pretrained vision-only and language-only models, (ii) handle sequences of arbitrarily interleaved visual and textual data, and (iii) seamlessly ingest images or videos as inputs. Thanks to their flexibility, Flamingo models can be trained on large-scale multimodal web corpora containing arbitrarily interleaved text and images, which is key to endow them with in-context few-shot learning capabilities. We perform a thorough evaluation of our models, exploring and measuring their ability to rapidly adapt to a variety of image and video tasks. These include open-ended tasks such as visual question-answering, where the model is prompted with a question which it has to answer; captioning tasks, which evaluate the ability to describe a scene or an event; and close-ended tasks such as multiple-choice visual question-answering. For tasks lying anywhere on this spectrum, a single Flamingo model can achieve a new state of the art with few-shot learning, simply by prompting the model with task-specific examples. On numerous benchmarks, Flamingo outperforms models fine-tuned on thousands of times more task-specific data.
Flamingo introduces a novel architecture that bridges frozen vision and language models via cross-attention and a Perceiver resampler, enabling robust in-context few-shot learning across diverse multimodal tasks. The work fundamentally shifts the paradigm for vision-language modeling by demonstrating that large-scale, arbitrarily interleaved web data can be leveraged to endow frozen foundation models with strong few-shot adaptation capabilities without task-specific fine-tuning. Its architectural design establishes the blueprint for nearly all subsequent multimodal large language models, effectively bridging the gap between discriminative vision encoders and generative language decoders. By proving that in-context learning generalizes to vision, it sets a new standard for open-ended multimodal reasoning and catalyzes widespread adoption across both academic and industrial research, positioning itself as a cornerstone alongside foundational works like CLIP and later derivatives such as LLaVA.
By decomposing the image formation process into a sequential application of denoising autoencoders, diffusion models (DMs) achieve state-of-the-art synthesis results on image data and beyond. Additionally, their formulation allows for a guiding mechanism to control the image generation process without retraining. However, since these models typically operate directly in pixel space, optimization of powerful DMs often consumes hundreds of GPU days and inference is expensive due to sequential evaluations. To enable DM training on limited computational resources while retaining their quality and flexibility, we apply them in the latent space of powerful pretrained autoencoders. In contrast to previous work, training diffusion models on such a representation allows for the first time to reach a near-optimal point between complexity reduction and detail preservation, greatly boosting visual fidelity. By introducing cross-attention layers into the model architecture, we turn diffusion models into powerful and flexible generators for general conditioning inputs such as text or bounding boxes and high-resolution synthesis becomes possible in a convolutional manner. Our latent diffusion models (LDMs) achieve a new state of the art for image inpainting and highly competitive performance on various tasks, including unconditional image generation, semantic scene synthesis, and super-resolution, while significantly reducing computational requirements compared to pixel-based DMs. Code is available at https://github.com/CompVis/latent-diffusion .
The paper introduces latent diffusion models, which shift the computationally intensive denoising process from high-dimensional pixel space to a compressed, semantically rich latent representation learned by a pretrained autoencoder, while incorporating cross-attention layers for flexible multimodal conditioning. This architectural pivot fundamentally resolves the scalability bottleneck that previously restricted diffusion models to low resolutions or prohibitive compute budgets, effectively bridging the gap between theoretical generative quality and practical deployment. By decoupling perceptual compression from iterative denoising, the approach preserves fine-grained visual details while drastically reducing training and inference costs, establishing a new standard for conditional image synthesis. The framework’s modularity and efficiency directly catalyzed the open-source generative AI ecosystem, serving as the foundational blueprint for subsequent high-resolution image, video, and 3D generation pipelines, and permanently shifting the field away from adversarial training paradigms toward likelihood-based latent diffusion architectures.
This paper presents a new vision Transformer, called Swin Transformer, that capably serves as a general-purpose backbone for computer vision. Challenges in adapting Transformer from language to vision arise from differences between the two domains, such as large variations in the scale of visual entities and the high resolution of pixels in images compared to words in text. To address these differences, we propose a hierarchical Transformer whose representation is computed with \textbf{S}hifted \textbf{win}dows. The shifted windowing scheme brings greater efficiency by limiting self-attention computation to non-overlapping local windows while also allowing for cross-window connection. This hierarchical architecture has the flexibility to model at various scales and has linear computational complexity with respect to image size. These qualities of Swin Transformer make it compatible with a broad range of vision tasks, including image classification (87.3 top-1 accuracy on ImageNet-1K) and dense prediction tasks such as object detection (58.7 box AP and 51.1 mask AP on COCO test-dev) and semantic segmentation (53.5 mIoU on ADE20K val). Its performance surpasses the previous state-of-the-art by a large margin of +2.7 box AP and +2.6 mask AP on COCO, and +3.2 mIoU on ADE20K, demonstrating the potential of Transformer-based models as vision backbones. The hierarchical design and the shifted window approach also prove beneficial for all-MLP architectures. The code and models are publicly available at~\url{https://github.com/microsoft/Swin-Transformer}.
The paper introduces a hierarchical vision transformer architecture that leverages shifted local windows to achieve linear computational complexity while enabling cross-window information flow, establishing a highly efficient and scalable backbone for diverse vision tasks. This work successfully bridges the gap between the global receptive field of early vision transformers and the multi-scale, localized inductive biases of convolutional networks. By restricting self-attention to non-overlapping windows and systematically shifting them across layers, the architecture dramatically reduces quadratic complexity without sacrificing representational power, making it practical for high-resolution dense prediction tasks like object detection and semantic segmentation. Its design principles have profoundly influenced subsequent vision model development, serving as a foundational reference for efficient transformer design and inspiring a wave of hybrid and pure transformer backbones that dominate modern computer vision pipelines. The combination of theoretical elegance, empirical dominance across multiple benchmarks, and widespread adoption in both academic and industrial settings firmly places it among the most impactful architectural contributions of its era.
Established depth as key factor in CNNs
The paper demonstrates that stacking small 3x3 convolutional filters to increase network depth significantly improves feature representation and recognition accuracy on large-scale datasets. This work fundamentally shifted architectural design philosophy in computer vision by proving that depth, rather than complex kernel sizes or branching structures, is the primary driver of representational capacity in convolutional networks. Its elegant simplicity established a standardized backbone that became the de facto foundation for downstream tasks including object detection, semantic segmentation, and transfer learning for nearly a decade. While conceptually straightforward, the systematic empirical validation of depth scaling provided practitioners with a reliable, reproducible blueprint that accelerated progress across the entire vision community. The architecture’s enduring legacy as a feature extractor and its direct influence on subsequent innovations like ResNet cement its status as a cornerstone of modern deep learning.
Kirillov et al.; Meta; promptable segmentation; billion-mask dataset
Introduces a promptable foundation model for image segmentation trained on a billion-mask dataset via an automated data engine, shifting the field from task-specific models to zero-shot, interactive segmentation. This work successfully translates the foundation model paradigm to dense prediction tasks, demonstrating that systematic data scaling combined with a flexible conditioning interface can yield robust generalization across highly diverse visual domains. While the underlying architecture relies on established transformer backbones and lightweight mask decoders, the core intellectual contribution lies in the iterative data engine that bootstraps high-quality mask generation and the unified prompt framework that seamlessly integrates points, bounding boxes, and text as inference signals. The release of the massive dataset and baseline has fundamentally altered the segmentation landscape, spawning a vast ecosystem of specialized derivatives, accelerating research in medical and robotic vision, and establishing a new standard for interactive and open-vocabulary dense prediction. Though it does not propose a radically new network architecture, its methodological blueprint for data curation and prompt-driven inference has redefined practitioner workflows and benchmarking standards, cementing its status as a highly influential milestone in modern computer vision.
Instruction tuning large language models (LLMs) using machine-generated instruction-following data has improved zero-shot capabilities on new tasks, but the idea is less explored in the multimodal field. In this paper, we present the first attempt to use language-only GPT-4 to generate multimodal language-image instruction-following data. By instruction tuning on such generated data, we introduce LLaVA: Large Language and Vision Assistant, an end-to-end trained large multimodal model that connects a vision encoder and LLM for general-purpose visual and language understanding.Our early experiments show that LLaVA demonstrates impressive multimodel chat abilities, sometimes exhibiting the behaviors of multimodal GPT-4 on unseen images/instructions, and yields a 85.1% relative score compared with GPT-4 on a synthetic multimodal instruction-following dataset. When fine-tuned on Science QA, the synergy of LLaVA and GPT-4 achieves a new state-of-the-art accuracy of 92.53%. We make GPT-4 generated visual instruction tuning data, our model and code base publicly available.
Introduces a streamlined vision-language instruction tuning pipeline that leverages GPT-4-synthesized multimodal data to align frozen vision encoders with large language models using only a simple projection layer. The work’s significance stems from its elegant reduction of a highly complex alignment problem into a highly reproducible, data-centric recipe. By demonstrating that architectural complexity can be effectively traded for high-quality synthetic instruction data, the paper democratized multimodal model development and shifted community focus toward data curation and instruction design. While the underlying components rely on established vision encoders and language models, the synthesis provides a clear, scalable blueprint that rapidly became the de facto starting point for open-source vision-language research. Compared to earlier proprietary or heavily engineered multimodal systems, this approach proves that competitive reasoning and conversational capabilities can be achieved with minimal architectural overhead, establishing a practical standard that continues to shape how researchers approach open multimodal alignment.
Redmon & Farhadi; real-time detection; widely deployed
Introduces multi-scale feature prediction and a residual-based backbone to significantly improve real-time object detection accuracy while maintaining high inference speeds. The work deliberately embraces an iterative, engineering-driven design philosophy rather than proposing a fundamentally new paradigm, yet its architectural refinements—particularly the hierarchical feature aggregation and independent logistic classifiers for multi-label prediction—established a highly practical standard for deployed vision systems. While later anchor-free approaches and transformer-based detectors have since pushed the accuracy frontier, this architecture successfully bridged the critical gap between research-grade precision and industrial deployment constraints, offering a robust, easily reproducible baseline that shaped how practitioners balance latency and performance in production environments.
We present SDXL, a latent diffusion model for text-to-image synthesis. Compared to previous versions of Stable Diffusion, SDXL leverages a three times larger UNet backbone: The increase of model parameters is mainly due to more attention blocks and a larger cross-attention context as SDXL uses a second text encoder. We design multiple novel conditioning schemes and train SDXL on multiple aspect ratios. We also introduce a refinement model which is used to improve the visual fidelity of samples generated by SDXL using a post-hoc image-to-image technique. We demonstrate that SDXL shows drastically improved performance compared the previous versions of Stable Diffusion and achieves results competitive with those of black-box state-of-the-art image generators. In the spirit of promoting open research and fostering transparency in large model training and evaluation, we provide access to code and model weights at https://github.com/Stability-AI/generative-models
SDXL advances latent diffusion models through strategic architectural scaling, dual-text-encoder conditioning, multi-aspect-ratio training, and a dedicated refinement pipeline, establishing a highly capable open-source baseline for high-resolution image synthesis. While the work represents a significant engineering milestone that democratized high-fidelity generation and heavily influenced subsequent open-model development, its contributions are primarily incremental optimizations and scaling of existing diffusion frameworks rather than novel algorithmic or theoretical breakthroughs, placing it firmly in the strong, widely-adopted tier rather than the paradigm-shifting category.
Liu et al.; CLIP + LLM with simple MLP projection; strong VQA baseline
The paper establishes a highly effective, open-source vision-language baseline by refining data curation, training recipes, and architectural choices for connecting frozen vision encoders to large language models. While conceptually incremental over prior instruction-tuning approaches like Flamingo and BLIP-2, its rigorous empirical validation and accessible design have made it the de facto standard for open multimodal research, enabling widespread reproducibility and rapid iteration across the community. The work demonstrates that careful dataset construction and straightforward projection mechanisms can rival more complex proprietary systems, effectively lowering the barrier to entry for VLM development. However, it does not introduce fundamentally new architectural paradigms or theoretical breakthroughs, positioning it as a crucial engineering milestone rather than a conceptual leap. Its lasting significance lies in democratizing high-performance multimodal capabilities and providing a robust, transparent reference point that subsequent methodological advances consistently benchmark against.
Kicked off the deep learning era; ImageNet competition winner
Large Convolutional Network models have recently demonstrated impressive classification performance on the ImageNet benchmark. However there is no clear understanding of why they perform so well, or how they might be improved. In this paper we address both issues. We introduce a novel visualization technique that gives insight into the function of intermediate feature layers and the operation of the classifier. We also perform an ablation study to discover the performance contribution from different model layers. This enables us to find model architectures that outperform Krizhevsky \etal on the ImageNet classification benchmark. We show our ImageNet model generalizes well to other datasets: when the softmax classifier is retrained, it convincingly beats the current state-of-the-art results on Caltech-101 and Caltech-256 datasets.
We propose a new framework for estimating generative models via an adversarial process, in which we simultaneously train two models: a generative model G that captures the data distribution, and a discriminative model D that estimates the probability that a sample came from the training data rather than G. The training procedure for G is to maximize the probability of D making a mistake. This framework corresponds to a minimax two-player game. In the space of arbitrary functions G and D, a unique solution exists, with G recovering the training data distribution and D equal to 1/2 everywhere. In the case where G and D are defined by multilayer perceptrons, the entire system can be trained with backpropagation. There is no need for any Markov chains or unrolled approximate inference networks during either training or generation of samples. Experiments demonstrate the potential of the framework through qualitative and quantitative evaluation of the generated samples.
Brown et al.; 175B params; in-context learning; paradigm shift
Demonstrates that scaling autoregressive language models to unprecedented parameter counts unlocks robust in-context learning capabilities, fundamentally shifting the paradigm from task-specific fine-tuning to prompt-based adaptation. This work represents a watershed moment in artificial intelligence, establishing scaling laws as a primary driver of capability emergence and proving that massive compute paired with simple next-token prediction yields unprecedented generalization across diverse tasks. By introducing few-shot and zero-shot evaluation as standard practice, it effectively retired the era of bespoke fine-tuning for most natural language benchmarks and laid the architectural and methodological groundwork for the entire modern large language model ecosystem. While the underlying transformer architecture was not novel, the systematic exploration of scale as a catalyst for emergent reasoning and instruction-following abilities constitutes a profound conceptual leap. Its influence extends far beyond natural language processing, reshaping research trajectories in multimodal learning, code synthesis, and AI alignment, firmly placing it among the most consequential publications in the history of machine learning.
Bahdanau attention — the precursor to Transformer
Introduces a soft, differentiable alignment mechanism that allows sequence-to-sequence models to dynamically focus on relevant parts of the input, fundamentally overcoming the fixed-context bottleneck of early encoder-decoder architectures. This work represents a foundational paradigm shift in how neural networks process sequential data, replacing rigid information compression with a learnable, context-aware retrieval process. By enabling models to maintain long-range dependencies and align source-target representations without explicit supervision, it directly catalyzed the transition from statistical to neural machine translation and established a new empirical ceiling for cross-lingual generation. The architectural insight proved so generalizable that it became the conceptual blueprint for self-attention, ultimately underpinning the Transformer family and the entire modern era of large language models. Its methodological elegance and empirical robustness established a new standard for sequence modeling, making it a cornerstone reference that continues to inform research across multilingual understanding, code synthesis, and complex reasoning pipelines.
We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models, BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications. BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE score to 80.5% (7.7% point absolute improvement), MultiNLI accuracy to 86.7% (4.6% absolute improvement), SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute improvement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement).
Introduces a deeply bidirectional transformer pre-training framework using masked language modeling that establishes a unified, fine-tuning-based paradigm for natural language understanding. The work fundamentally shifts the research trajectory away from designing highly specialized, task-specific architectures toward a single, heavily pre-trained representation model that adapts to diverse downstream applications with minimal architectural overhead. By successfully training deep bidirectional context through a masked token prediction objective, it elegantly circumvents the left-to-right constraints of prior autoregressive approaches while capturing rich syntactic and semantic dependencies across all network layers. This methodological clarity, paired with sweeping empirical gains across question answering, natural language inference, and general language understanding benchmarks, catalyzed an entire ecosystem of derivative architectures, optimization strategies, and transfer learning protocols that now form the backbone of modern NLP. Its conceptual simplicity and empirical dominance redefined how the field approaches representation learning, establishing the pre-train/fine-tune paradigm that directly paved the way for contemporary large language models and instruction-tuned systems.
Making language models bigger does not inherently make them better at following a user's intent. For example, large language models can generate outputs that are untruthful, toxic, or simply not helpful to the user. In other words, these models are not aligned with their users. In this paper, we show an avenue for aligning language models with user intent on a wide range of tasks by fine-tuning with human feedback. Starting with a set of labeler-written prompts and prompts submitted through the OpenAI API, we collect a dataset of labeler demonstrations of the desired model behavior, which we use to fine-tune GPT-3 using supervised learning. We then collect a dataset of rankings of model outputs, which we use to further fine-tune this supervised model using reinforcement learning from human feedback. We call the resulting models InstructGPT. In human evaluations on our prompt distribution, outputs from the 1.3B parameter InstructGPT model are preferred to outputs from the 175B GPT-3, despite having 100x fewer parameters. Moreover, InstructGPT models show improvements in truthfulness and reductions in toxic output generation while having minimal performance regressions on public NLP datasets. Even though InstructGPT still makes simple mistakes, our results show that fine-tuning with human feedback is a promising direction for aligning language models with human intent.
This work establishes a scalable post-training pipeline that aligns large language models with human intent by combining supervised fine-tuning on demonstrations with reinforcement learning from human feedback. The paper demonstrates that targeted alignment can dramatically improve instruction-following, truthfulness, and safety, enabling a significantly smaller model to outperform a much larger base model in human evaluations. While building on earlier reinforcement learning from human feedback concepts, the methodological synthesis, rigorous empirical validation at scale, and clear demonstration of the alignment-scaling trade-off fundamentally shifted the paradigm for language model development. Its influence is now ubiquitous across the modern LLM ecosystem, where alignment techniques have become an indispensable standard for model deployment, effectively redefining how the field approaches capability optimization, safety constraints, and user-centric behavior.
Deep Neural Networks (DNNs) are powerful models that have achieved excellent performance on difficult learning tasks. Although DNNs work well whenever large labeled training sets are available, they cannot be used to map sequences to sequences. In this paper, we present a general end-to-end approach to sequence learning that makes minimal assumptions on the sequence structure. Our method uses a multilayered Long Short-Term Memory (LSTM) to map the input sequence to a vector of a fixed dimensionality, and then another deep LSTM to decode the target sequence from the vector. Our main result is that on an English to French translation task from the WMT'14 dataset, the translations produced by the LSTM achieve a BLEU score of 34.8 on the entire test set, where the LSTM's BLEU score was penalized on out-of-vocabulary words. Additionally, the LSTM did not have difficulty on long sentences. For comparison, a phrase-based SMT system achieves a BLEU score of 33.3 on the same dataset. When we used the LSTM to rerank the 1000 hypotheses produced by the aforementioned SMT system, its BLEU score increases to 36.5, which is close to the previous best result on this task. The LSTM also learned sensible phrase and sentence representations that are sensitive to word order and are relatively invariant to the active and the passive voice. Finally, we found that reversing the order of the words in all source sentences (but not target sentences) improved the LSTM's performance markedly, because doing so introduced many short term dependencies between the source and the target sentence which made the optimization problem easier.
This work introduces the encoder-decoder LSTM architecture for sequence-to-sequence mapping, establishing the foundational paradigm that replaced statistical machine translation and ultimately enabled modern generative language models. The paper's significance lies in its elegant formulation of mapping variable-length sequences to a fixed-dimensional latent representation and decoding them autoregressively, a conceptual leap that decoupled input and output sequence lengths. The authors' empirical discovery that reversing source sentences dramatically improves optimization by shortening gradient paths provided a crucial insight into training dynamics for recurrent architectures. While subsequent innovations like attention mechanisms and Transformer architectures have superseded the specific LSTM implementation, the core sequence-to-sequence framework remains the architectural blueprint for virtually all contemporary neural translation, summarization, dialogue, and code generation systems. Its methodological clarity, empirical rigor, and immediate demonstration of competitive performance against established statistical baselines catalyzed a rapid, field-wide transition toward end-to-end neural approaches, cementing its status as a defining milestone in computational linguistics and deep learning.
Mikolov et al.; standard word embeddings for years
Introduced computationally efficient neural architectures (Skip-gram and CBOW) with negative sampling and hierarchical softmax to learn high-quality distributed word and phrase representations at scale. While modern contextualized models have superseded static embeddings, this work fundamentally shifted NLP from sparse, handcrafted features to dense, learned representations, establishing the empirical and algorithmic blueprint for representation learning. Its efficiency innovations solved critical scaling bottlenecks, enabling widespread adoption across academia and industry, and directly paved the way for the transfer learning paradigm that underpins contemporary LLMs. The paper’s enduring value lies in its elegant simplicity, rigorous empirical validation, and role as the foundational catalyst that redefined how the field approaches linguistic representation.
We explore how generating a chain of thought -- a series of intermediate reasoning steps -- significantly improves the ability of large language models to perform complex reasoning. In particular, we show how such reasoning abilities emerge naturally in sufficiently large language models via a simple method called chain of thought prompting, where a few chain of thought demonstrations are provided as exemplars in prompting. Experiments on three large language models show that chain of thought prompting improves performance on a range of arithmetic, commonsense, and symbolic reasoning tasks. The empirical gains can be striking. For instance, prompting a 540B-parameter language model with just eight chain of thought exemplars achieves state of the art accuracy on the GSM8K benchmark of math word problems, surpassing even finetuned GPT-3 with a verifier.
The paper demonstrates that explicitly prompting large language models to generate intermediate reasoning steps dramatically enhances their performance on complex multi-step tasks. This work fundamentally shifted the paradigm of how practitioners interact with generative models, revealing that sophisticated reasoning capabilities are emergent properties of scale that can be reliably unlocked through simple demonstration-based prompting rather than architectural modifications or extensive parameter updates. By establishing a straightforward yet highly effective methodology, it catalyzed an entire subfield of reasoning-focused prompting techniques, including self-consistency, tree-based search, and tool-augmented generation. The simplicity of the approach, combined with its profound empirical gains across arithmetic, commonsense, and symbolic domains, makes it a cornerstone reference for modern model evaluation and deployment. While subsequent research has refined and extended the core idea, the original insight remains foundational to how the field conceptualizes and elicits complex cognitive behaviors from neural language models, effectively bridging the gap between raw pattern matching and structured problem solving.
Stiennon et al.; OpenAI; early RLHF demonstration on summarization
This work establishes the modern reinforcement learning from human feedback pipeline, demonstrating that training a reward model on human preferences and optimizing a language model via proximal policy optimization yields substantially more aligned and higher-quality summaries than supervised baselines. The paper successfully operationalizes preference learning for open-ended text generation, moving beyond simple ranking or imitation learning to a scalable alignment framework. By decoupling reward modeling from policy optimization, it provides a practical blueprint that directly catalyzed the development of instruction-tuned and conversational models. While the underlying components draw from prior work in reinforcement learning and preference aggregation, the synthesis and empirical validation on a complex, subjective task represent a pivotal methodological advance. Its influence extends far beyond the immediate application, fundamentally reshaping how the field approaches model alignment, safety, and human-centric evaluation, effectively setting the standard for subsequent large-scale language model training paradigms.
Large pre-trained language models have been shown to store factual knowledge in their parameters, and achieve state-of-the-art results when fine-tuned on downstream NLP tasks. However, their ability to access and precisely manipulate knowledge is still limited, and hence on knowledge-intensive tasks, their performance lags behind task-specific architectures. Additionally, providing provenance for their decisions and updating their world knowledge remain open research problems. Pre-trained models with a differentiable access mechanism to explicit non-parametric memory can overcome this issue, but have so far been only investigated for extractive downstream tasks. We explore a general-purpose fine-tuning recipe for retrieval-augmented generation (RAG) -- models which combine pre-trained parametric and non-parametric memory for language generation. We introduce RAG models where the parametric memory is a pre-trained seq2seq model and the non-parametric memory is a dense vector index of Wikipedia, accessed with a pre-trained neural retriever. We compare two RAG formulations, one which conditions on the same retrieved passages across the whole generated sequence, the other can use different passages per token. We fine-tune and evaluate our models on a wide range of knowledge-intensive NLP tasks and set the state-of-the-art on three open domain QA tasks, outperforming parametric seq2seq models and task-specific retrieve-and-extract architectures. For language generation tasks, we find that RAG models generate more specific, diverse and factual language than a state-of-the-art parametric-only seq2seq baseline.
The paper introduces a unified framework that seamlessly integrates dense retrieval with autoregressive generation, enabling models to dynamically access external knowledge while maintaining end-to-end differentiability. This work addresses a fundamental limitation of purely parametric language models by formalizing a principled approach to combining static weights with dynamic, non-parametric memory. The key technical insight lies in marginalizing over retrieved documents during training and inference, which allows the generator to learn from multiple evidence sources without committing to a single retrieved passage. By demonstrating strong performance across open-domain question answering and factual generation, the paper establishes a robust blueprint for mitigating hallucination, enabling continuous knowledge updates, and providing decision provenance. The methodology has fundamentally reshaped how the community approaches knowledge-intensive generation, serving as the architectural foundation for modern retrieval-augmented systems and enterprise AI pipelines. Its elegant formulation bridges information retrieval and neural generation in a way that is both theoretically sound and highly practical, warranting recognition as a major methodological advance with enduring field-wide influence.
While large language models (LLMs) have demonstrated impressive capabilities across tasks in language understanding and interactive decision making, their abilities for reasoning (e.g. chain-of-thought prompting) and acting (e.g. action plan generation) have primarily been studied as separate topics. In this paper, we explore the use of LLMs to generate both reasoning traces and task-specific actions in an interleaved manner, allowing for greater synergy between the two: reasoning traces help the model induce, track, and update action plans as well as handle exceptions, while actions allow it to interface with external sources, such as knowledge bases or environments, to gather additional information. We apply our approach, named ReAct, to a diverse set of language and decision making tasks and demonstrate its effectiveness over state-of-the-art baselines, as well as improved human interpretability and trustworthiness over methods without reasoning or acting components. Concretely, on question answering (HotpotQA) and fact verification (Fever), ReAct overcomes issues of hallucination and error propagation prevalent in chain-of-thought reasoning by interacting with a simple Wikipedia API, and generates human-like task-solving trajectories that are more interpretable than baselines without reasoning traces. On two interactive decision making benchmarks (ALFWorld and WebShop), ReAct outperforms imitation and reinforcement learning methods by an absolute success rate of 34% and 10% respectively, while being prompted with only one or two in-context examples. Project site with code: https://react-lm.github.io
The paper introduces a prompting paradigm that interleaves verbal reasoning traces with executable actions, enabling LLMs to dynamically plan, ground their thoughts in external environments, and self-correct. This work represents a pivotal shift in how we conceptualize LLM capabilities, moving beyond static text generation toward interactive, agentic problem-solving. By demonstrating that reasoning and acting are mutually reinforcing rather than competing objectives, it established a foundational blueprint for modern LLM agents and tool-use frameworks. The approach significantly mitigates hallucination and error propagation while offering unprecedented interpretability, directly influencing a wave of subsequent research and production systems in autonomous reasoning and interactive decision-making. Compared to prior isolated approaches like chain-of-thought or pure reinforcement learning, this synergistic formulation provides a more robust, sample-efficient, and transparent pathway for deploying language models in complex, real-world environments. Its methodological clarity and empirical success across diverse benchmarks have cemented it as a cornerstone reference for the emerging field of agentic AI.
We propose a new test to measure a text model's multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more. To attain high accuracy on this test, models must possess extensive world knowledge and problem solving ability. We find that while most recent models have near random-chance accuracy, the very largest GPT-3 model improves over random chance by almost 20 percentage points on average. However, on every one of the 57 tasks, the best models still need substantial improvements before they can reach expert-level accuracy. Models also have lopsided performance and frequently do not know when they are wrong. Worse, they still have near-random accuracy on some socially important subjects such as morality and law. By comprehensively evaluating the breadth and depth of a model's academic and professional understanding, our test can be used to analyze models across many tasks and to identify important shortcomings.
Introduces a comprehensive, multi-domain benchmark that evaluates language models across fifty-seven academic and professional subjects, establishing a standardized metric for assessing broad knowledge and reasoning capabilities. This work fundamentally shifted how the community evaluates large language models by moving beyond narrow, task-specific metrics toward a unified measure of general academic and professional competence. By curating a diverse suite of subjects ranging from STEM to humanities and law, the benchmark exposed critical limitations in early models, particularly regarding calibration, domain-specific reasoning, and consistent performance across disciplines. Its design directly influenced subsequent evaluation frameworks and became the de facto standard for tracking progress in model scaling, instruction tuning, and reasoning enhancements. The paper’s emphasis on identifying knowledge gaps and miscalibration spurred widespread research into alignment techniques and robust evaluation protocols, cementing its role as a cornerstone reference for both academic research and industry development cycles. Compared to earlier aggregation efforts, this benchmark’s breadth and focus on real-world academic rigor provided a much-needed stress test that continues to drive architectural and training innovations across the field.
For natural language understanding (NLU) technology to be maximally useful, both practically and as a scientific object of study, it must be general: it must be able to process language in a way that is not exclusively tailored to any one specific task or dataset. In pursuit of this objective, we introduce the General Language Understanding Evaluation benchmark (GLUE), a tool for evaluating and analyzing the performance of models across a diverse range of existing NLU tasks. GLUE is model-agnostic, but it incentivizes sharing knowledge across tasks because certain tasks have very limited training data. We further provide a hand-crafted diagnostic test suite that enables detailed linguistic analysis of NLU models. We evaluate baselines based on current methods for multi-task and transfer learning and find that they do not immediately give substantial improvements over the aggregate performance of training a separate model per task, indicating room for improvement in developing general and robust NLU systems.
Introduces a standardized, multi-task evaluation framework and diagnostic suite that fundamentally reshaped how natural language understanding systems are measured, compared, and developed. The work addresses a critical fragmentation problem in NLP by unifying disparate datasets into a single, rigorous benchmark that explicitly measures cross-task generalization rather than isolated task performance. By coupling this with a carefully designed diagnostic evaluation suite, it provides researchers with a principled methodology to dissect model capabilities and failure modes beyond surface-level accuracy. This framework directly catalyzed the shift toward large-scale pre-trained representations, as subsequent breakthroughs in contextualized embeddings were validated and iterated upon using this exact evaluation protocol. The benchmark’s design principles established a new standard for empirical rigor in NLP, influencing the creation of successor benchmarks and fundamentally altering how the community conceptualizes and pursues robust language understanding. While it does not introduce a novel model architecture, its methodological contribution to evaluation infrastructure and its role in accelerating the pre-training paradigm justify its exceptional standing in the field.
Clark et al.; compute-efficient pretraining
ELECTRA introduces a highly compute-efficient pre-training objective that replaces masked language modeling with a replaced-token detection task, training a discriminator to distinguish original tokens from those swapped by a small generator. By providing a training signal for every token in the sequence rather than only masked positions, this approach dramatically improves sample efficiency and enables smaller models to match or exceed the performance of heavily compute-intensive baselines like BERT and RoBERTa. The generator-discriminator formulation elegantly bridges generative and discriminative self-supervision, establishing a widely adopted paradigm that fundamentally shifted how researchers approach encoder pre-training and continues to influence efficient representation learning pipelines across the field.
Liu et al.; showed LLMs ignore middle of context; important limitation study
This work empirically demonstrates that language models exhibit a U-shaped performance curve across extended contexts, systematically degrading when critical information is positioned in the middle. By challenging the prevailing assumption of uniform context utilization, the study has fundamentally redirected long-context system design, prompting widespread architectural shifts in retrieval-augmented generation pipelines and establishing new standards for context-window evaluation that rival foundational architectural papers in practical influence.
Raffel et al.; text-to-text framing for NLP
The paper introduces a unified text-to-text framework that reformulates all natural language processing tasks as sequence generation, accompanied by a comprehensive scaling study and the C4 pretraining corpus. This work fundamentally shifted the NLP landscape by demonstrating that a single architecture and training objective could achieve state-of-the-art performance across diverse benchmarks, effectively bridging the gap between specialized models and general-purpose language models. Its systematic exploration of transfer learning limits, combined with the open release of model variants and datasets, established a rigorous empirical standard for subsequent research. The text-to-text paradigm directly paved the way for modern instruction-tuning and chat-based architectures, cementing its status as a cornerstone in the evolution of large language models.
Language model pretraining has led to significant performance gains but careful comparison between different approaches is challenging. Training is computationally expensive, often done on private datasets of different sizes, and, as we will show, hyperparameter choices have significant impact on the final results. We present a replication study of BERT pretraining (Devlin et al., 2019) that carefully measures the impact of many key hyperparameters and training data size. We find that BERT was significantly undertrained, and can match or exceed the performance of every model published after it. Our best model achieves state-of-the-art results on GLUE, RACE and SQuAD. These results highlight the importance of previously overlooked design choices, and raise questions about the source of recently reported improvements. We release our models and code.
This work demonstrates that systematic optimization of pretraining dynamics, rather than architectural modifications, is the primary driver of performance gains in masked language models. By rigorously ablating the original training recipe, the authors reveal that the baseline model was substantially undertrained and that the auxiliary next-sentence prediction objective actively degrades representation quality. The introduction of dynamic masking, larger batch sizes, extended training schedules, and carefully scaled data established a new methodological standard that fundamentally shifted the field’s focus from architecture hunting to training recipe engineering. These empirical insights directly informed the development of subsequent foundation models and remain essential best practices for modern pretraining pipelines, cementing the paper’s status as a cornerstone of contemporary natural language processing methodology.
We introduce Codex, a GPT language model fine-tuned on publicly available code from GitHub, and study its Python code-writing capabilities. A distinct production version of Codex powers GitHub Copilot. On HumanEval, a new evaluation set we release to measure functional correctness for synthesizing programs from docstrings, our model solves 28.8% of the problems, while GPT-3 solves 0% and GPT-J solves 11.4%. Furthermore, we find that repeated sampling from the model is a surprisingly effective strategy for producing working solutions to difficult prompts. Using this method, we solve 70.2% of our problems with 100 samples per problem. Careful investigation of our model reveals its limitations, including difficulty with docstrings describing long chains of operations and with binding operations to variables. Finally, we discuss the potential broader impacts of deploying powerful code generation technologies, covering safety, security, and economics.
Introduces Codex, a code-specialized large language model, alongside the HumanEval benchmark and pass@k evaluation methodology, establishing the foundational framework for modern code generation research. This work fundamentally shifted the trajectory of language model research by demonstrating that domain-specific fine-tuning on massive code corpora unlocks robust program synthesis capabilities. The introduction of HumanEval provided the field with a rigorous, standardized benchmark that remains the primary evaluation standard for code generation models. Furthermore, the systematic analysis of repeated sampling strategies revealed a critical insight into how stochastic decoding can be leveraged to overcome the brittleness of single-pass generation, a methodology now universally adopted across the subfield. While the underlying architecture builds directly on prior autoregressive scaling work, the empirical rigor, comprehensive limitation analysis, and clear demonstration of how code data distribution shapes model reasoning represent a substantial methodological advance. The paper’s influence extends well beyond academic circles, directly catalyzing the development of open and proprietary code models that now form a core pillar of modern AI-assisted software engineering. Its enduring impact on benchmarking practices, evaluation protocols, and the broader code-generation ecosystem firmly places it among the most consequential empirical studies in recent NLP history.
This paper explores a simple method for improving the zero-shot learning abilities of language models. We show that instruction tuning -- finetuning language models on a collection of tasks described via instructions -- substantially improves zero-shot performance on unseen tasks. We take a 137B parameter pretrained language model and instruction-tune it on over 60 NLP tasks verbalized via natural language instruction templates. We evaluate this instruction-tuned model, which we call FLAN, on unseen task types. FLAN substantially improves the performance of its unmodified counterpart and surpasses zero-shot 175B GPT-3 on 20 of 25 tasks that we evaluate. FLAN even outperforms few-shot GPT-3 by a large margin on ANLI, RTE, BoolQ, AI2-ARC, OpenbookQA, and StoryCloze. Ablation studies reveal that number of finetuning datasets, model scale, and natural language instructions are key to the success of instruction tuning.
The paper introduces instruction tuning, demonstrating that fine-tuning large language models on a diverse mixture of tasks described via natural language instructions dramatically enhances their zero-shot generalization to unseen tasks. This work fundamentally shifted the paradigm for adapting large language models by demonstrating that multi-task fine-tuning on natural language instructions serves as an effective meta-learning mechanism for zero-shot generalization. Prior to this, prevailing scaling laws suggested that larger models inherently improved few-shot capabilities through in-context learning, but this study revealed that targeted instruction fine-tuning could unlock robust zero-shot performance that rivals or exceeds much larger few-shot baselines. The methodology established a new standard for model alignment and capability enhancement, directly paving the way for the instruction-following architectures that dominate contemporary natural language processing. By systematically analyzing dataset diversity, model scale, and instruction formatting, the authors provided a clear, reproducible blueprint that has been universally adopted across academia and industry. Its influence extends far beyond the original experiments, forming the conceptual foundation for modern alignment pipelines, open-weight instruction-tuned models, and subsequent reinforcement learning from human feedback frameworks. Given its role in redefining how practitioners bridge pre-training and downstream application, it represents a major methodological advance with enduring field-wide impact.
Wang et al.; bootstrapped instruction data; enabled Alpaca
The paper introduces a self-bootstrapping framework where a pretrained language model generates, filters, and refines its own instruction-tuning data, effectively bypassing the need for large-scale human annotation. This approach fundamentally shifted the paradigm for aligning open-weight models, demonstrating that synthetic data curation could rival human-written datasets for instruction following. While the underlying mechanism is conceptually straightforward, its practical efficacy catalyzed a major transition in the field toward automated data generation pipelines, establishing a new standard for how researchers approach model alignment without proprietary resources. The work directly enabled the rapid proliferation of community-driven instruction-tuned systems and reshaped the broader ecosystem of open-source language model development, proving that scalable, high-quality alignment data could be synthesized algorithmically rather than manually curated.
Language models (LMs) exhibit remarkable abilities to solve new tasks from just a few examples or textual instructions, especially at scale. They also, paradoxically, struggle with basic functionality, such as arithmetic or factual lookup, where much simpler and smaller models excel. In this paper, we show that LMs can teach themselves to use external tools via simple APIs and achieve the best of both worlds. We introduce Toolformer, a model trained to decide which APIs to call, when to call them, what arguments to pass, and how to best incorporate the results into future token prediction. This is done in a self-supervised way, requiring nothing more than a handful of demonstrations for each API. We incorporate a range of tools, including a calculator, a Q\&A system, two different search engines, a translation system, and a calendar. Toolformer achieves substantially improved zero-shot performance across a variety of downstream tasks, often competitive with much larger models, without sacrificing its core language modeling abilities.
The paper introduces a self-supervised paradigm where language models autonomously learn to invoke external APIs by evaluating whether tool outputs improve next-token prediction, requiring only minimal demonstrations. This elegantly bridges the gap between parametric knowledge and external computation, circumventing the need for costly reinforcement learning or heavily curated supervised datasets. By demonstrating that models can self-annotate their own tool-use trajectories and seamlessly integrate API calls into the generation stream, the work catalyzed a major shift toward tool-augmented and agentic language models. Its methodological simplicity and empirical effectiveness established a new baseline for external tool integration, directly influencing subsequent research in LLM agents, self-correction, and hybrid reasoning systems. While not a foundational architecture shift on the scale of the Transformer, it fundamentally altered the practical deployment and capability expansion of large language models, securing its place as a highly influential contribution to modern NLP.
Chain-of-thought prompting combined with pre-trained large language models has achieved encouraging results on complex reasoning tasks. In this paper, we propose a new decoding strategy, self-consistency, to replace the naive greedy decoding used in chain-of-thought prompting. It first samples a diverse set of reasoning paths instead of only taking the greedy one, and then selects the most consistent answer by marginalizing out the sampled reasoning paths. Self-consistency leverages the intuition that a complex reasoning problem typically admits multiple different ways of thinking leading to its unique correct answer. Our extensive empirical evaluation shows that self-consistency boosts the performance of chain-of-thought prompting with a striking margin on a range of popular arithmetic and commonsense reasoning benchmarks, including GSM8K (+17.9%), SVAMP (+11.0%), AQuA (+12.2%), StrategyQA (+6.4%) and ARC-challenge (+3.9%).
The paper introduces self-consistency decoding, an inference-time strategy that samples multiple chain-of-thought reasoning paths and selects the most frequent answer to substantially improve LLM performance on complex reasoning tasks. This work stands out for its elegant simplicity and profound practical impact, effectively demonstrating that scaling inference compute through diverse path sampling and majority voting can unlock significant reasoning capabilities without any additional model training or architectural changes. By formalizing the intuition that correct solutions naturally converge across varied reasoning trajectories while incorrect ones diverge, it established a new standard for LLM evaluation and deployment, directly inspiring a wave of research into inference-time search, verification, and compute-optimal scaling. Compared to prior prompting methods that relied on single-path greedy decoding, this approach fundamentally shifted how practitioners approach reasoning benchmarks and real-world LLM applications, cementing itself as a foundational technique in modern prompt engineering and reasoning-focused pipelines.
Large language models have been shown to achieve remarkable performance across a variety of natural language tasks using few-shot learning, which drastically reduces the number of task-specific training examples needed to adapt the model to a particular application. To further our understanding of the impact of scale on few-shot learning, we trained a 540-billion parameter, densely activated, Transformer language model, which we call Pathways Language Model PaLM. We trained PaLM on 6144 TPU v4 chips using Pathways, a new ML system which enables highly efficient training across multiple TPU Pods. We demonstrate continued benefits of scaling by achieving state-of-the-art few-shot learning results on hundreds of language understanding and generation benchmarks. On a number of these tasks, PaLM 540B achieves breakthrough performance, outperforming the finetuned state-of-the-art on a suite of multi-step reasoning tasks, and outperforming average human performance on the recently released BIG-bench benchmark. A significant number of BIG-bench tasks showed discontinuous improvements from model scale, meaning that performance steeply increased as we scaled to our largest model. PaLM also has strong capabilities in multilingual tasks and source code generation, which we demonstrate on a wide array of benchmarks. We additionally provide a comprehensive analysis on bias and toxicity, and study the extent of training data memorization with respect to model scale. Finally, we discuss the ethical considerations related to large language models and discuss potential mitigation strategies.
This work establishes that scaling dense Transformers to 540 billion parameters, facilitated by the Pathways distributed training infrastructure, unlocks substantial few-shot learning capabilities, emergent multi-step reasoning, and robust multilingual and code generation performance. While the core architecture remains a conventional Transformer, the paper’s rigorous empirical documentation of scaling laws and discontinuous performance jumps fundamentally reshaped how the community approaches model capacity, compute allocation, and capability evaluation. The integration of a novel cross-pod training system addresses critical bottlenecks in distributed ML, and the comprehensive analysis of bias, toxicity, and memorization provides a vital framework for responsible development. Compared to prior scaling studies, this work sets a definitive benchmark for few-shot generalization and emergent behavior, directly influencing subsequent architectural and training paradigms across both academia and industry. Its methodological rigor and systems-level contributions cement its status as a foundational reference in the modern large language model era.
General reasoning represents a long-standing and formidable challenge in artificial intelligence. Recent breakthroughs, exemplified by large language models (LLMs) and chain-of-thought prompting, have achieved considerable success on foundational reasoning tasks. However, this success is heavily contingent upon extensive human-annotated demonstrations, and models' capabilities are still insufficient for more complex problems. Here we show that the reasoning abilities of LLMs can be incentivized through pure reinforcement learning (RL), obviating the need for human-labeled reasoning trajectories. The proposed RL framework facilitates the emergent development of advanced reasoning patterns, such as self-reflection, verification, and dynamic strategy adaptation. Consequently, the trained model achieves superior performance on verifiable tasks such as mathematics, coding competitions, and STEM fields, surpassing its counterparts trained via conventional supervised learning on human demonstrations. Moreover, the emergent reasoning patterns exhibited by these large-scale models can be systematically harnessed to guide and enhance the reasoning capabilities of smaller models.
The paper demonstrates that pure reinforcement learning, without supervised fine-tuning on human reasoning traces, can reliably elicit advanced self-reflective reasoning capabilities in LLMs and effectively distill them into smaller architectures. This work marks a pivotal shift in how reasoning capabilities are cultivated in large language models, moving away from data-hungry supervised pipelines toward direct optimization via reinforcement learning. By showing that complex cognitive behaviors like self-verification and strategy adaptation emerge organically from reward signals alone, it challenges prevailing assumptions about the necessity of curated demonstration data. The methodological transparency and open release will likely catalyze widespread adoption, enabling both academic and industry groups to replicate and extend RL-driven reasoning training. Furthermore, the demonstrated distillation pathway addresses a critical bottleneck in deploying capable reasoning models at scale. While building on established RL frameworks, the systematic characterization of emergent reasoning patterns and their transferability represents a substantial methodological advance that redefines the training paradigm for next-generation language models.
Yang et al.; autoregressive BERT alternative
XLNet introduces permutation language modeling with two-stream self-attention to unify autoregressive and autoencoding objectives, enabling bidirectional context capture without the independence assumption inherent in masked language models. The work delivers a rigorous theoretical and empirical bridge between causal and denoising pretraining, demonstrating that factorizing text generation over all possible permutations allows the model to predict each token conditioned on its full context while maintaining a valid autoregressive likelihood. This architectural and objective-level innovation directly addressed fundamental limitations in both BERT and early GPT variants, yielding state-of-the-art results across diverse benchmarks and establishing a new analytical framework for evaluating pretraining objectives. While the broader ecosystem eventually converged on decoder-only architectures due to their superior scaling properties and implementation simplicity, XLNet's contributions to relative positional encoding, segment recurrence, and permutation-based training deeply influenced subsequent representation learning research and remain a cornerstone reference for understanding the trade-offs between bidirectional context and generative modeling.
Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that we understand the present and near-future capabilities and limitations of language models. To address this challenge, we introduce the Beyond the Imitation Game benchmark (BIG-bench). BIG-bench currently consists of 204 tasks, contributed by 450 authors across 132 institutions. Task topics are diverse, drawing problems from linguistics, childhood development, math, common-sense reasoning, biology, physics, social bias, software development, and beyond. BIG-bench focuses on tasks that are believed to be beyond the capabilities of current language models. We evaluate the behavior of OpenAI's GPT models, Google-internal dense transformer architectures, and Switch-style sparse transformers on BIG-bench, across model sizes spanning millions to hundreds of billions of parameters. In addition, a team of human expert raters performed all tasks in order to provide a strong baseline. Findings include: model performance and calibration both improve with scale, but are poor in absolute terms (and when compared with rater performance); performance is remarkably similar across model classes, though with benefits from sparsity; tasks that improve gradually and predictably commonly involve a large knowledge or memorization component, whereas tasks that exhibit "breakthrough" behavior at a critical scale often involve multiple steps or components, or brittle metrics; social bias typically increases with scale in settings with ambiguous context, but this can be improved with prompting.
Introduces a large-scale, community-driven benchmark to systematically evaluate and extrapolate language model capabilities beyond imitation, revealing critical insights into scaling laws, emergent abilities, and calibration. The work establishes a rigorous evaluation framework that fundamentally shifts how the field measures progress, moving beyond narrow task-specific metrics to a comprehensive mapping of model behavior across diverse cognitive and reasoning domains. By pairing extensive model evaluations with expert human baselines, it provides actionable empirical evidence on how capabilities scale, distinguishing between gradual knowledge accumulation and emergent multi-step reasoning. This systematic approach to capability assessment addresses a critical gap in the literature, offering a standardized reference for tracking model limitations, safety concerns, and calibration trends. While it does not introduce a novel architecture or training algorithm, its methodological rigor and broad community adoption make it a cornerstone for future research in model evaluation, scaling analysis, and alignment.
We introduce LLaMA, a collection of foundation language models ranging from 7B to 65B parameters. We train our models on trillions of tokens, and show that it is possible to train state-of-the-art models using publicly available datasets exclusively, without resorting to proprietary and inaccessible datasets. In particular, LLaMA-13B outperforms GPT-3 (175B) on most benchmarks, and LLaMA-65B is competitive with the best models, Chinchilla-70B and PaLM-540B. We release all our models to the research community.
This work demonstrates that carefully curated public datasets combined with compute-optimal training can yield foundation models competitive with proprietary counterparts, while openly releasing the weights to catalyze community research. The paper’s primary contribution lies in a rigorous empirical validation of scaling laws and data efficiency rather than architectural invention. By systematically integrating established components like SwiGLU activations, rotary positional embeddings, and RMSNorm, the authors prove that parameter count is secondary to data quality and training compute when optimized correctly. The strategic decision to release weights trained exclusively on publicly available corpora fundamentally shifted the research landscape, enabling widespread fine-tuning, alignment studies, and downstream applications that were previously gated behind proprietary APIs. While the methodological novelty remains modest compared to paradigm-shifting architectures like the original Transformer or instruction-tuning frameworks, the work’s impact on democratizing access to state-of-the-art language modeling and establishing open-weight development as a viable research paradigm is profound. It effectively bridges the gap between closed industrial labs and the broader academic community, setting a new standard for transparency, reproducibility, and data-centric optimization in large-scale NLP.
DeepSeek; 671B MoE; $6M training cost; matched proprietary frontier
Introduces a highly efficient 671B-parameter Mixture-of-Experts architecture with novel attention and prediction mechanisms that achieve frontier performance at a fraction of traditional training costs. The report details architectural and systems-level innovations—most notably Multi-head Latent Attention for KV-cache compression and Multi-Token Prediction for training efficiency—that collectively dismantle the prevailing assumption that frontier model capabilities require prohibitive compute budgets. By demonstrating that state-of-the-art performance can be reached with a transparent, sub-$10M training pipeline, the work provides a practical blueprint for democratizing large-scale language model development. While the core components build upon established MoE and quantization paradigms, their cohesive integration and rigorous empirical validation represent a meaningful methodological advance. The technical report’s open release of training recipes, hyperparameter schedules, and scaling insights will likely serve as a foundational reference for both academic and industry groups pursuing cost-efficient model scaling, cementing its status as a highly influential contribution to the field.
Language models are increasingly being deployed for general problem solving across a wide range of tasks, but are still confined to token-level, left-to-right decision-making processes during inference. This means they can fall short in tasks that require exploration, strategic lookahead, or where initial decisions play a pivotal role. To surmount these challenges, we introduce a new framework for language model inference, Tree of Thoughts (ToT), which generalizes over the popular Chain of Thought approach to prompting language models, and enables exploration over coherent units of text (thoughts) that serve as intermediate steps toward problem solving. ToT allows LMs to perform deliberate decision making by considering multiple different reasoning paths and self-evaluating choices to decide the next course of action, as well as looking ahead or backtracking when necessary to make global choices. Our experiments show that ToT significantly enhances language models' problem-solving abilities on three novel tasks requiring non-trivial planning or search: Game of 24, Creative Writing, and Mini Crosswords. For instance, in Game of 24, while GPT-4 with chain-of-thought prompting only solved 4% of tasks, our method achieved a success rate of 74%. Code repo with all prompts: https://github.com/princeton-nlp/tree-of-thought-llm.
This work introduces a structured inference framework that transforms autoregressive language model generation into a deliberate search process over intermediate reasoning steps. By treating coherent text units as nodes in a tree and integrating self-evaluation, lookahead, and backtracking, the framework directly addresses the fundamental limitation of left-to-right token generation in complex reasoning scenarios. The approach elegantly bridges classical planning algorithms with modern prompting paradigms, demonstrating that frozen models can achieve substantial performance gains on tasks requiring multi-step planning, error correction, and creative synthesis without architectural modifications. Its model-agnostic design has rapidly established a new baseline for LLM reasoning research, inspiring a wave of follow-up work that extends the paradigm to graph structures, reinforcement learning, and tool-augmented agents. While it synthesizes existing concepts rather than introducing a completely new learning paradigm, the formalization of deliberate problem-solving as a tractable search problem represents a meaningful conceptual shift that has reshaped how the community approaches reasoning, planning, and inference-time compute scaling in large language models.
In this work, we develop and release Llama 2, a collection of pretrained and fine-tuned large language models (LLMs) ranging in scale from 7 billion to 70 billion parameters. Our fine-tuned LLMs, called Llama 2-Chat, are optimized for dialogue use cases. Our models outperform open-source chat models on most benchmarks we tested, and based on our human evaluations for helpfulness and safety, may be a suitable substitute for closed-source models. We provide a detailed description of our approach to fine-tuning and safety improvements of Llama 2-Chat in order to enable the community to build on our work and contribute to the responsible development of LLMs.
Llama 2 establishes a comprehensive, transparent methodology for training and aligning large-scale open-weight language models, demonstrating that carefully curated data, iterative supervised fine-tuning, and reinforcement learning from human feedback can yield chat models competitive with closed-source alternatives. While the underlying architecture relies on established decoder-only transformer designs rather than introducing novel mechanisms, the paper’s true contribution lies in its rigorous documentation of scaling practices, safety red-teaming pipelines, and alignment workflows. By releasing models up to 70B parameters alongside detailed training recipes, it effectively democratized access to state-of-the-art capabilities and catalyzed a massive wave of downstream research, fine-tuning, and ecosystem development. The work shifts the open-source community’s baseline by proving that transparent, safety-conscious training at scale can rival proprietary systems, making it a foundational reference for subsequent open-weight LLM development despite its moderate algorithmic novelty.
We introduce Mistral 7B v0.1, a 7-billion-parameter language model engineered for superior performance and efficiency. Mistral 7B outperforms Llama 2 13B across all evaluated benchmarks, and Llama 1 34B in reasoning, mathematics, and code generation. Our model leverages grouped-query attention (GQA) for faster inference, coupled with sliding window attention (SWA) to effectively handle sequences of arbitrary length with a reduced inference cost. We also provide a model fine-tuned to follow instructions, Mistral 7B -- Instruct, that surpasses the Llama 2 13B -- Chat model both on human and automated benchmarks. Our models are released under the Apache 2.0 license.
The paper demonstrates that strategic architectural optimizations, specifically grouped-query and sliding window attention, enable a 7B-parameter model to surpass significantly larger predecessors while maintaining high inference efficiency and open accessibility. While the individual attention mechanisms were previously proposed, the work’s primary contribution lies in their rigorous integration and the empirical demonstration that careful architectural design and training methodology can dramatically compress the performance gap between small and large language models. By releasing the weights under a permissive license, the authors catalyzed a wave of downstream research, fine-tuning, and deployment that fundamentally shifted the open-weight ecosystem away from pure parameter scaling toward efficiency-first design. The model’s strong empirical results across reasoning, code, and instruction-following benchmarks established a new baseline for the 7B tier, making it a highly influential reference point for both academic research and industry deployment. Its impact is amplified by how it redefined practical constraints for long-context handling and inference throughput, though it remains an engineering-focused synthesis rather than a theoretical breakthrough.
Recent work claims that large language models display emergent abilities, abilities not present in smaller-scale models that are present in larger-scale models. What makes emergent abilities intriguing is two-fold: their sharpness, transitioning seemingly instantaneously from not present to present, and their unpredictability, appearing at seemingly unforeseeable model scales. Here, we present an alternative explanation for emergent abilities: that for a particular task and model family, when analyzing fixed model outputs, emergent abilities appear due to the researcher's choice of metric rather than due to fundamental changes in model behavior with scale. Specifically, nonlinear or discontinuous metrics produce apparent emergent abilities, whereas linear or continuous metrics produce smooth, continuous predictable changes in model performance. We present our alternative explanation in a simple mathematical model, then test it in three complementary ways: we (1) make, test and confirm three predictions on the effect of metric choice using the InstructGPT/GPT-3 family on tasks with claimed emergent abilities; (2) make, test and confirm two predictions about metric choices in a meta-analysis of emergent abilities on BIG-Bench; and (3) show to choose metrics to produce never-before-seen seemingly emergent abilities in multiple vision tasks across diverse deep networks. Via all three analyses, we provide evidence that alleged emergent abilities evaporate with different metrics or with better statistics, and may not be a fundamental property of scaling AI models.
The paper demonstrates that purported emergent abilities in scaled models are artifacts of discontinuous evaluation metrics rather than intrinsic phase transitions in model capabilities. This work provides a crucial corrective to the prevailing narrative around scaling laws and emergent behavior, offering both a mathematical framework and extensive empirical validation across language and vision domains. By showing that metric choice dictates the appearance of sharp capability jumps, it fundamentally shifts how researchers should design benchmarks and interpret scaling curves. While it does not introduce a new architecture or training paradigm, its conceptual clarity and methodological rigor make it highly influential for evaluation practices and theoretical discussions on model scaling. The analysis is robust, reproducible, and directly addresses a high-profile debate, ensuring it will serve as a foundational reference for future work on capability assessment and scaling dynamics.
We introduce Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) language model. Mixtral has the same architecture as Mistral 7B, with the difference that each layer is composed of 8 feedforward blocks (i.e. experts). For every token, at each layer, a router network selects two experts to process the current state and combine their outputs. Even though each token only sees two experts, the selected experts can be different at each timestep. As a result, each token has access to 47B parameters, but only uses 13B active parameters during inference. Mixtral was trained with a context size of 32k tokens and it outperforms or matches Llama 2 70B and GPT-3.5 across all evaluated benchmarks. In particular, Mixtral vastly outperforms Llama 2 70B on mathematics, code generation, and multilingual benchmarks. We also provide a model fine-tuned to follow instructions, Mixtral 8x7B - Instruct, that surpasses GPT-3.5 Turbo, Claude-2.1, Gemini Pro, and Llama 2 70B - chat model on human benchmarks. Both the base and instruct models are released under the Apache 2.0 license.
Demonstrates that a carefully optimized sparse mixture-of-experts architecture can match or exceed 70B dense models while activating only ~13B parameters per token, establishing a new efficiency-performance frontier for open-weight LLMs. While the MoE paradigm predates this work, the paper provides critical methodological refinements in routing stability, load balancing, and large-scale training that make SMoE practically viable at this scale. Its immediate industry adoption and permissive licensing have fundamentally shifted deployment strategies, proving that parameter efficiency can rival dense scaling without sacrificing reasoning, code, or multilingual capabilities, thereby cementing MoE as a standard architectural choice for next-generation language models.
DeepSeek; MLA attention; efficient MoE; competitive open weights
The paper introduces Multi-Head Latent Attention and an optimized Mixture-of-Experts routing strategy that drastically reduce KV cache memory and computational overhead while maintaining competitive language modeling performance. By compressing attention states into low-dimensional latent vectors and implementing a refined expert dispatch mechanism, the architecture directly addresses the memory bandwidth and routing inefficiencies that typically limit MoE scaling. This design provides a highly practical template for training and serving large models under strict compute constraints, distinguishing it from prior dense architectures and earlier MoE implementations that struggled with load balancing and cache bloat. Although the work focuses on architectural engineering rather than foundational theoretical breakthroughs, its immediate adoption across the open-weight ecosystem and clear demonstration of cost-effective scaling cement its status as a highly influential reference for efficient language model development.
We release Code Llama, a family of large language models for code based on Llama 2 providing state-of-the-art performance among open models, infilling capabilities, support for large input contexts, and zero-shot instruction following ability for programming tasks. We provide multiple flavors to cover a wide range of applications: foundation models (Code Llama), Python specializations (Code Llama - Python), and instruction-following models (Code Llama - Instruct) with 7B, 13B, 34B and 70B parameters each. All models are trained on sequences of 16k tokens and show improvements on inputs with up to 100k tokens. 7B, 13B and 70B Code Llama and Code Llama - Instruct variants support infilling based on surrounding content. Code Llama reaches state-of-the-art performance among open models on several code benchmarks, with scores of up to 67% and 65% on HumanEval and MBPP, respectively. Notably, Code Llama - Python 7B outperforms Llama 2 70B on HumanEval and MBPP, and all our models outperform every other publicly available model on MultiPL-E. We release Code Llama under a permissive license that allows for both research and commercial use.
A comprehensive family of open-weight code generation models derived from Llama 2, featuring specialized training for Python, infilling, and extended context windows that establish a strong open baseline for code synthesis. The work represents a rigorous scaling and specialization effort rather than a fundamental algorithmic breakthrough, relying on established continued pretraining paradigms and targeted architectural adaptations like special token masking for infilling and positional interpolation for long-context extension. Its primary significance lies in democratizing access to high-performance code generation capabilities, which has catalyzed widespread adoption, downstream fine-tuning, and integration across the open-source ecosystem. While the methodological contributions are incremental relative to the broader language modeling literature, the systematic empirical evaluation, transparent training details, and permissive licensing make it a cornerstone reference for open code models. Compared to proprietary predecessors, it closes the performance gap for open alternatives, yet it does not redefine the underlying training paradigm or introduce novel reasoning mechanisms, keeping it firmly in the tier of highly impactful engineering milestones rather than conceptual breakthroughs.
Zhao et al.; survey of methods for extending context window
This survey synthesizes and categorizes the rapidly evolving landscape of techniques for extending context windows in large language models. While it does not propose a new architecture or training paradigm, its systematic taxonomy of positional encoding adaptations, attention optimizations, and memory-efficient training strategies provides a highly valuable reference for researchers navigating a fragmented literature. The work’s primary merit lies in its comprehensive benchmarking comparisons and clear identification of computational trade-offs, which help contextualize recent architectural advances relative to foundational sequence modeling breakthroughs like the original Transformer and GPT series. However, as a synthesis paper, it inherently builds upon rather than redirects the field, serving as an essential onboarding resource rather than a catalyst for new methodological directions.
This paper introduces WaveNet, a deep neural network for generating raw audio waveforms. The model is fully probabilistic and autoregressive, with the predictive distribution for each audio sample conditioned on all previous ones; nonetheless we show that it can be efficiently trained on data with tens of thousands of samples per second of audio. When applied to text-to-speech, it yields state-of-the-art performance, with human listeners rating it as significantly more natural sounding than the best parametric and concatenative systems for both English and Mandarin. A single WaveNet can capture the characteristics of many different speakers with equal fidelity, and can switch between them by conditioning on the speaker identity. When trained to model music, we find that it generates novel and often highly realistic musical fragments. We also show that it can be employed as a discriminative model, returning promising results for phoneme recognition.
WaveNet introduces a fully convolutional, autoregressive architecture with dilated causal convolutions to directly model raw audio waveforms at scale. This work fundamentally shifted audio generation away from handcrafted acoustic features and external vocoders toward end-to-end neural synthesis, establishing a new paradigm that prioritizes sample-level fidelity and long-range temporal coherence. By efficiently expanding the receptive field without the sequential bottlenecks of recurrent networks, it solved the longstanding computational challenge of modeling high-frequency audio dependencies, directly enabling the modern era of neural TTS and inspiring architectural adaptations across video, image, and time-series domains. Its influence is evident in the rapid obsolescence of parametric and concatenative baselines, and it laid the essential groundwork for subsequent breakthroughs in neural audio codecs, diffusion-based audio models, and multimodal speech systems. Despite the emergence of faster non-autoregressive and latent-space approaches, the core architectural insight remains a cornerstone of generative audio research, warranting its status as a field-defining contribution that permanently altered how the community approaches sequential signal modeling.
We introduce AudioLM, a framework for high-quality audio generation with long-term consistency. AudioLM maps the input audio to a sequence of discrete tokens and casts audio generation as a language modeling task in this representation space. We show how existing audio tokenizers provide different trade-offs between reconstruction quality and long-term structure, and we propose a hybrid tokenization scheme to achieve both objectives. Namely, we leverage the discretized activations of a masked language model pre-trained on audio to capture long-term structure and the discrete codes produced by a neural audio codec to achieve high-quality synthesis. By training on large corpora of raw audio waveforms, AudioLM learns to generate natural and coherent continuations given short prompts. When trained on speech, and without any transcript or annotation, AudioLM generates syntactically and semantically plausible speech continuations while also maintaining speaker identity and prosody for unseen speakers. Furthermore, we demonstrate how our approach extends beyond speech by generating coherent piano music continuations, despite being trained without any symbolic representation of music.
AudioLM introduces a hierarchical tokenization framework that unifies semantic structure and acoustic fidelity for autoregressive audio generation. By combining discretized activations from a masked audio model with neural codec codes, the work elegantly resolves the longstanding trade-off between long-term coherence and high-fidelity synthesis. This approach fundamentally shifted the audio generation landscape away from purely waveform-based or single-stage token models, establishing the semantic-plus-acoustic token paradigm that subsequently underpinned MusicLM, VALL-E, and the broader wave of discrete audio language models. Its methodological clarity, cross-domain applicability to both speech and music, and demonstration of zero-shot speaker and prosody preservation mark it as a cornerstone contribution that redefined how the field approaches generative audio modeling.
We study the capabilities of speech processing systems trained simply to predict large amounts of transcripts of audio on the internet. When scaled to 680,000 hours of multilingual and multitask supervision, the resulting models generalize well to standard benchmarks and are often competitive with prior fully supervised results but in a zero-shot transfer setting without the need for any fine-tuning. When compared to humans, the models approach their accuracy and robustness. We are releasing models and inference code to serve as a foundation for further work on robust speech processing.
Demonstrates that scaling a standard sequence-to-sequence architecture on hundreds of thousands of hours of weakly supervised, web-scraped multilingual audio yields robust, zero-shot speech recognition that rivals human performance. This work fundamentally shifted the audio research paradigm away from heavily curated, task-specific datasets and complex architectural innovations toward data-centric scaling and foundation model pretraining. By proving that massive weak supervision naturally induces emergent multilingual capabilities and acoustic robustness, it established the blueprint for modern audio foundation models and dramatically lowered the barrier to deploying high-quality speech systems across diverse languages and real-world conditions. While prior self-supervised approaches relied on carefully curated corpora and intricate pretext objectives, this study revealed that simple supervised next-token prediction at internet scale yields superior zero-shot transfer and generalization. The open release of weights and inference code further accelerated community adoption, making it a cornerstone reference for subsequent work in speech understanding, audio-language integration, and scalable representation learning.
Hyper-parameter optimization is crucial for pushing the accuracy of a deep learning model to its limits. A hyper-parameter optimization job, referred to as a study, involves numerous trials of training a model using different training knobs, and therefore is very computation-heavy, typically taking hours and days to finish. We observe that trials issued from hyper-parameter optimization algorithms often share common hyper-parameter sequence prefixes. Based on this observation, we propose Hippo, a hyper-parameter optimization system that removes redundancy in the training process to reduce the overall amount of computation significantly. Instead of executing each trial independently as in existing hyper-parameter optimization systems, Hippo breaks down the hyper-parameter sequences into stages and merges common stages to form a tree of stages (called a stage-tree), then executes a stage once per tree on a distributed GPU server environment. Hippo is applicable to not only single studies, but multi-study scenarios as well, where multiple studies of the same model and search space can be formulated as trees of stages. Evaluations show that Hippo's stage-based execution strategy outperforms trial-based methods such as Ray Tune for several models and hyper-parameter optimization algorithms, reducing GPU-hours and end-to-end training time significantly.
Hippo introduces a stage-tree execution model that deduplicates shared hyper-parameter prefixes across HPO trials to reduce redundant GPU computation. While the work addresses a practical bottleneck in large-scale model tuning, the core concept of checkpoint sharing and DAG-based trial execution builds upon established ideas in workflow orchestration and multi-fidelity optimization. The contribution is primarily systems-oriented, offering incremental efficiency gains rather than algorithmic breakthroughs, and its impact remains confined to ML infrastructure rather than advancing core audio or representation learning methodologies. Consequently, it represents a competent engineering advance for practitioners managing compute-heavy search spaces, but lacks the conceptual novelty or broad methodological shift required for field-wide significance.
Hoogeboom et al.; 3D molecular generation with equivariant diffusion
Introduces an E(3)-equivariant diffusion framework that jointly generates 3D atomic coordinates and discrete atom types, establishing a new architectural standard for geometric molecular generation. The work meaningfully bridges continuous diffusion processes with geometric deep learning, solving the critical challenge of maintaining rotational and translational invariance while modeling discrete-continuous joint distributions. Compared to discriminative or docking-focused baselines like SchNet and DiffDock, EDM shifts the paradigm toward unconditional and conditional 3D generation, effectively complementing structure prediction breakthroughs like AlphaFold by addressing the inverse design problem for small molecules. While its immediate application space is narrower than cross-disciplinary protein or genomic foundation models, the underlying equivariant diffusion blueprint has rapidly become a standard reference in geometric deep learning, enabling a wave of subsequent generative chemistry and 3D structure modeling work. The methodological rigor and clear empirical validation place it at the upper bound of solid conference-tier contributions, though it stops short of the field-redefining, cross-domain impact required for higher calibration.
Temporal reasoning is the task of predicting temporal relations of event pairs. While temporal reasoning models can perform reasonably well on in-domain benchmarks, we have little idea of these systems' generalizability due to existing datasets' limitations. In this work, we introduce a novel task named TODAY that bridges this gap with temporal differential analysis, which as the name suggests, evaluates whether systems can correctly understand the effect of incremental changes. Specifically, TODAY introduces slight contextual changes for given event pairs, and systems are asked to tell how this subtle contextual change would affect relevant temporal relation distributions. To facilitate learning, TODAY also annotates human explanations. We show that existing models, including GPT-3.5, drop to random guessing on TODAY, suggesting that they heavily rely on spurious information rather than proper reasoning for temporal predictions. On the other hand, we show that TODAY's supervision style and explanation annotations can be used in joint learning, encouraging models to use more appropriate signals during training and thus outperform across several benchmarks. TODAY can also be used to train models to solicit incidental supervision from noisy sources such as GPT-3.5, thus moving us more toward the goal of generic temporal reasoning systems.
Introduces the TODAY benchmark and a joint learning framework that leverages differential analysis and explanation supervision to improve temporal reasoning robustness and generalization. The work addresses a critical evaluation gap by shifting from static prediction to counterfactual-style differential analysis, effectively exposing how current language models rely on spurious correlations rather than genuine temporal logic. While the proposed training paradigm and noisy supervision pipeline offer a practical pathway toward more robust reasoning systems, the methodological advances build incrementally on established paradigms in explanation-augmented learning and robustness evaluation. The benchmark will likely serve as a valuable diagnostic tool for the natural language processing community, but the approach remains specialized to temporal relation extraction rather than offering broad algorithmic breakthroughs that would reshape the wider machine learning landscape.
Sparsely Mixture of Experts (MoE) has received great interest due to its promising scaling capability with affordable computational overhead. MoE converts dense layers into sparse experts, and utilizes a gated routing network to make experts conditionally activated. However, as the number of experts grows, MoE with outrageous parameters suffers from overfitting and sparse data allocation. Such problems are especially severe on tasks with limited data, thus hindering the progress for MoE models to improve performance by scaling up. In this work, we propose Mixture of Expert Clusters - a general approach to enable expert layers to learn more diverse and appropriate knowledge by imposing variance-based constraints on the routing stage. We further propose a cluster-level expert dropout strategy specifically designed for the expert cluster structure. Our experiments reveal that MoEC could improve performance on machine translation and natural language understanding tasks, and raise the performance upper bound for scaling up experts under limited data. We also verify that MoEC plays a positive role in mitigating overfitting and sparse data allocation.
The paper introduces cluster-level regularization and variance-constrained routing to mitigate overfitting and expert collapse in sparse Mixture-of-Experts models under data-limited regimes. While the work addresses a well-documented bottleneck in MoE architectures—namely, the degradation of routing stability and generalization when parameter counts outpace available training data—the methodological contributions build incrementally on established load-balancing and dropout literature rather than introducing a fundamentally new routing paradigm. Structuring experts into clusters and applying targeted dropout alongside routing variance penalties offers a pragmatic stabilization scheme that improves sample efficiency and mitigates sparse activation patterns. The empirical validation on standard translation and language understanding tasks demonstrates practical utility for practitioners training MoEs on constrained or domain-specific corpora. However, the approach is unlikely to displace mainstream routing strategies in large-scale foundation model development, where massive datasets, specialized auxiliary losses, and hardware-aware capacity planning already dominate the scaling landscape. Consequently, the paper serves as a useful engineering refinement for resource-constrained MoE deployments rather than a broadly transformative advance, positioning it as a solid but niche contribution to the broader MoE optimization literature.
Recently, the stability of graph filters has been studied as one of the key theoretical properties driving the highly successful graph convolutional neural networks (GCNs). The stability of a graph filter characterizes the effect of topology perturbation on the output of a graph filter, a fundamental building block for GCNs. Many existing results have focused on the regime of small perturbation with a small number of edge rewires. However, the number of edge rewires can be large in many applications. To study the latter case, this work departs from the previous analysis and proves a bound on the stability of graph filter relying on the filter's frequency response. Assuming the graph filter is low pass, we show that the stability of the filter depends on perturbation to the community structure. As an application, we show that for stochastic block model graphs, the graph filter distance converges to zero when the number of nodes approaches infinity. Numerical simulations validate our findings.
This work establishes theoretical stability bounds for low-pass graph filters under large-scale topological perturbations, demonstrating that robustness depends on community structure preservation rather than merely the count of edge rewires. While the paper provides a rigorous mathematical extension to existing graph neural network stability theory, its scope remains highly specialized within theoretical graph signal processing. The analysis offers valuable insights for understanding how spectral methods behave under significant structural noise, shifting the focus from local edge changes to global community preservation. However, the contribution is narrowly theoretical, lacking empirical validation on real-world biological networks or large-scale benchmarks, and does not introduce new architectures, training paradigms, or broadly applicable tools. Compared to foundational computational biology works that drive empirical breakthroughs in protein design or genomic modeling, this paper operates at an abstract mathematical level with limited immediate translational impact. It represents a solid, incremental advance in geometric deep learning theory that will primarily interest researchers studying graph robustness and spectral methods, rather than reshaping broader machine learning or bio-ML practice.
We study how vision-language models trained on Internet-scale data can be incorporated directly into end-to-end robotic control to boost generalization and enable emergent semantic reasoning. Our goal is to enable a single end-to-end trained model to both learn to map robot observations to actions and enjoy the benefits of large-scale pretraining on language and vision-language data from the web. To this end, we propose to co-fine-tune state-of-the-art vision-language models on both robotic trajectory data and Internet-scale vision-language tasks, such as visual question answering. In contrast to other approaches, we propose a simple, general recipe to achieve this goal: in order to fit both natural language responses and robotic actions into the same format, we express the actions as text tokens and incorporate them directly into the training set of the model in the same way as natural language tokens. We refer to such category of models as vision-language-action models (VLA) and instantiate an example of such a model, which we call RT-2. Our extensive evaluation (6k evaluation trials) shows that our approach leads to performant robotic policies and enables RT-2 to obtain a range of emergent capabilities from Internet-scale training. This includes significantly improved generalization to novel objects, the ability to interpret commands not present in the robot training data (such as placing an object onto a particular number or icon), and the ability to perform rudimentary reasoning in response to user commands (such as picking up the smallest or largest object, or the one closest to another object). We further show that incorporating chain of thought reasoning allows RT-2 to perform multi-stage semantic reasoning, for example figuring out which object to pick up for use as an improvised hammer (a rock), or which type of drink is best suited for someone who is tired (an energy drink).
RT-2 introduces a unified vision-language-action architecture that tokenizes robotic control signals alongside natural language, enabling direct co-fine-tuning of internet-scale vision-language models for physical robot control. The work elegantly bridges the gap between large-scale web pretraining and low-level robotic control by treating continuous or discrete actions as vocabulary tokens, a deceptively simple formulation that unlocks substantial emergent semantic reasoning and zero-shot generalization. By demonstrating that a single co-fine-tuned model can inherit the rich world knowledge of foundation models while maintaining precise motor control, it effectively establishes the vision-language-action paradigm that has since become a dominant baseline in embodied AI research. The extensive empirical validation rigorously substantiates claims of improved generalization to novel objects, spatial reasoning, and multi-stage chain-of-thought planning. While the architectural recipe conceptually extends prior multi-modal tokenization efforts, its practical execution and clear demonstration of direct knowledge transfer to physical systems represent a significant methodological advance that has fundamentally shifted how the community approaches generalizable robot learning.
By transferring knowledge from large, diverse, task-agnostic datasets, modern machine learning models can solve specific downstream tasks either zero-shot or with small task-specific datasets to a high level of performance. While this capability has been demonstrated in other fields such as computer vision, natural language processing or speech recognition, it remains to be shown in robotics, where the generalization capabilities of the models are particularly critical due to the difficulty of collecting real-world robotic data. We argue that one of the keys to the success of such general robotic models lies with open-ended task-agnostic training, combined with high-capacity architectures that can absorb all of the diverse, robotic data. In this paper, we present a model class, dubbed Robotics Transformer, that exhibits promising scalable model properties. We verify our conclusions in a study of different model classes and their ability to generalize as a function of the data size, model size, and data diversity based on a large-scale data collection on real robots performing real-world tasks. The project's website and videos can be found at robotics-transformer1.github.io
RT-1 demonstrates that transformer-based architectures, when trained on massive, diverse real-world robotic datasets, exhibit strong scaling properties and zero-shot generalization across hundreds of manipulation tasks. The work successfully bridges the scaling paradigm from vision and language to physical control, establishing empirical evidence that model capacity and dataset diversity directly translate to robust, generalizable policies in real-world settings. By introducing a tokenized action representation and validating performance across a broad spectrum of tasks, it provides a practical blueprint for building generalist robot policies. While the architectural choices build upon established transformer designs rather than introducing fundamentally new mechanisms, the rigorous large-scale empirical study and the demonstration of real-world zero-shot transfer represent a meaningful methodological advance. The paper has already catalyzed a wave of research into robotic foundation models, shifting the community’s focus from narrow, task-specific controllers toward scalable, data-driven generalist agents. Its influence on subsequent architectures and data collection pipelines justifies its placement among the most impactful recent contributions to embodied AI.
We introduce Voyager, the first LLM-powered embodied lifelong learning agent in Minecraft that continuously explores the world, acquires diverse skills, and makes novel discoveries without human intervention. Voyager consists of three key components: 1) an automatic curriculum that maximizes exploration, 2) an ever-growing skill library of executable code for storing and retrieving complex behaviors, and 3) a new iterative prompting mechanism that incorporates environment feedback, execution errors, and self-verification for program improvement. Voyager interacts with GPT-4 via blackbox queries, which bypasses the need for model parameter fine-tuning. The skills developed by Voyager are temporally extended, interpretable, and compositional, which compounds the agent's abilities rapidly and alleviates catastrophic forgetting. Empirically, Voyager shows strong in-context lifelong learning capability and exhibits exceptional proficiency in playing Minecraft. It obtains 3.3x more unique items, travels 2.3x longer distances, and unlocks key tech tree milestones up to 15.3x faster than prior SOTA. Voyager is able to utilize the learned skill library in a new Minecraft world to solve novel tasks from scratch, while other techniques struggle to generalize. We open-source our full codebase and prompts at https://voyager.minedojo.org/.
Voyager introduces a framework for open-ended embodied agents that leverages LLMs to autonomously generate curricula, store executable code as a persistent skill library, and iteratively refine behaviors through environmental feedback. The work represents a meaningful architectural shift in how LLM-driven agents handle lifelong learning and compositional skill acquisition. By treating executable code as a persistent, queryable memory rather than relying on static prompts or weight updates, the system effectively mitigates catastrophic forgetting while enabling rapid capability compounding. Compared to prior reactive planning frameworks like SayCan or monolithic policy models like RT-1/2, this approach enables sustained, open-ended skill accumulation rather than single-episode task execution. The integration of an automatic curriculum and self-verification loop demonstrates that black-box LLMs can drive complex task decomposition and exploration without fine-tuning. While constrained to a simulated environment, the methodological contributions—particularly the code-as-skill paradigm and feedback-driven iterative prompting—establish a strong reference point for subsequent research in agentic memory, autonomous curriculum generation, and general-purpose embodied reasoning. The open-source release further accelerates adoption, positioning this work as a foundational step toward scalable, lifelong learning agents.
Large deep learning models offer significant accuracy gains, but training billions to trillions of parameters is challenging. Existing solutions such as data and model parallelisms exhibit fundamental limitations to fit these models into limited device memory, while obtaining computation, communication and development efficiency. We develop a novel solution, Zero Redundancy Optimizer (ZeRO), to optimize memory, vastly improving training speed while increasing the model size that can be efficiently trained. ZeRO eliminates memory redundancies in data- and model-parallel training while retaining low communication volume and high computational granularity, allowing us to scale the model size proportional to the number of devices with sustained high efficiency. Our analysis on memory requirements and communication volume demonstrates: ZeRO has the potential to scale beyond 1 Trillion parameters using today's hardware. We implement and evaluate ZeRO: it trains large models of over 100B parameter with super-linear speedup on 400 GPUs, achieving throughput of 15 Petaflops. This represents an 8x increase in model size and 10x increase in achievable performance over state-of-the-art. In terms of usability, ZeRO can train large models of up to 13B parameters (e.g., larger than Megatron GPT 8.3B and T5 11B) without requiring model parallelism which is harder for scientists to apply. Last but not the least, researchers have used the system breakthroughs of ZeRO to create the world's largest language model (Turing-NLG, 17B parameters) with record breaking accuracy.
ZeRO introduces a systematic memory partitioning strategy that eliminates optimizer, gradient, and parameter redundancies across data-parallel workers, enabling efficient training of models with hundreds of billions of parameters without complex model parallelism. By decoupling memory footprint from communication overhead, the framework fundamentally redefined the scalability limits of distributed training. Its architectural elegance directly inspired industry-standard implementations like DeepSpeed and PyTorch FSDP, transforming how research groups and industry labs approach large-scale model training and effectively democratizing access to trillion-parameter scale experimentation.
Recent work in language modeling demonstrates that training large transformer models advances the state of the art in Natural Language Processing applications. However, very large models can be quite difficult to train due to memory constraints. In this work, we present our techniques for training very large transformer models and implement a simple, efficient intra-layer model parallel approach that enables training transformer models with billions of parameters. Our approach does not require a new compiler or library changes, is orthogonal and complimentary to pipeline model parallelism, and can be fully implemented with the insertion of a few communication operations in native PyTorch. We illustrate this approach by converging transformer based models up to 8.3 billion parameters using 512 GPUs. We sustain 15.1 PetaFLOPs across the entire application with 76% scaling efficiency when compared to a strong single GPU baseline that sustains 39 TeraFLOPs, which is 30% of peak FLOPs. To demonstrate that large language models can further advance the state of the art (SOTA), we train an 8.3 billion parameter transformer language model similar to GPT-2 and a 3.9 billion parameter model similar to BERT. We show that careful attention to the placement of layer normalization in BERT-like models is critical to achieving increased performance as the model size grows. Using the GPT-2 model we achieve SOTA results on the WikiText103 (10.8 compared to SOTA perplexity of 15.8) and LAMBADA (66.5% compared to SOTA accuracy of 63.2%) datasets. Our BERT model achieves SOTA results on the RACE dataset (90.9% compared to SOTA accuracy of 89.4%).
This work introduces a practical intra-layer model parallelism strategy that splits transformer matrix operations across GPUs with minimal communication overhead, enabling efficient training of multi-billion parameter language models. The approach stands out for its elegant simplicity and seamless integration into native PyTorch, avoiding the need for specialized compilers or extensive framework modifications. By decoupling tensor parallelism from pipeline parallelism, it established a modular paradigm that has since become the de facto standard for scaling large language models across industry and academia. The demonstrated scaling efficiency and empirical validation on major benchmarks proved that memory-bound training bottlenecks could be systematically overcome through careful communication scheduling and architectural adjustments like layer normalization placement. While the underlying distributed computing concepts predate this work, the specific formulation for transformer architectures directly catalyzed the modern era of large-scale model training, serving as the foundational blueprint for subsequent frameworks and research in distributed deep learning. Its enduring influence on how practitioners design and deploy training infrastructure justifies its placement among the most consequential systems contributions in recent machine learning history.
Transformers are slow and memory-hungry on long sequences, since the time and memory complexity of self-attention are quadratic in sequence length. Approximate attention methods have attempted to address this problem by trading off model quality to reduce the compute complexity, but often do not achieve wall-clock speedup. We argue that a missing principle is making attention algorithms IO-aware -- accounting for reads and writes between levels of GPU memory. We propose FlashAttention, an IO-aware exact attention algorithm that uses tiling to reduce the number of memory reads/writes between GPU high bandwidth memory (HBM) and GPU on-chip SRAM. We analyze the IO complexity of FlashAttention, showing that it requires fewer HBM accesses than standard attention, and is optimal for a range of SRAM sizes. We also extend FlashAttention to block-sparse attention, yielding an approximate attention algorithm that is faster than any existing approximate attention method. FlashAttention trains Transformers faster than existing baselines: 15% end-to-end wall-clock speedup on BERT-large (seq. length 512) compared to the MLPerf 1.1 training speed record, 3$\times$ speedup on GPT-2 (seq. length 1K), and 2.4$\times$ speedup on long-range arena (seq. length 1K-4K). FlashAttention and block-sparse FlashAttention enable longer context in Transformers, yielding higher quality models (0.7 better perplexity on GPT-2 and 6.4 points of lift on long-document classification) and entirely new capabilities: the first Transformers to achieve better-than-chance performance on the Path-X challenge (seq. length 16K, 61.4% accuracy) and Path-256 (seq. length 64K, 63.1% accuracy).
Introduces an IO-aware tiling algorithm that minimizes high-bandwidth memory traffic during self-attention, enabling exact computation with significantly reduced memory footprint and faster wall-clock speeds. The work fundamentally shifts the optimization paradigm for neural network kernels from FLOP-centric to memory-bandwidth-centric, correctly identifying that modern GPU bottlenecks lie in data movement rather than arithmetic throughput. By leveraging blocked tiling combined with strategic on-chip recomputation, it circumvents the quadratic memory overhead that historically forced practitioners toward lossy approximate attention methods. This architectural insight has rapidly become a foundational primitive across the ML systems ecosystem, directly enabling the practical scaling of context windows in large language models and influencing the design of subsequent compiler passes and custom operators. Compared to broader infrastructure frameworks, this paper delivers a highly targeted, mathematically grounded kernel optimization that bridges the gap between theoretical algorithmic complexity and real-world hardware constraints, establishing a new standard for how attention is implemented in both training and inference pipelines.
Lin et al.; better quantization by protecting salient weights
AWQ introduces an activation-aware weight quantization strategy that identifies and preserves a small subset of salient weights at higher precision, enabling highly accurate post-training quantization for large language models. The work addresses a critical bottleneck in LLM deployment by demonstrating that quantization error is not uniformly distributed across parameters, but rather concentrated around channels that produce activation outliers. By leveraging activation statistics to selectively protect these critical weights while aggressively quantizing the remainder, the method achieves a remarkable balance between memory efficiency and model fidelity without requiring costly fine-tuning. This insight directly bridges the gap between theoretical compression limits and practical deployment constraints, offering a hardware-friendly solution that has been rapidly integrated into major inference frameworks and compilers. Compared to prior post-training techniques, AWQ provides a robust, scalable approach that has effectively democratized access to large-scale models on commodity hardware, cementing its status as a foundational contribution to modern ML systems.
High throughput serving of large language models (LLMs) requires batching sufficiently many requests at a time. However, existing systems struggle because the key-value cache (KV cache) memory for each request is huge and grows and shrinks dynamically. When managed inefficiently, this memory can be significantly wasted by fragmentation and redundant duplication, limiting the batch size. To address this problem, we propose PagedAttention, an attention algorithm inspired by the classical virtual memory and paging techniques in operating systems. On top of it, we build vLLM, an LLM serving system that achieves (1) near-zero waste in KV cache memory and (2) flexible sharing of KV cache within and across requests to further reduce memory usage. Our evaluations show that vLLM improves the throughput of popular LLMs by 2-4$\times$ with the same level of latency compared to the state-of-the-art systems, such as FasterTransformer and Orca. The improvement is more pronounced with longer sequences, larger models, and more complex decoding algorithms. vLLM's source code is publicly available at https://github.com/vllm-project/vllm
The paper introduces PagedAttention, a memory management algorithm that adapts operating system virtual memory paging to dynamically allocate and share transformer key-value caches, enabling the vLLM serving system to drastically reduce memory fragmentation and boost inference throughput. By recognizing that KV cache allocation patterns mirror classical OS page faults and fragmentation, the authors successfully bridge foundational systems concepts with modern generative AI workloads, yielding a highly practical and rapidly adopted solution. The work fundamentally shifts how practitioners approach LLM deployment, replacing rigid, pre-allocated memory schemes with a flexible, demand-driven architecture that scales efficiently across diverse model sizes and complex decoding strategies. Its design directly addresses the critical memory bottleneck in autoregressive generation, establishing a new baseline for continuous batching and serving efficiency that outperforms prior specialized frameworks while remaining highly accessible to the broader research and engineering community.
Frantar et al.; 3/4-bit quantization with minimal quality loss; widely used
GPTQ introduces a layer-wise, Hessian-based post-training quantization algorithm that sequentially corrects quantization errors, enabling 3–4 bit compression of large language models with negligible accuracy degradation. The method elegantly bridges numerical optimization and practical deployment by leveraging approximate second-order information to bypass the need for expensive quantization-aware training. While post-training quantization is a mature area, this approach’s mathematical formulation and computational efficiency represent a substantial algorithmic advance over prior first-order or heuristic methods. Compared to contemporaneous techniques that rely on activation smoothing or mixed-precision heuristics, GPTQ’s weight-only, layer-sequential correction achieves superior memory compression without sacrificing perplexity. Its immediate and widespread integration into production inference stacks fundamentally shifted how practitioners deploy generative models, effectively democratizing access to multi-billion parameter systems on commodity hardware and establishing a new baseline for memory-efficient LLM serving.
Transformers have emerged as the architecture of choice for many state-of-the-art AI models, showcasing exceptional performance across a wide range of AI applications. However, the memory demands imposed by Transformers limit their ability to handle long sequences, thereby posing challenges in utilizing videos, actions, and other long-form sequences and modalities in complex environments. We present a novel approach, Ring Attention with Blockwise Transformers (Ring Attention), which leverages blockwise computation of self-attention and feedforward to distribute long sequences across multiple devices while fully overlapping the communication of key-value blocks with the computation of blockwise attention. Our approach enables training and inference of sequences that are up to device count times longer than those achievable by prior memory-efficient Transformers, without resorting to approximations or incurring additional communication and computation overheads. Extensive experiments on language modeling and reinforcement learning tasks demonstrate the effectiveness of our approach in allowing millions of tokens context size and improving performance.
The paper introduces a ring-based communication protocol that seamlessly overlaps key-value block transfers with blockwise attention computation, enabling linear scaling of context length with device count. This work addresses a critical systems bottleneck in modern large language models by eliminating the memory constraints that previously forced practitioners to rely on approximate attention mechanisms or suffer severe communication overheads in distributed settings. By extending blockwise computation principles to a multi-device ring topology, the authors achieve exact attention over millions of tokens while maintaining high hardware utilization. Compared to prior sequence parallelism and context-parallel approaches, this method offers a cleaner algorithmic formulation that integrates naturally with existing frameworks and scales efficiently across modern GPU clusters. The contribution sits at the intersection of algorithmic optimization and distributed systems, providing a practical pathway for training and serving long-context models that will likely become standard infrastructure in both academic and industrial pipelines. While it builds upon established blockwise attention foundations rather than introducing a fundamentally new architectural paradigm, its elegant systems design and immediate applicability to the pressing challenge of context scaling justify its high standing in the ML systems landscape.
Scaling Transformers to longer sequence lengths has been a major problem in the last several years, promising to improve performance in language modeling and high-resolution image understanding, as well as to unlock new applications in code, audio, and video generation. The attention layer is the main bottleneck in scaling to longer sequences, as its runtime and memory increase quadratically in the sequence length. FlashAttention exploits the asymmetric GPU memory hierarchy to bring significant memory saving (linear instead of quadratic) and runtime speedup (2-4$\times$ compared to optimized baselines), with no approximation. However, FlashAttention is still not nearly as fast as optimized matrix-multiply (GEMM) operations, reaching only 25-40\% of the theoretical maximum FLOPs/s. We observe that the inefficiency is due to suboptimal work partitioning between different thread blocks and warps on the GPU, causing either low-occupancy or unnecessary shared memory reads/writes. We propose FlashAttention-2, with better work partitioning to address these issues. In particular, we (1) tweak the algorithm to reduce the number of non-matmul FLOPs (2) parallelize the attention computation, even for a single head, across different thread blocks to increase occupancy, and (3) within each thread block, distribute the work between warps to reduce communication through shared memory. These yield around 2$\times$ speedup compared to FlashAttention, reaching 50-73\% of the theoretical maximum FLOPs/s on A100 and getting close to the efficiency of GEMM operations. We empirically validate that when used end-to-end to train GPT-style models, FlashAttention-2 reaches training speed of up to 225 TFLOPs/s per A100 GPU (72\% model FLOPs utilization).
FlashAttention-2 introduces refined GPU work partitioning and thread-level parallelism strategies that double the throughput of its predecessor, pushing attention computation closer to hardware limits and drastically reducing training costs for long-context models. The paper addresses a critical bottleneck in modern deep learning by systematically diagnosing and resolving occupancy and memory bandwidth inefficiencies in the original FlashAttention kernel. Rather than proposing a new mathematical formulation, it delivers a rigorous hardware-aware algorithmic redesign, demonstrating how careful redistribution of computation across thread blocks and warps can unlock substantial performance gains without approximation. Its impact is immediate and pervasive, as the kernel has rapidly become the default attention implementation across major training frameworks and inference engines, directly enabling the economic feasibility of longer sequence lengths and larger model scales. While the conceptual leap is evolutionary rather than revolutionary, the engineering precision and empirical validation place it among the most consequential systems contributions in recent years, fundamentally altering how the community approaches transformer scaling and establishing a new baseline for compute efficiency in the field.