98 papers · scored by Qwen 3.6 · all categories
The paper introduces the Transformer architecture, replacing recurrent and convolutional sequence models with a purely attention-based design that enables unprecedented parallelization and scalability. By demonstrating that self-attention alone can capture long-range dependencies…
LoRA introduces a parameter-efficient fine-tuning paradigm that freezes pre-trained model weights and injects trainable low-rank decomposition matrices into Transformer layers, matching full fine-tuning performance while drastically reducing trainable parameters and eliminating i…
This work unifies discrete diffusion processes and continuous score-based generative modeling into a single stochastic differential equation framework, deriving the reverse-time SDE and establishing an equivalent probability flow ODE for exact likelihood computation. The paper pr…
Introduced the Deep Q-Network architecture, which successfully combined convolutional neural networks with Q-learning stabilized by experience replay and target networks to learn control policies directly from raw pixel inputs. This work serves as the foundational catalyst for mo…
Establishes empirical power-law relationships between compute, dataset size, and model parameters, providing a principled framework for compute-optimal scaling of neural language models. This work fundamentally shifted large-scale model development from heuristic, trial-and-error…
The paper introduces the lottery ticket hypothesis, demonstrating that dense neural networks contain sparse subnetworks that, when isolated and trained from their original random initialization, can match the performance of the full network. This work fundamentally reshaped our u…
The paper establishes a highly capable, multimodal foundation model whose performance scales predictably across orders of magnitude in compute, setting new benchmarks for factual reasoning and alignment. This work represents a watershed moment in scaling large language models, de…
The paper introduces a gradient-based meta-learning framework that optimizes model initializations for rapid task adaptation, establishing a rigorous and scalable paradigm for few-shot learning. By framing meta-learning as a differentiable optimization problem over parameter spac…
The paper establishes a compute-optimal scaling law demonstrating that model parameters and training tokens should scale proportionally, fundamentally redirecting LLM training strategies toward data-rich, parameter-efficient regimes. This work systematically dismantles the prevai…
Introduces a clipped surrogate objective that enables stable, multi-epoch policy gradient updates without complex second-order optimization. The work solves a persistent instability problem in policy gradient methods by replacing computationally heavy trust-region constraints wit…
Direct Preference Optimization reformulates the reinforcement learning from human feedback objective to derive a closed-form optimal policy, enabling preference alignment through a simple supervised classification loss without explicit reward modeling or complex reinforcement lea…
QLoRA introduces a theoretically grounded 4-bit quantization scheme combined with targeted memory management techniques that enable full-parameter-equivalent fine-tuning of massive language models on single consumer-grade GPUs. This work fundamentally democratizes access to large…
Introduces a scalable, first-order approximation of spectral graph convolutions that enables efficient semi-supervised learning on graph-structured data. This work bridges the gap between computationally prohibitive spectral graph theory and practical deep learning by deriving a …
Introduces selective state space models that dynamically adjust parameters based on input content, combined with a hardware-aware parallel algorithm, enabling linear-time sequence modeling that rivals Transformer performance across multiple modalities. The work successfully addre…
This work introduces Deep Deterministic Policy Gradient (DDPG), successfully adapting experience replay and target networks from discrete-action Deep Q-Networks to deterministic policy gradients for stable, model-free continuous control. The paper bridges a critical gap in deep r…
This work introduces a principled framework that unifies maximum entropy reinforcement learning with off-policy actor-critic methods to achieve stable, sample-efficient continuous control. By explicitly optimizing for both reward maximization and policy entropy, the method natura…
Introduces a scalable alignment framework that replaces human preference labels with AI-generated critiques and preferences guided by a fixed set of principles, enabling effective harmlessness training with minimal human oversight. The work addresses a critical bottleneck in pref…
The paper introduces masked self-attention to graph-structured data, enabling nodes to dynamically weight neighborhood contributions without relying on fixed spectral filters or costly matrix operations. By decoupling attention computation from graph topology, it elegantly bridge…
Reformer introduces locality-sensitive hashing (LSH) attention and reversible residual layers to reduce the computational and memory complexity of self-attention from quadratic to near-linear, enabling practical training on long sequences. The work directly tackles the primary sc…
Introduces a natively multimodal transformer family trained from scratch on mixed modalities, establishing new state-of-the-art performance across diverse reasoning and perception benchmarks while detailing scalable post-training and deployment pipelines. The work represents a su…
Default optimizer for most modern ML; Kingma & Ba
Made very deep networks trainable; Ioffe & Szegedy
The paper demonstrates that a pure Transformer architecture, applied directly to sequences of image patches without convolutional inductive biases, achieves state-of-the-art image classification when pre-trained at scale. This work fundamentally challenged the long-standing domin…
The paper establishes a mathematically principled and practically stable framework for training diffusion-based generative models by unifying variational inference with denoising score matching, achieving image quality competitive with adversarial methods. This work fundamentally…
Introduces a scalable contrastive pre-training framework that learns robust visual representations directly from natural language supervision, enabling unprecedented zero-shot transfer across diverse vision tasks. This work fundamentally shifts computer vision from closed-set, ma…
Introduces the Region Proposal Network (RPN) to unify feature extraction, proposal generation, and bounding box regression into a single, end-to-end trainable architecture. This work fundamentally shifted object detection from multi-stage pipelines reliant on hand-crafted externa…
Introduces a symmetric encoder-decoder convolutional architecture with skip connections and a data-efficient training strategy that became the foundational backbone for dense prediction and biomedical image segmentation. U-Net fundamentally shifted the paradigm for pixel-wise pre…
DETR reformulates object detection as a direct set prediction problem using a Transformer encoder-decoder architecture, eliminating hand-crafted components like anchor boxes and non-maximum suppression. This work represents a fundamental paradigm shift in dense prediction, succes…
The paper introduces a two-stage generative framework that first maps text prompts to a CLIP image embedding via a learned prior, then synthesizes high-fidelity images from that embedding using a diffusion decoder. This architecture fundamentally restructured the text-to-image ge…
Introduces a unified autoregressive transformer that jointly models text and discrete image tokens, demonstrating that scaling data and compute enables competitive zero-shot text-to-image synthesis without task-specific architectural priors. This work fundamentally shifts the gen…
Introduces a self-distillation framework that trains Vision Transformers without labels, revealing emergent semantic properties like object segmentation and patch correspondence in attention maps. The work fundamentally shifted self-supervised representation learning by demonstra…
The paper establishes that aggressive random patch masking combined with an asymmetric encoder-decoder architecture enables highly efficient and scalable self-supervised pre-training for vision transformers. By demonstrating that high masking ratios force models to learn robust s…
Flamingo introduces a novel architecture that bridges frozen vision and language models via cross-attention and a Perceiver resampler, enabling robust in-context few-shot learning across diverse multimodal tasks. The work fundamentally shifts the paradigm for vision-language mode…
The paper introduces latent diffusion models, which shift the computationally intensive denoising process from high-dimensional pixel space to a compressed, semantically rich latent representation learned by a pretrained autoencoder, while incorporating cross-attention layers for…
The paper introduces a hierarchical vision transformer architecture that leverages shifted local windows to achieve linear computational complexity while enabling cross-window information flow, establishing a highly efficient and scalable backbone for diverse vision tasks. This w…
The paper demonstrates that stacking small 3x3 convolutional filters to increase network depth significantly improves feature representation and recognition accuracy on large-scale datasets. This work fundamentally shifted architectural design philosophy in computer vision by pro…
Introduces a promptable foundation model for image segmentation trained on a billion-mask dataset via an automated data engine, shifting the field from task-specific models to zero-shot, interactive segmentation. This work successfully translates the foundation model paradigm to …
Introduces a streamlined vision-language instruction tuning pipeline that leverages GPT-4-synthesized multimodal data to align frozen vision encoders with large language models using only a simple projection layer. The work’s significance stems from its elegant reduction of a hig…
Introduces multi-scale feature prediction and a residual-based backbone to significantly improve real-time object detection accuracy while maintaining high inference speeds. The work deliberately embraces an iterative, engineering-driven design philosophy rather than proposing a …
SDXL advances latent diffusion models through strategic architectural scaling, dual-text-encoder conditioning, multi-aspect-ratio training, and a dedicated refinement pipeline, establishing a highly capable open-source baseline for high-resolution image synthesis. While the work …
The paper establishes a highly effective, open-source vision-language baseline by refining data curation, training recipes, and architectural choices for connecting frozen vision encoders to large language models. While conceptually incremental over prior instruction-tuning appro…
Kicked off the deep learning era; ImageNet competition winner
Zeiler & Fergus; visualised what CNNs learn; led to AlexNet improvements
Ian Goodfellow et al.; introduced adversarial training
Demonstrates that scaling autoregressive language models to unprecedented parameter counts unlocks robust in-context learning capabilities, fundamentally shifting the paradigm from task-specific fine-tuning to prompt-based adaptation. This work represents a watershed moment in ar…
Introduces a soft, differentiable alignment mechanism that allows sequence-to-sequence models to dynamically focus on relevant parts of the input, fundamentally overcoming the fixed-context bottleneck of early encoder-decoder architectures. This work represents a foundational par…
Introduces a deeply bidirectional transformer pre-training framework using masked language modeling that establishes a unified, fine-tuning-based paradigm for natural language understanding. The work fundamentally shifts the research trajectory away from designing highly speciali…
This work establishes a scalable post-training pipeline that aligns large language models with human intent by combining supervised fine-tuning on demonstrations with reinforcement learning from human feedback. The paper demonstrates that targeted alignment can dramatically impro…
This work introduces the encoder-decoder LSTM architecture for sequence-to-sequence mapping, establishing the foundational paradigm that replaced statistical machine translation and ultimately enabled modern generative language models. The paper's significance lies in its elegant…
Introduced computationally efficient neural architectures (Skip-gram and CBOW) with negative sampling and hierarchical softmax to learn high-quality distributed word and phrase representations at scale. While modern contextualized models have superseded static embeddings, this wo…
The paper demonstrates that explicitly prompting large language models to generate intermediate reasoning steps dramatically enhances their performance on complex multi-step tasks. This work fundamentally shifted the paradigm of how practitioners interact with generative models, …
This work establishes the modern reinforcement learning from human feedback pipeline, demonstrating that training a reward model on human preferences and optimizing a language model via proximal policy optimization yields substantially more aligned and higher-quality summaries th…
The paper introduces a unified framework that seamlessly integrates dense retrieval with autoregressive generation, enabling models to dynamically access external knowledge while maintaining end-to-end differentiability. This work addresses a fundamental limitation of purely para…
The paper introduces a prompting paradigm that interleaves verbal reasoning traces with executable actions, enabling LLMs to dynamically plan, ground their thoughts in external environments, and self-correct. This work represents a pivotal shift in how we conceptualize LLM capabi…
Introduces a comprehensive, multi-domain benchmark that evaluates language models across fifty-seven academic and professional subjects, establishing a standardized metric for assessing broad knowledge and reasoning capabilities. This work fundamentally shifted how the community …
Introduces a standardized, multi-task evaluation framework and diagnostic suite that fundamentally reshaped how natural language understanding systems are measured, compared, and developed. The work addresses a critical fragmentation problem in NLP by unifying disparate datasets …
ELECTRA introduces a highly compute-efficient pre-training objective that replaces masked language modeling with a replaced-token detection task, training a discriminator to distinguish original tokens from those swapped by a small generator. By providing a training signal for ev…
This work empirically demonstrates that language models exhibit a U-shaped performance curve across extended contexts, systematically degrading when critical information is positioned in the middle. By challenging the prevailing assumption of uniform context utilization, the stud…
The paper introduces a unified text-to-text framework that reformulates all natural language processing tasks as sequence generation, accompanied by a comprehensive scaling study and the C4 pretraining corpus. This work fundamentally shifted the NLP landscape by demonstrating tha…
This work demonstrates that systematic optimization of pretraining dynamics, rather than architectural modifications, is the primary driver of performance gains in masked language models. By rigorously ablating the original training recipe, the authors reveal that the baseline mo…
Introduces Codex, a code-specialized large language model, alongside the HumanEval benchmark and pass@k evaluation methodology, establishing the foundational framework for modern code generation research. This work fundamentally shifted the trajectory of language model research b…
The paper introduces instruction tuning, demonstrating that fine-tuning large language models on a diverse mixture of tasks described via natural language instructions dramatically enhances their zero-shot generalization to unseen tasks. This work fundamentally shifted the paradi…
The paper introduces a self-bootstrapping framework where a pretrained language model generates, filters, and refines its own instruction-tuning data, effectively bypassing the need for large-scale human annotation. This approach fundamentally shifted the paradigm for aligning op…
The paper introduces a self-supervised paradigm where language models autonomously learn to invoke external APIs by evaluating whether tool outputs improve next-token prediction, requiring only minimal demonstrations. This elegantly bridges the gap between parametric knowledge an…
The paper introduces self-consistency decoding, an inference-time strategy that samples multiple chain-of-thought reasoning paths and selects the most frequent answer to substantially improve LLM performance on complex reasoning tasks. This work stands out for its elegant simplic…
This work establishes that scaling dense Transformers to 540 billion parameters, facilitated by the Pathways distributed training infrastructure, unlocks substantial few-shot learning capabilities, emergent multi-step reasoning, and robust multilingual and code generation perform…
The paper demonstrates that pure reinforcement learning, without supervised fine-tuning on human reasoning traces, can reliably elicit advanced self-reflective reasoning capabilities in LLMs and effectively distill them into smaller architectures. This work marks a pivotal shift …
XLNet introduces permutation language modeling with two-stream self-attention to unify autoregressive and autoencoding objectives, enabling bidirectional context capture without the independence assumption inherent in masked language models. The work delivers a rigorous theoretic…
Introduces a large-scale, community-driven benchmark to systematically evaluate and extrapolate language model capabilities beyond imitation, revealing critical insights into scaling laws, emergent abilities, and calibration. The work establishes a rigorous evaluation framework t…
This work demonstrates that carefully curated public datasets combined with compute-optimal training can yield foundation models competitive with proprietary counterparts, while openly releasing the weights to catalyze community research. The paper’s primary contribution lies in …
Introduces a highly efficient 671B-parameter Mixture-of-Experts architecture with novel attention and prediction mechanisms that achieve frontier performance at a fraction of traditional training costs. The report details architectural and systems-level innovations—most notably M…
This work introduces a structured inference framework that transforms autoregressive language model generation into a deliberate search process over intermediate reasoning steps. By treating coherent text units as nodes in a tree and integrating self-evaluation, lookahead, and ba…
Llama 2 establishes a comprehensive, transparent methodology for training and aligning large-scale open-weight language models, demonstrating that carefully curated data, iterative supervised fine-tuning, and reinforcement learning from human feedback can yield chat models compet…
The paper demonstrates that strategic architectural optimizations, specifically grouped-query and sliding window attention, enable a 7B-parameter model to surpass significantly larger predecessors while maintaining high inference efficiency and open accessibility. While the indiv…
The paper demonstrates that purported emergent abilities in scaled models are artifacts of discontinuous evaluation metrics rather than intrinsic phase transitions in model capabilities. This work provides a crucial corrective to the prevailing narrative around scaling laws and e…
Demonstrates that a carefully optimized sparse mixture-of-experts architecture can match or exceed 70B dense models while activating only ~13B parameters per token, establishing a new efficiency-performance frontier for open-weight LLMs. While the MoE paradigm predates this work,…
The paper introduces Multi-Head Latent Attention and an optimized Mixture-of-Experts routing strategy that drastically reduce KV cache memory and computational overhead while maintaining competitive language modeling performance. By compressing attention states into low-dimension…
A comprehensive family of open-weight code generation models derived from Llama 2, featuring specialized training for Python, infilling, and extended context windows that establish a strong open baseline for code synthesis. The work represents a rigorous scaling and specializatio…
This survey synthesizes and categorizes the rapidly evolving landscape of techniques for extending context windows in large language models. While it does not propose a new architecture or training paradigm, its systematic taxonomy of positional encoding adaptations, attention op…
WaveNet introduces a fully convolutional, autoregressive architecture with dilated causal convolutions to directly model raw audio waveforms at scale. This work fundamentally shifted audio generation away from handcrafted acoustic features and external vocoders toward end-to-end …
AudioLM introduces a hierarchical tokenization framework that unifies semantic structure and acoustic fidelity for autoregressive audio generation. By combining discretized activations from a masked audio model with neural codec codes, the work elegantly resolves the longstanding…
Demonstrates that scaling a standard sequence-to-sequence architecture on hundreds of thousands of hours of weakly supervised, web-scraped multilingual audio yields robust, zero-shot speech recognition that rivals human performance. This work fundamentally shifted the audio resea…
Hippo introduces a stage-tree execution model that deduplicates shared hyper-parameter prefixes across HPO trials to reduce redundant GPU computation. While the work addresses a practical bottleneck in large-scale model tuning, the core concept of checkpoint sharing and DAG-based…
Introduces an E(3)-equivariant diffusion framework that jointly generates 3D atomic coordinates and discrete atom types, establishing a new architectural standard for geometric molecular generation. The work meaningfully bridges continuous diffusion processes with geometric deep …
Introduces the TODAY benchmark and a joint learning framework that leverages differential analysis and explanation supervision to improve temporal reasoning robustness and generalization. The work addresses a critical evaluation gap by shifting from static prediction to counterfa…
The paper introduces cluster-level regularization and variance-constrained routing to mitigate overfitting and expert collapse in sparse Mixture-of-Experts models under data-limited regimes. While the work addresses a well-documented bottleneck in MoE architectures—namely, the de…
This work establishes theoretical stability bounds for low-pass graph filters under large-scale topological perturbations, demonstrating that robustness depends on community structure preservation rather than merely the count of edge rewires. While the paper provides a rigorous m…
RT-2 introduces a unified vision-language-action architecture that tokenizes robotic control signals alongside natural language, enabling direct co-fine-tuning of internet-scale vision-language models for physical robot control. The work elegantly bridges the gap between large-sc…
RT-1 demonstrates that transformer-based architectures, when trained on massive, diverse real-world robotic datasets, exhibit strong scaling properties and zero-shot generalization across hundreds of manipulation tasks. The work successfully bridges the scaling paradigm from visi…
Voyager introduces a framework for open-ended embodied agents that leverages LLMs to autonomously generate curricula, store executable code as a persistent skill library, and iteratively refine behaviors through environmental feedback. The work represents a meaningful architectur…
ZeRO introduces a systematic memory partitioning strategy that eliminates optimizer, gradient, and parameter redundancies across data-parallel workers, enabling efficient training of models with hundreds of billions of parameters without complex model parallelism. By decoupling m…
This work introduces a practical intra-layer model parallelism strategy that splits transformer matrix operations across GPUs with minimal communication overhead, enabling efficient training of multi-billion parameter language models. The approach stands out for its elegant simpl…
Introduces an IO-aware tiling algorithm that minimizes high-bandwidth memory traffic during self-attention, enabling exact computation with significantly reduced memory footprint and faster wall-clock speeds. The work fundamentally shifts the optimization paradigm for neural netw…
AWQ introduces an activation-aware weight quantization strategy that identifies and preserves a small subset of salient weights at higher precision, enabling highly accurate post-training quantization for large language models. The work addresses a critical bottleneck in LLM depl…
The paper introduces PagedAttention, a memory management algorithm that adapts operating system virtual memory paging to dynamically allocate and share transformer key-value caches, enabling the vLLM serving system to drastically reduce memory fragmentation and boost inference th…
GPTQ introduces a layer-wise, Hessian-based post-training quantization algorithm that sequentially corrects quantization errors, enabling 3–4 bit compression of large language models with negligible accuracy degradation. The method elegantly bridges numerical optimization and pra…
The paper introduces a ring-based communication protocol that seamlessly overlaps key-value block transfers with blockwise attention computation, enabling linear scaling of context length with device count. This work addresses a critical systems bottleneck in modern large languag…
FlashAttention-2 introduces refined GPU work partitioning and thread-level parallelism strategies that double the throughput of its predecessor, pushing attention computation closer to hardware limits and drastically reducing training costs for long-context models. The paper addr…