The most influential machine learning papers — curated by impact, novelty, and field-defining significance. Organized by publication year.
98 landmark papers · Updated April 2026
DeepSeek; o1-level reasoning via RL; open weights; major milestone
DeepSeek; MLA attention; efficient MoE; competitive open weights
DeepSeek; 671B MoE; $6M training cost; matched proprietary frontier
Zhao et al.; survey of methods for extending context window
Podell et al.; improved Stable Diffusion
Rafailov et al.; simpler RLHF alternative; widely adopted
Dettmers et al.; 4-bit quantized LoRA; democratized LLM fine-tuning
Touvron et al.; Meta; open-weights foundation; sparked open-source LLM movement
Touvron et al.; Meta; commercial open-weights with RLHF
Yao et al.; systematic search over reasoning chains
Dao; further 2x improvement over FlashAttention
Kwon et al.; PagedAttention; near-zero KV cache waste; production LLM serving
Gu & Dao; SSM alternative to Transformer; linear scaling in sequence length
Liu et al.; distributed ring attention; million-token context
Zheng et al.; LMSYS; Elo-based human preference leaderboard
Schick et al.; Meta; self-supervised tool-use learning
Wang et al.; Minecraft agent; LLM as controller with skill library
Brohan et al.; Google; VLM directly outputs robot actions
OpenAI; multimodal GPT-4; frontier model; bar-setting benchmark results
Google DeepMind; multimodal Gemini; matched GPT-4 on many benchmarks
Rozière et al.; Meta; open-weights code LLM; extends Llama 2 for code
Kirillov et al.; Meta; promptable segmentation; billion-mask dataset
Liu et al.; showed LLMs ignore middle of context; important limitation study
Lin et al.; better quantization by protecting salient weights
Liu et al.; CLIP + LLM with simple MLP projection; strong VQA baseline
OpenAI; landmark text-to-image system
Alayrac et al.; DeepMind; few-shot VLM from frozen LLM
Ouyang et al.; RLHF for LLMs; precursor to ChatGPT
Bai et al.; Anthropic; RLAIF; scalable safety
Hoffmann et al.; revised scaling laws; data matters as much as params
Chowdhery et al.; Google; 540B params; chain-of-thought abilities
Wei et al.; showed reasoning emerges with step-by-step prompting
Wang et al.; majority-vote sampling over CoT paths
Yao et al.; interleaved reasoning and tool use; foundation of agents
Dao et al.; 2-4x speedup; enabled longer contexts; universally adopted
Srivastava et al.; Google; 204-task collaborative LLM benchmark
Watson et al.; David Baker lab; generative protein design
Radford et al.; OpenAI; standard ASR; 680k hours weak supervision
Borsos et al.; Google; language model for audio tokens
Brohan et al.; Google; large-scale robot transformer; real manipulation
Hoogeboom et al.; 3D molecular generation with equivariant diffusion
Frantar et al.; 3/4-bit quantization with minimal quality loss; widely used
Wang et al.; bootstrapped instruction data; enabled Alpaca
Rombach et al.; enabled open-source text-to-image at scale
Radford et al.; zero-shot transfer; most influential vision-language model
Hu et al.; standard PEFT method; enables consumer fine-tuning
Wei et al.; instruction tuning; zero-shot generalization
Chen et al.; OpenAI; code generation benchmark
Jumper et al.; DeepMind; Nature 2021; solved protein structure prediction
Caron et al.; Meta; self-distillation; strong visual features without labels
He et al.; Meta; high masking ratio MAE; efficient ViT pretraining
Tolstikhin et al.; showed Transformer not strictly necessary for vision
Dosovitskiy et al.; Transformer for vision; displaced CNN backbones
Kaplan et al.; power-law compute/data/parameter tradeoffs
Brown et al.; 175B params; in-context learning; paradigm shift
Clark et al.; compute-efficient pretraining
Ho et al.; launched the diffusion model era
Song et al.; unified view of score-matching & diffusion
Hendrycks et al.; 57-domain knowledge benchmark; standard LLM eval
Baevski et al.; Meta; self-supervised speech; standard baseline
Lewis et al.; Meta; grounded generation; production standard
Kitaev et al.; LSH attention; reduced quadratic complexity
Carion et al.; detection as set prediction; replaced anchors
Stiennon et al.; OpenAI; early RLHF demonstration on summarization
Liu et al.; showed BERT was undertrained
Raffel et al.; text-to-text framing for NLP
Yang et al.; autoregressive BERT alternative
Rajbhandari et al.; Microsoft; partitioned optimizer state / gradients / params
Shoeybi et al.; NVIDIA; tensor parallelism; standard multi-GPU training
Devlin et al.; transformed NLP; bidirectional language models
Frankle & Carlin; sparse subnetworks; influential pruning theory
Wang et al.; standard NLP benchmark suite
Haarnoja et al.; state-of-the-art continuous control
Vaswani et al.; most cited ML paper ever; foundation of modern AI
Schulman et al.; OpenAI; default RL algorithm for LLM alignment
Finn et al.; gradient-based meta-learning; few-shot adaptation
Kipf & Welling; standard graph neural network baseline
Oord et al.; DeepMind; autoregressive raw waveform; landmark TTS
Made very deep networks trainable; Ioffe & Szegedy
Standard architecture for image segmentation; 70k+ citations
Lillicrap et al.; actor-critic for continuous action spaces
Ren et al.; end-to-end detector; standard baseline for years
Default optimizer for most modern ML; Kingma & Ba
Established depth as key factor in CNNs
Sutskever et al.; foundation of seq2seq NMT
Bahdanau attention — the precursor to Transformer
Zeiler & Fergus; visualised what CNNs learn; led to AlexNet improvements
DeepMind; launched modern deep RL
Mikolov et al.; standard word embeddings for years
Kicked off the deep learning era; ImageNet competition winner