← Current Papers

2025

nlp 78

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek; o1-level reasoning via RL; open weights; major milestone

💬 Reddit · HN

2024

nlp 69

Mixtral of Experts

Jiang et al.; sparse MoE; outperforms dense 70B at fraction of cost

nlp 68

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

DeepSeek; MLA attention; efficient MoE; competitive open weights

nlp 76

DeepSeek-V3 Technical Report

DeepSeek; 671B MoE; $6M training cost; matched proprietary frontier

nlp 40

A Survey of Long-Context Large Language Models

Zhao et al.; survey of methods for extending context window

2023

vision 67

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Podell et al.; improved Stable Diffusion

vision 71

Visual Instruction Tuning

Liu et al.; open-source multimodal instruction-following

general ml 83

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Rafailov et al.; simpler RLHF alternative; widely adopted

💬 Reddit
general ml 82

QLoRA: Efficient Finetuning of Quantized LLMs

Dettmers et al.; 4-bit quantized LoRA; democratized LLM fine-tuning

nlp 76

LLaMA: Open and Efficient Foundation Language Models

Touvron et al.; Meta; open-weights foundation; sparked open-source LLM movement

nlp 72

Llama 2: Open Foundation and Fine-Tuned Chat Models

Touvron et al.; Meta; commercial open-weights with RLHF

nlp 71

Mistral 7B

Jiang et al.; efficient 7B; sliding window attention; widely deployed

nlp 74

Tree of Thoughts: Deliberate Problem Solving with Large Language Models

Yao et al.; systematic search over reasoning chains

systems 72

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

Dao; further 2x improvement over FlashAttention

systems 78

Efficient Memory Management for Large Language Model Serving with PagedAttention

Kwon et al.; PagedAttention; near-zero KV cache waste; production LLM serving

💬 HN
general ml 80

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Gu & Dao; SSM alternative to Transformer; linear scaling in sequence length

systems 73

Ring Attention with Blockwise Transformers for Near-Infinite Context

Liu et al.; distributed ring attention; million-token context

nlp 69

Are Emergent Abilities of Large Language Models a Mirage?

Zheng et al.; LMSYS; Elo-based human preference leaderboard

nlp 79

Toolformer: Language Models Can Teach Themselves to Use Tools

Schick et al.; Meta; self-supervised tool-use learning

robotics 72

Voyager: An Open-Ended Embodied Agent with Large Language Models

Wang et al.; Minecraft agent; LLM as controller with skill library

robotics 79

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

Brohan et al.; Google; VLM directly outputs robot actions

general ml 85

GPT-4 Technical Report

OpenAI; multimodal GPT-4; frontier model; bar-setting benchmark results

general ml 69

Gemini: A Family of Highly Capable Multimodal Models

Google DeepMind; multimodal Gemini; matched GPT-4 on many benchmarks

nlp 61

Code Llama: Open Foundation Models for Code

Rozière et al.; Meta; open-weights code LLM; extends Llama 2 for code

vision 79

Segment Anything (SAM)

Kirillov et al.; Meta; promptable segmentation; billion-mask dataset

💬 Reddit · HN
nlp 80

Lost in the Middle: How Language Models Use Long Contexts

Liu et al.; showed LLMs ignore middle of context; important limitation study

systems 78

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

Lin et al.; better quantization by protecting salient weights

vision 62

Improved Baselines with Visual Instruction Tuning (LLaVA-1.5)

Liu et al.; CLIP + LLM with simple MLP projection; strong VQA baseline

2022

vision 85

Hierarchical Text-Conditional Image Generation with CLIP Latents

OpenAI; landmark text-to-image system

vision 83

Flamingo: a Visual Language Model for Few-Shot Learning

Alayrac et al.; DeepMind; few-shot VLM from frozen LLM

nlp 91

Training language models to follow instructions with human feedback

Ouyang et al.; RLHF for LLMs; precursor to ChatGPT

💬 HN
general ml 79

Constitutional AI: Harmlessness from AI Feedback

Bai et al.; Anthropic; RLAIF; scalable safety

general ml 84

Training Compute-Optimal Large Language Models

Hoffmann et al.; revised scaling laws; data matters as much as params

nlp 78

PaLM: Scaling Language Modeling with Pathways

Chowdhery et al.; Google; 540B params; chain-of-thought abilities

nlp 85

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Wei et al.; showed reasoning emerges with step-by-step prompting

nlp 78

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Wang et al.; majority-vote sampling over CoT paths

nlp 82

ReAct: Synergizing Reasoning and Acting in Language Models

Yao et al.; interleaved reasoning and tool use; foundation of agents

systems 81

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

Dao et al.; 2-4x speedup; enabled longer contexts; universally adopted

💬 Reddit
nlp 76

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

Srivastava et al.; Google; 204-task collaborative LLM benchmark

bio 39

MoEC: Mixture of Expert Clusters

Lin et al.; Meta; protein LLM; fast structure prediction

bio 41

Generic Temporal Reasoning with Differential Analysis and Explanation

Watson et al.; David Baker lab; generative protein design

audio 78

Robust Speech Recognition via Large-Scale Weak Supervision

Radford et al.; OpenAI; standard ASR; 680k hours weak supervision

💬 Reddit
audio 78

AudioLM: a Language Modeling Approach to Audio Generation

Borsos et al.; Google; language model for audio tokens

robotics 78

RT-1: Robotics Transformer for Real-World Control at Scale

Brohan et al.; Google; large-scale robot transformer; real manipulation

bio 70

Equivariant Diffusion for Molecule Generation in 3D (EDM)

Hoogeboom et al.; 3D molecular generation with equivariant diffusion

systems 77

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Frantar et al.; 3/4-bit quantization with minimal quality loss; widely used

nlp 79

Self-Instruct: Aligning Language Models with Self-Generated Instructions

Wang et al.; bootstrapped instruction data; enabled Alpaca

2021

vision 82

High-Resolution Image Synthesis with Latent Diffusion Models

Rombach et al.; enabled open-source text-to-image at scale

💬 Reddit · HN
vision 84

Zero-Shot Text-to-Image Generation

OpenAI; first large-scale text-to-image model

vision 90

Learning Transferable Visual Models From Natural Language Supervision

Radford et al.; zero-shot transfer; most influential vision-language model

💬 Reddit
general ml 95

LoRA: Low-Rank Adaptation of Large Language Models

Hu et al.; standard PEFT method; enables consumer fine-tuning

💬 Reddit
nlp 79

Finetuned Language Models Are Zero-Shot Learners

Wei et al.; instruction tuning; zero-shot generalization

nlp 79

Evaluating Large Language Models Trained on Code

Chen et al.; OpenAI; code generation benchmark

bio 34

On the Stability of Low Pass Graph Filter With a Large Number of Edge Rewires

Jumper et al.; DeepMind; Nature 2021; solved protein structure prediction

💬 Reddit · HN
vision 83

Emerging Properties in Self-Supervised Vision Transformers (DINO)

Caron et al.; Meta; self-distillation; strong visual features without labels

vision 83

Masked Autoencoders Are Scalable Vision Learners

He et al.; Meta; high masking ratio MAE; efficient ViT pretraining

vision 81

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

Tolstikhin et al.; showed Transformer not strictly necessary for vision

2020

vision 92

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Dosovitskiy et al.; Transformer for vision; displaced CNN backbones

general ml 88

Scaling Laws for Neural Language Models

Kaplan et al.; power-law compute/data/parameter tradeoffs

nlp 96

Language Models are Few-Shot Learners (GPT-3)

Brown et al.; 175B params; in-context learning; paradigm shift

💬 Reddit · HN
nlp 80

ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators

Clark et al.; compute-efficient pretraining

vision 91

Denoising Diffusion Probabilistic Models

Ho et al.; launched the diffusion model era

💬 Reddit
general ml 94

Score-Based Generative Modeling through Stochastic Differential Equations

Song et al.; unified view of score-matching & diffusion

nlp 81

Measuring Massive Multitask Language Understanding

Hendrycks et al.; 57-domain knowledge benchmark; standard LLM eval

audio 40

Hippo: Taming Hyper-parameter Optimization of Deep Learning with Stage Trees

Baevski et al.; Meta; self-supervised speech; standard baseline

nlp 82

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

Lewis et al.; Meta; grounded generation; production standard

general ml 72

Reformer: The Efficient Transformer

Kitaev et al.; LSH attention; reduced quadratic complexity

vision 85

End-to-End Object Detection with Transformers (DETR)

Carion et al.; detection as set prediction; replaced anchors

nlp 83

Learning to summarize with human feedback

Stiennon et al.; OpenAI; early RLHF demonstration on summarization

2019

nlp 79

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Liu et al.; showed BERT was undertrained

nlp 76

XLNet: Generalized Autoregressive Pretraining for Language Understanding

Yang et al.; autoregressive BERT alternative

systems 85

ZeRO: Memory Optimizations Toward Training Trillion Parameter Models

Rajbhandari et al.; Microsoft; partitioned optimizer state / gradients / params

systems 84

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Shoeybi et al.; NVIDIA; tensor parallelism; standard multi-GPU training

2018

nlp 92

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Devlin et al.; transformed NLP; bidirectional language models

💬 Reddit
general ml 85

The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks

Frankle & Carlin; sparse subnetworks; influential pruning theory

general ml 79

Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor

Haarnoja et al.; state-of-the-art continuous control

vision 67

YOLOv3: An Incremental Improvement

Redmon & Farhadi; real-time detection; widely deployed

2017

general ml 100

Attention Is All You Need

Vaswani et al.; most cited ML paper ever; foundation of modern AI

💬 Reddit · HN
general ml 83

Proximal Policy Optimization Algorithms

Schulman et al.; OpenAI; default RL algorithm for LLM alignment

general ml 84

Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks

Finn et al.; gradient-based meta-learning; few-shot adaptation

general ml 78

Graph Attention Networks

Veličković et al.; attention on graphs; widely cited

2016

general ml 80

Semi-Supervised Classification with Graph Convolutional Networks

Kipf & Welling; standard graph neural network baseline

audio 94

WaveNet: A Generative Model for Raw Audio

Oord et al.; DeepMind; autoregressive raw waveform; landmark TTS

2015

general ml

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Made very deep networks trainable; Ioffe & Szegedy

vision 86

U-Net: Convolutional Networks for Biomedical Image Segmentation

Standard architecture for image segmentation; 70k+ citations

general ml 79

Continuous control with deep reinforcement learning

Lillicrap et al.; actor-critic for continuous action spaces

vision 88

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Ren et al.; end-to-end detector; standard baseline for years

2014

vision

Generative Adversarial Networks

Ian Goodfellow et al.; introduced adversarial training

general ml

Adam: A Method for Stochastic Optimization

Default optimizer for most modern ML; Kingma & Ba

vision 80

Very Deep Convolutional Networks for Large-Scale Image Recognition (VGG)

Established depth as key factor in CNNs

nlp 90

Sequence to Sequence Learning with Neural Networks

Sutskever et al.; foundation of seq2seq NMT

nlp 95

Neural Machine Translation by Jointly Learning to Align and Translate

Bahdanau attention — the precursor to Transformer

2013

vision

Visualizing and Understanding Convolutional Networks

Zeiler & Fergus; visualised what CNNs learn; led to AlexNet improvements

general ml 90

Playing Atari with Deep Reinforcement Learning (DQN)

DeepMind; launched modern deep RL

nlp 86

Distributed Representations of Words and Phrases (Word2Vec)

Mikolov et al.; standard word embeddings for years

2012

vision

ImageNet Classification with Deep Convolutional Neural Networks (AlexNet)

Kicked off the deep learning era; ImageNet competition winner