Papershelf

Research papers that shaped distributed systems and machine learning over the last decade. A reading list I return to — systems that survived contact with production and ideas that became the foundations everything else is built on.

Distributed Systems & Databases

Spanner: Google's Globally-Distributed Database

2012

Corbett et al. — Google Research

The first system to distribute data at global scale while supporting externally-consistent distributed transactions. Spanner uses TrueTime—GPS and atomic clocks—to assign commit timestamps that reflect real-world causality. It redefined what production-grade globally-consistent databases could look like and influenced every geo-distributed database built since.

TAO: Facebook's Distributed Data Store for the Social Graph

2013

Bronson et al. — Facebook

Facebook's custom graph store serving billions of reads per second with eventual consistency. TAO trades strict consistency for availability, using a tiered cache architecture optimized for the specific access patterns of social graph traversals. A masterclass in pragmatic distributed systems design—building for the actual workload, not the theoretical ideal.

In Search of an Understandable Consensus Algorithm (Raft)

2014

Ongaro & Ousterhout — Stanford University

Introduced Raft as a more understandable alternative to Paxos for achieving consensus in distributed systems. Decomposes the problem into leader election, log replication, and safety—each independently reasoned about. Now the backbone of etcd, CockroachDB, TiKV, and virtually every distributed system built in the last decade that needs fault-tolerant coordination.

The Dataflow Model

2015

Akidau et al. — Google

A unified model for batch and streaming data processing that answers the "what, where, when, and how" of data computation. Built on the experience of Google's internal Millwheel and Flume systems, it gave engineers a principled framework for reasoning about event-time vs. processing-time. The foundation for Apache Beam and the model that ended the "Lambda architecture" era.

The Snowflake Elastic Data Warehouse

2016

Dageville et al. — Snowflake Computing

Designed a cloud-native data warehouse that completely decouples storage from compute, enabling each to scale independently. The virtual warehouse model allows multiple concurrent workloads to share storage without I/O contention. Pioneered the multi-cluster shared data architecture that is now the standard across the cloud data warehouse industry.

Amazon Aurora: Design Considerations for High Throughput Cloud-Native Relational Databases

2017

Verbitski et al. — Amazon Web Services

Reimagined MySQL for cloud environments by replacing the storage layer with a distributed, fault-tolerant log-based system. The core insight: only redo log records need to flow to storage, eliminating the write amplification of traditional databases and reducing network I/O by 7.7x compared to MySQL on EC2. A blueprint for cloud-native database design.

Zanzibar: Google's Consistent, Global Authorization System

2019

Patel et al. — Google

The authorization system enforcing permissions across all of Google's products—Drive, YouTube, Maps, and more. Handles trillions of ACL checks per day with consistent snapshots while maintaining sub-10ms latency at the tail. The relationship-based access control (ReBAC) model it introduced spawned an entire category of open-source projects including OpenFGA and SpiceDB.

CockroachDB: The Resilient Geo-Distributed SQL Database

2020

Taft et al. — Cockroach Labs

A distributed SQL database that brings Spanner-style globally-consistent transactions to open-source. Built on RocksDB and Raft for the storage layer, with a full PostgreSQL-compatible SQL engine layered on top. Demonstrated that serializable isolation at geo-distributed scale was practical outside of hyperscalers—a significant proof of concept for the industry.

FoundationDB: A Distributed Unbundled Transactional Key-Value Store

2021

Zhou et al. — Apple / FoundationDB

An ACID key-value store that deliberately unbundles transaction processing from storage, enabling independent scaling of each. The simulation-based testing framework—capable of reproducing and replaying any sequence of events including disk failures and network partitions—has become legendary in the distributed systems community as a gold standard for correctness testing.

Amazon DynamoDB: A Scalable, Predictably Performant, and Fully Managed NoSQL Database Service

2022

Elhemali et al. — Amazon Web Services

A retrospective on 10+ years of operating DynamoDB at scale, revealing architectural evolutions not documented elsewhere. Details the shift to per-request cost-based admission control, the introduction of global tables, and the hard-won lessons around providing predictable single-digit millisecond latency at any scale. Essential reading for anyone building or operating a database service.

Machine Learning & AI

Efficient Estimation of Word Representations in Vector Space (Word2Vec)

2013

Mikolov et al. — Google

Introduced two architectures—CBOW and Skip-gram—for learning high-quality word vector representations from large text corpora with surprising efficiency. The resulting vectors capture semantic and syntactic relationships, famously demonstrating that "king − man + woman ≈ queen". Laid the groundwork for all subsequent representation learning research in NLP.

Generative Adversarial Nets

2014

Goodfellow et al. — Université de Montréal

Proposed training two networks simultaneously—a generator creating synthetic data and a discriminator distinguishing real from fake—via a minimax adversarial game. This framework proved remarkably generative, spawning photorealistic image synthesis, data augmentation, and ultimately the entire modern generative AI wave. One of the most-cited ML papers of all time.

Deep Residual Learning for Image Recognition (ResNet)

2015

He et al. — Microsoft Research

Introduced residual connections (skip connections) that allow gradients to flow directly through networks, enabling training of 100+ layer deep networks without degradation. Won ILSVRC 2015 with a top-5 error rate of 3.57%—surpassing human performance. The residual connection became one of the most universally adopted building blocks in deep learning across every domain.

Attention Is All You Need

2017

Vaswani et al. — Google Brain

Proposed the Transformer architecture, replacing recurrent and convolutional layers entirely with multi-head self-attention. Enabled massively parallel training and dramatically better handling of long-range dependencies compared to RNNs. Arguably the single most impactful ML paper of the past decade—it underpins GPT, BERT, and every modern large language model.

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

2018

Devlin et al. — Google AI Language

Demonstrated that deep bidirectional pre-training on masked language modeling followed by task-specific fine-tuning achieves state-of-the-art results across 11 NLP benchmarks simultaneously. The key insight: models benefit enormously from seeing context from both directions at once. Changed how the entire NLP community approaches new tasks—fine-tune, don't train from scratch.

Language Models are Few-Shot Learners (GPT-3)

2020

Brown et al. — OpenAI

Showed that scaling language models to 175B parameters produces emergent few-shot learning abilities—performing novel tasks from just a handful of examples in the prompt, without any gradient updates. Introduced the concept of in-context learning that underpins prompt engineering today. The paper that made the broader world take LLMs seriously as general-purpose tools.

Denoising Diffusion Probabilistic Models

2020

Ho, Jain & Abbeel — UC Berkeley

Established the theoretical and practical foundation for modern diffusion-based generative models. By framing image generation as iterative denoising of Gaussian noise, DDPM achieved quality competitive with GANs while being significantly more stable and principled to train. The direct ancestor of Stable Diffusion, DALL-E 2, Imagen, and every modern text-to-image system.

Highly Accurate Protein Structure Prediction with AlphaFold

2021

Jumper et al. — Google DeepMind

Solved the 50-year-old grand challenge of computational biology: predicting a protein's 3D structure from its amino acid sequence, with near-experimental accuracy. AlphaFold2's Evoformer architecture combines multiple sequence alignments with structural reasoning through specialized attention layers. Released a database of 200M+ predicted structures, accelerating drug discovery and biology research globally.

Learning Transferable Visual Models From Natural Language Supervision (CLIP)

2021

Radford et al. — OpenAI

Trained a vision encoder and text encoder jointly via contrastive learning on 400M internet image-text pairs, producing models that generalize to new visual concepts zero-shot by comparing image embeddings to text descriptions. CLIP demonstrated that natural language is a richer supervision signal than fixed label sets. Became a foundation component for multimodal AI and semantic image search.

Training Language Models to Follow Instructions with Human Feedback (InstructGPT)

2022

Ouyang et al. — OpenAI

Introduced the RLHF (Reinforcement Learning from Human Feedback) pipeline to align language model outputs with human intent. By fine-tuning GPT-3 on human-rated comparisons, InstructGPT produced models that were simultaneously more helpful, honest, and less harmful than the base model despite being 100x smaller. The alignment technique that powers ChatGPT and every instruction-tuned LLM since.

GPT-4 Technical Report

2023

OpenAI

Documents GPT-4, OpenAI's first multimodal model accepting both text and image inputs. Achieves human-level performance on professional benchmarks including the bar exam (90th percentile) and medical licensing exams. The report also details a novel risk assessment and red-teaming methodology for large models—setting a new standard for responsible pre-deployment evaluation that the industry began to follow.

LLaMA: Open and Efficient Foundation Language Models

2023

Touvron et al. — Meta AI

Released a series of open-source language models (7B–65B parameters) trained exclusively on publicly available data, showing that smaller carefully-trained models could match GPT-3. Democratized LLM research by giving the community a capable open base to build on. Spawned an ecosystem of fine-tunes—Alpaca, Vicuna, Mistral, and hundreds more—that collectively reshaped the open-source AI landscape.

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

2023

Gu & Dao — Carnegie Mellon / Princeton

Proposed a sequence model based on state space models (SSMs) with an input-dependent selection mechanism—matching or exceeding Transformer performance on language modeling while scaling linearly with sequence length instead of quadratically. Challenged the assumption that attention is necessary for high-quality sequence modeling and reignited interest in SSMs as a viable alternative architecture.

Gemini: A Family of Highly Capable Multimodal Models

2023

Gemini Team — Google DeepMind

Google DeepMind's natively multimodal model family trained jointly on text, images, audio, video, and code from the ground up—rather than retrofitting vision onto a language model. Gemini Ultra was the first model to surpass human expert performance on MMLU. Demonstrated that native multimodality at training time produces substantially better cross-modal reasoning than post-hoc fusion approaches.

Anurag sarkar

Anurag Sarkar Distributed systems & platform engineering

Papershelf

Distributed Systems & Databases

Spanner: Google's Globally-Distributed Database

TAO: Facebook's Distributed Data Store for the Social Graph

In Search of an Understandable Consensus Algorithm (Raft)

The Dataflow Model

The Snowflake Elastic Data Warehouse

Amazon Aurora: Design Considerations for High Throughput Cloud-Native Relational Databases

Zanzibar: Google's Consistent, Global Authorization System

CockroachDB: The Resilient Geo-Distributed SQL Database

FoundationDB: A Distributed Unbundled Transactional Key-Value Store

Amazon DynamoDB: A Scalable, Predictably Performant, and Fully Managed NoSQL Database Service

Machine Learning & AI

Efficient Estimation of Word Representations in Vector Space (Word2Vec)

Generative Adversarial Nets

Deep Residual Learning for Image Recognition (ResNet)

Attention Is All You Need

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Language Models are Few-Shot Learners (GPT-3)

Denoising Diffusion Probabilistic Models

Highly Accurate Protein Structure Prediction with AlphaFold

Learning Transferable Visual Models From Natural Language Supervision (CLIP)

Training Language Models to Follow Instructions with Human Feedback (InstructGPT)

GPT-4 Technical Report

LLaMA: Open and Efficient Foundation Language Models

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Gemini: A Family of Highly Capable Multimodal Models