The paper “Attention Is All You Need” (2017) introduced the Transformer — a novel neural architecture relying solely on self-attention, removing recurrence and convolutions. It revolutionized machine translation by dramatically improving training speed and translation quality (e.g., achieving 28.4 BLEU on English-German tasks), setting new state-of-the-art benchmarks. Its modular, parallelizable design opened the door to large-scale pretraining and fine-tuning, ultimately laying the foundation for modern large language models like BERT and GPT. This paper reshaped the landscape of NLP and deep learning, making attention-based models the dominant paradigm across many tasks.
The BERT (Bidirectional Encoder Representations from Transformers) paper introduced a powerful pre-trained language model that uses deep bidirectional transformers and masked language modeling to capture both left and right context. Unlike prior unidirectional models, BERT achieved state-of-the-art performance across 11 NLP tasks (like GLUE, SQuAD) by enabling fine-tuning with minimal task-specific adjustments. Its impact reshaped NLP by setting a new standard for transfer learning, greatly improving accuracy on tasks such as question answering, sentiment analysis, and natural language inference, and inspiring a wave of follow-up models like RoBERTa, ALBERT, and T5.