AIAny - Large Language Model Papers

Attention Is All You Need

2017

Ashish Vaswani, Noam Shazeer +6

The paper “Attention Is All You Need” (2017) introduced the Transformer — a novel neural architecture relying solely on self-attention, removing recurrence and convolutions. It revolutionized machine translation by dramatically improving training speed and translation quality (e.g., achieving 28.4 BLEU on English-German tasks), setting new state-of-the-art benchmarks. Its modular, parallelizable design opened the door to large-scale pretraining and fine-tuning, ultimately laying the foundation for modern large language models like BERT and GPT. This paper reshaped the landscape of NLP and deep learning, making attention-based models the dominant paradigm across many tasks.

NLP LLM AIGC 30u30 paper+1

Relational recurrent neural networks

2018

Adam Santoro, Ryan Faulkner +8

This paper introduces a Relational Memory Core that embeds multi-head dot-product attention into recurrent memory to enable explicit relational reasoning. Evaluated on synthetic distance-sorting, program execution, partially-observable reinforcement learning and large-scale language-modeling benchmarks, it consistently outperforms LSTM and memory-augmented baselines, setting state-of-the-art results on WikiText-103, Project Gutenberg and GigaWord. By letting memories interact rather than merely store information, the approach substantially boosts sequential relational reasoning and downstream task performance.

foundation 30u30 paper NLP LLM

GPT2: Language Models are Unsupervised Multitask Learners

2019

Alec Radford, Jeffrey Wu +4

This paper introduces GPT-2, showing that large-scale language models trained on diverse internet text can perform a wide range of natural language tasks in a zero-shot setting — without any task-specific training. By scaling up to 1.5 billion parameters and training on WebText, GPT-2 achieves state-of-the-art or competitive results on benchmarks like language modeling, reading comprehension, and question answering. Its impact has been profound, pioneering the trend toward general-purpose, unsupervised language models and paving the way for today’s foundation models in AI.

LLM NLP openai paper

Scaling Laws for Neural Language Models

2020

Jared Kaplan, Sam McCandlish +8

reveals that language model performance improves predictably as you scale up model size, dataset size, and compute, following smooth power-law relationships. It shows that larger models are more sample-efficient, and optimally efficient training uses very large models on moderate data, stopping well before convergence. The work provided foundational insights that influenced the development of massive models like GPT-3 and beyond, shaping how the AI community understands trade-offs between size, data, and compute in building ever-stronger models.

LLM NLP openai 30u30 paper

GPT3: Language Models are Few-Shot Learners

2020

Tom B. Brown, Benjamin Mann +29

This paper introduces GPT-3, a 175-billion-parameter autoregressive language model that achieves impressive zero-shot, one-shot, and few-shot performance across diverse NLP tasks without task-specific fine-tuning. Its scale allows it to generalize from natural language prompts, rivaling or surpassing prior state-of-the-art models that require fine-tuning. The paper’s impact is profound: it demonstrated the power of scaling laws, reshaped research on few-shot learning, and sparked widespread adoption of large-scale language models, influencing advancements in AI applications, ethical debates, and commercial deployments globally.

LLM NLP openai paper

GPT-4 Technical Report

2024

OpenAI, Josh Achiam +279

This paper introduces GPT-4, a large multimodal model that processes both text and images, achieving human-level performance on many academic and professional benchmarks like the bar exam and GRE. It significantly advances language understanding, multilingual capabilities, and safety alignment over previous models, outperforming GPT-3.5 by wide margins. Its impact is profound, setting new standards for natural language processing, enabling safer and more powerful applications, and driving critical research on scaling laws, safety, bias, and the societal implications of AI deployment.

LLM NLP openai paper

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

2024

DeepSeek-AI, Aixin Liu +155

This paper presents DeepSeek-V2, a 236B-parameter open-source Mixture-of-Experts (MoE) language model that activates only 21B parameters per token, achieving top-tier bilingual (English and Chinese) performance with remarkable training cost savings (42.5%) and inference efficiency (5.76× throughput) compared to previous models. Its innovations—Multi-head Latent Attention (MLA) and DeepSeekMoE—reduce memory bottlenecks and boost specialization. The paper’s impact lies in advancing economical, efficient large-scale language modeling, pushing open-source models closer to closed-source leaders, and paving the way for future multimodal and AGI-aligned systems.

LLM NLP deepseek paper

DeepSeek-V3 Technical Report

2024

DeepSeek-AI, Aixin Liu +198

This paper introduces DeepSeek-V3, a 671B-parameter Mixture-of-Experts (MoE) language model that activates only 37B parameters per token for efficient training and inference. By leveraging innovations like Multi-head Latent Attention, auxiliary-loss-free load balancing, and multi-token prediction, it achieves top-tier performance across math, code, multilingual, and reasoning tasks. Despite its massive scale, DeepSeek-V3 maintains economical training costs and outperforms all other open-source models, achieving results comparable to leading closed-source models like GPT-4o and Claude-3.5, thereby significantly narrowing the open-source vs. closed-source performance gap.

NLP LLM deepseek paper

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

2025

DeepSeek-AI, Daya Guo +198

This paper introduces DeepSeek-R1, a large language model that improves reasoning purely through reinforcement learning (RL), even without supervised fine-tuning. It shows that reasoning skills like chain-of-thought, self-reflection, and verification can naturally emerge from RL, achieving performance comparable to OpenAI’s top models. Its distilled smaller models outperform many open-source alternatives, democratizing advanced reasoning for smaller systems. The work impacts the field by proving RL-alone reasoning is viable and by open-sourcing both large and distilled models, opening new directions for scalable, cost-effective LLM training and future development in reasoning-focused AI systems.

NLP LLM deepseek paper

Category

Explore by categories

All

AI Leaderboard

Chatbot

Machine Learning Foundation Books

AI Train

AI Deploy

AI Client

Machine Learning Foundation Papers

Machine Learning Foundation Tutorials

AI Agent

Large Language Model Tutorials

Large Language Model Papers

Machine Learning Engineering Papers

Computer Vision Tutorials

Computer Vision Papers

Natural Language Processing Papers

Reinforcement Learning Papers

Speech Technology Papers

AI API

AI Coding

AI Image

AI Video

MLOps

MCP Client

MCP Server

Attention Is All You Need

Relational recurrent neural networks

GPT2: Language Models are Unsupervised Multitask Learners

Scaling Laws for Neural Language Models

GPT3: Language Models are Few-Shot Learners

GPT-4 Technical Report

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

DeepSeek-V3 Technical Report

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning