This is a seminal paper written by Alan Turing on the topic of artificial intelligence. The paper, published in 1950 in Mind, was the first to introduce his concept of what is now known as the Turing test to the general public.
Frank Rosenblatt’s 1958 paper introduced the perceptron, a probabilistic model mimicking neural connections for learning and pattern recognition, laying the mathematical and conceptual groundwork for modern neural networks and sparking decades of research in artificial intelligence, despite its early limitations and later critiques.
This paper introduces the generalized delta rule, a learning procedure for multi-layer networks with hidden units, enabling them to learn internal representations. This rule implements a gradient descent method to minimize the error between the network's output and a target output by propagating error signals backward through the network. The authors demonstrate through simulations on various problems, such as XOR and parity, that this method, often called backpropagation, can discover complex internal representations and solutions. They show it overcomes previous limitations in training such networks and rarely encounters debilitating local minima.
This paper proposes minimizing the information content in neural network weights to enhance generalization, particularly when training data is scarce. It introduces a method where adaptable Gaussian noise is added to the weights, balancing the expected squared error against the amount of information the weights contain. Leveraging the Minimum Description Length (MDL) principle and a "bits back" argument for communicating these noisy weights, the approach enables efficient derivative computations, especially if output units are linear. The paper also explores using adaptive mixtures of Gaussians for more flexible prior distributions for weight coding. Preliminary results indicated a slight improvement over simple weight-decay on a high-dimensional task.
The 2012 paper “ImageNet Classification with Deep Convolutional Neural Networks” by Krizhevsky, Sutskever, and Hinton introduced AlexNet, a deep CNN that dramatically improved image classification accuracy on ImageNet, halving the top-5 error rate from \~26% to \~15%. Its innovations — like ReLU activations, dropout, GPU training, and data augmentation — sparked the deep learning revolution, laying the foundation for modern computer vision and advancing AI across industries.
The 2014 paper “Generative Adversarial Nets” (GAN) by Ian Goodfellow et al. introduced a groundbreaking framework where two neural networks — a generator and a discriminator — compete in a minimax game: the generator tries to produce realistic data, while the discriminator tries to distinguish real from fake. This approach avoids Markov chains and approximate inference, relying solely on backpropagation. GANs revolutionized generative modeling, enabling realistic image, text, and audio generation, sparking massive advances in AI creativity, deepfake technology, and research on adversarial training and robustness.
This paper presents a method for applying dropout regularization to LSTMs by restricting it to non-recurrent connections, solving prior issues with overfitting in recurrent networks. It significantly improves generalization across diverse tasks including language modeling, speech recognition, machine translation, and image captioning. The technique allows larger RNNs to be effectively trained without compromising their ability to memorize long-term dependencies. This work helped establish dropout as a viable regularization strategy for RNNs and influenced widespread adoption in sequence modeling applications.
The paper “Deep Residual Learning for Image Recognition” (ResNet, 2015) introduced residual networks with shortcut connections, allowing very deep neural networks (over 100 layers) to be effectively trained by reformulating the learning task into residual functions (F(x) = H(x) − x). This innovation solved the degradation problem in deep models, achieving state-of-the-art results on ImageNet (winning ILSVRC 2015) and COCO challenges. Its impact reshaped the design of deep learning architectures across vision and non-vision tasks, becoming a foundational backbone in modern AI systems.
We introduce a new neural architecture to learn the conditional probability of an output sequence with elements that are discrete tokens corresponding to positions in an input sequence. Such problems cannot be trivially addressed by existent approaches such as sequence-to-sequence and Neural Turing Machines, because the number of target classes in each step of the output depends on the length of the input, which is variable. Problems such as sorting variable sized sequences, and various combinatorial optimization problems belong to this class. Our model solves the problem of variable size output dictionaries using a recently proposed mechanism of neural attention. It differs from the previous attention attempts in that, instead of using attention to blend hidden units of an encoder to a context vector at each decoder step, it uses attention as a pointer to select a member of the input sequence as the output. We call this architecture a Pointer Net (Ptr-Net). We show Ptr-Nets can be used to learn approximate solutions to three challenging geometric problems -- finding planar convex hulls, computing Delaunay triangulations, and the planar Travelling Salesman Problem -- using training examples alone. Ptr-Nets not only improve over sequence-to-sequence with input attention, but also allow us to generalize to variable size output dictionaries. We show that the learnt models generalize beyond the maximum lengths they were trained on. We hope our results on these tasks will encourage a broader exploration of neural learning for discrete problems.
The paper “Attention Is All You Need” (2017) introduced the Transformer — a novel neural architecture relying solely on self-attention, removing recurrence and convolutions. It revolutionized machine translation by dramatically improving training speed and translation quality (e.g., achieving 28.4 BLEU on English-German tasks), setting new state-of-the-art benchmarks. Its modular, parallelizable design opened the door to large-scale pretraining and fine-tuning, ultimately laying the foundation for modern large language models like BERT and GPT. This paper reshaped the landscape of NLP and deep learning, making attention-based models the dominant paradigm across many tasks.