Best learning resources for AI
This paper presents a method for applying dropout regularization to LSTMs by restricting it to non-recurrent connections, solving prior issues with overfitting in recurrent networks. It significantly improves generalization across diverse tasks including language modeling, speech recognition, machine translation, and image captioning. The technique allows larger RNNs to be effectively trained without compromising their ability to memorize long-term dependencies. This work helped establish dropout as a viable regularization strategy for RNNs and influenced widespread adoption in sequence modeling applications.
This paper augments recurrent neural networks with a differentiable external memory addressed by content and location attention. Trained end-to-end, it learns algorithmic tasks like copying, sorting and associative recall from examples, proving that neural nets can induce simple programs. The idea sparked extensive work on memory-augmented models, differentiable computers, neural program synthesis and modern attention mechanisms.
Stanford’s 10-week CS231n dives from first principles to state-of-the-art vision research, starting with image-classification basics, loss functions and optimization, then building from fully-connected nets to modern CNNs, residual and vision-transformer architectures. Lectures span training tricks, regularization, visualization, transfer learning, detection, segmentation, video, 3-D and generative models. Three hands-on PyTorch assignments guide students from k-NN/SVM through deep CNNs and network visualization, and a capstone project lets teams train large-scale models on a vision task of their choice, graduating with the skills to design, debug and deploy real-world deep-learning pipelines.
This tutorial explores the surprising capabilities of Recurrent Neural Networks (RNNs), particularly in generating coherent text character by character. It delves into how RNNs, especially when implemented with Long Short-Term Memory (LSTM) units, can learn complex patterns and structures in data, enabling them to produce outputs that mimic the style and syntax of the training material. The discussion includes the architecture of RNNs, their ability to handle sequences of varying lengths, and the challenges associated with training them, such as the vanishing gradient problem. Through various examples, the tutorial illustrates the potential of RNNs in tasks like language modeling and sequence prediction.
This tutorial explains how Long Short-Term Memory (LSTM) networks address the limitations of traditional Recurrent Neural Networks (RNNs), particularly their difficulty in learning long-term dependencies due to issues like vanishing gradients. LSTMs introduce a cell state that acts as a conveyor belt, allowing information to flow unchanged, and utilize gates (input, forget, and output) to regulate the addition, removal, and output of information. This architecture enables LSTMs to effectively capture and maintain long-term dependencies in sequential data
This paper explores how the order of inputs and outputs affects the performance of sequence-to-sequence (seq2seq) models, even when the data is unordered (e.g., sets). It introduces architectural extensions such as the Read-Process-Write model and proposes a training approach that searches over output permutations to improve learning. The paper shows that optimal ordering significantly impacts tasks like language modeling, parsing, and combinatorial problems. This work highlights the importance of considering input/output ordering in model design and has influenced further research in permutation-invariant architectures.
This paper introduces a novel module for semantic segmentation using dilated convolutions, which enables exponential expansion of the receptive field without losing resolution. By aggregating multi-scale contextual information efficiently, the proposed context module significantly improves dense prediction accuracy when integrated into existing architectures. The work has had a lasting impact on dense prediction and semantic segmentation, laying the foundation for many modern segmentation models.
This paper presents Deep Speech 2, an end-to-end deep learning system for automatic speech recognition that works across vastly different languages (English and Mandarin). It replaces traditional hand-engineered ASR pipelines with neural networks, achieving human-competitive transcription accuracy on standard datasets. The system uses HPC techniques for 7x speedup, enabling faster experimentation. Key innovations include Batch Normalization for RNNs, curriculum learning (SortaGrad), and GPU deployment optimization (Batch Dispatch). The approach demonstrates that end-to-end learning can handle diverse speech conditions including noise, accents, and different languages, representing a significant step toward universal speech recognition systems.
The paper “Deep Residual Learning for Image Recognition” (ResNet, 2015) introduced residual networks with shortcut connections, allowing very deep neural networks (over 100 layers) to be effectively trained by reformulating the learning task into residual functions (F(x) = H(x) − x). This innovation solved the degradation problem in deep models, achieving state-of-the-art results on ImageNet (winning ILSVRC 2015) and COCO challenges. Its impact reshaped the design of deep learning architectures across vision and non-vision tasks, becoming a foundational backbone in modern AI systems.
The paper introduced AlphaGo, the first program to defeat a human professional Go player without handicap. It combined deep neural networks — trained with supervised learning and reinforcement learning — with Monte Carlo tree search (MCTS), enabling efficient move selection and board evaluation in Go’s massive search space. AlphaGo’s victory against European champion Fan Hui marked a historic AI milestone, showcasing that combining learning-based policies with search can surpass prior handcrafted methods, reshaping both game AI and broader AI research directions.
This paper shows that using identity mappings for skip connections and pre-activation in residual blocks allows signals to flow unimpeded, making it easier to train very deep networks. Through theoretical analysis and ablation studies, the authors introduce a pre-activation residual unit that enables successful training of 1000-layer ResNets and improves CIFAR-10/100 and ImageNet accuracy, influencing later architectures such as ResNet-v2 and numerous deep vision models.
We introduce a new neural architecture to learn the conditional probability of an output sequence with elements that are discrete tokens corresponding to positions in an input sequence. Such problems cannot be trivially addressed by existent approaches such as sequence-to-sequence and Neural Turing Machines, because the number of target classes in each step of the output depends on the length of the input, which is variable. Problems such as sorting variable sized sequences, and various combinatorial optimization problems belong to this class. Our model solves the problem of variable size output dictionaries using a recently proposed mechanism of neural attention. It differs from the previous attention attempts in that, instead of using attention to blend hidden units of an encoder to a context vector at each decoder step, it uses attention as a pointer to select a member of the input sequence as the output. We call this architecture a Pointer Net (Ptr-Net). We show Ptr-Nets can be used to learn approximate solutions to three challenging geometric problems -- finding planar convex hulls, computing Delaunay triangulations, and the planar Travelling Salesman Problem -- using training examples alone. Ptr-Nets not only improve over sequence-to-sequence with input attention, but also allow us to generalize to variable size output dictionaries. We show that the learnt models generalize beyond the maximum lengths they were trained on. We hope our results on these tasks will encourage a broader exploration of neural learning for discrete problems.