Multi-Scale Context Aggregation by Dilated Convolutions

This paper introduces a novel module for semantic segmentation using dilated convolutions, which enables exponential expansion of the receptive field without losing resolution. By aggregating multi-scale contextual information efficiently, the proposed context module significantly improves dense prediction accuracy when integrated into existing architectures. The work has had a lasting impact on dense prediction and semantic segmentation, laying the foundation for many modern segmentation models.

Visit Website

Introduction

State-of-the-art models for semantic segmentation are based on adaptations of convolutional networks originally designed for image classification. However, dense prediction problems like semantic segmentation differ structurally. This paper develops a new convolutional network module using dilated convolutions to systematically aggregate multi-scale contextual information without losing resolution. It supports exponential expansion of the receptive field and improves the accuracy of semantic segmentation systems. The authors also simplify classification networks adapted for dense prediction and demonstrate improved accuracy.

Back

Information

Websitearxiv.org
AuthorsFisher Yu, Vladlen Koltun
Published date2015/11/23

More Items

Video models are zero-shot learners and reasoners

2025

Thaddäus Wiedemer, Yuxuan Li +7

This paper demonstrates the zero-shot learning and reasoning abilities of the generative video model Veo 3, paralleling the evolution of Large Language Models (LLMs) in natural language processing. Veo 3 excels in diverse visual tasks without explicit training, such as object segmentation, edge detection, image editing, understanding physical properties, recognizing affordances, and simulating tool use, enabling early visual reasoning like maze solving and symmetry detection.

video vision LLM paper ai-video+3

Identity Mappings in Deep Residual Networks

2016

Kaiming He, Xiangyu Zhang +2

This paper shows that using identity mappings for skip connections and pre-activation in residual blocks allows signals to flow unimpeded, making it easier to train very deep networks. Through theoretical analysis and ablation studies, the authors introduce a pre-activation residual unit that enables successful training of 1000-layer ResNets and improves CIFAR-10/100 and ImageNet accuracy, influencing later architectures such as ResNet-v2 and numerous deep vision models.

foundation 30u30 paper vision

Generative Adversarial Networks

2014

Ian J. Goodfellow, Jean Pouget-Abadie +6

The 2014 paper “Generative Adversarial Nets” (GAN) by Ian Goodfellow et al. introduced a groundbreaking framework where two neural networks — a generator and a discriminator — compete in a minimax game: the generator tries to produce realistic data, while the discriminator tries to distinguish real from fake. This approach avoids Markov chains and approximate inference, relying solely on backpropagation. GANs revolutionized generative modeling, enabling realistic image, text, and audio generation, sparking massive advances in AI creativity, deepfake technology, and research on adversarial training and robustness.

vision AIGC paper foundation