Call for Papers
Invited Speakers (Videos are now Available here)
Matthew Zeiler, Clarifai,USA,
Graham Taylor, Univ. Guelph, Canada,
Cees Snoek, Univ. Amsterdam / Qualcomm, Netherlands,
Anton van den Hengel, Univ. Adelaide, Australia
Andrej Karpathy, Stanford Univ., USA,
Title:Empowering Deep Learning for Developers
Abstract:This talk will cover the exciting research done at Clarifai and it's real world applications with customers. When we tackle research challenges we balance those tied to products directly with open ended research that interests us. We also consider heavily the interactions of end users with our machine learning technology to ensure it is both simple and fast. Use cases across various industries and developer to consumer segments will be demonstrated to show how our technology enables interesting real world applications.
Title:Learning deep multi-modal fusion architectures
Abstract:Advances in sensors and multi-modal deep learning have created new opportunities in automated understanding of complex human behaviour. A key challenge in multi-modal approaches is how and when to fuse modalities. The classical approach is either early fusion at the inputs, or late fusion at the outputs of individual per-modality predictors. Deep learning of representations enables more flexibility in fusion strategies. However, limited labeled datasets and complex models means that creating a monolithic architecture and hoping that an optimal fusion structure will emerge by learning simply doesn’t work. In this talk, I will describe multiple strategies to learning fusion in the context of multi-modal gesture recognition. First, I will describe a way to guide the model by starting with a simple fusion strategy and gradually introducing additional complexity. Next, I will describe how the architecture can be made more robust to noisy or missing sensors by modality-wise Dropout-style regularization. Finally, I will describe recent work which uses Bayesian Optimization with a Graph-Induced Kernel over model architectures to learn a good fusion strategy.
Title:VideoLSTM Convolves, Attends and Flows for Action Recognition
Abstract:We present a new architecture for end-to-end sequence learning of actions in video, we call VideoLSTM. Rather than adapting the video to the peculiarities of established recurrent or convolutional architectures, we adapt the architecture to fit the requirements of the video medium. Starting from the soft-Attention LSTM, VideoLSTM makes three novel contributions. First, video has a spatial layout. To exploit the spatial correlation we hardwire convolutions in the soft-Attention LSTM architecture. Second, motion not only informs us about the action content, but also guides better the attention towards the relevant spatio-temporal locations. We introduce motion-based attention. And finally, we demonstrate how the attention from VideoLSTM can be used for action localization by relying on just the action class label. Experiments and comparisons on challenging datasets for action classification and localization support our claims.
Title:VQA vs. AI
Abstract:Deep Learning has enabled incredible developments in Vision, but primarily in respect of a particular set of problems. Such problems are typically defined well in advance of needing to be solved, and allow the generation of lots of labelled training data. Visual Question Answering lies on the edge of this set because the Visual Question is specified at run time, and it’s not feasible to generate enough supervised data to train a method capable of answering general Visual Questions. One potential method for reducing the training data required is to use reasoning, particularly about external sources of knowledge. This talk will describe some of the steps we have been taking towards integrating external knowledge and reasoning into the VQA process, and why this takes us a step closer to AI.
Title:Optimising nested functions using auxiliary coordinates
Abstract:Deep learning usually refers to training neural networks with several layers of weights. Mathematically, we can regard deep learning as training "nested" nonlinear parametric functions, based on the composition of multiple, simpler functions or processing layers. Beyond neural nets, many models in machine learning, computer vision or speech processing have this form---for example, reducing dimensionality or extracting features prior to classification, or encoding an image as a bit string prior to search. Composition can construct powerful, sophisticated models, but it also gives rise to nonconvex objective functions. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimisation problem. Usually, this is achieved using nonlinear gradient-based optimisation methods, where the gradient is computed with the chain rule. However, this requires significant expertise and effort and is difficult to parallelise for execution in a distributed computation environment. I will describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms (considerably simplifying algorithm development), introduces significant parallelism not in the original problem, has provable convergence with differentiable layers, and applies even when parameter derivatives are not available or not desirable (so computing gradients with the chain rule does not apply). I will illustrate how to use MAC to derive training algorithms for a range of problems, such as supervised dimensionality reduction, binary hashing using binary autoencoders or affinity-based objective functions, nonlinear embeddings, deep nets, and others. If time permits, I will describe ParMAC, a parallel, distributed processing model for MAC. This is suitable for large-scale problems, where distributing the computation is necessary for faster training or the dataset may not fit in a single machine, and where it is essential to limit the amount of communication between machines. This is based on work with my students Mehdi Alizadeh, Ramin Raziperchikolaei, Max Vladymyrov and Weiran Wang.
Title:Connecting Images and Natural Language
Abstract:Intelligent agents require the ability to perceive their environments, understand their high-level semantics, and communicate with humans. While computer vision has recently made great strides in visual recognition, the predominant paradigm is to predict one or more fixed visual categories for each image. In this tutorial I will discuss recent advances that allow us to significantly expand the vocabulary of computer vision systems by treating natural language as a label space. In particular, I will talk about recent progress in areas such as image-sentence ranking, (dense) image captioning and visual Q&A, present recent state of the art approaches, and discuss current limitations and avenues for future work.
NVIDIA sponsored the best paper award with a TitanX board to:
- ReSeg: A Recurrent Neural Network-based Model for Semantic Segmentation. Francesco Visin et al.
Description of the workshop
The goal of the DeepVision Workshop is to accelerate the study of deep learning algorithms in computer vision problems. With the increase of acceleration of digital photography and the advances in storage devices over the last decade, we have seen explosive growth in the available amount of visual data and equally explosive growth in the computational capacities for image understanding. Instead of hand crafting features, recent advancement in deep learning suggests an emerging approach to extracting useful representations for many computer vision tasks. We encourage researchers to formulate innovative learning theories, feature representations, and end-to-end vision systems based on deep learning. We also encourage new theories and processes for dealing with large scale image datasets through deep learning architectures. We are soliciting original contributions that address a wide range of theoretical and practical issues including, but not limited to:
- Supervised and unsupervised algorithms in computer vision,
- Deep learning hardware and software architecture,
- Advancements in deep learning,
- Large scale computer vision problems including object recognition, scene analysis, industrial and medical applications.
Program Committee (tentative)
- Aaron Courville, Universite de Montreal
- Adriana Romero, Universite de Montreal
- Agata Lapedriza, Universitat Oberta de Catalunya
- Andrea Vedaldi, Oxford University
- Andreas Muller, University of Bonn
- Andrew Bagdanov, Computer Vision Center
- Angjoo Kanazawa, University of Maryland
- Baochen Sun, University of Massachusetts Lowell
- Cees Snoek, University of Amsterdam
- Chris McCool, Queensland University of Technology
- Christian Wolf, Universite de Lyon
- David Jacobs, University of Maryland
- Dumitru Erhan, Google
- Fang Wang, NICTA
- Gary Huang, University of Massachusetts Amherst
- German Ros, Computer Vision Center
- Graham Taylor, University of Guelph
- Hao Zhou, University of Maryland
- Hossein Azizpour, Kth
- Joost van de Weijer, Computer Vision Center
- Le Kang, University of Maryland
- Lisa Anne Hendricks, UC Berkeley
- Michael Pfeiffer, ETH Zurich
- Nathan Silberman, New York University
- Nicholas Leonard, Nikopia
- Ning Zhang, SnapChat
- Pierre Sermanet, Google
- Quoc Le, Stanford
- Rob Fergus, Facebook
- Serena Yeung, Stanford
- Sergio Guadarrama, Google
- Subhashini Venugopalan, U. Texas
- Ross Girshick, Facebook
- Sean Bell, Cornell University
- Kavita Bala, Cornell University