SPCOM 2020

INSTITUTIONAL SPONSOR

TECHNICAL CO-SPONSORS

DIAMOND SPONSORS

GOLD SPONSORS

SILVER SPONSOR

AWARDS SPONSOR

DEEP LEARNING FOR COMPUTER VISION

Organizers: R. Venkatesh Babu, Department of Computational and Data Sciences, Indian Institute of Science, Bangalore and
Anirban Chakraborty, Department of Computational and Data Sciences, Indian Institute of Science, Bangalore

Venkatesh Babu received his Ph.D degree from Dept. of Electrical Engineering, Indian Institute of Science, Bangalore. Thereafter, he held postdoctoral positions at NTNU, Norway and IRISA/INRIA, Rennes, France, through ERCIM fellowship. Subsequently, he worked as a research fellow at NTU, Singapore. He spent couple of years working in industry before returning back to the institute in August 2010. He is currently an Associate Professor at CDS, IISc

Anirban Chakraborty has been working as a faculty member in CDS, IISc since June’17. From December’16 to June’17, he worked as a researcher at Robert Bosch Research and Technology Centre, India. Before that, Dr. Chakraborty was a research fellow in the Rapid-Rich Object Search (ROSE) Lab at Nanyang Technological University, Singapore working on large scale, deployable video surveillance systems. From October’14 to November’15, he worked as a post-doctoral fellow in the A*STAR-NUS Clinical Imaging Research Centre at National University of Singapore. Dr. Chakraborty completed his Ph.D. in Electrical Engineering from University of California, Riverside under the supervision of Dr. Amit K. Roy-Chowdhury in August’14. He also obtained his M.S. from the same institute in 2010 and prior to that, his bachelors in Electrical Engineering from Jadavpur University, India.

SPEAKER	TITLE OF THE TALK
Arjun Jain	What Does it Take to Build an *Autonomous* Car?
Varun Jampani	Content-Adaptive Convolutional Neural Networks
Makarand Tapaswi	Machine Understanding of Social Situations
Anand Mishra	Reading Text in Scene Images, Bridging it to World Knowledge, and Beyond
Konda Reddy Mopuri	Dataset Condensation for Efficient DNN Training

What Does it Take to Build an Autonomous Car?

Arjun Jain, Axogyan AI

Arjun Jain is Founder and Chief Scientist at Axogyan AI. Previously, he co-founded Perceptive Code LLC, a Silicon Valley startup that builds intelligence into automobiles. He is also an Adjunct Faculty at the CDS dept. at IISc, where he leads a research group in deep learning. From 2016-2019, he was an Adjunct Assistant Professor at IIT Bombay, where he taught Computer Vision. Before that he was a post-doctoral researcher at the Computer Science department at New York University's Courant Institute where he worked with Turing award winner Dr. Yann LeCun. He received his Ph.D. in Computer Science from the Max-Planck Institute for Informatics in Germany. Broadly, his research lies at the interface of computer graphics, computer vision, and machine learning, with a focus on human pose estimation and data-driven artistic content creation tools. Arjun has worked for several companies, including Yahoo! in Bangalore, Weta Digital in New Zealand and Apple in Cupertino. He has been especially active in the human sensing related industries. Arjun served as a developer for Weta Digital’s motion capture system. This system has been used in many feature films, and Arjun was credited for his work in Steven Spielberg’s, The Adventures of Tintin (2011). Arjun’s work has resulted in several academic publications with currently over 3000 citations, patents, and has been featured by mainstream media, including in the magazines: New Scientist, Discovery, BCC, Vogue, Wired, India Today, and The Hollywood Reporter, among other outlets.

Abstract: In this talk, we take a closer look at Autonomous Vehicles and their various subsystems. We start with a bit of history and then dig into the various modules such as perception, localization, prediction, planning and control that enable Autonomous Driving. We finally conclude with its current challenges and discussion on how to overcome them.

Content-Adaptive Convolutional Neural Networks

Varun Jampani, Google Research, Cambridge, US

Varun Jampani is a research scientist at Google Research in Cambridge, US. Prior to that, he was a research scientist at Nvidia Research. He works in the areas of machine learning and computer vision and his main research interests include content-adaptive neural networks, self-supervised visual discovery and novel view synthesis. He obtained his PhD with highest honors at Max Planck Institute for Intelligent Systems (MPI) and University of Tübingen in Tübingen, Germany. He obtained his BTech and MS from International Institute of Information Technology, Hyderabad (IIIT-H), India, where he was a gold medalist. He was an outstanding reviewer at CVPR’19, CVPR’18, CVPR’17 and ECCV’16. His work on ‘SplatNet’ has received ‘Best Paper Honorable Mention’ award at CVPR’18.

Abstract: Convolutions are the basic building blocks of CNNs. We propose two different generalizations of standard convolutions that makes them content-adaptive.

1. Sparse high-dimensional networks: We propose a generalization of convolution operation to process sparse and high-dimensional data (for instance data that lies in 5D space such as XYRGB). Specifically, we make use of permutohedral lattices and hash tables for efficient processing of sparse high-dimensional data. The ability to learn generic high-dimensional filters allows us to stack several parallel and sequential filters like in a CNN resulting in a new breed of neural networks which we call ‘Bilateral Neural Networks’ (BNN). We demonstrate the use of BNNs on several 2D, video and 3D vision tasks.

2. Pixel adaptive convolutions: A simple yet effective modification of standard convolutions, in which the filter weights are multiplied with a spatially varying kernel that depends on learnable, local pixel features. PAC is a generalization of several popular filtering techniques and thus can be used for a wide range of use cases. Specifically, we demonstrate state-of-the-art performance when PAC is used for deep joint image upsampling. PAC also offers an effective alternative to fully-connected CRF (Full-CRF), called PAC-CRF. In addition, we also demonstrate that PAC can be used as a drop-in replacement for convolution layers in pre-trained networks, resulting in consistent performance improvements.

Machine Understanding of Social Situations

Makarand Tapaswi, Inria Paris

Makarand Tapaswi is a Post-Doctoral Fellow at Inria Paris, working on several topics revolving around machine understanding of videos and language, and their application to robotics. Previously, Makarand was a Post-Doctoral Fellow with the Machine Learning group, University of Toronto, and he worked on teaching machines about human behavior and analyzing storylines. Makarand completed his PhD from Karlsruhe Institute of Technology, Germany in 2016, introducing several novel problems such as alignment of books with movies, plot synopses (from Wikipedia) with TV episodes, their visualization inspired by XKCD, and the first story question-answering challenge: MovieQA.

Abstract: There is growing interest in artificial intelligence to build socially intelligent robots. This requires machines to have the ability to "read" people's emotions, motivations, and other factors that affect behavior. Towards this goal, we introduce a novel dataset called MovieGraphs which provides detailed, graph-based annotations of social situations depicted in movie clips. Each graph consists of several types of nodes, to capture who is present in the clip, their emotional and physical attributes, their relationships (i.e., parent/child), and the interactions between them. Most interactions are associated with topics that provide additional details, and reasons that give motivations for actions. In addition, most interactions and many attributes are grounded in the video with time stamps. We provide a thorough analysis of our dataset, showing interesting common-sense correlations between different social aspects of scenes, as well as across scenes over time. We present a method for querying videos with graphs, interaction understanding via ordering, and reason understanding. MovieGraphs is the first benchmark to focus on inferred properties of human-centric situations and opens up an exciting avenue towards socially-intelligent AI agents.

Reading Text in Scene Images, Bridging it to World Knowledge, and Beyond

Anand Mishra, Department of Computer Science and Engineering, Indian Institute of Technology Jodhpur

Anand Mishra is a faculty member at the Department of Computer Science and Engineering at IIT Jodhpur where his group focuses on problems intersecting vision, language, and knowledge graphs. Previously, Anand was a postdoctoral fellow at IISc Bangalore working with Dr. Partha Talukdar and Dr. Anirban Chakraborty on knowledge-aware computer vision. He received Ph.D. from IIIT Hyderabad working on Scene Text Understanding under the supervision of Prof. C. V. Jawahar and Dr. Karteek Alahari (Inria). Anand is a recipient of the prestigious Microsoft Research India Ph.D. Fellowship (2012) and the XRCI doctoral dissertation award-first runner up (2015).

Abstract: Reading text in scene images is an important step towards scene understanding. In the first part of my talk, I will present a discrete optimization formulation for reading text in scene images. In this formulation, we seamlessly integrate character recognition scores and language models in order to recognize text. In the second part of my talk, I will present our recent work on bridging text recognition and knowledge graph and demonstrating knowledge-aware visual question answering (KVQA) that can read and reason. Finally, I will conclude my talk by discussing various future research avenues in the broader theme of "knowledge-aware computer vision".

Dataset Condensation for Efficient DNN Training

Konda Reddy Mopuri, Visual Computing Group (VICO), School of Informatics, University of Edinburgh, UK

Konda Reddy Mopuri has held a post-doctoral researcher position at the school of Informatics, University of Edinburgh, UK. He received his PhD from the Dept. of Computational and Data Sciences (CDS), Indian Institute of Science, Bangalore. He did his MTech in Visual Information Processing & Embedded Systems from the Dept. of ECE, IIT Kharagpur. His research interests broadly span Signal/Image Processing, Computer Vision, Deep Learning. In his PhD work, he has studied the robustness, adaptability, and interpretability aspects of the Deep Neural Networks. His research has resulted in high-impact publications at various top-tier international venues, such as ICML, CVPR, ECCV, BMVC, IEEE TPAMI, and IEEE TIP. For more details, please see https://sites.google.com/site/kreddymopuri/

Abstract: Model distillation aims to distill the knowledge of a complex model into a simpler one. On the other hand, we consider distilling knowledge of large training datasets into smaller sets. Efficient training of deep neural networks is an increasingly important problem in the era of sophisticated architectures and large-scale datasets. We propose a training set synthesis technique, called Dataset Condensation, that learns to produce a small set of informative samples for training deep neural networks. The resulting samples, though out-of-distribution, enable training the models from scratch in a small fraction of the required computational cost on the original data while achieving comparable results. We rigorously evaluate our method in several computer vision benchmarks, across multiple model architectures and demonstrate a strong generalization promise. We show potential applications of our method in continual learning and domain adaptation.

Welcome to SPCOM 2020!