•THUMOS challenge was introduced in 2013 to serve as a benchmark for action recognition.•In this paper we describe the THUMOS benchmark in detail.•Give an overview of data collection and annotation ...procedures.•Present results of submissions to the THUMOS 2015 challenge and review the participating approaches.•We conclude by proposing several directions and improvements for future THUMOS challenges.
Automatically recognizing and localizing wide ranges of human actions are crucial for video understanding. Towards this goal, the THUMOS challenge was introduced in 2013 to serve as a benchmark for action recognition. Until then, video action recognition, including THUMOS challenge, had focused primarily on the classification of pre-segmented (i.e., trimmed) videos, which is an artificial task. In THUMOS 2014, we elevated action recognition to a more practical level by introducing temporally untrimmed videos. These also include ‘background videos’ which share similar scenes and backgrounds as action videos, but are devoid of the specific actions. The three editions of the challenge organized in 2013–2015 have made THUMOS a common benchmark for action classification and detection and the annual challenge is widely attended by teams from around the world.
In this paper we describe the THUMOS benchmark in detail and give an overview of data collection and annotation procedures. We present the evaluation protocols used to quantify results in the two THUMOS tasks of action classification and temporal action detection. We also present results of submissions to the THUMOS 2015 challenge and review the participating approaches. Additionally, we include a comprehensive empirical study evaluating the differences in action recognition between trimmed and untrimmed videos, and how well methods trained on trimmed videos generalize to untrimmed videos. We conclude by proposing several directions and improvements for future THUMOS challenges.
Deep Recurrent Neural Network architectures, though remarkably capable at modeling sequences, lack an intuitive high-level spatio-temporal structure. That is while many problems in computer vision ...inherently have an underlying high-level structure and can benefit from it. Spatiotemporal graphs are a popular tool for imposing such high-level intuitions in the formulation of real world problems. In this paper, we propose an approach for combining the power of high-level spatio-temporal graphs and sequence learning success of Recurrent Neural Networks (RNNs). We develop a scalable method for casting an arbitrary spatio-temporal graph as a rich RNN mixture that is feedforward, fully differentiable, and jointly trainable. The proposed method is generic and principled as it can be used for transforming any spatio-temporal graph through employing a certain set of well defined steps. The evaluations of the proposed approach on a diverse set of problems, ranging from modeling human motion to object interactions, shows improvement over the state-of-the-art with a large margin. We expect this method to empower new approaches to problem formulation through high-level spatio-temporal graphs and Recurrent Neural Networks.
Developing visual perception models for active agents and sensorimotor control in the physical world are cumbersome as existing algorithms are too slow to efficiently learn in real-time and robots ...are fragile and costly. This has given rise to learning-in-simulation which consequently casts a question on whether the results transfer to real-world. In this paper, we investigate developing real-world perception for active agents, propose Gibson Environment for this purpose, and showcase a set of perceptual tasks learned therein. Gibson is based upon virtualizing real spaces, rather than artificially designed ones, and currently includes over 1400 floor spaces from 572 full buildings. The main characteristics of Gibson are: I. being from the real-world and reflecting its semantic complexity, II. having an internal synthesis mechanism "Goggles" enabling deploying the trained models in real-world without needing domain adaptation, III. embodiment of agents and making them subject to constraints of physics and space.
3D Semantic Parsing of Large-Scale Indoor Spaces Armeni, Iro; Sener, Ozan; Zamir, Amir R. ...
2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
06/2016
Conference Proceeding
Odprti dostop
In this paper, we propose a method for semantic parsing the 3D point cloud of an entire building using a hierarchical approach: first, the raw data is parsed into semantically meaningful spaces (e.g. ...rooms, etc) that are aligned into a canonical reference coordinate system. Second, the spaces are parsed into their structural and building elements (e.g. walls, columns, etc). Performing these with a strong notation of global 3D space is the backbone of our method. The alignment in the first step injects strong 3D priors from the canonical coordinate system into the second step for discovering elements. This allows diverse challenging scenarios as man-made indoor spaces often show recurrent geometric patterns while the appearance features can change drastically. We also argue that identification of structural elements in indoor spaces is essentially a detection problem, rather than segmentation which is commonly used. We evaluated our method on a new dataset of several buildings with a covered area of over 6, 000m 2 and over 215 million points, demonstrating robust results readily useful for practical applications.
Robust Learning Through Cross-Task Consistency Zamir, Amir R.; Sax, Alexander; Cheerla, Nikhil ...
2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Conference Proceeding
Visual perception entails solving a wide set of tasks (e.g., object detection, depth estimation, etc). The predictions made for different tasks out of one image are not independent, and therefore, ...are expected to be 'consistent'. We propose a flexible and fully computational framework for learning while enforcing Cross-Task Consistency (X-TAC). The proposed formulation is based on 'inference path invariance' over an arbitrary graph of prediction domains. We observe that learning with cross-task consistency leads to more accurate predictions, better generalization to out-of-distribution samples, and improved sample efficiency. This framework also leads to a powerful unsupervised quantity, called 'Consistency Energy, based on measuring the intrinsic consistency of the system. Consistency Energy well correlates with the supervised error (r=0.67), thus it can be employed as an unsupervised robustness metric as well as for detection of out-of-distribution inputs (AUC=0.99). The evaluations were performed on multiple datasets, including Taskonomy, Replica, CocoDoom, and ApolloScape.
Unsupervised Semantic Parsing of Video Collections Sener, Ozan; Zamir, Amir R.; Savarese, Silvio ...
2015 IEEE International Conference on Computer Vision (ICCV),
12/2015
Conference Proceeding, Journal Article
Human communication typically has an underlying structure. This is reflected in the fact that in many user generated videos, a starting point, ending, and certain objective steps between these two ...can be identified. In this paper, we propose a method for parsing a video into such semantic steps in an unsupervised way. The proposed method is capable of providing a semantic "storyline" of the video composed of its objective steps. We accomplish this utilizing both visual and language cues in a joint generative model. The proposed method can also provide a textual description for each of identified semantic steps and video segments. We evaluate this method on a large number of complex YouTube videos and show results of unprecedented quality for this new and impactful problem.
Do visual tasks have a relationship, or are they unrelated? For instance, could having surface normals simplify estimating the depth of an image? Intuition answers these questions positively, ...implying existence of a structure among visual tasks. Knowing this structure has notable values; it is the concept underlying transfer learning and provides a principled way for identifying redundancies across tasks, e.g., to seamlessly reuse supervision among related tasks or solve many tasks in one system without piling up the complexity. We proposes a fully computational approach for modeling the structure of space of visual tasks. This is done via finding (first and higher-order) transfer learning dependencies across a dictionary of twenty six 2D, 2.5D, 3D, and semantic tasks in a latent space. The product is a computational taxonomic map for task transfer learning. We study the consequences of this structure, e.g. nontrivial emerged relationships, and exploit them to reduce the demand for labeled data. We provide a set of tools for computing and probing this taxonomical structure including a solver users can employ to find supervision policies for their use cases.
A comprehensive semantic understanding of a scene is important for many applications - but in what space should diverse semantic information (e.g., objects, scene categories, material types, 3D ...shapes, etc.) be grounded and what should be its structure? Aspiring to have one unified structure that hosts diverse types of semantics, we follow the Scene Graph paradigm in 3D, generating a 3D Scene Graph. Given a 3D mesh and registered panoramic images, we construct a graph that spans the entire building and includes semantics on objects (e.g., class, material, shape and other attributes), rooms (e.g., function, illumination type, etc.) and cameras (e.g., location, etc.), as well as the relationships among these entities. However, this process is prohibitively labor heavy if done manually. To alleviate this we devise a semi-automatic framework that employs existing detection methods and enhances them using two main constraints: I. framing of query images sampled on panoramas to maximize the performance of 2D detectors, and II. multi-view consistency enforcement across 2D detections that originate in different camera locations.
Feedback Networks Zamir, Amir R.; Te-Lin Wu; Lin Sun ...
2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
2017-July
Conference Proceeding
Urrently, the most successful learning models in computer vision are based on learning successive representations followed by a decision layer. This is usually actualized through feedforward ...multilayer neural networks, e.g. ConvNets, where each layer forms one of such successive representations. However, an alternative that can achieve the same goal is a feedback based approach in which the representation is formed in an iterative manner based on a feedback received from previous iterations output. We establish that a feedback based approach has several core advantages over feedforward: it enables making early predictions at the query time, its output naturally conforms to a hierarchical structure in the label space (e.g. a taxonomy), and it provides a new basis for Curriculum Learning. We observe that feedback develops a considerably different representation compared to feedforward counterparts, in line with the aforementioned advantages. We provide a general feedback based learning architecture, instantiated using existing RNNs, with the endpoint results on par or better than existing feedforward networks and the addition of the above advantages.