Neural networks trained on datasets such as ImageNet have led to major advances in visual object classification. One obstacle that prevents networks from reasoning more deeply about complex scenes ...and situations, and from integrating visual knowledge with natural language, like humans do, is their lack of common sense knowledge about the physical world. Videos, unlike still images, contain a wealth of detailed information about the physical world. However, most labelled video datasets represent high-level concepts rather than detailed physical aspects about actions and scenes. In this work, we describe our ongoing collection of the "something-something" database of video prediction tasks whose solutions require a common sense understanding of the depicted situation. The database currently contains more than 100,000 videos across 174 classes, which are defined as caption-templates. We also describe the challenges in crowd-sourcing this data at scale.
The first steps of visual processing are often described as a bank of oriented filters followed by divisive normalization. This approach has been tremendously successful at predicting contrast ...thresholds in simple visual displays. However, it is unclear to what extent this kind of architecture also supports processing in more complex visual tasks performed in naturally looking images. We used a deep generative image model to embed arc segments with different curvatures in naturalistic images. These images contain the target as part of the image scene, resulting in considerable appearance variation of target as well as background. Three observers localized arc targets in these images, with an average accuracy of 74.7%. Data were fit by several biologically inspired models, four standard deep convolutional neural networks (CNNs), and a five-layer CNN specifically trained for this task. Four models predicted observer responses particularly well; (1) a bank of oriented filters, similar to complex cells in primate area V1; (2) a bank of oriented filters followed by tuned gain control, incorporating knowledge about cortical surround interactions; (3) a bank of oriented filters followed by local normalization; and (4) the five-layer CNN. A control experiment with optimized stimuli based on these four models showed that the observers' data were best explained by model (2) with tuned gain control. These data suggest that standard models of early vision provide good descriptions of performance in much more complex tasks than what they were designed for, while general-purpose non linear models such as convolutional neural networks do not.
The visual system is exposed to a vast number of shapes and objects. Yet, human object recognition is effortless, fast and largely independent of naturally occurring transformations such as position ...and scale. The precise mechanisms of shape encoding are still largely unknown. Radial frequency (RF) patterns are a special class of closed contours defined by modulation of a circle’s radius. These patterns have been frequently and successfully used as stimuli in vision science to investigate aspects of shape processing. Given their mathematical properties, RF patterns can not represent any arbitrary shape, but the ability to generate more complex, biologically relevant, shapes depicting the outlines of objects such as fruits or human heads raises the possibility that RF patterns span a representative subset of possible shapes. However, this assumption has not been tested before. Here we show that only a small fraction of all possible shapes can be represented by RF patterns and that this small fraction is perceptually distinct from the general class of all possible shapes. Specifically, we derive a general measure for the distance of a given shape’s outline from the set of RF patterns, allowing us to scan large numbers of object outlines automatically. We find that only between 1% and 6% of naturally smooth outlines can be exactly represented by RF patterns. We present results from a visual search experiment, which revealed that searching an RF pattern among non-radial frequency patterns is efficient, whereas searching an RF pattern among other RF patterns is inefficient (and vice versa). These results suggest that RF patterns represent only a restricted subset of possible planar shapes and that results obtained with this special class of stimuli can not simply be expected to generalise to any arbitrary planar shape.
Humans are remarkably well tuned to the statistical properties of natural images. However, quantitative characterization of processing within the domain of natural images has been difficult because ...most parametric manipulations of a natural image make that image appear less natural. We used generative adversarial networks (GANs) to constrain parametric manipulations to remain within an approximation of the manifold of natural images. In the first experiment, seven observers decided which one of two synthetic perturbed images matched a synthetic unperturbed comparison image. Observers were significantly more sensitive to perturbations that were constrained to an approximate manifold of natural images than they were to perturbations applied directly in pixel space. Trial-by-trial errors were consistent with the idea that these perturbations disrupt configural aspects of visual structure used in image segmentation. In a second experiment, five observers discriminated paths along the image manifold as recovered by the GAN. Observers were remarkably good at this task, confirming that observers are tuned to fairly detailed properties of an approximate manifold of natural images. We conclude that human tuning to natural images is more general than detecting deviations from natural appearance, and that humans have, to some extent, access to detailed interrelations between natural images.
Gamma-band oscillations (roughly 30–100Hz) in human and animal EEG have received considerable attention in the past due to their correlations with cognitive processes. Here, we want to sketch how ...some of the higher cognitive functions can be explained by memory processes which are known to modulate gamma activity. Especially, the function of binding together the multiple features of a perceived object requires a comparison with contents stored in memory. In addition, we review recent findings about the actual behavioral relevance of human gamma-band activity. Interestingly, rather simple models of spiking neurons are not only able to generate oscillatory activity within the gamma-band range, but even show modulations of these oscillations in line with findings from human experiments.