The success of convolutional neural networks (CNNs) in computer vision applications has been accompanied by a significant increase of computation and memory costs, which prohibits their usage on ...resource-limited environments, such as mobile systems or embedded devices. To this end, the research of CNN compression has recently become emerging. In this paper, we propose a novel filter pruning scheme, termed structured sparsity regularization (SSR), to simultaneously speed up the computation and reduce the memory overhead of CNNs, which can be well supported by various off-the-shelf deep learning libraries. Concretely, the proposed scheme incorporates two different regularizers of structured sparsity into the original objective function of filter pruning, which fully coordinates the global output and local pruning operations to adaptively prune filters. We further propose an alternative updating with Lagrange multipliers (AULM) scheme to efficiently solve its optimization. AULM follows the principle of alternating direction method of multipliers (ADMM) and alternates between promoting the structured sparsity of CNNs and optimizing the recognition loss, which leads to a very efficient solver (<inline-formula> <tex-math notation="LaTeX">2.5\times </tex-math></inline-formula> to the most recent work that directly solves the group sparsity-based regularization). Moreover, by imposing the structured sparsity, the online inference is extremely memory-light since the number of filters and the output feature maps are simultaneously reduced. The proposed scheme has been deployed to a variety of state-of-the-art CNN structures, including LeNet, AlexNet, VGGNet, ResNet, and GoogLeNet, over different data sets. Quantitative results demonstrate that the proposed scheme achieves superior performance over the state-of-the-art methods. We further demonstrate the proposed compression scheme for the task of transfer learning, including domain adaptation and object detection, which also show exciting performance gains over the state-of-the-art filter pruning methods.
3D shape recognition has attracted much attention recently. Its recent advances advocate the usage of deep features and achieve the state-of-the-art performance. However, existing deep features for ...3D shape recognition are restricted to a view-to-shape setting, which learns the shape descriptor from the view-level feature directly. Despite the exciting progress on view-based 3D shape description, the intrinsic hierarchical correlation and discriminability among views have not been well exploited, which is important for 3D shape representation. To tackle this issue, in this paper, we propose a group-view convolutional neural network (GVCNN) framework for hierarchical correlation modeling towards discriminative 3D shape description. The proposed GVCNN framework is composed of a hierarchical view-group-shape architecture, i.e., from the view level, the group level and the shape level, which are organized using a grouping strategy. Concretely, we first use an expanded CNN to extract a view level descriptor. Then, a grouping module is introduced to estimate the content discrimination of each view, based on which all views can be splitted into different groups according to their discriminative level. A group level description can be further generated by pooling from view descriptors. Finally, all group level descriptors are combined into the shape level descriptor according to their discriminative weights. Experimental results and comparison with state-of-the-art methods show that our proposed GVCNN method can achieve a significant performance gain on both the 3D shape classification and retrieval tasks.
Previous works on image emotion analysis mainly focused on predicting the dominant emotion category or the average dimension values of an image for affective image classification and regression. ...However, this is often insufficient in various real-world applications, as the emotions that are evoked in viewers by an image are highly subjective and different. In this paper, we propose to predict the continuous probability distribution of image emotions which are represented in dimensional valence-arousal space. We carried out large-scale statistical analysis on the constructed Image-Emotion-Social-Net dataset, on which we observed that the emotion distribution can be well-modeled by a Gaussian mixture model. This model is estimated by an expectation-maximization algorithm with specified initializations. Then, we extract commonly used emotion features at different levels for each image. Finally, we formalize the emotion distribution prediction task as a shared sparse regression (SSR) problem and extend it to multitask settings, named multitask shared sparse regression (MTSSR), to explore the latent information between different prediction tasks. SSR and MTSSR are optimized by iteratively reweighted least squares. Experiments are conducted on the Image-Emotion-Social-Net dataset with comparisons to three alternative baselines. The quantitative results demonstrate the superiority of the proposed method.
Towards the end of 2012, artificial intelligence (AI) scientists first figured out how to impart “vision” to neural networks. Later, they also mastered how to enable neural networks to mimic human ...reasoning, hearing, speaking, and writing. Although AI has become similar to or even superior to humans in accomplishing specific tasks, it still does not possess the “flexibility” of the human brain, i.e., the human brain can apply skills learned in one situation to another.
Taking cues from the growth process of children, we think about the following question. If senses and language can be combined, and AI can perform at a level closer to humans in terms of collecting and processing information, will it be able to develop an understanding of the world? The answer is yes. “Multi-modal” systems, which can simultaneously acquire human senses and language, thereby generating significantly stronger AI, and making it easier for AI to adapt to new situations and solve new problems. Hence, such algorithms can be used to solve more complex problems, or be implanted into robots for communication and collaboration with humans in our daily lives. In September 2020, researchers from the Allen Institute for AI (AI2) created a model that could generate images from captions, thus demonstrating the ability of the algorithm to associate words with visual information. In November, scientists from the University of North Carolina at Chapel Hill developed a method of incorporating images into existing language models, which significantly enhanced the ability of the model to comprehend text. Early in 2021, OpenAI extended GPT-3 and released two visual language models: one associates the objects in the image with the words in the descriptions, and another one generates a digital image based on the combination of concepts it has learned. The progress made by “multi-modal” systems, in the long run, will help break through the limits of AI. It will not only unlock new AI applications, but also make these applications safer and more reliable. More sophisticated multi-modal systems will also aid the development of more advanced robot assistants. Ultimately, multi-modal systems may prove to be the first AI that we can trust.①①Original source in Chinese: R. Ji, Multi-skilled AI, Bulletin of National Natural Science Foundation of China. 35 (3) (2021) 413-415.
Body Structure Aware Deep Crowd Counting Huang, Siyu; Li, Xi; Zhang, Zhongfei ...
IEEE transactions on image processing,
2018-March, 2018-Mar, 2018-3-00, 20180301, Letnik:
27, Številka:
3
Journal Article
Recenzirano
Crowd counting is a challenging task, mainly due to the severe occlusions among dense crowds. This paper aims to take a broader view to address crowd counting from the perspective of semantic ...modeling. In essence, crowd counting is a task of pedestrian semantic analysis involving three key factors: pedestrians, heads, and their context structure. The information of different body parts is an important cue to help us judge whether there exists a person at a certain position. Existing methods usually perform crowd counting from the perspective of directly modeling the visual properties of either the whole body or the heads only, without explicitly capturing the composite body-part semantic structure information that is crucial for crowd counting. In our approach, we first formulate the key factors of crowd counting as semantic scene models. Then, we convert the crowd counting problem into a multi-task learning problem, such that the semantic scene models are turned into different sub-tasks. Finally, the deep convolutional neural networks are used to learn the sub-tasks in a unified scheme. Our approach encodes the semantic nature of crowd counting and provides a novel solution in terms of pedestrian semantic analysis. In experiments, our approach outperforms the state-of-the-art methods on four benchmark crowd counting data sets. The semantic structure information is demonstrated to be an effective cue in scene of crowd counting.
The wide 3D applications have led to increasing amount of 3D object data, and thus effective 3D object classification technique has become an urgent requirement. One important and challenging task ...for 3D object classification is how to formulate the 3D data correlation and exploit it. Most of the previous works focus on learning optimal pairwise distance metric for object comparison, which may lose the global correlation among 3D objects. Recently, a transductive hypergraph learning has been investigated for classification, which can jointly explore the correlation among multiple objects, including both the labeled and unlabeled data. Although these methods have shown better performance, they are still limited due to 1) a considerable amount of testing data may not be available in practice and 2) the high computational cost to test new coming data. To handle this problem, considering the multi-modal representations of 3D objects in practice, we propose an inductive multi-hypergraph learning algorithm, which targets on learning an optimal projection for the multi-modal training data. In this method, all the training data are formulated in multi-hypergraph based on the features, and the inductive learning is conducted to learn the projection matrices and the optimal multi-hypergraph combination weights simultaneously. Different from the transductive learning on hypergraph, the high cost training process is off-line, and the testing process is very efficient for the inductive learning on hypergraph. We have conducted experiments on two 3D benchmarks, i.e., the NTU and the ModelNet40 data sets, and compared the proposed algorithm with the state-of-the-art methods and traditional transductive multi-hypergraph learning methods. Experimental results have demonstrated that the proposed method can achieve effective and efficient classification performance. We also note that the proposed method is a general framework and has the potential to be applied in other applications in practice.
In view-based 3-D object retrieval, each object is described by a set of views. Group matching thus plays an important role. Previous research efforts have shown the effectiveness of Hausdorff ...distance in group matching. In this paper, we propose a 3-D object retrieval scheme with Hausdorff distance learning. In our approach, relevance feedback information is employed to select positive and negative view pairs with a probabilistic strategy, and a view-level Mahalanobis distance metric is learned. This Mahalanobis distance metric is adopted in estimating the Hausdorff distances between objects, based on which the objects in the 3-D database are ranked. We conduct experiments on three testing data sets, and the results demonstrate that the proposed Hausdorff learning approach can improve 3-D object retrieval performance.
View-based 3-D object retrieval and recognition has become popular in practice, e.g., in computer aided design. It is difficult to precisely estimate the distance between two objects represented by ...multiple views. Thus, current view-based 3-D object retrieval and recognition methods may not perform well. In this paper, we propose a hypergraph analysis approach to address this problem by avoiding the estimation of the distance between objects. In particular, we construct multiple hypergraphs for a set of 3-D objects based on their 2-D views. In these hypergraphs, each vertex is an object, and each edge is a cluster of views. Therefore, an edge connects multiple vertices. We define the weight of each edge based on the similarities between any two views within the cluster. Retrieval and recognition are performed based on the hypergraphs. Therefore, our method can explore the higher order relationship among objects and does not use the distance between objects. We conduct experiments on the National Taiwan University 3-D model dataset and the ETH 3-D object collection. Experimental results demonstrate the effectiveness of the proposed method by comparing with the state-of-the-art methods.
Hadamard Matrix Guided Online Hashing Lin, Mingbao; Ji, Rongrong; Liu, Hong ...
International journal of computer vision,
09/2020, Letnik:
128, Številka:
8-9
Journal Article
Recenzirano
Odprti dostop
Online image hashing has attracted increasing research attention recently, which receives large-scale data in a streaming manner to update the hash functions on-the-fly. Its key challenge lies in the ...difficulty of balancing the learning timeliness and model accuracy. To this end, most works follow a supervised setting, i.e., using class labels to boost the hashing performance, which defects in two aspects: first, strong constraints, e.g., orthogonal or similarity preserving, are used, which however are typically relaxed and lead to large accuracy drops. Second, large amounts of training batches are required to learn the up-to-date hash functions, which largely increase the learning complexity. To handle the above challenges, a novel supervised online hashing scheme termed
H
adamard
M
atrix Guided
O
nline
H
ashing (HMOH) is proposed in this paper. Our key innovation lies in introducing Hadamard matrix, which is an orthogonal binary matrix built via Sylvester method. In particular, to release the need of strong constraints, we regard each column of Hadamard matrix as the target code for each class label, which by nature satisfies several desired properties of hashing codes. To accelerate the online training, LSH is first adopted to align the lengths of target code and to-be-learned binary code. We then treat the learning of hash functions as a set of binary classification problems to fit the assigned target code. Finally, extensive experiments on four widely-used benchmarks demonstrate the superior accuracy and efficiency of HMOH over various state-of-the-art methods. Codes can be available at
https://github.com/lmbxmu/mycode
.
Hyperspectral image classification has attracted extensive research efforts in the recent decade. The main difficulty lies in the few labeled samples versus the high dimensional features. To this ...end, it is a fundamental step to explore the relationship among different pixels in hyperspectral image classification, toward jointly handing both the lack of label and high dimensionality problems. In the hyperspectral images, the classification task can be benefited from the spatial layout information. In this paper, we propose a hyperspectral image classification method to address both the pixel spectral and spatial constraints, in which the relationship among pixels is formulated in a hypergraph structure. In the constructed hypergraph, each vertex denotes a pixel in the hyperspectral image. And the hyperedges are constructed from both the distance between pixels in the feature space and the spatial locations of pixels. More specifically, a feature-based hyperedge is generated by using distance among pixels, where each pixel is connected with its K nearest neighbors in the feature space. Second, a spatial-based hyperedge is generated to model the layout among pixels by linking where each pixel is linked with its spatial local neighbors. Both the learning on the combinational hypergraph is conducted by jointly investigating the image feature and the spatial layout of pixels to seek their joint optimal partitions. Experiments on four data sets are performed to evaluate the effectiveness and and efficiency of the proposed method. Comparisons to the state-of-the-art methods demonstrate the superiority of the proposed method in the hyperspectral image classification.