In large-scale visual recognition and image retrieval tasks, feature vectors, such as Fisher vector (FV) or the vector of locally aggregated descriptors (VLAD), have achieved state-of-the-art ...results. However, the combination of the large numbers of examples and high-dimensional vectors necessitates dimensionality reduction, in order to reduce its storage and CPU costs to a reasonable range. In spite of the popularity of various feature compression methods, this paper shows that the feature (dimension) selection is a better choice for high-dimensional FV/VLAD than the feature (dimension) compression methods, e.g., product quantization. We show that strong correlation among the feature dimensions in the FV and the VLAD may not exist, which renders feature selection a natural choice. We also show that, many dimensions in FV/VLAD are noise. Throwing them away using feature selection is better than compressing them and useful dimensions altogether using feature compression methods. To choose features, we propose an efficient importance sorting algorithm considering both the supervised and unsupervised cases, for visual recognition and image retrieval, respectively. Combining with the 1-bit quantization, feature selection has achieved both higher accuracy and less computational cost than feature compression methods, such as product quantization, on the FV and the VLAD image representations.
Histograms are used in almost every aspect of image processing and computer vision, from visual descriptors to image representations. Histogram intersection kernel (HIK) and support vector machine ...(SVM) classifiers are shown to be very effective in dealing with histograms. This paper presents contributions concerning HIK SVM for image classification. First, we propose intersection coordinate descent (ICD), a deterministic and scalable HIK SVM solver. ICD is much faster than, and has similar accuracies to, general purpose SVM solvers and other fast HIK SVM training methods. We also extend ICD to the efficient training of a broader family of kernels. Second, we show an important empirical observation that ICD is not sensitive to the C parameter in SVM, and we provide some theoretical analyses to explain this observation. ICD achieves high accuracies in many problems, using its default parameters. This is an attractive property for practitioners, because many image processing tasks are too large to choose SVM parameters using cross-validation.
•Filter selection and model fine-tuning are integrated into a single end-to-end trainable framework.•Adaptive compression ratio and multi-layer compression.•Good generalization ability.
Channel ...pruning is an important method to speed up CNN model’s inference. Previous filter pruning algorithms regard importance evaluation and model fine-tuning as two independent steps. This paper argues that combining them into a single end-to-end trainable system will lead to better results. We propose an efficient channel selection layer, namely AutoPruner, to find less important filters automatically in a joint training manner. Our AutoPruner takes previous activation responses as an input and generates a true binary index code for pruning. Hence, all the filters corresponding to zero index values can be removed safely after training. By gradually erasing several unimportant filters, we can prevent an excessive drop in model accuracy. Compared with previous state-of-the-art pruning algorithms (including training from scratch), AutoPruner achieves significantly better performance. Furthermore, ablation experiments show that the proposed novel mini-batch pooling and binarization operations are vital for the success of model pruning.
For large-scale classification tasks, especially in the classification of images, additive kernels have shown a state-of-the-art accuracy. However, even with the recent development of fast ...algorithms, learning speed and the ability to handle large-scale tasks are still open problems. This paper proposes algorithms for large-scale support vector machines (SVM) classification and other tasks using additive kernels. First, a linear regression SVM framework for general nonlinear kernel is proposed using linear regression to approximate gradient computations in the learning process. Second, we propose a power mean SVM (PmSVM) algorithm for all additive kernels using nonsymmetric explanatory variable functions. This nonsymmetric kernel approximation has advantages over the existing methods: 1) it does not require closed-form Fourier transforms and 2) it does not require extra training for the approximation either. Compared on benchmark large-scale classification data sets with millions of examples or millions of dense feature dimensions, PmSVM has achieved the highest learning speed and highest accuracy among recent algorithms in most cases.
We propose an efficient and unified framework, namely ThiNet, to simultaneously accelerate and compress CNN models in both training and inference stages. We focus on the filter level pruning, i.e., ...the whole filter would be discarded if it is less important. Our method does not change the original network structure, thus it can be perfectly supported by any off-the-shelf deep learning libraries. We formally establish filter pruning as an optimization problem, and reveal that we need to prune filters based on statistics information computed from its next layer, not the current layer, which differentiates ThiNet from existing methods. Experimental results demonstrate the effectiveness of this strategy, which has advanced the state-of-the-art. We also show the performance of ThiNet on ILSVRC-12 benchmark. ThiNet achieves 3.31 x FLOPs reduction and 16.63× compression on VGG-16, with only 0.52% top-5 accuracy drop. Similar experiments with ResNet-50 reveal that even for a compact network, ThiNet can also reduce more than half of the parameters and FLOPs, at the cost of roughly 1% top-5 accuracy drop. Moreover, the original VGG-16 model can be further pruned into a very small model with only 5.05MB model size, preserving AlexNet level accuracy but showing much stronger generalization ability.
•To the best of our knowledge, Mask-CNN is the first end-to-end model that selects deep convolutional descriptors for object recognition, especially for fine-grained image recognition.•We present a ...novel and efficient part-based three-stream model for fine-grained recognition. By discarding the fully connected layers, the proposed M-CNN is computationally efficient (cf. Table 1 and Table 4 in experiments). Additionally, comparing with state-of-the-art methods, M-CNN has smaller feature dimensionality. Beyond those, it achieves the highest classification accuracy on CUB200-2011 and Birdsnap among published methods.•The part localization performance of the proposed model outperforms other part-based finegrained approaches which requires additional bounding boxes. In particular, M-CNN is 12.76% higher than state-of-the-art for head localization on CUB200-2011.
Fine-grained image recognition is a challenging computer vision problem, due to the small inter-class variations caused by highly similar subordinate categories, and the large intra-class variations in poses, scales and rotations. In this paper, we prove that selecting useful deep descriptors contributes well to fine-grained image recognition. Specifically, a novel Mask-CNN model without the fully connected layers is proposed. Based on the part annotations, the proposed model consists of a fully convolutional network to both locate the discriminative parts (e.g., head and torso), and more importantly generate weighted object/part masks for selecting useful and meaningful convolutional descriptors. After that, a three-stream Mask-CNN model is built for aggregating the selected object- and part-level descriptors simultaneously. Thanks to discarding the parameter redundant fully connected layers, our Mask-CNN has a small feature dimensionality and efficient inference speed by comparing with other fine-grained approaches. Furthermore, we obtain a new state-of-the-art accuracy on two challenging fine-grained bird species categorization datasets, which validates the effectiveness of both the descriptor selection scheme and the proposed Mask-CNN model.
CENsus TRansform hISTogram (CENTRIST), a new visual descriptor for recognizing topological places or scene categories, is introduced in this paper. We show that place and scene recognition, ...especially for indoor environments, require its visual descriptor to possess properties that are different from other vision domains (e.g., object recognition). CENTRIST satisfies these properties and suits the place and scene recognition task. It is a holistic representation and has strong generalizability for category recognition. CENTRIST mainly encodes the structural properties within an image and suppresses detailed textural information. Our experiments demonstrate that CENTRIST outperforms the current state of the art in several place and scene recognition data sets, compared with other descriptors such as SIFT and Gist. Besides, it is easy to implement and evaluates extremely fast.
Deep Label Distribution Learning With Label Ambiguity Gao, Bin-Bin; Xing, Chao; Xie, Chen-Wei ...
IEEE transactions on image processing,
2017-June, 2017-Jun, 2017-6-00, 20170601, Volume:
26, Issue:
6
Journal Article
Peer reviewed
Convolutional neural networks (ConvNets) have achieved excellent recognition performance in various visual recognition tasks. A large labeled training set is one of the most important factors for its ...success. However, it is difficult to collect sufficient training images with precise labels in some domains, such as apparent age estimation, head pose estimation, multilabel classification, and semantic segmentation. Fortunately, there is ambiguous information among labels, which makes these tasks different from traditional classification. Based on this observation, we convert the label of each image into a discrete label distribution, and learn the label distribution by minimizing a Kullback-Leibler divergence between the predicted and ground-truth label distributions using deep ConvNets. The proposed deep label distribution learning (DLDL) method effectively utilizes the label ambiguity in both feature learning and classifier learning, which help prevent the network from overfitting even when the training set is small. Experimental results show that the proposed approach produces significantly better results than the state-of-the-art methods for age estimation and head pose estimation. At the same time, it also improves recognition performance for multi-label classification and semantic segmentation tasks.
Deep convolutional neural network models pre-trained for the ImageNet classification task have been successfully adopted to tasks in other domains, such as texture description and object proposal ...generation, but these tasks require annotations for images in the new domain. In this paper, we focus on a novel and challenging task in the pure unsupervised setting: fine-grained image retrieval. Even with image labels, fine-grained images are difficult to classify, letting alone the unsupervised retrieval task. We propose the selective convolutional descriptor aggregation (SCDA) method. The SCDA first localizes the main object in fine-grained images, a step that discards the noisy background and keeps useful deep descriptors. The selected descriptors are then aggregated and the dimensionality is reduced into a short feature vector using the best practices we found. The SCDA is unsupervised, using no image label or bounding box annotation. Experiments on six fine-grained data sets confirm the effectiveness of the SCDA for fine-grained image retrieval. Besides, visualization of the SCDA features shows that they correspond to visual attributes (even subtle ones), which might explain SCDA's high-mean average precision in fine-grained retrieval. Moreover, on general image retrieval data sets, the SCDA achieves comparable retrieval results with the state-of-the-art general image retrieval approaches.
mCENTRIST, a new multichannel feature generation mechanism for recognizing scene categories, is proposed in this paper. mCENTRIST explicitly captures the image properties that are encoded jointly by ...two image channels, which is different from popular multichannel descriptors. In order to avoid the curse of dimensionality, tradeoffs at both feature and channel levels have been executed to make mCENTRIST computationally practical. As a result, mCENTRIST is both efficient and easy to implement. In addition, a hyperopponent color space is proposed by embedding Sobel information into the opponent color space for further performance improvements. Experiments show that mCENTRIST outperforms established multichannel descriptors on four RGB and RGB-near infrared data sets, including aerial orthoimagery, indoor, and outdoor scene category recognition tasks. Experiments also verify that the hyper opponent color space enhances descriptors' performance effectively.