Self-supervised learning has shown great potentials in improving the video representation ability of deep neural networks by getting supervision from the data itself. However, some of the current ...methods tend to cheat from the background, i.e., the prediction is highly dependent on the video background instead of the motion, making the model vulnerable to background changes. To mitigate the model reliance towards the background, we propose to remove the background impact by adding the background. That is, given a video, we randomly select a static frame and add it to every other frames to construct a distracting video sample. Then we force the model to pull the feature of the distracting video and the feature of the original video closer, so that the model is explicitly restricted to resist the background influence, focusing more on the motion changes. We term our method as Background Erasing (BE). It is worth noting that the implementation of our method is so simple and neat and can be added to most of the SOTA methods without much efforts. Specifically, BE brings 16.4% and 19.1% improvements with MoCo on the severely biased datasets UCF101 and HMDB51, and 14.5% improvement on the less biased dataset Diving48.
Bag-of-features (BoFs) representation has been extensively applied to deal with various computer vision applications. To extract discriminative and descriptive BoF, one important step is to learn a ...good dictionary to minimize the quantization loss between local features and codewords. While most existing visual dictionary learning approaches are engaged with unsupervised feature quantization, the latest trend has turned to supervised learning by harnessing the semantic labels of images or regions. However, such labels are typically too expensive to acquire, which restricts the scalability of supervised dictionary learning approaches. In this paper, we propose to leverage image attributes to weakly supervise the dictionary learning procedure without requiring any actual labels. As a key contribution, our approach establishes a generative hidden Markov random field (HMRF), which models the quantized codewords as the observed states and the image attributes as the hidden states, respectively. Dictionary learning is then performed by supervised grouping the observed states, where the supervised information is stemmed from the hidden states of the HMRF. In such a way, the proposed dictionary learning approach incorporates the image attributes to learn a semantic-preserving BoF representation without any genuine supervision. Experiments in large-scale image retrieval and classification tasks corroborate that our approach significantly outperforms the state-of-the-art unsupervised dictionary learning approaches.
Visual reranking has been widely deployed to refine the traditional text-based image retrieval. Its current trend is to combine the retrieval results from various visual features to boost reranking ...precision and scalability. And its prominent challenge is how to effectively exploit the complementary property of different features. Another significant issue raises from the noisy instances, from manual or automatic labels, which makes the exploration of such complementary property difficult. This paper proposes a novel image reranking by introducing a new Co-Regularized Multi- Graph Learning (Co-RMGL) framework, in which intra-graph and inter-graph constraints are integrated to simultaneously encode the similarity in a single graph and the consistency across multiple graphs. To deal with the noisy instances, weakly supervised learning via co-occurred visual attribute is utilized to select a set of graph anchors to guide multiple graphs alignment and fusion, and to filter out those pseudo labeling instances to highlight the strength of individual features. After that, a learned edge weighting matrix from a fused graph is used to reorder the retrieval results. We evaluate our approach on four popular image retrieval datasets and demonstrate a significant improvement over state-of-the-art methods.
Visual patterns, i.e., high-order combinations of visual words, contributes to a discriminative abstraction of the high-dimensional bag-of-words image representation. However, the existing visual ...patterns are built upon the 2D photographic concurrences of visual words, which is ill-posed comparing with their real-world 3D concurrences, since the words from different objects or different depth might be incorrectly bound into an identical pattern. On the other hand, designing compact descriptors from the mined patterns is left open. To address both issues, in this paper, we propose a novel compact bag-of-patterns (CBoPs) descriptor with an application to low bit rate mobile landmark search. First, to overcome the ill-posed 2D photographic configuration, we build up a 3D point cloud from the reference images of each landmark, therefore more accurate pattern candidates can be extracted from the 3D concurrences of visual words. A novel gravity distance metric is then proposed to mine discriminative visual patterns. Second, we come up with compact image description by introducing a CBoPs descriptor. CBoP is figured out by sparse coding over the mined visual patterns, which maximally reconstructs the original bag-of-words histogram with a minimum coding length. We developed a low bit rate mobile landmark search prototype, in which CBoP descriptor is directly extracted and sent from the mobile end to reduce the query delivery latency. The CBoP performance is quantized in several large-scale benchmarks with comparisons to the state-of-the-art compact descriptors, topic features, and hashing descriptors. We have reported comparable accuracy to the million-scale bag-of-words histogram over the million scale visual words, with high descriptor compression rate (approximately 100-bits) than the state-of-the-art bag-of-words compression scheme.
Traditional neural architecture search (NAS) has a significant impact in computer vision by automatically designing network architectures for various tasks. In this paper, binarized neural ...architecture search (BNAS), with a search space of binarized convolutions, is introduced to produce extremely compressed models to reduce huge computational cost on embedded devices for edge computing. The BNAS calculation is more challenging than NAS due to the learning inefficiency caused by optimization requirements and the huge architecture space, and the performance loss when handling the wild data in various computing applications. To address these issues, we introduce operation space reduction and channel sampling into BNAS to significantly reduce the cost of searching. This is accomplished through a performance-based strategy that is robust to wild data, which is further used to abandon less potential operations. Furthermore, we introduce the upper confidence bound to solve 1-bit BNAS. Two optimization methods for binarized neural networks are used to validate the effectiveness of our BNAS. Extensive experiments demonstrate that the proposed BNAS achieves a comparable performance to NAS on both CIFAR and ImageNet databases. An accuracy of 96.53% vs. 97.22% is achieved on the CIFAR-10 dataset, but with a significantly compressed model, and a 40% faster search than the state-of-the-art PC-DARTS. On the wild face recognition task, our binarized models achieve a performance similar to their corresponding full-precision models.
Notwithstanding many years of progress, visual tracking is still a difficult but important problem. Since most top-performing tracking methods have their strengths and weaknesses and are suited for ...handling only a certain type of variation, one of the next challenges is to integrate all these methods and address the problem of long-term persistent tracking in ever-changing environments. Towards this goal, we consider visual tracking in a novel weakly supervised learning scenario where (possibly noisy) labels but no ground truth are provided by multiple imperfect oracles (i.e., different trackers). These trackers naturally have intrinsic diversity due to their different design strategies, and we propose a probabilistic method to simultaneously infer the most likely object position by considering the outputs of all trackers, and estimate the accuracy of each tracker. An online evaluation strategy of trackers and a heuristic training data selection scheme are adopted to make the inference more effective and efficient. Consequently, the proposed method can avoid the pitfalls of purely single tracking methods and get reliably labeled samples to incrementally update each tracker (if it is an appearance-adaptive tracker) to capture the appearance changes. Extensive experiments on challenging video sequences demonstrate the robustness and effectiveness of the proposed method.
•A novel weakly supervised learning-based object tracking method is proposed.•The method presents a natural way of fusing multiple complementary methods.•The method can evaluate the online tracking methods in the absence of ground truth.
In recent years, there is an ever-increasing research focus on Bag-of-Words based near duplicate visual search paradigm with inverted indexing. One fundamental yet unexploited challenge is how to ...maintain the large indexing structures within a single server subject to its memory constraint, which is extremely hard to scale up to millions or even billions of images. In this paper, we propose to parallelize the near duplicate visual search architecture to index millions of images over multiple servers, including the distribution of both visual vocabulary and the corresponding indexing structure. We optimize the distribution of vocabulary indexing from a machine learning perspective, which provides a "memory light" search paradigm that leverages the computational power across multiple servers to reduce the search latency. Especially, our solution addresses two essential issues: "What to distribute" and "How to distribute". "What to distribute" is addressed by a "lossy" vocabulary Boosting, which discards both frequent and indiscriminating words prior to distribution. "How to distribute" is addressed by learning an optimal distribution function, which maximizes the uniformity of assigning the words of a given query to multiple servers. We validate the distributed vocabulary indexing scheme in a real world location search system over 10 million landmark images. Comparing to the state-of-the-art alternatives of single-server search 5, 6, 16 and distributed search 23, our scheme has yielded a significant gain of about 200% speedup at comparable precision by distributing only 5% words. We also report excellent robustness even when partial servers crash.
Polygonatum cyrtonema Hua, a rhizome‐propagating herb endemic to China, is used in many traditional Chinese medicines and foods. The hilly mountains in western and southern Anhui province is one of ...its main natural distribution and artificial cultivation areas. We assessed the genetic diversity and structure of P. cyrtonema germplasm resources in Anhui by nine pairs of SSR primers and selected morphological characters. The results showed that the 13 sampled populations of P. cyrtonema possessed normal levels of genetic diversity but could be clustered into three distinct genetic groups. The levels of within‐group genetic diversity was similar among the three groups, but their distribution areas and morphological characters were remarkably different. Group I was confined to the Tianmu (including Jiuhua) Mountains, group II was distributed in the Huangshan Mountains, and group III was restricted to the Dabie Mountains. Furthermore, the leaf length:width ratio significantly differed among groups, and the peduncle length of group I was significantly shorter than that of the other two groups. Levels of genetic differentiation among the three groups was close to that between different species within the genus. Thus, the three genetic groups of P. cyrtonema should be considered as independent units for conservation and breeding management in the Anhui region.
In this letter, we propose a cross-view down/up-sampling (CDU) method for the framework of reduced resolution multiview depth video coding, which exploits cross-view information to assist the ...up-sampling at the decoder. In the down-sampling procedure of CDU, the odd-even interlaced extraction is employed to preserve more confident information of the original depth video with reduced resolution. In the decoder, the cross-view information is exploited for up-sampling the reconstructed depth video. An iterative interpolation process is proposed to eliminate the effect of compression distortion on this up-sampling. Experimental results demonstrate the gains of up to 3.88 dB for the proposed algorithm and better quality of synthesized views.