Relative position encoding (RPE) is important for transformer to capture sequence ordering of input tokens. General efficacy has been proven in natural language processing. However, in computer ...vision, its efficacy is not well studied and even remains controversial, e.g., whether relative position encoding can work equally well as absolute position? In order to clarify this, we first review existing relative position encoding methods and analyze their pros and cons when applied in vision transformers. We then propose new relative position encoding methods dedicated to 2D images, called image RPE (iRPE). Our methods consider directional relative distance modeling as well as the interactions between queries and relative position embeddings in self-attention mechanism. The proposed iRPE methods are simple and lightweight. They can be easily plugged into transformer blocks. Experiments demonstrate that solely due to the proposed encoding methods, DeiT 21 and DETR 1 obtain up to 1.5% (top-1 Acc) and 1.3% (mAP) stable improvements over their original versions on ImageNet and COCO respectively, without tuning any extra hyperparameters such as learning rate and weight decay. Our ablation and analysis also yield interesting findings, some of which run counter to previous understanding. Code and models are open-sourced at https://github.com/microsoft/Cream/tree/main/iRPE.
We consider the compression artifacts reduction problem, where a compressed image is transformed into an artifact-free image. Recent approaches for this problem typically train a one-to-one mapping ...using a per-pixel L 2 loss between the outputs and the ground-truths. We point out that these approaches used to produce overly smooth results, and PSNR doesn't reflect their real performance. In this paper, we propose a one-to-many network, which measures output quality using a perceptual loss, a naturalness loss, and a JPEG loss. We also avoid grid-like artifacts during deconvolution using a shift-and-average strategy. Extensive experimental results demonstrate the dramatic visual improvement of our approach over the state of the arts.
In this paper, we study oracle character recognition and general sketch recognition. First, a data set of oracle characters, which are the oldest hieroglyphs in China yet remain a part of modern ...Chinese characters, is collected for analysis. Second, typical visual representations in shape- and sketch-related works are evaluated. We analyze the problems suffered when addressing these representations and determine several representation design criteria. Based on the analysis, we propose a novel hierarchical representation that combines a Gabor-related low-level representation and a sparse-encoder-related mid-level representation. Extensive experiments show the effectiveness of the proposed representation in both oracle character recognition and general sketch recognition. The proposed representation is also complementary to convolutional neural network (CNN)-based models. We introduce a solution to combine the proposed representation with CNN-based models, and achieve better performances over both approaches. This solution has beaten humans at recognizing general sketches.
We present a novel global stereo model designed for view interpolation. Unlike existing stereo models which only output a disparity map, our model is able to output a 3D triangular mesh, which can be ...directly used for view interpolation. To this aim, we partition the input stereo images into 2D triangles with shared vertices. Lifting the 2D triangulation to 3D naturally generates a corresponding mesh. A technical difficulty is to properly split vertices to multiple copies when they appear at depth discontinuous boundaries. To deal with this problem, we formulate our objective as a two-layer MRF, with the upper layer modeling the splitting properties of the vertices and the lower layer optimizing a region-based stereo matching. Experiments on the Middlebury and the Herodion datasets demonstrate that our model is able to synthesize visually coherent new view angles with high PSNR, as well as outputting high quality disparity maps which rank at the first place on the new challenging high resolution Middlebury 3.0 benchmark.
The automatic standardization of nomenclature for anatomical structures in radiotherapy (RT) clinical data is a critical prerequisite for data curation and data-driven research in the era of big data ...and artificial intelligence, but it is currently an unmet need. Existing methods either cannot handle cross-institutional datasets or suffer from heavy imbalance and poor-quality delineation in clinical RT datasets. To solve these problems, we propose an automated structure nomenclature standardization framework, 3D Non-local Network with Voting (3DNNV). This framework consists of an improved data processing strategy, namely, adaptive sampling and adaptive cropping (ASAC) with voting, and an optimized feature extraction module. The framework simulates clinicians' domain knowledge and recognition mechanisms to identify small-volume organs at risk (OARs) with heavily imbalanced data better than other methods. We used partial data from an open-source head-and-neck cancer dataset to train the model, then tested the model on three cross-institutional datasets to demonstrate its generalizability. 3DNNV outperformed the baseline model, achieving higher average true positive rates (TPR) over all categories on the three test datasets (+8.27%, +2.39%, and +5.53%, respectively). More importantly, the 3DNNV outperformed the baseline on the test dataset, 28.63% to 91.17%, in terms of F1 score for a small-volume OAR with only 9 training samples. The results show that 3DNNV can be applied to identify OARs, even error-prone ones. Furthermore, we discussed the limitations and applicability of the framework in practical scenarios. The framework we developed can assist in standardizing structure nomenclature to facilitate data-driven clinical research in cancer radiotherapy.
Bezigons, i.e., closed paths composed of Bézier curves, have been widely employed to describe shapes in image vectorization results. However, most existing vectorization techniques infer the ...bezigons by simply approximating an intermediate vector representation (such as polygons). Consequently, the resultant bezigons are sometimes imperfect due to accumulated errors, fitting ambiguities, and a lack of curve priors, especially for low-resolution images. In this paper, we describe a novel method for vectorizing clipart images. In contrast to previous methods, we directly optimize the bezigons rather than using other intermediate representations; therefore, the resultant bezigons are not only of higher fidelity compared with the original raster image but also more reasonable because they were traced by a proficient expert. To enable such optimization, we have overcome several challenges and have devised a differentiable data energy as well as several curve-based prior terms. To improve the efficiency of the optimization, we also take advantage of the local control property of bezigons and adopt an overlapped piecewise optimization strategy. The experimental results show that our method outperforms both the current state-of-the-art method and commonly used commercial software in terms of bezigon quality.
Identifying the same individual across different scenes is an important yet difficult task in intelligent video surveillance. Its main difficulty lies in how to preserve similarity of the same person ...against large appearance and structure variation while discriminating different individuals. In this paper, we present a scalable distance driven feature learning framework based on the deep neural network for person re-identification, and demonstrate its effectiveness to handle the existing challenges. Specifically, given the training images with the class labels (person IDs), we first produce a large number of triplet units, each of which contains three images, i.e. one person with a matched reference and a mismatched reference. Treating the units as the input, we build the convolutional neural network to generate the layered representations, and follow with the L2 distance metric. By means of parameter optimization, our framework tends to maximize the relative distance between the matched pair and the mismatched pair for each triplet unit. Moreover, a nontrivial issue arising with the framework is that the triplet organization cubically enlarges the number of training triplets, as one image can be involved into several triplet units. To overcome this problem, we develop an effective triplet generation scheme and an optimized gradient descent algorithm, making the computational load mainly depend on the number of original images instead of the number of triplets. On several challenging databases, our approach achieves very promising results and outperforms other state-of-the-art approaches.
•We present a novel feature learning framework for person re-identification.•Our framework is based on the maximum relative distance comparison.•The learning algorithm is scalable to process large amount of data.•We demonstrate superior performances over other state-of-the-arts.
In this paper, we present a framework for object categorization via sketch graphs that incorporate shape and structure information. In this framework, we integrate the learnable And–Or graph model, a ...hierarchical structure that combines the reconfigurability of a stochastic context free grammar (SCFG) with the constraints of a Markov random field (MRF). Considering the computation efficiency, we generalize instances from the And–Or graph models and perform a set of sequential tests for cascaded object categorization, rather than directly inferring with the And–Or graph models. We study 33 categories, each consisting of a small data set of 30 instances, and 30 additional templates with varied appearance are generalized from the learned And–Or graph model. These samples better span the appearance space and form an augmented training set ΩT of 1980 (60×33) training templates. To perform recognition on a testing image, we use a set of sequential tests to project ΩT into different representation spaces to narrow the number of candidate matches in ΩT. We use “graphlets” (structural elements), as our local features and model ΩT at each stage using histograms of graphlets over categories, histograms of graphlets over object instances, histograms of pairs of graphlets over objects, and shape context. Each test is increasingly computationally expensive, and by the end of the cascade we have a small candidate set remaining to use with our most powerful test, a top-down graph matching algorithm. We apply the proposed approach on the challenging public dataset including 33 object categories, and achieve state-of-the-art performance.
► We present a framework for object categorization via sketch graphs. ► We generate samples from the learnable And–Or graph models for training. ► We perform a set of sequential tests for cascaded object categorization. ► Our system achieves 81.4% classification rate in 33 object categories.
Image inpainting that completes large free-form missing regions in images is a promising yet challenging task. State-of-the-art approaches have achieved significant progress by taking advantage of ...generative adversarial networks (GAN). However, these approaches can suffer from generating distorted structures and blurry textures in high-resolution images (e.g., <inline-formula><tex-math notation="LaTeX">512\times 512</tex-math> <mml:math><mml:mrow><mml:mn>512</mml:mn><mml:mo>×</mml:mo><mml:mn>512</mml:mn></mml:mrow></mml:math><inline-graphic xlink:href="fu-ieq1-3156949.gif"/> </inline-formula>). The challenges mainly drive from (1) image content reasoning from distant contexts, and (2) fine-grained texture synthesis for a large missing region. To overcome these two challenges, we propose an enhanced GAN-based model, named A ggregated C O ntextual- T ransformation GAN ( AOT-GAN ), for high-resolution image inpainting. Specifically, to enhance context reasoning, we construct the generator of AOT-GAN by stacking multiple layers of a proposed AOT block. The AOT blocks aggregate contextual transformations from various receptive fields, allowing to capture both informative distant image contexts and rich patterns of interest for context reasoning. For improving texture synthesis, we enhance the discriminator of AOT-GAN by training it with a tailored mask-prediction task. Such a training objective forces the discriminator to distinguish the detailed appearances of real and synthesized patches, and in turn facilitates the generator to synthesize clear textures. Extensive comparisons on Places2, the most challenging benchmark with 1.8 million high-resolution images of 365 complex scenes, show that our model outperforms the state-of-the-art. A user study including more than 30 subjects further validates the superiority of AOT-GAN. We further evaluate the proposed AOT-GAN in practical applications, e.g., logo removal, face editing, and object removal. Results show that our model achieves promising completions in the real world. We release codes and models in https://github.com/researchmm/AOT-GAN-for-Inpainting .