Visual affordance studies what kind of interaction is possible and whether the interaction is reasonable in the current environment from an image/video. When inferring affordances of objects, ...semantics and relations of objects in the environment should be considered, and graph is usually used for modeling the environment context for object. Considering the weight of edge in graph describes the amount of contributed information between objects during affordance reasoning, this paper proposes VAR-Net (Visual Affordance Reasoning Network) which models the weights as graph attention coefficients and learns the weights based on objects' semantic and visual features implying their affordances. VAR-Net achieves higher accuracy on COCO-Tasks and ADE-Affordance datasets. Experiments also explain the meaning of edge weights in VAR-Net. For a definite affordance, an object commits it more, the edges linking from it to other objects have larger weights and vice versa, which makes objects' features distinguishable for inferring affordances.
Long document classification has aroused tremendous attention in the field of Nature Language Processing, due to the exponential increasing of publications. Although the common text classification ...methods can be extended for long document classification, they are confined to the length of text and do not have enough expressiveness to model the structure of the long document. To solve these problems, we proposed a hierarchical multiple granularity attention network for long document classification, in which the word and section level features are extracted and fused to represent the complex structure of the long document. Furthermore, a feature-based section pooling module is adopted to eliminate redundant text information and accelerate the computing. A series of experiments are conducted to evaluate the proposed method. The experimental results verify that our method is effective, efficient and competitive compared with the related state-of-the-art methods.
In the latest video coding standard H.264, intra frame prediction is employed to reduce the blocks spatial correlation. In this paper, a new algorithm is proposed to improve the performance of intra ...prediction of H.264/AVC. It changes mode 2 of H.264 standard prediction methods into a BMA (block-matching algorithm)-DC hybrid mode. Experiment results show that with the proposed algorithm, significant improvement of coding performance can be gained compared to the performance of H.264 algorithm
Satisfactory recognition performance has been achieved for simple and controllable printed molecular images. However, recognizing handwritten chemical structure images remains unresolved due to the ...inherent ambiguities in handwritten atoms and bonds, as well as the signifcant challenge of converting projected 2D molecular layouts into markup strings. Target to address these problems, this paper proposes an end-to-end framework for handwritten chemical structure images recognition, with novel structure-specific markup language (SSML) and random conditional guided decoder (RCGD). SSML alleviates ambiguity and complexity in Chemfig syntax by designing an innovative markup language to accurately depict molecular structures. Besides, we propose RCGD to address the issue of multiple path decoding of molecular structures, which is composed of conditional attention guidance, memory classification and path selection mechanisms. In order to fully confirm the effectiveness of the end-to-end method, a new database containing 50,000 handwritten chemical structure images (EDU-CHEMC) has been established. Experimental results demonstrate that compared to traditional SMILES sequences, our SSML can significantly reduces the semantic gap between chemical images and markup strings. It is worth noting that our method can also recognize invalid or non-existent organic molecular structures, making it highly applicable for tasks related to teaching evaluations in the fields of chemistry and biology education. The EDU-CHEMC will be released soon in https://github.com/iFLYTEK-CV/EDU-CHEMC.
The paper proposes a kind of joint feature with the geometric parameters and the color moment to represent the speaking-mouth image frames for the image-based visual speech synthesis system. The ...experiment results show the proposed joint feature can effectively provide the basis for classify the speaking-mouth images to distinguish the different speaking state according to the lip shape and the tooth visibility. In order to derive the geometric feature, a kind of step-by-step point feature localization is proposed here is introduced to get the feature points. Comparing the localization results by this automatic localization algorithm with that by hand, the paper proves the localization result can meet the need of geometric feature representation.
3D Face Recognition Based on Facial Profiles Hengliang Tang; Yanfeng Sun; Baocai Yin ...
2009 International Conference on Information Engineering and Computer Science,
2009-Dec.
Conference Proceeding
In this paper, a novel 3D face recognition algorithm based on facial profile information is presented, which needn't the training phase, and consumes little computation cost. We first extract and ...normalize the 3D face profiles, and then, the shape and texture information of the profiles are used to represent 3D faces. After that, three representations and the corresponding metric for facial profiles are proposed, the arc length metric (PL), the feature points' location metric (PFL) and the relative distance matrix metric (PFDM), to address the face recognition task. Finally, we test our algorithm based on facial profiles on BJUT-3D face database, and achieve the best performance of 97.1% for fusion experiment.
Deep reinforcement learning has achieved great success in laser-based collision avoidance works because the laser can sense accurate depth information without too much redundant data, which can ...maintain the robustness of the algorithm when it is migrated from the simulation environment to the real world. However, high-cost laser devices are not only difficult to deploy for a large scale of robots but also demonstrate unsatisfactory robustness towards the complex obstacles, including irregular obstacles, e.g., tables, chairs, and shelves, as well as complex ground and special materials. In this paper, we propose a novel monocular camera-based complex obstacle avoidance framework. Particularly, we innovatively transform the captured RGB images to pseudo-laser measurements for efficient deep reinforcement learning. Compared to the traditional laser measurement captured at a certain height that only contains one-dimensional distance information away from the neighboring obstacles, our proposed pseudo-laser measurement fuses the depth and semantic information of the captured RGB image, which makes our method effective for complex obstacles. We also design a feature extraction guidance module to weight the input pseudo-laser measurement, and the agent has more reasonable attention for the current state, which is conducive to improving the accuracy and efficiency of the obstacle avoidance policy.
Recently, the Weisfeiler-Lehman (WL) graph isomorphism test was used to measure the expressiveness of graph neural networks (GNNs), showing that the neighborhood aggregation GNNs were at most as ...powerful as 1-WL test in distinguishing graph structures. There were also improvements proposed in analogy to \(k\)-WL test (\(k>1\)). However, the aggregators in these GNNs are far from injective as required by the WL test, and suffer from weak distinguishing strength, making it become expressive bottlenecks. In this paper, we improve the expressiveness by exploring powerful aggregators. We reformulate aggregation with the corresponding aggregation coefficient matrix, and then systematically analyze the requirements of the aggregation coefficient matrix for building more powerful aggregators and even injective aggregators. It can also be viewed as the strategy for preserving the rank of hidden features, and implies that basic aggregators correspond to a special case of low-rank transformations. We also show the necessity of applying nonlinear units ahead of aggregation, which is different from most aggregation-based GNNs. Based on our theoretical analysis, we develop two GNN layers, ExpandingConv and CombConv. Experimental results show that our models significantly boost performance, especially for large and densely connected graphs.
Transcoding of H.264 Bitstream to AVS Bitstream Baoguo Wang; Yunhui Shi; Baocai Yin
2009 5th International Conference on Wireless Communications, Networking and Mobile Computing,
2009-Sept.
Conference Proceeding
This paper proposes a transcoding scheme from H.264 to AVS. As the highest performance video coding standard, H.264 is developed by MPEG and ITU, AVS is developed by the Audio Video Coding Standard ...Working Group of China. The two standards will co-exist in the future application, therefore, it is worthy to transcode the video format from the H.264 to AVS and almost no complete research on this scheme until now. This paper proposes an effective transcoding method to achieve high efficient and fast transcoding without losing much quality by reusing intra mode, inter mode, motion vector and SAD, etc. Detailed experiment results demonstrate that the proposed algorithm can effectively reduce the transcoding complexity.