Word-level sign language recognition (SLR) is a significant task which transcribes a sign language video into a word. Currently, deep-learning-based frameworks mostly combine spatial feature ...extractors based on convolution neural networks (CNNs) and sequence learners. These methods either lack the sufficient capacity to establish the high-level vision semantic knowledge and incorporate the details in images or perform weak intelligence on video frame sequence comprehension. Focusing on gestures and facial expressions is essential to interpreting sign language; however, it is challenging to crop these elements from pictures and distill them end-to-end. In this paper, a full self-attention framework for word-level SLR is proposed to tackle the above issue, which integrates a Vision Transformer as spatial encoder and an improved temporal Transformer. In addition, we introduce the masking future operation to improve the Transformer for the temporal module. The vision Transformer first refines the latent high-level semantic feature sequences from sign language videos and feeds them into the temporal module. Then the masking future Transformer enhances this sequence by making subsequent time invisible at each moment of frames and generates the final recognition. This approach integrates global and local spatial information; furthermore, it can also distinguish the latent semantic features contained in sign language action sequences. To validate the proposed approach, we perform extensive experiments on two datasets. The results and ablation studies demonstrate the effectiveness of this method, and it achieves new state-of-the-art performance on the WLASL dataset by using RGB images alone.
There are over 150 sign languages worldwide, each with numerous local variants and thousands of signs. However, collecting annotated data for each sign language to train a model is a laborious and ...expert-dependent task. To address this issue, this paper introduces the problem of few-shot sign language recognition (FSSLR) in a cross-lingual setting. The central motivation is to be able to recognize a novel sign, even if it belongs to a sign language unseen during training, based on a small set of examples. To tackle this problem, we propose a novel embedding-based framework that first extracts a spatio-temporal visual representation based on video and hand features, as well as hand landmark estimates. To establish a comprehensive test bed, we propose three meta-learning FSSLR benchmarks that span multiple languages, and extensively evaluate the proposed framework. The experimental results demonstrate the effectiveness and superiority of the proposed approach for few-shot sign language recognition in both monolingual and cross-lingual settings.
•The motivation of the problem is to recognize a novel sign based on a small set of examples.•A novel framework leverages signer body and hand features for embedding is proposed.•Three novel meta-learning benchmarks that span multiple languages are introduced.•Our embedding framework achieves the best performance in three proposed benchmarks.
Spoken language recognition refers to the automatic process through which we determine or verify the identity of the language spoken in a speech sample. We study a computational framework that allows ...such a decision to be made in a quantitative manner. In recent decades, we have made tremendous progress in spoken language recognition, which benefited from technological breakthroughs in related areas, such as signal processing, pattern recognition, cognitive science, and machine learning. In this paper, we attempt to provide an introductory tutorial on the fundamentals of the theory and the state-of-the-art solutions, from both phonological and computational aspects. We also give a comprehensive review of current trends and future research directions using the language recognition evaluation (LRE) formulated by the National Institute of Standards and Technology (NIST) as the case studies.
Continuous Sign Language Recognition (CSLR) is a long challenging task in Computer Vision due to the difficulties in detecting the explicit boundaries between the words in a sign sentence. To deal ...with this challenge, we propose a two-stage model. In the first stage, the predictor model, which includes a combination of CNN, SVD, and LSTM, is trained with the isolated signs. In the second stage, we apply a post-processing algorithm to the Softmax outputs obtained from the first part of the model in order to separate the isolated signs in the continuous signs. While the proposed model is trained on the isolated sign classes with similar frame numbers, it is evaluated on the continuous sign videos with a different frame length per each isolated sign class. Due to the lack of a large dataset, including both the sign sequences and the corresponding isolated signs, two public datasets in Isolated Sign Language Recognition (ISLR), RKS-PERSIANSIGN and ASLLVD, are used for evaluation. Results of the continuous sign videos confirm the efficiency of the proposed model to deal with isolated sign boundaries detection. The intuition behind the proposed post-processing methodology is to improve the recognition accuracy by removing the untrained and repetitive signs using the sliding window approach during the inference phase. This marks the first instance of such a mechanism within this domain. So, we present our methodology as a baseline for research community to enrich the methodology as well as evaluating on the other real data.
In this paper, we propose a novel multimodal framework for isolated Sign Language Recognition (SLR) using sensor devices. Microsoft Kinect and Leap motion sensors are used in our framework to capture ...finger and palm positions from two different views during gesture. One sensor (Leap Motion) is kept below the hand(s) while the other (Kinect) is placed in front of the signer for capturing horizontal and vertical movement of fingers during sign gestures. A set of features is next extracted from the raw data captured with both sensors. Recognition is performed separately by Hidden Markov Model (HMM) and Bidirectional Long Short-Term Memory Neural Network (BLSTM-NN) based sequential classifiers. In the next phase, results are combined to boost-up the recognition performance. The framework has been tested on a dataset of 7500 Indian Sign Language (ISL) gestures comprised with 50 different sign-words. Our dataset includes single as well as double handed gestures. It has been observed that, accuracies can be improved if data from both sensors are fused as compared to single sensor-based recognition. We have recorded improvements of 2.26% (single hand) and 0.91% (both hands) using HMM and 2.88% (single hand) and 1.67% (both hands) using BLSTM-NN classifiers. Overall accuracies of 97.85% and 94.55% have been recorded by combining HMM and BLSTM-NN for single hand and double handed signs.
With the increase in the number of deaf-mute people in the Arab world and the lack of Arabic sign language (ArSL) recognition benchmark data sets, there is a pressing need for publishing a ...large-volume and realistic ArSL data set. This study presents such a data set, which consists of 150 isolated ArSL signs. The data set is challenging due to the great similarity among hand shapes and motions in the collected signs. Along with the data set, a sign language recognition algorithm is presented. The authors’ proposed method consists of three major stages: hand segmentation, hand shape sequence and body motion description, and sign classification. The hand shape segmentation is based on the depth and position of the hand joints. Histograms of oriented gradients and principal component analysis are applied on the segmented hand shapes to obtain the hand shape sequence descriptor. The covariance of the three-dimensional joints of the upper half of the skeleton in addition to the hand states and face properties are adopted for motion sequence description. The canonical correlation analysis and random forest classifiers are used for classification. The achieved accuracy is 55.57% over 150 ArSL signs, which is considered promising.
Sign language is a form of visual communication employing hand gestures, body movements, and facial expressions. The growing prevalence of hearing impairment has driven the research community towards ...the domain of Continuous Sign Language Recognition (CSLR), which involves identification of successive signs in a video stream without prior knowledge of temporal boundaries. This survey article conducts a review of CSLR research, spanning the past 25 years, offering insights into the evolution of CSLR systems. A critical analysis of 126 studies is presented and organized into a taxonomy comprising seven critical dimensions: sign language, data acquisition, input modality, sign language cues, recognition techniques, utilized datasets, and overall performance. Additionally, the article investigated the classification of deep-learning CSLR models, categorizing them based on spatial, temporal, and alignment methods, while identifying their advantages and limitations. The article also explored various research aspects including CSLR challenges, the significance of non-manual features in CSLR systems, and identified gaps in existing literature. This literature taxonomy serves as a resource aiding researchers in the development and positioning of novel CSLR techniques. The study emphasizes the efficacy of multi-modal deep learning systems in capturing diverse sign language cues. However, the examination of existing research uncovers numerous limitations, calling for continued research and innovation within the CSLR domain. The findings not only contribute to the broader understanding of sign language recognition but also lay the foundations for future research initiatives aimed at addressing the persistent challenges within this emerging field.
Sign Language Recognition: A Deep Survey Rastgoo, Razieh; Kiani, Kourosh; Escalera, Sergio
Expert systems with applications,
February 2021, 2021-02-00, 20210201, Volume:
164
Journal Article
Peer reviewed
Sign language, as a different form of the communication language, is important to large groups of people in society. There are different signs in each sign language with variability in hand shape, ...motion profile, and position of the hand, face, and body parts contributing to each sign. So, visual sign language recognition is a complex research area in computer vision. Many models have been proposed by different researchers with significant improvement by deep learning approaches in recent years. In this survey, we review the vision-based proposed models of sign language recognition using deep learning approaches from the last five years. While the overall trend of the proposed models indicates a significant improvement in recognition accuracy in sign language recognition, there are some challenges yet that need to be solved. We present a taxonomy to categorize the proposed models for isolated and continuous sign language recognition, discussing applications, datasets, hybrid models, complexity, and future lines of research in the field.
•We perform a comprehensive review of recent works for sign language recognition.•We define a taxonomy to group existing works and discuss on their pros and cons.•We discuss on features, modalities, evaluation metrics, applications, and datasets.•Different challenges and future lines of research in the field are presented.
The performance of existing sign language recognition approaches is typically limited by the scale of training data. To address this issue, we propose a mutual enhancement network (MEN) for joint ...sign language recognition and education. First, a sign language recognition system built upon a spatial-temporal network is proposed to recognize the semantic category of a given sign language video. Besides, a sign language education system is developed to detect the failure modes of learners and further guide them to sign correctly. Our theoretical contribution lies in formulating the above two systems as an estimation-maximization (EM) framework, which can progressively boost each other. The recognition system could become more robust and accurate with more training data collected by the education system, while the education system could guide the learners to sign more precisely, benefiting from the hand shape analysis module of the recognition system. Experimental results on three large-scale sign language recognition datasets validate the superiority of the proposed framework.