In this paper, we propose a novel script-independent approach for word spotting in printed and handwritten multi-script documents. Since each writing type and script need to be processed using a ...specific spotting engine, the proposed system proceeds on two stages. The script identification is a preliminary stage that aims at recognizing on one level the writing type and the script of the input image document. Second, a specific word spotting method is used to spot query words in a large collection of documents. The proposed spotting system is based on deep bidirectional long short-term memory neural network and hidden Markov model (HMM) hybrid architecture. It takes advantage of DNN’s strong representation learning power and HMM’s sequential modeling ability. The global system has been evaluated on a mixed corpus of public databases such as KHATT, PKHATT for Arabic script and RIMES for Latin script. The experimental results on script identification and keyword spotting confirm the effectiveness of the proposed approach.
► We present a study on font-family and font-size recognition in the framework of a priori approach. ► The font and size systems are based on GMMs and applied to Arabic word images at ultra low ...resolution. ► We show the benefit of font recognition by comparing two HMMs based word recognition systems.
In this paper, we propose a new font and size identification method for ultra-low resolution Arabic word images using a stochastic approach. The literature has proved the difficulty for Arabic text recognition systems to treat multi-font and multi-size word images. This is due to the variability induced by some font family, in addition to the inherent difficulties of Arabic writing including cursive representation, overlaps and ligatures. This research work proposes an efficient stochastic approach to tackle the problem of font and size recognition. Our method treats a word image with a fixed-length, overlapping sliding window. Each window is represented with a 102 features whose distribution is captured by Gaussian Mixture Models (GMMs). We present three systems: (1) a font recognition system, (2) a size recognition system and (3) a font and size recognition system. We demonstrate the importance of font identification before recognizing the word images with two multi-font Arabic OCRs (cascading and global). The cascading system is about 23% better than the global multi-font system in terms of word recognition rate on the Arabic Printed Text Image (APTI) database which is freely available to the scientific community.
In this paper, an end-to-end multi-task deep neural network was proposed for simultaneous script identification and Keyword Spotting (KWS) in multi-lingual hand-written and printed document images. ...We introduced a unified approach which addresses both challenges cohesively, by designing a novel CNN-BLSTM architecture. The script identification stage involves local and global features extraction to allow the network to cover more relevant information. Contrarily to the traditional feature fusion approaches which build a linear feature concatenation, we employed a compact bi-linear pooling to capture pairwise correlations between these features. The script identification result is, then, injected in the KWS module to eliminate characters of irrelevant scripts and perform the decoding stage using a single-script mode. All the network parameters were trained in an end-to-end fashion using a multi-task learning that jointly minimizes the NLL loss for the script identification and the CTC loss for the KWS. Our approach was evaluated on a variety of public datasets of different languages and writing types.. Experiments proved the efficacy of our deep multi-task representation learning compared to the state-of-the-art systems for both of keyword spotting and script identification tasks.
•A novel two-step OOV words detection and recovery method is proposed.•The proposed method is generic and independent of the recognition engine.•The proposed method uses various sub-lexical modeling ...to improve the detection step.•The recovery process relies on dynamic lexicons built from large text corpora.•The proposed method significantly improves the recognition results.
Today's Arabic Handwriting recognition systems are able to recognize arbitrary words over a large but finite vocabulary. Systems operating with a fixed vocabulary are bound to encounter so-called out-of-vocabulary (OOV) words. The aim of this research is to propose a two-step approach that tackles the problem of OOV words in Arabic handwriting. In the first step, we exploit different types of sub-word units to detect the potential OOVs. In the recovery stage, a dynamic dictionary is built to extend the initial static word lexicon in order to cope with the detected OOVs. The recovery includes a selection step in which the best word candidates extracted from the external resource are kept. Experiments were conducted on the public benchmarking KHATT and AHTID/MW databases. The obtained results revealed that sub-word modeling could give cues for improving the detection and that the use of a dynamic dictionary significantly improves the recognition performance compared to one-step approaches that are based on a large static dictionary or the combination of different sub-word units. We achieve the state of the art results on the KHATT dataset.
Writer identification/recognition from off-line Arabic handwriting on sentence-level is still a tough task. In this paper, we start by investigating the performance of textural extractors for writer ...identification of divergent writing types. Taking into account their strengths and limits, we propose a new method that keeps the main features of the writing and handles the sensitivity of systems towards the available samples of text at the pre-processing phase. We also analyze the influence of the handwriting types on the efficiency of the writer identification process. In this regard, we perform a comparative study between handcrafted and automated features. Under multiple classifiers (RF, XGB, KNN and SVM). We find that writers with good and well clear handwriting have fewer similarities, thus, provides enhanced experimental identification rates. However, Bad handwriting presents more similarities between the writers, which explains the reduction in the identification rate.
The task of assessing, grouping and arranging data into meaningful groups or clusters based on their similarities/dissimilarities measures known as cluster analysis. Thereby, there are numerous ...clustering algorithms: hierarchical and partitional. In the last decade, clustering using bio-inspired algorithms received more attention, specifically the ant clustering algorithms. Regardless, they have required a lot of processing power due to the massive amount of data that has been generated during the last years. As a consequence, determining the computational cost of these algorithms is one of the most interesting tasks in the quest for optimal clustering solutions in a real-time system. This study presents a research guide for the researchers working in the same field. A series of experiments are elaborated to investigate the computational complexity of the most promising algorithms applied to students grouping problem. The results indicate two challenges that arise when using ant clustering algorithms: the difficulty in adjusting parameters and extended computation time.
Standard databases play essential roles for evaluating and comparing results obtained by different groups of researchers. In this paper, an Arabic Handwritten Text Images Database written by Multiple ...Writers (AHTID/MW) is introduced. This database can be used for research in the recognition of Arabic handwritten text with open vocabulary, word segmentation and writer identification. The AHTID/MW contains 3710 text lines and 22896 words written by 53 native writers of Arabic. In addition, ground truth annotation is provided for each text image. The database is freely available for worldwide researchers.
This paper presents a comparative study for word spotting techniques according to holistic approach. So, the current work consists in experimenting word image segmentation, characterization and ...matching to show the most reliable techniques. The experimental process is done in the same printed and handwritten Arabic dataset. Our aim is to realize an effective system of information retrieval.
Page segmentation and classification is very important in document layout analysis system before it is presented to an OCR system or for any other subsequent processing steps. In this paper, we ...propose an accurate and suitably designed system for complex documents segmentation. This system is based on steerable pyramid transform. The features extracted from pyramid sub-bands serve to locate and classify regions into text (either machine-printed or handwritten) and non-text (images, graphics, drawings or paintings) in some noise-infected, deformed, multilingual, multi-script document images. These documents contain tabular structures, logos, stamps, handwritten script blocks, photographs, etc. The encouraging and promising results obtained on 1,000 official complex document images data set are presented in this research paper. We compared our results with those from existing state-of-the-art methods. This comparison shows that the proposed method performs consistently well on large sets of complex document images.