•A Generative Adversarial Network for handwritten document image binarization.•We perform document binarization while ensuring text readability, simultaneously, by integrating a handwritten text ...recognition component within the proposed architecture.•The proposed model enhances different forms of documents, independently of the text language.•We achieve state-of-the-art performance on the public H-DIBCO datasets.
Handwritten document images can be highly affected by degradation for different reasons: Paper ageing, daily-life scenarios (wrinkles, dust, etc.), bad scanning process and so on. These artifacts raise many readability issues for current Handwritten Text Recognition (HTR) algorithms and severely devalue their efficiency. In this paper, we propose an end to end architecture based on Generative Adversarial Networks (GANs) to recover the degraded documents into a clean and readable form. Unlike the most well-known document binarization methods, which try to improve the visual quality of the degraded document, the proposed architecture integrates a handwritten text recognizer that promotes the generated document image to be more readable. To the best of our knowledge, this is the first work to use the text information while binarizing handwritten documents. Extensive experiments conducted on degraded Arabic and Latin handwritten documents demonstrate the usefulness of integrating the recognizer within the GAN architecture, which improves both the visual quality and the readability of the degraded document images. Moreover, we outperform the state of the art in H-DIBCO challenges, after fine tuning our pre-trained model with synthetically degraded Latin handwritten images, on this task.
Arabic Handwritten Text Recognition (AHTR) based on deep learning approaches remains a challenging problem due to the inevitable domain shift like the variability among writers’ styles and the ...scarcity of labelled data. To alleviate such problems, we investigate in this paper different domain adaptation strategies of AHTR system. The main idea is to exploit the knowledge of a handwriting source domain and to transfer this knowledge to another domain where only few labelled data are available. Different writer-dependent and writer-independent domain adaptation strategies are explored using a convolutional neural networks (CNN) and Bidirectional Long Short Term Memory (BSTM) - connectionist temporal classification (CTC) architecture. To discuss the interest of the proposed techniques on the target domain, we have conducted extensive experiments using three Arabic handwritten text datasets, mainly, the MADCAT, the AHTID/MW and the IFN/ENIT. Concurrently, the Arabic handwritten text dataset KHATT was used as the source domain. The obtained results prove the effectiveness of the proposed strategies specially when considering the writer’s information during the supervised adaptation process.
•A novel two-step OOV words detection and recovery method is proposed.•The proposed method is generic and independent of the recognition engine.•The proposed method uses various sub-lexical modeling ...to improve the detection step.•The recovery process relies on dynamic lexicons built from large text corpora.•The proposed method significantly improves the recognition results.
Today's Arabic Handwriting recognition systems are able to recognize arbitrary words over a large but finite vocabulary. Systems operating with a fixed vocabulary are bound to encounter so-called out-of-vocabulary (OOV) words. The aim of this research is to propose a two-step approach that tackles the problem of OOV words in Arabic handwriting. In the first step, we exploit different types of sub-word units to detect the potential OOVs. In the recovery stage, a dynamic dictionary is built to extend the initial static word lexicon in order to cope with the detected OOVs. The recovery includes a selection step in which the best word candidates extracted from the external resource are kept. Experiments were conducted on the public benchmarking KHATT and AHTID/MW databases. The obtained results revealed that sub-word modeling could give cues for improving the detection and that the use of a dynamic dictionary significantly improves the recognition performance compared to one-step approaches that are based on a large static dictionary or the combination of different sub-word units. We achieve the state of the art results on the KHATT dataset.
Document images can be affected by many degradation scenarios, which cause recognition and processing difficulties. In this age of digitization, it is important to denoise them for proper usage. To ...address this challenge, we present a new encoder-decoder architecture based on vision transformers to enhance both machine-printed and handwritten document images, in an end-to-end fashion. The encoder operates directly on the pixel patches with their positional information without the use of any convolutional layers, while the decoder reconstructs a clean image from the encoded patches. Conducted experiments show a superiority of the proposed model compared to the state-of-the-art methods on several DIBCO benchmarks. Code and models will be publicly available at: https://github.com/dali92002/DocEnTR.
Keyword spotting (KWS) in historical documents is an important tool for the initial exploration of digitized collections. Nowadays, the most efficient KWS methods are relying on machine learning ...techniques that require a large amount of annotated training data. However, in the case of historical manuscripts, there is a lack of annotated corpus for training. To handle the data scarcity issue, we investigate the merits of the self-supervised learning to extract useful representations of the input data without relying on human annotations and then using these representations in the downstream task. We propose ST-KeyS, a masked auto-encoder model based on vision transformers where the pretraining stage is based on the mask-and-predict paradigm, without the need of labeled data. In the fine-tuning stage, the pre-trained encoder is integrated into a siamese neural network model that is fine-tuned to improve feature embedding from the input images. We further improve the image representation using pyramidal histogram of characters (PHOC) embedding to create and exploit an intermediate representation of images based on text attributes. In an exhaustive experimental evaluation on three widely used benchmark datasets (Botany, Alvermann Konzilsprotokolle and George Washington), the proposed approach outperforms state-of-the-art methods trained on the same datasets.
Handwritten document images can be highly affected by degradation for different reasons: Paper ageing, daily-life scenarios (wrinkles, dust, etc.), bad scanning process and so on. These artifacts ...raise many readability issues for current Handwritten Text Recognition (HTR) algorithms and severely devalue their efficiency. In this paper, we propose an end to end architecture based on Generative Adversarial Networks (GANs) to recover the degraded documents into a clean and readable form. Unlike the most well-known document binarization methods, which try to improve the visual quality of the degraded document, the proposed architecture integrates a handwritten text recognizer that promotes the generated document image to be more readable. To the best of our knowledge, this is the first work to use the text information while binarizing handwritten documents. Extensive experiments conducted on degraded Arabic and Latin handwritten documents demonstrate the usefulness of integrating the recognizer within the GAN architecture, which improves both the visual quality and the readability of the degraded document images. Moreover, we outperform the state of the art in H-DIBCO challenges, after fine tuning our pre-trained model with synthetically degraded Latin handwritten images, on this task.
Document images can be affected by many degradation scenarios, which cause recognition and processing difficulties. In this age of digitization, it is important to denoise them for proper usage. To ...address this challenge, we present a new encoder-decoder architecture based on vision transformers to enhance both machine-printed and handwritten document images, in an end-to-end fashion. The encoder operates directly on the pixel patches with their positional information without the use of any convolutional layers, while the decoder reconstructs a clean image from the encoded patches. Conducted experiments show a superiority of the proposed model compared to the state-of the-art methods on several DIBCO benchmarks. Code and models will be publicly available at: \url{https://github.com/dali92002/DocEnTR}.
We propose in this paper, an Arabic handwriting recognition system based on multiple BLSTM-CTC combination architectures. Given several feature sets, the low-level fusion consisted in projecting them ...into a unique feature space. Mid-level combination methods were performed using two techniques: the first one consists in averaging the a-posteriori probabilities of each individual BLSTM, and injecting them in the CTC decoding. The second is based on the training of a new BLSTM-CTC system using the sum of the a-posteriori probabilities generated by the individual systems. The high-level fusion is based on the combination of the individual decoding outputs. Lattice combination and ROVER strategies were evaluated in this context. The experiments conducted on the KHATT database showed that the high-level combination method significantly improves the recognition rate compared to the other fusion strategies.