We propose an exact framework for online learning with a family of indefinite (not positive) kernels. As we study the case of nonpositive kernels, we first show how to extend kernel principal ...component analysis (KPCA) from a reproducing kernel Hilbert space to Krein space. We then formulate an incremental KPCA in Krein space that does not require the calculation of preimages and therefore is both efficient and exact. Our approach has been motivated by the application of visual tracking for which we wish to employ a robust gradient-based kernel. We use the proposed nonlinear appearance model learned online via KPCA in Krein space for visual tracking in many popular and difficult tracking scenarios. We also show applications of our kernel framework for the problem of face recognition.
Scene labeling with LSTM recurrent neural networks Wonmin Byeon; Breuel, Thomas M.; Raue, Federico ...
2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
06/2015
Conference Proceeding
This paper addresses the problem of pixel-level segmentation and classification of scene images with an entirely learning-based approach using Long Short Term Memory (LSTM) recurrent neural networks, ...which are commonly used for sequence classification. We investigate two-dimensional (2D) LSTM networks for natural scene images taking into account the complex spatial dependencies of labels. Prior methods generally have required separate classification and image segmentation stages and/or pre- and post-processing. In our approach, classification, segmentation, and context integration are all carried out by 2D LSTM networks, allowing texture and spatial model parameters to be learned within a single model. The networks efficiently capture local and global contextual information over raw RGB values and adapt well for complex scene images. Our approach, which has a much lower computational complexity than prior methods, achieved state-of-the-art performance over the Stanford Background and the SIFT Flow datasets. In fact, if no pre- or post-processing is applied, LSTM networks outperform other state-of-the-art approaches. Hence, only with a single-core Central Processing Unit (CPU), the running time of our approach is equivalent or better than the compared state-of-the-art approaches which use a Graphics Processing Unit (GPU). Finally, our networks' ability to visualize feature maps from each layer supports the hypothesis that LSTM networks are overall suited for image processing tasks.
•Efficient 2D LSTM attribute learning without pre-/post- processing of the data.•2D LSTM networks with only a small amount of parameters.•Raw noisy web-images for training without manual ...annotation.•Automatic web-image analysis (unknown number of attribute classes and scene types).•Further evaluations on public attribute dataset (SceneAtt).
This paper describes an approach to scene analysis based on supervised training of 2D Long Short-Term Memory recurrent neural networks (LSTM networks). Unlike previous methods, our approach requires no manual construction of feature hierarchies or incorporation of other prior knowledge. Rather, like deep learning approaches using convolutional networks, our recognition networks are trained directly on raw pixel values. However, in contrast to convolutional neural networks, our approach uses 2D LSTM networks at all levels. Our networks yield per pixel mid-level classifications of input images; since training data for such applications is not available in large numbers, we describe an approach to generating artificial training data, and then evaluate the trained networks on real-world images. Our approach performed significantly better than others methods including Convolutional Neural Networks (ConvNet), yet using two orders of magnitude fewer parameters. We further show the experiment on a recently published dataset, outdoor scene attribute dataset for fair comparisons of scene attribute learning which had significant performance improvement (ca. 21%). Finally, our approach is successfully applied on a real-world application, automatic web-image tagging.
In this paper, we extend a symbolic association framework for being able to handle missing elements in multimodal sequences. The general scope of the work is the symbolic associations of object-word ...mappings as it happens in language development in infants. In other words, two different representations of the same abstract concepts can associate in both directions. This scenario has been long interested in Artificial Intelligence, Psychology, and Neuroscience. In this work, we extend a recent approach for multimodal sequences (visual and audio) to also cope with missing elements in one or both modalities. Our method uses two parallel Long Short-Term Memories (LSTMs) with a learning rule based on EM-algorithm. It aligns both LSTM outputs via Dynamic Time Warping (DTW). We propose to include an extra step for the combination with the max operation for exploiting the common elements between both sequences. The motivation behind is that the combination acts as a condition selector for choosing the best representation from both LSTMs. We evaluated the proposed extension in the following scenarios: missing elements in one modality (visual or audio) and missing elements in both modalities (visual and sound). The performance of our extension reaches better results than the original model and similar results to individual LSTM trained in each modality.
Texture Classification Using 2D LSTM Networks Wonmin Byeon; Liwicki, Marcus; Breuel, Thomas M.
2014 22nd International Conference on Pattern Recognition,
2014-Aug.
Conference Proceeding
In this paper, we investigate the ability of the Long short term memory (LSTM) recurrent neural network architecture to perform texture classification on images. Existing approaches to texture ...classification rely on manually designed preprocessing steps or selected feature extractors. Since LSTM networks are able to bridge over long time lags, we propose applying them directly on the image, circumventing any handcrafted pre-processing. We investigate different approaches with several input and output representations. In our experiments on a number of widely used texture benchmarking tasks (KTH-TIPS, OuTex, VisTexL, VisTexP, and Newmarket), we show that the performance is comparable to, or better than, existing state-of-the-art methods for texture classification.
This paper presents a deep Convolutional Neural Network (CNN) based approach for document image classification. One of the main requirement of deep CNN architecture is that they need huge number of ...samples for training. To overcome this problem we adopt a deep CNN which is trained using big image dataset containing millions of samples i.e., ImageNet. The proposed work outperforms both the traditional structure similarity methods and the CNN based approaches proposed earlier. The accuracy of the proposed approach with merely 20 images per class outperforms the state-of-the-art by achieving classification accuracy of 68.25%. The best results on Tobbacoo-3428 dataset show that our proposed method outperforms the state-of-the-art method by a significant margin and achieved a median accuracy of 77.6% with 100 samples per class used for training and validation.
We introduce DeepDIVA: an infrastructure designed to enable quick and intuitive setup of reproducible experiments with a large range of useful analysis functionality. Reproducing scientific results ...can be a frustrating experience, not only in document image analysis but in machine learning in general. Using DeepDIVA a researcher can either reproduce a given experiment or share their own experiments with others. Moreover, the framework offers a large range of functions, such as boilerplate code, keeping track of experiments, hyper-parameter optimization, and visualization of data and results. To demonstrate the effectiveness of this framework, this paper presents case studies in the area of handwritten document analysis where researchers benefit from the integrated functionality. DeepDIVA is implemented in Python and uses the deep learning framework PyTorch. It is completely open source, and accessible as Web Service through DIVAServices.
This paper presents the first Pashto text image database for scientific research and thereby the first dataset with complete handwritten and printed text line images which ultimately covers all ...alphabets of Arabic and Persian languages. Language like Pashto, written in a complex way by calligraphers, still requires a mature Optical Character Recognition (OCR), system. Although 50 million people use this language both for oral and written communication, there is no significant effort which is devoted to the recognition of Pashto Script. A real dataset of 17,015 images having Pashto text lines is introduced. The images are acquired via scanning from hand scribed Pashto books. Further, in this work, we evaluated the performance of deep learning based models like Bidirectional and Multi-Dimensional Long Short Term Memory (BLSTM and MDLSTM) networks for Pashto texts and provide a baseline character error rate of 9.22%.
KHATT: A Deep Learning Benchmark on Arabic Script Ahmad, Riaz; Naz, Saeeda; Afzal, M. Zeshan ...
2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR),
2017-Nov., Volume:
7
Conference Proceeding
This work presents state-of-the-art results on one of the complex datasets; known as KHATT. The KHATT dataset shows complex patterns for Arabic handwritten text. We have achieved better performance ...in terms of Character Recognition by implementing the most successful deep learning approach based on Long Short-Term Memory (LSTM) networks. Connectionist Temporal Classification (CTC) is used as a final layer to align the predicted labels according to the most probable path. The application of MDLSTM scans text-lines in all direction to cover fine inflammation in horizontal and vertical direction. Further, we apply pre-processing on text-lines to prune extra white regions, and de-skew the text lines for accurate height normalization. The deep learning and pre-processing allow us to improve results from 46.13% to 75.8%.
In this paper, we present a novel methodology for multiple script identification using Long Short-Term Memory (LSTM) networks' sequence-learning capabilities. Our method is able to identify multiple ...scripts at text-line level, where two or more scripts are present in the same text-line. Unlike traditional techniques, where either shape features or bounding boxes of individual characters are extracted, the LSTM-based system learns a particular script in a supervised learning framework. Moreover, this system neither needs specific features nor other preprocessing steps other than text-line extraction and text-line normalization. The proposed method works on text-line level, where it identifies each character as belonging to a particular script. We have developed a database consisting of English and Greek script, and our system achieved a script recognition accuracy of 98.186% on this dataset.