The intelligent analysis of video data is currently in wide demand because a video is a major source of sensory data in our lives. Text is a prominent and direct source of information in video, while ...the recent surveys of text detection and recognition in imagery focus mainly on text extraction from scene images. Here, this paper presents a comprehensive survey of text detection, tracking, and recognition in video with three major contributions. First, a generic framework is proposed for video text extraction that uniformly describes detection, tracking, recognition, and their relations and interactions. Second, within this framework, a variety of methods, systems, and evaluation protocols of video text extraction are summarized, compared, and analyzed. Existing text tracking techniques, tracking-based detection and recognition techniques are specifically highlighted. Third, related applications, prominent challenges, and future directions for video text extraction (especially from scene videos and web videos) are also thoroughly discussed.
We report an ongoing study on statistical characteristics of texts written in different genres. At this stage, we compared Lithuanian and English texts of different genres. We used 16 indices which ...describe frequency structure of text as well as measure vocabulary richness. Initial study showed significant differences of indices calculated for genre pairs of the same language. Analysis of indices showed that the correlation between the various indices is high.
In vector space model (VSM), text representation is the task of transforming the content of a textual document into a vector in the term space so that the document could be recognized and classified ...by a computer or a classifier. Different terms (i.e. words, phrases, or any other indexing units used to identify the contents of a text) have different importance in a text. The term weighting methods assign appropriate weights to the terms to improve the performance of text categorization. In this study, we investigate several widely-used unsupervised (traditional) and supervised term weighting methods on benchmark data collections in combination with SVM and kNN algorithms. In consideration of the distribution of relevant documents in the collection, we propose a new simple supervised term weighting method, i.e. tf.rf, to improve the terms' discriminating power for text categorization task. From the controlled experimental results, these supervised term weighting methods have mixed performance. Specifically, our proposed supervised term weighting method, tf.rf, has a consistently better performance than other term weighting methods while other supervised term weighting methods based on information theory or statistical metric perform the worst in all experiments. On the other hand, the popularly used tf.idf method has not shown a uniformly good performance in terms of different data sets.
Research in computational textual aesthetics has shown that there are textual correlates of preference in prose texts. The present study investigates whether textual correlates of preference vary ...across different time periods (contemporary texts versus texts from the 19th and early 20th centuries). Preference is operationalized in different ways for the two periods, in terms of canonization for the earlier texts, and through sales figures for the contemporary texts. As potential textual correlates of preference, we measure degrees of (un)predictability in the distributions of two types of low-level observables, parts of speech and sentence length. Specifically, we calculate two entropy measures, Shannon Entropy as a global measure of unpredictability, and Approximate Entropy as a local measure of surprise (unpredictability in a specific context). Preferred texts from both periods (contemporary bestsellers and canonical earlier texts) are characterized by higher degrees of unpredictability. However, unlike canonicity in the earlier texts, sales figures in contemporary texts are reflected in global (text-level) distributions only (as measured with Shannon Entropy), while surprise in local distributions (as measured with Approximate Entropy) does not have an additional discriminating effect. Our findings thus suggest that there are both time-invariant correlates of preference, and period-specific correlates.
Today it is quite common for people to exchange hundreds of comments in online conversations (e.g., blogs). Often, it can be very difficult to analyze and gain insights from such long conversations. ...To address this problem, we present a visual text analytic system that tightly integrates interactive visualization with novel text mining and summarization techniques to fulfill information needs of users in exploring conversations. At first, we perform a user requirement analysis for the domain of blog conversations to derive a set of design principles. Following these principles, we present an interface that visualizes a combination of various metadata and textual analysis results, supporting the user to interactively explore the blog conversations. We conclude with an informal user evaluation, which provides anecdotal evidence about the effectiveness of our system and directions for further design.
(ProQuest: ... denotes formulae and/or non-USASCII text omitted; see image).In this paper, we are looking for ... = 1, AdS sub(4) sourceless vacua in type IIB. While several examples exist in type ...IIA, there exists only one example of such vacua in type IIB. Thanks to the framework of generalized geometry we were able to devise a semi-algorithmical method to look for sourceless vacua. We present this method, which can easily be generalized to more complex cases, and give two new vacua in type IIB.
The final volume in the series synthesizes the research conducted by the Heidelberg Collaborative Research Center 933. Systematized into six topic areas (reflecting on writing, layout and text/image, ...memory and the archive, material transformation, sanctification, and rule and administration), the CRC scholars summarize the knowledge gained from twelve years of interdisciplinary work into 35 theses on a theory of material text cultures.
The impact of preprocessing on text classification Uysal, Alper Kursat; Gunal, Serkan
Information processing & management,
January 2014, 2014, 2014-01-00, 20140101, Volume:
50, Issue:
1
Journal Article
Peer reviewed
•The impact of preprocessing on text classification in terms of various aspects is extensively examined.•Experiments are conducted on two different domains and in two different languages.•Choosing ...appropriate preprocessing tasks may improve classification accuracy significantly.
Preprocessing is one of the key components in a typical text classification framework. This paper aims to extensively examine the impact of preprocessing on text classification in terms of various aspects such as classification accuracy, text domain, text language, and dimension reduction. For this purpose, all possible combinations of widely used preprocessing tasks are comparatively evaluated on two different domains, namely e-mail and news, and in two different languages, namely Turkish and English. In this way, contribution of the preprocessing tasks to classification success at various feature dimensions, possible interactions among these tasks, and also dependency of these tasks to the respective languages and domains are comprehensively assessed. Experimental analysis on benchmark datasets reveals that choosing appropriate combinations of preprocessing tasks, rather than enabling or disabling them all, may provide significant improvement on classification accuracy depending on the domain and language studied on.
Irony is a pervasive aspect of many online texts, one made all the more difficult by the absence of face-to-face contact and vocal intonation. As our media increasingly become more social, the ...problem of irony detection will become even more pressing. We describe here a set of textual features for recognizing irony at a linguistic level, especially in short texts created via social media such as Twitter postings or "tweets". Our experiments concern four freely available data sets that were retrieved from Twitter using content words (e.g. "Toyota") and user-generated tags (e.g. "#irony"). We construct a new model of irony detection that is assessed along two dimensions: representativeness and relevance. Initial results are largely positive, and provide valuable insights into the figurative issues facing tasks such as sentiment analysis, assessment of online reputations, or decision making.
In this paper, we are concerned with the problem of automatic scene text recognition, which involves localizing and reading characters in natural images. We investigate this problem from the ...perspective of representation and propose a novel multi-scale representation, which leads to accurate, robust character identification and recognition. This representation consists of a set of mid-level primitives, termed strokelets, which capture the underlying substructures of characters at different granularities. The Strokelets possess four distinctive advantages: 1) usability: automatically learned from character level annotations; 2) robustness: insensitive to interference factors; 3) generality: applicable to variant languages; and 4) expressivity: effective at describing characters. Extensive experiments on standard benchmarks verify the advantages of the strokelets and demonstrate the effectiveness of the text recognition algorithm built upon the strokelets. Moreover, we show the method to incorporate the strokelets to improve the performance of scene text detection.