Sampled string matching is a very effective technique to reduce the search time for a pattern within a text at the cost of a small amount of additional memory, used for storing a partial index of the ...text. This approach has recently received some interest and has been applied to improve both online and offline string matching solutions, improving standard solutions by more than 50%. However, this improvement is currently only achievable in the case of texts on large-sized alphabets, and remains small (or absent) in the case of small-sized alphabets. In this article we propose an extension of the approach to text-sampling, known as Character Distance Sampling, to the case of small alphabets, obtaining an improvement of up to 98% compared to standard solutions in the case of online string matching. We also extend this approach to the case of offline string matching, introducing a sampled version of the suffix array, obtaining performances up to 5 times higher than the search obtained on the standard suffix array. Differently from what has been done by previous solutions, our idea is not based on the reduction of the number of indexed suffixes, but on the construction of the index directly on the sampled text.
•We extend the Character Distance Sampling approach to the case of small alphabets by making use of condensed alphabets.•We propose a way for constructing a sampled version of the suffix array to speed-up offline searching.•Our approach has a reduced space consumption whose gain is between 72% and 95%, if compared against previous solutions.•Our approach obtains a speed-up in online string matching up to 98%.•Our approach obtains a speed-up in offline string matching up to 5 times.
Penelitian ini dilakukan untuk membuat sampiran pantun otomatis berbasis pencocokan pola dan menganalisis tingkat naturalness dari pantun yang dihasilkan. Pada tahapan awal akan dibangun database ...berisi template dan kamus istilah. Sistem ini memerlukan input dari pengguna berupa isi dari pantun sebagai kata kunci. Kemudian dilakukan penentuan template. Dari kata kunci yang dimasukan oleh pengguna akan diperoleh rima. Rima ini akan dicocokan ke dalam database kamus istilah dan mengambil istilah dengan rima yang bersesuaian. Langkah terakhir yaitu melakukan penggabungan antara variable pada template dengan istilah yang terpilih sehingga membentuk teks sampiran yang utuh. Untuk tahap evaluasi naturalness dilakukukan dengan memberikan survey kepada responden untuk menilai hasil teks sampiran dari aspek keterbacaan, kejelasan, dan ketepatannya. Hasil dari penelitian ini menunjukan bahwa metode pattern-matching dapat digunakan untuk membuat teks sampiran pantun secara otomatis sesuai dengan kaidah, baik secara jumlah larik dan rimanya. Hal ini sejalan dengan hasil evaluasi naturalness yang baik dari pengguna dalam aspek readability, clarity, dan general approriateness yang cukup tinggi masing-masing sebesar 95%, 93% dan 97,5%.
Adverse drug reaction (ADR) reporting is a major component of drug safety monitoring; its input will, however, only be optimized if systems can manage to deal with its tremendous flow of information, ...based primarily on unstructured text fields. The aim of this study was to develop an automated system allowing to code ADRs from patient reports. Our system was based on a knowledge base about drugs, enriched by supervised machine learning (ML) models trained on patients reporting data. To train our models, we selected all cases of ADRs reported by patients to a French Pharmacovigilance Centre through a national web‐portal between March 2017 and March 2019 (n = 2,058 reports). We tested both conventional ML models and deep‐learning models. We performed an external validation using a dataset constituted of a random sample of ADRs reported to the Marseille Pharmacovigilance Centre over the same period (n = 187). Here, we show that regarding area under the curve (AUC) and F‐measure, the best model to identify ADRs was gradient boosting trees (LGBM), with an AUC of 0.93 (0.92–0.94) and F‐measure of 0.72 (0.68–0.75). This model was run for external validation showing an AUC of 0.91 and a F‐measure of 0.58. We evaluated an artificial intelligence pipeline that was found able to learn how to identify correctly ADRs from unstructured data. This result allowed us to start a new study using more data to further improve our performance and offer a tool that is useful in practice to efficiently manage drug safety information.
The authorship attribution task assumes the presence of several examples of documents written by various authors and it must be determined who wrote a given anonymous text. For each author, a ...specific writing style is hypothesized, with characteristics that the authors themselves are not aware of. The writing style acts as a fingerprint, as various features have been demonstrated to remain consistent for one author over the years. The Chaos Game Representation, a method for creating images from nucleotide sequences, is modified to make images from chunks of text documents. A text is transformed into a representation similar to a fingerprint that is used to check for similarities between the patterns existing in such marks from texts of distinct authors. Results indicate that this representation encodes sufficient particularities from an author writing style to make the methodology competitive for this field, with applications of both historic and current importance.
•Chaos game representation is used to make fingerprint-like images from text.•Images are classified via machine learning techniques for authorship attribution.•The methodology is validated on the CCAT, IMDb62 and PAN-12 data sets.•The method is shown to perform similarly well on 3 non-English data sets.•The Robert Galbraith pseudonym case is treated in several scenarios.
Urdu is still considered a low-resource language despite being ranked as world's <inline-formula> <tex-math notation="LaTeX">10^{th} </tex-math></inline-formula> most spoken language with nearly 230 ...million speakers. The scarcity of benchmark datasets in low-resource languages has led researchers to utilize more ingenious techniques to curb the issue. One such option widely adopted is to use language translation services to replicate existing datasets from resource-rich languages such as English to low-resource languages, such as Urdu. For most natural language processing tasks, including polarity assessment, words translated via Google translator from one language to another often change the meaning. It results in a polarity shift causing the system's performance degradation, particularly for sentiment classification and emotion detection tasks. This study evaluates the effect of translation on the sentiment classification task from a resource-rich language to a low-resource language. It identifies and enlists words causing polarity shift into five distinct categories. It further finds the correlation between the language with similar roots. Our study shows 2-3 percentage points performance degradation in sentiment classification due to polarity shift as a result of translation from resource-rich languages to low-resource languages.
Optical character recognition (OCR) is one of the most popular techniques used for converting printed documents into machine-readable ones. While OCR engines can do well with modern text, their ...performance is unfortunately significantly reduced on historical materials. Additionally, many texts have already been processed by various out-of-date digitisation techniques. As a consequence, digitised texts are noisy and need to be post-corrected. This article clarifies the importance of enhancing quality of OCR results by studying their effects on information retrieval and natural language processing applications. We then define the post-OCR processing problem, illustrate its typical pipeline, and review the state-of-the-art post-OCR processing approaches. Evaluation metrics, accessible datasets, language resources, and useful toolkits are also reported. Furthermore, the work identifies the current trend and outlines some research directions of this field.
•In summary, our proposed approach has the following contributions:○Our proposed approach investigates the application of hate speech detection approach to vulnerable community identification.○We ...successfully identify a potentially vulnerable community in terms of hatred on social media, by using the example of Amharic text data on Facebook.○We collected and annotated Amharic data for the task of hate speech detection, aligned with multicultural societies like Ethiopia.○We utilize Apache Spark distributed platform for data pre-processing and feature extraction since social media data is very noisy and large that needs efficient tools to facilitate efficient processing.
With the rapid development in mobile computing and Web technologies, online hate speech has been increasingly spread in social network platforms since it's easy to post any opinions. Previous studies confirm that exposure to online hate speech has serious offline consequences to historically deprived communities. Thus, research on automated hate speech detection has attracted much attention. However, the role of social networks in identifying hate-related vulnerable community is not well investigated. Hate speech can affect all population groups, but some are more vulnerable to its impact than others. For example, for ethnic groups whose languages have few computational resources, it is a challenge to automatically collect and process online texts, not to mention automatic hate speech detection on social media. In this paper, we propose a hate speech detection approach to identify hatred against vulnerable minority groups on social media. Firstly, in Spark distributed processing framework, posts are automatically collected and pre-processed, and features are extracted using word n-grams and word embedding techniques such as Word2Vec. Secondly, deep learning algorithms for classification such as Gated Recurrent Unit (GRU), a variety of Recurrent Neural Networks (RNNs), are used for hate speech detection. Finally, hate words are clustered with methods such as Word2Vec to predict the potential target ethnic group for hatred. In our experiments, we use Amharic language in Ethiopia as an example. Since there was no publicly available dataset for Amharic texts, we crawled Facebook pages to prepare the corpus. Since data annotation could be biased by culture, we recruit annotators from different cultural backgrounds and achieved better inter-annotator agreement. In our experimental results, feature extraction using word embedding techniques such as Word2Vec performs better in both classical and deep learning-based classification algorithms for hate speech detection, among which GRU achieves the best result. Our proposed approach can successfully identify the Tigre ethnic group as the highly vulnerable community in terms of hatred compared with Amhara and Oromo. As a result, hatred vulnerable group identification is vital to protect them by applying automatic hate speech detection model to remove contents that aggravate psychological harm and physical conflicts. This can also encourage the way towards the development of policies, strategies, and tools to empower and protect vulnerable communities.
Recent advances in low-resource abstractive summarization were largely made through the adoption of specialized pre-training, pseudo-summarization, that integrates the content selection knowledge ...through various centrality-based sentence recovery tasks. However, despite the substantial results, there are several cases where the predecessor general-purpose pre-trained language model BART outperforms the summarization-specialized counterparts in both few-shot and fine-tuned scenarios. In this work, we investigate these performance irregularities and shed some light on the effect of pseudo-summarization pre-training in low-resource settings. We benchmarked five pre-trained abstractive summarization models on five datasets of diverse domains and analyzed their behavior in terms of extractive intuition and attention patterns. Despite that all models exhibit extractive behavior, some lack the prediction confidence to copy longer text fragments and have a misaligned attention distribution with the structure of the real-world texts. The latter happens to be the major factor of underperformance in fiction, news, and scientific articles domains as the better initial attention alignment of BART leads to the best benchmark results in all few-shot settings. A further examination reveals that BART summarization capabilities are the side-effect of the combination of sentence permutation task and specificities of the pre-training dataset. Based on the discovery we introduce Pegasus-SP, an improved pre-trained abstractive summarization model that unifies pseudo-summarization with sentence permutation. The new model outperforms the existing counterparts in low-resource settings and demonstrates superior adaptability. Additionally, we show that all pre-trained summarization models benefit from data-wise attention correction, achieving up to 10% relative ROUGE improvement on model-data pairs with the largest distribution discrepancies.
In document image analysis, segmentation is the task that identifies the regions of a document. The increasing number of applications of document analysis requires a good knowledge of the available ...technologies. This survey highlights the variety of the approaches that have been proposed for document image segmentation since 2008. It provides a clear typology of documents and of document image segmentation algorithms. We also discuss the technical limitations of these algorithms, the way they are evaluated and the general trends of the community.
•Extensive review of the state of the art with a well defined scope.•Analysis of the algorithms from a scientific and an industrial point of view.•Well defined document and algorithm typologies.•Discussion on the trends of the field and the evaluation of the algorithms.