The field of
succinct data structures
has flourished over the past 16 years. Starting from the compressed suffix array by Grossi and Vitter (STOC 2000) and the FM-index by Ferragina and Manzini (FOCS ...2000), a number of generalizations and applications of string indexes based on the Burrows-Wheeler transform (BWT) have been developed, all taking an amount of space that is close to the input size in bits. In many large-scale applications, the construction of the index and its usage need to be considered as one unit of computation. For example, one can compare two genomes by building a common index for their concatenation and by detecting common substructures by querying the index. Efficient string indexing and analysis in small space lies also at the core of a number of primitives in the data-intensive field of high-throughput DNA sequencing.
We report the following advances in string indexing and analysis: We show that the BWT of a string
T
∈ {1,…,σ}
n
can be built in deterministic
O
(
n
) time using just
O
(
n
log σ) bits of space, where σ ≤
n
. Deterministic linear time is achieved by exploiting a new
partial rank
data structure that supports queries in constant time and that might have independent interest. Within the same time and space budget, we can build an index based on the BWT that allows one to enumerate all the internal nodes of the suffix tree of
T
. Many fundamental string analysis problems, such as maximal repeats, maximal unique matches, and string kernels, can be mapped to such enumeration and can thus be solved in deterministic
O
(
n
) time and in
O
(
n
log σ) bits of space from the input string by tailoring the enumeration algorithm to some problem-specific computations.
We also show how to build many of the existing indexes based on the BWT, such as the
compressed suffix array
, the
compressed suffix tree
, and the
bidirectional BWT index
, in
randomized
O
(
n
) time and in
O
(
n
log σ) bits of space. The previously fastest construction algorithms for BWT, compressed suffix array and compressed suffix tree, which used
O
(
n
log σ) bits of space, took
O
(
n
log log σ) time for the first two structures and
O
(
n
log
ϵ
n
) time for the third, where ϵ is any positive constant smaller than one. Alternatively, the BWT could be previously built in linear time if one was willing to spend
O
(
n
log σ log log
σ
n
) bits of space. Contrary to the state-of-the-art, our bidirectional BWT index supports every operation in constant time per element in its output.
Language is constantly developing, and the study of changes in the language's vocabulary is an important task of linguistics. The modern English language is characterized by the constant formation of ...new words. The suffix system of modern English has been actively and fruitfully studied by both domestic and foreign linguists; a significant number of works have been written considering certain aspects of the suffix system of the English language. However, the language changes over time, and its structure partially changes. The existing theoretical knowledge becomes insufficient. Therefore, it is necessary to identify new, modern trends that have been observed in recent decades. The article is devoted to the patterns of the use of noun suffixes in modern English. Also, in this article, the concept and types of word formation as one of the ways of enriching the language are studied. Thanks to the use of live suffixes, it will be easier to enrich your vocabulary. Suffixes are easily distinguished as word-forming elements. They are suffixes. They are used to form new words. The productivity of the suffix depends on how many formations they give. The article provides examples of suffixes of the actor and ways of formation from different parts of speech. Most often, nouns and verbs with the addition of suffixes are used for word formation at the base of the word. In addition, this article examines the definitions and approaches to defining the concepts of new words. New words formed by the suffix method are investigated. The most productive ways of forming new words are analyzed and identified. The principles of morphemic analysis of the basics are considered. Semantic and grammatical functions of the suffix morpheme are defined. The productivity of suffixes of modern English is studied. The process of distribution of suffixes by parts of speech is revealed.
Indexing highly repetitive texts—such as genomic databases, software repositories and versioned text collections—has become an important problem since the turn of the millennium. A relevant ...compressibility measure for repetitive texts is
r
, the number of runs in their Burrows-Wheeler Transforms (BWTs). One of the earliest indexes for repetitive collections, the Run-Length FM-index, used
O
(
r
) space and was able to efficiently count the number of occurrences of a pattern of length
m
in a text of length
n
(in
O
(
m
log log
n
) time, with current techniques). However, it was unable to locate the positions of those occurrences efficiently within a space bounded in terms of
r
. In this article, we close this long-standing problem, showing how to extend the Run-Length FM-index so that it can locate the
occ
occurrences efficiently (in
O
(
occ
log log
n
) time) within
O
(
r
) space. By raising the space to
O
(
r
log log
n
), our index counts the occurrences in optimal time,
O
(
m
), and locates them in optimal time as well,
O
(
m
+
occ
). By further raising the space by an
O
(
w
/ log σ) factor, where σ is the alphabet size and
w
= Ω (log
n
) is the RAM machine size in bits, we support count and locate in
O
(⌈
m
log (σ)/
w
⌉) and
O
(⌈
m
log (σ)/
w
⌉ +
occ
) time, which is optimal in the packed setting and had not been obtained before in compressed space. We also describe a structure using
O
(
r
log (
n
/
r
)) space that replaces the text and extracts any text substring of length ℓ in the almost-optimal time
O
(log (
n
/
r
)+ℓ log (σ)/
w
). Within that space, we similarly provide access to arbitrary suffix array, inverse suffix array, and longest common prefix array cells in time
O
(log (
n
/
r
)), and extend these capabilities to full suffix tree functionality, typically in
O
(log (
n
/
r
)) time per operation. Our experiments show that our
O
(
r
)-space index outperforms the space-competitive alternatives by 1--2 orders of magnitude in time. Competitive implementations of the original FM-index are outperformed by 1--2 orders of magnitude in space and/or 2--3 in time.
In 1990, Manber and Myers proposed suffix arrays as a space-saving alternative to suffix trees and described the first algorithms for suffix array construction and use. Since that time, and ...especially in the last few years, suffix array construction algorithms have proliferated in bewildering abundance. This survey paper attempts to provide simple high-level descriptions of these numerous algorithms that highlight both their distinctive features and their commonalities, while avoiding as much as possible the complexities of implementation details. New hybrid algorithms are also described. We provide comparisons of the algorithms' worst-case time complexity and use of additional space, together with results of recent experimental test runs on many of their implementations.
DNS in Real World
International journal of innovative technology and exploring engineering,
8/2019, Letnik:
8, Številka:
9S3
Journal Article
Odprti dostop
The examination of courseware has investigated the World Wide Web, and current patterns propose that the examination of setting free sentence structure will before long rise. Truth be told, few ...cyberinformaticians would differ with the investigation of virtual machines. In this paper we utilize remote originals to demonstrate that deletion coding and blockage control can conspire to answer this conundrum.
Morpheme order in Mari declension has been extensively studied in the past but attempts to explain the large amounts of alternation found here have been constrained by the difficulty of accessing ...sufficient data to properly elucidate the complexities in this domain. The paper at hand examines the prospect of using the Corpus of Literary Mari, created by an international workgroup around Trond Trosterud and his colleagues and hosted by Giellatekno, and other recently published resources on Mari to efficiently access vast amounts of data to quantitatively study this subject in a manner that had not previously been possible.
The suffix array is one of the most prevalent data structures for string indexing; it stores the lexicographically sorted list of suffixes of a given string. Its practical advantage compared to the ...suffix tree is space efficiency. In Property Indexing, we are given a string x of length n and a property Π, i.e., a set of Π-valid intervals over x so that a pattern p occurs in x if and only if x has an occurrence of p that lies entirely within an interval of Π. A suffix-tree-like index over the valid prefixes of suffixes of x can be built in time and space O(n). We show here how to directly build a suffix-array-like index, the Property Suffix Array (PSA), in time and space O(n). We mainly draw our motivation from weighted (probabilistic) sequences: sequences of probability distributions over a given alphabet. Given a probability threshold 1/z, we say that a string p of length m matches a weighted sequence x of length n at starting position i if the product of probabilities of the letters of p at positions i, …, i+m − 1 in x is at least 1/z. Our algorithm for building the PSA can be directly applied to build an O(nz)-sized suffix-array-like index over x in time and space O(nz). Finally, we present extensive experimental results showing that our new indexing data structure is well suited for real-world applications.
Los sufijos -oso, -ento, -udo y -ón (lluvioso, hambriento, barbudo, cabezón) comparten un espacio semántico que está a disposición de los hablantes, principalmente, para formar adjetivos ...calificativos (NGLE, 2010). La cercanía entre estos sufijos provoca que establezcan una rivalidad entre sí, la cual se evidencia en la gran cantidad de dobletes que existen en la lengua (Autor, 2016a): mugroso, mugriento; caprichoso, caprichudo; barbudo, barbón; sangrón, sangriento; lloroso, llorón, etc. En este artículo, a partir de un punto de vista onomasiológico de la formación de palabras (Štekauer, 2005a, 2005b, 2016), analizaremos estos sufijos para establecer los criterios que pueden distinguir los ámbitos de actuación de cada uno y las características principales de las palabras formadas en cada caso. Seguiremos una metodología que se ha utilizado ya para estudiar este tipo de sufijos en competencia, que distingue entre factores estructurales y contextuales. Utilizaremos datos de palabras atestiguadas, tanto en diccionarios como en neologismos (Autor, 2016b) para conocer las tendencias actuales en el español de México. Al final, se demostrará que cada sufijo establece su propio dominio de acción y que los dobletes existentes se deben a que cada palabra expresa sutiles diferencias de significado.
Random subsequence forests He, Zengyou; Wang, Jiaqi; Jiang, Mudi ...
Information sciences,
20/May , Letnik:
667
Journal Article
Recenzirano
The random forest classifier is widely used in different fields due to its accuracy and robustness. Since its invention, the random forest algorithm is naturally developed for multi-dimensional ...vectorial data since features can be directly sampled during the decision tree construction procedure. In the context of discrete sequence classification, an explicit feature set is not readily available and we need to employ a feature extraction algorithm before building the random forest. However, such a predefined feature subset may limit the diversity of decision trees since the set of candidate features is composed of all subsequences. As a result, the predictive accuracy of constructed random forest classifier may be reduced. To address this, we propose a new algorithm that is able to directly build a random forest by choosing features from the set of all subsequences adaptively. To improve the running efficiency of our algorithm, the count-suffix tree is utilized to facilitate the fast frequency counting of subsequences so as to accelerate the generation of each randomized decision tree. The experimental results on 15 real datasets show that our method can outperform those state-of-the-art classification algorithms in terms of the predictive accuracy. The source code of our method can be found at: https://github.com/JiaqiWang-dlut/RSForest.
The book is a research monograph that reviews and revises the concept of linguistic pejoration, and explores the role of 15 suffixes and combining forms, such as -ie, -o, -ard, -holic, -rrhea, -itis, ...-porn, -ish, in the formation of English pejoratives.