Reverse-Safe Text Indexing Bernardini, Giulia; Chen, Huiping; Fici, Gabriele ...
The ACM journal of experimental algorithmics,
12/2021, Volume:
26
Journal Article
Peer reviewed
Open access
We introduce the notion of reverse-safe data structures. These are data structures that prevent the reconstruction of the data they encode (i.e., they cannot be easily reversed). A data structure
D
...is called
z
-
reverse-safe
when there exist at least
z
datasets with the same set of answers as the ones stored by
D
. The main challenge is to ensure that
D
stores as many answers to useful queries as possible, is constructed efficiently, and has size close to the size of the original dataset it encodes. Given a text of length
n
and an integer
z
, we propose an algorithm that constructs a
z
-reverse-safe data structure (
z
-RSDS) that has size
O(n)
and answers decision and counting pattern matching queries of length at most
d
optimally, where
d
is maximal for any such
z
-RSDS. The construction algorithm takes
O(nɷ log d)
time, where
ɷ
is the matrix multiplication exponent. We show that, despite the
nɷ
factor, our engineered implementation takes only a few minutes to finish for million-letter texts. We also show that plugging our method in data analysis applications gives insignificant or no data utility loss. Furthermore, we show how our technique can be extended to support applications under realistic adversary models. Finally, we show a
z
-RSDS for decision pattern matching queries, whose size can be sublinear in
n
. A preliminary version of this article appeared in ALENEX 2020.
Compressed self-indexes are used widely in string processing applications, such as information retrieval, genome analysis, data mining, and web searching. The index not only indexes the data, but ...also encodes the data, and it is in compressed form. Moreover, the index and the data it encodes can be operated upon directly, without need to uncompress the entire index, thus saving time while maintaining small storage space. In some applications, such as in genome analysis, existing methods do not exploit the full possibilities of compressed self-indexes, and thus we seek faster and more space-efficient indexes. In this paper, we propose a practical high-order entropy-compressed self-index for efficient pattern matching in a text. We give practical implementations of compressed suffix arrays using a hybrid encoding in the representation of the neighbor function <inline-formula><tex-math notation="LaTeX">\Phi</tex-math> <mml:math><mml:mi>Φ</mml:mi></mml:math><inline-graphic xlink:href="vitter-ieq1-3114401.gif"/> </inline-formula>. We analyze the performance in theory and practice of our recommended indexing method, called <inline-formula><tex-math notation="LaTeX">{{\sf GeCSA}}</tex-math> <mml:math><mml:mi mathvariant="sans-serif">GeCSA</mml:mi></mml:math><inline-graphic xlink:href="vitter-ieq2-3114401.gif"/> </inline-formula>. We can improve retrieval time further using an iterated version of the neighbor function. Experimental results on the tested data demonstrate that the proposed index <inline-formula><tex-math notation="LaTeX">{{\sf GeCSA}}</tex-math> <mml:math><mml:mi mathvariant="sans-serif">GeCSA</mml:mi></mml:math><inline-graphic xlink:href="vitter-ieq3-3114401.gif"/> </inline-formula> has good overall advantages in space usage and retrieval time over the state-of-the-art indexing methods, especially on the repetitive data.
Document retrieval is one of the best-established information retrieval activities since the ’60s, pervading all search engines. Its aim is to obtain, from a collection of text documents, those most ...relevant to a pattern query. Current technology is mostly oriented to “natural language” text collections, where inverted indexes are the preferred solution. As successful as this paradigm has been, it fails to properly handle various East Asian languages and other scenarios where the “natural language” assumptions do not hold. Inthis survey, we cover the recent research in extending the document retrieval techniques to a broader class of sequence collections, which has applications in bioinformatics, data and web mining, chemoinformatics, software engineering, multimedia information retrieval, and many other fields. We focus on the algorithmic aspects of the techniques, uncovering a rich world of relations between document retrieval challenges and fundamental problems on trees, strings, range queries, discrete geometry, and other areas.
Full text
Available for:
IZUM, KILJ, NUK, PILJ, SAZU, UL, UM, UPUK
Full-text indexes provide fast substring search over large text collections. A serious problem of these indexes has traditionally been their space consumption. A recent trend is to develop indexes ...that exploit the compressibility of the text, so that their size is a function of the compressed text length. This concept has evolved into
self-indexes
, which in addition contain enough information to reproduce any text portion, so they
replace
the text. The exciting possibility of an index that takes space close to that of the compressed text, replaces it, and in addition provides fast search over it, has triggered a wealth of activity and produced surprising results in a very short time, which radically changed the status of this area in less than 5 years. The most successful indexes nowadays are able to obtain almost optimal space and search time simultaneously.
In this article we present the main concepts underlying (compressed) self-indexes. We explain the relationship between text entropy and regularities that show up in index structures and permit compressing them. Then we cover the most relevant self-indexes, focusing on how they exploit text compressibility to achieve compact structures that can efficiently solve various search problems. Our aim is to give the background to understand and follow the developments in this area.
Full text
Available for:
IZUM, KILJ, NUK, PILJ, SAZU, UL, UM, UPUK
•We present a simple algorithm to compute the document array.•The algorithm runs in linear time using constant workspace.•The source code is publicly available at ...https://github.com/felipelouza/document-array.
We present a simple algorithm for computing the document array given a string collection and its suffix array as input. Our algorithm runs in linear time using constant additional space for strings from constant alphabets.
Full text
Available for:
GEOZS, IJS, IMTLJ, KILJ, KISLJ, NLZOH, NUK, OILJ, PNG, SAZU, SBCE, SBJE, UILJ, UL, UM, UPUK, ZRSKP
A weighted sequence is a sequence of probability mass functions over a finite alphabet. A weighted index is a data structure constructed for a weighted sequence and a threshold 1z that, given a ...string pattern, reports all positions where it occurs in the weighted sequence with probability at least 1z. We present an O(nz)-time construction of an O(nz)-sized weighted index for a weighted sequence of length n that answers queries in optimal time. The previous solution by Amir et al. (2008) required O(nz2logz) time and space. Our main tools are a construction of a family of ⌊z⌋ strings that carries the information about all the strings that occur in a weighted sequence and a more straightforward solution to so-called property indexing. We present applications of our weighted index, in particular in approximate and general scenarios that were introduced by Biswas et al. (2016), and provide its implementation.
Full text
Available for:
GEOZS, IJS, IMTLJ, KILJ, KISLJ, NLZOH, NUK, OILJ, PNG, SAZU, SBCE, SBJE, UILJ, UL, UM, UPUK, ZAGLJ, ZRSKP
Let a text T1..n be the only string generated by a context-free grammar with g (terminal and nonterminal) symbols, and of size G (measured as the sum of the lengths of the right-hand sides of the ...rules). Such a grammar, called a grammar-compressed representation of T, can be encoded using GlgG bits. We introduce the first grammar-compressed index that uses O(Glgn) bits (precisely, Glgn+(2+ϵ)Glgg for any constant ϵ>0) and can find the occ occurrences of patterns P1..m in time O((m2+occ)lgG). We implement the index and demonstrate its practicality in comparison with the state of the art, on highly repetitive text collections.
Full text
Available for:
GEOZS, IJS, IMTLJ, KILJ, KISLJ, NLZOH, NUK, OILJ, PNG, SAZU, SBCE, SBJE, UILJ, UL, UM, UPUK, ZAGLJ, ZRSKP
The minimizers sampling mechanism is a popular mechanism for string sampling. However, minimizers sampling mechanisms lack good guarantees on the expected size of their samples for different ...combinations of their input parameters. Furthermore, indexes constructed over minimizers samples lack good worst-case guarantees for on-line pattern searches. In response, we propose bidirectional string anchors (bd-anchors), a new string sampling mechanism. Given an integer <inline-formula><tex-math notation="LaTeX">\ell</tex-math></inline-formula>, our mechanism selects the lexicographically smallest rotation in every length-<inline-formula><tex-math notation="LaTeX">\ell</tex-math></inline-formula> fragment. We show that, like minimizers samples, bd-anchors samples are approximately uniform, locally consistent, and computable in linear time. Furthermore, our experiments demonstrate that the bd-anchors sample sizes decrease proportionally to <inline-formula><tex-math notation="LaTeX">\ell</tex-math></inline-formula>; and that these sizes are competitive to or smaller than the minimizers sample sizes. We theoretically justify these results by analyzing the expected size of bd-anchors samples. We also prove that computing a total order on the input alphabet which minimizes the bd-anchors sample size is NP-hard. We next highlight the benefits of bd-anchors in two important applications: text indexing and top-<inline-formula><tex-math notation="LaTeX">K</tex-math></inline-formula> similarity search. For the first application, we develop an index for performing on-line pattern searches in near-optimal time, and show experimentally that a simple implementation of our index is consistently faster for on-line pattern searches than an analogous implementation of a minimizers-based index; we also show that it is substantially faster than two classic text indexes. For the second application, we develop a heuristic for top-<inline-formula><tex-math notation="LaTeX">K</tex-math></inline-formula> similarity search under edit distance, and show experimentally that it is generally as accurate as the state-of-the-art tool for the same purpose but more than one order of magnitude faster .
Two recent lower bounds on the compressibility of repetitive sequences,
δ
≤
γ
, have received much attention. It has been shown that a length-
n
string
S
over an alphabet of size
σ
can be represented ...within the optimal
O
(
δ
log
n
log
σ
δ
log
n
)
space, and further, that within that space one can find all the
occ
occurrences in
S
of any length-
m
pattern in time
O
(
m
log
n
+
o
c
c
log
ϵ
n
)
for any constant
ϵ
>
0
. Instead, the near-optimal search time
O
(
m
+
(
o
c
c
+
1
)
log
ϵ
n
)
has been achieved only within
O
(
γ
log
n
γ
)
space. Both results are based on considerably different locally consistent parsing techniques. The question of whether the better search time could be supported within the
δ
-optimal space remained open. In this paper, we prove that both techniques can indeed be combined to obtain the best of both worlds:
O
(
m
+
(
o
c
c
+
1
)
log
ϵ
n
)
search time within
O
(
δ
log
n
log
σ
δ
log
n
)
space. Moreover, the number of occurrences can be computed in
O
(
m
+
log
2
+
ϵ
n
)
time within
O
(
δ
log
n
log
σ
δ
log
n
)
space. We also show that an extra sublogarithmic factor on top of this space enables optimal
O
(
m
+
o
c
c
)
search time, whereas an extra logarithmic factor enables optimal
O
(
m
) counting time.
Full text
Available for:
EMUNI, FIS, FZAB, GEOZS, GIS, IJS, IMTLJ, KILJ, KISLJ, MFDPS, NLZOH, NUK, OILJ, PNG, SAZU, SBCE, SBJE, SBMB, SBNM, UKNU, UL, UM, UPUK, VKSCE, ZAGLJ