In recent years, several
compressed indexes
based on variants of the Burrows–Wheeler transform have been introduced. Some of these are used to index structures far more complex than a single string, ...as was originally done with the FM-index (Ferragina and Manzini in J. ACM 52(4):552–581, https://doi.org/10.1145/1082036.1082039, 2005). As such, there has been an increasing effort to better understand under which conditions such an indexing scheme is possible. This has led to the introduction of Wheeler graphs (Gagie et al. in Theor Comput Sci 698:67–78, https://doi.org/10.1016/j.tcs.2017.06.016, 2017). Gagie et al. showed that de Bruijn graphs, generalized compressed suffix arrays, and several other BWT related structures can be represented as Wheeler graphs, and that Wheeler graphs can be indexed in a space-efficient way. Hence, being able to recognize whether a given graph is a Wheeler graph, or being able to approximate a given graph by a Wheeler graph, could have numerous applications in indexing. Here we resolve the open question of whether there exists an efficient algorithm for recognizing if a given graph is a Wheeler graph. We show:
The problem of recognizing whether a given graph
G
=
(
V
,
E
)
is a Wheeler graph is NP-complete for any edge label alphabet of size
σ
≥
2
, even when
G
is a DAG. This holds even on a restricted subset of graphs called
d
-NFAs for
d
≥
5
. This is in contrast to recent results demonstrating the problem can be solved in polynomial time for
d
-NFAs where
d
≤
2
. We also show that the recognition problem can be solved in linear time for
σ
=
1
on graphs without self-loops;
There exists an
2
e
log
σ
+
O
(
n
+
e
)
time exact algorithm where
n
=
|
V
|
and
e
=
|
E
|
. This algorithm relies on graph isomorphism being computable in strictly sub-exponential time;
We define an optimization variant of the problem called Wheeler Graph Violation, abbreviated WGV, where the aim is to identify the smallest set of edges that have to be removed from a graph to obtain a Wheeler graph. We show WGV is APX-hard, even when
G
is a DAG, implying there exists a constant
C
>
1
for which there is no
C
-approximation algorithm (unless P = NP). Also, conditioned on the Unique Games Conjecture, for all
C
>
1
, it is NP-hard to find a
C
-approximation, implying WGV is not in APX;
We define the Wheeler Subgraph problem, abbreviated WS, where the aim is to find the largest subgraph which is a Wheeler Graph (the dual of WGV). In contrast to WGV, we give an
O
(
σ
)
-approximation algorithm for the WS problem, implying it is in APX for
σ
=
O
(
1
)
.
The above findings suggest that most problems under this theme are computationally difficult. However, we identify a class of graphs for which the recognition problem is polynomial-time solvable, raising the question of which properties determine this problem’s difficulty.
Full text
Available for:
EMUNI, FIS, FZAB, GEOZS, GIS, IJS, IMTLJ, KILJ, KISLJ, MFDPS, NLZOH, NUK, OILJ, PNG, SAZU, SBCE, SBJE, SBMB, SBNM, UKNU, UL, UM, UPUK, VKSCE, ZAGLJ
Colinear chaining has proven to be a powerful heuristic for finding near-optimal alignments of long DNA sequences (e.g., long reads or a genome assembly) to a reference. It is used as an intermediate ...step in several alignment tools that employ a seed-chain-extend strategy. Despite this popularity, efficient subquadratic time algorithms for the general case where chains support anchor overlaps and gap costs are not currently known. We present algorithms to solve the colinear chaining problem with anchor overlaps and gap costs in
denotes the count of anchors. The degree of the polylogarithmic factor depends on the type of anchors used (e.g., fixed-length anchors) and the type of precedence an optimal anchor chain is required to satisfy. We also establish the first theoretical connection between colinear chaining cost and edit distance. Specifically, we prove that for a fixed set of anchors under a carefully designed chaining cost function, the optimal "anchored" edit distance equals the optimal colinear chaining cost. The anchored edit distance for two sequences and a set of anchors is only a slight generalization of the standard edit distance. It adds an additional cost of one to an alignment of two matching symbols that are not supported by any anchor. Finally, we demonstrate experimentally that optimal colinear chaining cost under the proposed cost function can be computed orders of magnitude faster than edit distance, and achieves correlation coefficient >0.9 with edit distance for closely as well as distantly related sequences.
Alignment-free sequence comparison methods are attracting persistent interest, driven by data-intensive applications in genome-wide molecular taxonomy and phylogenetic reconstruction. Among all the ...methods based on substring composition, the average common substring (ACS) measure admits a straightforward linear time sequence comparison algorithm, while yielding impressive results in multiple applications. An important direction of this research is to extend the approach to permit a bounded edit/hamming distance between substrings, so as to reflect more accurately the evolutionary process. To date, however, algorithms designed to incorporate k ≥ 1 mismatches have O(n(2)) worst-case time complexity, where n is the total length of the input sequences. On the other hand, accounting for mismatches has shown to lead to much improved classification, while heuristics can improve practical performance. In this article, we close the gap by presenting the first provably efficient algorithm for the k-mismatch average common string (ACSk) problem that takes O(n) space and O(n log(k) n) time in the worst case for any constant k. Our method extends the generalized suffix tree model to incorporate a carefully selected bounded set of perturbed suffixes, and can be applied to other complex approximate sequence matching problems.
Parameterized pattern matching is a string searching variant that was initially defined to detect duplicate code but later proved to support several other applications. In particular, two ...equal-length strings X andY are a parameterized-match if there exists a bijective function g for which every text symbol in X is equal to g(Y). Baker was the first researcher to have addressed this problem (Baker, 1993) and, since then, many others have followed her work. She did, indeed, open up a wide field of extensive research. Over the years, many variants and extensions that have been pursued include: parameterized matching under edit and Hamming distances, parameterized multi-pattern matching, two dimensional parameterized matching, structural matching, function matching, and the very recent developments in succinct and streaming models. This accelerated research could only be justified by the usefulness of its practical applications such as in software maintenance, image processing and bioinformatics to name some. Even though the problem was posed about 25 years ago, research on parameterized matching is still very active. Its extensive study over the years and its current relevance motivate us to review the most notorious contributions as road map for current and future research.
Full text
Available for:
GEOZS, IJS, IMTLJ, KILJ, KISLJ, NLZOH, NUK, OILJ, PNG, SAZU, SBCE, SBJE, UILJ, UL, UM, UPCLJ, UPUK, ZAGLJ, ZRSKP
Let
T
1
and
T
2
be two rooted trees with an equal number of leaves. The leaves are labeled, and the labeling of the leaves in
T
2
is a permutation of those in
T
1
. Nodes are associated with weight, ...such that the weight of a node
u
, denoted by
W
(
u
), is more than the weight of its parent. A node
x
∈
T
1
and a node
y
∈
T
2
are induced, iff their subtrees have at least one common leaf label. A
heaviest induced ancestor
query
HIA
(
u
1
,
u
2
)
with input nodes
u
1
∈
T
1
and
u
2
∈
T
2
asks to output the pair
(
u
1
∗
,
u
2
∗
)
of induced nodes with the highest combined weight
W
(
u
1
∗
)
+
W
(
u
2
∗
)
, such that
u
1
∗
is an ancestor of
u
1
and
u
2
∗
is an ancestor of
u
2
. This is a useful primitive in several text processing applications. Gagie et al. (Proceedings of the 25th Canadian Conference on Computational Geometry, CCCG 2013, Waterloo, Ontario, Canada, 2013) introduced this problem and proposed three data structures with the following space-time trade-offs: (i)
O
(
n
log
2
n
)
space and
O
(
log
n
log
log
n
)
query time, (ii)
O
(
n
log
n
)
space and
O
(
log
2
n
)
query time, and (iii)
O
(
n
) space and
O
(
log
3
+
ϵ
n
)
query time. Here
n
is the number of nodes in both trees combined and
ϵ
>
0
is an arbitrarily small constant. We present two new data structures with better space-time trade-offs: (i)
O
(
n
log
n
)
space and
O
(
log
n
log
log
n
)
query time, and (ii)
O
(
n
) space and
O
(
log
2
n
/
log
log
n
)
query time. Additionally, we present new applications of these results.
Full text
Available for:
EMUNI, FIS, FZAB, GEOZS, GIS, IJS, IMTLJ, KILJ, KISLJ, MFDPS, NLZOH, NUK, OILJ, PNG, SAZU, SBCE, SBJE, SBMB, SBNM, UKNU, UL, UM, UPUK, VKSCE, ZAGLJ
Alignment-free methods for sequence comparisons have become popular in many bioinformatics applications, specifically in the estimation of sequence similarity measures to construct phylogenetic ...trees. Recently, the average common substring measure, ACS, and its k-mismatch counterpart, ACS
, have been shown to produce results as effective as multiple-sequence alignment based methods for reconstruction of phylogeny trees. Since computing ACS
takes O(n logkn) time and hence impractical for large datasets, multiple heuristics that can approximate ACS
have been introduced.
In this paper, we present a novel linear-time heuristic to approximate ACS
, which is faster than computing the exact ACS
while being closer to the exact ACS
values compared to previously published linear-time greedy heuristics. Using four real datasets, containing both DNA and protein sequences, we evaluate our algorithm in terms of accuracy, runtime and demonstrate its applicability for phylogeny reconstruction. Our algorithm provides better accuracy than previously published heuristic methods, while being comparable in its applications to phylogeny reconstruction.
Our method produces a better approximation for ACS
and is applicable for the alignment-free comparison of biological sequences at highly competitive speed. The algorithm is implemented in Rust programming language and the source code is available at https://github.com/srirampc/adyar-rs .
Full text
Available for:
DOBA, IZUM, KILJ, NUK, PILJ, PNG, SAZU, SIK, UILJ, UKNU, UL, UM, UPUK
This paper revisits the k-mismatch shortest unique substring finding problem and demonstrates that a technique recently presented in the context of solving the k-mismatch average common substring ...problem can be adapted and combined with parts of the existing solution, resulting in a new algorithm which has expected time complexity of O(n log
n), while maintaining a practical space complexity at O(kn), where n is the string length. When , which is the hard case, our new proposal significantly improves the any-case O(n
) time complexity of the prior best method for k-mismatch shortest unique substring finding. Experimental study shows that our new algorithm is practical to implement and demonstrates significant improvements in processing time compared to the prior best solution's implementation when k is small relative to n. For example, our method processes a 200 KB sample DNA sequence with k=1 in just 0.18 seconds compared to 174.37 seconds with the prior best solution. Further, it is observed that significant portions of the adapted technique can be executed in parallel, using two different simple concurrency models, resulting in further significant practical performance improvement. As an example, when using 8 cores, the parallel implementations both achieved processing times that are less than 1/4 of the serial implementation's time cost, when processing a 10 MB sample DNA sequence with k=2. In an age where instances with thousands of gigabytes of RAM are readily available for use through Cloud infrastructure providers, it is likely that the trade-off of additional memory usage for significantly improved processing times will be desirable and needed by many users. For example, the best prior solution may spend years to finish a DNA sample of 200MB for any , while this new proposal, using 24 cores, can finish processing a sample of this size with k=1 in 206.376 seconds with a peak memory usage of 46 GB, which is both easily available and affordable on Cloud. It is expected that this new efficient and practical algorithm for k-mismatch shortest unique substring finding will prove useful to those using the measure on long sequences in fields such as computational biology. We also give a theoretical bound that the k-mismatch shortest unique substring finding problem can be solved using O(n log
n) time and O(n) space, asymptotically much better than the one we implemented, serving as a new discovery of interest.
Alignment-free sequence comparison approaches have been garnering increasing interest in various data- and compute-intensive applications such as phylogenetic inference for large-scale sequences. ...While k-mer based methods are predominantly used in real applications, the average common substring (ACS) approach is emerging as one of the prominent alignment-free approaches. This ACS approach has been further generalized by some recent work, either greedily or exactly, by allowing a bounded number of mismatches in the common substrings.
We present ALFRED-G, a greedy alignment-free distance estimator for phylogenetic tree reconstruction based on the concept of the generalized ACS approach. In this algorithm, we have investigated a new heuristic to efficiently compute the lengths of common strings with mismatches allowed, and have further applied this heuristic to phylogeny reconstruction. Performance evaluation using real sequence datasets shows that our heuristic is able to reconstruct comparable, or even more accurate, phylogenetic tree topologies than the kmacs heuristic algorithm at highly competitive speed.
ALFRED-G is an alignment-free heuristic for evolutionary distance estimation between two biological sequences. This algorithm is implemented in C++ and has been incorporated into our open-source ALFRED software package ( http://alurulab.cc.gatech.edu/phylo ).
Full text
Available for:
DOBA, IZUM, KILJ, NUK, PILJ, PNG, SAZU, SIK, UILJ, UKNU, UL, UM, UPUK
Text indexing is a fundamental problem in computer science. The objective is to preprocess a text
T
, so that, given a pattern
P
, we can find all starting positions (or simply, occurrences) of
P
in
...T
efficiently. In some cases, additional restrictions are imposed. We consider two variants, namely the
non-overlapping indexing
problem, and the
range non-overlapping indexing
problem. Given a text
T
having
n
characters, the non-overlapping indexing problem is defined as follows: pre-process
T
into a data structure, such that for any pattern
P
, containing |
P
| characters, we can report a set containing the maximum number of non-overlapping occurrences of
P
in
T
. Cohen and Porat (in: Algorithms and computation, 20th international symposium, ISAAC 2009, Honolulu, Hawaii. Proceedings,
2009
) showed that by maintaining a linear space index in which the suffix tree of
T
is augmented with an
O
(
n
) word data structure, a query
P
can be answered in optimal time
O
(
|
P
|
+
n
o
c
c
)
, where
nocc
is the number of occurrences reported. We present the following new result. Let
CSA
(not necessarily a compressed suffix array) be an index of
T
that can compute (i) the suffix range of
P
in
search
(
P
)
time, and (ii) a suffix array or an inverse suffix array value in
t
SA
time. By using
CSA
alone, we can answer a query
P
in
search
(
P
)
+
sort
(
n
o
c
c
)
+
O
(
n
o
c
c
·
t
SA
)
time. The function
sort
(
k
)
denotes the time for sorting
k
numbers in
{
1
,
2
,
⋯
,
n
}
. In the range non-overlapping indexing problem, along with the pattern
P
, two integers
a
and
b
,
b
≥
a
, are provided as input. The task is to report a set containing the maximum number of non-overlapping occurrences of
P
that lie within the range
a
,
b
. For any arbitrarily small positive constant
ϵ
, we present an
O
(
n
log
ϵ
n
)
word index with
O
(
|
P
|
+
n
o
c
c
a
,
b
)
query time, where
n
o
c
c
a
,
b
is the number of occurrences reported. Our index improves upon the result of Cohen and Porat
6
.
Full text
Available for:
EMUNI, FIS, FZAB, GEOZS, GIS, IJS, IMTLJ, KILJ, KISLJ, MFDPS, NLZOH, NUK, OILJ, PNG, SAZU, SBCE, SBJE, SBMB, SBNM, UKNU, UL, UM, UPUK, VKSCE, ZAGLJ
Finding substrings of a text T that match a regular expression p is a fundamental problem. Despite being the subject of extensive research, no solution with a time complexity significantly better ...than O(|T||p|) has been found. Backurs and Indyk in FOCS 2016 established conditional lower bounds for the algorithmic problem based on the Strong Exponential Time Hypothesis that helps explain this difficulty. A natural question is whether we can improve the time complexity for matching the regular expression by preprocessing the text T? We show that conditioned on the Online Matrix–Vector Multiplication (OMv) conjecture, even with arbitrary polynomial preprocessing time, a regular expression query on a text cannot be answered in strongly sublinear time, i.e., O(|T|1−ε) for any ε>0. Furthermore, if we extend the OMv conjecture to a plausible conjecture regarding Boolean matrix multiplication with polynomial preprocessing time, which we call Online Matrix–Matrix Multiplication (OMM), we can strengthen this hardness result to there being no solution with a query time that is O(|T|3/2−ε). These results hold for alphabet sizes three or greater. We then provide data structures that answer queries in O(|T||p|τ) time where τ∈1,|T| is fixed at construction. These include a solution that works for all regular expressions with Expτ·|T| preprocessing time and space. For patterns containing only ‘concatenation’ and ‘or’ operators (the same type used in the hardness result), we provide (1) a deterministic solution which requires Expτ·|T|log2|T| preprocessing time and space, and (2) when |p|≤|T|z for z=2o(log|T|), a randomized solution with amortized query time which answers queries correctly with high probability, requiring Expτ·|T|2Ωlog|T| preprocessing time and space.
Full text
Available for:
IZUM, KILJ, NUK, PILJ, PNG, SAZU, UL, UM, UPUK