•Fast assembly of large and complex metagenomic datasets up to hundreds of Gb.•Integration of advanced assembly practices gives better assembly quality.•CPU-based algorithms eliminate GPU-dependency, ...yet more memory-efficient and faster.
The study of metagenomics has been much benefited from low-cost and high-throughput sequencing technologies, yet the tremendous amount of data generated make analysis like de novo assembly to consume too much computational resources. In late 2014 we released MEGAHIT v0.1 (together with a brief note of Li et al. (2015) 1), which is the first NGS metagenome assembler that can assemble genome sequences from metagenomic datasets of hundreds of Giga base-pairs (bp) in a time- and memory-efficient manner on a single server. The core of MEGAHIT is an efficient parallel algorithm for constructing succinct de Bruijn Graphs (SdBG), implemented on a graphical processing unit (GPU). The software has been well received by the assembly community, and there is interest in how to adapt the algorithms to integrate popular assembly practices so as to improve the assembly quality, as well as how to speed up the software using better CPU-based algorithms (instead of GPU).
In this paper we first describe the details of the core algorithms in MEGAHIT v0.1, and then we show the new modules to upgrade MEGAHIT to version v1.0, which gives better assembly quality, runs faster and uses less memory. For the Iowa Prairie Soil dataset (252Gbp after quality trimming), the assembly quality of MEGAHIT v1.0, when compared with v0.1, has a significant improvement, namely, 36% increase in assembly size and 23% in N50. More interestingly, MEGAHIT v1.0 is no slower than before (even running with the extra modules). This is primarily due to a new CPU-based algorithm for SdBG construction that is faster and requires less memory. Using CPU only, MEGAHIT v1.0 can assemble the Iowa Prairie Soil sample in about 43h, reducing the running time of v0.1 by at least 25% and memory usage by up to 50%. MEGAHIT v1.0, exhibiting a smaller memory footprint, can process even larger datasets. The Kansas Prairie Soil sample (484Gbp), the largest publicly available dataset, can now be assembled using no more than 500GB of memory in 7.5days. The assemblies of these datasets (and other large metgenomic datasets), as well as the software, are available at the website https://hku-bal.github.io/megabox.
We consider the problem of designing succinct navigational oracles, i.e., succinct data structures supporting basic navigational queries such as degree, adjacency and neighborhood efficiently for ...intersection graphs on a circle, which include graph classes such as circle graphs, k-polygon-circle graphs, circle-trapezoid graphs, trapezoid graphs. The degree query reports the number of incident edges to a given vertex, the adjacency query asks if there is an edge between two given vertices, and the neighborhood query enumerates all the neighbors of a given vertex. We first prove a general lower bound for these intersection graph classes, and then present a uniform approach that lets us obtain matching lower and upper bounds for representing each of these graph classes. More specifically, our lower bound proofs use a unified technique to produce tight bounds for all these classes, and this is followed by our data structures which are also obtained from a unified representation method to achieve succinctness for each class. In addition, we prove a lower bound of space for representing trapezoid graphs, and give a succinct navigational oracle for this class of graphs.
The massive amounts of data processed in modern computational systems are becoming a problem of increasing importance. This data is commonly stored directly or indirectly through the use of data ...exchange languages, such as JSON (JavaScript Object Notation) and XML (eXtensible Markup Language), for human-readable platform-agnostic access.
This paper focuses on exploring a set of succinct representations for JSON documents, which we call SJSON, achieving both reduced RAM and disk usage while supporting efficient queries on the documents. The representations we propose are mainly based on the idea that JSON documents can be decomposed into structural part and raw data part. In our method, we emulate the structure of the JSON document as a rooted ordered tree and represent it using succinct data structures, as opposed to the usual pointer-based implementation. Furthermore, the remaining raw data is reorganized into arrays of attributes and values. This deconstruction between structure and data allows for a straightforward connection between a node in the succinct tree and its corresponding name–value pair, dispensing pointers altogether.
The proposed scheme is implemented as the SJSON library in C++, and evaluated with respect to a number of metrics, comparing its performance with popular alternative JSON parsers. Empirical results show that the library is able to represent JSON files succinctly while efficiently supporting traversal queries.
•A set of succinct representations of JSON documents is suggested.•Succinct tree structure is used to store structure information of the document.•Bit string indexing is used to maintain data part of the new document.•Experimental results based on the C++ implementation exhibit space-efficiency.
Succinct Range Filters Zhang, Huanchen; Lim, Hyeontaek; Leis, Viktor ...
ACM transactions on database systems,
07/2020, Letnik:
45, Številka:
2
Journal Article
Recenzirano
Odprti dostop
We present the
Succinct Range Filter
(SuRF), a fast and compact data structure for approximate membership tests. Unlike traditional Bloom filters, SuRF supports both single-key lookups and common ...range queries: open-range queries, closed-range queries, and range counts. SuRF is based on a new data structure called the
Fast Succinct Trie (FST)
that matches the point and range query performance of state-of-the-art order-preserving indexes, while consuming only 10 bits per trie node. The false-positive rates in SuRF for both point and range queries are tunable to satisfy different application needs. We evaluate SuRF in RocksDB as a replacement for its Bloom filters to reduce I/O by filtering requests before they access on-disk data structures. Our experiments on a 100-GB dataset show that replacing RocksDB’s Bloom filters with SuRFs speeds up open-seek (without upper-bound) and closed-seek (with upper-bound) queries by up to 1.5× and 5× with a modest cost on the worst-case (all-missing) point query throughput due to slightly higher false-positive rate.
•Using ∞ instead of 0 for encoding parameterized strings derives a simpler index.•The presented index consists of only wavelet trees and a range maximum query index.•The index occupies 2n lg ...σ+2n+o(n) bits with O(lg σ) search time per symbol.
In parameterized string matching, a string comprises static and parameterized symbols. Two strings are said to be matched if there exists a one-to-one mapping of parameterized symbols onto itself such that it transforms one string into the other. Traditionally, to construct a full-text index for this problem, suffixes are transformed in a manner referred to as previous occurrence encoding. Although a space-efficient backward searching based data structure has been proposed recently, the data structure is specialized and complex owing to the nature of the encoding scheme. In this study, we demonstrate that a slight modification of the encoding scheme, by using ∞ instead of 0 to denote the first occurrences of each parameterized symbol, enables us to develop a much simpler FM-index for this problem, which only comprises two wavelet trees and one range maximum query index.
BSuccinct is a collection of software focused on compact and succinct data structures that are both space and time efficient. It is written in Rust, a programming language well suited for scientific ...applications due to its emphasis on reliability and performance. BSuccinct provides libraries implementing minimal perfect hash functions, compressed static functions, bit vector manipulations, Huffman coding, and programs to benchmark them. It also includes libraries for several auxiliary operations, including binary serialization/deserialization and accurate float summation.
Parallel construction of succinct trees Fuentes-Sepúlveda, José; Ferres, Leo; He, Meng ...
Theoretical computer science,
11/2017, Letnik:
700
Journal Article
Recenzirano
Odprti dostop
Succinct representations of trees are an elegant solution to make large trees fit in main memory while still supporting navigational operations in constant time. However, their construction time ...remains a bottleneck. We introduce two parallel algorithms that improve the state of the art in succinct tree construction. Our results are presented in terms of work, the time needed to execute a parallel computation using one thread, and span, the minimum amount of time needed to execute a parallel computation, for any amount of threads. Given a tree on n nodes stored as a sequence of balanced parentheses, our first algorithm builds a succinct tree representation with O(n) work, O(lgn) span and supports a rich set of operations in O(lgn) time. Our second algorithm improves the query support. It constructs a succinct representation that supports queries in O(c) time, taking O(n+nlgcnlg(nlgcn)+cc) work and O(c+lg(ncclgcn)) span, for any positive constant c. Both algorithms use O(nlgn) bits of working space. In experiments using up to 64 cores on inputs of different sizes, our first algorithm achieved good parallel speed-up. We also present an algorithm that takes O(n) work and O(lgn) span to construct the balanced parenthesis sequence of the input tree required by our succinct tree construction algorithm.
High throughput sequencing technologies have become fast and cheap in the past years. As a result, large-scale projects started to sequence tens to several thousands of genomes per species, producing ...a high number of sequences sampled from each genome. Such a highly redundant collection of very similar sequences is called a pan-genome. It can be transformed into a set of sequences "colored" by the genomes to which they belong. A colored de Bruijn graph (C-DBG) extracts from the sequences all colored k-mers, strings of length k, and stores them in vertices.
In this paper, we present an alignment-free, reference-free and incremental data structure for storing a pan-genome as a C-DBG: the bloom filter trie (BFT). The data structure allows to store and compress a set of colored k-mers, and also to efficiently traverse the graph. Bloom filter trie was used to index and query different pangenome datasets. Compared to another state-of-the-art data structure, BFT was up to two times faster to build while using about the same amount of main memory. For querying k-mers, BFT was about 52-66 times faster while using about 5.5-14.3 times less memory.
We present a novel succinct data structure called the Bloom Filter Trie for indexing a pan-genome as a colored de Bruijn graph. The trie stores k-mers and their colors based on a new representation of vertices that compress and index shared substrings. Vertices use basic data structures for lightweight substrings storage as well as Bloom filters for efficient trie and graph traversals. Experimental results prove better performance compared to another state-of-the-art data structure.
https://www.github.com/GuillaumeHolley/BloomFilterTrie.
We introduce compact and succinct representations of a d-dimensional point set for any constant d≥3 supporting orthogonal range searching. Our first data structure uses dnlgn+o(nlgn) bits, where n ...denotes the number of points in P, and supporting reporting queries in O((n(d−2)/d+occ)lgn/lglgn) time, and counting queries in O(n(d−2)/dlgn/lglgn) time, where occ denotes the number of point to report, which is faster than known algorithms. Our second data structure uses dnlgU−nlgn+o(nlgn) bits, where U is the size of the universe, which asymptotically matches the information-theoretic lower bound. The query time complexity is worse than the first one, but the algorithm runs fast in practical database search.