Key-value data structures have been extensively used in various applications. When a large amount of data needs to be compactly stored in a fixed memory size, a functional Bloom filter is a ...space-efficient key-value structure. In this letter, we propose a 2-stage functional Bloom filter structure composed of a primary functional Bloom filter and a secondary functional Bloom filter to resolve the indeterminables produced from the primary functional Bloom filter. We analytically present the memory ratio allocated for each of the two Bloom filters to achieve the lowest search failure rate. The analytical result is validated through experiments, thereby demonstrating that the optimal performance is realized when the secondary functional Bloom filter uses 3% of the total memory.
Streaming triangle counting is a critical issue in graph stream mining, with applications in dense subgraph discovery, web mining, anomaly detection, and more. Recent efforts have focused on ...estimating triangle counts in graph streams, primarily through sampling methods. However, because of limited memory resources for handling high speed streams, traditional sampling methods suffer from reduced sampling rate and thereby performance loss. In this paper, we propose a new compact data structure called uHLL to process edge streams by considering the tradeoff between estimation accuracy and memory efficiency. Furthermore, different from conventional triangle counting algorithms, we solve the estimation of union set cardinality for edge-local triangle count under both centralized and distributed framework, so as to efficiently estimate the global triangle count by a one-pass streaming algorithm. To the best of our knowledge, this is the first implementation of a distributed framework using a compact data structure for streaming triangle counting. We provide theoretical proof of unbiasedness and derive the variance of the union set and global triangle count. We compare our scheme with 11 algorithms, showing that under the same experimental setting, uHLL and distributed uHLL are at least <inline-formula><tex-math notation="LaTeX"> 2.3</tex-math> <mml:math><mml:mrow><mml:mn>2</mml:mn><mml:mo>.</mml:mo><mml:mn>3</mml:mn></mml:mrow></mml:math><inline-graphic xlink:href="song-ieq1-3371228.gif"/> </inline-formula> and <inline-formula><tex-math notation="LaTeX"> 1.7</tex-math> <mml:math><mml:mrow><mml:mn>1</mml:mn><mml:mo>.</mml:mo><mml:mn>7</mml:mn></mml:mrow></mml:math><inline-graphic xlink:href="song-ieq2-3371228.gif"/> </inline-formula> times more accurate than the state-of-the-art, respectively.
Online 3D reconstruction is gaining newfound interest due to the availability of real-time consumer depth cameras. The basic problem takes live overlapping depth maps as input and incrementally fuses ...these into a single 3D model. This is challenging particularly when real-time performance is desired without trading quality or scale. We contribute an online system for large and fine scale volumetric reconstruction based on a memory and speed efficient data structure. Our system uses a simple spatial hashing scheme that compresses space, and allows for real-time access and updates of implicit surface data, without the need for a regular or hierarchical grid data structure. Surface data is only stored densely where measurements are observed. Additionally, data can be streamed efficiently in or out of the hash table, allowing for further scalability during sensor motion. We show interactive reconstructions of a variety of scenes, reconstructing both fine-grained details and large scale environments. We illustrate how all parts of our pipeline from depth map pre-processing, camera pose estimation, depth map fusion, and surface rendering are performed at real-time rates on commodity graphics hardware. We conclude with a comparison to current state-of-the-art online systems, illustrating improved performance and reconstruction quality.
The Internet of Things (IoT) is poised to transform human life and unleash enormous economic benefit. However, inadequate data security and trust of current IoT are seriously limiting its adoption. ...Blockchain, a distributed and tamper-resistant ledger, maintains consistent records of data at different locations, and has the potential to address the data security concern in IoT networks. While providing data security to the IoT, Blockchain also encounters a number of critical challenges inherent in the IoT, such as a huge number of IoT devices, non-homogeneous network structure, limited computing power, low communication bandwidth, and error-prone radio links. This paper presents a comprehensive survey on existing Blockchain technologies with an emphasis on the IoT applications. The Blockchain technologies which can potentially address the critical challenges arising from the IoT and hence suit the IoT applications are identified with potential adaptations and enhancements elaborated on the Blockchain consensus protocols and data structures. Future research directions are collated for effective integration of Blockchain into the IoT networks.
Over the years mainly three unconventional fluorescence techniques, Excitation–emission matrix fluorescence (EEMF), synchronous fluorescence spectroscopy (SFS), and total synchronous fluorescence ...spectroscopy (TSFS) are introduced for the analysis of multifluorophoric mixtures. Application of EEMF, SFS and TSFS are conceptually different. The existing literature lacks a review article that gives an overview on conceptual and analytical aspects of EEMF, SFS and TSFS for general as well as specialized fluorescence scientific community. The present review article attempts to address these issues and discusses various conceptual and practical aspects of EEMF, SFS and TSFS spectroscopy. The present article contains numerous novel fluorescence parameters, concept of concentration dependent red shift, protocol for finding the optimum wavelength offset for SFS data acquisition is introduced, various practical aspects of integrating chemometric methods with TSFS and number of successful applications of EEMF, SFS and TSFS for the analysis of complex and simple multifluorophoric mixtures is discussed.
•Conceptual and practical aspects of unconventional fluorescence techniques are discussed.•A set of novel fluorescence parameters are introduced for the analysis of complex mixtures.•The concept of ‘concentration red shift’ in fluorescence spectroscopy is discussed.•Protocol for finding the optimum wavelength offset for SFS data acquisition is introduced.•Various practical aspects of integrating chemometric methods with TSFS are discussed.
Let
T
1
and
T
2
be two rooted trees with an equal number of leaves. The leaves are labeled, and the labeling of the leaves in
T
2
is a permutation of those in
T
1
. Nodes are associated with weight, ...such that the weight of a node
u
, denoted by
W
(
u
), is more than the weight of its parent. A node
x
∈
T
1
and a node
y
∈
T
2
are induced, iff their subtrees have at least one common leaf label. A
heaviest induced ancestor
query
HIA
(
u
1
,
u
2
)
with input nodes
u
1
∈
T
1
and
u
2
∈
T
2
asks to output the pair
(
u
1
∗
,
u
2
∗
)
of induced nodes with the highest combined weight
W
(
u
1
∗
)
+
W
(
u
2
∗
)
, such that
u
1
∗
is an ancestor of
u
1
and
u
2
∗
is an ancestor of
u
2
. This is a useful primitive in several text processing applications. Gagie et al. (Proceedings of the 25th Canadian Conference on Computational Geometry, CCCG 2013, Waterloo, Ontario, Canada, 2013) introduced this problem and proposed three data structures with the following space-time trade-offs: (i)
O
(
n
log
2
n
)
space and
O
(
log
n
log
log
n
)
query time, (ii)
O
(
n
log
n
)
space and
O
(
log
2
n
)
query time, and (iii)
O
(
n
) space and
O
(
log
3
+
ϵ
n
)
query time. Here
n
is the number of nodes in both trees combined and
ϵ
>
0
is an arbitrarily small constant. We present two new data structures with better space-time trade-offs: (i)
O
(
n
log
n
)
space and
O
(
log
n
log
log
n
)
query time, and (ii)
O
(
n
) space and
O
(
log
2
n
/
log
log
n
)
query time. Additionally, we present new applications of these results.
TrustChain is capable of creating trusted transactions among strangers without central control. This enables new areas of blockchain use with a focus on building trust between individuals. Our ...innovative approach offers scalability, openness and Sybil-resistance while replacing proof-of-work with a mechanism to establish the validity and integrity of transactions.
TrustChain is a permission-less tamper-proof data structure for storing transaction records of agents. We create an immutable chain of temporally ordered interactions for each agent. It is inherently parallel and every agent creates his own genesis block. TrustChain includes a novel Sybil-resistant algorithm named NetFlow to determine trustworthiness of agents in an online community. NetFlow ensures that agents who take resources from the community also contribute back. We demonstrate that irrefutable historical transaction records offer security and seamless scalability, without requiring global consensus. Experimentation shows that the transaction throughput of TrustChain surpasses that of traditional blockchain architectures like Bitcoin. We show by using extracted data from a live network that TrustChain has sufficient informativeness to identify freeriders, leading to refusal of service.
•A tamper-proof, scalable and blockchain-based data structure (TrustChain).•A Sybil-resistant model to determine trustworthiness (NetFlow).•A public experiment which addresses freeriding in online communities.
The use of data fusion methodologies has increased at the same rhythm as the capability of modern analytical laboratories of measuring sample from multiple sources. Almost all data fusion strategies ...can be grouped into three levels, they fuse the data differently with the sole aim of obtaining a better response (qualitative or quantitative) than that obtained by the instruments individually. One of the major key points for the data fusion methodologies to succeed is the understanding of the data structure obtained from a particular instrument. This point is not exhaustively commented in the literature focused on data fusion, sometimes paying too much attention to the algorithms instead. This manuscript explains data fusion from the structure of the different data obtained by different analytical platforms. Special attention will be given to the nature of the data and the relationships between the samples and the variables, as well as within the variables.
•Structure of the data generated by analytical platforms.•Discussion about the definitions of typical terms encompassing data fusion context.•Perspective of the major strategies in data fusion.•One example to illustrate the multilevel fusion process in detail.
•Fast assembly of large and complex metagenomic datasets up to hundreds of Gb.•Integration of advanced assembly practices gives better assembly quality.•CPU-based algorithms eliminate GPU-dependency, ...yet more memory-efficient and faster.
The study of metagenomics has been much benefited from low-cost and high-throughput sequencing technologies, yet the tremendous amount of data generated make analysis like de novo assembly to consume too much computational resources. In late 2014 we released MEGAHIT v0.1 (together with a brief note of Li et al. (2015) 1), which is the first NGS metagenome assembler that can assemble genome sequences from metagenomic datasets of hundreds of Giga base-pairs (bp) in a time- and memory-efficient manner on a single server. The core of MEGAHIT is an efficient parallel algorithm for constructing succinct de Bruijn Graphs (SdBG), implemented on a graphical processing unit (GPU). The software has been well received by the assembly community, and there is interest in how to adapt the algorithms to integrate popular assembly practices so as to improve the assembly quality, as well as how to speed up the software using better CPU-based algorithms (instead of GPU).
In this paper we first describe the details of the core algorithms in MEGAHIT v0.1, and then we show the new modules to upgrade MEGAHIT to version v1.0, which gives better assembly quality, runs faster and uses less memory. For the Iowa Prairie Soil dataset (252Gbp after quality trimming), the assembly quality of MEGAHIT v1.0, when compared with v0.1, has a significant improvement, namely, 36% increase in assembly size and 23% in N50. More interestingly, MEGAHIT v1.0 is no slower than before (even running with the extra modules). This is primarily due to a new CPU-based algorithm for SdBG construction that is faster and requires less memory. Using CPU only, MEGAHIT v1.0 can assemble the Iowa Prairie Soil sample in about 43h, reducing the running time of v0.1 by at least 25% and memory usage by up to 50%. MEGAHIT v1.0, exhibiting a smaller memory footprint, can process even larger datasets. The Kansas Prairie Soil sample (484Gbp), the largest publicly available dataset, can now be assembled using no more than 500GB of memory in 7.5days. The assemblies of these datasets (and other large metgenomic datasets), as well as the software, are available at the website https://hku-bal.github.io/megabox.
This paper presents a solution for building and implementing data processing models and experimentally evaluates new possibilities for improving ensemble methods based on multilevel data processing ...models. This study proposes a model to reduce the cost of retraining models when transforming data properties. The research objective is to improve the quality indicators of machine learning models when solving classification problems. The novelty is a method that uses a multilevel architecture of data processing models to determine the current data properties in segments at different levels and assign algorithms with the best quality indicators. This method differs from the known ones by using several model levels that analyze data properties and assign the best models to individual segments of data and training. The improvement consists of using unsupervised clustering of data samples. The resulting clusters are separate subsamples for assigning the best machine-learning models and algorithms. Experimental values of quality indicators for different classifiers on the whole sample and different segments were obtained. The findings show that unsupervised clustering using multilevel models can significantly improve the quality indicators of “weak” classifiers. The quality indicators of individual classifiers improve when the number of data clusters is increased to a certain threshold. The results obtained are applicable to classification when developing models and machine learning methods. The proposed method improved the classification quality indicators by 2–9% due to segmentation and the assignment of models with the best quality indicators in individual segments. Doi: 10.28991/ESJ-2024-08-01-025 Full Text: PDF