Deep learning based techniques have been recently used with promising results for data integration problems. Some methods directly use pre-trained embeddings that were trained on a large corpus such ...as Wikipedia. However, they may not always be an appropriate choice for enterprise datasets with custom vocabulary. Other methods adapt techniques from natural language processing to obtain embeddings for the enterprise's relational data. However, this approach blindly treats a tuple as a sentence, thus losing a large amount of contextual information present in the tuple. We propose algorithms for obtaining local embeddings that are effective for data integration tasks on relational databases. We make four major contributions. First, we describe a compact graph-based representation that allows the specification of a rich set of relationships inherent in the relational world. Second, we propose how to derive sentences from such a graph that effectively "describe" the similarity across elements (tokens, attributes, rows) in the two datasets. The embeddings are learned based on such sentences. Third, we propose effective optimization to improve the quality of the learned embeddings and the performance of integration tasks. Finally, we propose a diverse collection of criteria to evaluate relational embeddings and perform an extensive set of experiments validating them against multiple baseline methods. Our experiments show that our framework, EmbDI, produces meaningful results for data integration tasks such as schema matching and entity resolution both in supervised and unsupervised settings.
A wealth of single-cell protocols makes it possible to characterize different molecular layers at unprecedented resolution. Integrating the resulting multimodal single-cell data to find cell-to-cell ...correspondences remains a challenge. We argue that data integration needs to happen at a meaningful biological level of abstraction and that it is necessary to consider the inherent discrepancies between modalities to strike a balance between biological discovery and noise removal. A survey of current methods reveals that a distinction between technical and biological origins of presumed unwanted variation between datasets is not yet commonly considered. The increasing availability of paired multimodal data will aid the development of improved methods by providing a ground truth on cell-to-cell matches.
Identifying cell-to-cell correspondences between unpaired datasets from different single cell protocols promises to provide a more comprehensive view of cellular states.Integration of unpaired data from multiple modalities is more complicated than single-omics integration due to a lack of feature correspondence across modalities and ground truth information about biological differences between modalities.Retention of biological variation during multi-omic data integration has been insufficiently addressed to date, but is essential to leverage complementary information from different omics layers.Ground truth data can now be provided by new paired multi-omics assays. This will inform robust associations between features of different modalities and reveal modality-specific biological patterns that may also help to improve methods for multimodal integration of unpaired data.
Full text
Available for:
GEOZS, IJS, IMTLJ, KILJ, KISLJ, NLZOH, NUK, OILJ, PNG, SAZU, SBCE, SBJE, UILJ, UL, UM, UPCLJ, UPUK, ZAGLJ, ZRSKP
Wound healing is a dynamic process over temporal and spatial scales. Key to repair outcomes are fibroblasts, yet how they modulate healing across time and in different wound regions remains ...incompletely understood. By integrating single-cell RNA-sequencing datasets of mouse skin and wounds, we infer that fibroblasts are the most transcriptionally dynamic skin-resident cells, evolving during postnatal skin maturation, and rapidly after injury towards distinct late scar states. We show that transcriptional dynamics in fibroblasts are largely driven by genes encoding extracellular matrix and signaling factors. Lineage trajectory inference and spatial gene mapping reveal that Prg4-expressing fibroblasts transiently emerge along early wound edges. Within days, they become replaced by long-lasting and likely non-interconverting fibroblast populations, including Col25a1-expressing and Pamr1-expressing fibroblasts that occupy subepidermal and deep scar regions, respectively, where they engage in reciprocal signaling with immune cells. Signaling inference shows that fibroblast-immune crosstalk repeatedly uses some signaling pathways across wound healing time, while use of other signaling pathways is time- and space-limited. Collectively, we uncovered high transcriptional plasticity by wound fibroblasts, with early states transiently forming distinct micro-niches along wound edges and in the fascia, followed by stable states, that stratify scar tissue into molecularly dissimilar upper and lower layers.
The huge values created by Big Data and the recent advances in cloud computing have been driving data from different sources into cloud repositories for comprehensive query services. However, ...cloud-based data fusion makes it challenging to verify if an untrusted server faithfully integrates data and executes queries or not. This is even harder for range-aggregate queries that apply aggregate operations on data within given ranges. In this article, we propose a query authentication scheme, named Formula Omitted, enabling a user to efficiently authenticate range-aggregate queries on multi-source data. Specifically, Formula Omitted creates a VG-tree by subtly integrating Expressive Set Accumulator into a multi-dimensional G-tree while signing the root digest with a multi-source aggregate signature scheme. Compared with previous solutions, Formula Omitted has the following merits: (1) Practicality. Instead of treating range and aggregate queries separately, the user can directly verify the statistical result of selected data. (2) Scalability. Instead of authenticating the individual result from each source, the user can perform an aggregative validation on the integrated result from multiple sources. The experimental results demonstrate the effectiveness of MARS. For large-scale data fusion, the user-side verification time increases by only 103 ms as the amount of data sources increases by five times.
Nowadays, there is a strong demand for inspection systems integrating both high sensitivity under various testing conditions and advanced processing allowing automatic identification of the examined ...object state and detection of threats. This paper presents the possibility of utilization of a magnetic multi-sensor matrix transducer for characterization of defected areas in steel elements and a deep learning based algorithm for integration of data and final identification of the object state. The transducer allows sensing of a magnetic vector in a single location in different directions. Thus, it enables detecting and characterizing any material changes that affect magnetic properties regardless of their orientation in reference to the scanning direction. To assess the general application capability of the system, steel elements with rectangular-shaped artificial defects were used. First, a database was constructed considering numerical and measurements results. A finite element method was used to run a simulation process and provide transducer signal patterns for different defect arrangements. Next, the algorithm integrating responses of the transducer collected in a single position was applied, and a convolutional neural network was used for implementation of the material state evaluation model. Then, validation of the obtained model was carried out. In this paper, the procedure for updating the evaluated local state, referring to the neighboring area results, is presented. Finally, the results and future perspective are discussed.
Full text
Available for:
IZUM, KILJ, NUK, PILJ, PNG, SAZU, UL, UM, UPUK
Abstract
We present the ggtreeExtra package for visualizing heterogeneous data with a phylogenetic tree in a circular or rectangular layout (https://www.bioconductor.org/packages/ggtreeExtra). The ...package supports more data types and visualization methods than other tools. It supports using the grammar of graphics syntax to present data on a tree with richly annotated layers and allows evolutionary statistics inferred by commonly used software to be integrated and visualized with external data. GgtreeExtra is a universal tool for tree data visualization. It extends the applications of the phylogenetic tree in different disciplines by making more domain-specific data to be available to visualize and interpret in the evolutionary context.
With the advancement of sequencing methodologies, the acquisition of vast amounts of multi-omics data presents a significant opportunity for comprehending the intricate biological mechanisms ...underlying diseases and achieving precise diagnosis and treatment for complex disorders. However, as diverse omics data are integrated, extracting sample-specific features within each omics modality and exploring potential correlations among different modalities while avoiding mutual interference becomes a critical challenge in multi-omics data integration research. In the context of this study, we proposed a framework that unites specificity-aware GATs and cross-modal attention to integrate different omics data (MOSGAT). To be specific, we devise Graph Attention Networks (GATs) tailored for each omics modality data to perform feature extraction on samples. Additionally, an adaptive confidence attention weighting technique is incorporated to enhance the confidence in the extracted features. Finally, a cross-modal attention mechanism was devised based on multi-head self-attention, thoroughly uncovering potential correlations between different omics data. Extensive experiments were conducted on four publicly available medical datasets, highlighting the superiority of the proposed framework when compared to state-of-the-art methodologies, particularly in the realm of classification tasks. The experimental results underscore MOSGAT's effectiveness in extracting features and exploring potential inter-omics associations.
Soil carbon has been measured for over a century in applications ranging from understanding biogeochemical processes in natural ecosystems to quantifying the productivity and health of managed ...systems. Consolidating diverse soil carbon datasets is increasingly important to maximize their value, particularly with growing anthropogenic and climate change pressures. In this progress report, we describe recent advances in soil carbon data led by the International Soil Carbon Network and other networks. We highlight priority areas of research requiring soil carbon data, including (a) quantifying boreal, arctic and wetland carbon stocks, (b) understanding the timescales of soil carbon persistence using radiocarbon and chronosequence studies, (c) synthesizing long-term and experimental data to inform carbon stock vulnerability to global change, (d) quantifying root influences on soil carbon and (e) identifying gaps in model–data integration. We also describe the landscape of soil datasets currently available, highlighting their strengths, weaknesses and synergies. Now more than ever, integrated soil data are needed to inform climate mitigation, land management and agricultural practices. This report will aid new data users in navigating various soil databases and encourage scientists to make their measurements publicly available and to join forces to find soil-related solutions.
Full text
Available for:
NUK, OILJ, SAZU, UKNU, UL, UM, UPUK
As data sets of related studies become more easily accessible, combining data sets of similar studies is often undertaken in practice to achieve a larger sample size and higher power. A major ...challenge arising from data integration pertains to data heterogeneity in terms of study population, study design, or study coordination. Ignoring such heterogeneity in data analysis may result in biased estimation and misleading inference. Traditional techniques of remedy to data heterogeneity include the use of interactions and random effects, which are inferior to achieving desirable statistical power or providing a meaningful interpretation, especially when a large number of smaller data sets are combined. In this paper, we propose a regularized fusion method that allows us to identify and merge inter-study homogeneous parameter clusters in regression analysis, without the use of hypothesis testing approach. Using the fused lasso, we establish a computationally efficient procedure to deal with large-scale integrated data. Incorporating the estimated parameter ordering in the fused lasso facilitates computing speed with no loss of statistical power. We conduct extensive simulation studies and provide an application example to demonstrate the performance of the new method with a comparison to the conventional methods.