•A novel method for modeling multiple data streams (SIMF) was developed.•Implemented as real-time recommender system based on collective matrix factorization.•Results on synthetic and real streams ...confirm that data fusion improves predictions.•Cold-start problem can be alleviated using SIMF and additional data streams.
Recommender systems are large-scale machine learning and knowledge discovery tools aimed at providing personalized recommendations to customers based on their preferences and needs. They need to handle large quantities of diverse and very sparse data in a matter of seconds. Matrix factorization techniques have proven to be useful and reliable for implementing recommender systems, while data sparsity problem can be indirectly alleviated by considering multiple heterogeneous data sources. Furthermore, utilization of data fusion can resolve in a higher predictive accuracy. For real-world applications, e.g., such with continuous user feedback, incrementally handling recommender systems upon multiple data streams remains a crucial and only partially solved problem. This paper presents one way of fusing multiple data streams through matrix factorization. Our proposed method (SIMF) models heterogeneous and asynchronous data streams and provides predictions in real time. As a result of incremental updating, the proposed method successfully adapts to changes in data concepts, while application of data fusion improves prediction accuracy and reduces effects of the cold-start problem. Using the proposed methodology, we have develop a streaming algorithm and show how prediction accuracy can be substantially increased by considering multiple data sources, while at the same time the negative effects of the cold-start can be greatly diminished. Evaluations on a large-scale real-life problem (Yelp recommendations) confirm these claims as we present a highly scalable streaming recommender system that adapts to new concepts in data and provides accurate predictions (compared to the other matrix factorization techniques) in a very sparse problem domain. Apart from a recommender system proposed in this work, the versatility of matrix factorization could further allow the presented methodology for adaptation to solve several other machine learning problems, such as dimensionality reduction, clustering and classification.
Established bioprocess monitoring is based on quick and reliable methods, including cell count and viability measurement, extracellular metabolite measurement, and the measurement of physicochemical ...qualities of the cultivation medium. These methods are sufficient for monitoring of process performance, but rarely give insight into the actual physiological states of the cell culture. However, understanding of the latter is essential for optimization of bioprocess development. Our study used LC‐MS metabolomics as a tool for additional resolution of bioprocess monitoring and was designed at three bioreactors scales (10 L, 100 L, and 1,000 L) to gain insight into the basal metabolic states of the Chinese hamster ovary (CHO) cell culture during fed‐batch. Metabolites characteristics of the four growth stages (early and late exponential phase, stationary phase, and the phase of decline) were identified by multivariate analysis. Enriched metabolic pathways were then established for each growth phase using the CHO metabolic network model. Biomass generation and nucleotide synthesis were enriched in early exponential phase, followed by increased protein production and imbalanced glutathione metabolism in late exponential phase. Glycolysis became downregulated in stationary phase and amino‐acid metabolism increased. Phase of culture decline resulted in rise of oxidized glutathione and fatty acid concentrations. Intracellular metabolic profiles of the CHO fed‐batch culture were also shown to be consistent with scale and thus demonstrate metabolomic profiling as an informative method to gain physiological insight into the cell culture states during bioprocess regardless of scale.
Established bioprocess monitoring is based on quick and reliable methods, including cell count and viability measurement, extracellular metabolite measurement and the measurement of physicochemical qualities of the cultivation medium.
Abstract Motivation We learn more effectively through experience and reflection than through passive reception of information. Bioinformatics offers an excellent opportunity for project-based ...learning. Molecular data are abundant and accessible in open repositories, and important concepts in biology can be rediscovered by reanalyzing the data. Results In the manuscript, we report on five hands-on assignments we designed for master’s computer science students to train them in bioinformatics for genomics. These assignments are the cornerstones of our introductory bioinformatics course and are centered around the study of the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). They assume no prior knowledge of molecular biology but do require programming skills. Through these assignments, students learn about genomes and genes, discover their composition and function, relate SARS-CoV-2 to other viruses, and learn about the body’s response to infection. Student evaluation of the assignments confirms their usefulness and value, their appropriate mastery-level difficulty, and their interesting and motivating storyline. Availability and Implementation The course materials are freely available on GitHub at https://github.com/IB-ULFRI.
•Improved matrix completion on binary, positive-only data for recommender systems.•Filtering for the number of intermediary objects improves data chaining.•Evaluation on real and synthetic data with ...varying degree of transitivity.•Faster computation time using chain matrix multiplication.
Recommender systems typically work on user-item preference relation data. Recommendations can be improved by including side relations that are indirectly linked to the target relation. Data fusion by matrix co-factorization is one such method that can integrate heterogeneous representations of objects of different types. A shared latent matrix factor model is inferred, which is then used to approximate and thus predict multiple data matrices at the same time. The factor model can also be used to infer indirect relations among objects, by multiplying the corresponding factor matrices on a chain of relations that link those objects (i.e., relation chaining). We show that recommendation in binary positive-only data can be improved by relation chaining. We can filter out less reliable relations by using the number of supporting intermediate paths as an additional parameter in a multi-objective Pareto optimization. Our method outperforms other state-of-the-art chaining methods. To speed-up the computation of relation chaining, we propose a chain matrix multiplication-based approach for chaining.
To evaluate our method, we have created synthetic data on transitive relations among objects, for a varying degree of noise. Results on synthetic data show that chaining indeed works on chains containing transitive relations. Results on three real datasets show that the inclusion of the number of intermediate paths improves relation chaining predictions. Compared to the full data matrix multiplication approach, the proposed relation chaining method achieved two-fold speed-up.
Matrix factorization methods are linear models, with limited capability to model complex relations. In our work, we use tropical semiring to introduce non-linearity into matrix factorization models. ...We propose a method called Sparse Tropical Matrix Factorization (STMF) for the estimation of missing (unknown) values in sparse data.
We evaluate the efficiency of the STMF method on both synthetic data and biological data in the form of gene expression measurements downloaded from The Cancer Genome Atlas (TCGA) database. Tests on unique synthetic data showed that STMF approximation achieves a higher correlation than non-negative matrix factorization (NMF), which is unable to recover patterns effectively. On real data, STMF outperforms NMF on six out of nine gene expression datasets. While NMF assumes normal distribution and tends toward the mean value, STMF can better fit to extreme values and distributions.
STMF is the first work that uses tropical semiring on sparse data. We show that in certain cases semirings are useful because they consider the structure, which is different and simpler to understand than it is with standard linear algebra.
Celotno besedilo
Dostopno za:
DOBA, IZUM, KILJ, NUK, PILJ, PNG, SAZU, SIK, UILJ, UKNU, UL, UM, UPUK
Tropical semiring has proven successful in several research areas, including optimal control, bioinformatics, discrete event systems, and decision problems. Previous studies have applied a matrix ...two-factorization algorithm based on the tropical semiring to investigate bipartite and tripartite networks. Tri-factorization algorithms based on standard linear algebra are used to solve tasks such as data fusion, co-clustering, matrix completion, community detection, and more. However, there is currently no tropical matrix tri-factorization approach that would allow for the analysis of multipartite networks with many parts. To address this, we propose the triFastSTMF algorithm, which performs tri-factorization over the tropical semiring. We applied it to analyze a four-partition network structure and recover the edge lengths of the network. We show that triFastSTMF performs similarly to Fast-NMTF in terms of approximation and prediction performance when fitted on the whole network. When trained on a specific subnetwork and used to predict the entire network, triFastSTMF outperforms Fast-NMTF by several orders of magnitude smaller error. The robustness of triFastSTMF is due to tropical operations, which are less prone to predict large values compared to standard operations.
Studies of spliceosomal interactions are challenging due to their dynamic nature. Here we used spliceosome iCLIP, which immunoprecipitates SmB along with small nuclear ribonucleoprotein particles and ...auxiliary RNA binding proteins, to map spliceosome engagement with pre-messenger RNAs in human cell lines. This revealed seven peaks of spliceosomal crosslinking around branchpoints (BPs) and splice sites. We identified RNA binding proteins that crosslink to each peak, including known and candidate splicing factors. Moreover, we detected the use of over 40,000 BPs with strong sequence consensus and structural accessibility, which align well to nearby crosslinking peaks. We show how the position and strength of BPs affect the crosslinking patterns of spliceosomal factors, which bind more efficiently upstream of strong or proximally located BPs and downstream of weak or distally located BPs. These insights exemplify spliceosome iCLIP as a broadly applicable method for transcriptomic studies of splicing mechanisms.
Kernel methods provide a principled way for general data representations. Multiple kernel learning and kernel approximation are often treated as separate tasks, with considerable savings in time and ...memory expected if the two are performed simultaneously.
Our proposed Mklaren algorithm selectively approximates multiple kernel matrices in regression. It uses Incomplete Cholesky Decomposition and Least-angle regression (LAR) to select basis functions, achieving linear complexity both in the number of data points and kernels. Since it approximates kernel matrices rather than functions, it allows to combine an arbitrary set of kernels. Compared to single kernel-based approximations, it selectively approximates different kernels in different regions of the input spaces.
The LAR criterion provides a robust selection of inducing points in noisy settings, and an accurate modelling of regression functions in continuous and discrete input spaces. Among general kernel matrix decompositions, Mklaren achieves minimal approximation rank required for performance comparable to using the exact kernel matrix, at a cost lower than 1% of required operations. Finally, we demonstrate the scalability and interpretability in settings with millions of data points and thousands of kernels.
•A novel synthetic data stream generator (GIDS) was developed.•Use for evaluation of incremental recommender systems and data fusion algorithms.•Resembles real datasets in terms of data properties ...and recommender performance.•Tunable parameters allow for systematic generation of streams for various problems.
Recommender systems are essential tools in modern e-commerce, streaming services, search engines, social networks and many other areas including the scientific community. However, lack of publicly available data hinders the development and evaluation of recommender algorithms. To address this problem, we propose a Generator of Inter-dependent Data Streams (GIDS), capable of generating multiple temporal and inter-dependent synthetic datasets of relational data. The generator is able to simulate a collection of time-changing data streams, helping to effectively evaluate a variety of recommender systems, data fusion algorithms and incremental algorithms. The evaluation using recommender and data fusion algorithms showed that our generator can successfully mimic real datasets in terms of statistical data properties, and achieved performance of recommender systems.
What kind of questions about human mobility can computational analysis help answer? How to translate the findings into anthropology? We analyzed a publicly available data set of road traffic counters ...in Slovenia to answer these questions. The data revealed information on how a population drives, how it travels for tourism, which locations it prefers, what it does during the week and the weekend, and how its habits change during the year. We conducted the empirical analysis in two parts. First, we defined traffic profile deviations and designed computational methods to find them in a large data set. As shown in the paper, traffic counters hint at potential causes and effects in driving practices that we interpreted anthropologically. Second, we used hierarchical clustering to find groups of similar traffic counters as described by their daily profiles. Clustering revealed the main features of road traffic in Slovenia. Using the two quantitative approaches, we outlined the general properties of road traffic in the country and identified and explained the outliers. We show that quantitative data analysis only partially answers anthropological questions, but it can be a valuable tool for preliminary research. We conclude that open data are a useful component in an anthropological analysis and that quantitative discovery of small local events can help us pinpoint future fieldwork sites.