The human genome is diploid, which requires assigning heterozygous single nucleotide polymorphisms (SNPs) to the two copies of the genome. The resulting haplotypes, lists of SNPs belonging to each ...copy, are crucial for downstream analyses in population genetics. Currently, statistical approaches, which are oblivious to direct read information, constitute the state-of-the-art. Haplotype assembly, which addresses phasing directly from sequencing reads, suffers from the fact that sequencing reads of the current generation are too short to serve the purposes of genome-wide phasing. While future-technology sequencing reads will contain sufficient amounts of SNPs per read for phasing, they are also likely to suffer from higher sequencing error rates. Currently, no haplotype assembly approaches exist that allow for taking both increasing read length and sequencing error information into account. Here, we suggest WhatsHap, the first approach that yields provably optimal solutions to the weighted minimum error correction problem in runtime linear in the number of SNPs. WhatsHap is a fixed parameter tractable (FPT) approach with coverage as the parameter. We demonstrate that WhatsHap can handle datasets of coverage up to 20×, and that 15× are generally enough for reliably phasing long reads, even at significantly elevated sequencing error rates. We also find that the switch and flip error rates of the haplotypes we output are favorable when comparing them with state-of-the-art statistical phasers.
The ecosystem services and natural capital of soils are often not recognised and generally not well understood. This paper addresses this issue by drawing on scientific understanding of soil ...formation, functioning and classification systems and building on current thinking on ecosystem services to develop a framework to classify and quantify soil natural capital and ecosystem services. The framework consists of five main interconnected components: (1) soil natural capital, characterised by standard soil properties well known to soil scientists; (2) the processes behind soil natural capital formation, maintenance and degradation; (3) drivers (anthropogenic and natural) of soil processes; (4) provisioning, regulating and cultural ecosystem services; and (5) human needs fulfilled by soil ecosystem services.
The rapid spread of the COVID-19 pandemic has resulted in an unprecedented amount of sequence data of the SARS-CoV-2 genome-millions of sequences and counting. This amount of data, while being orders ...of magnitude beyond the capacity of traditional approaches to understanding the diversity, dynamics, and evolution of viruses, is nonetheless a rich resource for machine learning (ML) approaches as alternatives for extracting such important information from these data. It is of hence utmost importance to design a framework for testing and benchmarking the robustness of these ML models. This paper makes the first effort (to our knowledge) to benchmark the robustness of ML models by simulating biological sequences with errors. In this paper, we introduce several ways to perturb SARS-CoV-2 genome sequences to mimic the error profiles of common sequencing platforms such as Illumina and PacBio. We show from experiments on a wide array of ML models that some simulation-based approaches with different perturbation budgets are more robust (and accurate) than others for specific embedding methods to certain noise simulations on the input sequences. Our benchmarking framework may assist researchers in properly assessing different ML models and help them understand the behavior of the SARS-CoV-2 virus or avoid possible future pandemics.
Cancer progression reconstruction is an important development stemming from the phylogenetics field. In this context, the reconstruction of the phylogeny representing the evolutionary history ...presents some peculiar aspects that depend on the technology used to obtain the data to analyze: Single Cell DNA Sequencing data have great specificity, but are affected by moderate false negative and missing value rates. Moreover, there has been some recent evidence of back mutations in cancer: this phenomenon is currently widely ignored.
We present a new tool, gpps, that reconstructs a tumor phylogeny from Single Cell Sequencing data, allowing each mutation to be lost at most a fixed number of times. The General Parsimony Phylogeny from Single cell (gpps) tool is open source and available at https://github.com/AlgoLab/gpps .
gpps provides new insights to the analysis of intra-tumor heterogeneity by proposing a new progression model to the field of cancer phylogeny reconstruction on Single Cell data.
The widespread availability of large amounts of genomic data on the SARS-CoV-2 virus, as a result of the COVID-19 pandemic, has created an opportunity for researchers to analyze the disease at a ...level of detail, unlike any virus before it. On the one hand, this will help biologists, policymakers, and other authorities to make timely and appropriate decisions to control the spread of the coronavirus. On the other hand, such studies will help to more effectively deal with any possible future pandemic. Since the SARS-CoV-2 virus contains different variants, each of them having different mutations, performing any analysis on such data becomes a difficult task, given the size of the data. It is well known that much of the variation in the SARS-CoV-2 genome happens disproportionately in the spike region of the genome sequence—the relatively short region which codes for the spike protein(s). In this paper, we propose a robust feature-vector representation of biological sequences that, when combined with the appropriate feature selection method, allows different downstream clustering approaches to perform well on a variety of different measures. We use such proposed approach with an array of clustering techniques to cluster spike protein sequences in order to study the behavior of different known variants that are increasing at a very high rate throughout the world. We use a k-mers based approach first to generate a fixed-length feature vector representation of the spike sequences. We then show that we can efficiently and effectively cluster the spike sequences based on the different variants with the appropriate feature selection. Using a publicly available set of SARS-CoV-2 spike sequences, we perform clustering of these sequences using both hard and soft clustering methods and show that, with our feature selection methods, we can achieve higher F1 scores for the clusters and also better clustering quality metrics compared to baselines.
Most tourism-related activities require energy directly in the form of fossil fuels or indirectly in the form of electricity often generated from petroleum, coal or gas. This consumption leads to the ...emission of greenhouse gases, mainly carbon dioxide. Tourism is not a traditional sector in the System of National Accounts and as a result no country possesses comprehensive national statistics on the energy demand or emissions specifically resulting from tourism. This paper suggests two approaches for accounting for carbon dioxide emissions from tourism: a bottom-up analysis involving industry and tourist analyses, and a top-down analysis using environmental accounting. Using the case study of New Zealand, we demonstrate that both approaches result in similar estimates of the degree to which tourism contributes to national carbon dioxide emissions. The bottom-up analysis provides detailed information on energy end-uses and the main drivers of carbon dioxide emissions. These results can be used for the development of targeted industry-based greenhouse gas reduction strategies. The top-down analysis allows assessment of tourism as a sector within the wider economy, for example with the purpose of comparing tourism’s eco-efficiency with other sectors, or the impact of macroeconomic instruments such as carbon charges.
The Global Biogeochemical Cycles (GBCs) are extremely important biosphere functions, critical to the maintenance of conditions necessary for all life. Importantly, perturbation of the GBCs has the ...potential to affect the structure and functioning of the Earth system. While biogeochemistry research to date has largely focused on ‘natural’ processes, human economic activities are increasingly recognised as integral components of the GBCs. In this paper we develop a novel systems model, the Environmental Social Accounting Matrix (ESAM), of coupled GBCs (explicitly covering Carbon, Nitrogen, Phosphorus and Sulphur) with a particularly focus on the environment-economy interface. We illustrate diagrammatically the level at which the global economy, through its transformation of useful resources (i.e. raw materials) into residuals (i.e. wastes, pollutants, emission), appropriates biogeochemical processes. Then through an application lens we discuss the ESAM’s potential applications and extensions. The ESAM represents one of only a few attempts to develop an integrated model of the Earth system, explicitly capturing the interaction in the element-based GBCs between both natural and human processes.
Data visualization plays a crucial role in gaining insights from high-dimensional datasets. ISOMAP is a popular algorithm that maps high-dimensional data into a lower-dimensional space while ...preserving the underlying geometric structure. However, ISOMAP can be computationally expensive, especially for large datasets, due to the computation of the pairwise distances between data points. The motivation behind this study is to improve efficiency by leveraging an approximate method, which is based on random kitchen sinks (RKS). This approach provides a faster way to compute the kernel matrix. Using RKS significantly reduces the computational complexity of ISOMAP while still obtaining a meaningful low-dimensional representation of the data. We compare the performance of the approximate ISOMAP approach using RKS with the traditional t-SNE algorithm. The comparison involves computing the distance matrix using the original high-dimensional data and the low-dimensional data computed from both t-SNE and ISOMAP. The quality of the low-dimensional embeddings is measured using several metrics, including mean squared error (MSE), mean absolute error (MAE), and explained variance score (EVS). Additionally, the runtime of each algorithm is recorded to assess its computational efficiency. The comparison is conducted on a set of protein sequences, used in many bioinformatics tasks. We use three different embedding methods based on k-mers, minimizers, and position weight matrix (PWM) to capture various aspects of the underlying structure and the relationships between the protein sequences. By comparing different embeddings and by evaluating the effectiveness of the approximate ISOMAP approach using RKS and comparing it against t-SNE, we provide insights on the efficacy of our proposed approach. Our goal is to retain the quality of the low-dimensional embeddings while improving the computational performance.
The study of host specificity has important connections to the question about the origin of SARS-CoV-2 in humans which led to the COVID-19 pandemic-an important open question. There are speculations ...that bats are a possible origin. Likewise, there are many closely related (corona)viruses, such as SARS, which was found to be transmitted through civets. The study of the different hosts which can be potential carriers and transmitters of deadly viruses to humans is crucial to understanding, mitigating, and preventing current and future pandemics. In coronaviruses, the surface (S) protein, or spike protein, is important in determining host specificity, since it is the point of contact between the virus and the host cell membrane. In this paper, we classify the hosts of over five thousand coronaviruses from their spike protein sequences, segregating them into clusters of distinct hosts among birds, bats, camels, swine, humans, and weasels, to name a few. We propose a feature embedding based on the well-known position weight matrix (PWM), which we call PWM2Vec, and we use it to generate feature vectors from the spike protein sequences of these coronaviruses. While our embedding is inspired by the success of PWMs in biological applications, such as determining protein function and identifying transcription factor binding sites, we are the first (to the best of our knowledge) to use PWMs from viral sequences to generate fixed-length feature vector representations, and use them in the context of host classification. The results on real world data show that when using PWM2Vec, machine learning classifiers are able to perform comparably to the baseline models in terms of predictive performance and runtime-in some cases, the performance is better. We also measure the importance of different amino acids using information gain to show the amino acids which are important for predicting the host of a given coronavirus. Finally, we perform some statistical analyses on these results to show that our embedding is more compact than the embeddings of the baseline models.
Protein structure analysis is essential in various bioinformatics domains such as drug discovery, disease diagnosis, and evolutionary studies. Within structural biology, the classification of protein ...structures is pivotal, employing machine learning algorithms to categorize structures based on data from databases like the Protein Data Bank (PDB). To predict protein functions, embeddings based on protein sequences have been employed. Creating numerical embeddings that preserve vital information while considering protein structure and sequence presents several challenges. The existing literature lacks a comprehensive and effective approach that combines structural and sequence-based features to achieve efficient protein classification. While large language models (LLMs) have exhibited promising outcomes for protein function prediction, their focus primarily lies on protein sequences, disregarding the 3D structures of proteins. The quality of embeddings heavily relies on how well the geometry of the embedding space aligns with the underlying data structure, posing a critical research question. Traditionally, Euclidean space has served as a widely utilized framework for embeddings. In this study, we propose a novel method for designing numerical embeddings in Euclidean space for proteins by leveraging 3D structure information, specifically employing the concept of contact maps. These embeddings are synergistically combined with features extracted from LLMs and traditional feature engineering techniques to enhance the performance of embeddings in supervised protein analysis. Experimental results on benchmark datasets, including PDB Bind and STCRDAB, demonstrate the superior performance of the proposed method for protein function prediction.