Antibiotic resistance is becoming a common problem in medicine, food, and industry, with multidrug-resistant bacterial strains occurring in all regions. One of the possible future solutions is the ...use of bacteriophages. Phages are the most abundant form of life in the biosphere, so we can highly likely purify a specific phage against each target bacterium. The identification and consistent characterization of individual phages was a common form of phage work and included determining bacteriophages' host-specificity. With the advent of new modern sequencing methods, there was a problem with the detailed characterization of phages in the environment identified by metagenome analysis. The solution to this problem may be to use a bioinformatic approach in the form of prediction software capable of determining a bacterial host based on the phage whole-genome sequence. The result of our research is the machine learning algorithm-based tool called PHERI. PHERI predicts the suitable bacterial host genus for the purification of individual viruses from different samples. In addition, it can identify and highlight protein sequences that are important for host selection.
With the rapid growth of massively parallel sequencing technologies, still more laboratories are utilising sequenced DNA fragments for genomic analyses. Interpretation of sequencing data is, however, ...strongly dependent on bioinformatics processing, which is often too demanding for clinicians and researchers without a computational background. Another problem represents the reproducibility of computational analyses across separated computational centres with inconsistent versions of installed libraries and bioinformatics tools. We propose an easily extensible set of computational pipelines, called SnakeLines, for processing sequencing reads; including mapping, assembly, variant calling, viral identification, transcriptomics, and metagenomics analysis. Individual steps of an analysis, along with methods and their parameters can be readily modified in a single configuration file. Provided pipelines are embedded in virtual environments that ensure isolation of required resources from the host operating system, rapid deployment, and reproducibility of analysis across different Unix-based platforms. SnakeLines is a powerful framework for the automation of bioinformatics analyses, with emphasis on a simple set-up, modifications, extensibility, and reproducibility. The framework is already routinely used in various research projects and their applications, especially in the Slovak national surveillance of SARS-CoV-2.
The genomes of SARS-CoV-2 are classified into variants, some of which are monitored as variants of concern (e.g. the Delta variant B.1.617.2 or Omicron variant B.1.1.529). Proportions of these ...variants circulating in a human population are typically estimated by large-scale sequencing of individual patient samples. Sequencing a mixture of SARS-CoV-2 RNA molecules from wastewater provides a cost-effective alternative, but requires methods for estimating variant proportions in a mixed sample.
We propose a new method based on a probabilistic model of sequencing reads, capturing sequence diversity present within individual variants, as well as sequencing errors. The algorithm is implemented in an open source Python program called VirPool. We evaluate the accuracy of VirPool on several simulated and real sequencing data sets from both Illumina and nanopore sequencing platforms, including wastewater samples from Austria and France monitoring the onset of the Alpha variant.
VirPool is a versatile tool for wastewater and other mixed-sample analysis that can handle both short- and long-read sequencing data. Our approach does not require pre-selection of characteristic mutations for variant profiles, it is able to use the entire length of reads instead of just the most informative positions, and can also capture haplotype dependencies within a single read.
Celotno besedilo
Dostopno za:
DOBA, IZUM, KILJ, NUK, PILJ, PNG, SAZU, SIK, UILJ, UKNU, UL, UM, UPUK
The yeast
Magnusiomyces capitatus
is an opportunistic human pathogen causing rare yet severe infections, especially in patients with hematological malignancies. Here, we report the 20.2 megabase ...genome sequence of an environmental strain of this species as well as the genome sequences of eight additional isolates from human and animal sources providing an insight into intraspecies variation. The distribution of single-nucleotide variants is indicative of genetic recombination events, supporting evidence for sexual reproduction in this heterothallic yeast. Using RNAseq-aided annotation, we identified genes for 6518 proteins including several expanded families such as kexin proteases and Hsp70 molecular chaperones. Several of these families are potentially associated with the ability of
M. capitatus
to infect and colonize humans. For the purpose of comparative analysis, we also determined the genome sequence of a closely related yeast,
Magnusiomyces ingens
. The genome sequences of
M. capitatus
and
M. ingens
exhibit many distinct features and represent a basis for further comparative and functional studies.
A recent paradigm shift in bioinformatics from a single reference genome to a pangenome brought with it several graph structures. These graph structures must implement operations, such as efficient ...construction from multiple genomes and read mapping. Read mapping is a well-studied problem in sequential data, and, together with data structures such as suffix array and Burrows-Wheeler transform, allows for efficient computation. Attempts to achieve comparatively high performance on graphs bring many complications since the common data structures on strings are not easily obtainable for graphs. In this work, we introduce prefix-free graphs, a novel pangenomic data structure; we show how to construct them and how to use them to obtain well-known data structures from stringology in sublinear space, allowing for many efficient operations on pangenomes.
We propose a new technique for creating a space-efficient index for large repetitive text collections, such as pangenomic databases containing sequences of many individuals from the same species. We ...combine two recent techniques from this area: Wheeler graphs (Gagie et al., 2017) and prefix-free parsing (PFP, Boucher et al., 2019). Wheeler graphs (WGs) are a general framework encompassing several indexes based on the Burrows-Wheeler transform (BWT), such as the FM-index. Wheeler graphs admit a succinct representation which can be further compacted by employing the idea of tunnelling, which exploits redundancies in the form of parallel, equally-labelled paths called blocks that can be merged into a single path. The problem of finding the optimal set of blocks for tunnelling, i.e. the one that minimizes the size of the resulting WG, is known to be NP-complete and remains the most computationally challenging part of the tunnelling process. To find an adequate set of blocks in less time, we propose a new method based on the prefix-free parsing (PFP). The idea of PFP is to divide the input text into phrases of roughly equal sizes that overlap by a fixed number of characters. The original text is represented by a sequence of phrase ranks (the parse) and a list of all used phrases (the dictionary). In repetitive texts, the PFP of the text is generally much shorter than the original. To speed up the block selection for tunnelling, we apply the PFP to obtain the parse and the dictionary of the text, tunnel the WG of the parse using existing heuristics and subsequently use this tunnelled parse to construct a compact WG of the original text. Compared with constructing a WG from the original text without PFP, our method is much faster and uses less memory on collections of pangenomic sequences. Therefore, our method enables the use of WGs as a pangenomic reference for real-world datasets.
There now exist compact indexes that can efficiently list all the occurrences
of a pattern in a dataset consisting of thousands of genomes, or even all the
occurrences of all the pattern's maximal ...exact matches (MEMs) with respect to
the dataset. Unless we are lucky and the pattern is specific to only a few
genomes, however, we could be swamped by hundreds of matches -- or even
hundreds per MEM -- only to discover that most or all of the matches are to
substrings that occupy the same few columns in a multiple alignment. To address
this issue, in this paper we present a simple and compact data index MARIA that
stores a multiple alignment such that, given the position of one match of a
pattern (or a MEM or other substring of a pattern) and its length, we can
quickly list all the distinct columns of the multiple alignment where matches
start.
There now exist compact indexes that can efficiently list all the occurrences of a pattern in a dataset consisting of thousands of genomes, or even all the occurrences of all the pattern's maximal ...exact matches (MEMs) with respect to the dataset. Unless we are lucky and the pattern is specific to only a few genomes, however, we could be swamped by hundreds of matches -- or even hundreds per MEM -- only to discover that most or all of the matches are to substrings that occupy the same few columns in a multiple alignment. To address this issue, in this paper we present a simple and compact data index MARIA that stores a multiple alignment such that, given the position of one match of a pattern (or a MEM or other substring of a pattern) and its length, we can quickly list all the distinct columns of the multiple alignment where matches start.
Motivated by challenges in pangenomic read alignment, we propose a generalization of Wheeler graphs that we call Wheeler maps. A Wheeler map stores a text \(T1..n\) and an assignment of tags to the ...characters of \(T\) such that we can preprocess a pattern \(P1..m\) and then, given \(i\) and \(j\), quickly return all the distinct tags labeling the first characters of the occurrences of \(Pi..j\) in \(T\). For the applications that most interest us, characters with long common contexts are likely to have the same tag, so we consider the number \(t\) of runs in the list of tags sorted by their characters' positions in the Burrows-Wheeler Transform (BWT) of \(T\). We show how, given a straight-line program with \(g\) rules for \(T\), we can build an \(O(g + r + t)\)-space Wheeler map, where \(r\) is the number of runs in the BWT of \(T\), with which we can preprocess a pattern \(P1..m\) in \(O(m \log n)\) time and then return the \(k\) distinct tags for \(Pi..j\) in optimal \(O(k)\) time for any given \(i\) and \(j\). We show various further results related to prioritizing the most frequent tags.