•Geometric space is established for the study of SARS-CoV-2 data set.•In geometric space, SARS-CoV-2 sequences from same variants cluster together.•Distances between points of geometric space reflect ...their biological distances.•For points closer in geometric space, their biological relationships are closer.
SARS-CoV-2 as a severe respiratory disease has been prevalent around the world since its first discovery in 2019.As a single-stranded RNA virus, its high mutation rate makes its variants manifold and enables some of them to have high pathogenicity, such as Omicron variant, the most prevalent virus now. Research on the relationship of these SARS-CoV-2 variants, especially exploring their difference is a hot issue. In this study, we constructed a geometric space to represent all SARS-CoV-2 sequences of different variants. An alignment-free method: natural vector method was utilized to establish genome space. The genome space of SARS-CoV-2 was constructed based on the 24-dimensional natural vector and the appropriate metric was determined through performing phylogenetic analysises. Phylogenetic trees of different lineages constructed under the selected natural vector and metric coincided with the lineage naming standards, which means lineages with same alphabetical prefix cluster in phylogenetic trees. Furthermore, the relationships between the various GISAID clades as depicted by the natural graph primarily matched the description provided in the GISAID clade naming.The validity of our geometric space was demonstrated by these phylogenetic analysis results. So in this research, we constructed a geometry space for the genomes of the novel coronavirus SARS-CoV-2, which allows us to compare the different variants. Our geometric space is valuable for resolving the issues insides the virus.
The International Committee on Taxonomy of Viruses authorizes and organizes the taxonomic classification of viruses. Thus far, the detailed classifications for all viruses are neither complete nor ...free from dispute. For example, the current missing label rates in GenBank are 12.1% for family label and 30.0% for genus label. Using the proposed Natural Vector representation, all 2,044 single-segment referenced viral genomes in GenBank can be embedded in Formula: see text. Unlike other approaches, this allows us to determine phylogenetic relations for all viruses at any level (e.g., Baltimore class, family, subfamily, genus, and species) in real time. Additionally, the proposed graphical representation for virus phylogeny provides a visualization of the distribution of viruses in Formula: see text. Unlike the commonly used tree visualization methods which suffer from uniqueness and existence problems, our representation always exists and is unique. This approach is successfully used to predict and correct viral classification information, as well as to identify viral origins; e.g. a recent public health threat, the West Nile virus, is closer to the Japanese encephalitis antigenic complex based on our visualization. Based on cross-validation results, the accuracy rates of our predictions are as high as 98.2% for Baltimore class labels, 96.6% for family labels, 99.7% for subfamily labels and 97.2% for genus labels.
Ever since the Lie algebra method was introduced to construct finite dimensional nonlinear filters by Brockett and Mitter independently, there has been an intense interest in classifying all finite ...dimensional estimation algebras and finding new classes of finite dimensional recursive filters. The estimation algebra method has been proven to be an invaluable tool in the nonlinear filtering theory. This article considers the finite dimensional estimation algebras derived from a nonlinear filtering system with state dimension n, linear rank n-1, and constant Wong matrix. Related theories of the underdetermined partial differential equations and the Euler operator are applied to classify the estimation algebras. It is proved that the Mitter conjecture holds and the dimension of the finite dimensional estimation algebras must be 2n or 2n+1 with the abovementioned conditions. Therefore, we can construct the explicit solution of filtering systems by Wei-Norman approach. This result is of great significance because it is the first classification of nonmaximal rank finite dimensional estimation algebras with arbitrary state dimension.
Numerical encoding plays an important role in DNA sequence analysis via computational methods, in which numerical values are associated with corresponding symbolic characters. After numerical ...representation, digital signal processing methods can be exploited to analyze DNA sequences. To reflect the biological properties of the original sequence, it is vital that the representation is one-to-one. Chaos Game Representation (CGR) is an iterative mapping technique that assigns each nucleotide in a DNA sequence to a respective position on the plane that allows the depiction of the DNA sequence in the form of image. Using CGR, a biological sequence can be transformed one-to-one to a numerical sequence that preserves the main features of the original sequence. In this research, we propose to encode DNA sequences by considering 2D CGR coordinates as complex numbers, and apply digital signal processing methods to analyze their evolutionary relationship. Computational experiments indicate that this approach gives comparable results to the state-of-the-art multiple sequence alignment method, Clustal Omega, and is significantly faster. The MATLAB code for our method can be accessed from: www.mathworks.com/matlabcentral/fileexchange/57152
•Complex number encoding of DNA sequences by Chaos Game Representation is proposed.•Fourier power spectra of DNA sequences is computed from the complex number encoding.•Alignment-free analysis of DNA sequences using Fourier power spectra is proposed.
Most existing methods for phylogenetic analysis involve developing an evolutionary model and then using some type of computational algorithm to perform multiple sequence alignment. There are two ...problems with this approach: (1) different evolutionary models can lead to different results, and (2) the computation time required for multiple alignments makes it impossible to analyse the phylogeny of a whole genome. This motivates us to create a new approach to characterize genetic sequences.
To each DNA sequence, we associate a natural vector based on the distributions of nucleotides. This produces a one-to-one correspondence between the DNA sequence and its natural vector. We define the distance between two DNA sequences to be the distance between their associated natural vectors. This creates a genome space with a biological distance which makes global comparison of genomes with same topology possible. We use our proposed method to analyze the genomes of the new influenza A (H1N1) virus, human rhinoviruses (HRV) and mammalian mitochondrial. The result shows that a triple-reassortant swine virus circulating in North America and the Eurasian swine virus belong to the lineage of the influenza A (H1N1) virus. For the HRV and mammalian mitochondrial genomes, the results coincide with biologists' analyses.
Our approach provides a powerful new tool for analyzing and annotating genomes and their phylogenetic relationships. Whole or partial genomes can be handled more easily and more quickly than using multiple alignment methods. Once a genome space has been constructed, it can be stored in a database. There is no need to reconstruct the genome space for subsequent applications, whereas in multiple alignment methods, realignment is needed to add new sequences. Furthermore, one can make a global comparison of all genomes simultaneously, which no other existing method can achieve.
Ever since the technique of the Kalman filter was popularized, there has been a lot of research interest in finding more classes of finite-dimensional recursive filters. In past research, the ...estimation algebra method can only be used for time-invariant systems. In this paper, we extend the estimation algebra method so that it applies to a general class of time-varying filtering systems. Then Wei- Norman method can be used to derive the explicit solution of the posterior distribution of state estimation. As a special control law, tangent flow is derived for the nonlinear filtering system based on the Monge-Amp <inline-formula><tex-math notation="LaTeX">\grave{\text{e}}</tex-math></inline-formula> re equation in optimal transport. As a result, We propose an optimal transportation filter by applying stochastic tangent flow to Yau filtering systems. The numerical experiments demonstrate the higher efficacy and accuracy of the proposed optimal transportation filter compared to common traditional algorithms such as EKF and FPF.
In this article, we propose an efficient numerical method to solve nonlinear filtering (NLF) problems. Specifically, we use the tensor train decomposition method to solve the forward Kolmogorov ...equation (FKE) arising from the NLF problem. Our method consists of offline and online stages. In the offline stage, we use the finite difference method to discretize the partial differential operators involved in the FKE and extract low-dimensional structures in the solution tensor using the tensor train decomposition method. In the online stage using the precomputed low-rank approximation tensors, we can quickly solve the FKE given new observation data. Therefore, we can solve the NLF problem in a real-time manner. Finally, we present numerical results to show the efficiency and accuracy of the proposed method in solving up to six-dimensional NLF problems.
•Presenting geometric graph method for inverted repeat (IR) analysis.•Finding the correlation between IR distributions and evolution events.•Comparing the IR distributions of SARS-CoV-2 and bat and ...human CoVs.•Inferring mutations on IRs as the major evolution driver in SARS-CoV-2.
The world faces a great unforeseen challenge through the COVID-19 pandemic caused by coronavirus SARS-CoV-2. The virus genome structure and evolution are positioned front and center for further understanding insights on vaccine development, monitoring of transmission trajectories, and prevention of zoonotic infections of new coronaviruses. Of particular interest are genomic elements Inverse Repeats (IRs), which maintain genome stability, regulate gene expressions, and are the targets of mutations. However, little research attention is given to the IR content analysis in the SARS-CoV-2 genome. In this study, we propose a geometric analysis method and using the method to investigate the distributions of IRs in SARS-CoV-2 and its related coronavirus genomes. The method represents each genomic IR sequence pair as a single point and constructs the geometric shape of the genome using the IRs. Thus, the IR shape can be considered as the signature of the genome. The genomes of different coronaviruses are then compared using the constructed IR shapes. The results demonstrate that SARS-CoV-2 genome, specifically, has an abundance of IRs, and the IRs in coronavirus genomes show an increase during evolution events.
Chromosomal fusion is a significant form of structural variation, but research into algorithms for its identification has been limited. Most existing methods rely on synteny analysis, which ...necessitates manual annotations and always involves inefficient sequence alignments. In this paper, we present a novel alignment-free algorithm for chromosomal fusion recognition. Our method transforms the problem into a series of assignment problems using natural vectors and efficiently solves them with the Kuhn-Munkres algorithm. When applied to the human/gorilla and swamp buffalo/river buffalo datasets, our algorithm successfully and efficiently identifies chromosomal fusion events. Notably, our approach offers several advantages, including higher processing speeds by eliminating time-consuming alignments and removing the need for manual annotations. By an alignment-free perspective, our algorithm initially considers entire chromosomes instead of fragments to identify chromosomal structural variations, offering substantial potential to advance research in this field.
Protein-protein interactions (PPIs) play key roles in life processes, such as signal transduction, transcription regulations, and immune response, etc. Identification of PPIs enables better ...understanding of the functional networks within a cell. Common experimental methods for identifying PPIs are time consuming and expensive. However, recent developments in computational approaches for inferring PPIs from protein sequences based on coevolution theory avoid these problems. In the coevolution theory model, interacted proteins may show coevolutionary mutations and have similar phylogenetic trees. The existing coevolution methods depend on multiple sequence alignments (MSA); however, the MSA-based coevolution methods often produce high false positive interactions. In this paper, we present a computational method using an alignment-free approach to accurately detect PPIs and reduce false positives. In the method, protein sequences are numerically represented by biochemical properties of amino acids, which reflect the structural and functional differences of proteins. Fourier transform is applied to the numerical representation of protein sequences to capture the dissimilarities of protein sequences in biophysical context. The method is assessed for predicting PPIs in Ebola virus. The results indicate strong coevolution between the protein pairs (NP-VP24, NP-VP30, NP-VP40, VP24-VP30, VP24-VP40, and VP30-VP40). The method is also validated for PPIs in influenza and E.coli genomes. Since our method can reduce false positive and increase the specificity of PPI prediction, it offers an effective tool to understand mechanisms of disease pathogens and find potential targets for drug design. The Python programs in this study are available to public at URL (https://github.com/cyinbox/PPI).