Identifying splice sites is a necessary step to analyze the location and structure of genes. Two dinucleotides, GT and AG, are highly frequent on splice sites, and many other patterns are also on ...splice sites with important biological functions. Meanwhile, the dinucleotides occur frequently at the sequences without splice sites, which makes the prediction prone to generate false positives. Most existing tools select all the sequences with the two dimers and then focus on distinguishing the true splice sites from those pseudo ones. Such an approach will lead to a decrease in false positives; however, it will result in non-canonical splice sites missing.
We have designed SpliceFinder based on convolutional neural network (CNN) to predict splice sites. To achieve the ab initio prediction, we used human genomic data to train our neural network. An iterative approach is adopted to reconstruct the dataset, which tackles the data unbalance problem and forces the model to learn more features of splice sites. The proposed CNN obtains the classification accuracy of 90.25%, which is 10% higher than the existing algorithms. The method outperforms other existing methods in terms of area under receiver operating characteristics (AUC), recall, precision, and F1 score. Furthermore, SpliceFinder can find the exact position of splice sites on long genomic sequences with a sliding window. Compared with other state-of-the-art splice site prediction tools, SpliceFinder generates results in about half lower false positive while keeping recall higher than 0.8. Also, SpliceFinder captures the non-canonical splice sites. In addition, SpliceFinder performs well on the genomic sequences of Drosophila melanogaster, Mus musculus, Rattus, and Danio rerio without retraining.
Based on CNN, we have proposed a new ab initio splice site prediction tool, SpliceFinder, which generates less false positives and can detect non-canonical splice sites. Additionally, SpliceFinder is transferable to other species without retraining. The source code and additional materials are available at https://gitlab.deepomics.org/wangruohan/SpliceFinder.
The gut microbiota (GM) is related to obesity and other metabolic diseases. To detect GM markers for obesity in patients with different metabolic abnormalities and investigate their relationships ...with clinical indicators, 1,914 Chinese adults were enrolled for 16S rRNA gene sequencing in this retrospective study. Based on GM composition, Random forest classifiers were constructed to screen the obesity patients with (Group OA) or without metabolic diseases (Group O) from healthy individuals (Group H), and high accuracies were observed for the discrimination of Group O and Group OA (areas under the receiver operating curve (AUC) equal to 0.68 and 0.76, respectively). Furthermore, six GM markers were shared by obesity patients with various metabolic disorders (Bacteroides, Parabacteroides, Blautia, Alistipes, Romboutsia and Roseburia). As for the discrimination with Group O, Group OA exhibited low accuracy (AUC = 0.57). Nonetheless, GM classifications to distinguish between Group O and the obese patients with specific metabolic abnormalities were not accurate (AUC values from 0.59 to 0.66). Common biomarkers were identified for the obesity patients with high uric acid, high serum lipids and high blood pressure, such as Clostridium XIVa, Bacteroides and Roseburia. A total of 20 genera were associated with multiple significant clinical indicators. For example, Blautia, Romboutsia, Ruminococcus2, Clostridium sensu stricto and Dorea were positively correlated with indicators of bodyweight (including waistline and body mass index) and serum lipids (including low density lipoprotein, triglyceride and total cholesterol). In contrast, the aforementioned clinical indicators were negatively associated with Bacteroides, Roseburia, Butyricicoccus, Alistipes, Parasutterella, Parabacteroides and Clostridium IV. Generally, these biomarkers hold the potential to predict obesity-related metabolic abnormalities, and interventions based on these biomarkers might be beneficial to weight loss and metabolic risk improvement.
Natural and artificial directional selections have resulted in significantly genetic and phenotypic differences across breeds in domestic animals. However, the molecular regulation of skeletal muscle ...diversity remains largely unknown. Here, we conducted transcriptome profiling of skeletal muscle across 27 time points, and performed whole-genome re-sequencing in Landrace (lean-type) and Tongcheng (obese-type) pigs. The transcription activity decreased with development, and the high-resolution transcriptome precisely captured the characterizations of skeletal muscle with distinct biological events in four developmental phases: Embryonic, Fetal, Neonatal, and Adult. A divergence in the developmental timing and asynchronous development between the two breeds was observed; Landrace showed a developmental lag and stronger abilities of myoblast proliferation and cell migration, whereas Tongcheng had higher ATP synthase activity in postnatal periods. The miR-24-3p driven network targeting insulin signaling pathway regulated glucose metabolism. Notably, integrated analysis suggested SATB2 and XLOC_036765 contributed to skeletal muscle diversity via regulating the myoblast migration and proliferation, respectively. Overall, our results provide insights into the molecular regulation of skeletal muscle development and diversity in mammals.
Horizontal Gene Transfer (HGT) refers to the transfer of genetic materials between organisms through mechanisms other than parent-offspring inheritance. HGTs may affect human health through a large ...number of microorganisms, especially the gut microbiomes which the human body harbors. The transferred segments may lead to complicated local genome structural variations. Details of the local genome structure can elucidate the effects of the HGTs.
In this work, we propose a graph-based method to reconstruct the local strains from the gut metagenomics data at the HGT sites. The method is implemented in a package named LEMON. The simulated results indicate that the method can identify transferred segments accurately on reference sequences of the microbiome. Simulation results illustrate that LEMON could recover local strains with complicated structure variation. Furthermore, the gene fusion points detected in real data near HGT breakpoints validate the accuracy of LEMON. Some strains reconstructed by LEMON have a replication time profile with lower standard error, which demonstrates HGT events recovered by LEMON is reliable.
Through LEMON we could reconstruct the sequence structure of bacteria, which harbors HGT events. This helps us to study gene flow among different microbial species.
Human papillomavirus (HPV) integration is a key genetic event in cervical carcinogenesis. By conducting whole-genome sequencing and high-throughput viral integration detection, we identified 3,667 ...HPV integration breakpoints in 26 cervical intraepithelial neoplasias, 104 cervical carcinomas and five cell lines. Beyond recalculating frequencies for the previously reported frequent integration sites POU5F1B (9.7%), FHIT (8.7%), KLF12 (7.8%), KLF5 (6.8%), LRP1B (5.8%) and LEPREL1 (4.9%), we discovered new hot spots HMGA2 (7.8%), DLG2 (4.9%) and SEMA3D (4.9%). Protein expression from FHIT and LRP1B was downregulated when HPV integrated in their introns. Protein expression from MYC and HMGA2 was elevated when HPV integrated into flanking regions. Moreover, microhomologous sequence between the human and HPV genomes was significantly enriched near integration breakpoints, indicating that fusion between viral and human DNA may have occurred by microhomology-mediated DNA repair pathways. Our data provide insights into HPV integration-driven cervical carcinogenesis.
Tibetan barley (Hordeum vulgare L., qingke) is the principal cereal cultivated on the Tibetan Plateau for at least 3,500 years, but its origin and domestication remain unclear. Here, based on ...deep-coverage whole-genome and published exome-capture resequencing data for a total of 437 accessions, we show that contemporary qingke is derived from eastern domesticated barley and it is introduced to southern Tibet most likely via north Pakistan, India, and Nepal between 4,500 and 3,500 years ago. The low genetic diversity of qingke suggests Tibet can be excluded as a center of origin or domestication for barley. The rapid decrease in genetic diversity from eastern domesticated barley to qingke can be explained by a founder effect from 4,500 to 2,000 years ago. The haplotypes of the five key domestication genes of barley support a feral or hybridization origin for Tibetan weedy barley and reject the hypothesis of native Tibetan wild barley.
Distance functions are fundamental for evaluating the differences between gene expression profiles. Such a function would output a low value if the profiles are strongly correlated-either negatively ...or positively-and vice versa. One popular distance function is the absolute correlation distance, Formula: see text, where Formula: see text is similarity measure, such as Pearson or Spearman correlation. However, the absolute correlation distance fails to fulfill the triangle inequality, which would have guaranteed better performance at vector quantization, allowed fast data localization, as well as accelerated data clustering.
In this work, we propose Formula: see text as an alternative. We prove that Formula: see text satisfies the triangle inequality when Formula: see text represents Pearson correlation, Spearman correlation, or Cosine similarity. We show Formula: see text to be better than Formula: see text, another variant of Formula: see text that satisfies the triangle inequality, both analytically as well as experimentally. We empirically compared Formula: see text with Formula: see text in gene clustering and sample clustering experiment by real-world biological data. The two distances performed similarly in both gene clustering and sample clustering in hierarchical clustering and PAM (partitioning around medoids) clustering. However, Formula: see text demonstrated more robust clustering. According to the bootstrap experiment, Formula: see text generated more robust sample pair partition more frequently (P-value Formula: see text). The statistics on the time a class "dissolved" also support the advantage of Formula: see text in robustness.
Formula: see text, as a variant of absolute correlation distance, satisfies the triangle inequality and is capable for more robust clustering.
Rights and permissions Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in ...any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. Copy to clipboard Provided by the Springer Nature SharedIt content-sharing initiative Volume 23 Supplement 3 Selected articles from the International Conference on Intelligent Biology and Medicine (ICIBM 2021): bioinformatics Correction Open Access Published:31 May 2023 Correction: On triangle inequalities of correlation-based distances for gene expression profiles Jiaxing Chen1,2, Yen Kaow Ng1, Lu Lin1, Xianglilan Zhang 3 & … Shuaicheng Li 1 Show authors BMC Bioinformatics volume 23, Article number: 571 (2022) Cite this article 178 Accesses Metrics details The Original Article was published on 08 February 2023 Correction: BMC Bioinformatics (2023) 24:40 https://doi.org/10.1186/s12859-023-05161-y Following publication of the original article 1, it was reported that the article entitled “On triangle inequalities of correlation-based distances for gene expression profiles” was published in the regular issue of this journal instead of in the supplement issue.
Persistent high-risk human papillomavirus (hrHPV) infection is the highest risk to cervical cancer which is the fourth most common cancer in women worldwide. A growing body of literatures demonstrate ...the role of cervicovaginal microbiome (CVM) in hrHPV susceptibility and clearance, suggesting the promise of CVM-targeted interventions in protecting against or eliminating HPV infection. Nevertheless, the CVM-HPV-host interactions are largely unknown. In this review, we summarize imbalanced CVM in HPV-positive women, with or without cervical diseases, and the progress of exploring CVM resources in HPV clearance. In addition, microbe- and host-microbe interactions in HPV infection and elimination are reviewed to understand the role of CVM in remission of HPV infection. Lastly, the feasibility of CVM-modulated and -derived products in promoting HPV clearance is discussed. Information in this article will provide valuable reference for researchers interested in cervical cancer prevention and therapy.
Cetaceans (whales, dolphins, and porpoises) are a group of mammals adapted to various aquatic habitats, from oceans to freshwater rivers. We report the sequencing, de novo assembly and analysis of a ...finless porpoise genome, and the re-sequencing of an additional 48 finless porpoise individuals. We use these data to reconstruct the demographic history of finless porpoises from their origin to the occupation into the Yangtze River. Analyses of selection between marine and freshwater porpoises identify genes associated with renal water homeostasis and urea cycle, such as urea transporter 2 and angiotensin I-converting enzyme 2, which are likely adaptations associated with the difference in osmotic stress between ocean and rivers. Our results strongly suggest that the critically endangered Yangtze finless porpoises are reproductively isolated from other porpoise populations and harbor unique genetic adaptations, supporting that they should be considered a unique incipient species.