This study compared computational approaches to parallelization of an SNP calling workflow. The data comprised DNA from five Holstein-Friesian cows sequenced with the Illumina platform. The pipeline ...consisted of quality control, alignment to the reference genome, post-alignment, and SNP calling. Three approaches to parallelization were compared: (i) a plain Bash script in which a pipeline for each cow was executed as separate processes invoked at the same time, (ii) a Bash script wrapped in a single Nextflow process and (iii) a Nextflow script with each component of the pipeline defined as a separate process. The results demonstrated that on average, the multi-process Nextflow script performed 15-27% faster depending on the number of assigned threads, with the biggest execution time advantage over the plain Bash approach observed with 10 threads. In terms of RAM usage, the most substantial variation was observed for the multi-process Nextflow, for which it increased with the number of assigned threads, while RAM consumption of the other setups did not depend much on the number of threads assigned for computations. Due to intermediate and log files generated, disk usage was markedly higher for the multi-process Nextflow than for the plain Bash and for the single-process Nextflow.
Variations in serum amino acid levels are linked to a multitude of complex disorders. We report the largest genome-wide association study (GWAS) on nine serum amino acids in the UK Biobank ...participants (117 944, European descent). We identified 34 genomic loci for circulatory levels of alanine, 48 loci for glutamine, 44 loci for glycine, 16 loci for histidine, 11 loci for isoleucine, 19 loci for leucine, 9 loci for phenylalanine, 32 loci for tyrosine and 20 loci for valine. Our gene-based analysis mapped 46-293 genes associated with serum amino acids, including
,
,
gene family,
,
,
and
.The gene-property analysis across 30 tissues highlighted enriched expression of the identified genes in liver tissues for all studied amino acids, except for isoleucine and valine, in muscle tissues for serum alanine and glycine, in adrenal gland tissues for serum isoleucine and leucine, and in pancreatic tissues for serum phenylalanine. Mendelian randomization (MR) phenome-wide association study analysis and subsequent two-sample MR analysis provided evidence that every standard deviation increase in valine is associated with 35% higher risk of type 2 diabetes and elevated levels of serum alanine and branched-chain amino acids with higher levels of total cholesterol, triglyceride and low-density lipoprotein, and lower levels of high-density lipoprotein. In contrast to reports by observational studies, MR analysis did not support a causal association between studied amino acids and coronary artery disease, Alzheimer's disease, breast cancer or prostate cancer. In conclusion, we explored the genetic architecture of serum amino acids and provided evidence supporting a causal role of amino acids in cardiometabolic health.
Two-component systems are key signal-transduction systems that enable bacteria to respond to a wide variety of environmental stimuli. The human pathogen,
(pneumococcus) encodes 13 two-component ...systems and a single orphan response regulator, most of which are significant for pneumococcal pathogenicity. Mapping the regulatory networks governed by these systems is key to understand pneumococcal host adaptation. Here we employ a novel bioinformatic approach to predict the regulons of each two-component system based on publicly available whole-genome sequencing data. By employing pangenome-wide association studies (panGWAS) to predict genotype-genotype associations for each two-component system, we predicted regulon genes of 11 of the pneumococcal two-component systems. Through validation via next-generation RNA-sequencing on response regulator overexpression mutants, several top candidate genes predicted by the panGWAS analysis were confirmed as regulon genes. The present study presents novel details on multiple pneumococcal two-component systems, including an expansion of regulons, identification of candidate response regulator binding motifs, and identification of candidate response regulator-regulated small non-coding RNAs. We also demonstrate a use for panGWAS as a complementary tool in target gene identification via identification of genotype-to-genotype links. Expanding our knowledge on two-component systems in pathogens is crucial to understanding how these bacteria sense and respond to their host environment, which could prove useful in future drug development.
The large diversity of functional genomic assays allows for the characterization of non-coding and coding events at the tissue level or at a single-cell resolution. However, this diversity also leads ...to protocol differences, widely varying sequencing depths, substantial disparities in sample sizes, and number of features. In this work, we have built a Python package, MUFFIN, which offers a wide variety of tools suitable for a broad range of genomic assays and brings many tools that were missing from the Python ecosystem. First, MUFFIN has specialized tools for the exploration of the non-coding regions of genomes, such as a function to identify consensus peaks in peak-called assays, as well as linking genomic regions to genes and performing Gene Set Enrichment Analyses. MUFFIN also possesses a robust and flexible count table processing pipeline, comprising normalization, count transformation, dimensionality reduction, Differential Expression, and clustering. Our tools were tested on three widely different scRNA-seq, ChIP-seq and ATAC-seq datasets. MUFFIN integrates with the popular Scanpy ecosystem and is available on Conda and at https://github.com/pdelangen/Muffin.
Delineating the intricate interplay between promoter-proximal and -distal regulators is crucial for understanding the function of transcriptional mediator complexes implicated in the regulation of ...gene expression. The present study aimed to develop a computational method for accurately modeling the spatial proximal and distal regulatory interactions. Our method combined regression-based models to identify key regulators through gene expression prediction and a graph-embedding approach to detect coregulated genes. This approach enabled a detailed investigation of the gene regulatory mechanisms for germinal center B cells, accompanied by dramatic rearrangements of the genome structure. We found that while the promoter-proximal regulatory elements were the principal regulators of gene expression, the distal regulators fine-tuned transcription. Moreover, our approach unveiled the presence of modular regulators, such as cofactors and proximal/distal transcription factors, which were co-expressed with their target genes. Some of these modules exhibited abnormal expression patterns in lymphoma. These findings suggest that the dysregulation of interactions between transcriptional and architectural factors is associated with chromatin reorganization failure, which may increase the risk of malignancy. Therefore, our computational approach helps decipher the transcriptional
-regulatory code spatially interacting.
In the rapidly evolving field of genomics, understanding the genetic basis of complex diseases like breast cancer, particularly its familial/hereditary forms, is crucial. Current methods often ...examine genomic variants-such as Single Nucleotide Variants (SNVs), insertions/deletions (Indels), and Copy Number Variations (CNVs)-separately, lacking an integrated approach. Here, we introduced a robust, flexible methodology for a comprehensive variants' analysis using Whole Exome Sequencing (WES) data. Our approach uniquely combines meticulous validation with an effective variant filtering strategy. By reanalyzing two germline WES datasets from
negative breast cancer patients, we demonstrated our tool's efficiency and adaptability, uncovering both known and novel variants. This contributed new insights for potential diagnostic, preventive, and therapeutic strategies. Our method stands out for its comprehensive inclusion of key genomic variants in a unified analysis, and its practical resolution of technical challenges, offering a pioneering solution in genomic research. This tool presents a breakthrough in providing detailed insights into the genetic alterations in genomes, with significant implications for understanding and managing hereditary breast cancer.
While machine learning models have been successfully applied to predicting gene expression from promoter sequences, it remains a great challenge to derive intuitive interpretation of the model and ...reveal DNA motif grammar such as motif cooperation and distance constraint between motif sites. Previous interpretation approaches are often time-consuming or have difficulty to learn the combinatory rules. In this work, we designed interpretable neural network models to predict the mRNA expression levels from DNA sequences. By applying the Contextual Regression framework we developed, we extracted weighted features to cluster samples into different groups, which have different gene expression levels. We performed motif analysis in each cluster and found motifs with active or repressive regulation on gene expression. By comparing the co-occurrence locations of discovered motifs, we also uncovered multiple grammars of motif combination including communities of cooperative motifs and distance constraints between motif pairs. These results revealed new insights of the regulatory architecture of promoter sequences.
RNA-RNA interactions are a key feature of post-transcriptional gene regulation in all domains of life. While ever more experimental protocols are being developed to study RNA duplex formation on a ...genome-wide scale, computational methods for the analysis and interpretation of the underlying data are lagging behind. Here, we present ChimericFragments, an analysis framework for RNA-seq experiments that produce chimeric RNA molecules. ChimericFragments implements a novel statistical method based on the complementarity of the base-pairing RNAs around their ligation site and provides an interactive graph-based visualization for data exploration and interpretation. ChimericFragments detects true RNA-RNA interactions with high precision and is compatible with several widely used experimental procedures such as RIL-seq, LIGR-seq or CLASH. We further demonstrate that ChimericFragments enables the systematic detection of novel RNA regulators and RNA-target pairs with crucial roles in microbial physiology and virulence. ChimericFragments is written in
and available at: https://github.com/maltesie/ChimericFragments.
Malat1 is a long-noncoding RNA with critical roles in gene regulation and cancer metastasis, however its functional role in stem cells is largely unexplored. We here perform a nuclear knockdown of ...Malat1 in mouse embryonic stem cells, causing the de-regulation of 320 genes and aberrant splicing of 90 transcripts, some of which potentially affecting the translated protein sequence. We find evidence that Malat1 directly interacts with gene bodies and aberrantly spliced transcripts, and that it locates upstream of down-regulated genes at their putative enhancer regions, in agreement with functional genomics data. Consistent with this, we find these genes affected at both exon and intron levels, suggesting that they are transcriptionally regulated by Malat1. Besides, the down-regulated genes are regulated by specific transcription factors and bear both activating and repressive chromatin marks, suggesting that some of them might be regulated by bivalent promoters. We propose a model in which Malat1 facilitates the transcription of genes involved in chromatid dynamics and mitosis in one pathway, and affects the splicing of transcripts that are themselves involved in RNA processing in a distinct pathway. Lastly, we compare our findings with Malat1 perturbation studies performed in other cell systems and
.
Sequence classification facilitates a fundamental understanding of the structure of microbial communities. Binary metagenomic sequence classifiers are insufficient because environmental metagenomes ...are typically derived from multiple sequence sources. Here we introduce a deep-learning based sequence classifier, DeepMicroClass, that classifies metagenomic contigs into five sequence classes, i.e. viruses infecting prokaryotic or eukaryotic hosts, eukaryotic or prokaryotic chromosomes, and prokaryotic plasmids. DeepMicroClass achieved high performance for all sequence classes at various tested sequence lengths ranging from 500 bp to 100 kbps. By benchmarking on a synthetic dataset with variable sequence class composition, we showed that DeepMicroClass obtained better performance for eukaryotic, plasmid and viral contig classification than other state-of-the-art predictors. DeepMicroClass achieved comparable performance on viral sequence classification with geNomad and VirSorter2 when benchmarked on the CAMI II marine dataset. Using a coastal daily time-series metagenomic dataset as a case study, we showed that microbial eukaryotes and prokaryotic viruses are integral to microbial communities. By analyzing monthly metagenomes collected at HOT and BATS, we found relatively higher viral read proportions in the subsurface layer in late summer, consistent with the seasonal viral infection patterns prevalent in these areas. We expect DeepMicroClass will promote metagenomic studies of under-appreciated sequence types.