Computational omics methods packaged as software have become essential to modern biological research. The increasing dependence of scientists on these powerful software tools creates a need for ...systematic assessment of these methods, known as benchmarking. Adopting a standardized benchmarking practice could help researchers who use omics data to better leverage recent technological innovations. Our review summarizes benchmarking practices from 25 recent studies and discusses the challenges, advantages, and limitations of benchmarking across various domains of biology. We also propose principles that can make computational biology benchmarking studies more sustainable and reproducible, ultimately increasing the transparency of biomedical data and results.
Metagenomic profiling, predicting the presence and relative abundances of microbes in a sample, is a critical first step in microbiome analysis. Alignment-based approaches are often considered ...accurate yet computationally infeasible. Here, we present a novel method, Metalign, that performs efficient and accurate alignment-based metagenomic profiling. We use a novel containment min hash approach to pre-filter the reference database prior to alignment and then process both uniquely aligned and multi-aligned reads to produce accurate abundance estimates. In performance evaluations on both real and simulated datasets, Metalign is the only method evaluated that maintained high performance and competitive running time across all datasets.
Implementation of bioinformatics software involves numerous unique challenges; a rigorous standardized approach is needed to examine software tools prior to their publication.
The emergence of viral variants with altered phenotypes is a public health challenge underscoring the need for advanced evolutionary forecasting methods. Given extensive epistatic interactions within ...viral genomes and known viral evolutionary history, efficient genomic surveillance necessitates early detection of emerging viral haplotypes rather than commonly targeted single mutations. Haplotype inference, however, is a significantly more challenging problem precluding the use of traditional approaches. Here, using SARS-CoV-2 evolutionary dynamics as a case study, we show that emerging haplotypes with altered transmissibility can be linked to dense communities in coordinated substitution networks, which become discernible significantly earlier than the haplotypes become prevalent. From these insights, we develop a computational framework for inference of viral variants and validate it by successful early detection of known SARS-CoV-2 strains. Our methodology offers greater scalability than phylogenetic lineage tracing and can be applied to any rapidly evolving pathogen with adequate genomic surveillance data.
As are most non-European populations, the Han Chinese are relatively understudied in population and medical genetics studies. From low-coverage whole-genome sequencing of 11,670 Han Chinese women we ...present a catalog of 25,057,223 variants, including 548,401 novel variants that are seen at least 10 times in our data set. Individuals from this data set came from 24 out of 33 administrative divisions across China (including 19 provinces, 4 municipalities, and 1 autonomous region), thus allowing us to study population structure, genetic ancestry, and local adaptation in Han Chinese. We identified previously unrecognized population structure along the East-West axis of China, demonstrated a general pattern of isolation-by-distance among Han Chinese, and reported unique regional signals of admixture, such as European influences among the Northwestern provinces of China. Furthermore, we identified a number of highly differentiated, putatively adaptive, loci (e.g., MTHFR, ADH7, and FADS, among others) that may be driven by immune response, climate, and diet in the Han Chinese. Finally, we have made available allele frequency estimates stratified by administrative divisions across China in the Geography of Genetic Variant browser for the broader community. By leveraging the largest currently available genetic data set for Han Chinese, we have gained insights into the history and population structure of the world's largest ethnic group.
Profiling immunoglobulin (Ig) receptor repertoires with specialized assays can be cost-ineffective and time-consuming. Here we report ImReP, a computational method for rapid and accurate profiling of ...the Ig repertoire, including the complementary-determining region 3 (CDR3), using regular RNA sequencing data such as those from 8,555 samples across 53 tissues types from 544 individuals in the Genotype-Tissue Expression (GTEx v6) project. Using ImReP and GTEx v6 data, we generate a collection of 3.6 million Ig sequences, termed the atlas of immunoglobulin repertoires (TAIR), across a broad range of tissue types that often do not have reported Ig repertoires information. Moreover, the flow of Ig clonotypes and inter-tissue repertoire similarities across immune-related tissues are also evaluated. In summary, TAIR is one of the largest collections of CDR3 sequences and tissue types, and should serve as an important resource for studying immunological diseases.
Aligning sequencing reads onto a reference is an essential step of the majority of genomic analysis pipelines. Computational algorithms for read alignment have evolved in accordance with ...technological advances, leading to today's diverse array of alignment methods. We provide a systematic survey of algorithmic foundations and methodologies across 107 alignment methods, for both short and long reads. We provide a rigorous experimental evaluation of 11 read aligners to demonstrate the effect of these underlying algorithms on speed and efficiency of read alignment. We discuss how general alignment algorithms have been tailored to the specific needs of various domains in biology.
Life and medical science researchers increasingly rely on applications that lack a graphical interface. Scientists who are not trained in computer science face an enormous challenge analyzing ...high-throughput data. We present a training model for use of command-line tools when the learner has little to no prior knowledge of UNIX.
Juvenile idiopathic arthritis (JIA) is one of the most prevalent rheumatic disorders in children and is classified as an autoimmune disease (AID). While a robust genetic contribution to JIA etiology ...has been established, the exact pathogenesis remains unclear.
To prioritize biologically interpretable susceptibility genes and proteins for JIA, we conducted transcriptome-wide and proteome-wide association studies (TWAS/PWAS). Then, to understand the genetic architecture of JIA, we systematically analyzed single-nucleotide polymorphism (SNP)-based heritability, a signature of natural selection, and polygenicity. Next, we conducted HLA typing using multi-ethnicity RNA sequencing data. Additionally, we examined the T cell receptor (TCR) repertoire at a single-cell level to explore the potential links between immunity and JIA risk.
We have identified 19 TWAS genes and two PWAS proteins associated with JIA risks. Furthermore, we observe that the heritability and cell type enrichment analysis of JIA are enriched in T lymphocytes and HLA regions and that JIA shows higher polygenicity compared to other AIDs. In multi-ancestry HLA typing, B*45:01 is more prevalent in African JIA patients than in European JIA patients, whereas DQA1*01:01, DQA1*03:01, and DRB1*04:01 exhibit a higher frequency in European JIA patients. Using single-cell immune repertoire analysis, we identify clonally expanded T cell subpopulations in JIA patients, including CXCL13
BHLHE40
T
cells which are significantly associated with JIA risks.
Our findings shed new light on the pathogenesis of JIA and provide a strong foundation for future mechanistic studies aimed at uncovering the molecular drivers of JIA.