Motivation
The emergence of a novel strain of betacoronavirus, SARS-CoV-2, has led to a pandemic that has been associated with over 700 000 deaths as of August 5, 2020. Research is ongoing around the ...world to create vaccines and therapies to minimize rates of disease spread and mortality. Crucial to these efforts are molecular characterizations of neutralizing antibodies to SARS-CoV-2. Such antibodies would be valuable for measuring vaccine efficacy, diagnosing exposure and developing effective biotherapeutics. Here, we describe our new database, CoV-AbDab, which already contains data on over 1400 published/patented antibodies and nanobodies known to bind to at least one betacoronavirus. This database is the first consolidation of antibodies known to bind SARS-CoV-2 as well as other betacoronaviruses such as SARS-CoV-1 and MERS-CoV. It contains relevant metadata including evidence of cross-neutralization, antibody/nanobody origin, full variable domain sequence (where available) and germline assignments, epitope region, links to relevant PDB entries, homology models and source literature.
Results
On August 5, 2020, CoV-AbDab referenced sequence information on 1402 anti-coronavirus antibodies and nanobodies, spanning 66 papers and 21 patents. Of these, 1131 bind to SARS-CoV-2.
Availabilityand implementation
CoV-AbDab is free to access and download without registration at http://opig.stats.ox.ac.uk/webapps/coronavirus. Community submissions are encouraged.
Supplementary information
Supplementary data are available at Bioinformatics online.
Abstract Summary The limited resolution of spatial transcriptomics (ST) assays in the past has led to the development of cell type annotation methods that separate the convolved signal based on ...available external atlas data. In light of the rapidly increasing resolution of the ST assay technologies, we made available and investigated the performance of a deconvolution-free marker-based cell annotation method called scType. In contrast to existing methods, the spatial application of scType does not require computationally strenuous deconvolution, nor large single-cell reference atlases. We show that scType enables ultra-fast and accurate identification of abundant cell types from ST data, especially when a large enough panel of genes is detected. Examples of such assays are Visium and Slide-seq, which currently offer the best trade-off between high resolution and number of genes detected by the assay for cell type annotation. Availability and implementation scType source R and python codes for spatial data are openly available in GitHub (https://github.com/kris-nader/sp-type or https://github.com/kris-nader/sc-type-py). Step-by-step tutorials for R and python spatial data analysis can be found in https://github.com/kris-nader/sp-type and https://github.com/kris-nader/sc-type-py/blob/main/spatial_tutorial.md, respectively.
Due to the varying delivery methods of messenger RNA (mRNA) vaccines, codon optimization plays a critical role in vaccine design to improve the stability and expression of proteins in specific ...tissues. Considering the many-to-one relationship between synonymous codons and amino acids, the number of mRNA sequences encoding the same amino acid sequence could be enormous. Finding stable and highly expressed mRNA sequences from the vast sequence space using in silico methods can generally be viewed as a path-search problem or a machine translation problem. However, current deep learning-based methods inspired by machine translation may have some limitations, such as recurrent neural networks (RNNs), which have a weak ability to capture the long-term dependencies of codon preferences.
We develop a BERT-based architecture that uses the cross-attention mechanism for codon optimization. In CodonBERT, the codon sequence is randomly masked with each codon serving as a key and a value. In the meantime, the amino acid sequence is used as the query. CodonBERT was trained on high-expression transcripts from Human Protein Atlas mixed with different proportions of high codon adaptation index (CAI) codon sequences. The result showed that CodonBERT can effectively capture the long-term dependencies between codons and amino acids, suggesting that it can be used as a customized training framework for specific optimization targets.
CodonBERT is freely available on https://github.com/FPPGroup/CodonBERT.
Supplementary data are available at Bioinformatics online.
Abstract Motivation Infinium DNA methylation BeadChips are widely used for genome-wide DNA methylation profiling at the population scale. Recent updates to probe content and naming conventions in the ...EPIC version 2 (EPICv2) arrays have complicated integrating new data with previous Infinium array platforms, such as the MethylationEPIC (EPIC) and the HumanMethylation450 (HM450) BeadChip. Results We present mLiftOver, a user-friendly tool that harmonizes probe ID, methylation level, and signal intensity data across different Infinium platforms. It manages probe replicates, missing data imputation, and platform-specific bias for accurate data conversion. We validated the tool by applying HM450-based cancer classifiers to EPICv2 cancer data, achieving high accuracy. Additionally, we successfully integrated EPICv2 healthy tissue data with legacy HM450 data for tissue identity analysis and produced consistent copy number profiles in cancer cells. Availability and implementation mLiftOver is implemented R and available in the Bioconductor package SeSAMe (version 1.21.13+): https://bioconductor.org/packages/release/bioc/html/sesame.html. Analysis of EPIC and EPICv2 platform-specific bias and high-confidence mapping is available at https://github.com/zhou-lab/InfiniumAnnotationV1/raw/main/Anno/EPICv2/EPICv2ToEPIC_conversion.tsv.gz. The source code is available at https://github.com/zwdzwd/sesame/blob/devel/R/mLiftOver.R under the MIT license.
Abstract Motivation The emergence of large chemical repositories and combinatorial chemical spaces, coupled with high-throughput docking and generative AI, have greatly expanded the chemical ...diversity of small molecules for drug discovery. Selecting compounds for experimental validation requires filtering these molecules based on favourable druglike properties, such as Absorption, Distribution, Metabolism, Excretion, and Toxicity (ADMET). Results We developed ADMET-AI, a machine learning platform that provides fast and accurate ADMET predictions both as a website and as a Python package. ADMET-AI has the highest average rank on the TDC ADMET Leaderboard, and it is currently the fastest web-based ADMET predictor, with a 45% reduction in time compared to the next fastest public ADMET web server. ADMET-AI can also be run locally with predictions for one million molecules taking just 3.1 h. Availability and implementation The ADMET-AI platform is freely available both as a web server at admet.ai.greenstonebio.com and as an open-source Python package for local batch prediction at github.com/swansonk14/admet_ai (also archived on Zenodo at doi.org/10.5281/zenodo.10372930). All data and models are archived on Zenodo at doi.org/10.5281/zenodo.10372418.
Abstract Motivation Gene clusters, defined as a set of genes encoding functionally related proteins, are abundant in eukaryotic genomes. Despite the increasing availability of chromosome-level ...genomes, the comprehensive analysis of gene family evolution remains largely unexplored, particularly for large and highly dynamic gene families or those including very recent family members. These challenges stem from limitations in genome assembly contiguity, particularly in repetitive regions such as large gene clusters. Recent advancements in sequencing technology, such as long reads and chromatin contact mapping, hold promise in addressing these challenges. Results To facilitate the identification, analysis, and visualization of physically clustered gene family members within chromosome-level genomes, we introduce GALEON, a user-friendly bioinformatic tool. GALEON identifies gene clusters by studying the spatial distribution of pairwise physical distances among gene family members along with the genome-wide gene density. The pipeline also enables the simultaneous analysis and comparison of two gene families and allows the exploration of the relationship between physical and evolutionary distances. This tool offers a novel approach for studying the origin and evolution of gene families. Availability and implementation GALEON is freely available from https://www.ub.edu/softevol/galeon and https://github.com/molevol-ub/galeon
Abstract Motivation Errors in the processing of genetic information during protein synthesis can lead to phenotypic mutations, such as amino acid substitutions, e.g. by transcription or translation ...errors. While genetic mutations can be readily identified using DNA sequencing, and mutations due to transcription errors by RNA sequencing, translation errors can only be identified proteome-wide using mass spectrometry. Results Here, we provide a Python package implementation of a high-throughput pipeline to detect amino acid substitutions in mass spectrometry datasets. Our tools enable users to process hundreds of mass spectrometry datasets in batch mode to detect amino acid substitutions and calculate codon-specific and site-specific translation error rates. deTELpy will facilitate the systematic understanding of amino acid misincorporation rates (translation error rates), and the inference of error models across organisms and under stress conditions, such as drug treatment or disease conditions. Availability and implementation deTELpy is implemented in Python 3 and is freely available with detailed documentation and practical examples at https://git.mpi-cbg.de/tothpetroczylab/detelpy and https://pypi.org/project/deTELpy/ and can be easily installed via pip install deTELpy.
Abstract Summary Identification and quantification of phosphorylation sites are essential for biological interpretation of a phosphoproteomics experiment. For data independent acquisition mass ...spectrometry-based (DIA-MS) phosphoproteomics, extracting a site-level report from the output of current processing software is not straightforward as multiple peptides might contribute to a single site, multiple phosphorylation sites can occur on the same peptides, and protein isoforms complicate site specification. Currently only limited support is available from a commercial software package via a platform-specific solution with a rather simple site quantification method. Here, we present sitereport, a software tool implemented in an extendable Python package called msproteomics to report phosphosites and phosphopeptides from a DIA-MS phosphoproteomics experiment with a proven quantification method called MaxLFQ. We demonstrate the use of sitereport for downstream data analysis at site level, allowing benchmarking different DIA-MS processing software tools. Availability and implementation sitereport is available as a command line tool in the Python package msproteomics, released under the Apache License 2.0 and available from the Python Package Index (PyPI) at https://pypi.org/project/msproteomics and GitHub at https://github.com/tvpham/msproteomics.
Abstract Summary Protein Interaction Explorer (PIE) is a new web-based tool integrated to our database iPPI-DB, specifically crafted to support structure-based drug discovery initiatives focused on ...protein–protein interactions (PPIs). Drawing upon extensive structural data encompassing thousands of heterodimer complexes, including those with successful ligands, PIE provides a comprehensive suite of tools dedicated to aid decision-making in PPI drug discovery. PIE enables researchers/bioinformaticians to identify and characterize crucial factors such as the presence of binding pockets or functional binding sites at the interface, predicting hot spots, and foreseeing similar protein-embedded pockets for potential repurposing efforts. Availability and implementation PIE is user-friendly and readily accessible at https://ippidb.pasteur.fr/targetcentric/. It relies on the NGL visualizer.
Abstract Motivation The data independent acquisition (DIA) mass spectrometry (MS) method is increasingly popular in the field of proteomics. But the loss of the correspondence between peptide ions ...and their spectra in DIA makes the identification challenging. One effective approach to reduce false positive identification is to calculate the deviation between the peptide’s estimated retention time (RT) and measured RT. During this process, scaling the spectral library RT into the estimated RT, known as the RT calibration, is a prerequisite for calculating the deviation. Currently, within the DIA algorithm ecosystem, there is a lack of engine-independent and readily usable RT calibration toolkits. Results In this work, we introduce Calib-RT, a RT calibration method tailored to the characteristics of RT data. This method can achieve the nonlinear calibration across various data scales and tolerate a certain level of noise interference. Calib-RT is expected to enrich the open source DIA algorithm toolchain and assist in the development of DIA identification algorithms. Availability and implementation Calib-RT is released as an open source software under the MIT license and can be installed from PyPi as a python module. The source code is available on GitHub at https://github.com/chenghui03/Calib_RT.