The HIV-1 Env spike is the main protein complex that facilitates HIV-1 entry into CD4+ host cells. HIV-1 entry is a multistep process that is not yet completely understood. This process involves ...several protein-protein interactions between HIV-1 Env and a variety of host cell receptors along with many conformational changes within the spike. HIV-1 Env developed due to high mutation rates and plasticity escape strategies from immense immune pressure and entry inhibitors. We applied a coevolution and residue-residue contact detecting method to identify coevolution patterns within HIV-1 Env protein sequences representing all group M subtypes. We identified 424 coevolving residue pairs within HIV-1 Env. The majority of predicted pairs are residue-residue contacts and are proximal in 3D structure. Furthermore, many of the detected pairs have functional implications due to contributions in either CD4 or coreceptor binding, or variable loop, gp120-gp41, and interdomain interactions. This study provides a new dimension of information in HIV research. The identified residue couplings may not only be important in assisting gp120 and gp41 coordinate structure prediction, but also in designing new and effective entry inhibitors that incorporate mutation patterns of HIV-1 Env.
The current study explores the genetic underpinnings of cardiac arrhythmia phenotypes within Middle Eastern populations, which are under-represented in genomic medicine research.
Whole-genome ...sequencing data from 14,259 individuals from the Qatar Biobank were used and contained 47.8% of Arab ancestry, 18.4% of South Asian ancestry, and 4.6% of African ancestry. The frequency of rare functional variants within a set of 410 candidate genes for cardiac arrhythmias was assessed. Polygenic risk score (PRS) performance for atrial fibrillation (AF) prediction was evaluated.
This study identified 1196 rare functional variants, including 162 previously linked to arrhythmia phenotypes, with varying frequencies across Arab, South Asian, and African ancestries. Of these, 137 variants met the pathogenic or likely pathogenic (P/LP) criteria according to ACMG guidelines. Of these, 91 were in ACMG actionable genes and were present in 1030 individuals (~7%). Ten P/LP variants showed significant associations with atrial fibrillation
< 2.4 × 10
. Five out of ten existing PRSs were significantly associated with AF (e.g., PGS000727,
= 0.03, OR = 1.43 1.03, 1.97).
Our study is the largest to study the genetic predisposition to arrhythmia phenotypes in the Middle East using whole-genome sequence data. It underscores the importance of including diverse populations in genomic investigations to elucidate the genetic landscape of cardiac arrhythmias and mitigate health disparities in genomic medicine.
Resting electrocardiogram (ECG) is a valuable non-invasive diagnostic tool used in clinical medicine to assess the electrical activity of the heart while the patient is resting. Abnormalities in ECG ...may be associated with clinical biomarkers and can predict early stages of diseases. In this study, we evaluated the association between ECG traits, clinical biomarkers, and diseases and developed risk scores to predict the risk of developing coronary artery disease (CAD) in the Qatar Biobank.
This study used 12-lead ECG data from 13,827 participants. The ECG traits used for association analysis were RR, PR, QRS, QTc, PW, and JT. Association analysis using regression models was conducted between ECG variables and serum electrolytes, sugars, lipids, blood pressure (BP), blood and inflammatory biomarkers, and diseases (e.g., type 2 diabetes, CAD, and stroke). ECG-based and clinical risk scores were developed, and their performance was assessed to predict CAD. Classical regression and machine-learning models were used for risk score development.
Significant associations were observed with ECG traits. RR showed the largest number of associations: e.g., positive associations with bicarbonate, chloride, HDL-C, and monocytes, and negative associations with glucose, insulin, neutrophil, calcium, and risk of T2D. QRS was positively associated with phosphorus, bicarbonate, and risk of CAD. Elevated QTc was observed in CAD patients, whereas decreased QTc was correlated with decreased levels of calcium and potassium. Risk scores developed using regression models were outperformed by machine-learning models. The area under the receiver operating curve reached 0.84 using a machine-learning model that contains ECG traits, sugars, lipids, serum electrolytes, and cardiovascular disease risk factors. The odds ratio for the top decile of CAD risk score compared to the remaining deciles was 13.99.
ECG abnormalities were associated with serum electrolytes, sugars, lipids, and blood and inflammatory biomarkers. These abnormalities were also observed in T2D and CAD patients. Risk scores showed great predictive performance in predicting CAD.
Germline genetic variants modulate human immune response. We present analytical pipelines for assessing the contribution of hosts’ genetic background to the immune landscape of solid tumors using ...harmonized data from more than 9,000 patients in The Cancer Genome Atlas (TCGA). These include protocols for heritability, genome-wide association studies (GWAS), colocalization, and rare variant analyses. These workflows are developed around the structure of TCGA but can be adapted to explore other repositories or in the context of cancer immunotherapy.
For complete details on the use and execution of this protocol, please refer to Sayaman et al. (2021).
Display omitted
•Pipelines for assessing the contribution of germline genetics on tumor immune contexture•Workflow for data download, processing, assembly, curation, and annotation•Protocols for heritability, GWAS, colocalization, and rare variant analysis•Visualization tools for exploration of the results by iAtlas and PheWeb
Publisher’s note: Undertaking any experimental protocol requires adherence to local institutional guidelines for laboratory safety and ethics.
Germline genetic variants modulate human immune response. We present analytical pipelines for assessing the contribution of hosts’ genetic background to the immune landscape of solid tumors using harmonized data from more than 9,000 patients in The Cancer Genome Atlas (TCGA). These include protocols for heritability, genome-wide association studies (GWAS), colocalization, and rare variants analysis. These workflows are developed around the structure of TCGA but can be adapted to explore other repositories or in the context of cancer immunotherapy.
Coronary heart disease (CHD) is a major cause of death in Middle Eastern (ME) populations, with current studies of the metabolic fingerprints of CHD lacking in diversity. Identification of specific ...biomarkers to uncover potential mechanisms for developing predictive models and targeted therapies for CHD is urgently needed for the least-studied ME populations. A case-control study was carried out in a cohort of 1001 CHD patients and 2999 controls. Untargeted metabolomics was used, generating 1159 metabolites. Univariate and pathway enrichment analyses were performed to understand functional changes in CHD. A metabolite risk score (MRS) was developed to assess the predictive performance of CHD using multivariate analysis and machine learning. A total of 511 metabolites were significantly different between the CHD patients and the controls (FDR p < 0.05). The enriched pathways (FDR p < 10−300) included D-arginine and D-ornithine metabolism, glycolysis, oxidation and degradation of branched chain fatty acids, and sphingolipid metabolism. MRS showed good discriminative power between the CHD cases and the controls (AUC = 0.99). In this first study in the Middle East, known and novel circulating metabolites and metabolic pathways associated with CHD were identified. A small panel of metabolites can efficiently discriminate CHD cases and controls and therefore can be used as a diagnostic/predictive tool.
Abstract
Motivation
Protein solubility plays a vital role in pharmaceutical research and production yield. For a given protein, the extent of its solubility can represent the quality of its function, ...and is ultimately defined by its sequence. Thus, it is imperative to develop novel, highly accurate in silico sequence-based protein solubility predictors. In this work we propose, DeepSol, a novel Deep Learning-based protein solubility predictor. The backbone of our framework is a convolutional neural network that exploits k-mer structure and additional sequence and structural features extracted from the protein sequence.
Results
DeepSol outperformed all known sequence-based state-of-the-art solubility prediction methods and attained an accuracy of 0.77 and Matthew's correlation coefficient of 0.55. The superior prediction accuracy of DeepSol allows to screen for sequences with enhanced production capacity and can more reliably predict solubility of novel proteins.
Availability and implementation
DeepSol's best performing models and results are publicly deposited at https://doi.org/10.5281/zenodo.1162886 (Khurana and Mall, 2018).
Supplementary information
Supplementary data are available at Bioinformatics online.
Protein structure determination has primarily been performed using X-ray crystallography. To overcome the expensive cost, high attrition rate and series of trial-and-error settings, many in-silico ...methods have been developed to predict crystallization propensities of proteins based on their sequences. However, the majority of these methods build their predictors by extracting features from protein sequences, which is computationally expensive and can explode the feature space. We propose DeepCrystal, a deep learning framework for sequence-based protein crystallization prediction. It uses deep learning to identify proteins which can produce diffraction-quality crystals without the need to manually engineer additional biochemical and structural features from sequence. Our model is based on convolutional neural networks, which can exploit frequently occurring k-mers and sets of k-mers from the protein sequences to distinguish proteins that will result in diffraction-quality crystals from those that will not.
Our model surpasses previous sequence-based protein crystallization predictors in terms of recall, F-score, accuracy and Matthew's correlation coefficient (MCC) on three independent test sets. DeepCrystal achieves an average improvement of 1.4, 12.1% in recall, when compared to its closest competitors, Crysalis II and Crysf, respectively. In addition, DeepCrystal attains an average improvement of 2.1, 6.0% for F-score, 1.9, 3.9% for accuracy and 3.8, 7.0% for MCC w.r.t. Crysalis II and Crysf on independent test sets.
The standalone source code and models are available at https://github.com/elbasir/DeepCrystal and a web-server is also available at https://deeplearning-protein.qcri.org.
Supplementary data are available at Bioinformatics online.
Abstract
Motivation
Protein solubility can be a decisive factor in both research and production efficiency, and in silico sequence-based predictors that can accurately estimate solubility outcomes ...are highly sought.
Results
In this study, we present a novel approach termed PRotein SolubIlity Predictor (PaRSnIP), which uses a gradient boosting machine algorithm as well as an approximation of sequence and structural features of the protein of interest. Based on an independent test set, PaRSnIP outperformed other state-of-the-art sequence-based methods by more than 9% in accuracy and 0.17 in Matthew's correlation coefficient, with an overall accuracy of 74% and Matthew's correlation coefficient of 0.48. Additionally, PaRSnIP provides importance scores for all features used in training. We observed higher fractions of exposed residues to associate positively with protein solubility and tripeptide stretches with multiple histidines to associate negatively with solubility. The improved prediction accuracy of PaRSnIP should enable it to predict protein solubility with greater reliability and to screen for sequence variants with enhanced manufacturability.
Availability and implementation
PaRSnIP software is available for download under GitHub (https://github.com/RedaRawi/PaRSnIP).
Supplementary information
Supplementary data are available at Bioinformatics online.
Abstract only Introduction: The potential use of polygenic risk scores (PRSs) in clinical practice is tempered by concern about their portability among diverse populations. To prevent disparities in ...genomic medicine, there is an urgent need to conduct genome-wide association studies in non-European ancestry cohorts. Methods: We conducted whole genome sequencing (WGS) with 30x coverage on coronary heart disease patients (n=1,067, mean age ± SD = 59.96 years ± 10.99; 70.32% males) and controls (n=6,170, mean age ± SD = 40.02 years ± 12.56; 43.45% males) in a Middle Eastern cohort to compare the performance of available PRSs for CHD (LDpred, metaGRS, lassosum, and P+T) and identify common variants associated with CHD (via generalized linear mixed models). Results: Excepting lassosum, all PRSs performed well. LDpred and metaGRS performed similarly (AUC= ~0.685) and outperformed P+T (AUC=0.667). Based on the OR per 1 SD increase (OR 1sd ), P+T (OR 1sd =1.85 1.69-2.02, P =3.69x10 -41 ) outperformed other PRSs (OR 1sd =1.61 1.48-1.74, P =3.02x10 -31 for LDpred and OR 1sd =1.61 1.49-1.75, P =9.47x10 -31 for metaGRS). After binning PRSs into 10 deciles, the odds of CHD in the top decile compared to all others was highest for metaGRS (3.87 3.07-4.88) and LDpred (3.45 2.74-4.341). Thirty-two known GWAS loci (e.g., ABCG8 , CELSR2 , and SLC22A4 ) were replicated in our analysis with P <0.05. Seven suggestive new loci/genes ( P <10 -6 ) with relevant biological function were identified (e.g., CORO7 , RBM47 , and PDE4D ). The well-established 9p21 locus was not replicated. Conclusions: Genome-wide PRSs derived from European ancestry cohorts performed well in a Middle Eastern cohort. Further studies are needed to develop and validate an ancestry specific PRS and to confirm the suggestive loci/genes.