Technical variation plays an important role in microarray-based gene expression studies, and batch effects explain a large proportion of this noise. It is therefore mandatory to eliminate technical ...variation while maintaining biological variability. Several strategies have been proposed for the removal of batch effects, although they have not been evaluated in large-scale longitudinal gene expression data. In this study, we aimed at identifying a suitable method for batch effect removal in a large study of microarray-based longitudinal gene expression. Monocytic gene expression was measured in 1092 participants of the Gutenberg Health Study at baseline and 5-year follow up. Replicates of selected samples were measured at both time points to identify technical variability. Deming regression, Passing-Bablok regression, linear mixed models, non-linear models as well as ReplicateRUV and ComBat were applied to eliminate batch effects between replicates. In a second step, quantile normalization prior to batch effect correction was performed for each method. Technical variation between batches was evaluated by principal component analysis. Associations between body mass index and transcriptomes were calculated before and after batch removal. Results from association analyses were compared to evaluate maintenance of biological variability. Quantile normalization, separately performed in each batch, combined with ComBat successfully reduced batch effects and maintained biological variability. ReplicateRUV performed perfectly in the replicate data subset of the study, but failed when applied to all samples. All other methods did not substantially reduce batch effects in the replicate data subset. Quantile normalization plus ComBat appears to be a valuable approach for batch correction in longitudinal gene expression data.
Variability of gene expression in human may link gene sequence variability and phenotypes; however, non-genetic variations, alone or in combination with genetics, may also influence expression traits ...and have a critical role in physiological and disease processes.
To get better insight into the overall variability of gene expression, we assessed the transcriptome of circulating monocytes, a key cell involved in immunity-related diseases and atherosclerosis, in 1,490 unrelated individuals and investigated its association with >675,000 SNPs and 10 common cardiovascular risk factors. Out of 12,808 expressed genes, 2,745 expression quantitative trait loci were detected (P<5.78x10(-12)), most of them (90%) being cis-modulated. Extensive analyses showed that associations identified by genome-wide association studies of lipids, body mass index or blood pressure were rarely compatible with a mediation by monocyte expression level at the locus. At a study-wide level (P<3.9x10(-7)), 1,662 expression traits (13.0%) were significantly associated with at least one risk factor. Genome-wide interaction analyses suggested that genetic variability and risk factors mostly acted additively on gene expression. Because of the structure of correlation among expression traits, the variability of risk factors could be characterized by a limited set of independent gene expressions which may have biological and clinical relevance. For example expression traits associated with cigarette smoking were more strongly associated with carotid atherosclerosis than smoking itself.
This study demonstrates that the monocyte transcriptome is a potent integrator of genetic and non-genetic influences of relevance for disease pathophysiology and risk assessment.
Microarray profiling of gene expression is widely applied in molecular biology and functional genomics. Experimental and technical variations make meta-analysis of different studies challenging. In a ...total of 3358 samples, all from German population-based cohorts, we investigated the effect of data preprocessing and the variability due to sample processing in whole blood cell and blood monocyte gene expression data, measured on the Illumina HumanHT-12 v3 BeadChip array.Gene expression signal intensities were similar after applying the log(2) or the variance-stabilizing transformation. In all cohorts, the first principal component (PC) explained more than 95% of the total variation. Technical factors substantially influenced signal intensity values, especially the Illumina chip assignment (33-48% of the variance), the RNA amplification batch (12-24%), the RNA isolation batch (16%), and the sample storage time, in particular the time between blood donation and RNA isolation for the whole blood cell samples (2-3%), and the time between RNA isolation and amplification for the monocyte samples (2%). White blood cell composition parameters were the strongest biological factors influencing the expression signal intensities in the whole blood cell samples (3%), followed by sex (1-2%) in both sample types. Known single nucleotide polymorphisms (SNPs) were located in 38% of the analyzed probe sequences and 4% of them included common SNPs (minor allele frequency >5%). Out of the tested SNPs, 1.4% significantly modified the probe-specific expression signals (Bonferroni corrected p-value<0.05), but in almost half of these events the signal intensities were even increased despite the occurrence of the mismatch. Thus, the vast majority of SNPs within probes had no significant effect on hybridization efficiency.In summary, adjustment for a few selected technical factors greatly improved reliability of gene expression analyses. Such adjustments are particularly required for meta-analyses.
Genome-wide association studies (GWAS) yielded significant advances in defining the genetic architecture of complex traits and disease. Still, a major hurdle of GWAS is narrowing down multiple ...genetic associations to a few causal variants for functional studies. This becomes critical in multi-phenotype GWAS where detection and interpretability of complex SNP(s)-trait(s) associations are complicated by complex Linkage Disequilibrium patterns between SNPs and correlation between traits. Here we propose a computationally efficient algorithm (GUESS) to explore complex genetic-association models and maximize genetic variant detection. We integrated our algorithm with a new Bayesian strategy for multi-phenotype analysis to identify the specific contribution of each SNP to different trait combinations and study genetic regulation of lipid metabolism in the Gutenberg Health Study (GHS). Despite the relatively small size of GHS (n = 3,175), when compared with the largest published meta-GWAS (n > 100,000), GUESS recovered most of the major associations and was better at refining multi-trait associations than alternative methods. Amongst the new findings provided by GUESS, we revealed a strong association of SORT1 with TG-APOB and LIPC with TG-HDL phenotypic groups, which were overlooked in the larger meta-GWAS and not revealed by competing approaches, associations that we replicated in two independent cohorts. Moreover, we demonstrated the increased power of GUESS over alternative multi-phenotype approaches, both Bayesian and non-Bayesian, in a simulation study that mimics real-case scenarios. We showed that our parallel implementation based on Graphics Processing Units outperforms alternative multi-phenotype methods. Beyond multivariate modelling of multi-phenotypes, our Bayesian model employs a flexible hierarchical prior structure for genetic effects that adapts to any correlation structure of the predictors and increases the power to identify associated variants. This provides a powerful tool for the analysis of diverse genomic features, for instance including gene expression and exome sequencing data, where complex dependencies are present in the predictor space.
Embryonic stem (ES) cells have the potential to differentiate into all cell types and are considered as a valuable source of cells for transplantation therapies. A critical issue, however, is the ...risk of teratoma formation after transplantation. The effect of the immune response on the tumorigenicity of transplanted cells is poorly understood. We have systematically compared the tumorigenicity of mouse ES cells and in vitro differentiated neuronal cells in various recipients. Subcutaneous injection of 1x10(6) ES or differentiated cells into syngeneic or allogeneic immunodeficient mice resulted in teratomas in about 95% of the recipients. Both cell types did not give rise to tumors in immunocompetent allogeneic mice or xenogeneic rats. However, in 61% of cyclosporine A-treated rats teratomas developed after injection of differentiated cells. Undifferentiated ES cells did not give rise to tumors in these rats. ES cells turned out to be highly susceptible to killing by rat natural killer (NK) cells due to the expression of ligands of the activating NK receptor NKG2D on ES cells. These ligands were down-regulated on differentiated cells. The activity of NK cells which is not suppressed by cyclosporine A might contribute to the prevention of teratomas after injection of ES cells but not after inoculation of differentiated cells. These findings clearly point to the importance of the immune response in this process. Interestingly, the differentiated cells must contain a tumorigenic cell population that is not present among ES cells and which might be resistant to NK cell-mediated killing.
Introduction Sexually transmitted infections (STIs) cause considerable morbidity worldwide and, depending on the specific pathogen, may lead to serious complications in the female reproductive tract. ...Incarcerated women are particularly vulnerable to health problems with a disproportionate high rate of STIs, including infections with human papillomavirus (HPV). Methods Here, cervical swab samples collected from 299 women (18 to 64 years) living in one of the women’s prisons of São Paulo, Brazil were submitted for liquid-based cytology to determine the prevalence of precancerous lesions. Furthermore, direct detection of 30 genital HPV genotypes (18 high-risk and 12 low-risk types) and 11 additional STIs ( Chlamydia trachomatis , Neisseria gonorrhoeae , Herpes simplex virus 1 and 2, Haemophilus ducreyi , Mycoplasma genitalium and hominis , Treponema pallidum , Trichomonas vaginalis , Ureaplasma parvum and urealyticum ) were performed by molecular typing using two PCR-based DNA microarray systems, i.e., EUROArray HPV and EUROArray STI (EUROIMMUN), respectively. Results The overall prevalence of cytological abnormalities was 5.8%, including five women with low-grade and five women with high-grade squamous intraepithelial lesions. The overall prevalence of HPV was 62.2, and 87.1% of the HPV-positive women were infected with oncogenic high-risk (HR) HPV types. HPV types 16 (24.1%), 33 and 52 (both 10.4%) were the most frequently detected. The prevalence of the other STIs was 72.8%. Up to four different pathogens were found in the infected women, the most frequent being Ureaplasma parvum (45.3%), Mycoplasma hominis (36.2%) and Trichomonas vaginalis (24.8%). Conclusion The high number of HR-HPV infections and other STIs described here highlights the fact that the Brazilian female prison population requires more attention in the country’s health policies. The implementation of screening programs and treatment measures might contribute to a decrease in the incidence of STIs and cervical cancer in this vulnerable population. However, for such measures to be effective, further studies are needed to investigate the best practice to get more women to engage in in-prison prevention programs, e.g., through offering further sexual health education and self-sampling.
Prognostic models based on survival data frequently make use of the Cox proportional hazards model. Developing reliable Cox models with few events relative to the number of predictors can be ...challenging, even in low-dimensional datasets, with a much larger number of observations than variables. In such a setting we examined the performance of methods used to estimate a Cox model, including (i) full model using all available predictors and estimated by standard techniques, (ii) backward elimination (BE), (iii) ridge regression, (iv) least absolute shrinkage and selection operator (lasso), and (v) elastic net. Based on a prospective cohort of patients with manifest coronary artery disease (CAD), we performed a simulation study to compare the predictive accuracy, calibration, and discrimination of these approaches, Candidate predictors for incident cardiovascular events we used included clinical variables, biomarkers, and a selection of genetic variants associated with CAD. The penalized methods, i.e., ridge, lasso, and elastic net, showed a comparable performance, in terms of predictive accuracy, calibration, and discrimination, and outperformed BE and the full model. Excessive shrinkage was observed in some cases for the penalized methods, mostly on the simulation scenarios having the lowest ratio of a number of events to the number of variables. We conclude that in similar settings, these three penalized methods can be used interchangeably. The full model and backward elimination are not recommended in rare event scenarios.
Summary Background High plasma HDL cholesterol is associated with reduced risk of myocardial infarction, but whether this association is causal is unclear. Exploiting the fact that genotypes are ...randomly assigned at meiosis, are independent of non-genetic confounding, and are unmodified by disease processes, mendelian randomisation can be used to test the hypothesis that the association of a plasma biomarker with disease is causal. Methods We performed two mendelian randomisation analyses. First, we used as an instrument a single nucleotide polymorphism (SNP) in the endothelial lipase gene ( LIPG Asn396Ser) and tested this SNP in 20 studies (20 913 myocardial infarction cases, 95 407 controls). Second, we used as an instrument a genetic score consisting of 14 common SNPs that exclusively associate with HDL cholesterol and tested this score in up to 12 482 cases of myocardial infarction and 41 331 controls. As a positive control, we also tested a genetic score of 13 common SNPs exclusively associated with LDL cholesterol. Findings Carriers of the LIPG 396Ser allele (2·6% frequency) had higher HDL cholesterol (0·14 mmol/L higher, p=8×10−13 ) but similar levels of other lipid and non-lipid risk factors for myocardial infarction compared with non-carriers. This difference in HDL cholesterol is expected to decrease risk of myocardial infarction by 13% (odds ratio OR 0·87, 95% CI 0·84–0·91). However, we noted that the 396Ser allele was not associated with risk of myocardial infarction (OR 0·99, 95% CI 0·88–1·11, p=0·85). From observational epidemiology, an increase of 1 SD in HDL cholesterol was associated with reduced risk of myocardial infarction (OR 0·62, 95% CI 0·58–0·66). However, a 1 SD increase in HDL cholesterol due to genetic score was not associated with risk of myocardial infarction (OR 0·93, 95% CI 0·68–1·26, p=0·63). For LDL cholesterol, the estimate from observational epidemiology (a 1 SD increase in LDL cholesterol associated with OR 1·54, 95% CI 1·45–1·63) was concordant with that from genetic score (OR 2·13, 95% CI 1·69–2·69, p=2×10−10 ). Interpretation Some genetic mechanisms that raise plasma HDL cholesterol do not seem to lower risk of myocardial infarction. These data challenge the concept that raising of plasma HDL cholesterol will uniformly translate into reductions in risk of myocardial infarction. Funding US National Institutes of Health, The Wellcome Trust, European Union, British Heart Foundation, and the German Federal Ministry of Education and Research.
We performed a meta-analysis of 14 genome-wide association studies of coronary artery disease (CAD) comprising 22,233 individuals with CAD (cases) and 64,762 controls of European descent followed by ...genotyping of top association signals in 56,682 additional individuals. This analysis identified 13 loci newly associated with CAD at P < 5 × 10⁻⁸ and confirmed the association of 10 of 12 previously reported CAD loci. The 13 new loci showed risk allele frequencies ranging from 0.13 to 0.91 and were associated with a 6% to 17% increase in the risk of CAD per allele. Notably, only three of the new loci showed significant association with traditional CAD risk factors and the majority lie in gene regions not previously implicated in the pathogenesis of CAD. Finally, five of the new CAD risk loci appear to have pleiotropic effects, showing strong association with various other human diseases or traits.