We review the problem of confounding in genetic association studies, which arises principally because of population structure and cryptic relatedness. Many treatments of the problem consider only a ...simple "island" model of population structure. We take a broader approach, which views population structure and cryptic relatedness as different aspects of a single confounder: the unobserved pedigree defining the (often distant) relationships among the study subjects. Kinship is therefore a central concept, and we review methods of defining and estimating kinship coefficients, both pedigree-based and marker-based. In this unified framework we review solutions to the problem of population structure, including family-based study designs, genomic control, structured association, regression control, principal components adjustment and linear mixed models. The last solution makes the most explicit use of the kinships among the study subjects, and has an established role in the analysis of animal and plant breeding studies. Recent computational developments mean that analyses of human genetic association data are beginning to benefit from its powerful tests for association, which protect against population structure and cryptic kinship, as well as intermediate levels of confounding by the pedigree.
There is an increasing body of work exploring the integration of random projection into algorithms for numerical linear algebra. The primary motivation is to reduce the overall computational cost of ...processing large datasets. A suitably chosen random projection can be used to embed the original dataset in a lower-dimensional space such that key properties of the original dataset are retained. These algorithms are often referred to as sketching algorithms, as the projected dataset can be used as a compressed representation of the full dataset. We show that random matrix theory, in particular the Tracy–Widom law, is useful for describing the operating characteristics of sketching algorithms in the tall-data regime when the sample size
n
is much greater than the number of variables
d
. Asymptotic large sample results are of particular interest as this is the regime where sketching is most useful for data compression. In particular, we develop asymptotic approximations for the success rate in generating random subspace embeddings and the convergence probability of iterative sketching algorithms. We test a number of sketching algorithms on real large high-dimensional datasets and find that the asymptotic expressions give accurate predictions of the empirical performance.
Summary
Low platelet count, or thrombocytopenia, is a common haematological abnormality, with a wide differential diagnosis, which may represent a clinically significant underlying pathology. ...Macrothrombocytopenia, the presence of large platelets in combination with thrombocytopenia, can be acquired or hereditary and indicative of a complex disorder. In this review, we discuss the interpretation of platelet count and volume measured by automated haematology analysers and highlight some important technical considerations relevant to the analysis of blood samples with macrothrombocytopenia. We review how large cohorts, such as the UK Biobank and INTERVAL studies, have enabled an accurate description of the distribution and co‐variation of platelet parameters in adult populations. We discuss how genome‐wide association studies have identified hundreds of genetic associations with platelet count and mean platelet volume, which in aggregate can explain large fractions of phenotypic variance, consistent with a complex genetic architecture and polygenic inheritance. Finally, we describe the large genetic diagnostic and discovery programmes, which, simultaneously to genome‐wide association studies, have expanded the repertoire of genes and variants associated with extreme platelet phenotypes. These have advanced our understanding of the pathogenesis of hereditary macrothrombocytopenia and support a future clinical diagnostic strategy that utilises genotype alongside clinical and laboratory phenotype data.
Most methods for estimating differential expression from RNA-seq are based on statistics that compare normalized read counts between treatment classes. Unfortunately, reads are in general too short ...to be mapped unambiguously to features of interest, such as genes, isoforms or haplotype-specific isoforms. There are methods for estimating expression levels that account for this source of ambiguity. However, the uncertainty is not generally accounted for in downstream analysis of gene expression experiments. Moreover, at the individual transcript level, it can sometimes be too large to allow useful comparisons between treatment groups.
In this article we make two proposals that improve the power, specificity and versatility of expression analysis using RNA-seq data. First, we present a Bayesian method for model selection that accounts for read mapping ambiguities using random effects. This polytomous model selection approach can be used to identify many interesting patterns of gene expression and is not confined to detecting differential expression between two groups. For illustration, we use our method to detect imprinting, different types of regulatory divergence in cis and in trans and differential isoform usage, but many other applications are possible. Second, we present a novel collapsing algorithm for grouping transcripts into inferential units that exploits the posterior correlation between transcript expression levels. The aggregate expression levels of these units can be estimated with useful levels of uncertainty. Our algorithm can improve the precision of expression estimates when uncertainty is large with only a small reduction in biological resolution.
We have implemented our software in the mmdiff and mmcollapse multithreaded C++ programs as part of the open-source MMSEQ package, available on https://github.com/eturro/mmseq.
Data processing for 1D NMR spectra is a key bottleneck for metabolomic and other complex-mixture studies, particularly where quantitative data on individual metabolites are required. We present a ...protocol for automated metabolite deconvolution and quantification from complex NMR spectra by using the Bayesian automated metabolite analyzer for NMR (BATMAN) R package. BATMAN models resonances on the basis of a user-controllable set of templates, each of which specifies the chemical shifts, J-couplings and relative peak intensities for a single metabolite. Peaks are allowed to shift position slightly between spectra, and peak widths are allowed to vary by user-specified amounts. NMR signals not captured by the templates are modeled non-parametrically by using wavelets. The protocol covers setting up user template libraries, optimizing algorithmic input parameters, improving prior information on peak positions, quality control and evaluation of outputs. The outputs include relative concentration estimates for named metabolites together with associated Bayesian uncertainty estimates, as well as the fit of the remainder of the spectrum using wavelets. Graphical diagnostics allow the user to examine the quality of the fit for multiple spectra simultaneously. This approach offers a workflow to analyze large numbers of spectra and is expected to be useful in a wide range of metabolomics studies.
Genome-wide association studies have identified a genetic variant at 3p14.3 (SNP rs1354034) that strongly associates with platelet number and mean platelet volume in humans. While originally proposed ...to be intronic, analysis of mRNA expression in primary human hematopoietic subpopulations reveals that this SNP is located directly upstream of the predominantly expressed ARHGEF3 isoform in megakaryocytes (MK). We found that ARHGEF3, which encodes a Rho guanine exchange factor, is dramatically upregulated during both human and murine MK maturation. We show that the SNP (rs1354034) is located in a DNase I hypersensitive region in human MKs and is an expression quantitative locus (eQTL) associated with ARHGEF3 expression level in human platelets, suggesting that it may be the causal SNP that accounts for the variations observed in human platelet traits and ARHGEF3 expression. In vitro human platelet activation assays revealed that rs1354034 is highly correlated with human platelet activation by ADP. In order to test whether ARHGEF3 plays a role in MK development and/or platelet function, we developed an Arhgef3 KO/LacZ reporter mouse model. Reflecting changes in gene expression, LacZ expression increases during MK maturation in these mice. Although Arhgef3 KO mice have significantly larger platelets, loss of Arhgef3 does not affect baseline MK or platelets nor does it affect platelet function or platelet recovery in response to antibody-mediated platelet depletion compared to littermate controls. In summary, our data suggest that modulation of ARHGEF3 gene expression in humans with a promoter-localized SNP plays a role in human MKs and human platelet function-a finding resulting from the biological follow-up of human genetic studies. Arhgef3 KO mice partially recapitulate the human phenotype.
Nuclear Magnetic Resonance (NMR) spectra are widely used in metabolomics to obtain metabolite profiles in complex biological mixtures. Common methods used to assign and estimate concentrations of ...metabolites involve either an expert manual peak fitting or extra pre-processing steps, such as peak alignment and binning. Peak fitting is very time consuming and is subject to human error. Conversely, alignment and binning can introduce artefacts and limit immediate biological interpretation of models.
We present the Bayesian automated metabolite analyser for NMR spectra (BATMAN), an R package that deconvolutes peaks from one-dimensional NMR spectra, automatically assigns them to specific metabolites from a target list and obtains concentration estimates. The Bayesian model incorporates information on characteristic peak patterns of metabolites and is able to account for shifts in the position of peaks commonly seen in NMR spectra of biological samples. It applies a Markov chain Monte Carlo algorithm to sample from a joint posterior distribution of the model parameters and obtains concentration estimates with reduced error compared with conventional numerical integration and comparable to manual deconvolution by experienced spectroscopists.
http://www1.imperial.ac.uk/medicine/people/t.ebbels/
t.ebbels@imperial.ac.uk.
Blood cells contain functionally important intracellular structures, such as granules, critical to immunity and thrombosis. Quantitative variation in these structures has not been subjected ...previously to large-scale genetic analysis. We perform genome-wide association studies of 63 flow-cytometry derived cellular phenotypes-including cell-type specific measures of granularity, nucleic acid content and reactivity-in 41,515 participants in the INTERVAL study. We identify 2172 distinct variant-trait associations, including associations near genes coding for proteins in organelles implicated in inflammatory and thrombotic diseases. By integrating with epigenetic data we show that many intracellular structures are likely to be determined in immature precursor cells. By integrating with proteomic data we identify the transcription factor FOG2 as an early regulator of platelet formation and α-granularity. Finally, we show that colocalisation of our associations with disease risk signals can suggest aetiological cell-types-variants in IL2RA and ITGA4 respectively mirror the known effects of daclizumab in multiple sclerosis and vedolizumab in inflammatory bowel disease.
•PR can be predicted from scattergrams generated by hematology analyzers of a type that is in widespread clinical use.•Genetic analysis of predicted PR reveals associations with the risk of ...thrombotic diseases, including stroke.
Display omitted
Genetic studies of platelet reactivity (PR) phenotypes may identify novel antiplatelet drug targets. However, these discoveries have been limited by small sample sizes (n < 5000) because of the complexity of measuring the PR. We trained a model to predict the PR using complete blood count (CBC) scattergrams. A genome-wide association study of this phenotype in 29 806 blood donors identified 21 distinct associations implicating 20 genes, of which 6 have been identified previously. The effect size estimates were significantly correlated with estimates from a study of flow-cytometry-measured PR and a study of the phenotype of in vitro thrombus formation. A genetic score of PR built from the 21 variants was associated with myocardial infarction and pulmonary embolism. Mendelian randomization analyses showed that PR was causally associated with the risks of coronary artery disease, stroke, and venous thromboembolism. Our approach provides a blueprint for using phenotype imputation to study the determinants of hard-to-measure but biologically important hematological traits.
The von Willebrand receptor complex, which is composed of the glycoproteins Ibα, Ibβ, GPV, and GPIX, plays an essential role in the earliest steps in hemostasis. During the last 4 decades, it has ...become apparent that loss of function of any 1 of 3 of the genes encoding these glycoproteins (namely, GP1BA, GP1BB, and GP9) leads to autosomal recessive macrothrombocytopenia complicated by bleeding. A small number of variants in GP1BA have been reported to cause a milder and dominant form of macrothrombocytopenia, but only 2 tentative reports exist of such a variant in GP1BB. By analyzing data from a collection of more than 1000 genome-sequenced patients with a rare bleeding and/or platelet disorder, we have identified a significant association between rare monoallelic variants in GP1BB and macrothrombocytopenia. To strengthen our findings, we sought further cases in 2 additional collections in the United Kingdom and Japan. Across 18 families exhibiting phenotypes consistent with autosomal dominant inheritance of macrothrombocytopenia, we report on 27 affected cases carrying 1 of 9 rare variants in GP1BB.
•Variants in GP1BB can cause autosomal dominant macrothrombocytopenia.