Detecting Novel Associations in Large Data Sets Reshef, David N.; Reshef, Yakir A.; Finucane, Hilary K. ...
Science (American Association for the Advancement of Science),
12/2011, Letnik:
334, Številka:
6062
Journal Article
Recenzirano
Odprti dostop
Identifying interesting relationships between pairs of variables in large data sets is increasingly important. Here, we present a measure of dependence for two-variable relationships: the maximal ...information coefficient (MIC). MIC captures a wide range of associations both functional and not, and for functional relationships provides a score that roughly equals the coefficient of determination (R²) of the data relative to the regression function. MIC belongs to a larger class of maximal information-based nonparametric exploration (MINE) statistics for identifying and classifying relationships. We apply MIC and MINE to data sets in global health, gene expression, major-league baseball, and the human gut microbiota and identify known and novel relationships.
Small insertions and deletions (indels) are a common and functionally important type of sequence polymorphism. Most of the focus of studies of sequence variation is on single nucleotide variants ...(SNVs) and large structural variants. In principle, high-throughput sequencing studies should allow identification of indels just as SNVs. However, inference of indels from next-generation sequence data is challenging, and so far methods for identifying indels lag behind methods for calling SNVs in terms of sensitivity and specificity. We propose a Bayesian method to call indels from short-read sequence data in individuals and populations by realigning reads to candidate haplotypes that represent alternative sequence to the reference. The candidate haplotypes are formed by combining candidate indels and SNVs identified by the read mapper, while allowing for known sequence variants or candidates from other methods to be included. In our probabilistic realignment model we account for base-calling errors, mapping errors, and also, importantly, for increased sequencing error indel rates in long homopolymer runs. We show that our method is sensitive and achieves low false discovery rates on simulated and real data sets, although challenges remain. The algorithm is implemented in the program Dindel, which has been used in the 1000 Genomes Project call sets.
Instances in which natural selection maintains genetic variation in a population over millions of years are thought to be extremely rare. We conducted a genome-wide scan for long-lived balancing ...selection by looking for combinations of SNPs shared between humans and chimpanzees. In addition to the major histocompatibility complex, we identified 125 regions in which the same haplotypes are segregating in the two species, all but two of which are noncoding. In six cases, there is evidence for an ancestral polymorphism that persisted to the present in humans and chimpanzees. Regions with shared haplotypes are significantly enriched for membrane glycoproteins, and a similar trend is seen among shared coding polymorphisms. These findings indicate that ancient balancing selection has shaped human variation and point to genes involved in host-pathogen interactions as common targets.
Obtaining an accurate measure of how recombination rates vary across the genome has implications for understanding the molecular basis of recombination, its evolutionary significance and the ...distribution of linkage disequilibrium in natural populations. Although measuring the recombination rate is experimentally challenging, good estimates can be obtained by applying population-genetic methods to DNA sequences taken from natural populations. Statistical methods are now providing insights into the nature and scale of variation in the recombination rate, particularly in humans. Such knowledge will become increasingly important owing to the growing use of population-genetic methods in biomedical research.
Celotno besedilo
Dostopno za:
DOBA, IJS, IZUM, KILJ, NUK, PILJ, PNG, SAZU, UILJ, UKNU, UL, UM, UPUK
Models of molecular evolution that incorporate the ratio of nonsynonymous to synonymous polymorphism (dN/dS ratio) as a parameter can be used to identify sites that are under diversifying selection ...or functional constraint in a sample of gene sequences. However, when there has been recombination in the evolutionary history of the sequences, reconstructing a single phylogenetic tree is not appropriate, and inference based on a single tree can give misleading results. In the presence of high levels of recombination, the identification of sites experiencing diversifying selection can suffer from a false-positive rate as high as 90%. We present a model that uses a population genetics approximation to the coalescent with recombination and use reversible-jump MCMC to perform Bayesian inference on both the dN/dS ratio and the recombination rate, allowing each to vary along the sequence. We demonstrate that the method has the power to detect variation in the dN/dS ratio and the recombination rate and does not suffer from a high false-positive rate. We use the method to analyze the porB gene of Neisseria meningitidis and verify the inferences using prior sensitivity analysis and model criticism techniques.
Despite the importance of mutation in genetics, there are virtually no experimental data on the occurrence of specific nucleotide substitutions in human gametes. C>G transversions at position 755 of ...FGF receptor 2 (FGFR2) cause Apert syndrome; this mutation, encoding the gain-of-function substitution Ser252Trp, occurs with a birth rate elevated 200- to 800-fold above background and originates exclusively from the unaffected father. We previously demonstrated high levels of both 755C>G and 755C>T FGFR2 mutations in human sperm and proposed that these particular mutations are enriched because the encoded proteins confer a selective advantage to spermatogonial cells. Here, we examine three corollaries of this hypothesis. First, we show that mutation levels at the adjacent FGFR2 nucleotides 752-754 are low, excluding any general increase in local mutation rate. Second, we present three instances of double-nucleotide changes involving 755C, expected to be extremely rare as chance events. Two of these double-nucleotide substitutions are shown, either by assessment of the pedigree or by direct analysis of sperm, to have arisen in sequential steps; the third (encoding Ser252Tyr) was predicted from structural considerations. Finally, we demonstrate that both major alternative spliceforms of FGFR2 (Fgfr2b and Fgfr2c) are expressed in rat spermatogonial stem cell lines. Taken together, these observations show that specific FGFR2 mutations attain high levels in sperm because they encode proteins with gain-of-function properties, favoring clonal expansion of mutant spermatogonial cells. Among FGFR2 mutations, those causing Apert syndrome may be especially prevalent because they enhance signaling by FGF ligands specific for each of the major expressed isoforms.
The degree of association between alleles at different loci, or linkage disequilibrium, is widely used to infer details of evolutionary processes. Here I explore how associations between alleles ...relate to properties of the underlying genealogy of sequences. Under the neutral, infinite-sites assumption I show that there is a direct correspondence between the covariance in coalescence times at different parts of the genome and the degree of linkage disequilibrium. These covariances can be calculated exactly under the standard neutral model and by Monte Carlo simulation under different demographic models. I show that the effects of population growth, population bottlenecks, and population structure on linkage disequilibrium can be described through their effects on the covariance in coalescence times.
Mosaic mutations present in the germline have important implications for reproductive risk and disease transmission. We previously demonstrated a phenomenon occurring in the male germline, whereby ...specific mutations arising spontaneously in stem cells (spermatogonia) lead to clonal expansion, resulting in elevated mutation levels in sperm over time. This process, termed "selfish spermatogonial selection," explains the high spontaneous birth prevalence and strong paternal age-effect of disorders such as achondroplasia and Apert, Noonan and Costello syndromes, with direct experimental evidence currently available for specific positions of six genes (
,
,
,
,
, and
). We present a discovery screen to identify novel mutations and genes showing evidence of positive selection in the male germline, by performing massively parallel simplex PCR using RainDance technology to interrogate mutational hotspots in 67 genes (51.5 kb in total) in 276 biopsies of testes from five men (median age, 83 yr). Following ultradeep sequencing (about 16,000×), development of a low-frequency variant prioritization strategy, and targeted validation, we identified 61 distinct variants present at frequencies as low as 0.06%, including 54 variants not previously directly associated with selfish selection. The majority (80%) of variants identified have previously been implicated in developmental disorders and/or oncogenesis and include mutations in six newly associated genes (
,
,
,
,
, and
), all of which encode components of the RAS-MAPK pathway and activate signaling. Our findings extend the link between mutations dysregulating the RAS-MAPK pathway and selfish selection, and show that the aging male germline is a repository for such deleterious mutations.
Associations between selected alleles and the genetic backgrounds on which they are found can reduce the efficacy of selection. We consider the extent to which such interference, known as the ...Hill-Robertson effect, acting between weakly selected alleles, can restrict molecular adaptation and affect patterns of polymorphism and divergence. In particular, we focus on synonymous-site mutations, considering the fate of novel variants in a two-locus model and the equilibrium effects of interference with multiple loci and reversible mutation. We find that weak selection Hill-Robertson (wsHR) interference can considerably reduce adaptation, e.g., codon bias, and, to a lesser extent, levels of polymorphism, particularly in regions of low recombination. Interference causes the frequency distribution of segregating sites to resemble that expected from more weakly selected mutations and also generates specific patterns of linkage disequilibrium. While the selection coefficients involved are small, the fitness consequences of wsHR interference across the genome can be considerable. We suggest that wsHR interference is an important force in the evolution of nonrecombining genomes and may explain the unexpected constancy of codon bias across species of very different census population sizes, as well as several unusual features of codon usage in Drosophila.
Oak gallwasps (Hymenoptera, Cynipidae, Cynipini) are one of seven major animal taxa that commonly reproduce by cyclical parthenogenesis (CP). A major question in research on CP taxa is the frequency ...with which lineages lose their sexual generations, and diversify as purely asexual radiations. Most oak gallwasp species are only known from an asexual generation, and secondary loss of sex has been conclusively demonstrated in several species, particularly members of the holarctic genus Andricus. This raises the possibility of widespread secondary loss of sex in the Cynipini, and of diversification within purely parthenogenetic lineages. We use two approaches based on analyses of allele frequency data to test for cryptic sexual generations in eight apparently asexual European species distributed through a major western palaearctic lineage of the gallwasp genus Andricus. All species showing adequate levels of polymorphism (7/8) showed signatures of sex compatible with cyclical parthenogenesis. We also use DNA sequence data to test the hypothesis that ignorance of these sexual generations (despite extensive study on this group) results from failure to discriminate among known but morphologically indistinguishable sexual generations. This hypothesis is supported: 35 sequences attributed by leading cynipid taxonomists to a single sexual adult morphospecies, Andricus burgundus, were found to represent the sexual generations of at least six Andricus species. We confirm cryptic sexual generations in a total of 11 Andricus species, suggesting that secondary loss of sex is rare in Andricus.