Genome structural variation in human evolution Hollox, Edward J.; Zuccherato, Luciana W.; Tucci, Serena
Trends in genetics,
January 2022, 2022-01-00, 20220101, Letnik:
38, Številka:
1
Journal Article
Recenzirano
Odprti dostop
Structural variation (SV) is a large difference (typically >100 bp) in the genomic structure of two genomes and includes both copy number variation and variation that does not change copy number of a ...genomic region, such as an inversion. Improved reference genomes, combined with widespread genome sequencing using short-read sequencing technology, and increasingly using long-read sequencing, have reignited interest in SV. Recent large-scale studies and functional focused analyses have highlighted the role of SV in human evolution. In this review, we highlight human-specific SVs involved in changes in the brain, population-specific SVs that affect response to the environment, including adaptation to diet and infectious diseases, and summarise the contribution of archaic hominin admixture to present-day human SV.
There has been an explosion in knowledge of structural variants through analysis of short-read sequencing in large population cohorts.Long-read sequencing technology is dramatically improving our ability to detect and genotype structural variants, particularly in complex repeat-rich regions.Structural variants are important in neurological changes involved in human evolution.Structural variants have mediated population-specific human adaptations to diet and infectious disease exposure.Introgression from archaic hominins has contributed structural variants to modern human populations.
Short tandem repeat (STR) variation is an often overlooked source of variation between genomes. STRs comprise about 3% of the human genome and are highly polymorphic. Some cause Mendelian disease, ...and others affect gene expression. Their contribution to common disease is not well-understood, but recent software tools designed to genotype STRs using short read sequencing data will help address this. Here, we compare software that genotypes common STRs and rarer STR expansions genome-wide, with the aim of applying them to population-scale genomes. By using the Genome-In-A-Bottle (GIAB) consortium and 1000 Genomes Project short-read sequencing data, we compare performance in terms of sequence length, depth, computing resources needed, genotyping accuracy and number of STRs genotyped. To ensure broad applicability of our findings, we also measure genotyping performance against a set of genomes from clinical samples with known STR expansions, and a set of STRs commonly used for forensic identification. We find that HipSTR, ExpansionHunter and GangSTR perform well in genotyping common STRs, including the CODIS 13 core STRs used for forensic analysis. GangSTR and ExpansionHunter outperform HipSTR for genotyping call rate and memory usage. ExpansionHunter denovo (EHdn), STRling and GangSTR outperformed STRetch for detecting expanded STRs, and EHdn and STRling used considerably less processor time compared to GangSTR. Analysis on shared genomic sequence data provided by the GIAB consortium allows future performance comparisons of new software approaches on a common set of data, facilitating comparisons and allowing researchers to choose the best software that fulfils their needs.
Glycophorins are transmembrane proteins of red blood cells (RBCs), heavily glycosylated on their external-facing surface. In humans, there are four glycophorin proteins, glycophorins A, B, C and D. ...Glycophorins A and B are encoded by two similar genes
GYPA
and
GYPB
, and glycophorin C and glycophorin D are encoded by a single gene,
GYPC
. The exact function of glycophorins remains unclear. However, given their abundance on the surface of RBCs, it is likely that they serve as a substrate for glycosylation, giving the RBC a negatively charged, complex glycan “coat”.
GYPB
and
GYPE
(a closely related pseudogene) were generated from
GYPA
by two duplication events involving a 120-kb genomic segment between 10 and 15 million years ago. Non-allelic homologous recombination between these 120-kb repeats generates a variety of duplication alleles and deletion alleles, which have been systematically catalogued from genomic sequence data. One allele, called DUP4, encodes the Dantu NE blood type and is strongly protective against malaria as it alters the surface tension of the RBC membrane. Glycophorins interact with other infectious pathogens, including viruses, as well as the malarial parasite
Plasmodium falciparum
, but the role of glycophorin variation in mediating the effects of these pathogens remains underexplored.
Copy number variation (CNV), where a segment of DNA differs in copy number between different individuals, is an extensive and often underappreciated source of genetic variation within species. ...However, reliably determining copy number of a particular DNA sequence for a large number of samples can be challenging. Here, I describe and review the paralogue ratio test (PRT) in detail. PRT was developed to robustly type the CNV of the beta-defensin locus using small amounts of genomic DNA in a high-throughput manner, and has been applied successfully at many other loci. I discuss the strategies for designing successful PRT assays using both manual and bioinformatics methods, how to optimize experimental conditions, and approaches for analyzing the data. I discuss strengths and weaknesses of the approach, and how to troubleshoot results, as well as the range of problems to which PRT can be a potential solution.
Variability in the susceptibility to infectious disease and its clinical manifestation can be determined by variation in the environment and by genetic variation in the pathogen and the host. Despite ...several successes based on candidate gene studies, defining the host variation affecting infectious disease has not been as successful as for other multifactorial diseases. Both single nucleotide variation and copy number variation (CNV) of the host contribute to the host’s susceptibility to infectious disease. In this review we focus on CNV, particularly on complex multiallelic CNV that is often not well characterised either directly by hybridisation methods or indirectly by analysis of genotypes and flanking single nucleotide variants. We summarise the well-known examples, such as α-globin deletion and susceptibility to severe malaria, as well as more recent controversies, such as the extensive CNV of the chemokine gene
CCL3L1
and HIV infection. We discuss the potential biological mechanisms that could underly any genetic association and reflect on the extensive complexity and functional variation generated by a combination of CNV and sequence variation, as illustrated by the Fc gamma receptor genes
FCGR3A
,
FCGR3B
and
FCGR2C
. We also highlight some understudied areas that might prove fruitful areas for further research.
Intelectins are ancient carbohydrate binding proteins, spanning chordate evolution and implicated in multiple human diseases. Previous GWAS have linked SNPs in ITLN1 (also known as omentin) with ...susceptibility to Crohn's disease (CD); however, analysis of possible functional significance of SNPs at this locus is lacking. Using the Ensembl database, pairwise linkage disequilibrium (LD) analyses indicated that several disease-associated SNPs at the ITLN1 locus, including SNPs in CD244 and Ly9, were in LD. The alleles comprising the risk haplotype are the major alleles in European (67%), but minor alleles in African superpopulations. Neither ITLN1 mRNA nor protein abundance in intestinal tissue, which we confirm as goblet-cell derived, was altered in the CD samples overall nor when samples were analyzed according to genotype. Moreover, the missense variant V109D does not influence ITLN1 glycan binding to the glycan β-D-galactofuranose or protein-protein oligomerization. Taken together, our data are an important step in defining the role(s) of the CD-risk haplotype by determining that risk is unlikely to be due to changes in ITLN1 carbohydrate recognition, protein oligomerization, or expression levels in intestinal mucosa. Our findings suggest that the relationship between the genomic data and disease arises from changes in CD244 or Ly9 biology, differences in ITLN1 expression in other tissues, or an alteration in ITLN1 interaction with other proteins.
Intelectins (intestinal lectins) are highly conserved across chordate evolution and have been implicated in various human diseases, including Crohn's disease (CD). The human genome encodes two ...intelectin genes, intelectin‐1 (ITLN1) and intelectin‐2 (ITLN2). Other than its high sequence similarity with ITLN1, little is known about ITLN2. To address this void in knowledge, we report that ITLN2 exhibits discrete, yet notable differences from ITLN1 in primary structure, including a unique amino terminus, as well as changes in amino acid residues associated with the glycan‐binding activity of ITLN1. We identified that ITLN2 is a highly abundant Paneth cell‐specific product, which localizes to secretory granules, and is expressed as a multimeric protein in the small intestine. In surgical specimens of ileal CD, ITLN2 mRNA levels were reduced approximately five‐fold compared to control specimens. The ileal expression of ITLN2 was unaffected by previously reported disease‐associated variants in ITLN2 and CD‐associated variants in neighboring ITLN1 as well as NOD2 and ATG16L1. ITLN2 mRNA expression was undetectable in control colon tissue; however, in both ulcerative colitis (UC) and colonic CD, metaplastic Paneth cells were found to express ITLN2. Together, the data reported establish the groundwork for understanding ITLN2 function(s) in the intestine, including its possible role in CD.
Single nucleotide variants (SNVs) within and surrounding the complement receptor 1 (
CR1
) gene show some of the strongest genome-wide association signals with late-onset Alzheimer’s disease. Some ...studies have suggested that this association signal is due to a duplication allele (
CR1
-B) of a low copy repeat (LCR) within the
CR1
gene, which increases the number of complement C3b/C4b-binding sites in the mature receptor. In this study, we develop a triplex paralogue ratio test assay for
CR1
LCR copy number allowing large numbers of samples to be typed with a limited amount of DNA. We also develop a
CR1
-B allele-specific PCR based on the junction generated by an historical non-allelic homologous recombination event between
CR1
LCRs. We use these methods to genotype
CR1
and measure
CR1
-B allele frequency in both late-onset and early-onset cases and unaffected controls from the United Kingdom. Our data support an association of late-onset Alzheimer’s disease with the
CR1
-B allele, and confirm that this allele occurs most frequently on the risk haplotype defined by SNV alleles. Furthermore, regression models incorporating
CR1
-B genotype provide a better fit to our data compared to incorporating the SNV-defined risk haplotype, supporting the
CR1
-B allele as the variant underlying the increased risk of late-onset Alzheimer’s disease.
Glycophorin A and glycophorin B are red blood cell surface proteins and are both receptors for the parasite Plasmodium falciparum, which is the principal cause of malaria in sub-Saharan Africa. DUP4 ...is a complex structural genomic variant that carries extra copies of a glycophorin A-glycophorin B fusion gene and has a dramatic effect on malaria risk by reducing the risk of severe malaria by up to 40%. Using fiber-FISH and Illumina sequencing, we validate the structural arrangement of the glycophorin locus in the DUP4 variant and reveal somatic variation in copy number of the glycophorin B-glycophorin A fusion gene. By developing a simple, specific, PCR-based assay for DUP4, we show that the DUP4 variant reaches a frequency of 13% in the population of a malaria-endemic village in south-eastern Tanzania. We genotype a substantial proportion of that village and demonstrate an association of DUP4 genotype with hemoglobin levels, a phenotype related to malaria, using a family-based association test. Taken together, we show that DUP4 is a complex structural variant that may be susceptible to somatic variation and show that DUP4 is associated with a malarial-related phenotype in a longitudinally followed population.
Abstract
Intelectins are a family of multimeric secreted proteins that bind microbe-specific glycans. Both genetic and functional studies have suggested that intelectins have an important role in ...innate immunity and are involved in the etiology of various human diseases, including inflammatory bowel disease. Experiments investigating the role of intelectins in human disease using mouse models are limited by the fact that there is not a clear one-to-one relationship between intelectin genes in humans and mice, and that the number of intelectin genes varies between different mouse strains. In this study we show by gene sequence and gene expression analysis that human intelectin-1 (
ITLN1
) has multiple orthologues in mice, including a functional homologue
Itln1
; however, human intelectin-2 has no such orthologue or homologue. We confirm that all sub-strains of the C57 mouse strain have a large deletion resulting in retention of only one intelectin gene,
Itln1
. The majority of laboratory strains have a full complement of six intelectin genes, except CAST, SPRET, SKIVE, MOLF and PANCEVO strains, which are derived from different mouse species/subspecies and encode different complements of intelectin genes. In wild mice, intelectin deletions are polymorphic in
Mus musculus castaneus
and
Mus musculus domesticus
. Further sequence analysis shows that
Itln3
and
Itln5
are polymorphic pseudogenes due to premature truncating mutations, and that mouse
Itln1
has undergone recent adaptive evolution. Taken together, our study shows extensive diversity in intelectin genes in both laboratory and wild-mice, suggesting a pattern of birth-and-death evolution. In addition, our data provide a foundation for further experimental investigation of the role of intelectins in disease.