Abstract
Summary
HLA*LA implements a new graph alignment model for human leukocyte antigen (HLA) type inference, based on the projection of linear alignments onto a variation graph. It enables ...accurate HLA type inference from whole-genome (99% accuracy) and whole-exome (93% accuracy) Illumina data; from long-read Oxford Nanopore and Pacific Biosciences data (98% accuracy for whole-genome and targeted data) and from genome assemblies. Computational requirements for a typical sample vary between 0.7 and 14 CPU hours per sample.
Availability and implementation
HLA*LA is implemented in C++ and Perl and freely available as a bioconda package or from https://github.com/DiltheyLab/HLA-LA (GPL v3).
Supplementary information
Supplementary data are available at Bioinformatics online.
HLA class I glycoproteins contain the functional sites that bind peptide antigens and engage lymphocyte receptors. Recently, clinical application of sequence-based HLA typing has uncovered an ...unprecedented number of novel HLA class I alleles. Here we define the nature and extent of the variation in 3,489 HLA-A, 4,356 HLA-B and 3,111 HLA-C alleles. This analysis required development of suites of methods, having general applicability, for comparing and analyzing large numbers of homologous sequences. At least three amino-acid substitutions are present at every position in the polymorphic α1 and α2 domains of HLA-A, -B and -C. A minority of positions have an incidence >1% for the 'second' most frequent nucleotide, comprising 70 positions in HLA-A, 85 in HLA-B and 54 in HLA-C. The majority of these positions have three or four alternative nucleotides. These positions were subject to positive selection and correspond to binding sites for peptides and receptors. Most alleles of HLA class I (>80%) are very rare, often identified in one person or family, and they differ by point mutation from older, more common alleles. These alleles with single nucleotide polymorphisms reflect the germ-line mutation rate. Their frequency predicts the human population harbors 8-9 million HLA class I variants. The common alleles of human populations comprise 42 core alleles, which represent all selected polymorphism, and recombinants that have assorted this polymorphism.
The 1000 Genomes Project aims to provide a deep characterization of human genome sequence variation by sequencing at a level that should allow the genome-wide detection of most variants with ...frequencies as low as 1%. However, in the major histocompatibility complex (MHC), only the top 10 most frequent haplotypes are in the 1% frequency range whereas thousands of haplotypes are present at lower frequencies. Given the limitation of both the coverage and the read length of the sequences generated by the 1000 Genomes Project, the highly variable positions that define HLA alleles may be difficult to identify. We used classical Sanger sequencing techniques to type the HLA-A, HLA-B, HLA-C, HLA-DRB1 and HLA-DQB1 genes in the available 1000 Genomes samples and combined the results with the 103,310 variants in the MHC region genotyped by the 1000 Genomes Project. Using pairwise identity-by-descent distances between individuals and principal component analysis, we established the relationship between ancestry and genetic diversity in the MHC region. As expected, both the MHC variants and the HLA phenotype can identify the major ancestry lineage, informed mainly by the most frequent HLA haplotypes. To some extent, regions of the genome with similar genetic or similar recombination rate have similar properties. An MHC-centric analysis underlines departures between the ancestral background of the MHC and the genome-wide picture. Our analysis of linkage disequilibrium (LD) decay in these samples suggests that overestimation of pairwise LD occurs due to a limited sampling of the MHC diversity. This collection of HLA-specific MHC variants, available on the dbMHC portal, is a valuable resource for future analyses of the role of MHC in population and disease studies.
Genetic variation at the Human Leucocyte Antigen (HLA) genes is associated with many autoimmune and infectious disease phenotypes, is an important element of the immunological distinction between ...self and non-self, and shapes immune epitope repertoires. Determining the allelic state of the HLA genes (HLA typing) as a by-product of standard whole-genome sequencing data would therefore be highly desirable and enable the immunogenetic characterization of samples in currently ongoing population sequencing projects. Extensive hyperpolymorphism and sequence similarity between the HLA genes, however, pose problems for accurate read mapping and make HLA type inference from whole-genome sequencing data a challenging problem. We describe how to address these challenges in a Population Reference Graph (PRG) framework. First, we construct a PRG for 46 (mostly HLA) genes and pseudogenes, their genomic context and their characterized sequence variants, integrating a database of over 10,000 known allele sequences. Second, we present a sequence-to-PRG paired-end read mapping algorithm that enables accurate read mapping for the HLA genes. Third, we infer the most likely pair of underlying alleles at G group resolution from the IMGT/HLA database at each locus, employing a simple likelihood framework. We show that HLA*PRG, our algorithm, outperforms existing methods by a wide margin. We evaluate HLA*PRG on six classical class I and class II HLA genes (HLA-A, -B, -C, -DQA1, -DQB1, -DRB1) and on a set of 14 samples (3 samples with 2 x 100bp, 11 samples with 2 x 250bp Illumina HiSeq data). Of 158 alleles tested, we correctly infer 157 alleles (99.4%). We also identify and re-type two erroneous alleles in the original validation data. We conclude that HLA*PRG for the first time achieves accuracies comparable to gold-standard reference methods from standard whole-genome sequencing data, though high computational demands (currently ~30-250 CPU hours per sample) remain a significant challenge to practical application.
The human KIR genes are arranged in at least six major gene-content haplotypes, all of which are combinations of four centromeric and two telomeric motifs. Several less frequent or minor haplotypes ...also exist, including insertions, deletions, and hybridization of KIR genes derived from the major haplotypes. These haplotype structures and their concomitant linkage disequilibrium among KIR genes suggest that more meaningful correlative data from studies of KIR genetics and complex disease may be achieved by measuring haplotypes of the KIR region in total.
Towards that end, we developed a KIR haplotyping method that reports unambiguous combinations of KIR gene-content haplotypes, including both phase and copy number for each KIR. A total of 37 different gene content haplotypes were detected from 4,512 individuals and new sequence data was derived from haplotypes where the detailed structure was not previously available.
These new structures suggest a number of specific recombinant events during the course of KIR evolution, and add to an expanding diversity of potential new KIR haplotypes derived from gene duplication, deletion, and hybridization.
Abstract Human leukocyte antigen (HLA) haplotype frequency distributions in specific populations can be applied to optimize both individual stem cell donor searches and donor registry planning. We ...present allele and haplotype frequencies derived from a data set of 8862 German stem cell donors who were typed at high resolution for the HLA-A, HLA-B, HLA-C, and HLA-DRB1 genes upon registration. Calculated haplotype frequencies were used to estimate the probability p to find matching donors subject to donor registry size n . The impact of various matching standards on p ( n ) was analyzed. When high-resolution matching for HLA-A, HLA-B, HLA-C, and HLA-DRB1 is required, p (1,000,000) is 0.678. The corresponding value for n = 7,000,000 is 0.859. In a scenario with low-resolution matching and no consideration of HLA-C, p (1,000,000) is 0.863 and thus larger than p (7,000,000) in the scenario with stricter matching requirements. As recent findings support the importance of high-resolution matching of HLA-A, HLA-B, HLA-C, and HLA-DRB1 for outcomes of hematopoietic stem cell transplantation, our results are highly relevant for strategic planning and resource allocation of donor centers and registries.
Abstract We present an electronic format for exchanging data for HLA and KIR genotyping with extensions for next-generation sequencing (NGS). This format addresses NGS data exchange by refining the ...Histoimmunogenetics Markup Language (HML) to conform to the proposed Minimum Information for Reporting Immunogenomic NGS Genotyping (MIRING) reporting guidelines ( miring.immunogenomics.org ). Our refinements of HML include two major additions. First, NGS is supported by new XML structures to capture additional NGS data and metadata required to produce a genotyping result, including analysis-dependent (dynamic) and method-dependent (static) components. A full genotype, consensus sequence, and the surrounding metadata are included directly, while the raw sequence reads and platform documentation are externally referenced. Second, genotype ambiguity is fully represented by integrating Genotype List Strings, which use a hierarchical set of delimiters to represent allele and genotype ambiguity in a complete and accurate fashion. HML also continues to enable the transmission of legacy methods (e.g. site-specific oligonucleotide, sequence-specific priming, and Sequence Based Typing (SBT)), adding features such as allowing multiple group-specific sequencing primers, and fully leveraging techniques that combine multiple methods to obtain a single result, such as SBT integrated with NGS.
Regional HLA frequency differences are of potential relevance for the optimization of stem cell donor recruitment. We analyzed a very large sample (n = 123,749) of registered Polish stem cell donors. ...Donor figures by 1-digit postal code regions ranged from n = 5,243 (region 9) to n = 19,661 (region 8). Simulations based on region-specific haplotype frequencies showed that donor recruitment in regions 0, 2, 3 and 4 (mainly located in the south-eastern part of Poland) resulted in an above-average increase of matching probabilities for Polish patients. Regions 1, 7, 8, 9 (mainly located in the northern part of Poland) showed an opposite behavior. However, HLA frequency differences between regions were generally small. A strong indication for regionally focused donor recruitment efforts can, therefore, not be derived from our analyses. Results of haplotype frequency estimations showed sample size effects even for sizes between n≈5,000 and n≈20,000. This observation deserves further attention as most published haplotype frequency estimations are based on much smaller samples.
Abstract This communication describes our experience in large-scale G group-level high resolution HLA typing using three different DNA sequencing platforms – ABI 3730 xl, Illumina MiSeq and PacBio RS ...II. Recent advances in DNA sequencing technologies, so-called next generation sequencing (NGS), have brought breakthroughs in deciphering the genetic information in all living species at a large scale and at an affordable level. The NGS DNA indexing system allows sequencing multiple genes for large number of individuals in a single run. Our laboratory has adopted and used these technologies for HLA molecular testing services. We found that each sequencing technology has its own strengths and weaknesses, and their sequencing performances complement each other. HLA genes are highly complex and genotyping them is quite challenging. Using these three sequencing platforms, we were able to meet all requirements for G group-level high resolution and high volume HLA typing.