We extracted 210 412 blood group alleles (∼8.5 × 109 bp) from 1000 Genomes and matched them against official blood group reference lists.
Of 1241 nonsynonymous variants found, 241 are known blood group polymorphisms while 1000 may represent undiscovered or altered antigens.
Blood group genotyping has recently developed into a clinical tool to improve compatibility of blood transfusions and management of pregnancies. Next-generation sequencing (NGS) is rapidly moving toward routine practice for patient and donor typing and has the potential to remedy some of the limitations of currently used platforms. However, a large-scale investigation into the blood group genotypes obtained by NGS in a multiethnic cohort is lacking. The 1000 Genomes Project provides information on genome variation among 2504 individuals representing 26 populations worldwide. We extracted their NGS data for all 36 blood group systems to a custom-designed database. In total, 210 412 alleles from 43 blood group–related genes were imported and curated. Matching algorithms were developed to compare them to blood group variants identified to date. Of the 1241 non-synonymous variants identified in the coding regions, 241 are known blood group polymorphisms. Interestingly, 357 of the remaining 1000 variants are predicted to occur on extracellular portions of 31 different blood group–carrying proteins and some may represent undiscovered antigens. Of the alleles analyzed, 1504 were not previously described. The ABO/GBGT1/FUT2/FUT3 and GYPB/GYPC genes showed the highest degree of variation per kilobase coding sequence, and ACKR1 variants had the most skewed distribution across 5 continental superpopulations in the dataset. Results were exported to an online search engine, www.erythrogene.com, which presents data according to the allele nomenclature developed for clinical reporting by the International Society of Blood Transfusion. The established database deepens our knowledge on blood group polymorphism globally and provides a long-sought platform for future research.
Interindividual variation is the cornerstone of blood group antigen discovery, a field that has interested scientists and clinicians for over a century. However, identification has been based primarily on clinical case studies and subsequent serologic and molecular mapping. Cloning of the human genome opened the possibility to look for human variation on a more systematic scale, and following the Human Genome Project1 and the International HapMap project,2 the 1000 Genomes (1000G) Project was announced. The aim was to create a map of genetic variation in at least 1000 individuals from different ethnic groups with the latest DNA-sequencing technologies.3 The 1000G Project was subsequently expanded to include 2504 individuals from 26 population groups classified into 5 superpopulations (supplemental Table 1).4,5 It has already formed the basis for detailed studies in areas like coronary artery disease and chronic obstructive pulmonary disease,6,7 but not yet in transfusion medicine.
Although tools like Ensembl8 and the Exome Aggregation Consortium9 browser can visualize data from the 1000G Project, they are limited and not always based on accurately curated reference sequences. For example, the official ABO reference has been repeatedly revised after it was found to be an artificially fused sequence from different individuals.10 Furthermore, certain variants are not correctly shown in 1000G browsers. As an example, the c.1057C>A variant (rs1799805) in the ACHE gene, responsible for the YT*02 blood group allele, is not visible in the allele view in Ensembl but is listed among variants. Detailed curation of this valuable dataset by investigators with an interest in blood groups is therefore required. Otherwise extraction and interpretation of blood group data will remain difficult.
The International Society of Blood Transfusion (ISBT) defines a blood group system as ≥1 blood group antigens encoded by a single gene or by ≥2 closely linked genes.11 The importance of identifying and matching blood types is relevant not only for blood transfusions but also during pregnancy to prevent hemolytic disease of the newborn and before hematopoietic stem cell and solid organ transplantations to ensure compatibility. In addition, many blood group molecules have key roles in host–pathogen interactions, like the glycoprotein underlying the Duffy system, which acts as a receptor for Plasmodium vivax and P knowlesi malaria parasites.12 There are multiple associations between blood groups and other forms of diseases, for example cardiovascular disease, where non-O blood group correlates with higher mortality.13 Currently, there are 36 defined blood group systems, giving rise to >300 different antigens.14 The genes underlying these systems form the foundation of this study.
Until now, blood group identification and matching have mainly been carried out as red blood cell (RBC) phenotyping by hemagglutination. In recent years, technological advancements have enabled large-scale blood group genotyping, and DNA microarray platforms are used in clinical laboratories.15 These methods complement serology and improve blood group matching but have built-in limitations in the number of variants they can detect.16,17 Next-generation sequencing (NGS) has the potential to redress these limitations and is being implemented for HLA analysis in tissue typing reference laboratories.18 Similarly, NGS is rapidly moving toward routine practice for patient and donor blood group typing,19,20 yet a large worldwide survey of genomic blood group variation is lacking.
The purpose of this study was to create and dissect a database of alleles for all of the 43 currently recognized blood group–related genes from the 2504 individuals in the third and final phase of the 1000G Project. A computational algorithm was developed for matching these with previously defined alleles characterized and named by the ISBT and to recognize previously neglected variants with potential future impact. The results are made accessible to clinicians and scientists in an online search engine, Erythrogene.
Materials and methods
A custom-designed database was developed to enable import and integration of data from a wide range of different sources and to create a user-friendly interface with correct nomenclature for the hematology community. The database engine chosen was Microsoft SQL Server 2012 (www.microsoft.com/sql). Figure 1 represents a graphic working model for the project.
Locus Reference Genomic (LRG) records
LRG records (www.lrg-sequence.org/)21 were found for 41 blood group genes and the erythroid-specific transcription factor genes GATA1 and KLF1, mutations that are known to affect blood group expression. LRG provided reference sequences and coordinates for genes and exons to map them to reference assembly GRCh37 used by the 1000G Project (supplemental Table 2). Since 38 of 43 LRG files were pending and not finalized, a set of validation routines (described below) was compiled to verify their correctness. The reference sequences provided were verified against their NCBI RefSeqGene counterparts,22 and remapping functions were developed to enable conversion of sequences and variants to and from GRCh37. In addition to defining exon and intron borders, the LRG files also specify the size of 5′- and 3′-flanking regions. A standard length of 5 kilobase pairs (kb) for 5′-flanking regions, and 2 kb for 3′-flanking regions were found in 22 genes, but the variation for the remaining 21 genes was extensive (2419-98 537 bp and 1267-5742 bp for the up- and downstream regions, respectively).
Processing the 1000G Project data
We downloaded the GRCh37 human reference assembly in FASTA format from the 1000G Project FTP server and imported reference sequences for the 43 genes into the database, and alignments in the LRG records were verified. Furthermore, we downloaded variant call format files comprising single-nucleotide variation, shorter insertions or deletions (indels), and structural variants (SVs). The latter are defined as indels, copy-number variations (CNVs), and mobile element insertions ≥50 bp in length.5 The variant call format files were read with Tabix,23 and variants within the coordinates determined by LRG records were imported to the database. All imported variants were processed in numerous different ways. Finally, alleles for all 43 genes were compiled for the 2504 individuals based on the genotype phasing performed in the 1000G Project. The processing pipeline is summarized in Figure 2.24,25
Reference allele data sources
The ISBT has maintained guidelines for blood group terminology since the 1980s,26,27 and catalogs of reference alleles are provided for most blood group systems on its Web site (www.isbt-web.org).11 Tables for 40 genes were available directly from ISBT. For C4A and C4B, the genotypes listed in The Blood Group Antigen FactsBook were used.28 The Blood Group Antigen Gene Mutation Database (dbRBC29 ) provided FUT3 alleles. The reference tables used are shown in supplemental Table 3.
Allele data validation
We imported all allele tables into the database and used a set of validation procedures to make sure they were correctly formatted (exemplified in supplemental Figure 1). In the first validation procedure, reference sequences were compared with LRG records. Discrepancies were found in 8 genes (supplemental Table 4), for example ACKR1, where the LRG record refers to the FY*01 allele but the ISBT reference allele is FY*02. To compensate for these inconsistencies, the described changes were applied to the alleles before any matching was performed. The second validation procedure ensured that syntax of the nucleotide changes conformed to Human Genome Variation Society (HGVS) guidelines. The third validation procedure confirmed that the nucleotide at the position of the variant corresponded to the reference sequence nucleotide. Finally, the nucleotide change was merged with the reference sequence. The result was translated into an amino acid sequence and compared with the reference protein sequence. The difference between them was compared with the amino acid change specified for this allele in the catalog of reference alleles. These procedures generated numerous errors, as >100 alleles failed validation. The majority of errors occurred because the nucleotide changes specified in the allele tables did not conform to HGVS nomenclature or were due to simple typographical errors. Several of these inconsistencies were traced back to articles originally describing the variants. In most cases, the papers contained the correct specification, but a few errors were present there as well. All erroneous alleles were corrected before further processing in the database. The corrections made to the allele lists and the suggested reference sequence updates were submitted to ISBT for review.
The variants from the reference tables and their corresponding genotypes were categorized as matchable or unmatchable, where a matchable variant has the prerequisites for a definitive identification in 1000G. Of 1137 variants, 49 were classified as unmatchable, where 33 involved rearrangements and hybrid genes in the MNS and Rh blood group systems. The remaining 16 were also hybrid genes or just indistinctly defined. The reference alleles then went through the same classification, where alleles containing at least 1 unmatchable variant were classified as unmatchable.
Localization of extracellular regions of blood group–carrying proteins
We retrieved data on predicted extracellular regions of blood group–carrying proteins from Uniprot.30 For membrane-spanning proteins, we used amino acid positions from extracellular region annotations, or, if not available, positions corresponding to extracellular regions between annotated transmembrane regions. For glycophosphatidylinositol-anchored proteins, we used all amino acid positions remaining after signal- and propeptides were removed.
Allele matching algorithm
According to the ISBT Guidelines for Naming of Blood Group Alleles (www.isbt-web.org), neither silent (synonymous) changes nor alleles that differ only by silent polymorphisms should be included in the tables, but such examples still exist. Other types of changes (eg, promoter and splice-site mutations) must also be considered. Therefore, a simple match on protein level will not be adequate. For this reason, the following algorithm was developed: (1) Alleles containing variants known to influence transcription-initiation sites or splicing, together with the larger deletions and duplications previously acknowledged by ISBT, were matched separately. Other SVs, including duplications and deletions in GYPA/GYPB, were excluded due to the difficulty of identifying and matching them correctly. (2) Alleles were matched at DNA level, and only exact matches were accepted. (3) Alleles were matched at protein level. If there was >1 match, the reference allele with the most nucleotides in common with the allele from 1000G was chosen as the matched option. This matching algorithm was applied to all genes.
Standard deviation (SD) for allele frequency differences was calculated using the STDEV function in SQL Server.
We extracted and curated NGS data from 43 blood group–related genes in the 1000G Project for the currently existing 36 blood group systems to a custom-designed database. Matching algorithms were developed to compare them to known blood group variants. Even if NGS data covering the whole genomic length of the LRG reference sequences have been imported to the database, the results presented here focus mainly on the coding sequences (CDSs) and variants known to alter expression or splicing of transcripts.
We analyzed almost 1.75 million bp per genome in 2504 individuals. Given that some of the investigated genes are not autosomal but reside on the X chromosome, the total number of nucleotides included in this study amounts to 8 501 108 720 bp. We found 52 305 different variants that were identified at 51 955 unique variant sites. Table 1 presents a summary of the analyzed data. Out of 50 076 single-nucleotide variants, 68.6% were transitions and 31.4% were transversions. A total of 2168 insertions and deletions were observed.
Structural variants and indels
One of the major challenges in the 1000G Project has been to develop algorithms for correct assignment of SVs.5 In this study, we identified a total of 61 SVs: 8 intronic Alu element insertions and 53 duplications, large indels, or CNVs. Nine of the 53 non-Alu SVs were found in the GYPA and GYPB genes, known to have a high rate of recombination and gene conversions. However, we noted that neither the 37-bp insertion characteristic of the RHD pseudogene31 nor the 17-bp deletion in SMIM1 underlying the Vel-negative phenotype32-34 were called in 1000G. While the former allele was still detected because of a downstream stop codon (p.Tyr269Ter), the latter remained invisible. However, examination of 1000G alignment files revealed the pre-sence of the 17-bp deletion in at least 19 individuals (data not shown). This highlights the shortcomings of the 1000G Project for these kinds of clinically important alleles.
Variant characterization and gene comparison
To classify the variation according to location in the transcript, we used Sequence Ontology terms (Table 2).25 Notably, indels within CDSs are rare in the dataset, with only 10 identified variants in all 43 genes. To assess variation within each gene, the number of variants per kilobase pair was calculated for the whole gene, CDSs, exons, and introns (Table 3). BSG, CD151, and GYPB have the highest total level of polymorphism, with 49.9, 45.6, and 42.7 variants/kb, respectively. However, most variants in BSG and CD151 are quite rare. GYPB showed the most variable CDS among all protein blood groups (68.8 variants/kb), followed by GYPC (62.0 variants/kb). ABO and 3 other glycosyltransferase genes (FUT2, FUT3, and GBGT1) showed particularly high levels of CDS polymorphism (all 4 at >60 variants/kb).
The least variation was observed in C4A and C4B, with a total of 5.5 and 4.4 variants/kb CDS, respectively, compatible with a key role in the classical pathway of the complement system. Still, these genes are located in the highly polymorphic MHC region on chromosome 6, where several CNVs and recombinations are known.35
Few variants were also found in XK (8.2 variants/kb CDS) on the X chromosome. Mutations in XK may give rise to McLeod syndrome, indicating that preservation of this gene is beneficial.36,37 However, the total number of alleles included is also lower for genes on the X chromosome (3775 vs 5008 for autosomes). The GLOB system showed the third lowest degree of CDS conservation (9 variants/kb) and was the least variable glycosyltransferase gene.
Analysis of variant frequencies among populations
By studying the propagation of each variant across the 5 superpopulations, variants can be identified as candidates for being part of natural selection. Therefore, SDs for all variant frequencies in the superpopulations were calculated. The 16 most unevenly distributed variants are shown in Table 4. The GATA-motif polymorphism (c.−67T>C) in the promoter of the P vivax/P knowlesi receptor–encoding ACKR1 had the highest SD, 0.42. Another malaria-related variant,38,39 the Sl(a−) phenotype-encoding c.4801A>G (p.Arg1601Gly) in CR1 (SD 0.31), also surfaced in our analysis together with other less well understood candidates.
Matching 1000G data to reference allele data
In the next step, we matched the variants and alleles from the 1000G Project with the previously identified variants and alleles from the tables provided by ISBT, The Blood Group Antigen FactsBook, and dbRBC.11,28,29 A summary of the results from the matching is displayed in Figure 3 and supplemental Table 5. With the current matching algorithm, 90.1% of the total number of alleles (189 552 of 210 412) could be matched with previously known alleles. Still, this only includes 38.9% (958 of 2462) of all unique alleles in the 1000G dataset, indicating a better match for more common alleles. The remaining 1504 unique alleles were not previously described. Of the 1241 nonsynonymous variants found in the CDS, 241 can be classified as previously known to be involved in blood group variation as defined by ISBT. Of the remaining 1000 variants, further analysis shows that 634 of these were only found in 1 allele each in the dataset. Interestingly, 357 of the 1000 non-synonymous variants (not previously associated with blood group variation) are predicted to occur on the extracellular portions of 31 blood group–carrying proteins, and some may represent undiscovered antigens. As many as 200 of the 1000 variants targeted genes encoding glycosyltransferases, ranging from 4 in B3GALNT1 to 47 in GBGT1 (supplemental Table 5). Of the 1088 variants classified as matchable in the reference tables, only 277 could be found in 1000G, implying that a lot of the previously identified variation is rare or restricted to certain populations.
A searchable online database of blood group alleles
To provide an overview of the results as well as detailed information about each allele, we created a searchable database available through a Web interface (www.erythrogene.com). The search engine allows for quick navigation between genes and advanced queries against most of the study results. All alleles found in 1000G and alleles classified as matchable from reference tables were incorporated. Different search terms can be used, such as gene, nucleotide or amino acid change, and parts of nucleotide and amino acid sequences. The output terminology strictly follows that of ISBT and HGVS, but in rare cases, the predicted protein changes induced by certain genetic alterations (eg, splice site, promoter, and SVs) have limitations (eg, regarding nomenclature). Superpopulation allele frequencies are also shown, as well as gene and protein sequences for any given allele. A description of how to use Erythrogene is provided in supplemental Figure 2. Finally, supplemental Table 6 provides genomic coordinates for all included variants, both for GRCh37 and for the updated assembly, GRCh38.
We describe here the development of tools to specifically extract and dissect 43 blood group–related genes from 2504 individuals and to match them with the current set of blood group variants as defined by ISBT, the official body responsible for blood group nomenclature used for clinical and scientific reporting worldwide. To our knowledge, this study reports the first in-depth analysis of all 36 blood group systems in such a large and ethnically diverse NGS dataset. Blood group prediction based on NGS has been on the horizon for several years now, but interest was further sparked by introduction of this technology into tissue typing laboratories.18 Earlier this year, NGS-based blood group and platelet phenotype predictions were reported for a single individual in a proof-of-principle study.20 All blood group systems except XG were assessed, and >200 antigens were predicted, 17 of which were serologically confirmed.
In total, 90.1% of all alleles from 1000G could be matched with already known alleles, implying that 9.9% are not described before. The fact that the latter constitute 61.1% of the total number of unique alleles in the dataset indicate that they are rare and have therefore not previously been reported. Another reason might be that the variants involved do not significantly alter antigen expression and thus have gone undetected. Some alleles also show resemblance with ≥2 known alleles. Hybrid alleles are known to occur frequently in blood group systems like ABO, Rh, and MNS, but some hybrids detected in this study may also be the result of inaccurate allele imputation. On the other hand, the current catalogs of alleles from ISBT are not automatically updated, and some alleles identified here as novel might have been described elsewhere. The confusion about which reference sequence to use for some blood group genes, also reported by others,20 is a major issue. Hopefully, the effort we put into sorting it out will facilitate the process and thereby help future research.
Of the 1241 nonsynonymous variants identified in the dataset, only 19% occurred in our reference list sources. Of the remaining 1000, 20% are glycosyltransferase variants, some of which may alter carbohydrate antigen expression by modifying enzymatic activity or specificity. We also predicted that 36% of the unknown variants alter amino acids in extracellular portions of RBC proteins and hypothesize that some may constitute blood group antigens not yet discovered. This might represent a “forward genetics” approach for blood group discovery. While it is difficult to speculate which ones are more immunogenic and thereby likely candidates, new antigens are indeed recognized every year. Sixty-three percent of unknown variants occurred as single alleles. We cannot exclude that some may be due to 1000G sequencing or interpretation errors.
We also compared how much the frequency of all variants detected differed between superpopulations. The genetic diversity is known to be high in African populations,40 and in Table 4, the superpopulation of African origin shows the highest or lowest allele frequency for all entries. Among the 16 most skewed variants, only 2 are missense mutations, and only 6 genes are represented. The function of the others is unclear. Not surprisingly, malaria-related polymorphisms like the GATA-motif polymorphism in the ACKR1 promoter (c.−67T>C), which protects homozygous carriers from RBC invasion by P vivax and P knowlesi,12,41 and c.4801A>G (p.Arg1601Gly) in CR1 associated with Sl(a−) phenotype and reduced rosetting in P falciparum malaria38,39 top the list. More unexpected and worth pursuing are the variants in A4GALT and 3 erythroid transporter genes (RHD, ABCG2, and AQP1). Notably, neither GYPA nor GYPB variants appeared in Table 4 despite that GPA and GPB are known receptors for P falciparum. However, the GYPB deletion allele (GYPB*01N) that defines the S−s−U− phenotype among Africans was deemed unmatchable, while 2 alleles (GYPB*03N.01 and GYPB*03N.03) were detected at 0.04% and 0.18%, respectively, in this superpopulation.
For ABO, the 6 most common alleles in 1000G were matched to the 6 major alleles: ABO*O.01.01, ABO*O.01.02, ABO*B.01, ABO*A1.01, ABO*A1.02, and ABO*A2.01. The reference allele, ABO*A1.01, is only the fourth most common allele in this dataset (9.5%). Fifteen of the alleles directly following the 6 major alleles above are group O variants, 14 of which contain c.261delG. Only 2 of the 20 most common alleles are unmatched. In total, 110 of 5008 ABO alleles in the database, and 16 of 48 non-synonymous CDS variants, could not be found, implying that even in a well-investigated blood group system like ABO, the 1000G project reveals new findings.
The allele carrying the RHD gene deletion, RHD*01N.01, was found in 20.6% of the alleles in the 1000G dataset. However, a closer look at this SV reveals less than perfect integration with the main dataset. Some of the alleles carrying the RHD deletion also carry the c.1136C>T polymorphism normally expected in the RHD*10.00 allele cluster (RHD*DAU) but this polymorphism is located within the deleted region. Similar problems were detected elsewhere in the 1000G data, and the most obvious explanation is that these SNPs are incorrectly imputed and belong to the allele in trans instead. The RHD deletion was identified as part of a CNV with 1032 alleles carrying a copy-number loss (deletion) and 11 individuals carrying a copy-number gain (duplication). The duplication has been described previously42 but is not included in the ISBT allele lists.
The quality of the results in a study like ours depends largely on the underlying dataset, and 1000G has undertaken comprehensive measures to maintain well-qualified data.4 While the mean coverage for whole-genome sequencing was limited to 7.4×, coverage for exome-targeted regions averaged at 65.7×. Obviously, most blood group–relevant polymorphisms occur in exons. To keep the false discovery rate of SNPs and indels low, a quality score was calculated by 1000G based on PCR-free sequence data from 1 individual in each population. The most limiting factor in current NGS technology is read length (∼100 bp in 1000G), and problems may arise (eg, when generating haplotypes in regions with a lower degree of polymorphism). Long-read sequencing technologies, such as the single-molecule real-time DNA sequencing technology, could provide significant contributions in solving these problems.43
The 1000G Structural Variation Analysis Group used several algorithms to detect 9132 additional small deletions (<1 kb) compared with the primary analysis,5 but surprisingly, we did not observe any deletions between 52 and 645 bp in the 43 genes. Conversely, 14 deletions at 646 to 5000 bp and 26 at 20 to 51 bp were found. As noted above, 2 characteristic and clinically important small indels in RHD and SMIM1 were not called in 1000G. Current 1000G imputation algorithms were developed to predict cis or trans allocation of neighboring variants (ie, on the same allele or haplotype). Even if now much improved, these algorithms still tend to overestimate the reference allele frequencies in highly polymorphic regions like the HLA region on chromosome 6.44
The database and search engine described here provide a platform for a wide variety of future applications. Updated reference allele tables from ISBT can be incorporated into the database when available and this will improve matching results further. The 1000G project has now evolved into the International Genome Sample Resource, enabling scientists to submit DNA sequencing data for more populations continuously. However, the database is not limited to blood group data from 1000G. Allele data can be generated for any gene having an LRG record, including other erythroid genes, and additional datasets from other large-scale sequencing projects can be incorporated. Even if the current study focused mainly on CDS variants, our database also contains detailed information and allows deeper analyses on variants in nonexonic regions.
This study demonstrates the power of integrating large-scale NGS data with previously made findings in transfusion medicine. It also highlights some of the caveats and challenges when interpreting NGS data. In parallel with the written paper, our results are available online in a search engine, Erythrogene, which will serve as portal to the blood group data from the 1000G Project. It also provides researchers and clinical reference laboratories with a user-friendly search interface for acknowledged and correctly named blood group alleles.
The full-text version of this article contains a data supplement.
This study was supported by grant 2014.0312 from the Knut and Alice Wallenberg Foundation (M.L.O.), grant 2014-71X-14251 from the Swedish Research Council (M.L.O.), and governmental Avtal om Läkarutbildning och Forskning grants to university health care in Region Skåne, Sweden (M.J., J.R.S., and M.L.O.).
Contribution: M.M. designed and developed the database and algorithms used for matching blood group variants and alleles. M.J. made calculations on the location of nonsynonymous variants and supported M.M. in the development of algorithms. All authors interpreted data. J.R.S. and M.L.O. conceived the study. M.M. and M.L.O. drafted the manuscript. All authors read, revised, and approved the manuscript.
Conflict-of-interest disclosure: The authors declare no competing financial interests.
Correspondence: Martin L. Olsson, Division of Hematology and Transfusion Medicine, Department of Laboratory Medicine, Lund University, BMC C14, SE-22184 Lund, Sweden; e-mail: email@example.com.