In recent years, cancer gene discovery studies have delivered a comprehensive list of the genes recurrently mutated in cancer. Myeloid disease has been at the forefront of such discoveries, with more than 100 genes characterized. Most genes are infrequently mutated (<5%) and carry complex patterns of mutations (activating, inactivating, gain-of-function, etc).

From a clinical perspective, such molecular findings are being increasingly incorporated into diagnostic and prognostic classifiers for disease diagnosis, risk stratification and treatment decisions. To support clinical decisions, molecular profiling of gene panels is implemented at diagnosis. However, the annotation of gene mutations poses significant challenges. Not all mutations identified in a cancer gene will be pathogenic and conversely rare, previously unseen mutations within well-established cancer genes can be pathogenic. Manual curation of variants is time consuming, requires specialized expert and technical knowledge, imposes thresholds (5% variant allele fraction) and is subject to false positive or negative results. Translation of cancer genome discoveries into the clinic is dependent on a reproducible and evidence based resource to support accurate variant annotation. To address this unmet clinical need, we develop a clinical diagnosis support tool that prioritizes variants identified by next-generation-sequencing of myeloid cancer genes for their putative role in myeloid pathogenesis.

Data and Methods

We assembled a database of 22,680 variants (22% oncogenic, 14% sequencing artefacts, 22% germline polymorphisms and 42% variants of unknown significance) on the 275 most frequently mutated genes in myeloid neoplasia from patients affected by AML (n=4607), MDS (n=758) and MPN (n=2041) using published and in-house data. Patients' samples were sequenced by custom capture and Illumina sequencing technologies using protocols that reflect current clinical profiling workflows. Triple-blinded expert annotation of each variant was performed as previously described (Papaemmanuil et al Blood, 2013, Papaemmanuil E & Gerstung et al, NEJM 2016). The database is characterized by 4,990 oncogenic variants, split into 65% missense, 25% stop-gained and 10% splice-site mutations.

We applied supervised classification analysis, to learn a model on the existing annotated variant database. The model predicts the probability of each variant being pathogenic and thus of clinical relevance.


Our predictive model considers informative features derived from sequencing measurements (such as read depth, read directionality), metrics extracted from control panels, genetic attributes (gene, mutation type, effect), as well as information extracted from external databases of somatic mutations in cancer (COSMIC), and germline variations (ExAC database, 1000 genomes). We assess the predictive power of 60 features, informing on the minimal set of required information to accurately categorize variants (30 features). Model performance was similar within and across disease subsets, illustrating its generalizability. We also evaluate the model with internal cross-validation and external validation on independently processed variants from a MDS cohort (n=944) (Haferlach et al Leukemia 2014). The overall performance of our model in classifying variants reaches an area under the ROC curve (AUC) greater than 0.9.

Finally, we inspect the distributions of predicted probabilities and assess that the prioritization performed by our method is meaningful. Rare somatic mutations are associated with higher probabilities for a putative oncogenic assignment than sequencing artefacts or germline variants and variants correlate with expected clinical outcomes.


Variant annotation poses a significant challenge in the clinic. Using a reference database of 22,680 variants from 7,406 patients with myeloid disease, we deliver an automated, standardized and unbiased approach for variant annotation. Our model considers technical parameters from sequencing, accounts for germline variations and incorporates current knowledge from cancer genome profiling studies. The method will be available as an open-access online tool, where clinical laboratories can upload variants discovered through their clinical pipelines to obtain instant and accurate interpretation of variants.


Haferlach: MLL Munich Leukemia Laboratory: Employment, Equity Ownership.

Author notes


Asterisk with author names denotes non-ASH members.