Background: Gaucher disease (GD) is a rare, autosomal recessive condition, characterized by deficiency of the lysosomal enzyme β-glucocerebrosidase. The main disease features are anemia, thrombocytopenia, hepato-splenomegaly and bone infarction, osteonecrosis, and pathological fractures. However, diagnosis of GD can be challenging, especially for non-specialists, owing to wide variability in age at presentation, non-specific features, severity and type of clinical manifestations, and lack of awareness of the early signs and symptoms of the disease. Delayed and misdiagnosis of GD may lead to irreversible bone disease, severe growth retardation, and high risk of bleeding; in rare cases, misdiagnosis may be life-threatening. Developing a system for early and accurate diagnosis of GD is thus an essential unmet need. The development of an algorithm for early diagnosis of patients with rare diseases such as GD may help reduce delays in diagnosis and enable prompt, appropriate initiation of therapy, earlier decision-making, prevent potentially irreversible morbidities and unnecessary tests (some invasive), reduce anxiety, and facilitate genetic counseling. This study aims to develop a predictive model for the accurate diagnosis of GD using machine learning based on real-world clinical data.

Methods: This study will be comprised of three parts. Part 1, a retrospective observational database analysis, will use data from the electronic patient database of the Maccabi Healthcare Service (MHS), the second largest Health Maintenance Organization in Israel. The MHS includes 2.2 million health records from 25% of the Israeli population. Clinical records have been fully computerized for >20 years and are fully integrated with automated central laboratory, digitized imaging and pharmacy purchase data. Patients with confirmed GD who have been enrolled in the MHS health plan for ≥1 year will be eligible for inclusion, with approximately 250 patients with GD expected to be enrolled. Using MHS data from patients with GD, the Gaucher Earlier Diagnosis Consensus (GED-C) scoring system, developed by a consensus panel using Delphi methodology on the signs and co-variables that may be important for the diagnosis of GD, will be evaluated and compared with alternative scores developed directly from clinical data based on supervised machine learning.

In Part 2, a clinical study, the best performing modeled scores from Part 1 will be applied to the MHS database to identify individuals who may have undiagnosed GD ('GD suspects'). Samples for diagnostic testing (using a specific and sensitive biomarker (glucosylsphingosine, lyso-Gb1) followed by beta-glucocerebrosidase (GBA) genotyping for positive samples) will be collected from MHS biobank (for individuals who have consented). Individuals not participating in the biobank will be asked to provide a sample. This part of the study will evaluate the predictive value of the modeled scores, and assess the sensitivity and specificity of the model for the diagnosis of new patients with GD.

In Part 3, analysis of data from newly diagnosed patients identified in Part 2 will be used to develop machine learning models for the diagnosis of GD (Figure 1). Signs and co-variables included in the GED-C score will be used, eliminating features that are non-informative. Features will be quantitative where possible, and interaction terms will be added for age of onset and trend for key features. A number of methods will be developed, with the best performing, based on its precision at a given sensitivity level, being selected as the final model. External validation of the best identified model is planned, to ensure unbiased estimate of the model's accuracy.

Discussion: The main goal of the study is to develop an algorithm to help detect patients with GD, independent of physicians' ability to recognize signs and symptoms, using the application of machine learning to data from a large health database. The study is expected to result in a practical tool that will alert physicians to the possibility of GD. The resulting model will also improve our understanding of GD based on the relative importance of features for GD prediction. Such tools will have a positive impact on patient care and quality of life and on healthcare costs and may lead to a change in approach for diagnosing rare diseases.


Revel-Vilk:Takeda: Honoraria; sanofi-Genzyme: Honoraria; Pfizer: Honoraria. Chodick:Novartis Pharma AG: Other: Institutional grant. Gadir:Takeda: Current Employment.

Author notes


Asterisk with author names denotes non-ASH members.