The simultaneous quantification of thousands of genes in gene expression profiling (GEP) on DNA chips is part of the whole-genome sequencing revolution. Affymetrix(R) chip technology provides both a quantitative fluorescence signal and a decision of absent or present gene expression (absent or present call) based on signed-rank algorithms applied to several hybridization repeats of each gene spread on a single chip. To avoid an empirical normalization between chips of the same experiment, we developed an analysis of GEP based on Affymetrix present or absent calls. Bone marrow aspirates from newly-diagnosed multiple myeloma (MM) patients were purified with CD138 automated magnetic cell sorting. Amplified RNA was run on U133A+B Affymetrix DNA microarrays for a first set of 65 patients, or U133Plus2 for a second cohort of 40 patients. Scan files were transferred to an Oracle(R) data base and analyzed with web-oriented scripts for both unsupervised and supervised non-parametric analysis on either the fluorescence signal or the Affymetrix call. To build a multiclass call-based predictor, the observed distribution of present call of each probeset was first compared between predetermined sample groups using a chi2 test and probesets were kept only if above a threshold compatible with further analysis and calculation time (usually 100 to 1000 genes). The power of a probeset list to classify the different groups (number of presence/number of probesets) was then evaluated for each sample of a group and compared to each sample of all other groups by calculating the reduced deviation (RD) in paired comparisons and evaluating the overall number of non significant comparisons (NS) with a chosen precision, the sum of the reduced deviations divided by the square root of the probeset number (f, independent of the list size), and the smallest RD (RDmin). The minimum predictive probeset list was obtained by deleting each probeset one after the other, and computing NS, f and RDmin from the remaining probeset list. If either NS is reduced, or both NS unchanged and f increased, or together NS and f unchanged and RDmin either increased or higher than precision, the probeset is left out and the process run again on the shortened list until no more leave-outs are possible. This method was successfully validated by determining a 22-gene sex predictor with the 65 patient series that made it possible to classify gender with no error in the 40 patient validation group. Partial loss of chromosome Y was confirmed in 3 male MM patients by short tandem repeat analysis. Significant predictors could not be generated with randomly selected patient groups. Validation was also successful with P <.001 in predicting the immunoglobulin light chain of the validation group after educating with the training group (lambda to kappa-type ratio between 1/4 and 1/3). Classification of the training set according to Salmon-Durie staging made it possible to generate a 97-gene predictor with a validation error of P <.01. This normalization-free method looks particularly promising for further applications like diagnostic classification (MGUS), prognostic grouping and prediction of response to treatment. In addition, it can be used as a powerful tool to mine generated or published data on all cancer types.

Author notes

Corresponding author