The human genome has been estimated to contain tens of thousands of genes. Of these, the promoters have been experimentally verified for almost two thousand. We have examined the DNA sequences just up-stream of the transcription start site, a region which includes the TATA box. Genetic control sites, such as promoters, often have a characteristic consensus sequence, but the variation about a given consensus sequence has received little attention. Sequence variations may be related to functional differences amongst the control sites. Principal components analysis has been chosen because of its generality and the variety of phenomena which it reveals. Promoter sequences were considered because of the large number available and their importance in gene expression. The sequences of the 1977 promoters recognised by human RNA polymerase II were obtained from the Eukaryotic Promoter Database. Many of these promoters are of interest in oncology and the database includes sequences for growth factors (e.g. GM-CSF, interleukins), oncogenes and tumour viruses among others. Sub-sequences of 25 bases centred on position −13 relative to the transcription start site were extracted. Two bits were used to encode each base (a=11, c=00, g=10 and t=01) and the covariance matrix of the resulting 50 variables was determined. The eigenvalues and eigenvectors of the covariance matrix were calculated. All calculations were carried out by computer using MS-Excel and SYSTAT 11. The eigenvalues of the covariance matrix ranged from 0.571 down to 0.133. The eigenvectors were used to calculate principal components. Thus 50 more or less correlated variables were transformed into 50 uncorrelated variables with the same total variance. The sequences were sorted according to the principal components to reveal which features were associated with the most variation amongst the sequences. When the covariances among the coded sequences were calculated many associations were found, for example, a purine at position 15 was associated with a purine at position 16, and a purine at position 19 with a G or C at position 20. Although these correlations individually were not especially strong, together they were a notable feature of the set of sequences. The consensus sequence was observed to be agggg ggggg ggc(g/c)c ggggg gcgcc. A principal components analysis enabled the promoters to be identified which differed most (in opposite directions) from the consensus sequence, taking account of the correlations. Nearly all the elements of the first eigenvector were of alternating sign; thus the first principal component separated promoters which were rich in G from those rich in T. Almost all elements of the second eigenvector were positive, so the second principal component distinguished promoters rich in A from those rich in C. There was a remarkable concentration of promoters from genes for interleukins or IL repressors with large values for the second principal component:- IL1A, IL2, IL4, IL6-2, IL2RA1, IL2RA2 and IL8RB were in positions 160, 43, 14, 158, 131, 101 and 158 (out of 1977) respectively. The variation in the sequence of promoters about their consensus sequence is seen not to be random but to display detectable patterns. Correlations were found to be frequent within the promoter sequences considered here; in the absence of correlations all the eigenvalues would have been equal. The major principal components separated promoters with markedly different sequences. It is to be expected that the other principal components would yield further separations.

Author notes

Corresponding author