> next up previous
Next: Flipping the light on Up: Consensus Sequence Zen Previous: Statistical effects of making

Missing the trees in the data forest.

As a result of counting mismatches to a consensus it is possible to entirely miss a binding site. One of the most striking examples is a Fis site in the tgt/sec promoter of E. coli (Fig. 4). We carefully collected 60 sequences shown by DNase I footprinting to be bound by Fis, used information theory to maximize the information content of the alignment [Schneider & Mastronarde, 1996] and produced a model of how Fis binds [Hengen et al., 1997]. The model did a good job of predicting where Fis binds in the original footprint regions. However the model also predicted a site in tgt/sec that was not noted by the original authors. Surprisingly, four pieces of data support the existence of a Fis site centered at position relative to the start of transcription [Slany & Kersten, 1992]:

1. Fis often induces DNase I hypersensitive phosphates; these are seen between bases and , corresponding to the Fis site at position . In addition there are hypersensitive positions between and , which correspond to the site shown by the walker at .

2. The end of the original DNase I footprints was not well determined, but could have extended to cover the Fis site at .

3. Gel mobility shift assays showed two band shifts when the entire region was used, one band shift when the region to the conveniently (!) placed ClaI site was removed (which would eliminate the proposed Fis site) and none when both were removed.

4. Assays with these same nested DNAs showed that both sites activate transcription.

The authors were aware that there was a second binding site, but placed its location somewhere in the 69 bases upstream of the ClaI site. Why did they miss the site? Two positions (indicated by arrows in the figure) did not match the `accepted' consensus sequence [Hübner & Arber, 1989]. The consensus method gave these positions far more weight than was appropriate. The information for the site at is 10.8 bits, which is 2 bits more than the average. To determine if there is really a site there, we performed a gel shift experiment using a DNA containing only the proposed Fis site at and showed that the sequence is indeed bound by Fis [Hengen et al., 1997]. Because the consensus sequence failed to predict a site that had been documented experimentally, that site could not be seen, and to the scientists it did not exist [Kuhn, 1970].

A more critical example is in the hMSH2 gene, which is associated with familial nonpolyposis colon cancer [Rogan & Schneider, 1995]. A `T' to `C' transition occurred at position of an acceptor site and this change was proposed to be the cause of the disease [Fishel et al., 1993]. Inspection of the logo in Fig. 1 shows that the consensus at position (base zero is just to the left of the vertical bar, the first base on the intron side) is a T, but that close to half of the bases in the polypyrimidine tract are C. When the transition is made, the individual information changes by only 0.2 bits, which is not significantly different. A study of 20 normal people found that only 2 had this change [Leach et al., 1993], so the change is a polymorphism unrelated to the disease.

Why did this potential `misdiagnosis' happen? We suppose that T was taken to be the consensus sequence. Given this, one would interpret any change from that consensus to be detrimental. In this case the consensus sequence was so rigid that it could not handle a subtle change and a site disappeared from the scientist's view even though it was still functional. As DNA sequencing technologies become widely available to doctors, this situation will come up repeatedly. Serious malpractice suits could occur as a result of using the consensus model.


next up previous
Next: Flipping the light on Up: Consensus Sequence Zen Previous: Statistical effects of making
Tom Schneider 2002-12-05