6. PERFORMANCE
A. Database statistics
The following is a synopsis of the score distributions for the PDB and coiled-coil databases. The score distributions are approximated by Gaussians and the means and standard deviations of the Gaussians are given. PDB is a database of globular sequences from The Protein Data Bank (32,592 res.) described in Science 252:1162. The combined coiled-coil database contains 26,965 residues from various coiled-coil proteins (see Section 4: SCORING OPTIONS) and will be described in detail in print. Obviously, every family of coiled-coil proteins was scored with a scoring matrix that excluded residue frequencies from that family.
28 residue scan 21 residue scan 14 residue scan
mean std.dev. mean std.dev. mean std.dev.
PDB MTK 0.77 0.20 0.83 0.24 0.94 0.29
MTIDK 0.80 0.18 0.86 0.21 0.95 0.26
MTK_W 0.79 0.23 0.86 0.26 1.00 0.33
MTIDK_W 0.86 0.18 0.92 0.22 1.04 0.27
Coiled coils MTK 1.63 0.22 1.70 0.25 1.79 0.30
MTIDK 1.69 0.18 1.74 0.23 1.82 0.28
MTK_W 1.70 0.24 1.76 0.28 1.88 0.34
MTIDK_W 1.74 0.20 1.79 0.24 1.89 0.30
From these numbers, several conclusions can be drawn:
- The difference between the mean scores in PDB and in coiled coils is slightly larger with the MTIDK matrix than with the MTK matrix. More importantly, the standard deviation of the score distribution is lower with the MTIDK matrix for both databases. This means that the MTIDK matrix yields a more consistent evaluation of globular and coiled-coil sequences and provides for a better resolution between the two score distributions. Not shown here is that the MTIDK matrix also improves the score of intermediate filament sequences relative to the scores of other coiled-coil sequences, thus providing for a more balanced scoring of the different families of coiled-coil proteins than the MTK matrix.
- For both matrices, weighting slightly decreases the resolution between the globular and coiled-coil score distributions.
- For all scoring methods, the resolution between the globular and coiled-coil score distributions decreases strongly with decreasing size of the scanning window.
- The difference in performance between the MTK matrix and the MTIDK matrix is small although the MTIDK matrix is derived from over twice the number of residues and many more protein families. I conclude that little further progress can be expected from even larger coiled-coil databases.
B. Highscoring sequences in globular proteins
I scored release 13.0 (8/93) of the NRL_3D database (containing thesequences of proteins of known structure from PDB) with all four scoring methods and counted the number of segments obtaining probabilities >10%. The database contained 539 nonredundant protein sequences and excluded the coiled-coil proteins tropomyosin, hemagglutinin, GCN4, Gal4 and apolipoprotein E. Apolipoprotein E was included with the coiled-coil subset because its helices are very long compared to those of other helical bundles and because it forms a partly three-stranded structure. All other helical bundles were included with the globular proteins because their helices are short and frequently packed at irregular angles. These features generally prevent their detection by this algorithm although several helices from four-helix bundles appear as high-scoring segments in the following table. Results are compared to the number of segments obtained in a database of sequences generated by means of a random number generator (see Science 252:1162).
(1 - MTK; 2 - MTIDK; 3 - MTK_W; 4 - MTIDK_W)
RANDOM SEQUENCES
28 res. 21 res. 14 res. 28 21 14
1 2 3 4 1 2 3 4 1 2 3 4 1 2 1 2 1 2
10-19% 8 5 11 13 37 22 24 35 96 85 99 85 1 2 12 10 51 60
20-29% 4 1 5 3 18 14 23 14 47 33 51 45 2 1 10 5 21 26
30-39% 2 0 2 4 14 8 9 9 29 35 42 21 2 0 7 4 14 14
40-49% 4 0 2 5 6 2 15 10 21 14 17 19 1 0 2 1 8 9
50-59% 2 2 1 1 1 4 4 7 11 9 11 14 0 0 1 0 10 9
60-69% 1 0 3 6 3 4 7 5 9 11 12 14 0 0 0 0 5 6
70-79% 3 2 2 1 4 1 6 1 12 7 12 13 0 0 2 1 6 4
80-89% 1 2 3 1 3 4 3 4 10 14 8 18 0 0 1 2 2 5
>= 90% 1 3 1 1 4 9 6 7 11 20 8 15 2 2 2 2 5 7
In this table, the number of segments per 10% increment levels off above 50% rather than decreasing continuously. This is due to the sigmoid shape of the curve that relates scores to probabilities which masks a continuing decrease in number of segments per score interval. Above 50%, the number of segments per 10% increment doubles from around 2 in the 28 res. scan to around 4 in the 21 res. scan and then triples to around 12 in the 14 res. scan. A similar progression at a lower level is observed for the random sequence database. This progression is due to the significantly poorer resolution of smaller scanning windows. The difference in numbers between PDB and random sequences is attributable to amphipathic helices that are frequently present in native proteins but are not a preferred element of random sequences. Outside the tail end of the score distribution seen in this table, the score distributions of PDB and random sequences are superimposable (see Science 252:1162). This means that the real resolution between the globular and coiled-coil score distributions is slightly lower than the nominal resolution.
The weighted matrices are less reliable than the unweighted matrices.
The MTK matrix yields fewer highscoring segments at probabilities >90% than the MTIDK matrix and thus appears more reliable even though its nominal resolution is poorer. This is probably an incorrect conclusion. As is detailed in the next paragraph, there are now several examples of sequences that do not assume a coiled-coil (or even alpha-helical!) structure under normal circumstances but that have the potential to do so if their context is changed. It therefore appears likely that the sequences which are assigned elevated coiled-coil probabilities by the COILS program actually do have the potential to form coiled coils even though they do not do so in the protein context or under the conditions in which the structure was determined. The larger number of high-scoring segments with the MTIDK matrix would then be the result of an increased sensitivity of this matrix.
Virtually all segments with scores above 50% in 21 and 28 scans are centered on a surface helix although several contain two discotinuous helices rather than one continuous helix. Several of the helices are from four-helix bundles and thus have coiled-coil characteristics. Following recent developments, it is increasingly likely that most (if not all) of these high-scoring sequences have an elevated coiled-coil-forming potential and could form coiled coils in a different context. This follows from three recent results:
- A loop segment of influenza hemagglutinin, pH7, which was predicted by COILS to have elevated coiled-coil potential, in fact forms a coiled coil in the pH4 structure (Bullough et al., Nature 371:37, 1994).
- The basic region of bZip transcription factors, which is not even alpha-helical in the absence of DNA, can be converted into a coiled coil by a designed peptide (Krylov et al., EMBO J. 14:5329, 1995).
- A peptide from topoisomerase II, which was identified using COILS, forms a coiled coil in solution but not in the structure of the full protein (Frere et al., J. Biol.Chem. 270:17502, 1995).
Nevertheless, the decreased coiled-coil-forming potential of these sequences relative to "constitutive" coiled coils can be seen from the fact that they score highly in one method but generally much lower in at least one of the other methods; example: 5LDH - lactate dehydrogenase:
seq CAISILGKSLTDELALVDVLEDKLKGEMMDLQHGSLFLQTP
MTK 00112444444444444444444444444444411000000
MTK_W 35678999999999999999999999999999911000000
MTIDK 00000000000000000000000000000000000000000
MTIDK_W 00012333333333333333333333333333300000000
and several segments drop considerably in score from a 28 residue scan to a 21 residue scan; example: 2TS1 - tyrosyl-tRNA synthetase:
seq PEKRAAQKTLAEEVTKLVHGEEALRQAIRIS
14 0001111111111111100000000000000
21 0222222222222222222222222220000
28 0777777777777777777777777777721
The latter effect is observed particularly if a segment contains two discontinuous helices. These effects can be taken as indicators for a decreased likelihood of coiled-coil formation since neither effect is normally observed in coiled coils, as can be seen in part C of this section.
C. Performance on coiled coils
In the following, secondary structure (c = coiled-coil helix) and coiled-coil-forming probabilities are shown beneath the sequences as scored by MTK, MTIDK, MTK_W and MTIDK_W in that order. The values were obtained with a 21 residue scanning window which appears to spot the ends of coiled-coil segments somewhat more accurately than a 28 residue window. (For spotting the ends of coiled coil helices, see also the documentation for the auxiliary program CAPS). The coiled coils in Gal4, GreA and human mannose-binding protein were analyzed with a 14 residue window because of their short length. Tropomyosin is not shown; it obtains probabilities >99% over its entire length except for the C-terminal 20 residues.
(C1) parallel, two-stranded structure.
>GCN4 bZip (Cell 71:1223)
MKDPAALKRARNTEAARRSRARKLQRMKQLEDKVEELLSKNYHLENEVARLKKLVGER
hhhhhhhhhhhhhhhhhhhhhhhhhhcccccccccccccccccccccccccccccccc
0000000000222779999999999999999999999999999999999999988330
0000011111777999999999999999999999999999999999999999988110
0000000000000224555566699999999999999999999999999999999770
0000000000000889999999999999999999999999999999999999999770
Similar probabilities (>99%) are obtained for the bZip regions of Fos and Jun (see Meth. Enzymology 266:513). As seen here, the ends of coiled-coil segments may be overpredicted significantly in the absence of strong flanking helix-breaking residues. This is a particular problem in bZip proteins, where the coiled coil follows continuously out of the basic-region helix. Note, though, that the basic region also has some coiled-coil-forming potential, as demonstrated by Krylov et al. (EMBO J. 14:5329, 1995).
>Max b-HLH-Zip (Nature 363:38)
ADKRAHHNALERKRRDHIKDSFHSLRDSVPSLQGEKASRAQILDKATEYIQYMRRKNDTH
hhhhhhhhhhhhhhhhhhhhhhhhhhhhhh hhhhhhhhhhhhhhccccccc
000000000000000000000000000000000000000000111112288889999999
000000000000000000000000000000000000000000000001199999999999
000000000000000000000000000000000000000000000011155556888999
000000000000000000000000000000000000000000111113388889999999
QQDIDDLKRQNALLEQQVRALEKARSSAQLQT
ccccccccccccccccccccc
99999999999999999999999999999884
99999999999999999999999999999996
99999999999999999999999999988771
99999999999999999999999999999992
>Gal4 (Nature 356:408)
MKLLSSIEQACDICRLKKLKCSKEKPKCAKCLKNNWECRYSPKTKRSPLTRAHLTEVESRLERLEF
hhhhhhhh hhhhhhhh ccccccccccccccc
000000000000000000000000000000000000000000000000014888888888888882
000000000000000000000000000000000000000000000000017999999999999992
000000000000000000000000000000000000000000000000006888888888888884
000000000000000000000000000000000000000000000000008999999999999995
COILS works well for parallel two-stranded structures (independently of the scoring method used) if they are solvent-exposed. The parallel two-stranded coiled coil buried in CAP is entirely invisible to this program because of the absence of a heptad repeat.
(C2) antiparallel, two-stranded structures
>Seryl-tRNA synthetase - Escherichia coli (Nature 347:249)
MLDPNLLRNEPDAVAEKLARRGFKLDVDKLGALEERRKVLQVKTENLQAERNSRSKSIGQ
cccccccccccccccccccccccccccccccchh
000000000000000000000000003888888888888888888888888882100000
000000000000000000000000003999999999999999999999999993000000
000000000000000000000000003777777777777777777777773330000000
000000000000000000000000004889999999999999999999998880000000
AKARGEDIEPLRLEVNKLGEELDAAKAELDALQAEIRDIALTIPNLPADEVPVG......
hhhh cccccccccccccccccccccccccccccccccc
000000000099999999999999999999999999999999900000000000
000007788899999999999999999999999999999999988800000000
000000000099999999999999999999999999999999955500000000
000089999999999999999999999999999999999999999933100000
>Seryl-tRNA synthetase - Thermus thermophilus (JMB 234:222)
MVDRKRLRQEPEVFHRAIREKGVALDLEALLALDREVQELKKRLQEVQTERNQVAKRVPK
ccccccccccccccccccccccccccccccc
000000000000000000011124599999999999999999999999999999999910
000000000000000000000013499999999999999999999999999999999986
000000000000000000022236699999999999999999999999999998887700
000000000000000000000014599999999999999999999999999999999954
APPEEKEALIARGKALGEEAKRLEEALREKEARLEALLLQVPLPPWPGAPVG........
ccccccccccccccccccccccccccccccccccccc
0008888888888999999999999999999999999999920000000000
4009999999999999999999999999999999999999997000000000
0002224444444999999999999999999999999999932000000000
1005556677777999999999999999999999999999999000000000
>GreA transcript cleavage factor (Nature 373:636)
MQAIPMTLRGAEKLREELDFLKSVRRPEIIAAIAEAREHGDLKENAEYHAAREQQGFCEGRIKDIEAKLSNAQVID
sscccccccccccccccc-ccccccccccccc cccccccccccccccccccccccccc ss
0000011366666666666666664200000000000000000000000000000002999999999999998730
0000011388888888888888888500000001111111111111100000000004999999999999997710
0000022688888888888888885300000000000000000000000000000000777777777777776630
0000033899999999999999999800000000000000000000000000000000777777777777776620
GreA resembles in its structural organization seryl-tRNA synthase. It is currently the only known coiled-coil structure with a true skip residue (Val34). The high scores in the two coiled coil helices correspond to the segment of coiled coil that is located between the skip and the globular part of the protein.
>Replication terminator protein (Cell 80:651)
MKEEKRSSTGFLVKQRAFLKLYMITMTEQERLYGLKLLEVLRSEFKEIGFKPNHTEVYRSL
hhhhhhhhhhhhhhhh ssss hhhhhhhhhhh hhhhhhhh
0000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000
HELLDDGILKQIKVKKEGAKLQEVVLYQFKDYEAAKLYKKQLKVELDRCKKLIEKALSDNF
hhhhh sssssss sssssss hhhhhhhhhhccccccccccccccccccccc
0000000000000000000000000001111133666666666666666666666655540
0000000000000000000000000000033333444488888888888888888888880
0000000000000000000000000002222233555555555555555555555533320
0000000000000000000000000001133344555588888888888888888888882
COILS is also generally reliable in the analysis of antiparallel two-stranded coiled coils, but does not detect the DNA-binding coiled coil in serum response factor (Nature 376:490), which, because of its special function, has a very distinct residue distribution.
(C3) parallel, three-stranded structures
>hemagglutinin (Nature 333:426 and 371:37)
GLFGAIAGFIENGWEGMIDGWYGFRHQNSEGTGQAADLKSTQAAIDQINGKLNRVIEKTN
hhhhhhhhhhhhhhhhhh pH7
ccccccccccccccccccccc pH4
000000000000000000000000000000001223466666666666666666666658
000000000000000000000000000000000222455555555555667888888889
000000000000000000000000000000000122344444444444444444444402
000000000000000000000000000000000111222222222222222222222211
EKFHQIEKEFSEVEGRIQDLEKYVEDTKIDLWSYNAELLVALENQHTIDLTDSEMNKLFE
ccccccccccccccccccccccccccccccccccccccccccccc pH7
ccccccccccccccccccccccccccccccccccccccccccccc hhhhhhhh pH4
999999999999999999999999999988800000000000000000000144444444
999999999999999999999999999766611111110000000000000288888888
333377777788888888888888888888800000000000000000000000000000
333355555555555555555555555555533333331000000000000033333333
KTRRQLRENAEEMGNGCFKIYHKCDNACIESIRNGTYDHDVYRDEALNNRFQIKG
cccccc pH7
hhhhhhhhh pH4
4444444444444220000000000000000000000000000000000000000
8888888888888440000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000
3333333333333110000000000000000000000000000000000000000
Influenza haemagglutinin is a complex structure which undergoes a large structural transition between pH7 and pH4. There is multiple evidence that the structure at pH7 is only meta-stable.
>Mannose-binding protein A, rat (Structure 2:1227)
AIEVKLANMEAEINTLKSKLELTNKLHAFSMGKKSGKKFFVTNHERMPFSKVKALCSELRGTVAIPRNAEENKAI
cccccccccccccccccccccccccccccc sssssssss hhhhhhhhhh ss hhhhhhh
999999999999999999999999997731000000000000000000000000000000000000000000000
999999999999999999999999998830000000000000000000000000000000000000000000000
999999999999999999999999993320000000000000000000000000000000000000000000000
999999999999999999999999995520000000000000000000000000000000000000000000000
QEVAKTSAFLGITDEVTEGQFMYVTGGRLTYSNWKKDEPNDHGSGEDCVTIVDNGLWNDISCQASHTAVCEFPA
hhhh ssssss ss sssss ssss sssssss
00000000000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000000000
>Mannose-binding protein C, human (Nature Struct. Biol. 1:789)
AASERKALQTEMARIKKWLTFSLGKQVGNKFFLTNGEIMTFEKVKALCVKFQASVATPRNAAENGAI
cccccccccccccccccccc sss ssssssssssshhhhhhhhhh ss hhhhhhh
2246666666666666600000000000000000000000000000000000000000000000000
5579999999999999900000000000000000000000000000000000000000000000000
2222222222222222200000000000000000000000000000000000000000000000000
5555555555555555500000000000000000000000000000000000000000000000000
QNLIKEEAFLGITDEKTEGQFVDLTGNRLTYTNWNEGEPNNAGSDEDCVLLLKNGQWNDVPCSTSHLAVCEFPI
hhh ssssss ss ssss ssss sssssssss
00000000000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000000000
(C4) antiparallel, three-stranded structures
>coil-Ser (Science 259:1288)
EWEALEKKLAALESKLQALEKKLEALEHG
ccccccccccccccccccccccccccccc
99999999999999999999999999999
99999999999999999999999999999
99999999999999999999999999999
99999999999999999999999999999
This is an unusual homotrimeric structure that was produced incidentally to the design of a two-stranded coiled coil.
>spectrin (Science 262:2027)
NLDLQLYMRDCELAESWMSAREAFLNADDDANAGGNVEALIKKHEDFDKAINGHEQKIAA
cccccccccccccccccccccccccccc cccccccccccccccccccccccc
000000000000000000000000000000111114466667777777777777777777
000000000000000000000000000000000003355558888888888888888888
000000000000000000000000000000000000011111111117777777777777
000000000000000000000000000000000000011113333339999999999999
LQTVADQLIAQNHYASNLVDEKRKQVLERWRHLKEGLIEKRSRLGD
cccccccccc ccccccccccccccccccccccccccccccc
7777777777742220000000000000000000000000000000
8888888888863110000000022222222222222222222200
7777777777755552211110000000000000000000000000
9999999999977773322220044444444444444444444400
As an antiparallel three-helix bundle, spectrin is already fairly far removed from the reference set of parallel two-stranded structures that is used for scoring. Accordingly, as with four-helix bundles, the program has problems identifying all the helices in the structure. While this does not make the prediction of helix B as a coiled coil incorrect, it makes it rather useless and indeed misleading for model-building. In the long run, scoring matrices that are specific for helical bundles should be the answer, but my experiments with a matrix derived from four-helix bundles (Paliakasis & Kokkinidis, Prot.Eng. 5:739) show that the ones currently available have only little predictive power. Even in the absence of such matrices, the prediction can be improved significantly using the auxiliary programs ALIGNED20/80 if homologous sequences are available for a protein. Their application to spectrin is shown in the documentation file ALIGNED.DOC.
One of the specific problems of the program with helix A of spectrin are the Trp and Phe residues in position g of the heptad repeat. These residues are very rare at that position both in two-stranded and three-stranded coiled coils. Such residues can occur or even be important in certain structures even though they are disfavored in most others. It is therefore recommended that a protein with a single peak be also analyzed with all rare residues (W, C, P) replaced by Ala. Emergence of more peaks indicates the presence of a helical bundle. Also, if proteins that one suspects may form a helical bundle have a peak that occurs only in a 14 residue scan, one should look whether replacement of a single unfavorable residue (e.g. D in a) by Ala does not greatly lengthen the predicted length of the helix or raise significantly its score. Such "wrong" residues may actually help to build a model since their presence needs to be accounted for and limits the possibilities.
(C5) other antiparallel helical bundles
>ApoE (Science 252:1817)
GQRWELALGRFWDYLRWVQTLSEQVQEELLSSQVTQELRALMDETMKELKAYKSELEEQL
ccccccccccccccccccc hhhhhhhhhhcccccccccccccccccccccccccccc
000000000000000000000000000000013379999999999999999999999999
000000000000000000000000000000026699999999999999999999999999
000000000000000000000000000000001129999999999999999999999999
000000000000000000000000000001689999999999999999999999999999
TPVAEETRARLSKELQAAQARLGADMEDVCGRLVQYRGEVQAMLGQSTEELRVRLASHLR
cccccccccccccccccccccccccccccccccccc ccccccccccccc
818999999999999999999999533331111000000000000111111114478999
889999999999999999999999733330000000000000000444444445589999
959999999999999999999999444441111111111111000333333336689999
999999999999999999999999433331111111111110011888888888899999
KLRKRLLRDADDLQKRLAVYQAGA
cccccccccccccccccccccc
999999999999999999988877
999999999999999999999855
999999999999999999999999
999999999999999999999999
The prediction for ApoE is good for the three-stranded part but much poorer for the four-stranded part: the short N-terminal helix 1 is not seen by the program, partly because of its length but mostly because of the three Trp residues, and the C-terminus of helix 3 and the N-terminus of helix 4 which interact with helix 1 also obtain low scores. This brings me to:
D. Limits of the method
As can be seen from the examples given, the program works well for parallel two-stranded structures that are solvent-exposed but runs progressively into problems with the addition of more helices, their antiparallel orientation and their decreasing length. The program fails entirely on buried structures. Limits are also set by the statistical noise which greatly decreases the usefulness of small scanning windows. Finally, the possibility that sequences with good coiled-coil potential do not form a coiled coil because of constraints from other parts of the sequence may add a further limit to the accuracy of the program.
Because many reasons can lead the program to miss a helix while the conditions for detection are quite stringent, the absence of a peak is not nearly as conclusive as the presence of a peak. Effects of this on interpreting scores from multiple alignments is discussed in ALIGNED.DOC. What I believe one can conclude safely from the absence of a peak is that no solvent-exposed two- or three-stranded coiled-coil of length greater than approximately 20 residues is present in the protein.
|