The .doc format is much clearer. BLOSUM62 Substitution Matrix
This table depicts the observed substitutions found in a broad sampling from the aligned segments of polypeptides. The precise method of calculation gets a bit abstruse, but an illustration may demystify some aspects. The procedure for calculating a BLOSUM matrix is based on a likelihood method estimating the occurrence of each possible pairwise substitution. A very simple illustration of such a calculation for a very short segment is given below to illustrate the process. The polypeptides were initially aligned using an identity scoring matrix. Only aligned blocks are used to calculate the BLOSUMs. Assume that the following 5 proteins were aligned as follows: AVAAA AVAAA AVAAA AVLAA VVAAL First, we have to decide whether each sequence should count equally in this process. If this database were a global representation, they should. However, databases of today typically overrepresent certain classes of proteins. Therefore, the first step is to reduce this overrepresentation and make the dataset more representative. One method of doing this is to count all the identical blocks as if they were a single block, reducing the aligned database to: AVAAA AVLAA VVAAL Next, at each position, we must calculate the observed and expected pairwise substitutions. At position 1, we have: A A AA, AV, AV are the observed substitutions. Without specifying the necessary calculations, one V can see that the substitution of A for A or A for V are quite likely and that AX is not! Why is BLOSUM62 called BLOSUM62? Basically, this is because all blocks whose members shared at least 62% identity with ANY other member of that block were averaged and represented as 1 sequence. How would this work with our example? AVAAA AVAAA AVAAA AVLAA VVAAL 14 are all 80% identical to each other. Sequence 5 is less than 62% (it is 60% identical to 13 and t0 #4). This means that the BLOCK used to make a BLOSUM62 would be: (note the averaging!) AVA_{(3/4)}AA _{} L_{(1/4)} VVA_{*****}AL How does the matrix relate to structural similarity? Most biochemists group the amino acids as follows: G,A,V,L,I, M aliphatic (though some would not include G) S,T,C hydroxyl, sulfhydryl, polar N,Q amide side chains F,W,Y aromatic H,K,R basic D,E acidic Some rather anomalous substitutions relative to these groupings are highlighted below. For example, it seems VERY surprising to me that KE substitution is not unusual. That changes charge! This suggests that what evolution thinks is "similar" is not necessarily similar to the molecular biologist. (Of course, it is also possible that KE substitution in one position is also often correlated with a EK substitution elsewhere, and that what one retains is the electrostatic interaction between charged residues. Much like GC or CG compensating mutations in stemloops of RNA secondary structures.)
