Amino Acid Substitutions & Replacement Matrices

Mutation prabability matrix

The previous page discussed the need for a scoring scheme to score amino acid replacements for an accurate sequence alignment. A scoring scheme only based on identities is not very effective. It will miss much helpful information, especially when we need to detect weak homology. When discussing amino acid replacement schemes, we must remember that during evolution, amino acids may be replaced by the original type after several cycles. Such a process is logical since the conservation of structure and function always has higher priority in the selection process. This type of conservation has also been described as conservation of size and hydrophobicity (Taylor, 1986).

Margaret Dayhoff and co-workers, who pioneered the field of protein sequence analysis, databases, and bioinformatics, developed the first mutation probability matrix in the 1970s. Their model was based on observed frequencies of substitutions of each of the 20 amino acids derived from an alignment of closely related sequences. Based on the number of common amino acids (20), we can count 210 possible replacement pairs, of which 190 pairs correspond to two different amino acids and 20 pairs of identical substitutions. In the resulting so-called mutation probability matrix, each element Mij provides an estimate of the probability of an amino acid in column j to be mutated to the amino acid in row i after a particular evolutionary time.
PDM250 mutation probability matrix
PDM250 mutation probability matrix
Image by Byron Holland:
An evolutionary unit of 100 million years was adapted, resulting in the PAM (Percentage Accepted Mutations / 100 million years) matrix. 1 PAM corresponds to an average amino acid substitution in 1% of all positions. Although 100 PAM does not mean that all the amino acids in the sequence are different (as compared to the original sequence), since, as noted above, many of them will be mutated back to their original type.

Further analysis showed that, although at 256 PAM, 80 % of all amino acids will be substituted, 48% of Trp, 41% of Cys, and 20% of His will be conserved. On the other hand, only 7% of S residues will remain (discussed in Barton, J, 1996). This conservation pattern presumably results from a combination of structural and functional restraints mentioned above. For example, tryptophan has a large side chain, and if positioned within the structure's core, its replacement by another amino acid may destabilize the protein. In addition, the other highly conserved residues, cysteine and histidine, are often involved in specific functions like proton abstraction (His), metal binding (His and Cys), or disulfide bridge formation (Cys).

After the construction of the mutation probability matrix, Dayhoff et al. defined the score Si,j of two aligned residues i, j as 10 times the logarithm of the probability (Mij ) of these two residues being aligned divided by the probability of these two amino acids being aligned by chance. This type of scoring will give higher numbers to the alignment of a pair of similar amino acids (e.g., Leu and Ile) and lower numbers when the amino acids are different (like I and D).


Using this definition, the probability matrix (shown in the image above) can then be used to derive the so-called log-odds scoring matrix:
PAM 250 scoring matrix
PAM 250 scoring matrix

PAM, PET91 & BLOSUM matrices

Different versions of the amino acid substitution matrix have been generated and serve different purposes. For example, low PAM matrices (20, 40, 60) are preferred in database scanning when alignments of short, closely related sequence stretches are output. The higher the PAM number, the longer the expected evolutionary distance between the proteins. This makes high PAM matrices suitable for aligning more distant proteins, and it will find more distant homologs if used for database scanning. Many internet resources for sequence alignment provide the option of choosing the substitution matrix for the alignment.

The original Dayhoff matrix was based on a limited set of known protein sequences, and no statistical data could be collected for many of the possible 190 substitutions. This was corrected later when more data became available. An example is PET91, an updated Dayhoff matrix (Jones et al., 1992), based on 2,621 protein families from the SwissProt database (now UniProt, part of the Expasy server). Meanwhile, other substitution matrices were developed based on slightly different principles. One of the most popular is the BLOSUM matrix (BLOcks of Amino Acid SUbstitution Matrix, Henikoff S, Henikoff JG. 1992). BLOSUM scores amino acid replacements based on the frequencies of amino acid substitutions in ungapped aligned blocks of sequences with a certain percentage identity. Thus, there is a significant difference between PAM and BLOSUM matrices - PAM matrices are based on mutations observed in a global alignment, which includes highly conserved regions and low-conservation regions with gapped alignment. The numbers associated with each matrix (e.g., BLOSUM62, BLOSUM80, etc.) refer to the minimum percentage sequence identity for the sequences aligned within a block. Higher numbers correspond to higher sequence identity and shorter evolutionary distances between the proteins. BLOSUM with high numbers should then be used for sequences known to be homologous with a good level of sequence identity. At the same time, low BLOSUM numbers should be used for distantly related proteins, for example, in screening databases.

Structure-based alignment

The 3D structure of a protein contains vital information on the position and length of secondary structure elements and loop regions. Since many gaps in sequence alignment occur in regions between secondary structure elements, using information from a 3D structure will provide more accurate positioning of insertions and deletions in the alignment. Many graphics programs include the superposition of 3D structures and provide a structure-based alignment of the sequences as an output. However, this alignment still needs to be verified, e.g., by examining the 3D structure and comparing the generated alignment to an automatic multiple sequence alignment.

Concluding remarks
This brief review explains some of the basic concepts in the sequence alignment. Since, as mentioned earlier, here we are focused on the practical application of the tools for sequence alignment and analysis, I did not get into the details of the statistical analysis behind the construction of the replacement matrices, the data related to sequence conservation, or details of calculating alignment scores. In a tutorial, we will discuss sequence alignment's practical aspects.