Amino Acid Substitutions & Replacement Matrices

Mutation prabability matrix

The previous page discussed the need for a scoring scheme to score amino acid replacements for an accurate sequence alignment. A scoring scheme only based on identities is not very effective. It will miss much helpful information, especially when we need to detect weak homology. When discussing amino acid replacement schemes, we must remember that during evolution, amino acids may be replaced by the original type after several cycles. Such a process is logical since the conservation of structure and function always has higher priority in the selection process. This type of conservation has also been described as conservation of size and hydrophobicity (Taylor, 1986).

Margaret Dayhoff and co-workers, who pioneered the field of protein sequence analysis, databases, and bioinformatics, developed the first mutation probability matrix in the 1970s. Their model was based on observed frequencies of substitutions of each of the 20 amino acids derived from an alignment of closely related sequences. Based on the number of common amino acids (20), we can count 210 possible replacement pairs, of which 190 pairs correspond to two different amino acids and 20 pairs of identical substitutions. In the resulting so-called mutation probability matrix, each element Mij provides an estimate of the probability of an amino acid in column j to be mutated to the amino acid in row i after a particular evolutionary time.
PDM250 mutation probability matrix
Image from Byron Holland: https://slideplayer.com/slide/8888934/
PDM250 mutation probability matrix
Image from Byron Holland: https://slideplayer.com/slide/8888934/
An evolutionary unit of 100 million years was adapted, resulting in the PAM (Percentage Accepted Mutations / 100 million years) matrix. 1 PAM corresponds to an average amino acid substitution in 1% of all positions. Although 100 PAM does not mean that all the amino acids in the sequence are different (as compared to the original sequence), since, as noted above, many of them will be mutated back to their original type.

Further analysis showed that, although at 256 PAM, 80 % of all amino acids will be substituted, 48% of Trp, 41% of Cys, and 20% of His will be conserved. On the other hand, only 7% of S residues will remain (discussed in Barton, J, 1996). This conservation pattern presumably results from a combination of structural and functional restraints mentioned above. For example, tryptophan has a large side chain, and if positioned within the structure's core, its replacement by another amino acid may destabilize the protein. In addition, the other highly conserved residues, cysteine and histidine, are often involved in specific functions like proton abstraction (His), metal binding (His and Cys), or disulfide bridge formation (Cys).

After the construction of the mutation probability matrix, Dayhoff et al. defined the score Si,j of two aligned residues i, j according to the following equation:

8kusmh8kDtWMDO2c_rjMdeNHFdrpid8EKAQcjo1dL8bTmm4zx4cRnAzPe4DpErFGTVB82yMkuB9-ce53hOBYy5-xeWndX5AwJJxOmMYrXGYdzrUnUqktKwAXHf6tsKxdgOYPii3GYY_Ipcw3VQ

In which (Mij ) is the probability of these two residues being aligned (from the matrix above), and pi - the probability of these two amino acids being aligned by chance.

This type of scoring will give higher numbers to the alignment of a pair of similar amino acids (e.g., Leu and Ile) and lower numbers when the amino acids are different (like I and D). Using this definition, the probability matrix (shown in the image above) can then be used to derive the so-called log-odds scoring matrix:
PAM 250 scoring matrix
PAM 250 scoring matrix
This replacement matrix can be used for the calculation of the alignment score according to the equation discussed earlier: (score) S= Σ of costs (identities, replacements) - Σ of penalties (number of gaps x gap penalties)

PAM, PET91 & BLOSUM matrices

Different amino acid substitution matrix versions have been generated and serve different purposes. For example, low PAM matrices (20, 40, 60) are preferred in database scanning when short, closely related sequence stretches need to be found. On the other hand, higher PAM matrices, are more suitable for aligning longer sequences. Many internet resources for sequence alignment provide the option of choosing the substitution matrix for the alignment.

The original Dayhoff matrix was based on a limited set of known protein sequences, and no statistical data could be collected for many of the possible 190 substitutions. This was corrected later when more data became available. An example is PET91, an updated Dayhoff matrix (Jones et al., 1992) based on 2,621 protein families from the SwissProt database (now UniProt, part of the Expasy server). Meanwhile, other substitution matrices were developed based on slightly different principles. One of the most popular is the BLOSUM matrix (BLOcks of Amino Acid SUbstitution Matrix, Henikoff S, Henikoff JG. 1992). BLOSUM scores amino acid replacements based on the frequencies of amino acid substitutions in ungapped aligned blocks of sequences with a specific percentage identity. Thus, there is a significant difference between PAM and BLOSUM matrices - PAM matrices are based on mutations observed in a global alignment, which includes highly conserved regions and low-conservation regions with gapped alignment. The numbers associated with each matrix (e.g., BLOSUM62, BLOSUM80, etc.) refer to the minimum percentage sequence identity for the sequences aligned within a block. Higher numbers correspond to higher sequence identity and shorter evolutionary distances between the proteins. BLOSUM with high numbers should then be used for sequences known to be homologous with a good level of sequence identity. At the same time, low BLOSUM numbers should be used for distantly related proteins, for example, in screening databases.

Structure-based alignment

The 3D structure of a protein contains vital information on the position and length of secondary structure elements and loop regions. Since many gaps in sequence alignment occur in regions between secondary structure elements, using information from a 3D structure will provide a more accurate positioning of insertions and deletions in the alignment. Many graphics programs include the superposition of 3D structures and provide a structure-based alignment of the sequences as an output. However, this alignment still needs to be verified, e.g., by examining the 3D structure and comparing the generated alignment to an automatic multiple sequence alignment.

Concluding remarks
This brief review explains some of the basic concepts in the sequence alignment. Since, as mentioned earlier, here we are focused on the practical application of the tools for sequence alignment and analysis, I did not get into the details of the statistical analysis behind the construction of the replacement matrices, the data related to sequence conservation, or details of calculating alignment scores. In a tutorial, we will discuss sequence alignment's practical aspects.