Imagine a deletion of a couple of residues, for example, in an α-helix. What will happen to that helix? There is a good chance that it will change its shape or even collapse, since a deletion or insertion will introduce a distortion in the hydrogen bonding network, in the packing of side chains, in the mutual adjustment of the torsion angles along the helix, etc. This in turn, may modify the overall 3D structure of the protein, affecting its function, or probably resulting in denaturation and total loss of function, loss of the protein’s ability to interact with its partners, etc. For this reason, insertions and deletions are usually found in regions between secondary structure elements − loop regions, where they can be accommodated easier without major distortions in the overall fold of the protein. The core of proteins, on the other hand, normally has a higher degree of sequence conservation and a smaller number of insertions and deletions, which is reflected by a smaller number of gaps in such regions of the sequence alignment.
Generally, we cannot score the alignment only based on the number of aligned identical and similar residues, we also need to take into account the number of gaps in the alignment, and also their length and position in the sequence. To do this, various types of gap penalties are introduced − gap opening penalty, gap extension penalty, etc. Then, we need to allow gaps only at positions where they would increase the total score of the alignment even after taking into account the imposed penalties:
S= Σ of costs (identities, replacements) - Σ of penalties (number of gaps x gap creation penalties)
The numbers for identities and replacements used for calculating the overall alignment score in the expression above are usually presented in the form of a 20 x 20 matrix (20 is the number of the most common amino acids). In total there are 210 possible replacement pairs (residues replacing each other) of amino acids, which includes 190 pairs of different amino acid substitutions + 20 pairs of identical substitutions (an amino acid may be replaced back after several replacement cycles during evolution).
Margaret Dayhoff and co-workers, who pioneered the field of protein sequence analysis, databases and bioinformatics, developed the first matrix of this type in the 1970s. Their scoring model was based on observed frequencies of substitutions of each of the 20 amino acids, derived from alignment of closely related sequences. In the resulting mutation data (or probability) matrix Mij each element provides an estimate of the probability of an amino acid in column i to be mutated to the amino acid in row j after certain evolutionary time. An evolutionary unit of 100 million years was adapted, resulting in the PAM (percentage accepted mutations / 100 million years) matrix. 1 PAM corresponds to an average amino acid substitution in 1% of all positions. Although 100 PAM does not mean that all the amino acids in the sequence are different, compared to the original sequence, since many of them will be mutated back to their original type. This is logical to assume since preservation of structure and function always have higher priority in the selection process, for this reason there is a limited number of possible replacement at a certain position of a sequence, and the original amino acid is always one of the possible “choices”.
Image by Byron Holland: https://slideplayer.com/slide/8888934/
Dayhoff et al. defined the score Si,j of two aligned residues i, j as 10 times the logarithm of the probability to observe these two residues (based on the empirical observation of how often they are aligned in nature) divided by the probability of finding these amino acids by chance.
Using this definition, the probability matrix can then be used to derive the so-called log-odds scoring matrix:
Different versions of the amino acid substitution matrix can be used for different purposes. For example, low PAM (20, 40, 60) may be preferred in database scanning, which uses the so-called local alignment algorithms and outputs short alignments of the closest-related sequence segments. The higher the number associated with PAM the longer the evolutionary distance. Thus, high PAM will be suitable for aligning more distant proteins, and if used for database scanning, it will find more distant homologues. It has been shown that at 256 PAM 80 % of all amino acids will be substituted, although to various degrees: 48% of Trp, 41% of Cys and 20% of His would be unchanged, but only 7% of Ser will remain. By other words different amino acids have different propensities for change, presumably due to both structural and functional reasons. For example, as mentioned earlier, tryptophan has a large side chain, and if located within the core of the structure, it would not be easy to replace it by some other amino acid. This may leave a cavity inside the structure, which may destabilize the protein structure as a whole. Cys and His are often involved in some specific functions like protein abstraction (His), metal binding (both), disulfide bridges (Cys), etc, and their replacement will affect the activity of the protein.
Dayhoff matrix was based on a limited set of protein sequences known at that time, and no statistical data could be collected for many of the possible 190 substitutions. This was corrected for in a more recent PET91 substitution matrix, essentially an updated Dayhoff matrix (Jones et al., 1992). PET91 was constructed based on a study, which included 2,621 protein families from the SwissProt database (now UniProt, part of the Expasy server). Meanwhile, other types of substitution matrices were developed, based on slightly different principles. One of the most popular is the BLOSUM matrix (BLOcks of Amino Acid SUbstitution Matrix, Henikoff S, Henikoff JG. 1992). BLOSUM scores amino acid replacements based on the frequencies of amino acid substitutions in un-gaped aligned blocks of sequences with a certain percentage sequence identity. This constitutes a major difference between PAM and BLOSUM matrices, since PAM matrices are based on mutations observed in a global alignment, which includes highly conserved regions as well as low-conservation regions with gapped alignment. The numbers associated with each matrix (e.g. BLOSUM62, BLOSUM80, etc) refer to the minimum percentage sequence identity of the sequences group within a certain block. Thus, higher numbers correspond to higher sequence identity and shorter evolutionary distance between the proteins. By other words, BLOSUM with high numbers should be used for highly related sequences, while low BLOSUM numbers should be used for distantly related proteins, for example is screening databases.
Based on the comparison (superposition) of three-dimensional structures, a structure-based sequence alignment can be generated and used in the construction of substitution matrices. The 3D structure provides vital information on the position and length of secondary structure elements and loop regions, which allows a more accurate positioning of insertions and deletions in the alignment. As will be discussed later, structure-based alignment is especially important in homology modelling. Many graphics programs include superposition of 3D structures and output a structure-based alignment of the sequences as an option. However, this alignment still needs to be verified, e.g. by comparing it to the structure and to an automatic multiple sequence alignment.