Imagine a deletion of a couple of residues, for example, in an α-helix. What will happen to that helix? There is a good chance that it will change its shape or even collapse, since a deletion or insertion will introduce a distortion in the hydrogen bonding network, in the packing of side chains, in the mutual adjustment of the torsion angles along the helix, etc. This in turn, may modify the overall 3D structure of the protein, affecting its function, or probably resulting in denaturation and total loss of function, loss of the protein’s ability to interact with its partners, etc. For this reason, insertions and deletions are usually found in regions between secondary structure elements − loop regions, where they can be accommodated easier without major distortions in the overall fold of the protein. The core of proteins, on the other hand, normally has a higher degree of sequence conservation and a smaller number of insertions and deletions, which is reflected by a smaller number of gaps in such regions of the sequence alignment.
Generally, we cannot score the alignment only based on the number of aligned identical and similar residues, we also need to take into account the number of gaps in the alignment, and also their length and position in the sequence. To do this, various types of gap penalties are introduced − gap opening penalty, gap extension penalty, etc. Then, we need to allow gaps only at positions where they would increase the total score of the alignment even after taking into account the imposed penalties:
S= Σ of costs (identities, replacements) - Σ of penalties (number of gaps x gap creation penalties)
The numbers for identities and replacements used for calculating the overall alignment score in the expression above are usually presented in the form of a 20 x 20 matrix (20 is the number of the most common amino acids). In total there are 210 possible replacement pairs (residues replacing each other) of amino acids, which includes 190 pairs of different amino acid substitutions + 20 pairs of identical substitutions (an amino acid may be replaced back after several replacement cycles during evolution). An example is presented below, the so-called Gonnet matrix:
Margaret Dayhoff and co-workers, who pioneered the field of protein sequence analysis, databases and bioinformatics, developed the first matrix of this type in the 1970s. Their scoring model was based on observed frequencies of substitutions of each of the 20 amino acids, derived from alignment of closely related sequences. In the resulting mutation data (or probability) matrix Mij each element provides an estimate of the probability of an amino acid in column i to be mutated to the amino acid in row j after certain evolutionary time. An evolutionary unit of 100 million years was adapted, resulting in the PAM (percentage accepted mutations / 100 million years) matrix. 1 PAM corresponds to an average amino acid substitution in 1% of all positions. Although 100 PAM does not mean that all the amino acids in the sequence are different, compared to the original sequence, since many of them will be mutated back to their original type. This is logical to assume since preservation of structure and function always have higher priority in the selection process, for this reason there is a limited number of possible replacement at a certain position of a sequence, and the original amino acid is always one of the possible “choices”.
Different versions of the amino acid substitution matrix can be used for different purposes. For example, low PAM (20, 40, 60) may be preferred in database scanning, which uses the so-called local alignment algorithms and outputs short alignments of the closest-related sequence segments. The higher the number associated with PAM the longer the evolutionary distance. Thus, high PAM will be suitable for aligning more distant proteins, and if used for database scanning, it will find more distant homologues. It has been shown that at 256 PAM 80 % of all amino acids will be substituted, although to various degrees: 48% of Trp, 41% of Cys and 20% of His would be unchanged, but only 7% of Ser will remain (More details). By other words different amino acids have different propensities for change, presumably due to both structural and functional reasons. For example, as mentioned earlier, tryptophane has a large side chain, and if located within the core of the structure, it would not be easy to replace it by some other amino acid. This may leave a cavity inside the structure, which may destabilize the protein structure as a whole. Cys and His are often involved in some specific functions like protein abstraction (His), metal binding (both), disulfide bridges (Cys), etc, and their replacement will affect the activity of the protein.
Dayhoff matrix was based on a limited set of protein sequences known at that time, and no statistical data could be collected for many of the possible 190 substitutions. This was corrected for in a more recent PET91 substitution matrix, essentially an updated Dayhoff matrix (Jones et al., 1992). PET91 was constructed based on a study, which included 2,621 protein families from the SwissProt database (now UniProt, part of the Expasy server). Meanwhile, other types of substitution matrices were developed, based on slightly different principles. One of the most popular is the BLOSUM matrix (BLOcks of Amino Acid SUbstitution Matrix, Henikoff S, Henikoff JG. 1992). BLOSUM scores amino acid replacements based on the frequencies of amino acid substitutions in un-gaped aligned blocks of sequences with a certain percentage sequence identity. This constitutes a major difference between PAM and BLOSUM matrices, since PAM matrices are based on mutations observed in a global alignment, which includes highly conserved regions as well as low-conservation regions with gaped alignment. The numbers associated with each matrix (e.g. BLOSUM62, BLOSUM80, etc) refer to the minimum percentage sequence identity of the sequences group within a certain block. Thus, higher numbers correspond to higher sequence identity and shorter evolutionary distance between the proteins. By other words, BLOSUM with high numbers should be used for highly related sequences, while low BLOSUM numbers should be used for distantly related proteins, for example is screening databases.
A number of substitution matrices have also been developed based on the comparison of three-dimensional structures (structure-based alignment). The 3D structure provides information on the position and length of secondary structure elements as well as loop regions, allowing a more precise positioning of gaps. To generate a structure-based sequence alignment it is possible to use a superposition of the 3D structures of the proteins in question (if structures are available for both) or to use the 3D structure of one member of the protein family to guide and correct placement of gaps in a multiple sequence alignment. Many graphics programs include superposition of 3D structures and structure-based alignment of the sequences as an option.
Correct sequence alignment is crucial, e.g. for many applications, for example successful homology modeling. In the next sections we will consider some examples of sequence alignment (Tutorial 1 and Tutorial 2). The results of these tutorials will be used later in the tutorials on homology modeling.