In order for an amino acid sequence alignment to be as realistic as possible we need to understand the processes which control amino acid substitutions, as well as the effects of insertions and deletions. As mentioned in the chapter on protein structures, the degree of conservation of a protein three-dimensional structure is much higher than the degree of sequence conservation. And the reason is simply that the tertiary structure to a large degree determines protein function. Now, imagine a deletion of a couple of residues, for example, in an alpha-helix. What will happen to that helix? There is a good chance that it will collapse, since we may introduce a distortion in the hydrogen bonding network, in the packing of side chains, in the mutual adjustment of the torsion angles along the helix, etc. This in turn, may modify the overall 3D structure of the protein, which will affect its function, or probably result in denaturation and total loss of function, loss of the proteins ability to interact with its partners cell, etc. I think that one of the consequences of the principle of preservation of protein tertiary structure is the observation that insertions and deletions are most often found in regions between secondary structure elements - loop regions. Insertions and deletions within secondary structure elements may simply affect protein structure and function in some undesirable way. At the same time, the core region of proteins normally has a higher degree of sequence conservation and smaller number of insertions an deletions, which may be recognized by a smaller number of gaps in a sequence alignment.
As discussed in the previous page, in a computer-generated alignment insertions and deletions show up as gaps. If we count the score of the alignment taking into account the number of aligned identical and similar residues, we also need to find a way to limit the number of gaps in the alignment while maximizing the alignment score. Common sense would tell us that the number of insertions and deletions must be limited. To control the number of gaps, their size, position, etc., gap penalties are introduced. Then, the score used to judge the correctness of a given alignment is modified accordingly and gaps are allowed only if they increase the score (despite the negative effect of the penalty):
S= Sum of costs (identities, replacements) - Sum of penalties (number of gaps x gap creation penalties)
The values of identities and amino acid replacements in the expression are pre-calculated numbers, which are usually presented in a form of a 20 x 20 matrix (20 is the number of the most common amino acids). In total there are 210 possible pairs of amino acids, which includes 190 pairs of different amino acids + 20 pairs of identical substitutions. An example is presented below, the Gonnet matrix:
The first matrix of this type was developed in the 1970s by Margaret Dayhoff and co-workers, who pioneered the field of protein sequence analysis, databases and bioinformatics. Their scoring model was based on observed frequencies of substitutions of each of the 20 amino acids, derived from alignment of closely related sequences. In the resulting mutation data (or probability) matrix Mij each element provides an estimate of the probability of an amino acid in column i to be mutated to the amino acid in row j after certain evolutionary time. An evolutionary unit of 100 million years was adapted, resulting in the PAM (percentage accepted mutations / 100 million years) matrix. 1 PAM corresponds to an average amino acid substitution in 1% of all positions. Although 100 PAM does not mean that all the amino acids in the sequence have been changed since many of them will be mutated back to their original type. This is logical to assume since preservation of structure and function always has higher priority in selection.
Different versions of the amino acid substitution matrix can be used for different purposes. For example, low PAM (20, 40, 60) may be preferred in database scanning, which usually outputs short alignments of the closest-related sequence segments. The higher the number associated with PAM the longer the evolutionary distance. Thus, high PAM will be suitable for aligning more distant proteins, and if used for database scanning, it will find weaker and longer alignments. It has been shown that at 256 PAM 80 % of all amino acids will be substituted, although to various degrees: 48% of Trp, 41% of Cys and 20% of His would be unchanged, but only 7% of Ser will remain (More details). By other words different amino acids have different propensities for change, presumably due to both structural and functional reasons. For example, as mentioned earlier, tryptophane has a large side chain, and if located within the core of the structure, it would not be easy to replace it by some other amino acid. This may leave a cavity inside the structure, which may destabilize the protein structure as a whole.
Since the Dayhoff matrix was based on a limited set of protein sequences known at that time, no statistical data could be collected for many of the possible 190 substitutions. This was corrected for in a more recent PET91 substitution matrix, essentially an updated Dayhoff matrix (Jones et al., 1992). PET91 was constructed based on a study which included 2,621 protein families from the SwissProt database (part of the Expasy server). Meanwhile, other types of substitution matrices were developed, based on slightly different principles. One of the most popular is the BLOSUM matrix (BLOcks of Amino Acid SUbstitution Matrix, Henikoff S, Henikoff JG. 1992). BLOSUM scores amino acid replacements based on the frequencies of amino acid substitutions in un-gaped aligned blocks of sequences with a certain percentage sequence identity. This constitutes a major difference between PAM and BLOSUM matrices, since PAM matrices are based on mutations observed throughout a global alignment, which includes highly conserved regions as well as low-conservation regions with gaped alignment. The numbers associated with each matrix (e.g. BLOSUM62, BLOSUM80, etc) refer to the minimum percentage sequence identity of the sequences group within a certain block. Thus, higher numbers correspond to shorter evolutionary distance. By other words, BLOSUM with high numbers will be used for highly related sequences, while low BLOSUM numbers are used for distantly related proteins, for example is screening databases.
A number of substitution matrices have also been developed based on the comparison of three-dimensional structures (structure-based alignment). The 3D structure provides information on the position and length of secondary structure elements as well as loop regions, allowing a more precise positioning of gaps. To generate a structure-based sequence alignment it is possible to use a superposition of the 3D structures of the proteins in question (if structures are available for both) or to use the 3D structure of one member of the protein family to guide and correct placement of gaps in a multiple sequence alignment. Many graphics programs include superposition of 3D structures as an option.
Correct sequence alignment is crucial, e.g. for successful homology modeling. In the next sections we will consider some examples of sequence alignment (Tutorial 1 and Tutorial 2). The results of these tutorials will be used later in the tutorials on homology modeling.