Sequence Alignment and Analysis

Sequence alignment is crucial in any analyses of evolutionary relationships, in extracting functional and even tertiary structure information from a protein amino acid sequence. Since evolutionary relationships assume that a certain number of the amino acid residues in a protein sequence are conserved, the simplest way to assess the relationships between two sequences would be to count the numbers of identical and similar amino acids. This is done by sequence alignment. The number of identical and similar amino acid residues may then be compared to the total number of amino acids in the protein. This gives the percentage of identical and similar residues – percentage of sequence identity and sequence similarity. Similar residues are those that have similar chemical characteristics, like positively charged Lys and Arg, or hydrophobic Leu and Val, etc. Substitution of amino acids by chemically equivalent ones often does not have a dramatic effect on the structure or function of the protein. For example, Leu and Val will be equally tolerated within a hydrophobic core, assuming that there is place for the slightly larger side chain of leucine. The same applies to Lys and Arg, which are usually located on the surface and primarily interact with solvent or with the acidic side chains of Glu or Asp. On the other hand, a substitution of Val by Arg may have a dramatic effect and may destabilize and even denature a protein.

To count the number of identities and similarities in sequence alignment, we need to establish some rules describing how alignment can be performed. Apparently we want to align as many identical or similar amino acid residues against each other as possible. Nevertheless, one should be aware that an
alignment generated by a computer program represents only one of many possibilities. One of the reasons is that while identical amino acids are easy to recognize and align, alignment of similar amino acids is not that straight forward. For example, how to score and prioritize the following substitutions - Val-Leu, Leu-Ile, Ser-Thr or Lys-Arg? Apparently, the score we give to each of these substitutions, or call it a weight, may affect the entire alignment.

Additional factors to take into account when analyzing sequences are insertions and deletions − it is quite common that when comparing sequences of members of a protein family we will find that at some positions in some of the sequences there will be some extra residues (insertions), or missing residues (deletions). For example, when a group of bacterial sequences is compared to a group of eukaryotic sequences there will often be some regions of insertions and deletions. Sometimes even larger segments or a whole domain may be inserted into or deleted in a protein. Depending on how we handle these insertions and deletions, different sequence alignments may be generated. To illustrate the concept, an example of a simple alignment of a short stretch of two sequences is shown below. This was extracted from a ClustalW generated sequence alignment using the
EBI server (European Bioinformatics Institute):

My Image

The amino acid residues which are identical in the two sequences are marked in the third raw by their names (GCP and P), while the position of those which are different are marked by x. One of the residues (a cysteine) in the second sequence does not seem to have a corresponding mate in the first. This position is marked by a dash. The percentage of identity for this sequence alignment is simply 4/12, or 30%. Then, the score of the alignment can be assessed, for example, by a simple expression:

(Score) S= number of matches - number of mismatches = 4 - 12 =-8

Everything looks nice, except that to maximize the number of matches, we introduced a gap (marked by a dash in the first sequence). A gap in one of the sequences simply means that one or more amino acid residues have been deleted from the sequence, or we could also say that there is an insertion in the second sequence. When introducing a gap several questions may arise: How many gaps we can introduce? How to decide where to place them? How long they can be? Apparently, by introducing a large number of gaps here and there, we could continue maximizing the identity, but would that be biologically relevant? Intuitively one would think that something must be wrong in this approach, but a correct answer is crucial for a correct sequence alignment. For example in
homology modeling correct placement of gaps is one of the moments, which will ensure correctness of the model. A badly placed gap may result in a totally meaningless model. Normally, when we run a sequence alignment software, we will notice that the number of gaps is limited. Apparently, the program has some instructions on how to limit the number of gaps and where to place them. What are these instructions?
They are gap penalties. Each time the program introduces a gap it triggers a penalty score, which reduces the total score of the alignment. However, this would make the whole thing meaningless, unless gap introduction would rise the total score by a value that is higher than the negative effect of the penalty. By this simple way we can limit the number of gaps and increase their significance. The gap penalty is a parameter that can be changed each time an alignment is run. This will affect the number of gaps, their length and position in the sequence alignment.
On the
next page we will discuss the substitution matrices in more detail.