Alignment Scoring, Gaps & Similarities

Alignment Score

There are many ways to align two protein sequences against each other. Any alignment generated by an alignment software will represent one of many different options, which could be sorted according to some alignment score, with the best alignment (with the highest score) being presented to us by the software. The simplest score to use for assessing how closely related two sequences are, can be based on the number of identical amino acids that align against each other. This number can be used for counting the percentage of identical residues – called the percentage of sequence identity. The higher this percentage the closer the compared sequences will be in terms of their evolutionary origin.

Even though many of the amino acids in a protein sequence can be invariant, depending on the evolutionary distance between the proteins, there will always be a substantial number of residue substitutions caused by mutations. Many of the replaced residues will be chemically equivalent to the “original” ones. For this reason, this type of conservation is called similarity and is due to the demand for conservation of structure and function. For example, L and V will be equally tolerated within a hydrophobic core of a protein, assuming that there is enough space for the slightly larger side chain of leucine. The same applies, e.g., to K —> R substitution, since both these residues are usually located on the surface and primarily interact with solvent or with the acidic side chains of E or D. On the other hand, a substitution of V by R may have a dramatic negative effect and may destabilize or even denature a protein.

This suggests that in the calculation of the alignment score we need to consider both identities and similarities between the amino acids. But then a question will arise: if we would assign a score of 1 to each pair of identical residues, what score we should assign to a substitution like K —> R, or to V —> L compared to V —> I or V —> A? Our software will optimize the score for each possible alignment, and we will need to tell it how to count the contribution for each of the above and many other similar substitutions.

To illustrate the concept, an example of a simple alignment of a short segment of two sequences is shown below:


The amino acids which are identical (invariant) in the two sequences are shown in the third raw by their names (GCP, SP and A), while the positions of those which are different are marked by x. One of the residues (a cysteine) in the second sequence does not seem to have a corresponding mate in the first. This position is marked by a dash. The percentage of identity for this sequence alignment is simply 6/12, which is 50%. Then, the score of the alignment can be calculated, for example, by a simple expression:

(Score) S= number of matches - number of mismatches = 6 - 12 = -6

One shortcoming of this expression is that it does not consider the number of conserved substitutions. For example, F in the first sequence is replaced by a chemically equivalent Y and E is replaced by a chemically equivalent D. Apparently, for a more accurate calculation of an alignment score we need a score for each such replacement. This will be the subject of substitution matrices discussed later in the text.

Introducing gaps

Additional factor to consider when analyzing sequences are insertions and deletions − it is quite common that when comparing sequences of members of a protein family we will find that at some positions in some of the sequences there will be one or more extra residues (insertion), or some missing residues (deletion). For example, when a group of bacterial sequences is compared to a group of eukaryotic sequences, there will often be some relatively large segments of insertions and deletions. Sometimes even a whole domain may be inserted into or deleted from a protein. Depending on how we handle these insertions and deletions, different sequence alignments may be generated.

In the small example alignment above to maximize the number of matches, we introduced a gap (marked by a dash in the first sequence). A gap in one of the sequences simply means that one or more amino acid residues have been deleted from the sequence, or we could also say that there is an insertion in the second sequence. When introducing a gap several questions may arise: How many gaps can we introduce? How to decide where to place them? How long they can be? Apparently, by introducing many gaps here and there, we could try to improve the score of the alignment, but would that be biologically relevant? Intuitively one would think that something must be wrong with this approach. When we look at automatically generated sequence alignments the number of gaps is always limited. Apparently, the program has some instructions forcing it to limit the number of gaps and their position in the sequence. That instruction is called "gap penalty". Each time the program introduces a gap it triggers a penalty score, which may decrease or increase the total score of the alignment. Gaps are introduced only if they substantially increase the total score of the alignment. By this simple rule we can limit the number of gaps and increase their significance. The gap penalty is a parameter that can be changed each time an alignment is run. By increasing or reducing the penalty, the number of gaps, their length and position in the sequence alignment may be controlled. With other words, these penalties also need to be introduced into the expression used for calculating the alignment score.

The expression for calculating the alignment score can be modified accordingly to include the discussed considerations:

(score) S= Σ of costs (identities, replacements) - Σ of penalties (number of gaps x gap penalties)

The numbers required for the left-hand side of the equation, were we count the score generated by identities and similarities, are presented in the form of a 20 x 20 matrix which is called substitution matrix. This will be discussed below in more detail.