Alignment Scoring, Gaps & Similarities

Alignment Score

There are many ways to align two protein sequences against each other. First, however, we must remember that an alignment generated by software will represent only one of many different possible alignments. The alignment software sorts the generated alignments according to a calculated score, with the output being the one with the highest score. This suggests that the alignment score is essential, and its calculation needs careful consideration. The most straightforward score to assess how closely related two sequences are can be based on the number of identical amino acids that align against each other. Using this number, we can count the percentage of identical residues – called the percentage of sequence identity. The higher this percentage, the closer the compared sequences will be in terms of their evolutionary origin.

Even though many amino acids in a protein sequence can be invariant, depending on the evolutionary distance between the proteins, there will always be a substantial number of residue substitutions caused by mutations. Many replaced residues will be chemically equivalent to the "original" ones. For this reason, this type of conservation is called similarity, and it depends on the demand for the conservation of structure and function. For example, L and V will be equally tolerated within a protein's hydrophobic core, assuming enough space is available for the slightly larger side chain of leucine to be accommodated. The same applies, e.g., to K and R substitution, since both these residues are usually located on the surface and primarily interact with solvent or with the acidic side chains of E or D. On the other hand, substituting V with R may have a dramatic negative effect and destabilize or denature a protein.

The above suggests that we must consider both identities and similarities between the amino acids in calculating the alignment score. However, a question will arise: if we assign a score of 1 to each pair of identical residues, what score should we assign to a substitution like K with R or V with L compared to V with I or V with A? Our software will optimize the score for each possible alignment, and we will need to tell it how to count the contribution for each of the above and many other similar substitutions.

As an example, let us have a look at a simple alignment of a short segment of two sequences:

GCPFS-SPNVEA
GCPYGCSPEADA
GCPxx-SPxxxA

The identical (invariant) amino acids (matches) in the two sequences are highlighted in the third raw (GCP, SP, and A), while the differences (mismatches) are marked by an x. The cysteine residue in the second sequence does not seem to have a corresponding mate in the first. A dash marks this position. The percentage of identity for this sequence alignment is simply 6/12, which is 50%. Then, the score of the alignment can be calculated by a simple expression:

(Score) S= No of matches - length of sequence = 6 - 12 = -6

One shortcoming of this expression is that it does not consider the number of conserved substitutions. So, for example, F in the first sequence is replaced by a chemically equivalent Y, and E is replaced by a chemically equivalent D. This shows that for a more accurate calculation of an alignment score, we need a score for each such replacement.

Introducing gaps

Additional factors to consider when analyzing sequences are insertions and deletions. It is expected that when comparing sequences of members of a protein family, we will find that at some positions in some of the sequences, there will be one or more extra residues (insertion) or some missing residues (deletion). For example, when a group of bacterial sequences is compared to a group of eukaryotic sequences, there will often be some relatively large segments of insertions and deletions. Sometimes, a whole domain may be inserted into or deleted from a protein. Different sequence alignments may be generated depending on how we handle these insertions and deletions.

In the example alignment above, we introduced a gap (marked by a dash in the first sequence) to maximize the number of matches. A gap in one of the sequences means that one or more amino acid residues have been deleted from the sequence, or we could also say that there is an insertion in the second sequence. When introducing a gap, several questions may arise: How many gaps can we introduce? How to decide where to place them? How long can they be? Apparently, we could try to improve the alignment score by introducing many gaps here and there, but would that be biologically relevant? Intuitively one would think that something must be wrong with this approach. The number of gaps is always limited when we look at automatically generated sequence alignments (example in the image below).
Example of multiple sequence alignment
Example of a multiple sequence alignment (MSA) of the sequence of subunit BchI (four first sequences) with a homologous domain of subunit BchD (four lower sequences). Invariant residues are highlighted with dark grey boxes, while light grey highlights conserved regions. Secondary structure elements, as determined in the X-ray crystallographic structure of Rhodobacter Capsulatus BchI are shown on top of the alignment. When examining this alignment, it is clear that some regions have a higher degree of conservation than others. We can also see that a number of residues are invariant in both proteins. There are also several regions of long insertions and deletions (sometimes called indels). Also highlighted by stars under the alignment is a region (residues 52-58 in BachI in the highest row on the right). These constitute the consensus sequence of the so-called P-loop found in ATP/GTP binding proteins: GxRGTGK. The P-loop is involved in ATP/GTP binding.

Apparently, the program has some instructions forcing it to limit the number of gaps and their position in the sequence. That instruction is called the "gap penalty". Each time the program introduces a gap, it triggers a penalty score, which may decrease or increase the total score of the alignment. Gaps are introduced only if they substantially increase the total score of the alignment. By this simple rule, we can limit the number of gaps and increase their significance. The gap penalty is a parameter that can be changed each time an alignment is run. By increasing or reducing the value of the gap penalties, the total number of gaps, their length, and their position in the sequence alignment may be controlled.

The expression for calculating the alignment score can be modified accordingly to include gap penalties:

(score) S= Σ of costs (identities, replacements) - Σ of penalties (number of gaps x gap penalties)

The numbers required for the left-hand side of the equation, where we count the score generated by identities and similarities, are presented as a 20 x 20 matrix. This matrix is called a substitution matrix. The details are discussed on the next page.