Introduction to Protein Sequence Alignment and Analysis

Amino acid sequence alignment and analysis is central to most biochemical and molecular biology applications. Although it should be possible to retrieve all the information we need about a protein directly from its sequence, looking at a sequence without prior knowledge and experience is like reading a text in a foreign language: we may recognize the letters, but we do not understand the meaning and are unable to extract the information. Still, when proteins are concerned, we have learned to extract a substantial part of the information from detailed sequence analysis, using for example multiple sequence alignment. In a multiple sequence alignment a given sequence is compared to a group of other sequences from related organisms. When we say "related" we mean “evolutionary related” and that they belong to the same protein family, the members of which usually perform a similar function in different organisms. We know that when proteins are evolutionary related the main characteristic features of the sequence and the tertiary structure are conserved. Since conservation of function normally assumes that a certain number of amino acid residues within a protein family are conserved, we need to have some tools to be able to assess the degree of conservation of each member of the protein family. For this, alignment techniques and scoring schemes for sequence alignment have been developed. Here we will discuss the basic concepts behind these techniques and will provide some examples to guide you in making sequence alignment using Internet resources. Since we focus on structural bioinformatics, we will also need to learn how to interpret sequence alignments in terms of the three-dimensional structure of the protein, and to relate sequence and structural information. We can even use available structural data to make better sequence alignment!

In a sequence alignment we try to align identical amino acids in the sequences against each other. However, since normally there are also many amino acid substitutions, we need to know how to handle substitutions of one amino acid by another in the sequences being aligned (amino acid substitutions are caused by mutations in the gene coding for the protein in question). Some substitutions are conservative, i.e., they will not cause any substantial disturbances in the protein structure, which would affect the protein function, but other substitutions may have a dramatic effect on the structure and function of the protein. To handle amino acid substitutions in sequence alignment, specially designed
substitution matrices are used, which are part of the alignment scoring scheme and help in calculating the score of the alignment to distinguish between several possible alignments. Even structural information may be used in making a correct alignment, for example in correctly placing insertions and deletions in the alignment. Insertions and deletions are very common in sequences belonging to the same family and often occur in loop regions. In other words, insertions and deletions indicate that a certain region of the sequence may be a surface loop.