Introduction to Protein Sequence Alignment and Analysis

Amino acid sequence alignment and analysis is central to most biochemical and molecular biology applications. Although it should be possible to retrieve all the information we need about a protein directly from its sequence, looking at a sequence without prior knowledge and experience is like reading a text in a foreign language: we may recognize the letters, but we do not understand the meaning and are unable to extract the information. Still, when proteins are concerned, we have learned to extract a substantial part of the information from detailed sequence analysis, using for example
multiple sequence alignment. In a multiple sequence alignment a given sequence is compared to a group of other sequences from related organisms. When we say "related" we mean “evolutionary related” and that they belong to the same family, the members of which usually perform a similar function in different organisms. We know that when proteins are evolutionary related the main characteristic features of the sequence and the tertiary structure are conserved. Since conservation of function normally assumes that a certain number of amino acid residues within a protein family are conserved, we need to have some tools to be able to assess the degree of conservation of each member of the protein family. For this, alignment techniques and scoring schemes for sequence alignment have been developed. Here we will discuss the basic concepts behind these techniques and will provide some examples to guide you in making sequence alignment using Internet resources. Since we focus on structural bioinformatics, we will also need to learn how to interpret sequence alignments in terms of the three-dimensional structure of the protein, and to relate sequence and structural information. We might even use available structural data to make better sequence alignment!

In a sequence alignment we try to align identical amino acids in the sequences against each other. However, since normally there are also many amino acid substitutions, we need to know how to handle substitutions of one amino acid by another in the sequences being aligned (amino acid substitutions are caused by mutations in the gene coding for the protein in question). Some substitutions are conservative, i.e., they will not cause any substantial disturbances in the protein structure, which would affect the protein function, but other substitutions, if they would occur, may have a dramatic effect on protein structure and function. To handle amino acid substitutions in sequence alignment, specially designed
substitution matrices are used, which are part of the alignment scoring scheme and help in calculating the score of the alignment to distinguish between several possible alignments. Even structural information may be used in making a correct alignment, for example in correctly placing insertion and deletion regions in the alignment. Insertions and deletions are very common in sequences belonging to the same family and often occur in loop regions. By other words, insertions and deletions may indicate that a certain region of the sequence may have a loop structure.