Introduction to Protein Sequence Alignment and Analysis

Theoretically, it should be possible to retrieve most of the information we need about a protein directly from its amino acid sequence. To do this, we need sequence analysis. However, the study of sequence alignment can be challenging and requires prior knowledge and experience. Sometimes we can compare such study to attempting to read a text in a foreign language without speaking it: we may recognize the letters and even some words, but we are not going to understand the meaning of the text. Since our focus here is on structural biology and the relationships between the amino acid sequence, structure, and function, you will only find an overview of the basic concepts and techniques used in sequence alignment and analysis without going into the theory and algorithms behind the process. We use sequence alignment in the exercises (and in our work) primarily to extract structural and functional information from a sequence and to understand the evolutionary relationships between the proteins.

The first step in sequence analysis is placing the protein we are interested in within a specific frame, which is the protein family to which it belongs. Within a family, proteins perform a similar function and have conserved and easily recognizable sequence features and conserved three-dimensional structures. An alignment will help us reveal all these characteristic features, find the conserved and variable regions in the sequence, show functionally essential residues, and extract information on the secondary and tertiary structure. Later in the chapter, we will discuss the techniques used for performing this type of analysis.

To find related sequences, we first need to run a database search. The software will compare our sequence (query sequence) to all other sequences in the database by using a so-called "local alignment" and, based on specific criteria, will output several amino acid sequences related to our protein. Local alignment compares small segments of the query sequence to other sequences in the database to find matching amino acid segments. If we are happy with the search results and have a list of proteins to start the analysis, we use a different type of sequence alignment, the so-called global alignment. In this case, we compare the entire query sequence to another sequence using pair-wise alignment or a group of sequences using multiple sequence alignment (MSA). MSA constitutes the basis of any sequence analysis and provides much more information than the alignment of just two sequences. This type of alignment is extensively used in secondary and tertiary structure prediction and modeling, identifying protein families and domains or identifying conserved residues essential for function.
There are many sequence alignment and analysis tools on the web. To guide you through the jungle, I will provide some examples to show the way I make this analysis. Then it is up to you to decide if you want to follow my examples or use some other tools.

The first step in sequence analysis is placing the protein we are interested in within a specific frame, which is the protein family to which it belongs. Within a family, proteins perform a similar function, have conserved and easily recognizable sequence features, conserved three-dimensional structures, etc. An alignment will help us reveal all these characteristic features, find the conserved and variable regions in the sequence, show functionally essential residues, extract information on the secondary and tertiary structure, etc. Later in the chapter, we will discuss the techniques used for performing this type of analysis.

To find related sequences, we first need to run a database search. The software will compare our sequence (query sequence) to all other sequences in the database by using a so-called "local alignment" (see image below) and, based on specific criteria, will output several amino acid sequences related to our protein. Local alignment compares small segments of the query sequence to other sequences in the database to find matching amino acid segments. If we are happy with the search results and have a list of proteins to start the analysis, we use a different type of sequence alignment, the so-called global alignment. In this case, we compare the entire query sequence to another sequence using pair-wise alignment or to a group of sequences using multiple sequence alignment (MSA). MSA constitutes the basis of any sequence analysis and provides much more information than the alignment of just two sequences. This type of alignment is extensively used in secondary and tertiary structure prediction and modeling, in identifying protein families and domains, identifying conserved residues essential for function, etc.

An example of a local sequence alignment (residues 1 to 66 included) was generated by a Uniprot search using the sequence of the magnesium chelatase BchI subunit.

There are many sequence alignment and analysis tools on the web. I usually use the Expasy and EBI servers. To guide you through the jungle, I will provide some examples to show how I make this analysis. Then it is up to you to decide if you want to follow my examples or use some other tools.

Recommended reading:
Geoffrey J. Barton, Protein sequence alignment and database scanning

Geoffrey J. Barton, Protein Sequence Alignment Techniques
Acta Cryst. (1998). D54, 1139±1146