Introduction to Protein Sequence Alignment and Analysis

Theoretically, retrieving most of the information we need about a protein directly from its amino acid sequence should be possible. To do this, we need sequence analysis. However, the study of sequence alignment can be challenging and requires prior knowledge and experience. Sometimes, we can compare such study to attempting to read a text in a foreign language without speaking it: we may recognize the letters and even some words, but we are not going to understand the meaning of the text. Since our focus here is on structural biology and the relationships between the amino acid sequence, structure, and function, you will only find an overview of the basic concepts and techniques used in sequence alignment and analysis without going into the theory and algorithms behind the process. We primarily use sequence alignment to extract structural and functional information from a sequence and understand the proteins' evolutionary relationships.

Before running sequence alignment and analysis, we need to understand the essential characteristics of the 20 most common amino acids that build up the proteins we find in nature. The amino acids can be classified into three main groups: hydrophobic, polar, and charged. Different amino acids are responsible for different functions in proteins. Some are often found at functional sites; hydrophobic amino acids usually build up the core of protein molecules, while charged residues are distributed on the protein surface in contact with a solvent. A sequence alignment will quickly reveal the distribution of these residues along the protein amino acid sequence and will also show their conservation pattern in a protein family.

Many positions in a protein sequence change during evolution due to mutations in the genetic code. However, many of these substitutions are so-called conserved substitutions. This means that other hydrophobic amino acids usually replace hydrophobic amino acids, which also applies to charged and polar amino acids. In other words, we observe the conservation of the chemical nature of an amino acid. This raises the question of how we account for these substitutions when running and scoring a sequence alignment. Mutation probability matrices (substitution matrices) were introduced to answer this question. These matrices consider the frequency of amino acid substitution replacements in calculating an alignment score.

The first step in sequence alignment and analysis is placing the protein we are interested in within a specific frame, which is the protein family to which it belongs. Within a family, proteins perform a similar function, have conserved sequence features, conserved three-dimensional structures, etc. An alignment will help us reveal all these characteristic features, find the conserved and variable regions in the sequence, show functionally essential residues, extract information on the secondary and tertiary structure, etc. We will discuss the techniques used for performing sequence alignment and analysis and demonstrate the process's practical aspects in a tutorial.

For an alignment, we first need to find related sequences. To do this, we need to run a database search. The software will compare our sequence (query sequence) to all other sequences in the database by using a local alignment algorithm BLAST and, based on specific criteria, will output several amino acid sequences found to be related to our protein. Local alignment compares small segments of the query sequence to other sequences in the database to find matching amino acid segments. If we are happy with the search results and have a list of proteins to start the analysis with, we can run a so-called global alignment. In this case, we can run a pair-wise alignment of only two sequences or a multiple sequence alignment (MSA) using a group of sequences. MSA constitutes the basis of any sequence analysis and provides much more information than the alignment of a pair of sequences. It is extensively used in secondary and tertiary structure prediction and modeling to identify protein families and domains.

Local alignment compares small segments of the query sequence to other sequences in the database to find matching amino acid segments. I usually use the Expasy and EBI servers, although many others perform the same function. The image below shows an example of a BLAST-generated local alignment (66 residues) made on the UniProt server.

An example of a local sequence alignment (residues 1 to 66 included) was generated by a Uniprot search using the sequence of the magnesium chelatase BchI subunit.

There are many sequence alignment and analysis tools on the web. I usually use the Expasy and EBI servers. To guide you through the jungle, I will provide some examples to show how I make this analysis. Then it is up to you to decide if you want to follow my examples or use some other tools.

Recommended reading:
Geoffrey J. Barton, Protein sequence alignment and database scanning

Geoffrey J. Barton, Protein Sequence Alignment Techniques
Acta Cryst. (1998). D54, 1139±1146