Before running sequence alignment and analysis, we need to understand the essential characteristics of the
20 most common amino acids that build up the proteins we find in nature. The amino acids can be classified into three main groups: hydrophobic, polar, and charged. Different amino acids are responsible for different functions in proteins. Some are often found at functional sites; hydrophobic amino acids usually build up the core of protein molecules, while charged residues are distributed on the protein surface in contact with a solvent. A sequence alignment will quickly reveal the distribution of these residues along the protein amino acid sequence and will also show their conservation pattern in a protein family.
Many positions in a protein sequence change during evolution due to mutations in the genetic code. However, many of these substitutions are so-called
conserved substitutions. This means that other hydrophobic amino acids usually replace hydrophobic amino acids, which also applies to charged and polar amino acids. In other words, we observe the
conservation of the chemical nature of an amino acid. This raises the question of how we account for these substitutions when running and
scoring a sequence alignment.
Mutation probability matrices (substitution matrices) were introduced to answer this question. These matrices consider the frequency of amino acid substitution replacements in calculating an alignment score.
The first step in
sequence alignment and analysis is placing the protein we are interested in within a specific frame, which is the protein family to which it belongs. Within a family, proteins perform a similar function, have conserved sequence features, conserved three-dimensional structures, etc. An alignment will help us reveal all these characteristic features, find the conserved and variable regions in the sequence, show functionally essential residues, extract information on the secondary and tertiary structure, etc. We will discuss the techniques used for performing
sequence alignment and analysis and demonstrate the process's practical aspects in a
tutorial.
For an alignment, we first need to find related sequences. To do this, we need to run a database search. The software will compare our sequence (query sequence) to all other sequences in the database by using a
local alignment algorithm BLAST and, based on specific criteria, will output several amino acid sequences found to be related to our protein. Local alignment compares small segments of the query sequence to other sequences in the database to find matching amino acid segments. If we are happy with the search results and have a list of proteins to start the analysis with, we can run a so-called
global alignment. In this case, we can run a pair-wise alignment of only two sequences or a multiple sequence alignment (MSA) using a group of sequences. MSA constitutes the basis of any sequence analysis and provides much more information than the alignment of a pair of sequences. It is extensively used in secondary and tertiary structure prediction and modeling to identify protein families and domains.
Local alignment compares small segments of the query sequence to other sequences in the database to find matching amino acid segments. I usually use the
Expasy and
EBI servers, although many others perform the same function. The image below shows an example of a BLAST-generated local alignment (66 residues) made on the UniProt server.