Sequence alignment tutorial 1

Amino acid sequence alignment may be rather simple to run, but may also need some extra attention, for example in cases when the proteins have considerably diverged and there is a large number of insertions and deletions, or in cases of multidomain proteins, especially if not all the domains are present in the protein of interest, something which could happen in homology modeling. Information from the tertiary structure, like the position of helices, strands and loops is always of great help for correctly placing insertions and deletions in the alignment. In this first tutorial we will explore an easy way of making a sequence alignment and will be focusing on using the tools available at the Expasy and EBI servers, although there are many other servers, which will do the same job. We start with a case of a protein of highly conserved sequence - subunit BchI of the enzyme magnesium chelatase. It is one of the three subunits required for the function of the enzyme - catalysis of the insertion of a Mg2+ ion into protoporphyrin IX. This reaction is the first committed enzymatic step in chlorophyl biosynthesis.

In the second tutorial we will go through a more elaborated case. We will first try to identify the domains of subunit BchD (the second subunit of magnesium chelatase) and its potential relation to BchI. Then we will make an alignment and closely examine the conservation and the differences between the two proteins.

In the beginning, we need to choose the sequences for the alignment. For this we will use the
UniProtKB database, which is part of the Expasy server. To start, simply write the name of the protein (BchI) into the UniProt or Expasy search window, and you will be taken to a list of sequences of BchI from different organisms:

My Image

The figure above shows just the first few sequences, the actual list is much longer. You may also notice on the left, where it says “Filtered by”, that there are “Reviewed” sequences and “Unreviewed”. We prefer to use the reviewed sequences as much as possible, these have been verified to be what we expect them to be. There are many automatically annotated sequences among the Unreviewed and sometimes they may contain assignment errors.

There we need to choose BCHI_RHOCB (entry
P26239), which is subunit BchI from Rhodobacter capsulatus. On the page which will open after clicking we may find information on the biological function (photosynthesis, magnesium chelatase activity), type of ligands/substrate it binds (ATP), catalytic function (ATP hydrolysis), Protein Data Bank (PDB) entries (if available), links to published works, links to entries related to this particular protein in other databases, and of course the amino acid sequence of the protein. A very useful link is the one to the InterPro database. It provides a plenty of information on the protein, its domain content, biological function, the family to which it belongs, etc. The actual amino acid sequence is presented in the following format:

My Image

Although, to make the alignment we need to have the sequence in FASTA format, which is required by the server. To get it we simple need to click the FASTA botton on the left, then copy the sequence and paste it into the alignment window. However, at UniProt the sequences will be automatically copied in FASTA format to the alignment window when marked in the list.

Normally one needs to spend few minutes and think which sequences to include into the alignment. A good strategy is 5-6 sequences from distantly related organisms, since this will give a better idea on the conservation of the most important amino acids in the protein. We can mark the sequences we want to include and click “add to basket” on the top menu. Don’t forget to choose “Reviewed” and to include the protein from
Rhodobacter capsulatus. The figure below shows the open basket drop-down window. By clicking “Full view” button the drop-down will open as a full web page.

My Image

On the full page mark all the sequences and click the “Align” button, which will start the alignment. The results look like this:

My Image

I have chosen 4 sequences here. You may also notice that residues involved in ATP binding (nucleotide binding) are marked green, while the rest are coloured in various shades of grey. There are other colouring options available, which we may explore, if required. We may notice that the protein is highly conserved, although the last two sequences have much longer N-terminal part. These two sequence originate from plants, while the first two are from bacteria.

One disadvantage of using this server for alignment is that we cannot change alignment parameters, like the amino acid replacement matrix or gap penalties, discussed earlier. If this is required, one could use the
EBI server (European Bioinformatics Institute). There are several alignment programs to choose at the EBI site. Kalign works well for our purposes and it provides the options of modifying gap penalties .

We can simply copy and paste the list of the sequences in FASTA format (provided on the page showing the alignment) into the alignment window of the EBI server. On EBI there is also an opportunity to use Jalview, a Java-based application, with which we can color the alignment in different colors, change its appearance in various ways and save a jpg image, e.g. for publication or a presentation. It is possible to use Jalview directly on EBI, however, it is recommended to download and install the application on your own computer. The installed application has much more choices, for example, saving the alignment into an image file for later use. This option is not available in the web-version of the viewer.

A final note: The FASTA format
Many applications require the amino acid sequence to be in FASTA format. The FASTA format includes the amino acid sequence in one-letter code, usually with 60 letters/line. Most important is the sign ">", “larger than” , on the first line. Alignment programs like Kalign at the EBI server will use the text after the >-sign on that line as the alignment title for the sequence. For convenience, one could write the name of the protein on that row, which would be useful as a sequence identifier after running the alignment. BchI sequence in FASTA format is show on the image below:

My Image