Amino acid sequence alignment may be rather simple to run, but may also need some extra attention, for example in cases when the proteins have considerably diverged and there is a large number of insertions and deletions, or in cases of multidomain proteins, especially if not all domains are present in one of the proteins being compared, something which could happen for example during homology modeling. Information from the tertiary structure, like the position of helices, strands and loops, is of course of great help for correct placing of insertions and deletions in the alignment. In this first tutorial we will explore an easy way of making a sequence alignment and will be focusing on using the tools available at Expasy and EBI servers, although there are of course many other servers, which will do exactly the same job. We start with a case of a protein of highly conserved sequence - subunit BchI of the enzyme magnesium chelatase. It is one of three subunits, which are required for this enzyme to catalyze the first committed step in chlorophyl biosynthesis, the insertion of a Mg2+ ion into protoporphyrin IX. In the second tutorial we will go through a slightly more complicated case and will first identify the domain of BchD (the second subunit of magnesium chelatase), which is homologous to subunits BchI. We will make then an alignment of the sequences of the two domains to closely examine conservation and differences between the two proteins.
To make the alignment we first need to choose and retrieve the sequences. For this we will use the UniProtKB database within the Expasy group of servers. To start, simply write the name of the protein (BchI) into the UniProt or Expasy search window, and you will be taken to a list of sequences of BchI from different organisms:
The figure is showing just the first few sequences, the actual list contained many more. You may also notice on the left, where it says “Filtered by”, that there are “Reviewed” sequences and “Unreviewed”. It is always better to use the reviewed sequences as much as possible, these have been verified to be what we expect them to be. There are many automatically annotated sequences among the Unreviewed and sometimes they may contain assignment errors.
There we need to choose BCHI_RHOCB (entry P26239), which is subunit BchI from Rhodobacter capsulatus. On the page which will open you will find information on the biological function (photosynthesis, magnesium chelatase activity), type of ligands/substrate it binds (ATP), catalytic function (ATP hydrolysis), Protein Data Bank (PDB) entries, if available, links to published works, links to entries related to this particular protein in other databases, and of course the amino acid sequence of the protein. A very useful link is the one to the InterPro database. It provides a plenty of information about the protein, its domain content, biological function, the family to which it belongs, etc. For sequence alignment we first need to retrieve the sequences of BchI from different organisms. Normally the sequence is presented in the following format:
However, to make an alignment we need to have the sequence in FASTA format, although it will be done automatically when we run the alignment tool at this server. Later we can take the chosen sequences to another server with a better choice of alignment parameters.
To run the alignment we first need to choose some additional sequences. Normally one needs to spend few minutes and think which sequences to include in the alignment. A good strategy is to include sequence from distantly related organisms, since this will give a better idea on the conservation of the most important amino acids. We can mark the sequences we want to include and click “add to basket” on the top menu. Don’t forget to choose “Reviewed” and to include the protein from Rhodobacter capsulatus. The figure below shows the open basket drop-down window. By clicking “Full view” button the drop-down will open as a full web page.
On the full page mark all the sequences and click the “Align” button, which will start the alignment. The results I got look like this:
I have chosen 4 sequences here. You may also notice that residues involved in ATP binding (nucleotide binding) are marked green, while the rest are coloured in various shades of grey. There are other colouring options available, which we may explore, if required. We may notice that the protein is highly conserved, although the last two sequences have much longer N-terminal part. These two sequence originate from plants, while the first two are from bacteria.
One disadvantage of using this server for alignment is that we cannot change alignment parameters, like the amino acid replacement matrix or gap penalties, discussed earlier. If this is required, one could use the EBI server (European Bioinformatics Institute). We could simply copy and paste the list of the sequences in FASTA format (provided on the page showing the alignment) into the alignment window of the EBI server. On EBI there is also an opportunity to use Jalview, a Java-based application, with which we can color the alignment in different colors, change its appearance in various ways and save a jpg image, e.g. for publication or a presentation. It is possible to use Jalview directly on EBI, however, it is recommended to download and install the application on your own computer. The installed application has much more choices, for example, saving the alignment into an image file for later use. This option is not available in the web-version of the viewer.
A final note: The FASTA format
Many applications require the amino acid sequence to be in FASTA format. The FASTA format includes the amino acid sequence in one-letter code, usually with 60 letters/line. Most important is the sign ">", “larger than” , on the first line. Alignment programs like CLUSTALW will use the text after the >-sign on that line as the alignment title for the particular sequence. For convenience, one could leave the name of the protein on that row, which would be useful as a sequence identifier after running the alignment. BchI sequence in FASTA format is show on the image below: