Multiple Sequence Alignment Tutorial: Understanding & Analyzing the Alignment
Background
Amino acid sequence alignment is a valuable tool for studying proteins. Nowadays, various bioinformatics databases provide a wealth of information about a protein using only its sequence. With the introduction of AlphaFold and the ESM Metagenomic Atlas, even the three-dimensional structures of many proteins for which no experimental structure is available can be used in the analysis. The insights provided by amino acid sequence alignment into a protein’s conservation pattern, the presence of functionally important conserved motifs, domain structure, and evolutionary relationships remain highly valuable for studying proteins and complement the three-dimensional structure perfectly.
In this tutorial, I will analyze two related proteins mentioned earlier when I discussed sequence alignment: BchI and BchD, two subunits of the magnesium chelatase enzyme. I will use the tools available at UniProt and EBI multiple sequence alignment (MSA) servers. Although many other servers will do the same job, I am used to these two. We will explore how to make a sequence alignment while examining the information in UniProt and other related databases to learn more about these two proteins.
Subunits BchI (35 kDa) and BchD (70 kDa) together build up a large 600 kDa complex (see image below); there is also a third subunit required for the enzyme’s function called BchH (about 120 kDa). However, we are not going to discuss it here. Magnesium chelatase catalyzes the insertion of an Mg2+ ion into protoporphyrin IX at the first committed step of chlorophyll biosynthesis.
Based on their size, it is likely that BchI and BchD are multidomain proteins. In these instances, it’s crucial to identify the specific domains and analyze conservation patterns (motifs), shedding light on their evolutionary origin. Insights gained from this analysis could enhance our understanding of how these enzymes function.
Cryo-electron microscopy reconstruction of the complex of subunits BchI and BchD of Rhodobacter capsulatus magnesium chelatase. Where appropriate, the available X-ray structure of subunit BchI of the enzyme (shown in ribbon representation) was docked into the EM density. Other domains were modeled based on known structures from other proteins. Published in Lundqvist et al., Structure 2010.

1- Finding The Sequence & Initial Analysis
In the beginning, we need to choose the sequences for the alignment. To start, we write the name of the protein (BchD) into the UniProt search window, and we will be taken to a list of protein sequences from different organisms.
The site shows a large number of sequences. You may notice on the left, where it says “Status,” that there are “Reviewed” and “Unreviewed” sequences. This is one of my favorite features – when there is a sufficient number of reviewed sequences, I usually choose them for further analysis. These sequences have been verified to be what we expect them to be. Many automatically annotated sequences are among the Unreviewed; sometimes, we may find assignment errors. There is more helpful information (on the left field), like the availability of a 3D structure, catalytic activity, etc.
We have selected BCHD_RHOCB (entry P26175), which refers to the BchD subunit from Rhodobacter capsulatus. In plants, the equivalent subunit is called ChlD. The “B” in BchD indicates that the protein is involved in bacteriochlorophyll synthesis. Upon clicking on the entry ID P26175, the page that opens will contain detailed information on magnesium chelatase, including its biological function (photosynthesis, magnesium chelatase activity), the type of ligands/substrates it binds, its catalytic function (insertion of magnesium and ATP hydrolysis), links to published works, links to related entries in other databases, and, of course, the amino acid sequence of BchD.
If you click “Family & Domains” on the left menu of the UniProt page, you’ll see BchD’s domain content. In this case, only the vWFA domain (von Willebrand Factor A-like domain superfamily) at the C-terminal of the protein (residues 379-559) is shown. However, the InterPro database link directs us to the BCHD_RHOCB page, where we find a more detailed analysis of the domain composition of the protein. Apart from the vWFA domain, they also identify a P-loop NTPase family domain at the N-terminal part of the sequence, residues 78-235. The characteristic P-loop sequence motif, which is [AG]-x(4)-G-K-[ST] as defined in the Prosite database, is not conserved in R. capsulatus BchD. We will look closer into this when we analyze the sequence alignment. There is also a link to the AlphaFold predicted structure of the subunit, where we can explore the domains and see their proposed arrangement in three dimensions.
The vWFA domain is characterized by the conservation of the MIDAS motif, which stands for metal ion-dependent adhesion site. This motif includes the DXSXS sequence (X means any residue) and additional threonine and aspartate (T and D) residues located further down in the sequence. In the BchD protein, the DXSXS motif residues are D385, S387, and S389, located close to the N-terminus of the vWFA domain. Threonine T452 and aspartate D482 are found further down in the sequence. These residues, marked by yellow boxes and a # on top of the alignment in the image below, are involved in binding metal ions such as Ca2+ or Mg2+. The alignment can also be found on NCBI’s Conserved Protein Domain Family server. The exact function of this domain within the magnesium chelatase complex is still unknown. For a detailed description of the vWFA domain, refer to the paper by Lacy et al. (2004).

Closer to the bottom of the UniProt page, we will find the sequence of the protein:

To use the sequence for a database search, you can download it by clicking “Download” at the top. After downloading, you will notice that the sequence format will be different, known as the FASTA format. A brief description of this format can be found at the bottom of this page. Since multiple sequences are needed for the multiple sequence alignment, BLAST (Basic Local Alignment Search Tool) is typically used to obtain a list of related sequences. To run BLAST, select “Tools” and click “Blast.” You can also choose the substitution matrix for the search on the BLAST page. BLAST searches the entire database and provides a list of sequences aligned with our query sequence (pairwise alignment). Once BLAST is ready, you can select several sequences from the list based on the organism and percentage sequence identity and add them to the Basket for later use.
The image below shows part of the list after running BLAST. On the BLAST search page, you need to click the dark bar on the right to view the alignment of the query sequence with that protein.
Making The Amino Acid Sequence Alignment
To perform the alignment, choosing 3-4 sequences is best, as this will provide a clearer idea of the conservation pattern within the protein family. To run the alignment, I select the sequences I added to the Basket and then choose “Align” from the Basket menu. This will take us to the alignment page. An important option is the “Output sequence order” under “Advanced parameters,” where I usually select “input order” to group bacterial and plant sequences separately, making the analysis of the conservation pattern easier. We can also edit the sequences in the alignment window if necessary. The sequences in the alignment window are in FASTA format, which differs from the default format shown on the UniProt page. When using the FASTA format, I usually remove most of the information on the top row of the sequence (everything that comes after the”>” sign) and only leave the name of the protein, for example, BCHD_RHOCB, to simplify the alignment analysis. Don’t forget to include the protein from Rhodobacter capsulatus (the top row in the image below):

Analyzing The Sequence Alignment
The alignment is colored according to identity/similarity. Other options for coloring are by residue type, which will show hydrophobic, charged, and polar amino acids. We can see that the sequence (and structure) of BchD is well-conserved among different bacteria—there are just a few insertions and deletions.
We can also examine the AlphaFold predicted structure of BchD (image below) to get additional insights into the sequence-structure relationships for this protein. As mentioned, the C-terminal domain (residues 375-559) was earlier recognized as a vWAF-type domain. The predicted structure shows that a long stretch of the sequence, which does not have much secondary structure, connects the C-terminal vWAF-domain to the N-terminal domain. The residues from approximately 232 to 309 (the orange color means prediction is unreliable) are followed by a region with some secondary structure, which continues until Met375, where the vWAF domain starts. Looking at the sequence alignment, we can see that up to Glu309, we have a so-called low-complexity region – it is not well conserved, and there are many repeats but also many acidic residues. Such regions are usually very flexible and often found to be involved in protein-protein interactions. This suggests that this region may interact with other subunits within the large 600 kDa magnesium chelatase complex. The rest of the sequence, up to the vWAF domain, appears to have more structures, although judging by the color, the reliability of the prediction is still low. The N-terminal domain, which contains residues 1-232, as demonstrated in the publication by Fodje et al., has a conserved fold of the so-called AAA+ family of proteins to which the smallest magnesium chelatase subunit BchI belongs. It is possible to verify this by submitting the AlphaFold predicted structure to one of the fold recognition servers like DALI.

I also made an amino acid sequence alignment of BchD with the smaller magnesium chelatase subunit, BchI. The sequence similarity is about 29%, which is not very high. The alignment (shown below; only part of the BchD sequence is included) also shows that the ATP binding site residues (the P-loop motif, marked on the alignment) of BchI are not conserved in BchD. This means that R. capsulatus BchD does not hydrolyze ATP. However, it is still possible that ATP may bind to the protein and drive oligomerization of the complex (it does for BchI). The alignment also shows large insertions in the BchI sequence compared to the BchD N-terminal domain. This is a good reason for submitting the predicted structure of the BchD N-terminal domain to a fold recognition server. This way, we can find out which AAA+ proteins its three-dimensional structure is closest to.

Concluding remarks
This tutorial showcases the capabilities of sequence analysis tools at UniProt and other servers, including the AlphaFold structure prediction. Unfortunately, AlphaFold predictions are not currently classified in the CATH database, but we hope this will change soon. Until then, we will need to rely on our analysis to identify the domain fold in a predicted structure.
There are likely many other methods to combine sequence and structure analysis. Additionally, our analysis will depend on our goals and what we aim to achieve. I hope this introduction will assist you in your search for information.
A note on the FASTA format of an amino acid sequence
When working with various applications, it’s often necessary to have the amino acid sequence in FASTA format. This format consists of the amino acid sequence in one-letter code, typically with 60 letters per line. The most important feature is the”>” symbol at the beginning of the first line. When using alignment programs at the EBI MSA server, the text after the “>” symbol on that line is used as the alignment title for the sequence. To make things more convenient, you can write the name of the protein on that line, which will serve as a helpful sequence identifier after the alignment is completed. Below is the BchI sequence in FASTA format:
>sp|P26239|BCHI_RHOCB Magnesium-chelatase 38 kDa subunit OS=Rhodobacter capsulatus (strain ATCC BAA-309 / NBRC 16581 / SB1003) OX=272942 GN=bchI PE=1 SV=1
MTTAVARLQPSASGAKTRPVFPFSAIVGQEDMKLALLLTAVDPGIGGVLVFGDRGTGKST
AVRALAALLPEIEAVEGCPVSSPNVEMIPDWATVLSTNVIRKPTPVVDLPLGVSEDRVVG
ALDIERAISKGEKAFEPGLLARANRGYLYIDECNLLEDHIVDLLLDVAQSGENVVERDGL
SIRHPARFVLVGSGNPEEGDLRPQLLDRFGLSVEVLSPRDVETRVEVIRRRDTYDADPKA
FLEEWRPKDMDIRNQILEARERLPKVEAPNTALYDCAALCIALGSDGLRGELTLLRSARA
LAALEGATAVGRDHLKRVATMALSHRLRRDPLDEAGSTARVARTVEETLP