Protein Structure Databases: Short Overview

There are many protein and structural bioinformatics-related resources on the Internet. Some of them are of general character; some are dedicated to specific aspects of proteins and protein families, specific functions, metabolic pathways, etc. Here we will discuss just two general-type databases.

The first questions to ask when trying to explore a protein and its function should probably be - is there a 3D structure and where to get the coordinate file. When working with coordinate files one would also like to know what information is stored there. The primary database for protein structures is the Protein Data Bank (PDB), created in the beginning of the 1970ties. Only few structures existed at that time, and the only experimental method for protein structure determination available then was protein X-ray crystallography. As we can see from the image below, starting from the 1990ties, PDB content growth has been accelerating:

My Image

One of the reasons for this structural revolution was that cloning techniques started to enter the lab and both the number and amount of proteins available for crystallization increased drastically. Before the cloning era proteins were purified directly from cells, which substantially limited availability − there is always a limited number of copies of a certain protein in a cell. To obtain a few milligrams of a protein for crystallization large cell volumes had to be grown. Cloning solved the problem, proteins could be expressed in large quantities and purified for crystallization. Another substantial factor was the introduction of synchrotron radiation for X-ray data collection. A number of synchrotrons around the world currently provide high intensity X-rays for quality X-ray diffraction data collection. This substantially reduced the time required for optimization of crystallization conditions, which was required for growing crystals large enough for the relatively low-intensity laboratory X-ray sources. The third factor, I believe was the introduction of low-cost personal computers with ever increasing computational and graphics processing power. Cheaper computers also meant new software, which also started to become user friendly. A proper graphics monitor with a computer, which was needed for model building and refinement of a protein structure, in the early days of crystallography would cost around 100 thousands dollar, obviously unaffordable for personal use for people interested in science. Now a better PC or a Mac is all we need. Then came the era of structural genomics - large consortia were formed with the aim to develop new technologies for solving large numbers of protein structures. With the increasing number of structures the number of protein databases started to increase and new tools for the analysis of protein sequence and structure were rapidly developed.

Although the number of structures in the
PDB is rapidly increasing, one should remember that far from all PDB entries are unique. In many cases there are many entries of the same protein in the database - some are mutant variants, others may be complexes with ligands (substrate analogues, inhibitors, co-factors), complexes with other proteins, etc. This may be a source of confusion if one would try to fetch a structure from PDB - which one to choose if there are many entries of the same protein? For now we need to remember that not all structures in the PDB are of equal quality and we need to identify the one with the best available quality.

Using the PDB we can easily find the structure of the protein of interest and assess its quality. We just need to type its name into the search window on the PDB web site. For example, enter the name of pyruvate kinase. Generally one gets many hits, and some of them would be unrelated to the search. PDBsum and PDBe (PDB Europe) usually give more accurate search results. It is also possible to refine the search using the options provided by the PDB site.

Both RCSB PDB,
PDBe and PDBsum provide plenty of additional data, including links to other databases, where more information can be found. Below is an example from the PDBsum link page. For example we may be interested in the links to CATH and SCOP databases, or some other.

My Image

We also need to remember that PDB files contain the so-called asymmetric unit of the crystal. The biological functional unit in solution may contain several subunits of the same protein, arranged as dimers, trimers etc., as discussed earlier. Often the subunits in these quaternary structures are related by some symmetry - for example two-fold rotation, three-fold rotation or four-fold rotation for a dimer, trimer or tetramer, respectively. When the molecules are crystallized, they are arranged in certain types of space lattices, within which all molecules are ordered and related to each other by symmetry operations of the particular symmetry group of the crystal (possible symmetry groups are listed in the International Tables for Crystallography). The symmetry in solution, for example 2-, 3-, or 4-fold, may become part of the crystallographic symmetry. In such cases, one unit within, for example a trimer, will become the asymmetric unit of the crystal with a 3-fold symmetry axis. Crystallographic calculations are usually performed using the asymmetric unit, since the other subunits, related by symmetry to the first, will be exactly the same. This is reflected in the content of PDB files. They only contain the atomic coordinates of the asymmetric unit. The PDB server reconstructs the biological unit in cases when it is known to be different from the asymmetric unit. The biological unit may be chosen when viewing the 3D structure in the graphics display on the site, or it may be downloaded. For clarity, the concept of the asymmetric unit is illustrated in the image below:

the assymetric unit in a protein crystal

In the left the asymmetric unit of the crystal is just one subunit and all molecules in the lattice are related to each other by simple translation. In the example in the middle there are two subunits in the unit cell related to each other by a two-fold rotation axis. In this case there is a big chance that the biological unit of the protein in solution is actually a dimer. The third example on the right shows that the molecules in the unit cell are related by a 4-fold crystallographic symmetry. Again, it cannot be excluded that the biological unit is going to be a tetramer, but in all cases the asymmetric unit is a monomer.