Protein Structure Databases: Short Overview

There are many structural bioinformatics-related resources on the Internet. Some of them are of general character, while others are dedicated to specific aspects of proteins and protein families, specific functions, metabolic pathways, etc. A good example of a general type educational database, to which you may find several links on this site, is the Proteopedia. It provides in-depth discussions of structural and functional aspects of a large number of proteins. Here we focus on two databases which are often used and are essential for the fields of structural biology and structural bioinformatics - the RSCB Protein Databank (PDB), and two European versions, PDBe and PDBsum.

The Protein Databank PDB, PDBe & PDBsum
The first questions to ask when trying to explore a protein should probably be
- is there a 3D structure and where to get the coordinate file. When working with coordinate files a good is to know what information is stored in the file. For a long period of time the primary database for protein structures was the RSCB Protein Data Bank, created in the beginning of the 1970-ties. Only few structures existed at the time, and the only experimental method for protein structure determination available then was protein X-ray crystallography. As we can see from the image below, starting from the 1990ties, PDB content growth has been accelerating:

The structural revolution
There are of course several reasons for this structural revolution. One of them was that cloning techniques started to enter the lab and both the number of different proteins and their quantity available for crystallization increased drastically. Before the cloning era proteins were purified directly from cells, which substantially limited availability − there is always a limited number of copies of a certain protein in a cell. To obtain a few milligrams of a protein for crystallization large cell volumes had to be grown. Cloning solved the problem, proteins could be expressed in large quantities and purified for crystallization. Another essential factor was the introduction of synchrotron radiation for X-ray data collection. Several synchrotrons around the world currently provide high intensity X-rays for quality X-ray diffraction data collection. This reduced the time required for optimization of crystallization conditions. Before synchrotrons we needed to optimize the crystallization conditions to grow crystals large enough for the relatively low-intensity laboratory X-ray sources available at that time. The third factor, I believe was the introduction of low-cost personal computers with ever increasing computational and graphics processing power. Cheaper computers also meant new software, which also started to become user friendly. In the middle of the 1980-ties a proper graphics monitor with a computer, which was needed for model building and refinement, would cost around 100 k dollars, obviously unaffordable for personal use for people interested in science. Now a better PC or a Mac is all we need. Then came the era of structural genomics - large consortia were formed with the aim to develop new technologies for crystallization and solving large numbers of protein structures. With the increasing number of structures, the number of protein databases started to increase and new tools for the analysis of protein sequence and structure were rapidly developed.

Although the number of structures in the PDB is rapidly increasing, one should remember that far from all PDB entries are unique. In many cases there are many entries of the same protein in the database - some are mutant variants, others may be complexes with ligands (substrate analogues, inhibitors, co-factors), complexes with other proteins, etc. This may be a source of confusion when we try to fetch a structure from the PDB - which one to choose if there are many entries of the same protein? For our purposes, we need to remember that not all structures in the PDB are created equal and we need to identify the one with the best available quality (see discussion on the Ramachandran plot and on Homology Modeling).

Using the PDB we can easily find the structure of the protein of interest and assess its quality. We just need to type the name of the protein into the search window on the PDB site. Generally, one gets many hits, and some of them would be unrelated to the search. PDBsum and PDBe (PDB Europe) usually give more narrowed search results. It is also possible to refine the search using the options provided at the PDB site.

Both PDB (including
PDBe) and PDBsum provide plenty of additional data, including links to other databases, where more information can be found. Below is an example from the PDBsum link page.

My Image

The asymmetric unit - what is it?
We also need to remember that PDB files contain the so-called asymmetric unit of the crystal. The biological functional unit (the quaternary structure) in solution may contain several subunits of the same protein, arranged as dimers, trimers etc., as discussed earlier. Often the subunits in these quaternary structures are related by some symmetry - for example two-fold rotation, three-fold rotation or four-fold rotation for a dimer, trimer or tetramer, respectively. On the other hand, when the molecules are crystallized, they are arranged in the space lattices of the crystal. Within this lattice all molecules are ordered and related to each other by the so-called symmetry operations of the particular symmetry group of the crystal (the possible symmetry groups are listed in a book called
International Tables for Crystallography). A symmetry operation represents a mathematical transformation which when applied to the coordinates of the molecule, will transform it to its symmetry-related mate in the crystal lattice. This is what we mean when we say that molecules in the crystal lattice are arranged according to certain symmetry operations.

For oligomeric proteins, the symmetry in solution, if present (not all oligomeric structures are symmetric), could be, for example, represented by a 2-, 3-, or 4-fold rotation symmetry axis. When the protein is crystallized the symmetry axis of the oligomer may become part of the crystallographic symmetry. For example, the 3-fold rotation axis of a trimer may become a 3-fold symmetry axis of the crystal. In this case one of the subunits of the trimer will be the “asymmetric unit” of the crystal. By other words, the mathematical operations required for generating other molecules of the crystal lattice will need to be applied only to one of the molecules of the trimer. For this reason, calculations in crystallography are usually performed using only the asymmetric unit of the crystal, the other subunits (in the trimer in our case), are related by symmetry to the first and are identical to it. This is reflected in the content of PDB files. They only contain the atomic coordinates of the asymmetric unit. However, the PDB server reconstructs the biological unit in cases when it is known to be different from the asymmetric unit. When we need the biological unit, we may choose it when viewing the 3D structure in the graphics display on the site, or when we download the coordinate file.
For clarity, the concept of the asymmetric unit is illustrated in the image below:

the assymetric unit in a protein crystal

On the Figure, on the left image the asymmetric unit of the crystal is just one subunit and all molecules in the lattice are related to each other by simple translation. In the example in the middle there are two subunits in the unit cell related to each other by a two-fold rotation axis (180 degrees of rotation around the axis). This suggests that the protein is solution (the biological unit) is a dimer. The third example on the right shows that the molecules in the unit cell are related by a 4-fold crystallographic symmetry. Again, it cannot be excluded that the biological unit in solution is a tetramer. In all these examples the asymmetric unit is a monomer, but we also may have a dimer, trimer, etc. in the asymmetric unit. In this case all the symmetry operations described will be applied to the whole dimer, trimer, etc.