Protein Databases: Short Overview

There are many protein and structural bioinformatics-related resources on the Internet. Some of them are of general character, some are dedicated to specific aspects of protein families, specific metabolic pathways, etc. Here we will discuss just few general-character databases.

The first question, when working with a protein, would be where to find its structure. It is also interesting to know what is actually inside a structural file, what type of information is kept there and how structural information is presented in the file. The primary database for protein structure information is the Protein Data Bank (PDB), created sometime in the beginning of the 1970ties. Only few structures existed at that time, and the only experimental method for protein structure determination available was protein X-ray crystallography. The structural revolution started in the 1990-ties:

PDB growth statistics

One of the reasons for this structural revolution was that cloning techniques started to enter the lab and both the number and amount of proteins available for crystallization increased drastically. Before the cloning era people had to purify proteins from cells, substantially limiting availability - to obtain a few milligrams of a protein for crystallization one would need a lot of cells. Cloning solved the problem, proteins could be expressed in large quantities and purified for crystallization. Another important factor was the introduction of synchrotron radiation. Synchrotrons, like MAX IV in Lund, Sweden, ESRF in Grenoble, France, or DESY in Hamburg, Germany, and many others around the world provide very high intensity X-rays, which may be used for collecting high quality X-ray diffraction data even from small crystals. This eliminated the time-consuming stage of optimization of the crystallization conditions, which was required for obtaining crystals large enough for the relatively low X-ray intensity of home sources. The third factor was probably the introduction of personal computers, relatively cheep and with ever increasing power. Cheaper computers also meant new software, which also started to become user friendly, and in addition new graphics capabilities of monitors became available. A proper graphics monitor with a computer, which was used for model building in the early days of crystallography would cost around 50-60 thousands dollar! Now a better PC or a Mac is all we need. That was when the number of protein structures started to increase dramatically. Then came the era of structural genomics- large consortia were formed with the aim to develop new technology for solving large amounts of protein structures. One such consortium is, for example, the Structural Genomics Consortium (SGC). With the increasing number of structures the number of protein databases started to increase and new tools for the analysis of protein sequence and structure were rapidly developed.

Currently every newly determined protein structure has to be deposited with the Protein Data Bank before the scientific paper describing the structure can be published. Currently the number of structures in the PDB has exceeded 100 000. However, one should remember that not all structures in the PDB are unique. In many cases there are many entries of the same protein in the database - some are mutant variants, others may be complexes with ligands (substrate analogues, inhibitors, co-factors), complexes with other proteins, etc. This may be a source of confusion if one would try to fetch a structure from PDB - which one to choose if there are many entries of the same protein? This will be discussed later in the chapter on homology modeling. For modeling it is important to choose the right structure with the best available quality.

Coming back to our initial questions, how to download a structure and what is inside the PDB file? First we need to check if there is a structure for the protein we are interested in. This part is easily done, all you need to do is to go to the PDB and type the name of the protein you are looking for into the search window. For example, enter the name of a protein called magnesium chelatase. Generally one would get several hits, however, in the case of magnesium chelatase there is only one X-ray structure for one of the submits of the enzyme. Some other proteins may be listed in the output, some of them come from electron microscopy modeling, others may be totally unrelated. PDBsum gives more clear results
entering the name of the same protein we would get a single hit (PDB ID 1g8p). Of course you may refine your search using the options provided on the PDB page that show up when you enter the name of the protein. Among the options to refine our search we can choose the organism from which the protein originates, chose a particular subunit, the experimental method, etc.

Both PDB and
PDBsum provide additional data on the entry, including links to other databases, where more information can be found. Here is an example from PDBsum link page:

My Image

For our purposes we may be interested in the links to CATH and SCOP (structural classification). The PQS database is also of interest, it is the Protein Quaternary Structure database. However, when you click on this link the database will inform you that from 2009 it is not updated anymore. The reason is that the information, which can be found in PQS is currently generated by the PISA sever, Protein Interfaces, Surfaces and Assemblies. The reason is that PDB files usually contain the crystallographic unit, the so-called asymmetric unit. The biological unit in solution may contain several subunits of the same protein, arranged as dimers, trimers of higher order oligomers. In these oligomers the subunits are usually related by some kind of symmetry - two-fold rotation for dimers, three-fold rotation for trimers, four-fold rotation for tetramers, etc. When the molecules are crystallized, they get arranged in certain types of space lattices, within which all molecules are ordered and related to each other by symmetry operations of the particular symmetry group of the crystal (possible symmetry groups are listed in the International Tables for Crystallography). The symmetry axes present in the molecule in solution, which could be 2-, 3-, or 4-fold, may become part of the crystallographic symmetry. In such cases, one unit within, for example a trimer, becomes the asymmetric unit of the crystal. Crystallography operates with asymmetric units since the other units will be exactly the same and related by the symmetry operation of the crystal. This is reflected in the content of the files in the PDB, they contain coordinates for the atoms of one subunit, the asymmetric unit. The PISA server reconstructs the biological unit in cases when it is known to be different from the asymmetric unit or when there are some other indications that need to be taken into account. The file generated by the PISA server may also be downloaded from the PDB. The concept of the asymmetric unit is illustrated in the figure below:

the assymetric unit in a protein crystal

In the left figure the asymmetric unit of the crystal is just one subunit and all molecules in the lattice are related to each other by simple translation. In the middle figure there are two subunits in the unit cell related to each other by a two-fold rotation symmetry axis. In this case there is a big chance that the biological unit of the protein in solution is a dimer. In the last figure on the right the molecules in the unit cell are related by a 4-fold crystallographic symmetry axis. Again, it cannot be excluded that the biological unit is going to be a tetramer.

On essential feature is a description of the amino acid sequence in relation to the secondary structure of the protein. This is provided both by PDB and PDBsum. The image below shows the page from PDBsum:

PDBsum structure presentation

The information in this page is useful for quick identification of the position of amino acids within the structure, for getting an idea on the type of the protein (all alpha, alpha/beta, etc). There is also information on the publication which describes the structure (rather detailed in PDBsum, with links to citing papers). In the next page we will continue with the discussion of the PDB and the information, which can be found in PDB files.