The Protein Databank (PDB): File Format and Content

Here we will continue the discussion of the PDB and look inside the coordinate file - it is easy to download a PDB file (and it is free!). After finding the protein of interest and clicking on its name we will be taken to the protein-specific page. In the right corner of that page there is a drop-down menu, where we can choose the format of the PDB file we want to download (shown in the image below). PDB files are simple text files and can be opened by any text editor (including MS Word). The file is called a "coordinate file" simply because it contains a list of the coordinates of all atoms of the protein structure in a conventional orthogonal coordinate system. Each atom position is defined by its x,y,z coordinates.

PDB search results

The file contains important information on the method used to solve the structure, on the various parameters related to the quality of the X-ray data (like resolution, the R-factor etc.). The R-factor and resolution are the two most central parameters for the assessment of the quality of the structure. Higher resolution of the X-ray data and lower R-factor values ensure a better fit of the structure to the experimental electron density. In my experience, high quality well-refined protein structures have resolution of (or better than) 2.2 Å and R-factor values below 20%. The PDB file also contains the symmetry operations for the specific space group of the crystal, data on the quality of the geometry of the model (deviations of bond lengths, bond angles and torsion angles from ideal values), secondary structure content, description of missing regions in the structure (a result of weak electron density due to the flexibility of the structure), etc.

Below is part of a PDB file header showing some of the data like resolution (2.1 Å), resolution range of the data, No of reflections collected from the crystal, R-factor, etc. After that we can read about the start and end residues of the helices and β-strands in the structure. And last in this view are the coordinates of the atoms:

REMARK 1
REMARK 2
REMARK 2 RESOLUTION. 2.10 ANGSTROMS.
REMARK 3
REMARK 3 REFINEMENT.
REMARK 3 PROGRAM : CNS 1.0
REMARK 3 AUTHORS : BRUNGER,ADAMS,CLORE,DELANO,GROS,GROSSE-
REMARK 3 : KUNSTLEVE,JIANG,KUSZEWSKI,NILGES, PANNU,
REMARK 3 : READ,RICE,SIMONSON,WARREN
REMARK 3
REMARK 3 REFINEMENT TARGET : ENGH & HUBER
REMARK 3
REMARK 3 DATA USED IN REFINEMENT.
REMARK 3 RESOLUTION RANGE HIGH (ANGSTROMS) : 2.10
REMARK 3 RESOLUTION RANGE LOW (ANGSTROMS) : 29.55
REMARK 3 DATA CUTOFF (SIGMA(F)) : 0.000
REMARK 3 DATA CUTOFF HIGH (ABS(F)) : 312841.620
REMARK 3 DATA CUTOFF LOW (ABS(F)) : 0.0000
REMARK 3 COMPLETENESS (WORKING+TEST) (%) : 97.9
REMARK 3 NUMBER OF REFLECTIONS : 22179
REMARK 3
REMARK 3 FIT TO DATA USED IN REFINEMENT.
REMARK 3 CROSS-VALIDATION METHOD : THROUGHOUT
REMARK 3 FREE R VALUE TEST SET SELECTION : RANDOM
REMARK 3 R VALUE (WORKING SET) : 0.214
REMARK 3 FREE R VALUE : 0.247
REMARK 3 FREE R VALUE TEST SET SIZE (%) : 10.000
REMARK 3 FREE R VALUE TEST SET COUNT : 2207
REMARK 3 ESTIMATED ERROR OF FREE R VALUE : 0.005
REMARK 3

Further down there is a list of the secondary structure elements within the structure, also showing the first and last residue in each element:

HELIX 1 1 PRO A 22 ILE A 26 5 5
HELIX 2 2 GLN A 29 ASP A 42 1 14
HELIX 3 3 PRO A 43 GLY A 46 5 4
HELIX 4 4 ASP A 53 GLY A 57 5 5
HELIX 5 5 SER A 59 LEU A 69 1 11
HELIX 6 6 ASN A 84 ILE A 88 5 5
HELIX 7 7 SER A 114 GLY A 120 1 7
HELIX 8 8 ASP A 123 GLY A 131 1 9
HELIX 9 9 GLY A 138 ASN A 144 1 7
HELIX 10 10 GLU A 152 LEU A 156 5 5
HELIX 11 11 GLU A 157 GLY A 171 1 15
HELIX 12 12 ARG A 202 ASP A 207 1 6
HELIX 13 13 ASP A 220 ASP A 237 1 18
HELIX 14 14 ASP A 237 LEU A 263 1 27
HELIX 15 15 PRO A 264 VAL A 266 5 3
HELIX 16 16 PRO A 269 LEU A 283 1 15
HELIX 17 17 GLY A 287 GLU A 305 1 19
HELIX 18 18 GLY A 311 SER A 324 1 14
HELIX 19 19 HIS A 325 LEU A 327 5 3
HELIX 20 20 VAL A 341 LEU A 349 1 9
SHEET 1 A 5 VAL A 106 LEU A 109 0
SHEET 2 A 5 GLY A 146 ILE A 150 1 O TYR A 147 N VAL A 107
SHEET 3 A 5 PHE A 188 GLY A 194 1 O VAL A 189 N LEU A 148
SHEET 4 A 5 VAL A 48 PHE A 51 1 N VAL A 48 O LEU A 190
SHEET 5 A 5 LEU A 211 GLU A 214 1 O LEU A 211 N LEU A 49
SHEET 1 B 2 ILE A 72 VAL A 75 0
SHEET 2 B 2 VAL A 99 LYS A 102 -1 N ILE A 100 O ALA A 74
SHEET 1 C 2 ALA A 121 LEU A 122 0
SHEET 2 C 2 PHE A 135 GLU A 136 -1 N GLU A 136 O ALA A 121
SHEET 1 D 2 GLU A 172 VAL A 175 0
SHEET 2 D 2 ILE A 182 PRO A 185 -1 O ILE A 182 N VAL A 175
CRYST1 90.259 90.259 83.716 90.00 90.00 120.00 P 65 6

After the general informational part, the x,y,z coordinates of the atoms are listed:

ATOM 1 N ARG A 18 14.699 61.369 62.050 1.00 39.19 N
ATOM 2 CA ARG A 18 14.500 62.241 60.856 1.00 38.35 C
ATOM 3 C ARG A 18 13.762 61.516 59.729 1.00 36.05 C
ATOM 4 O ARG A 18 14.354 60.740 58.982 1.00 34.91 O
ATOM 5 CB ARG A 18 15.850 62.753 60.334 1.00 42.36 C
ATOM 6 CG ARG A 18 16.537 63.770 61.247 1.00 46.92 C
ATOM 7 CD ARG A 18 17.825 64.314 60.629 1.00 51.24 C
ATOM 8 NE ARG A 18 18.442 65.347 61.462 1.00 54.15 N

When looking at the coordinates, notice that the structure starts at amino acid Arg 18! Amino acids from 1 to 17 are missing. The reason is that there was no electron density for these residues to build them into the model (see for example the discussion on structure quality in homology modeling). This is normally a result of a high flexibility of that particular region of the structure. It is essentially impossible to find the correct positions for the amino acids without the guiding electron density. This shows that we need to be aware of that many structures in the PDB may have missing parts, sometimes in loop regions, very often it is a side chain residues on the surface of the molecule and in the worse cases a whole domain may be missing.

The numbers after the ATOM record are just sequential numbers of the atoms in the list. This is followed by the atom type - for example, CA means C-α, the carbon atom to which the side chain of the amino acid is attached. Next is main chain carbon atom C followed by the carbonyl oxygen O. Sde chain atoms C-β, CG, CD (gamma, delta etc.), are listed according to the Greek alphabet. After the atom type we find the name of the amino acid (ARG in this case), followed by a letter, in this file A. This is the so-called chain identifier. In cases when the structure consists of several polypeptide chains (e.g. as in the tetramers of hemoglobin and Pyruvate kinase discussed earlier), each chain will get its own identifier, like A, B, C, etc. The 3 following numbers (e.g., 14.699, 61.369, 62.050 for the very first atom) are the x,y,z coordinates of the atom. They describe the position of each atom in an orthogonal coordinate system. If we can describe the positions of all atoms in the protein, we will be able to draw the whole molecular structure.

The x,y,z coordinates are followed by a number, which is one in most cases. This is called occupancy. Sometimes the side chain of an amino acid may have two or more different conformations due to local flexibility. These conformations can be distinguished in the electron density map of the structure. In this case the crystallographer will build all conformations and for each atom refine a parameter called occupancy (1 for full occupancy, < 1 for partial occupancies, the sum always being 1). In PDB files these conformations are marked with "ALT".

The numbers in the last column in the file are called the temperature factors, or B-factor, for each atom in the structure. The B-factor describes the displacement of the atomic positions from an average (mean) value (mean-square displacement). Higher flexibility results in larger displacements, and eventually lower electron density. This is simply because the atoms of a flexible side chain (or other part of the structure) will be distributed over a larger volume, leading to lower density per unit of volume. In graphics programs we can color a protein according to B-factor value. Areas with high B-factors are then colored red (hot), while low B-factors are colored blue (cold). An inspection of a PDB structure with such coloring scheme will immediately reveal regions with high flexibility in the structure. The core of the molecule usually has low B-factors, due to tight packing of the side chains (enzyme active sites are usually located there). The values of the B-factors are normally between 15 to 30 (sq. Angstroms), but often much higher than 30 for flexible regions.

This concludes the structural part. In the next chapter we will dive into the world of protein sequences to get a connection between sequence and structure. In the homology modelling part we will make a practical use of all we learn here about sequence and structure.