PDB File Format and Content

Here we will continue the discussion of the PDB and look inside the coordinate file - it is easy to download a PDB file (and it is free!). After finding the protein of interest and clicking on its name we will be taken to the protein-specific page. In the right corner of that page there is a drop-down menu, where we can choose the format of the PDB file we want to download (shown in the image below). PDB files are simple text files and can be opened by any text editor (including MS Word). The file is called a "coordinate file" simply because it contains a list of the coordinates of all atoms of the protein structure in a conventional orthogonal coordinate system. As in any coordinate system, each atom position is defined by its x,y,z coordinates.

PDB search results

Apart from the coordinates, the file also contains important information on the method used to solve the structure, on the parameters related to the quality of the X-ray data (like resolution), as well as the symmetry operations for the specific space group of the crystal, the quality of the model geometry (deviations of bond lengths, bond angles and torsion angles from ideal values), secondary structure content, description of missing regions in the structure (a result of weak electron density due to flexibility/disorder in the structure), etc.

The R-factor and resolution are probably the two most central parameters for the assessment of the quality of the structure. The R-factor tells us how good the fit of the structural model to the X-ray data (the electron density) is. Higher resolution of the X-ray data usually provides better fit and results in lower R-factor. In my experience, good quality well-refined protein structures have resolution of (or better than) 2.2 Å and R-factor below 20%. At this resolution the electron density for most atoms looks nice and well-separated from its neighbors.

Below is part of a PDB file header showing some of the data like resolution (2.1 Å), resolution range of the data (from lowest, 29.55 Å to highest, 2.10 Å), the number of reflections collected from the crystal during the X-ray experiment (22179), the R-factor (0,214), etc. The R-factor is ok, but not really great in this case. This is due to high flexibility of parts of the structure, which makes reliable model building for them essentially impossible. The result is that parts of the model have not been accounted for, which gives insufficient fit to the experimental data.

REMARK 1
REMARK 2
REMARK 2 RESOLUTION. 2.10 ANGSTROMS.
REMARK 3
REMARK 3 REFINEMENT.
REMARK 3 PROGRAM : CNS 1.0
REMARK 3 AUTHORS : BRUNGER,ADAMS,CLORE,DELANO,GROS,GROSSE-
REMARK 3 : KUNSTLEVE,JIANG,KUSZEWSKI,NILGES, PANNU,
REMARK 3 : READ,RICE,SIMONSON,WARREN
REMARK 3
REMARK 3 REFINEMENT TARGET : ENGH & HUBER
REMARK 3
REMARK 3 DATA USED IN REFINEMENT.
REMARK 3 RESOLUTION RANGE HIGH (ANGSTROMS) : 2.10
REMARK 3 RESOLUTION RANGE LOW (ANGSTROMS) : 29.55
REMARK 3 DATA CUTOFF (SIGMA(F)) : 0.000
REMARK 3 DATA CUTOFF HIGH (ABS(F)) : 312841.620
REMARK 3 DATA CUTOFF LOW (ABS(F)) : 0.0000
REMARK 3 COMPLETENESS (WORKING+TEST) (%) : 97.9
REMARK 3 NUMBER OF REFLECTIONS : 22179
REMARK 3
REMARK 3 FIT TO DATA USED IN REFINEMENT.
REMARK 3 CROSS-VALIDATION METHOD : THROUGHOUT
REMARK 3 FREE R VALUE TEST SET SELECTION : RANDOM
REMARK 3 R VALUE (WORKING SET) : 0.214
REMARK 3 FREE R VALUE : 0.247
REMARK 3 FREE R VALUE TEST SET SIZE (%) : 10.000
REMARK 3 FREE R VALUE TEST SET COUNT : 2207
REMARK 3 ESTIMATED ERROR OF FREE R VALUE : 0.005
REMARK 3

Further down there is a list of the secondary structure elements within the structure, also showing the first and last residues in each element:

HELIX 1 1 PRO A 22 ILE A 26 5 5
HELIX 2 2 GLN A 29 ASP A 42 1 14
HELIX 3 3 PRO A 43 GLY A 46 5 4
HELIX 4 4 ASP A 53 GLY A 57 5 5
HELIX 5 5 SER A 59 LEU A 69 1 11
HELIX 6 6 ASN A 84 ILE A 88 5 5
HELIX 7 7 SER A 114 GLY A 120 1 7
HELIX 8 8 ASP A 123 GLY A 131 1 9
HELIX 9 9 GLY A 138 ASN A 144 1 7
HELIX 10 10 GLU A 152 LEU A 156 5 5
HELIX 11 11 GLU A 157 GLY A 171 1 15
HELIX 12 12 ARG A 202 ASP A 207 1 6
HELIX 13 13 ASP A 220 ASP A 237 1 18
HELIX 14 14 ASP A 237 LEU A 263 1 27
HELIX 15 15 PRO A 264 VAL A 266 5 3
HELIX 16 16 PRO A 269 LEU A 283 1 15
HELIX 17 17 GLY A 287 GLU A 305 1 19
HELIX 18 18 GLY A 311 SER A 324 1 14
HELIX 19 19 HIS A 325 LEU A 327 5 3
HELIX 20 20 VAL A 341 LEU A 349 1 9
SHEET 1 A 5 VAL A 106 LEU A 109 0
SHEET 2 A 5 GLY A 146 ILE A 150 1 O TYR A 147 N VAL A 107
SHEET 3 A 5 PHE A 188 GLY A 194 1 O VAL A 189 N LEU A 148
SHEET 4 A 5 VAL A 48 PHE A 51 1 N VAL A 48 O LEU A 190
SHEET 5 A 5 LEU A 211 GLU A 214 1 O LEU A 211 N LEU A 49
SHEET 1 B 2 ILE A 72 VAL A 75 0
SHEET 2 B 2 VAL A 99 LYS A 102 -1 N ILE A 100 O ALA A 74
SHEET 1 C 2 ALA A 121 LEU A 122 0
SHEET 2 C 2 PHE A 135 GLU A 136 -1 N GLU A 136 O ALA A 121
SHEET 1 D 2 GLU A 172 VAL A 175 0
SHEET 2 D 2 ILE A 182 PRO A 185 -1 O ILE A 182 N VAL A 175
CRYST1 90.259 90.259 83.716 90.00 90.00 120.00 P 65 6

After the general informational part, the x,y,z coordinates of the atoms are listed:

ATOM 1 N ARG A 18 14.699 61.369 62.050 1.00 39.19 N
ATOM 2 CA ARG A 18 14.500 62.241 60.856 1.00 38.35 C
ATOM 3 C ARG A 18 13.762 61.516 59.729 1.00 36.05 C
ATOM 4 O ARG A 18 14.354 60.740 58.982 1.00 34.91 O
ATOM 5 CB ARG A 18 15.850 62.753 60.334 1.00 42.36 C
ATOM 6 CG ARG A 18 16.537 63.770 61.247 1.00 46.92 C
ATOM 7 CD ARG A 18 17.825 64.314 60.629 1.00 51.24 C
ATOM 8 NE ARG A 18 18.442 65.347 61.462 1.00 54.15 N

When looking at the coordinates, notice that in this case the structure starts at amino acid Arg 18! Amino acids from 1 to 17 are missing. The reason, as mentioned above, is poor electron density for these residues, insufficient for building them into the density (see for example the discussion on structure quality in homology modeling in a later chapter). It is essentially impossible to find the correct positions for the amino acids without the guiding electron density. This shows that we need to be aware of the fact that many structures in the PDB may have missing parts, sometimes in loop regions, very often it is a side chain (or side chains) on the surface of the molecule and in the worse cases a whole domain may be missing.

The numbers after the first record in the file, ATOM, are just sequential numbers of the atoms in the list. This is followed by the atom type - for example, CA means C-α, the carbon atom to which the side chain of the amino acid is attached. Next is main chain carbon atom C followed by the carbonyl oxygen O. Side chain atoms C-β, CG, CD (gamma, delta etc.), are listed according to the Greek alphabet. After the atom type, we find the name of the amino acid (ARG in this case), followed by a letter, in this file A. This letter is the so-called chain identifier. In cases when the structure consists of several polypeptide chains (e.g. as in the tetramers of hemoglobin and Pyruvate kinase discussed earlier), each chain will get its own identifier, like A, B, C, etc. The 3 following numbers (e.g., 14.699, 61.369, 62.050 for the very first atom) are the x,y,z coordinates of the atom. As mentioned above, they describe the position of each atom in an orthogonal coordinate system. If we can describe the positions of all atoms in the protein, we will be able to draw the whole molecular structure.

The x,y,z coordinates are followed by a number, which is 1 in most cases. This is called occupancy. Sometimes the side chain of an amino acid may have two or more different conformations due to local flexibility. These conformations can be distinguished in the electron density map of the structure. In this case the crystallographer will build all conformations and for each atom refine a parameter called occupancy (1 for full occupancy, < 1 for partial occupancies, the sum always being 1). In PDB files these alternate conformations are marked with "ALT".

The numbers in the last column in the file are called the temperature factors, or B-factor, for each atom in the structure. The B-factor describes the displacement of the atomic positions from an average (mean) value (mean-square displacement). Higher flexibility results in larger displacements, and eventually lower electron density. This is simply because the atoms of a flexible side chain (or other part of the structure) will be distributed over a larger volume, leading to lower density per unit of volume. In many graphics programs we can color a protein chain according to B-factor values. Areas with high B-factors are usually colored red (hot), while low B-factors are colored blue (cold). An inspection of a PDB structure with such coloring scheme will immediately reveal regions with high flexibility. The core of the molecule usually has low B-factors, due to tight packing of the side chains (enzyme active sites are usually located there). The values of the B-factors are normally between 15 to 30 (sq. Angstroms), but often much higher than 30 for flexible regions.

This concludes the structural part. In the next chapter we will dive into the world of
protein sequences to get a connection between sequence and structure. In the homology modelling chapter, we will make a practical use of all we learn about sequence and structure.