The Protein Databank (PDB): File Format and Content

Here we will focus on the PDB. We could start with downloading the coordinate file, opening it in some text editor to look into its content. PDB files are simple text files and can be open by any text editor (in contrast, for example to MS Word files, which cannot be opened by all text editors). The file is called a "coordinate file" simply because it contains a list of the coordinates of all the atoms of the protein structure in some conventional orthogonal coordinate system (at least the atoms visible in the electron density map calculated on the basis of the X-ray data). Each atom position is defined by its x,y,z coordinates. To download the file, we simply search first by typing the name of the protein, and when we find the entry we are interested in, like the magnesium chelatase protein we discussed briefly earlier, we can open the drop-down menu in the right corner, as shown on the image below, choose and save the file on the hard drive (easiest to choose PDB file (Text)):

My Image

There is a plenty of important information on the structure in the file, like the method used to solve the structure and various parameters related to the quality of the X-ray data (like resolution, R-factor etc.) and the structure as such, like geometry, secondary structure content, regions missing in the structure, etc). The R-factor is an essential parameter for the assessment of high well the structure fits the X-ray data. The lower the value of the R-factor, the better the fit. Well-refined protein structures have R-factor values below 20%:

REMARK 1
REMARK 2
REMARK 2 RESOLUTION. 2.10 ANGSTROMS.
REMARK 3
REMARK 3 REFINEMENT.
REMARK 3 PROGRAM : CNS 1.0
REMARK 3 AUTHORS : BRUNGER,ADAMS,CLORE,DELANO,GROS,GROSSE-
REMARK 3 : KUNSTLEVE,JIANG,KUSZEWSKI,NILGES, PANNU,
REMARK 3 : READ,RICE,SIMONSON,WARREN
REMARK 3
REMARK 3 REFINEMENT TARGET : ENGH & HUBER
REMARK 3
REMARK 3 DATA USED IN REFINEMENT.
REMARK 3 RESOLUTION RANGE HIGH (ANGSTROMS) : 2.10
REMARK 3 RESOLUTION RANGE LOW (ANGSTROMS) : 29.55
REMARK 3 DATA CUTOFF (SIGMA(F)) : 0.000
REMARK 3 DATA CUTOFF HIGH (ABS(F)) : 312841.620
REMARK 3 DATA CUTOFF LOW (ABS(F)) : 0.0000
REMARK 3 COMPLETENESS (WORKING+TEST) (%) : 97.9
REMARK 3 NUMBER OF REFLECTIONS : 22179
REMARK 3
REMARK 3 FIT TO DATA USED IN REFINEMENT.
REMARK 3 CROSS-VALIDATION METHOD : THROUGHOUT
REMARK 3 FREE R VALUE TEST SET SELECTION : RANDOM
REMARK 3 R VALUE (WORKING SET) : 0.214
REMARK 3 FREE R VALUE : 0.247
REMARK 3 FREE R VALUE TEST SET SIZE (%) : 10.000
REMARK 3 FREE R VALUE TEST SET COUNT : 2207
REMARK 3 ESTIMATED ERROR OF FREE R VALUE : 0.005
REMARK 3
Further down there is a list of the secondary structure elements within the structure, also showing the first and last residue in each element:

HELIX 1 1 PRO A 22 ILE A 26 5 5
HELIX 2 2 GLN A 29 ASP A 42 1 14
HELIX 3 3 PRO A 43 GLY A 46 5 4
HELIX 4 4 ASP A 53 GLY A 57 5 5
HELIX 5 5 SER A 59 LEU A 69 1 11
HELIX 6 6 ASN A 84 ILE A 88 5 5
HELIX 7 7 SER A 114 GLY A 120 1 7
HELIX 8 8 ASP A 123 GLY A 131 1 9
HELIX 9 9 GLY A 138 ASN A 144 1 7
HELIX 10 10 GLU A 152 LEU A 156 5 5
HELIX 11 11 GLU A 157 GLY A 171 1 15
HELIX 12 12 ARG A 202 ASP A 207 1 6
HELIX 13 13 ASP A 220 ASP A 237 1 18
HELIX 14 14 ASP A 237 LEU A 263 1 27
HELIX 15 15 PRO A 264 VAL A 266 5 3
HELIX 16 16 PRO A 269 LEU A 283 1 15
HELIX 17 17 GLY A 287 GLU A 305 1 19
HELIX 18 18 GLY A 311 SER A 324 1 14
HELIX 19 19 HIS A 325 LEU A 327 5 3
HELIX 20 20 VAL A 341 LEU A 349 1 9
SHEET 1 A 5 VAL A 106 LEU A 109 0
SHEET 2 A 5 GLY A 146 ILE A 150 1 O TYR A 147 N VAL A 107
SHEET 3 A 5 PHE A 188 GLY A 194 1 O VAL A 189 N LEU A 148
SHEET 4 A 5 VAL A 48 PHE A 51 1 N VAL A 48 O LEU A 190
SHEET 5 A 5 LEU A 211 GLU A 214 1 O LEU A 211 N LEU A 49
SHEET 1 B 2 ILE A 72 VAL A 75 0
SHEET 2 B 2 VAL A 99 LYS A 102 -1 N ILE A 100 O ALA A 74
SHEET 1 C 2 ALA A 121 LEU A 122 0
SHEET 2 C 2 PHE A 135 GLU A 136 -1 N GLU A 136 O ALA A 121
SHEET 1 D 2 GLU A 172 VAL A 175 0
SHEET 2 D 2 ILE A 182 PRO A 185 -1 O ILE A 182 N VAL A 175
CRYST1 90.259 90.259 83.716 90.00 90.00 120.00 P 65 6

After the general informational part, the x,y,z coordinates of the atoms are listed:

ATOM 1 N ARG A 18 14.699 61.369 62.050 1.00 39.19 N
ATOM 2 CA ARG A 18 14.500 62.241 60.856 1.00 38.35 C
ATOM 3 C ARG A 18 13.762 61.516 59.729 1.00 36.05 C
ATOM 4 O ARG A 18 14.354 60.740 58.982 1.00 34.91 O
ATOM 5 CB ARG A 18 15.850 62.753 60.334 1.00 42.36 C
ATOM 6 CG ARG A 18 16.537 63.770 61.247 1.00 46.92 C
ATOM 7 CD ARG A 18 17.825 64.314 60.629 1.00 51.24 C
ATOM 8 NE ARG A 18 18.442 65.347 61.462 1.00 54.15 N

First of all, notice that this structure starts from amino acid Arg 18! No amino acids from 1 to 17. The reason is that there was no electron density for these residues (see for example the discussion on structure quality in homology modeling). This is normally a result of a high flexibility of that particular region of the structure. It is essentially impossible to find the correct positions for amino acids without the guiding electron density. We need to be aware that many structures in the PDB have missing parts, sometimes in loop regions, sometimes just a side chain, and in the worse cases a whole domain may be missing.
The numbers after the first record in the file, ATOM, are just sequential numbers of the atoms in the structure. This is followed by the atom type - for example, CA means C-α, the carbon atom to which the side chain of the amino acid is attached. The next carbon atom is C-β, and following atoms are named after the Greek alphabet, gamma, delta, etc. Except C-α, main chain atoms do not have any Greek letters attached to them. They are just C, O and N. After the atom type, you will see the name of the amino acid, followed in this file by a letter A. This is the so-called chain identifier. In cases when the structure consists of several polypeptide chains (a multi-subunit protein), each chain will get its own identifier, like A, B, C, etc (as in the case of Pyruvate kinase discussed earlier). Without chain identifiers graphics programs will get confused having the same amino acids names and numbers for different chains (in cases of homo-multimeric proteins). The 3 numbers which follow (e.g., 14.699, 61.369, 62.050 for the very first atom) are the x,y,z coordinates of the atom. They describe the position of each atom in an orthogonal coordinate system. If we can describe the position of each atom in the protein, we will obviously be able to draw the whole tertiary structure. Graphics programs, when they read the coordinates from the protein databank file, simply connect the atoms to each other according to some distance cut-offs, thus creating the graphics view we are accustomed to. For example, we know that C-C distance is 1.54 Å and this can be used to connect two carbon atoms when they are found to be at this distance from each other.

The x,y,z coordinates are followed by a number, which is one in most cases. This is called atom occupancy. Sometimes the side chain of a particular amino acid, but even main chain atoms, may have two or more different conformations due to local flexibility. These conformations can be distinguished in the electron density map of the structure. In this case the crystallographer will build both conformations into the electron density and refine a parameter called occupancy, for each conformation. In protein databank files these conformations are called "alternative conformations" and often marked with "ALT". The occupancy numbers for each alternative conformation will be less than 1 (1 corresponds to 100% occupancy), for example it may be 0.5/0.5 (50/50), when both conformations are equally occupied, or 40/60, or some other numbers. Also ligands and metal atoms bound to proteins may often have partial occupancy, for example if the concentration of the ligand or metal, which was co-crystallized with the protein or soaked into the protein crystal, was too low.

The numbers in the last column in the file are called the temperature factors, or B-factor, for each atom in the structure. The B-factor describes the displacement of the atomic positions from an average (mean) value. For example, the more flexible an atom is the larger the displacement from the mean position will be (mean-squares displacement). In graphics programs we can usually color a protein according to B-factor value. Areas with high B-factors are often colored red (hot), while low B-factors are colored blue (cold). An inspection of a protein databank structure with such coloring scheme will immediately reveal regions with high flexibility in the tertiary structure of the protein. The values of the B-factors are normally between 15 to 30 (sq. Angstroms), but often much higher than 30 for flexible regions.