The Protein Databank (PDB): File Format and Content


Here we will continue the discussion of protein databases and focus more on the PDB. We start with retrieving a coordinate file, open it in some text editor and spend some time to study its content. PDB files are simple text files and can be open by any text editor(in contrast, for example with a MS Word file, which cannot be opened by other text editors). The files are called "coordinate files" simply because they contain a list of all the atoms of the protein structure (at least the ones visible in the electron density map calculated on the basis of the X-ray data). Each atom position is defined by its x,y,z coordinates in a conventional orthogonal coordinate system. To retrieve a file, we simply search first by typing the name of the protein, and when we find the entry we are interested in, like the protein we discussed briefly in the previous page, we can open the drop-down menu in the right corner, as shown on the image below, choose and save the file on the hard drive (I usually choose PDB file (Text)):

BchI-PDB

If we need to look at the structure, we can use a graphics software. There are many programs for visualization of protein structures, which can be freely downloaded from the Internet. Examples include SwissPDB viewer, Chimera, Yasara, Maestro, etc. We may also open the file using a text editor, as I mentioned above, and examine the data inside the file. There is a plenty of important information on the structure, like the method used to solve the structure and various parameters related to the quality of the data and the protein structure (like resolution (described here), geometry, the R-factor, etc). The R-factor is an essential parameter for the assessment of X-ray structure quality. It tells us how well a structure fits the experimental data. The lower the value of the R-factor, the better the fit. Well-refined protein structures have R-factor values below 20%:

REMARK 1
REMARK 2
REMARK 2 RESOLUTION. 2.10 ANGSTROMS.
REMARK 3
REMARK 3 REFINEMENT.
REMARK 3 PROGRAM : CNS 1.0
REMARK 3 AUTHORS : BRUNGER,ADAMS,CLORE,DELANO,GROS,GROSSE-
REMARK 3 : KUNSTLEVE,JIANG,KUSZEWSKI,NILGES, PANNU,
REMARK 3 : READ,RICE,SIMONSON,WARREN
REMARK 3
REMARK 3 REFINEMENT TARGET : ENGH & HUBER
REMARK 3
REMARK 3 DATA USED IN REFINEMENT.
REMARK 3 RESOLUTION RANGE HIGH (ANGSTROMS) : 2.10
REMARK 3 RESOLUTION RANGE LOW (ANGSTROMS) : 29.55
REMARK 3 DATA CUTOFF (SIGMA(F)) : 0.000
REMARK 3 DATA CUTOFF HIGH (ABS(F)) : 312841.620
REMARK 3 DATA CUTOFF LOW (ABS(F)) : 0.0000
REMARK 3 COMPLETENESS (WORKING+TEST) (%) : 97.9
REMARK 3 NUMBER OF REFLECTIONS : 22179
REMARK 3
REMARK 3 FIT TO DATA USED IN REFINEMENT.
REMARK 3 CROSS-VALIDATION METHOD : THROUGHOUT
REMARK 3 FREE R VALUE TEST SET SELECTION : RANDOM
REMARK 3 R VALUE (WORKING SET) : 0.214
REMARK 3 FREE R VALUE : 0.247
REMARK 3 FREE R VALUE TEST SET SIZE (%) : 10.000
REMARK 3 FREE R VALUE TEST SET COUNT : 2207
REMARK 3 ESTIMATED ERROR OF FREE R VALUE : 0.005
REMARK 3


Further down there is a list of the secondary structure elements within the structure,also showing the first and last residue of each element:

HELIX 1 1 PRO A 22 ILE A 26 5
5
HELIX 2 2 GLN A 29 ASP A 42 1 14
HELIX 3 3 PRO A 43 GLY A 46 5 4
HELIX 4 4 ASP A 53 GLY A 57 5 5
HELIX 5 5 SER A 59 LEU A 69 1 11
HELIX 6 6 ASN A 84 ILE A 88 5 5
HELIX 7 7 SER A 114 GLY A 120 1 7
HELIX 8 8 ASP A 123 GLY A 131 1 9
HELIX 9 9 GLY A 138 ASN A 144 1 7
HELIX 10 10 GLU A 152 LEU A 156 5 5
HELIX 11 11 GLU A 157 GLY A 171 1 15
HELIX 12 12 ARG A 202 ASP A 207 1 6
HELIX 13 13 ASP A 220 ASP A 237 1 18
HELIX 14 14 ASP A 237 LEU A 263 1 27
HELIX 15 15 PRO A 264 VAL A 266 5 3
HELIX 16 16 PRO A 269 LEU A 283 1 15
HELIX 17 17 GLY A 287 GLU A 305 1 19
HELIX 18 18 GLY A 311 SER A 324 1 14
HELIX 19 19 HIS A 325 LEU A 327 5 3
HELIX 20 20 VAL A 341 LEU A 349 1 9
SHEET 1 A 5 VAL A 106 LEU A 109 0
SHEET 2 A 5 GLY A 146 ILE A 150 1 O TYR A 147 N VAL A 107
SHEET 3 A 5 PHE A 188 GLY A 194 1 O VAL A 189 N LEU A 148
SHEET 4 A 5 VAL A 48 PHE A 51 1 N VAL A 48 O LEU A 190
SHEET 5 A 5 LEU A 211 GLU A 214 1 O LEU A 211 N LEU A 49
SHEET 1 B 2 ILE A 72 VAL A 75 0
SHEET 2 B 2 VAL A 99 LYS A 102 -1 N ILE A 100 O ALA A 74
SHEET 1 C 2 ALA A 121 LEU A 122 0
SHEET 2 C 2 PHE A 135 GLU A 136 -1 N GLU A 136 O ALA A 121
SHEET 1 D 2 GLU A 172 VAL A 175 0
SHEET 2 D 2 ILE A 182 PRO A 185 -1 O ILE A 182 N VAL A 175
CRYST1 90.259 90.259 83.716 90.00 90.00 120.00 P 65 6

After the general informational part, the x,y,z coordinates of the atoms are listed:

ATOM 1 N ARG A 18 14.699 61.369 62.050 1.00 39.19 N
ATOM 2 CA ARG A 18 14.500 62.241 60.856 1.00 38.35 C
ATOM 3 C ARG A 18 13.762 61.516 59.729 1.00 36.05 C
ATOM 4 O ARG A 18 14.354 60.740 58.982 1.00 34.91 O
ATOM 5 CB ARG A 18 15.850 62.753 60.334 1.00 42.36 C
ATOM 6 CG ARG A 18 16.537 63.770 61.247 1.00 46.92 C
ATOM 7 CD ARG A 18 17.825 64.314 60.629 1.00 51.24 C
ATOM 8 NE ARG A 18 18.442 65.347 61.462 1.00 54.15 N

First of all notice that the structure starts from amino acid Arg 18! No amino acids from 1 to 17. What happened? The reason they are absent is that there was no electron density for these residues (see for example the discussion on structure quality in homology modeling). This is normally a result of a high flexibility of the region of the structure. It is essentially impossible to find the correct positions for these amino acids without the guiding electron density. You need to be aware that many structures in the PDB may have missing parts, sometimes in loop regions, sometimes just side chain, and in the worse cases a whole domain may be missing.

The numbers after the first record in the file, ATOM, are just sequential numbers of the atoms in the structure. This is followed by the atom type. For example, CA means C-alpha, the carbon atom to which the side chain of the amino acid is attached. The next carbon atom is C-beta, and following atoms are named after the Greek alphabet, gamma, delta, etc. Except C-alpha, main chain atoms do not have any Greek letters attached to them. They are just C, O and N. After the atom type, you will see the name of the amino acid, followed in this file by a letter A. This is the so called chain identifier. In cases when the structure is consisted of several polypeptide chains (a
multi-subunit protein), each chain will get its own identifier, like A, B, C, etc (as in the case of Pyruvate kinase discussed earlier). Without chain identifiers, graphics programs may get confused having the same amino acids names and numbers for different chains (in cases of homo-multimeric proteins). The 3 numbers which follow (e.g., 14.699, 61.369, 62.050 for the very first atom) are the x,y,z coordinates of the atom. They describe the position of each atom in an orthogonal coordinate system. If we can describe the position of each atom in the protein, we will obviously be able to draw the whole tertiary structure. Graphics programs, when they read the coordinates from the protein databank file, simply connect the atoms to each other according to some distance cut-offs, which are often among the default parameters of the program, thus creating the graphics view we are accustomed to. For example, we know that C-C distance is 1.54 Å and this can be used to connect two carbon atoms when they are found to be at this distance from each other.

The x,y,z coordinates are followed by a number, which is one in most cases. This is called atom occupancy. Sometimes the side chain of a particular amino acid, but even main chain atoms, may have two or more different conformations due to local flexibility. These conformations can be distinguished in the electron density map of the structure. In this case the crystallographer will build both conformations into the electron density and refine a parameter called occupancy, for each conformation. In protein databank files these conformations are called "alternative conformations" and often marked with "ALT". The occupancy numbers for each alternative conformation will be less than 1 (1 corresponds to 100% occupancy), for example it may be 0.5/0.5 (50/50), when both conformations are equally occupied, or 40/60, or some other numbers. Also ligands and metal atoms bound to proteins may often have partial occupancy, for example if the concentration of the ligand or metal, which was co-crystallized with the protein or soaked into the protein crystal, was too low.

The numbers in the last column of the file are the so-called temperature factors, or B-factor. The B-factor describes the displacement of the atomic positions from an average (mean) value. For example, the more flexible an atom is the larger the displacement from the mean position will be (mean-squares displacement). In graphics programs we can often color a protein according to B-factor value. Usually the areas with high B-factors are colored red (hot), while low B-factors are colored blue (cold). An inspection of a protein databank structure with such coloring scheme will immediately reveal regions with high flexibility in the tertiary structure of the protein. The values of the B-factors are normally between 15 to 30 (sq. Angstroms), but often higher than 30 for more flexible regions.

Basics of Protein Structure