7DFR Protein  Biological  Unit  Database 
Database Schema
Search PDB Code
Lab Home

    ProtBuD is a server to search the content of asymmetric and biological units as given by RCSB and PQS. ProtBuD uses SCOP and PSIBLAST to provide this information for all entries with proteins in particular superfamilies or families. A user can search for a particular SCOP designation or a particular entry or chain in an entry and obtains the asymmetric and biological units content of nearly all (except non-SCOP or too distantly related new entries) related proteins in PDB.

    The web site provides three types of searches.

    (1) A search by SCOP classification Code will return a list of PDB entries in the input SCOP classification level and their asymmetric units and biological units.

    (2) A search by PDB entry with or without chain ID or entity ID will return a table of all SCOP domains for this PDB entry or specific chain or entity.  Clicking the SCOP family or superfamily or fold name, returns a list of PDB entries in that SCOP designation and the PSI-BLAST hits homologous to the query entry (or other sequences in the SCOP group) with E-Value better than 0.1. The user can further explore entities and asymmetric chains for a PDB entry.

    If a PDB entry is not in SCOP, a list of PSI-BLAST hits and their SCOP family codes will be returned.

    (3) A search by keywords will return a list of PDB entries containing the input keyword(s). The resulting table also includes names, titles, and SCOP codes if an entry exists in SCOP.  (Standalone tool only.)


Processing of data files

    The data in ProtBuD come from four sources: protein structure files from the PDB in XML format (Berman, et al., 2000; Westbrook, et al., 2005), biological unit coordinate files from PQS (Henrick and Thornton, 1998) in the legacy PDB format, domain classification files from SCOP (Murzin, et al., 1995), and PSI-BLAST hit files from a non-redundant PDB database of our lab (Wang and Dunbrack, 2003; Wang and Dunbrack, 2005). We use the XML entity_id and asym_id identifiers for all molecules, while the other sources use the author chain IDs. The XML files provide a correspondence between these identifiers, although this occasionally presents some ambiguities that can be resolved as described below.

Parsing XML files.

    PDBML (Westbrook, et al., 2005) is part of the uniformity project (Bhat, et al., 2001) of the PDB. The PDB XML data files preserve the logical data model of the PDB Exchange Data Dictionary (Westbrook and Fitzgerald, 2003). Data can be retrieved quickly from XML files, and most software development environments provide libraries to read and write XML files. From the XML files, we retrieve the following data:

  • the entity_id and name for each type of molecule in the structure,
  • for each entity_id, the asymmetric unit contents in terms of asym_ids; there may be several asym_ids for a given entity_id
  • the biological unit contents consisting of symmetry operators applied to asym_ids
  • for protein and nucleic acid polymers, the author chain IDs for each asym_id molecule to provide links with other databases such as PQS and SCOP that use the author chain IDs; the XML files provide the information that the author chain ID’s may be blank, but this information is only provided for polypeptide entities.
  • information on covalent attachments and modified residues, defined in terms of asym_ids and residue numbers, and atom names
  • structural determination data such as experiment type, space group, transformation matrices for converting to unit cell coordinates, missing residues, resolution, and R-factors

    We use the asym_ids in the XML files to link the required information for asymmetric and biological units. Since the biological units are defined in terms of the asym_ids and symmetry operators of the space group, the asym_ids are sufficient for defining both asymmetric and biological units in the XML files. However, ligands are not always assigned properly to specific biological units. Often when an asymmetric unit is broken up into more than one biological unit, all of the non-polymer ligands are assigned to the first unit. This is a limitation of the current state of the PDB and may be resolved in future releases of the PDB (J. Westbrook and H. Berman, personal communication). Covalent attachments and modified residues are identified uniquely in the XML files, and are connected to other data fields by asym_ids, residue numbers, and atom names. The categories used in our database are described in Table 1.

Parsing PQS files.

    To compare the biological units provided by the PDB and PQS, we use the legacy-PDB format *.mmol files provided by PQS, parsing the “REMARK 300” fields to match PQS chains and PDB author-designated chains. Our goal is to use the Exchange Dictionary asym_ids and entity_ids to designate the contents of asymmetric and biological units in PQS. We therefore use the XML files to provide a translation of the author-designated chain IDs used by PQS into asym_ids and entity_ids. This assignment process is however not always unambiguous. About 1.5% of PQS entries have protein chains that cannot be matched at all to asym_ids. This usually occurs when the PDB has changed the author-approved file to the correct asymmetric unit and added or subtracted chains from the original files. Ligands are also not handled consistently by PQS. In any case, we only list the non-polymer entities in the asymmetric unit contents and not as part of the biological unit contents. Assigning the non-polymer ligands to appropriate biological units would require a significant research project outside the current scope of this database.

Parsing SCOP files.

    Domain definitions are parsed from the latest version of SCOP classification files: dir.cla.scop.txt_1.69 and its description file, dir.des.scop.txt_1.69, available from http://scop.mrc-lmb.cam.ac.uk/scop/parse/index.html. Since SCOP uses a unique integer number for each domain (Andreeva, et al., 2004), this makes parsing these files straightforward. SCOP also uses author chain IDs, and we therefore use the XML data to translate these into asym_ids and entity_ids, and this matching is also sometimes ambiguous. Again, this appears to happen when the original author file does not provide coordinates and chain IDs for the correct asymmetric unit. The ambiguity can usually be resolved by taking the first chain in the XML file to correspond to the SCOP chain, in cases where SCOP has one chain with blank chain ID and the XML file contains two chains in the asymmetric unit.

Parsing PSI-BLAST Hit files.

    As part of our PISCES server, we create a non-redundant set of sequences of proteins in the PDB. We apply a modified PSI-BLAST (Altschul, et al., 1997) (G. Wang and R. Dunbrack, unpublished) to each of the sequences of this non-redundant set to search the non-redundant protein sequence database (“nr”) available from NCBI (Wheeler, et al., 2005) to create a position-specific scoring matrix or profile. We then search the entire (redundant) PDB database with these non-redundant profiles and save hits with E-value better than 0.001. For an entry and chain not in the non-redundant set, the hits for the representative profile are assigned as hits for these entries and chains. PISCES also uses the author chains from the remediated PDB files, and we use the XML data to assign entity_ids to the sequences in the redundant and non-redundant sets. In ProtBuD, around 70% of those PDB entries not in SCOP are provided other PSI-BLAST hits in the PDB and their SCOP codes if available.

Table 1. PDB major data fields and their corresponding PDB noatom XML categories.

Data Fields Tags Description
PDB ID PDB entry file name 4-character PDB code
Biological Unit PDBx:struct_biol_genCategory Biological unit Ids and asymmetric chains
asym_id PDBx:struct_asymCategory PDBx:pdbx_poly_seq_schemeCategory Asymmetric chain Ids and corresponding entity Ids, author chains
NumOfAsymIDs PDBx:refine_histCategory Number of asymmetric chain copies
entity_id PDBx:entityCategory Polymer status for an entity, can be nonpolymer or polymer
Polymer Type PDBx:entity_polyCategory Polymer type for each entity number. Polymer type can be polypeptide, polydeoxyribonucleotide, polyribonucleotide or polysaccharide, or “other”
Name PDBx:pdbx_entity_nameCategory Entity Name: may be from different sources. Only one is used. The order is SwissProt, RCSB and PDB.
Sequence PDBx:pdbx_poly_seq_schemeCategory Residue sequence and PDB sequence. Missing residues may be in PDB sequence, represented by “-”.
NumOfLigandAtoms PDBx:refine_histCategory Number of ligand atoms
Methods PDBx:exptlCategory Experimental methods such as X-ray, NMR and EM
R factor PDBx:refineCategory Work R factor and free R factor and resolution if X-ray
Covalent Attachment PDBx:struct_connCategory Covalent attachments including attached chain information and covalent chain information, such as asymmetric chain ID, sequence ID, author chain ID and author sequence ID
Modified Residues PDBx:struct_connCategory The modified residues data include asymmetric chain ID, modified residue, standard residue, sequence ID, and author provided information.


Database design and construction.

    The ProtBuD database functionality was implemented using FireBird relational database server (http://firebird.sourceforge.net/). The database structure was designed to be modular, to avoid unnecessary redundancy and to allow fast queries. The database schema conforms to the Third Normal Form (3NF) under a set of functional dependencies designed to avoid unnecessary data duplication. Functional dependencies are considered standard practice in establishing good database designs (Silberschatz, et al., 2002). The communication between the application and the database server is performed using the ODBC protocol. The data tables are created dynamically just before data insertion.
    We do not discuss each functional dependency here, but provide one simple example. In the ScopDomain table the functional dependency is SunID. All other fields are dependent on SunID, and without SunID the other fields lose their meaning. We must preserve this functional dependency at all times. In order to optimize the query speed, indices are added to the tables. The best tradeoff between speed and the required disk space was achieved by using composite indices, which take advantage of the leftmost prefixing rule. For instance, in the ScopDomain table, besides the primary key SunID, a composite index (Class, Fold, Superfamily, Family) is added to speed up SCOP code queries. Our database can be divided into five independent modules: SCOP, PDB, PQS PSIBLAST Hits, and biological units comparison. Each module can be created or updated individually. The whole database is connected by SCOP SunID, PDB entry ID, asym_id, and/or author chain ID.

Automatic Updates

    We provide two ways to update the database. The fastest way is using “Automatic Update”, which is a one-click operation. This method involves downloading the precompiled database files from our server, unzipping and installing them to a user-specified directory. The “Advanced Update” protocol gives the power user more flexibility and independence and consists of downloading new or modified data source files from SCOP, PDB, PQS and our lab server (for PSI-BLAST hits), processing of these files, and inserting the data into the database. The advanced update functionality also removes obsolete PDB entries from PDB and PQS, and replaces obsolete PDB codes in SCOP with the replacement PDB codes. The PSI-BLAST hit file is computed from scratch each week, and so it is necessary to parse the whole file. The Automatic update procedure takes only a few minutes and is largely dependent on the internet transfer rate available to the user’s machine.
    On a machine with 1GB memory and 2.8GHz CPU and a broadband-internet connection , using the Advanced Update function weekly takes approximately one hour. Compared to the time of rebuilding the whole database, more than 10 hours, the automated update is a more efficient way to maintain the database.

Database Interface

    The program that creates, updates and queries the database is written in C# .Net. C# is a programming language that has many similarities with C++ and Java. The ProtBuD database project has two parts: a core library that implements all processing functions and the user interface. The core library is also shared with the web-server version of the program. The standalone program has a user-friendly interface.
    The installation procedure of the program is very convenient. The user is guided step by step in a graphical installer window during the setup. This will automatically install the database and configure the data source file directories, and also install all necessary libraries. The embedded FireBird database is completely hidden from the user, so that a database server does not need to be installed and maintained separately by the user. The current version can be only installed in Window OS, although future ports of C# to Linux systems may enable future versions for Linux (see http://www.mono-project.com). The tool can be downloaded at our lab server (http://dunbrack.fccc.edu/ProtBuD, download file ProtBuDSetup.msi) after a simple registration and acceptance of a BSD-type license.

The Web Server

    In addition to the downloadable version of our program, we also make the data querying and retrieval functions available through a web server for the casual user. While the standalone application allows maximum flexibility including user-defined SQL queries on the database, the web server provides an easy way for the user to access the biological data without having to worry about hardware requirements, installation and maintenance of the program.
    The Web-based interface (http://dunbrack2.fccc.edu/ProtBuD/Query.aspx) follows closely the look-and-feel of the desktop application and implements the same query and display logic. From the very beginning, the software design took into account the requirement that the same functionality should be accessed via a graphical user interface (GUI) and a web page. The code development process was optimized by creating a core function library that is shared by both the standalone and web-server versions.

Query Interface

    The central feature of the program tool is the Query. Figures 2 and 3 show a typical session of a search on the database. The user enters a PDB entry code with or without a chain identifier (Figure 2a, top), and submits the query to the database. The returned SCOP domain definition data are displayed in a data grid (Figure 2b, bottom). A residue range designated by “-” indicates that the whole chain is a domain. To explore structures with domains in the same family, superfamily, or fold, the user clicks the cell with the appropriate SCOP designation. A new window opens and shows the asymmetric units and biological units of all PDB entries with a domain in the same family, superfamily, or fold, as shown in Figure 3. The user can input a single SCOP code and directly get the result of Figure 3. Figure 3 gives an example of output from SCOP code input “b.2.5.2” (p53 DNA-binding domain-like) or PDB entry input “1KZY”. Four data formats are provided for the asymmetric and biological units: AsymID, EntityID, Author Chain and ABC formats. These are described in Table 2. The default is the “ABC” format, which is similar to that used by PQS. The other formats provide more detailed information on which sequences (entity_ids) or chains (asym_ids) in the asymmetric unit make up the biological units. In each of these formats, proteins in the asymmetric or biological unit with the same sequence are placed together in set of parentheses. So for instance, in the asym_id format for a heterotetramer of two different sequences, the form might be (A,B)(C,D), indicating A and B have one sequence and C and D another. If the same structure was an octamer in the biological unit, the asym_id form might be (A2,B2)(C2,D2), indicating that there are two copies of each chain of the asymmetric unit. An alternative octamer might have been (A4)(C4). The difference is important because there may be some structural differences among chains with the same sequence within a single asymmetric unit. The user can show or hide each kind of format by clicking checkboxes at the top of the window. For most purposes, the ABC format is simplest and provides enough information.



Figure 2. Screenshots of a PDB entry query on the ProtBuD database. (a) PDB entry query input “1kzy”; (b) SCOP domain definition table that results from the query “1kzy”. The SCOP-defined domains for all chains in 1kzy are listed. Clicking the family cell for d1kzya_ (marked in blue) returns the asymmetric units and biological units as well as ligands, DNA/RNA information of all PDB entries in SCOP family b.2.5.2.

    The PDB and PQS biological units are matched based on their asymmetric chain format, not their biological unit IDs. The column SameBU indicates if two biological units are the same or not, as detailed in Table 2. Two BUs are the same if they contain the same polymer entities with the same number and types of interfaces. Two BUs with the same entity contents may be different either because of a different number of interfaces, marked by difNum, or because of different interaction orientations between proteins marked by difOrient. An example of difNum is shown in table 3 (PDB entry 1TUI), while an example of difOrient is shown in table 3  (PDB entry 1BUU). PQS labels some biological units as “XPACK”, and describes them as probably due to crystal packing but nevertheless of possible interest. There are currently 1220 such XPACK biological units and these are labeled “XPACK” in the SameBU column.
    To further explore the entities and asymmetric chains of a PDB entry, the user clicks the PDBID cell (leftmost column in Figure 3), and two tables appear at the bottom of the window. The first of these covers all the entities (by entity_id) that are in the asymmetric unit. From this table, the user can get a summary of the kinds of proteins in the asymmetric unit, including their names, SCOP codes, and biological species, as well as the identities of other ligands such as ions and small molecules. The example in Figure 3 shows that there are two kinds of proteins in PDB entry 1KZY, and it provides the asym_ids and author chain ID’s for these proteins in the asymmetric unit. It also indicates that there is zinc and water in the asymmetric unit and the entity_id and asym_ids used for these. It should be noted that the authors used A and B for the zincs as well as the p53 proteins and no chain ID for the water molecules. In the lower grid, data are provided for each polymer asym_id in the asymmetric unit. The data include the type, the length, the missing residues and modified residues or covalent attachments to these chains.
    The user can browse through the PDB entries in the family, superfamily, or fold returned by the query by using the UP or DOWN arrow keys in the PDBID column of the top table in Figure 3. Searching for an entry with a specific type of ligand, such as ATP, within the structures in a particular superfamily or family can be accomplished easily by navigating up and down the top table and examining the entity_id table that appears below as each PDB entry is selected.
    The user can also download the coordinate files from PDB and PQS ftp server by right- click on the selected rows or cells. Selecting multiple rows is a shortcut to download ASU/BU files for multiple entries. Selecting a single cell only downloads the ASU or BU file for that cell. The compressed files are decompressed after being downloaded.
    If an input PDB entry is not in SCOP, a list of PSI-BLAST hits are returned with E-values, percent identities, and residue ranges from the PSI-BLAST alignments. Those in SCOP are listed with their SCOP designations. Any of these hits can be clicked to reveal a new window with the asymmetric and biological unit data. A right-click will produce a biological unit table with all of the hits listed.
    A query may also consist of two different SCOP codes so that a user may obtain all structures that contain members of two different SCOP families, superfamilies, or folds. We have not tested whether the two SCOP domains are in fact in contact with each other. It may be in some cases useful to find structures that contain two SCOP domains, even if they are not in contact. They may be in the same protein chain with a linker long enough to separate them, but such a template may still be useful for modeling. Information on SCOP domain contacts can be obtained from other databases such as PIBASE and PSIMAP. An example of this kind of search is shown in Figure 4, in which all PDB entries with both a kinase fold and a cyclin fold are shown. If a user does not know the SCOP codes, these can be obtained by single queries to our database with PDB entry identifiers, and then combining the results in the dual SCOP query. The figure shows that the PQS BU can be downloaded by clicking on the BU designation (“ABC”).
    A user with SQL knowledge can query the database by inputting a query string. This provides greater opportunities for exploring the database and discovering useful information based on a user’s specific interest.

Figure 3. The asymmetric units and biological units output for SCOP family b.2.5.2 (p53 DNA-binding domain-like family). The format for the asymmetric and biological units can be selected and deselected with the checkboxes at the top of the window. The “ABC” format is shown in the figure. The two lower frames appear when a PDB ID is clicked in the first column of the upper table. The first table lists the entities in the PDB entry (all unique molecule types), and the second table lists the proteins in the asymmetric unit, including information on missing coordinates and residue modifications. By right click, ASU/BU files can be downloaded from PDB and PQS ftp server by either selecting the whole rows for multiple files or selecting individual cells for individual files.

Table 2. Description of Flags for column "SameBUs" in ASU/BU table
Flags Descriptions Example
same Same entity contents, same orientation 1gzh: (PDBBU-Entity: (1.1)(2.1), PQSBU-Entity: (1.1)(2.1)) same.
Interfaces in PDBBU and PQSBU are same
difNum Same entity contents, different number of interfaces 1tui: (PDBBU-Entity: (1.3), PQSBU-Entity: (1.3)) difNum.
The number of interfaces in 1tui.pdb1 is 2, the number of interfaces in 1tui_1.mmol is 3.
difOrient Same entity contents, same number of interfaces, different orientation 1buu: (PDBBU-Entity: (1.3), PQSBU-Entity: (1.3)) difOrient.
The number of interfaces in both biological units is 3, but different orientations.
substruct One entity content is a subset of the other one. Interfaces in smaller biological unit are contained in the larger biological unit 1b71: (PDBBU-Entity: (1.2), PQSBU-Entity: (1.4)) substruct.
PDB biological unit is half of PQS biological unit
dif Different entity contents 1a4p: (PDBBU-Entity: (1.2), PQSBU-Entity: (1.4)) dif.
PDB biological unit is not a substructure of PQS biological unit.
Xpack PQS crystal packing 1jnx: (PDBBU-Entity: (1.1), PQSBU-Entity: (1.2)). Xpack.
PQS biological unit is a crystal packing