|
Protein
Biological
Unit
Database
(ProtBuD) |
|
|
|
ProtBuD is a server to
search the content of asymmetric and biological units as given by RCSB and PQS.
ProtBuD uses SCOP and PSIBLAST to provide this information for all entries with
proteins in particular superfamilies or families. A user can search for a
particular SCOP designation or a particular entry or chain in an entry and
obtains the asymmetric and biological units content of nearly all (except
non-SCOP or too distantly related new entries) related proteins in PDB.
The web site provides three types of
searches.
(1) A search by SCOP classification Code
will return a list of PDB entries in the input SCOP classification level and
their asymmetric units and biological units.
(2) A search by PDB entry with or without
chain ID or entity ID will return a table of all SCOP domains for this PDB
entry or specific chain or entity. Clicking the SCOP family or
superfamily or fold name, returns a list of PDB entries in that SCOP
designation and the PSI-BLAST hits homologous to the query entry (or other
sequences in the SCOP group) with E-Value better than 0.1. The user can further
explore entities and asymmetric chains for a PDB entry.
If a PDB entry is not in SCOP, a list of
PSI-BLAST hits and their SCOP family codes will be returned.
(3) A search by keywords will return a list
of PDB entries containing the input keyword(s). The resulting table also
includes names, titles, and SCOP codes if an entry exists in SCOP.
(Standalone tool only.)
|
|
Method
Processing of data files
The data in ProtBuD come from four sources:
protein structure files from the PDB in XML format (Berman, et al., 2000;
Westbrook, et al., 2005), biological unit coordinate files from PQS (Henrick
and Thornton, 1998) in the legacy PDB format, domain classification files from
SCOP (Murzin, et al., 1995), and PSI-BLAST hit files from a non-redundant PDB
database of our lab (Wang and Dunbrack, 2003; Wang and Dunbrack, 2005). We use
the XML entity_id and asym_id identifiers for all molecules, while the other
sources use the author chain IDs. The XML files provide a correspondence
between these identifiers, although this occasionally presents some ambiguities
that can be resolved as described below.
Parsing XML files.
PDBML (Westbrook, et al., 2005) is part of
the uniformity project (Bhat, et al., 2001) of the PDB. The PDB XML data files
preserve the logical data model of the PDB Exchange Data Dictionary (Westbrook
and Fitzgerald, 2003). Data can be retrieved quickly from XML files, and most
software development environments provide libraries to read and write XML
files. From the XML files, we retrieve the following data:
-
the entity_id and name for each type of molecule in the structure,
-
for each entity_id, the asymmetric unit contents in terms of asym_ids; there
may be several asym_ids for a given entity_id
-
the biological unit contents consisting of symmetry operators applied to
asym_ids
-
for protein and nucleic acid polymers, the author chain IDs for each asym_id
molecule to provide links with other databases such as PQS and SCOP that use
the author chain IDs; the XML files provide the information that the author
chain ID’s may be blank, but this information is only provided for polypeptide
entities.
-
information on covalent attachments and modified residues, defined in terms of
asym_ids and residue numbers, and atom names
-
structural determination data such as experiment type, space group,
transformation matrices for converting to unit cell coordinates, missing
residues, resolution, and R-factors
We use the asym_ids in the XML files to link
the required information for asymmetric and biological units. Since the
biological units are defined in terms of the asym_ids and symmetry operators of
the space group, the asym_ids are sufficient for defining both asymmetric and
biological units in the XML files. However, ligands are not always assigned
properly to specific biological units. Often when an asymmetric unit is broken
up into more than one biological unit, all of the non-polymer ligands are
assigned to the first unit. This is a limitation of the current state of the
PDB and may be resolved in future releases of the PDB (J. Westbrook and H.
Berman, personal communication). Covalent attachments and modified residues are
identified uniquely in the XML files, and are connected to other data fields by
asym_ids, residue numbers, and atom names. The categories used in our database
are described in Table 1.
Parsing PQS files.
To compare the biological units provided by
the PDB and PQS, we use the legacy-PDB format *.mmol files provided by PQS,
parsing the “REMARK 300” fields to match PQS chains and PDB author-designated
chains. Our goal is to use the Exchange Dictionary asym_ids and entity_ids to
designate the contents of asymmetric and biological units in PQS. We therefore
use the XML files to provide a translation of the author-designated chain IDs
used by PQS into asym_ids and entity_ids. This assignment process is however
not always unambiguous. About 1.5% of PQS entries have protein chains that
cannot be matched at all to asym_ids. This usually occurs when the PDB has
changed the author-approved file to the correct asymmetric unit and added or
subtracted chains from the original files. Ligands are also not handled
consistently by PQS. In any case, we only list the non-polymer entities in the
asymmetric unit contents and not as part of the biological unit contents.
Assigning the non-polymer ligands to appropriate biological units would require
a significant research project outside the current scope of this database.
Parsing SCOP files.
Domain definitions are parsed from the
latest version of SCOP classification files: dir.cla.scop.txt_1.69 and its
description file, dir.des.scop.txt_1.69, available from
http://scop.mrc-lmb.cam.ac.uk/scop/parse/index.html. Since SCOP uses a unique
integer number for each domain (Andreeva, et al., 2004), this makes parsing
these files straightforward. SCOP also uses author chain IDs, and we therefore
use the XML data to translate these into asym_ids and entity_ids, and this
matching is also sometimes ambiguous. Again, this appears to happen when the
original author file does not provide coordinates and chain IDs for the correct
asymmetric unit. The ambiguity can usually be resolved by taking the first
chain in the XML file to correspond to the SCOP chain, in cases where SCOP has
one chain with blank chain ID and the XML file contains two chains in the
asymmetric unit.
Parsing PSI-BLAST Hit files.
As part of our PISCES server, we create a
non-redundant set of sequences of proteins in the PDB. We apply a modified
PSI-BLAST (Altschul, et al., 1997) (G. Wang and R. Dunbrack, unpublished) to
each of the sequences of this non-redundant set to search the non-redundant
protein sequence database (“nr”) available from NCBI (Wheeler, et al., 2005) to
create a position-specific scoring matrix or profile. We then search the entire
(redundant) PDB database with these non-redundant profiles and save hits with
E-value better than 0.001. For an entry and chain not in the non-redundant set,
the hits for the representative profile are assigned as hits for these entries
and chains. PISCES also uses the author chains from the remediated PDB files,
and we use the XML data to assign entity_ids to the sequences in the redundant
and non-redundant sets. In ProtBuD, around 70% of those PDB entries not in SCOP
are provided other PSI-BLAST hits in the PDB and their SCOP codes if available.
Table
1. PDB major data fields and their corresponding PDB noatom XML categories.
Data Fields
|
Tags
|
Description
|
PDB ID
|
PDB entry file name
|
4-character PDB code
|
Biological Unit
|
PDBx:struct_biol_genCategory
|
Biological unit Ids and asymmetric chains
|
asym_id
|
PDBx:struct_asymCategory
PDBx:pdbx_poly_seq_schemeCategory
|
Asymmetric chain Ids and corresponding entity Ids, author chains
|
NumOfAsymIDs
|
PDBx:refine_histCategory
|
Number of asymmetric chain copies
|
entity_id
|
PDBx:entityCategory
|
Polymer status for an entity, can be nonpolymer or polymer
|
Polymer Type
|
PDBx:entity_polyCategory
|
Polymer type for each entity number. Polymer type can be
polypeptide, polydeoxyribonucleotide, polyribonucleotide or polysaccharide, or
“other”
|
Name
|
PDBx:pdbx_entity_nameCategory
|
Entity Name: may be from different sources. Only one is used. The
order is SwissProt, RCSB and PDB.
|
Sequence
|
PDBx:pdbx_poly_seq_schemeCategory
|
Residue sequence and PDB sequence. Missing residues may be in PDB
sequence, represented by “-”.
|
NumOfLigandAtoms
|
PDBx:refine_histCategory
|
Number of ligand atoms
|
Methods
|
PDBx:exptlCategory
|
Experimental methods such as X-ray, NMR and EM
|
R factor
|
PDBx:refineCategory
|
Work R factor and free R factor and resolution if X-ray
|
Covalent Attachment
|
PDBx:struct_connCategory
|
Covalent attachments including attached chain information and
covalent chain information, such as asymmetric chain ID, sequence ID, author
chain ID and author sequence ID
|
Modified Residues
|
PDBx:struct_connCategory
|
The modified residues data include asymmetric chain ID, modified
residue, standard residue, sequence ID, and author provided information.
|
|
|
Implementation
Database design and construction.
The ProtBuD database functionality was
implemented using FireBird relational database server
(http://firebird.sourceforge.net/). The database structure was designed to be
modular, to avoid unnecessary redundancy and to allow fast queries. The
database schema conforms to the Third Normal Form (3NF) under a set of
functional dependencies designed to avoid unnecessary data duplication.
Functional dependencies are considered standard practice in establishing good
database designs (Silberschatz, et al., 2002). The communication between the
application and the database server is performed using the ODBC protocol. The
data tables are created dynamically just before data insertion.
We do not discuss each functional dependency here, but
provide one simple example. In the ScopDomain table the functional dependency
is SunID. All other fields are dependent on SunID, and without SunID the other
fields lose their meaning. We must preserve this functional dependency at all
times. In order to optimize the query speed, indices are added to the tables.
The best tradeoff between speed and the required disk space was achieved by
using composite indices, which take advantage of the leftmost prefixing rule.
For instance, in the ScopDomain table, besides the primary key SunID, a
composite index (Class, Fold, Superfamily, Family) is added to speed up SCOP
code queries. Our database can be divided into five independent modules: SCOP,
PDB, PQS PSIBLAST Hits, and biological units comparison. Each module can be
created or updated individually. The whole database is connected by SCOP SunID,
PDB entry ID, asym_id, and/or author chain ID.
Automatic Updates
We provide two ways to update the database.
The fastest way is using “Automatic Update”, which is a one-click operation.
This method involves downloading the precompiled database files from our
server, unzipping and installing them to a user-specified directory. The
“Advanced Update” protocol gives the power user more flexibility and
independence and consists of downloading new or modified data source files from
SCOP, PDB, PQS and our lab server (for PSI-BLAST hits), processing of these
files, and inserting the data into the database. The advanced update
functionality also removes obsolete PDB entries from PDB and PQS, and replaces
obsolete PDB codes in SCOP with the replacement PDB codes. The PSI-BLAST hit
file is computed from scratch each week, and so it is necessary to parse the
whole file. The Automatic update procedure takes only a few minutes and is
largely dependent on the internet transfer rate available to the user’s
machine.
On a machine with 1GB memory and 2.8GHz CPU and a
broadband-internet connection , using the Advanced Update function weekly takes
approximately one hour. Compared to the time of rebuilding the whole database,
more than 10 hours, the automated update is a more efficient way to maintain
the database.
Database Interface
The program that creates, updates and
queries the database is written in C# .Net. C# is a programming language that
has many similarities with C++ and Java. The ProtBuD database project has two
parts: a core library that implements all processing functions and the user
interface. The core library is also shared with the web-server version of the
program. The standalone program has a user-friendly interface.
The installation procedure of the program is very
convenient. The user is guided step by step in a graphical installer window
during the setup. This will automatically install the database and configure
the data source file directories, and also install all necessary libraries. The
embedded FireBird database is completely hidden from the user, so that a
database server does not need to be installed and maintained separately by the
user. The current version can be only installed in Window OS, although future
ports of C# to Linux systems may enable future versions for Linux (see
http://www.mono-project.com). The tool can be downloaded at our lab server
(http://dunbrack.fccc.edu/ProtBuD, download file ProtBuDSetup.msi) after a
simple registration and acceptance of a BSD-type license.
The Web Server
In addition to the downloadable version of
our program, we also make the data querying and retrieval functions available
through a web server for the casual user. While the standalone application
allows maximum flexibility including user-defined SQL queries on the database,
the web server provides an easy way for the user to access the biological data
without having to worry about hardware requirements, installation and
maintenance of the program.
The Web-based interface
(http://dunbrack2.fccc.edu/ProtBuD/Query.aspx) follows closely the
look-and-feel of the desktop application and implements the same query and
display logic. From the very beginning, the software design took into account
the requirement that the same functionality should be accessed via a graphical
user interface (GUI) and a web page. The code development process was optimized
by creating a core function library that is shared by both the standalone and
web-server versions.
|
|
Query Interface
The central feature of the program tool is
the Query. Figures 2 and 3 show a typical session of a search on the database.
The user enters a PDB entry code with or without a chain identifier (Figure 2a,
top), and submits the query to the database. The returned SCOP domain
definition data are displayed in a data grid (Figure 2b, bottom). A residue
range designated by “-” indicates that the whole chain is a domain. To explore
structures with domains in the same family, superfamily, or fold, the user
clicks the cell with the appropriate SCOP designation. A new window opens and
shows the asymmetric units and biological units of all PDB entries with a
domain in the same family, superfamily, or fold, as shown in Figure 3. The user
can input a single SCOP code and directly get the result of Figure 3. Figure 3
gives an example of output from SCOP code input “b.2.5.2” (p53 DNA-binding
domain-like) or PDB entry input “1KZY”. Four data formats are provided for the
asymmetric and biological units: AsymID, EntityID, Author Chain and ABC
formats. These are described in Table 2. The default is the “ABC” format, which
is similar to that used by PQS. The other formats provide more detailed
information on which sequences (entity_ids) or chains (asym_ids) in the
asymmetric unit make up the biological units. In each of these formats,
proteins in the asymmetric or biological unit with the same sequence are placed
together in set of parentheses. So for instance, in the asym_id format for a
heterotetramer of two different sequences, the form might be (A,B)(C,D),
indicating A and B have one sequence and C and D another. If the same structure
was an octamer in the biological unit, the asym_id form might be
(A2,B2)(C2,D2), indicating that there are two copies of each chain of the
asymmetric unit. An alternative octamer might have been (A4)(C4). The
difference is important because there may be some structural differences among
chains with the same sequence within a single asymmetric unit. The user can
show or hide each kind of format by clicking checkboxes at the top of the
window. For most purposes, the ABC format is simplest and provides enough
information.
|
|
a
b
Figure 2. Screenshots of a PDB entry query on the ProtBuD database. (a) PDB
entry query input “1kzy”; (b) SCOP domain definition table that results from
the query “1kzy”. The SCOP-defined domains for all chains in 1kzy are listed.
Clicking the family cell for d1kzya_ (marked in blue) returns the asymmetric
units and biological units as well as ligands, DNA/RNA information of all PDB
entries in SCOP family b.2.5.2.
|
|
The PDB and PQS biological units are matched based on their
asymmetric chain format, not their biological unit IDs. The column SameBU
indicates if two biological units are the same or not, as detailed in
Table 2. Two BUs are the same if they contain the same polymer entities
with the same number and types of interfaces. Two BUs with the same entity
contents may be different either because of a different number of interfaces,
marked by difNum, or because of different interaction orientations between
proteins marked by difOrient. An example of difNum is shown in table 3
(PDB entry 1TUI), while an example of difOrient is shown in table 3
(PDB entry 1BUU). PQS labels some biological units as “XPACK”, and describes
them as probably due to crystal packing but nevertheless of possible interest.
There are currently 1220 such XPACK biological units and these are labeled
“XPACK” in the SameBU column.
To further explore the entities and asymmetric chains of a
PDB entry, the user clicks the PDBID cell (leftmost column in Figure 3), and
two tables appear at the bottom of the window. The first of these covers all
the entities (by entity_id) that are in the asymmetric unit. From this table,
the user can get a summary of the kinds of proteins in the asymmetric unit,
including their names, SCOP codes, and biological species, as well as the
identities of other ligands such as ions and small molecules. The example in
Figure 3 shows that there are two kinds of proteins in PDB entry 1KZY, and it
provides the asym_ids and author chain ID’s for these proteins in the
asymmetric unit. It also indicates that there is zinc and water in the
asymmetric unit and the entity_id and asym_ids used for these. It should be
noted that the authors used A and B for the zincs as well as the p53 proteins
and no chain ID for the water molecules. In the lower grid, data are provided
for each polymer asym_id in the asymmetric unit. The data include the type, the
length, the missing residues and modified residues or covalent attachments to
these chains.
The user can browse through the PDB entries in the family,
superfamily, or fold returned by the query by using the UP or DOWN arrow keys
in the PDBID column of the top table in Figure 3. Searching for an entry with a
specific type of ligand, such as ATP, within the structures in a particular
superfamily or family can be accomplished easily by navigating up and down the
top table and examining the entity_id table that appears below as each PDB
entry is selected.
The user can also download the coordinate files from PDB and
PQS ftp server by right- click on the selected rows or cells. Selecting
multiple rows is a shortcut to download ASU/BU files for multiple entries.
Selecting a single cell only downloads the ASU or BU file for that cell. The
compressed files are decompressed after being downloaded.
If an input PDB entry is not in SCOP, a list of PSI-BLAST
hits are returned with E-values, percent identities, and residue ranges from
the PSI-BLAST alignments. Those in SCOP are listed with their SCOP
designations. Any of these hits can be clicked to reveal a new window with the
asymmetric and biological unit data. A right-click will produce a biological
unit table with all of the hits listed.
A query may also consist of two different SCOP codes so that
a user may obtain all structures that contain members of two different SCOP
families, superfamilies, or folds. We have not tested whether the two SCOP
domains are in fact in contact with each other. It may be in some cases useful
to find structures that contain two SCOP domains, even if they are not in
contact. They may be in the same protein chain with a linker long enough to
separate them, but such a template may still be useful for modeling.
Information on SCOP domain contacts can be obtained from other databases such
as PIBASE and PSIMAP. An example of this kind of search is shown in Figure 4,
in which all PDB entries with both a kinase fold and a cyclin fold are shown.
If a user does not know the SCOP codes, these can be obtained by single queries
to our database with PDB entry identifiers, and then combining the results in
the dual SCOP query. The figure shows that the PQS BU can be downloaded by
clicking on the BU designation (“ABC”).
A user with SQL knowledge can query the database by
inputting a query string. This provides greater opportunities for exploring the
database and discovering useful information based on a user’s specific
interest.
|
|
Figure 3. The asymmetric units and biological units output for SCOP family
b.2.5.2 (p53 DNA-binding domain-like family). The format for the asymmetric and
biological units can be selected and deselected with the checkboxes at the top
of the window. The “ABC” format is shown in the figure. The two lower frames
appear when a PDB ID is clicked in the first column of the upper table. The
first table lists the entities in the PDB entry (all unique molecule types),
and the second table lists the proteins in the asymmetric unit, including
information on missing coordinates and residue modifications. By right click,
ASU/BU files can be downloaded from PDB and PQS ftp server by either selecting
the whole rows for multiple files or selecting individual cells for individual
files.
|
|
Table 2. Description of Flags for column "SameBUs" in ASU/BU table
Flags
|
Descriptions
|
Example
|
same
|
Same entity contents, same orientation
|
1gzh:
(PDBBU-Entity: (1.1)(2.1), PQSBU-Entity: (1.1)(2.1)) same.
Interfaces in PDBBU and PQSBU are same
|
difNum
|
Same entity contents, different number of interfaces
|
1tui:
(PDBBU-Entity: (1.3), PQSBU-Entity: (1.3)) difNum.
The number of interfaces in 1tui.pdb1 is 2, the number of interfaces in
1tui_1.mmol is 3.
|
difOrient
|
Same entity contents, same number of interfaces, different
orientation
|
1buu:
(PDBBU-Entity: (1.3), PQSBU-Entity: (1.3)) difOrient.
The number of interfaces in both biological units is 3, but different
orientations.
|
substruct
|
One entity content is a subset of the other one. Interfaces in
smaller biological unit are contained in the larger biological unit
|
1b71:
(PDBBU-Entity: (1.2), PQSBU-Entity: (1.4)) substruct.
PDB biological unit is half of PQS biological unit
|
dif
|
Different entity contents
|
1a4p:
(PDBBU-Entity: (1.2), PQSBU-Entity: (1.4)) dif.
PDB biological unit is not a substructure of PQS biological unit.
|
Xpack
|
PQS crystal packing
|
1jnx: (PDBBU-Entity: (1.1), PQSBU-Entity: (1.2)). Xpack.
PQS biological unit is a crystal packing
|
|
|
Copyright © 2006, Q. Xu, A.A. Canutescu &
R.L. Dunbrack Jr.
Fox Chase Cancer Center
|
| |