Protein Common Assembly Database
ProtCAD contains clusters of assemblies of homologous proteins observed in multiple independent experiments, such as unique crystal forms
and individual EM and NMR experiments. The term "Crystal forms" (CFs) refers to independent experiments that provide evidence
in favor of the structure of biological assemblies. We build and identify unique viable assemblies that can constitute the entire crystal
using EPPIC [1] and mmCIF file from PDB.
We removed those assemblies with any disconnected chains.
We added any PDB biological assemblies that are not in EPPIC. We first cluster the assemblies across crystal forms of homologous proteins
by stoichiometry and symmetry (e.g. only assemblies in same symmetry and stoichiometry are compared), and then check
whether the unique connecting interfaces are shared for a pair of assemblies. For example, a pair of D3 hexamers are similar
when they share the asymmetric interface that creates the cyclic trimer and at least one ioslogous interface between the C3 cycles.
The interfaces are always sorted from largest to smallest based on their surface area values. This is a very fast and accurate way of comparing
and clustering assemblies, since it requires only a small number of unique interfaces of an assembly.
Assemblies occurring in two or more crystal forms comprise a common assembly cluster in ProtCAD.
About 100,000 entries (59% of all PDB entries) appear in clusters with at least two crystal forms.
About 65,000 are contained in clusters of at least five crystal forms and the PDB annotations have the same assembly in only ~85% of these.
In addition, the observation of many crystal forms without a common assembly (dimer or larger) is good evidence in favor of a monomeric protein.
We annotate whether the PDB, EPPIC, and PISA have the clustered assembly, which may suggest that the assembly is biologically relevant
(by the criteria of author annotations, sequence conservation in interfaces, or biophysical properties, respectively).
For each UniProt sequence, we also determine what percentage of the available crystal forms of that protein contain the same assembly,
an additional indication of the biological relevance of the assembly. With a click of a single button,
the user can download all the structures of a particular assembly across PDB entries and PyMOL scripts for aligning and visualizing them.
In a similar manner as our Protein Common Interface Database (ProtCID)
[2][3], that has been widely used for benchmarking interface/assembly predictors.
e.g. We generated a non-physiological dimers for benchmarking various interface predictors from
the 3D-BioInfo community of ELIXIR. We believe the common assembly
clusters in protCAD will also be used for benchmark data, as well as training and testing data sets for structure prediction of protein complexes,
especially in the rapidly developing field of deep learning structure predictors.
You can search ProtCAD in different inputs:
-
PDB Code. Searching by PDB code returns a list of assemblies of the structure and their clusters if available.
Click the "GroupID" goes to the cluster page.
-
Pfam code. A list of Pfam architecture groups that contain the query Pfam is returned.
Selecting one group ID from this list goes to its cluster page.
-
UniProt code. Searching by a UniProt Code returns the list of Pfam architecture groups and clusters
containing the input UniProt. Selecting one Pfam architecture group from this list goes to its cluster page.
Or you can browse Pfams, Pfam architectures and UniProts in the PDB.