Urease C3-A3B3C3

Protein Common Assembly Database

ProtCAD contains clusters of assemblies of homologous proteins observed in multiple independent experiments, such as unique crystal forms and individual EM and NMR experiments. The term "Crystal forms" (CFs) refers to independent experiments that provide evidence in favor of the structure of biological assemblies. We build and identify unique viable assemblies that can constitute the entire crystal using EPPIC [1] and mmCIF file from PDB. We removed those assemblies with any disconnected chains. We added any PDB biological assemblies that are not in EPPIC. We first cluster the assemblies across crystal forms of homologous proteins by stoichiometry and symmetry (e.g. only assemblies in same symmetry and stoichiometry are compared), and then check whether the unique connecting interfaces are shared for a pair of assemblies. For example, a pair of D3 hexamers are similar when they share the asymmetric interface that creates the cyclic trimer and at least one ioslogous interface between the C3 cycles. The interfaces are always sorted from largest to smallest based on their surface area values. This is a very fast and accurate way of comparing and clustering assemblies, since it requires only a small number of unique interfaces of an assembly.

Assemblies occurring in two or more crystal forms comprise a common assembly cluster in ProtCAD. About 100,000 entries (59% of all PDB entries) appear in clusters with at least two crystal forms. About 65,000 are contained in clusters of at least five crystal forms and the PDB annotations have the same assembly in only ~85% of these. In addition, the observation of many crystal forms without a common assembly (dimer or larger) is good evidence in favor of a monomeric protein. We annotate whether the PDB, EPPIC, and PISA have the clustered assembly, which may suggest that the assembly is biologically relevant (by the criteria of author annotations, sequence conservation in interfaces, or biophysical properties, respectively). For each UniProt sequence, we also determine what percentage of the available crystal forms of that protein contain the same assembly, an additional indication of the biological relevance of the assembly. With a click of a single button, the user can download all the structures of a particular assembly across PDB entries and PyMOL scripts for aligning and visualizing them. In a similar manner as our Protein Common Interface Database (ProtCID) [2][3], that has been widely used for benchmarking interface/assembly predictors. e.g. We generated a non-physiological dimers for benchmarking various interface predictors from the 3D-BioInfo community of ELIXIR. We believe the common assembly clusters in protCAD will also be used for benchmark data, as well as training and testing data sets for structure prediction of protein complexes, especially in the rapidly developing field of deep learning structure predictors.

You can search ProtCAD in different inputs:

  • PDB Code. Searching by PDB code returns a list of assemblies of the structure and their clusters if available. Click the "GroupID" goes to the cluster page.
  • Pfam code. A list of Pfam architecture groups that contain the query Pfam is returned. Selecting one group ID from this list goes to its cluster page.
  • UniProt code. Searching by a UniProt Code returns the list of Pfam architecture groups and clusters containing the input UniProt. Selecting one Pfam architecture group from this list goes to its cluster page.

Or you can browse Pfams, Pfam architectures and UniProts in the PDB.

References

1. S. Bliven, et al, Automated evaluation of quaternary structures from protein crystals. PLoS Comput Biol 14(4): e1006104. https://doi.org/10.1371/journal.pcbi.1006104.

2. Q. Xu and R. Dunbrack, The protein common interface database (ProtCID) - a comprehensive database of interactions of homologous proteins in multiple crystal forms. Nucleic Acids Research, Volume 39, Issue suppl_1, 1 January 2011, Pages D761-D770.

3. Q. Xu and R. Dunbrack, ProtCID: a data resource for structural information on protein interactions. Nat Commun 11, 711 (2020). https://doi.org/10.1038/s41467-020-14301-4