Evolutionary inference on structure-function relationships requires grouping known structures into protein families or superfamilies, usually at the domain level. Pfam, a database of hidden Markov models (HMMs) of protein families, is used widely, contains significant biological annotation, and can be easily applied to new structures. Publicly available Pfam assignments miss many remote assignments, and many assignments are shorter than the true HMM/structure alignment. The algorithms used to define architectures produce flawed results when domains are split by long insertions or split across chains.

We use alignment of PSI-BLAST-derived consensus sequences and HMM-HMM alignments to the Pfam HMMs to produce Pfam assignments in the PDB. We derive consensus sequences and HMMs for the PDB chain sequences and their parent Uniprot sequences. Further, we use structure alignment to verify assignments with weak E-values and/or short alignments relative to the HMM length. We found that HMM-HMM alignments frequently scored more remotely related Pfams in Pfam clans higher than closely related Pfams, thus leading to erroneous assignment at the Pfam family level. A greedy algorithm was thus applied first to the high-confidence consensus alignments, then the high-confidence HMM-HMM alignments, and then the structure alignments, taking care to join partial assignments split by large insertions or between chains. Our assignments cover 99.3% of chains longer than 50 residues and 82% of all PDB residues.

If you use PDBfam data, please cite the reference that describe the work: Assignment of protein sequences to existing domain and family classification systems: Pfam and the PDB. Q. Xu and R. Dunbrack. Bioinformatics (2012).