PFAM is a comprehensive database of Hidden Markov Models (HMMs) of protein families. It is widely used by
biologists because of its widely coverage and sensible naming convention related to protein functions
and commonly used names.
Current available Pfam assignments in the PDB either uses UniProt sequences,
then mapping to the PDB sequences (from PFAM itself or
SIFTS ),
or directly applies HMMER3
on PDB sequences (from PDB web site).
Each of these assignments suffer from one of more of a number of problems.
First of all, those data sets miss many potential assignments that occur
when the sequence is not closely related to any single Pfam family.
Second, in some cases, these sources also provide completely overlapping
assignments.
Third, some proteins have long inertions relative to the PFAM HMM definition,
and HMMER may produce two alignmnet segments, one on either side of the
insertion. These two segments cover non-overlapping regions of the HMM, and
together should comprise a single Pfam assignment.
Fourth, some protein structures are composed of two chains that together
comprise a single Pfam domain.
We overcome some of the deficiencies of other Pfam assignments using several strategies.
The first is to use consensus sequences derived from PSI-BLAST profiles and to run these through the Pfam HMM library.
A consensus sequence is a sequence of the same length as the query sequence which at every position contains the most
common or highest scoring amino acid type at that position in a multiple sequence alignment. The consensus sequence can be
used by HMMER3 and searched on Pfam HMMs, and produce more complete alignments at greater statistical significance than
the original sequences.
Second, we use HHblits to produce profile HMMs for PDB sequences and their parent UniProt sequences,
and then use HHsearch to search on PFAM HMM library by HMM-HMM alignment.
The third approach is to utilize structure alignment of statistically condident and complete structures in Pfam families with weal Pfam hits.
Weak hits refer to those with weak statistical significance and/or alignments which cover only a portion of the Pfam HMM. This allows us to verify whether a weak
assignment is correct and to extend short alignments.
Finally, we delevoped a procedure for optimally combining assignments from these multiple sources into Pfam achitectures for each protein in the PDB. The procedure
combines non-overlapping partial assignments to the same Pfam into single assignments, thus accounting for large insertions or domains split across multiple protein
chains.