Split Pfam Architecture Assignments

One feature of structures that we have accounted for in our Pfam assignments is the presence of large insertions, either folded domains or long linkers, relative to the multiple sequence alignments that define the Pfam models. Such insertions often result in separate alignments from HMMER3 or HHSearch covering different parts of the PDB sequence and different parts of the Pfam HMM. These are not accounted for on the Pfam website, where they are often listed as distinct architectures containing two copies of the Pfam rather than one that is split by an insertion. In our current data set we have 5,023 split domains (1.9% of the total) of which 966 domains are multi-chain domains.

Our assignments come in a number of forms due to the ways that domains can be inserted or split up in the PDB sequences. The following table displays the different formats of split domains in our assignments, where X and Y are two Pfam IDs. The format of the chain Pfam architecture for proteins with an inserted domain is given on Line 1 as Domain1[Start-End]_Domain2_Domain1[Start-End], where Domain1 is a split domain and Start and End are positions within the HMM. Line 2 shows the format when there is a long insertion that is not assigned to a Pfam. Line 3 represents those structures where two portions of the HMM are in reverse order in the PDB structures. Line 4 of the table denotes those structures where a Pfam is split between two different chain sequences in the structure (e.g., in this case, entity_id 2 and 3 in the PDB XML file).

Format #Domains #Pfam Pfam Example (HMM)
X[s1-e1]_Y_X[s2-e2] 1,087 X=96 Y=116 (ADK[1-122])_(ADK_lid)_(ADK[123-150])
X[s1-e1]_(#)_X[s2-e2] 2,519 422 (Hpt[1-69])_(35)_(Hpt[70-88])
X[s2-e2]_X[s1-e1] 451 75 (CIMR[46-145])_(CIMR[5-48])
Multi-chain domains 966 100 ((2)Trypsin[1-132])(3)(Trypsin[135-220])

Example: IMPDH

IMPDH family contains 28 PDB structures in which 15 IMPDH domains are split by 2 CBS domains. In PFAM v26 data file, the sequence ranges of 7 IMPDH domains completely overlap the 2 CBS domains, the other 8 PDB sequences missed CBS domains.

#PDB Entity PFAM Architecture Entity PFAM Architecture in PFAM
13 (IMPDH) IMPDH
8 (IMPDH[1-85])_(CBS)_(CBS)_(IMPDH[86-351]) IMPDH
7 (IMPDH[1-85])_(CBS)_(CBS)_(IMPDH[86-351]) IMPDH_CBS_CBS