General Greedy Algorithm

From any set of alignments of PDB sequences to Pfam HMMs, we use the same general procedure based on a simple greedy algorithm to create a unique assignment of a Pfam to each residue in a PDB sequence. Such an assignment constitutes a Pfam “architecture” or arrangement domains in the PDB sequence allowing only for very short overlaps.

For a given PDB sequence, we start by assigning the hit with the best E-value. If there is any region in the query of more than 30 amino acids that occurs within the boundaries of the alignment to the best HMM but which is not aligned to HMM match states, we create a “split assignment.” A split assignment indicates that match states in the HMM align to separate, non-contiguous regions of the query sequence. The residues in the inserted region of the query are then “unassigned,” which means they are available for subsequent assignments. For each additional hit in the order by E-value (best to worst), we check whether it overlaps the current Pfam assignments by more than 10 residues on either end. If it does not, then an assignment is made. Again, long insertions in the query result in split assignments and the insertions are unassigned.

If at any time, the same Pfam model aligns more than once to a query sequence, we check if the HMM match states align only once to the query and in order allowing short overlaps of less than 10 amino acids in the HMM. If yes, then we combine them into one assignment to the HMM. The assignment is split if there are more than 30 residues between the assigned regions, and the intervening residues are left unassigned. If the assignments to the Pfam cover the HMM match states more than once, then there is more than one copy of the Pfam in the sequence (e.g. repeated domains) and multiple assignments of the Pfam are made.

We also check whether the same Pfam aligned to different sequences within the same PDB entry. In some cases, these hits do not overlap in the HMM by more than 10 amino acids, and they are then combined into a single assignment.

In our procedure, we always used HMMER hits first, then HH hits.

Improve Pfam Assignments Using Structure Alignments

We use structure alignment to verify whether Pfam-PDB alignments with weak E-values are correct and to extend short alignments to Pfam HMMs. To do so, we need to identify structures (or domains within structures) which cover Pfams in their entirety with statistically significant E-values. We call such structures exemplars for their Pfams. Only a subset of Pfams in the PDB have such high-quality alignments.

To identify exemplars, we first applied the greedy algorithm on all Pfam alignments in the six sets of sequences and consensus sequences with a conservative HMMER E-value ≤ 10-5, obtaining split and combined Pfam assignments. Some split assignments may be possible where one component has significant E-value while the other is much weaker. So we continue the greedy algorithm with alignments with E-value > 10-5 if the same Pfam has already been assigned to the PDB sequence, up to an E-value of 1.0. We continued the greedy algorithm with the HH hits with an E-value cutoff of 10-4. For Pfams assigned in this procedure, we identify an exemplar structure, defined as the structure with the largest number of match states assigned to residues with Cartesian coordinates in the PDB entry with a coverage of the Pfam HMM of at least 80%. HMM coverage is the number of the sequence residues with coordinates aligned to a Pfam HMM match state divided by the length of the model. In the event of a tie, the structure with the best E-value is used.

We divided the HMMER Pfam hits of all six sets into two non-overlapping sets: {Strong Hits} and {Weak Hits}. Strong Hits are those hits with E-value ≤ 10-5 and < 10 residues missing from the N or C terminal end of the HMM, while weak hits comprise the remaining alignments. For each hit in {Weak Hits}, we checked whether there are exemplar structures for that Pfam and/or other Pfams in the same clan. If there are, we perform structure alignments with the FATCAT program on the region(s) of the Weak Hit structure not previously aligned to the {Strong Hits}. We performed this procedure separately for HH Pfam hits with E-value ≤ 10-4.

If the FATCAT p-value is better than 10-3, we create an alignment of the PDB query to the Pfam HMM via the exemplar structure through a transitive alignment. For residue pairs AB and BC, (A to B) + (B to C) = (A to C). Here, A to B is the HMM to exemplar alignment, B to C is the structure alignment of the exemplar to the weak assignment, and A to C is HMM to the weak assignment. Once this alignment is created, we move the alignment from {Weak Hits} to a new set {Struct Hits}.

Full Procedure

The full procedure of creating Pfam assignments to PDB sequences is as follows. We have in hand six sets of alignments, {HMMER Strong Hits}, {HH Strong Hits}, {HMMER Struct Hits}, {HH Struct Hits}, {HMMER Weak Hits}, and {HH Weak Hits}, the last two containing those weak hits (too short and/or too weak an E-value) for which structure alignment was not possible or did not produce a significant alignment. First we use the {HMMER Strong Hits} in the greedy algorithm until no more assignments can be made, and then continue with the {HH Strong Hits}. Second, we continue the greedy algorithm with the alignments in the {HMMER Struct Hits} and {HH Struct Hits} sets until no more assignments can be made. Third, we apply the greedy algorithm to the remaining {HMMER Weak Hits} and {HH Weak Hits} with E-value ≤ 10-5 (HMMER) or ≤ 10-4 (HH). These hits have strong statistical significance but more than 10 residues missing from the N or C terminal end of the HMM. Fourth, we proceed with the remaining {HMMER Weak Hits} and {HH Weak Hits} up to a value of 1.0 but we only add these if the same Pfam has already been assigned in one of the earlier steps. Some of these will be combined with earlier assignments to produce split assignments. Some will be repeated domains. Pfam B assignments are treated as weak hits and added only if the E-value is better than the appropriate threshold.


Flow Chart

Flow Chart