General Greedy Algorithm
From any set of alignments of PDB sequences to Pfam HMMs, we use the same
general procedure based on a simple greedy algorithm to create a unique
assignment of a Pfam to each residue in a PDB sequence. Such an assignment
constitutes a Pfam “architecture” or arrangement domains in the PDB sequence
allowing only for very short overlaps.
For a given PDB sequence, we start by assigning the hit with the best E-value.
If there is any region in the query of more than 30 amino acids that occurs
within the boundaries of the alignment to the best HMM but which is not aligned
to HMM match states, we create a “split assignment.” A split assignment indicates
that match states in the HMM align to separate, non-contiguous regions of the
query sequence. The residues in the inserted region of the query are then “unassigned,”
which means they are available for subsequent assignments. For each additional hit
in the order by E-value (best to worst), we check whether it overlaps the current
Pfam assignments by more than 10 residues on either end. If it does not,
then an assignment is made. Again, long insertions in the query result
in split assignments and the insertions are unassigned.
If at any time, the same Pfam model aligns more than once to a query sequence,
we check if the HMM match states align only once to the query and in order
allowing short overlaps of less than 10 amino acids in the HMM. If yes,
then we combine them into one assignment to the HMM. The assignment is split
if there are more than 30 residues between the assigned regions,
and the intervening residues are left unassigned. If the assignments to
the Pfam cover the HMM match states more than once, then there is more than
one copy of the Pfam in the sequence (e.g. repeated domains) and multiple assignments of the Pfam are made.
We also check whether the same Pfam aligned to different sequences
within the same PDB entry. In some cases, these hits do not overlap
in the HMM by more than 10 amino acids, and they are then combined into a single assignment.
In our procedure, we always used HMMER hits first, then HH hits.
Improve Pfam Assignments Using Structure Alignments
We use structure alignment to verify whether Pfam-PDB alignments with weak E-values are correct
and to extend short alignments to Pfam HMMs. To do so, we need to identify structures
(or domains within structures) which cover Pfams in their entirety with statistically
significant E-values. We call such structures exemplars for their Pfams.
Only a subset of Pfams in the PDB have such high-quality alignments.
To identify exemplars, we first applied the greedy algorithm on all Pfam alignments
in the six sets of sequences and consensus sequences with a conservative HMMER E-value ≤ 10-5,
obtaining split and combined Pfam assignments. Some split assignments may be possible
where one component has significant E-value while the other is much weaker.
So we continue the greedy algorithm with alignments with E-value > 10-5
if the same Pfam has already been assigned to the PDB sequence, up to an E-value of 1.0.
We continued the greedy algorithm with the HH hits with an E-value cutoff of 10-4.
For Pfams assigned in this procedure, we identify an exemplar structure,
defined as the structure with the largest number of match states assigned to residues
with Cartesian coordinates in the PDB entry with a coverage of the Pfam HMM of at least 80%.
HMM coverage is the number of the sequence residues with coordinates aligned to a Pfam HMM match state divided
by the length of the model. In the event of a tie, the structure with the best E-value is used.
We divided the HMMER Pfam hits of all six sets into two non-overlapping sets: {Strong Hits} and {Weak Hits}.
Strong Hits are those hits with E-value ≤ 10-5 and < 10 residues missing from the N or C terminal end of the HMM,
while weak hits comprise the remaining alignments. For each hit in {Weak Hits},
we checked whether there are exemplar structures for that Pfam and/or other Pfams in the same clan.
If there are, we perform structure alignments with the FATCAT program on the region(s)
of the Weak Hit structure not previously aligned to the {Strong Hits}.
We performed this procedure separately for HH Pfam hits with E-value ≤ 10-4.
If the FATCAT p-value is better than 10-3, we create an alignment of the PDB query to the Pfam HMM
via the exemplar structure through a transitive alignment.
For residue pairs AB and BC, (A to B) + (B to C) = (A to C). Here, A to B is the HMM to exemplar alignment,
B to C is the structure alignment of the exemplar to the weak assignment, and A to C is HMM to the weak assignment.
Once this alignment is created, we move the alignment from {Weak Hits} to a new set {Struct Hits}.
Full Procedure
The full procedure of creating Pfam assignments to PDB sequences is as follows.
We have in hand six sets of alignments, {HMMER Strong Hits}, {HH Strong Hits},
{HMMER Struct Hits}, {HH Struct Hits}, {HMMER Weak Hits}, and {HH Weak Hits},
the last two containing those weak hits (too short and/or too weak an E-value)
for which structure alignment was not possible or did not produce a significant
alignment. First we use the {HMMER Strong Hits} in the greedy algorithm until
no more assignments can be made, and then continue with the {HH Strong Hits}.
Second, we continue the greedy algorithm with the alignments in the {HMMER Struct Hits}
and {HH Struct Hits} sets until no more assignments can be made.
Third, we apply the greedy algorithm to the remaining {HMMER Weak Hits}
and {HH Weak Hits} with E-value ≤ 10-5 (HMMER) or ≤ 10-4 (HH). These hits have
strong statistical significance but more than 10 residues missing
from the N or C terminal end of the HMM. Fourth, we proceed with the
remaining {HMMER Weak Hits} and {HH Weak Hits} up to a value of 1.0
but we only add these if the same Pfam has already been assigned in one of the earlier steps.
Some of these will be combined with earlier assignments to produce split assignments.
Some will be repeated domains. Pfam B assignments are treated as weak hits and added
only if the E-value is better than the appropriate threshold.
Flow Chart