To determine the cutoff of HMMER E-values and structure alignment p-value,
for each Pfam A present in the six sets of alignments, we collected those
Pfam hits with HMMER E-value < 10-5, HMM coverage > 0.9, and then selected
the alignment with the largest number of match states assigned to residues
with Cartesian coordinates in the PDB structures as the representative hit.
A total of 5,134 Pfams were selected. With HMMER3, we aligned each PDB sequence
of these representative hits to all of the 5,134 Pfam HMMs. The resulting
data points were divided into two classes: same-clan and different-clan,
depending on whether the two Pfams were in the same or different clans.
Smoothed density function curves were calculated using kernel density estimates
in the R project (http://www.r-project.org/) by calculating probability density estimates of same-clan
and different-clan prediction as a function of log10(E-value).
The probability at A is calculated using Bayes' rule.
HMMER E-values
From a value of A such that p(same|A)>95%, we selected a threshold for the Pfam E-values of 10-5
for HMMER alignments.
HHsearch E-values
To select a Pfam E-value threshold for HH hits, we applied the same procedure on HH alignments,
which contains 5,387 Pfam hits. The threshold of Pfam E-value for p(same|A)>95% is 10-4.
FATCAT P-values
We performed structure alignment with the FATCAT program (Ye and Godzik, 2003) of each structure
with every other structure in the 5,134 Pfam set. The data points consisting of log10(p-values),
were defined as either same-clan or different-clan. Kernel density estimates and Bayes’ rule
were used to obtain p(same|A) where A is the log10(p-value) from FATCAT.
From the value of A such that p(same|A)>95%, we selected a threshold for the FATCAT p-values of 10-3.