To determine the cutoff of HMMER E-values and structure alignment p-value, for each Pfam A present in the six sets of alignments, we collected those Pfam hits with HMMER E-value < 10-5, HMM coverage > 0.9, and then selected the alignment with the largest number of match states assigned to residues with Cartesian coordinates in the PDB structures as the representative hit. A total of 5,134 Pfams were selected. With HMMER3, we aligned each PDB sequence of these representative hits to all of the 5,134 Pfam HMMs. The resulting data points were divided into two classes: same-clan and different-clan, depending on whether the two Pfams were in the same or different clans.

Smoothed density function curves were calculated using kernel density estimates in the R project (http://www.r-project.org/) by calculating probability density estimates of same-clan and different-clan prediction as a function of log10(E-value). The probability at A is calculated using Bayes' rule.

HMMER E-values

From a value of A such that p(same|A)>95%, we selected a threshold for the Pfam E-values of 10-5 for HMMER alignments.

HMMER

HHsearch E-values

To select a Pfam E-value threshold for HH hits, we applied the same procedure on HH alignments, which contains 5,387 Pfam hits. The threshold of Pfam E-value for p(same|A)>95% is 10-4.

HH

FATCAT P-values

We performed structure alignment with the FATCAT program (Ye and Godzik, 2003) of each structure with every other structure in the 5,134 Pfam set. The data points consisting of log10(p-values), were defined as either same-clan or different-clan. Kernel density estimates and Bayes’ rule were used to obtain p(same|A) where A is the log10(p-value) from FATCAT. From the value of A such that p(same|A)>95%, we selected a threshold for the FATCAT p-values of 10-3.

FATCAT