We compiled out data sets from PDB sequences and UniProt sequences in the PDB, and use SIFTS to map UniProt sequences to the PDB.

For each PDB sequence, we used one iteration of our modified PSI-BLAST to generate a profile from seqeunces in the UniRef90 database. The parameters for PSI-BLAST were "-e 10 -h 0.0001 -v 5000 -b 5000 -N 25 -f 16". A PSI-BLAST profile is a position-specific scoring matrix (PSSM), which provides a log-odds score and percentage of occurrences for each of the 20 amino acid types at each position in the query sequence. A consensus sequence is a one-dimensional simplification of a PSI-BLAST profile obtained by substituting the 20-dimensional vector in each residue position by the highest scoring or most common amino acid observed at that position. In this paper, a “percentage consensus sequence” is composed of the most frequent residues in each column, while a “PSSM consensus sequence” is composed of the highest scoring amino acid at each position. We also applied the same procedure to the full UniProt sequences from which PDB sequences are derived, as identified by SIFTS. We thus have six sets of sequences: PDB sequences, PDB percentage consensus sequences, PDB PSSM consensus sequences, UniProt sequences, UniProt percentage consensus sequences and UniProt PSSM consensus sequences. In this paper, we denote those sequences as PDB, PDB-percent, PDB-pssm, UNP, UNP-percent and UNP-pssm respectively.

We ran HMMER3 on all six sets of sequences against Pfam A and Pfam B HMM models. We refer to these six sets of alignments as “HMMER hits”.

We ran HHblits on unique sequences in the PDB and UniProt sequences to generate HMMs on database uniprot20_29Mar11 which is a database of HMMs created from a clustering of Uniprot sequences at 20% identity. We searched the Pfam HMMs with the HHblits-derived PDB and Uniprot HMMs with HHsearch to generate Pfam to PDB alignments via HMM-HMM alignments. We refer to these two sets of alignments as “HH hits”.