We compiled out data sets from PDB sequences
and UniProt sequences in the PDB,
and use SIFTS to map UniProt sequences to the PDB.
For each PDB sequence, we used one iteration of our modified PSI-BLAST to generate
a profile from seqeunces in the UniRef90 database. The parameters for PSI-BLAST were
"-e 10 -h 0.0001 -v 5000 -b 5000 -N 25 -f 16". A PSI-BLAST profile is a position-specific
scoring matrix (PSSM), which provides a log-odds score and percentage of occurrences
for each of the 20 amino acid types at each position in the query sequence.
A consensus sequence is a one-dimensional simplification of a PSI-BLAST profile
obtained by substituting the 20-dimensional vector in each residue position by the
highest scoring or most common amino acid observed at that position. In this paper,
a “percentage consensus sequence” is composed of the most frequent residues in each
column, while a “PSSM consensus sequence” is composed of the highest scoring amino
acid at each position. We also applied the same procedure to the full UniProt
sequences from which PDB sequences are derived, as identified by SIFTS. We thus
have six sets of sequences: PDB sequences, PDB percentage consensus sequences,
PDB PSSM consensus sequences, UniProt sequences, UniProt percentage consensus
sequences and UniProt PSSM consensus sequences. In this paper, we denote those
sequences as PDB, PDB-percent, PDB-pssm, UNP,
UNP-percent and UNP-pssm respectively.
We ran HMMER3 on all six sets of sequences against Pfam A and Pfam B HMM models.
We refer to these six sets of alignments as “HMMER hits”.
We ran HHblits
on unique sequences in the PDB and UniProt sequences to generate
HMMs on database uniprot20_29Mar11 which is a database of HMMs created from a clustering
of Uniprot sequences at 20% identity. We searched the Pfam HMMs
with the HHblits-derived PDB and Uniprot HMMs with
HHsearch to
generate Pfam to PDB alignments via HMM-HMM alignments. We refer to these
two sets of alignments as “HH hits”.