Protein Sequence Datasets

Below are links to the Pfam datasets used in the Active Clustering of Biological Sequences study.

Each line in these files contains 3 tab-delimited entries: (family-id, protein-id, protein-sequence). You can see the paper for more details about these datasets.

For clustering purposes, you may compute similarities/distances between these sequences using BLAST.

Dataset 1