Below are links to the Pfam datasets used in the Active Clustering of Biological Sequences study.
Each line in these files contains 3 tab-delimited entries: (family-id, protein-id, protein-sequence). You can see the paper for more details about these datasets.
For clustering purposes, you may compute similarities/distances between these sequences using BLAST.