Protein Sequence Datasets

Below are links to the Pfam datasets used in the Active Clustering of Biological Sequences study.

Each line in these files contains 3 tab-delimited entries: (family-id, protein-id, protein-sequence). You can see the paper for more details about these datasets.

For clustering purposes, you may compute similarities/distances between these sequences using BLAST.

Dataset 1

Dataset 2

Dataset 3

Dataset 4

Dataset 5

Dataset 6

Dataset 7

Dataset 8

Dataset 9

Dataset 10