Benchmark datasets used in proFold

1. DD-dataset

DD-dataset was proposed by Ding and Dubchak in 2001 and modified by Shen and Chou in 2006. Since then, DD-dataset has been used in many protein fold classification studies. There are 311 protein sequences in the training set and 386 protein sequences in the testing set with no two proteins having more than 35% of sequence identity. The protein sequences in DD-dataset were selected from 27 SCOP folds comprehensively, which belong to different structural classes containing α, β, α/β, and α + β.

Link: DD-train.dataset      DD-test.dataset

2. EDD-dataset

EDD-dataset contains 27 SCOP folds, like DD-dataset. There are 3418 protein sequences with no protein having more than 40% sequence identity.

Link: EDD.dataset

3. TG-dataset

TG-dataset contains 30 SCOP folds and 1612 protein sequences with no two protein having more than 25% sequence identity.

Link: TG.dataset