Benchmark datasets used in proFold

1. DD-dataset

DD-dataset was proposed by Ding and Dubchak in 2001 and modified by Shen and Chou in 2006. Since then, DD-dataset has been used in many protein fold classification studies. There are 311 protein sequences in the training set and 386 protein sequences in the testing set with no two proteins having more than 35% of sequence identity. The protein sequences in DD-dataset were selected from 27 SCOP folds comprehensively, which belong to different structural classes containing α, β, α/β, and α + β.

C.H. Ding and I. Dubchak, “Multi-class protein fold recognition using support vector machines and neural networks,” Bioinformatics, vol. 17, no. 4, pp. 349–358, 2001.
H.-B. Shen and K.-C. Chou, “Ensemble classifier for protein fold pattern recognition,” Bioinformatics, vol. 22, no. 14, pp. 1717–1722, 2006.

Link: DD-train.dataset      DD-test.dataset

2. EDD-dataset

EDD-dataset contains 27 SCOP folds, like DD-dataset. There are 3418 protein sequences with no protein having more than 40% sequence identity.

Q. Dong, S. Zhou, and J. Guan, “A new taxonomy-based protein fold recognition approach based on autocross-covariance transformation,” Bioinformatics, vol. 25, no. 20, pp. 2655–2662, 2009.

Link: EDD.dataset

3. TG-dataset

TG-dataset contains 30 SCOP folds and 1612 protein sequences with no two protein having more than 25% sequence identity.

J. Lyons, N. Biswas, A. Sharma, A. Dehzangi, and K.K. Paliwal, “Protein fold recognition by alignment of amino acid residues using kernelized dynamic time warping,” Journal of theoretical biology, vol. 354, pp. 137–145, 2014.

Link: TG.dataset