Methods in proFold

Feature Extraction Method

Definition of Secondary Structure in Proteins

The DSSP program was designed by Wolfgang Kabsch and Chis Sander and used to standardize protein secondary structure. The DSSP program works by calculating the most likely protein secondary structure given by the protein 3-Dimension structure. The specific principle of the DSSP program is calculating the H-bond energy between every two atoms by the atomic position in a PDB file, and then the most likely class of secondary structure for each residue can be determined by the best two H-bonds of each atom.

The DSSP feature extraction process is as follows. Firstly, DSSP entries are calculated from PDB entries by DSSP program. Secondly, the corresponding DSSP sequences from DSSP entries are obtained. DSSP sequence contains eight states (T, S, G, H, I, B, E, −), which can be divided into four groups, as shown in Table 1. Finally, according to the eight states and four groups, a 40D feature vector can be extracted from a DSSP sequence. The detail of the description and dimension of the features are shown in Table 2.

Table 1. The eight states of DSSP feature in four groups

8-state SS Code Description 4 groups
310 helix (G)
alpha-helix (H)
pi-helix (I)
Alpha helix
beta-strand (E)
beta-bridge (B)
Beta bridge
beta-turn (T)
high curvature loop (S)
irregular (L) - empty, no secondary structure assigned 4th

Table 2. The description and dimension of the DSSP feature

Features Description Dimension
state composition 8
group composition 4
number of continuous state 8
number of continuous group 4
number of continuous state composition 8
number of continuous group composition 4
alternate frequency between groups 4

Amino Acids Composition and Physicochemical Properties

As effective features to describe a protein, the amino acid composition and physiochemical properties have reached good predict result respectively. Ding and Dubchak tried to integrate the features for the first time and achieved a better result. Later, many other researchers proposed other feature integration methods. In 2013, Lin used a 188D combined feature vector combining amino acid composition and physiochemical properties. The 188D feature extraction method is used in this paper.

The eight physiochemical properties of amino acids are: hydrophobicity, van der Waals volume, polarity, polarizability, charge, surface tension, surface tension and solvent accessibility. Different kinds of amino acids have different physiochemical properties so that they can be divided into three groups, as shown in Table 3.

Table 3. The 20 amino acids divided into 3 groups according to their physiochemical properties

Physicochemical property The 1st group The 2nd group The 3rd group
Secondary structure EALMQKRH VIYCWFT GNPSD
Solvent accessibility ALFCGIVW RKQEND MPSTHY

The percentage composition of the 20 amino acids in the query protein forms a 20D feature vector. The group composition of amino acids (3D), the pairwise frequency between every two groups (3D) and the distribution pattern of constituents (where the first, 25%, 50%, 75% and 100% of a given constituent are contained) (5 × 3D) from each physiochemical property are extracted. Therefore, we can get a 168D feature vector from a protein sequence according to the eight physiochemical properties. Adding up the 20D amino acid composition feature and the 168D physiochemical feature, we can get a 188D feature vector altogether. The name and the dimensions of the features are listed in Table 4.

Table 4. The name and the dimension of the amino acids composition and physiochemical features

Feature Name Dimension
Amino Acids Composition 20
Hydrophobicity 21
Van der Waals volume 21
Polarity 21
Polarizability 21
Charge 21
Surface tension 21
Secondary structure 21
Solvent accessibility 21

Position Specific Scoring Matrix

PSSM is derived from PSI-BLAST (Position-Specific Iterative Basic Local Alignment Search Tool) by taking the multiple sequence alignment of sequences in non-redundant protein sequence database (nrdb90). The iteration number is 3 and the cutoff E-value is 0.001. Two L × 20 matrices can be obtained by PSI-BLAST, in which L represents the length of the query amino acid sequence, and 20 represents the 20 amino acids. One of the two matrices contains conservation scores of a given amino acid at a given position in sequence, and the other provides probability of occurrence of a given amino acid at a given position in the sequence. The PSSM feature is extracted from the former matrix. Then calculating the average value of each column in the matrix to form a 20D feature vector.

Functional Domain Composition

Firstly, use RPS-BLAST program to compare the protein sequence with each of the 17402 domain sequences. Secondly, if the significance threshold value (expect value) is no more than 0.001, this component of the protein in the 17402D feature vector is assigned 1; otherwise, 0. In this way, we can extract a 17402D feature vector, and each component of the feature can only be 1 or 0.

The Proposed Ensemble Classifier

In this study, we propose a novel ensemble strategy.

Step 1: 10 widely used machine learning classifiers: LMT, RandomForest, LibSVM, SimpleLogistic, RotationForest, SMO, NaiveBayes, RandomTree, FT and SimpleCart are selected, and a 5-fold cross validation is implemented on the DD dataset.

Step 2: The classifier with the highest accuracy in each feature group are chosen.

Step 3: Corresponding models by training each feature group with the chosen classifier are selected. The four models are DSSP classification model, AAsCPP classification model, PSSM classification model and FunD classification model. Detailed process is shown in Figure 1.

Step 4: Features from the test dataset are extracted and the classification result Pij by calculating the corresponding models are obtained, i represents a kind of classification model ranging from 1 to 4, and j represents a kind of fold index, ranging from 1 to the total number of the fold classes (for example, the value of j ranges from 1 to 27 on DD-dataset).

Step 5: The average of the probabilities of the four models in each fold class are calculated. The fold class with the highest probability will be chosen as the classification result. Detailed process is shown in Figure 2.

Figure 1: The training process of the four feature groups through the corresponding classifier

Figure 2: The ensemble process of calculating the test data through the models

Application scope

proFold server is based on DD-dataset, so it covers the following 27 protein fold types: (1) Globin-like, (2) Cytochrome c, (3) DNA-binding 3-helical bundle, (4) 4-helical up-and-down bundle, (5) 4-helical cytokines, (6) EF-hand, (7) Immunoglobulin-like beta-sandwich, (8) Cupredoxins, (9) Viral coat and capsid proteins, (10) ConA-like lectins/glucanases, (11) SH3-like barrel, (12) OB-fold, (13) beta-Trefoil, (14) Trypsin-like serine proteases, (15) Lipocalins, (16) (TIM)-barrel, (17) FAD (also NAD)-binding motif, (18) Flavodoxin-like, (19) NAD(P)-binding Rossmann-fold domains, (20) P-loop containing nucleotide triphosphate hydrolases, (21) Thioredoxin-like, (22) Ribonuclease H-like motif, (23) Hydrolases, (24) Periplasmic binding protein-like, (25) Belta-grasp, (26) Ferredoxin-like, and (27) Small inhibitors, toxins, lectins.