Methods in bSiteFinder

Definitions of operations

Rules of Five

The protein data in PDB database are filtered through the rules below:

1.    The macromolecule type is protein, no DNA and RNA.

2.    Experiment method is set to X-ray.

3.    X-ray resolution is between 0 and 3.0.

4.    Has free ligands = yes.

5.    Sequence length is over 20.

Number of Ligand Atoms

In the process of building databases, which database a protein finally falls into depends on whether it contains ligands and whether these ligands have enough atoms. For this reason, ligands identification, which is judged by the rules mentioned below, plays a key role. Every HETATM residue is recognized through HET records from the header of PDB files. Notably, some of the residues are modified on normal chains, which are not counted as true ligands because of their present in the MODRES records. Hence, the selected ligands only come from HET records excluding MODRES ones. Water molecule is included in HETATM but not regarded as a ligand. Analyzing the data, we define that a ligand should possess 6 or more atoms as a basic rule to identify a ligand.

Stability of Complex

The binding site check criterion is using as the standard of judging the bound structure's stability. Only if any one of atoms of the ligand has a distance within 4Å from the geometry center of the calculated binding site, the structure of complex is considered to be stable.

Homology Indexing

Homology Indexing is implemented by using SCOPe, version 2.03 (Fox et al., 2014). First, a four-digit classification number is searched based on PDB ID and CHAIN ID of the query chain. After that, all the protein chains with the same classification number are obtained and used to constitute the template database for subsequent structural alignment.

Chain Length Indexing

Only the chains, which have length difference with query chain less than 30%, are used as candidates for subsequent structural alignment.

Structural Alignment

The structural alignment between query and templates in bSiteFinder is implemented by using Combinatorial Extension(CE) algorithm, which is provided by Biojava (Prlic et al., 2012). Different from traditional dynamic programming algorithm and Monte Carlo algorithm, CE algorithm defines continuous residues in the sequence as aligned fragment pairs (AFPs), which is used in local alignment between query and template. Finally, the optimized alignment results are obtained by expanding or abandoning the local AFPs.

Optimized Multiple-Templates Clustering

After structural alignment, template will be mapped to query. Then, the templates which meet the requirement of Stability of Complex are ranked according to the similarity with query chain, and ligands of the top 20 templates at most will be picked out. After 20 times of structural alignments, all the ligands in templates will be mapped to the query. Further, these ligands are clustered into different clusters. The number of ligand geometric centers, which have a distance less than 3Å from the certain ligand geometric center, is counted for each ligand. After that, the ligand with the largest number is defined as the center of the Top1 binding site (Figure 1). Then, this ligand and all the other ligands within 3Å are removed for searching the centers of the Top2 and Top3 binding site in the same way.

Figure 1. Workflow of Optimized Multiple-Templates Clustering. Template (b) is mapped to query (a) by structural alignment to form query-template complex (c). Then, the template chain will be removed, and the ligand will be retained (d). After 20 times of structural alignments, the ligands in templates will be mapped to the query (e). The number of ligand geometric centers, which have a distance less than 3Å from the certain ligand geometric center, is counted for each ligand (f). The ligand with the largest number is defined as the center of the Top1 binding site (g).

Detection of Binding Sites

On the condition that protein chains have ligands, we define all residues within the distance of 8 Å from ligands as the components of the binding site. But, On the condition that binding site is detected by doing structural alignment with templates, all residues within the distance of 10 Å from mapped ligands are defined as the components of the binding site. It should be noted that if the bound proteins’ stabilities did not pass the evaluation of Stability of Complex, the bound proteins would be treated as unbound proteins with original ligands removed.

Algorithm

Figure 2. Workflow of creating template database.

Create template database

Our algorithm will maximize the information of bound proteins. Herein, we built so far the largest database of bound templates from PDB database with stringent quality control. Figure 2 shows the workflow of creating template database, which include four steps as follow: 1. 97591 complex structures in PDB database (February 11, 2014) were filtered according to Rules of Five, and 62487 complex structures were obtained. 2. Proteins were divided into chains, and then the chains which are less than 20 residues in length were removed. After that, 146089 chains were obtained. 3. Number of Ligand Atoms was employed to ensure that there is at least one ligand in the complex structures of each chain, and 117823 chains were obtained. 4. Stability of Complex was employed to ensure that it forms a stable bound structure of each chain with its ligand. Finally, 101315 chains were obtained for building the database of bound templates.

Figure 3. Workflow of binding sites detection. Each protein chain submitted would be processed successively by following steps: 1. Binding sites prediction of high quality bound protein (Part 1), or enter the following process. 2. Binding sites prediction of unbound protein with bound templates of same Homology Indexing(Part 2), or enter the following process. 3. Binding sites prediction of unbound protein with bound templates of Chain Length Indexing(Part 3). Any protein chains submitted into our system could receive the results of binding sites via efficient computation.

Workflow of binding sites detection

When a query protein is submitted by user for binding site prediction, it will be firstly divided into chains. After that, the prediction will be done for each chain. Figure 3 shows the workflow of binding sites detection. Each protein chain will be processed by following steps:

1.    Binding sites prediction of high quality bound protein (Part 1)

Detection of Binding Sites is employed for binding site detection, when the protein chains meet the requirement of Number of Ligand Atoms and Stability of Complex. Otherwise, enter the following process.

2.    Binding sites prediction of unbound protein with bound templates of same Homology Indexing (Part 2)

If the query chain has a four-digit classification number in SCOPe and has bound template with the same Homology Indexing in template database, the binding site of this query chain will be detected as the following procedure. First, structural alignments between query chain and templates will be done, and the top 20 bound templates which are the most similar to the query will be selected subsequently. The locations of ligands are detected by mapping the ligands in templates to the query, and then the optimization of binding sites was following by using the new developed Optimized Multiple-Templates Clustering method. Finally, Detection of Binding Sites will be employed for binding site detection. Otherwise, enter the following process.

3.    Binding sites prediction of unbound protein with bound templates of Chain Length Indexing (Part 3)

If the query chain has no satisfactory homologous bound template, the binding site of this query chain will be detected as the following procedure. Chain Length Indexing will be employed to search the bound templates, which have difference with query chain less than 30% in length, in template database. Then enter the process as the description above (Part 2 of “Workflow of binding sites detection”) with top 20 most similar bound templates.

Source Code

Download Link: BSite.zip