## Users.cecs.anu.edu.au

**PROTEIN-CHEMICAL INTERACTION PREDICTION VIA KERNELIZED**
**SPARSE LEARNING SVM**
YI SHI*1, XINHUA ZHANG1, XIAOPING LIAO2, GUOHUI LIN1, DALE SCHUURMANS1
1

*Department of Computing Science, University of Alberta,*
*E-mail: {ys3,xinhua2,guohui,daes}*@

*ualberta.ca*
2

*Department of Agricultural, Food and Nutritional Science, University of Alberta,*
Given the diﬃculty of experimental determination of drug-protein interactions, there is a signiﬁcantmotivation to develop eﬀective

*in silico *prediction methods that can provide both new predictions forexperimental veriﬁcation and supporting evidence for experimental results. Most recently, classiﬁca-tion methods such as support vector machines (SVMs) have been applied to drug-target prediction.

Unfortunately, these methods generally rely on measures of the maximum “local similarity” betweentwo protein sequences, which could mask important drug-protein interaction information since drugsare much smaller molecules than proteins and drug-target binding regions must comprise only smalllocal regions of the proteins. We therefore develop a novel sparse learning method that considers setsof short peptides. Our method integrates feature selection, multi-instance learning, and Gaussiankernelization into an

*L*1 norm support vector machine classiﬁer. Experimental results show that it notonly outperformed the previous methods but also pointed to an optimal subset of potential bindingregions. Supplementary materials are available at “www.cs.ualberta.ca/~ys3/drug_target”.

*Keywords*: Drug-target interaction; SVM; Sparse learning; Kernelization.

**1. Introduction**
Proteins operate in highly interconnected networks (“interactome networks”) that play a cen-tral role in governing cell functions. If a protein’s conformation is changed, its function canbe altered, thus aﬀecting cell function. Drugs are small molecules that bind to target pro-teins to intensionally change the protein conformation, ultimately achieving treatment eﬀects.

The function of many classes of pharmaceutically useful protein targets, such as enzymes, ionchannels, G protein coupled receptors (GPCRs), and nuclear receptors, can be modulated byligand interaction. Identifying interaction between ligands and proteins is therefore a key togenomic drug discovery.

Various high-throughput technologies for analyzing the genome, the transcriptome, and
the proteome have enhanced our understanding of the space populated by protein classes.

Meanwhile, the development of high-throughput screening technology has enabled broaderexploration of the space of chemical compounds.1–3 The goal of the chemical genomics re-search is to identify potentially useful compounds, such as imaging probes and drug leads,by relating the chemical space to the genomic space. Unfortunately, our understanding of therelationship between the chemical and the genomic spaces remains insuﬃcient. For example,the PubChem database at NCBI4 contains information of millions of chemical compounds,but the number of compounds with known target proteins is limited. The lack of documentedprotein-chemical interactions suggests that many remain to be discovered, which motivates
the need for improved methods for inferring potential drug-target interactions automaticallyand eﬃciently. To facilitate the study of protein-chemical interactions, Kuhn et al. created aprotein-chemical interaction database called STITCH,5 which, up to now, contains interactionsfor between 300,000 small molecules and 2.6 million proteins from 1,133 organisms.

By elucidating the interaction between proteins and drug molecules, 3D-structure based
“docking analysis” has been the principle method for drug discovery.6–8 In docking analysis,drug-protein binding aﬃnities are modeled by non-covalent intermolecular interactions, such ashydrogen bonding, electrostatic interactions, hydrophobic and Van der Waals forces. Throughestablishing equations that model the physical interaction between a receptor and potentialligand, the potential energy of binding can be calculated. There are many docking softwaretools available, including DOCK,8 GOLD,6 and AutoDock.7 All these methods require com-plete 3D structural information for the target, which might not be available in practice. Sucha major disadvantage makes docking analyses infeasible for genome wide application.

Given the diﬃculty of experimental determination of compound-protein interactions,9,10
there is a signiﬁcant motivation to develop eﬀective

*in silico *prediction methods that canprovide both new predictions for experimental veriﬁcation and supporting evidence for exper-imental results. To predict compound-protein interactions various computational approacheshave been developed. Keiser et al.11 propose using the known structure of a set of ligandsto predict target protein families. This method does not take advantage of available pro-tein sequence information, and is thus limited to those between known ligands and proteinfamilies. Campillos et al.12 propose predicting drug-target interaction based on similaritiesbetween side-eﬀects of known drugs. Some results of this approach have been veriﬁed by

*invitro *binding assays, but the approach remains limited to predictions involving drugs withknown side-eﬀects. Yamanishi et al.13 have investigated the relationship between drug chem-ical structure, target protein sequence, and drug-target network topology, and developed aregression-based learning method for predicting unknown drug-target interactions. In partic-ular, they integrated the chemical and the genomic spaces into a uniﬁed space, referred toas the “pharmacological space”, wherein chemical-chemical, protein-protein, and chemical-protein similarities can be modeled. Perlman et al.14 used a combination of Smith-Watermanscore, protein-protein interaction, and Gene Ontology information to measure the gene-genesimilarity (similarity between targets), but these ancillary information is not always availablemaking the prediction hard to extend to general case, and the way of combining diﬀerentinformation sources is somehow tricky.

Most recently, classiﬁcation methods have been adopted in drug-target prediction.15–17
These methods ﬁrstly calculate the similarities between targets and/or drugs, then use thesesimilarities to construct kernel matrices for the classiﬁers, such as the support vector machines(SVMs) for predicting novel drug-target interactions. The prediction can be cast into twoways, one for drug side or drug-to-target and the other for target side. For drug-to-targetprediction, drug-drug similarities are ﬁrst obtained, based on structural or pharmacologicalinformation; then a bipartite known drug-target interaction graph is constructed; for a newdrug with known structural or pharmacological information, its similarities to known drugs arecalculated to predict its interactions with known targets using the bipartite interaction graph.

Similarly for target-to-drug prediction, target-target similarities are ﬁrst obtained using theprimary amino acid sequences;13,17,18 then for a new target with known primary sequence, itssimilarities to known targets are calculated to predict its interactions with known drugs againusing the bipartite interaction graph.

It should be pointed out that in the state-of-the-art works of target-to-drug prediction,
the target-target similarity is deﬁned out of the normalized Smith-Waterman score.17 ThisS-W score measures the maximum “local similarity” between two protein sequences,19 thusreasonable, but the local similarity still uses the whole sequences and consequently mightinvolve

*long *substrings, which is unreasonable. In fact, long substrings could mask importantinteraction information, since drugs are usually much smaller molecules than proteins and thedrug-target binding sites mostly comprise of only small local regions of the target.

In this work, we focus on the latter target-to-drug prediction to address the issues in the
existing works. We ﬁrst attempt to identify key local binding regions from the

*common short*substrings shared by proteins that interact with the same drug. These key short substringsare then used to construct a vector representation for a protein sequence, to be used in thetraining and testing phases of a classiﬁer. The use of key short substrings (i.e. potential bindingregions) as features for the targets is a more direct and meaningful representation for druginteraction prediction. Additionally, the explicit vector representation of targets, as opposedto assessing similarity based on the S-W score, maps the targets into higher dimensionalspaces, thus increasing the eﬀectiveness of kernel-based classiﬁers. We remark that our use ofcommon short substrings diﬀer from the substring composition representation for proteins,15which uses all substrings while disregarding whether interactions exist.

The rest of the paper is organized as follows. In Section 2, we introduce the details of our
prediction method, in which we focus on the SVM classiﬁers. We demonstrate in Section 3the performance of our method compared against the existing ones. Lastly, in Section 4, wediscuss the advantages and disadvantages of our method and propose future work.

**2. Methods**
The drug-target interaction prediction framework is the same as in Bleakley et al.,17 in whichwe assume a dataset containing

*m *drugs

*d*1

*, d*2

*, . . . , dm *and

*n *targets

*t*1

*, t*2

*, . . . , tn*, and the binaryindicator on whether or not drug

*di *interacts target

*tj*. The goal is to predict which of thedrugs a new target

*tc *will interact.

**2.1. ***Target Vectorization*
In the

*bipartite local model *(BLM) by Bleakley et al.,17 to which our method will compareagainst, the similarity between two targets

*t *and

*t′ *is deﬁned as the normalized Smith-Waterman score:17
where

*SW *(

*·, ·*) denotes the original Smith-Waterman score.19 As we mentioned in the intro-duction, such a similarity measure might overlook the key short sequence regions to which adrug binds.

To address this issue, we want to identify the common short substrings of the targets
that interact the same drug. We consider one drug, say

*di*, at a time. From the dataset, weﬁrst retrieve the set of targets

*Ti *=

*{ti*1

*, ti*2

*, . . . , tin } *interacting with

*d*
target

*tc*, we obtain another set

*Ti ∪ {tc}*. Using a substring length lower bound, we computefor each of the two sets

*Ti *and

*Ti ∪ {tc} *the multi-set of pairwise maximal common substrings,denoted as

*withoutSS *=

*{si*1

*, si*2

*, . . . , siq′} *and

*withSS *=

*{si*1

*, si*2

*, . . . , sip′}*, respectively. In eachof the two multi-sets, if two substrings diﬀer at at most one position, they are merged intoone and their frequencies are summed together. This way, we obtain two

*reduced *sets

*with-outSS *=

*{si*1

*, si*2

*, . . . , siq} *and

*withSS *=

*{si*1

*, si*2

*, . . . , sip}*, containing

*q *and

*p *unique substringsrespectively, and each substring is associated with its number of occurences.

Using the substrings in set

*withSS *and their occurrences, we can map the

*n *training targets
and the new target

*tc *into the

*p *dimensional Euclidean space, where each substring representsa dimension and the coordinate of target

*t *in dimension

*s *is calculated as the normalizedmatch score between

*t *and

*s *in set

*withSS *:
where

*L*(

*·, ·*) is length of the longest common substring between the two sequences and

*cs *isthe number of occurrence of substring

*s*. Intuitively, if target

*tc *contains a long substring thatis also frequent in the binding targets, then its match score for this feature substring willbe high indicating a high likelihood of binding. We use (

*M *(

*t, s*1)

*, M *(

*t, s*2)

*, . . . , M *(

*t, sp*)) as thevector representation for target

*t*.

This way we obtain an

*n × p *training matrix

*X*, where each row represents a training
target, and a

*p × *1 testing vector

**x***c *representing the new target

*tc*, along with the

*n × *1 binary

training label vector

**y **(with 1 indicating the target interacts with drug

*di *and

*−*1 otherwise).

The task is to construct a classiﬁer to return 1 if the new target

*tc *interacts with drug

*di*, or

*−*1 otherwise.

The classiﬁcation problem can be analogously formulated using set

*withoutSS *substring
set. Next we show how to construct a classiﬁer from the training data.

**2.2. ***Classification with Feature Selection*
In any classiﬁcation problem, the quality of features used determines the accuracy of pre-dictions. Here, features correspond to substrings of target proteins, which comprise potentialbinding regions between the proteins and drugs. Thus, selecting good features not only im-proves classiﬁcation accuracy, but also provides candidate drug-target binding sites for furtherinvestigation. We investigated an approach that integrates feature selection in

*L*1-norm basedsupport vector machine (SVM) classiﬁcation method.

The primal form of

*L*1-norm SVM is:
min

*β∥***w***∥*1 +

**1***T ***ξ**
**w***,***b***,***ξ**
**ξ **≥ **1 ***− △*(

**y**)(

*X***w ***− b***1**)

*,*
**ξ **≥ **0***.*
where

*△*(

**y**) denotes putting the vector

**y **on the main diagonal of a square matrix. Here

*X ∈ *R

*n×p*,

**y ***∈ {*+1

*, −*1

*}n*,

*n *is the number of data points (targets), and

*p *is the number of

features. Since by Micchelli et al.20

**w***∥*1 = min

+

*γj*) = min (

**w***T △*(

**γ**)

*−*1

**w **+

**γ**T **1**)

*,*
**γ**≥0 2

**γ**≥0 2

(

**w***T △*(

**γ**)

*−*1

**w **+

**γ**T **1**) +

**1***T ***ξ**
**w***,***b***,***ξ**,**γ **2

**ξ **≥ **1 ***− △*(

**y**)(

*X***w ***− b***1**)

*,*
**ξ **≥ **0***, ***γ **≥ **0***.*
By introducing Lagrangian multipliers

**λ **≥ **0 **and

**µ **≥ **0**, (4) becomes

(

**w***T △*(

**γ**)

*−*1

**w **+

**γ**T **1**) +

**1***T ***ξ **+

**λ**T (

**1 ***− △*(

**y**)(

*X***w ***− b***1**)

*− ***ξ**)

*− ***µ**T **ξ**
**w***,***b***,***ξ**,**γ λ**,**µ**
**λ **≥ **0***, ***µ **≥ **0***, ***γ **≥ **0***.*
Let the objective function of (5) be

*L*1, and let

*∂L*1 = 0, we get

**λ **=

**1 ***− ***µ**. Therefore, since

*∂***ξ**
**µ **≥ **0**, we conclude that

**λ **≤ **1**, hence

**0 ***≤ ***λ **≤ **1**. By substitution, (5) becomes

(

**w***T △*(

**γ**)

*−*1

**w **+

**γ**T **1**) +

**λ**T **1 ***− ***λ**T △(

**y**)

*X***w **+

*b***λ**T △(

**y**)

**1**
**w***,***b***,***γ**
**λ**
**0 ***≤ ***λ **≤ **1***,*
**γ **≥ **0***.*
Let the objective function of (6) be

*L*2, and let

*∂L*2 = 0. We get

**λ**T **y **= 0, so (6) becomes

(

**w***T △*(

**γ**)

*−*1

**w **+

**γ**T **1**) +

**λ**T **1 ***− ***λ**T △(

**y**)

*X***w**
**w***,***γ**
**λ**
**0 ***≤ ***λ **≤ **1***,*
**λ**T **y **= 0

*,*
**γ **≥ **0***.*
Let the objective function of (7) be

*L*3, and let

*∂L*3 = 0, we get

*β△*(

**γ**)

*−*1

**w ***− XT △*(

**y**)

*λ *=

**0**,

*∂***w**
so that

**w **= 1

*△*(

**γ**)

*XT △*(

**y**)

**λ**. By substitution, (7) becomes

min max

**λ**T **1 ***− *1

**λ**T △(

**y**)

*X△*(

**γ**)

*XT △*(

**y**)

**λ **+

**γ**T **1**
**γ**
**λ**
**0 ***≤ ***λ **≤ **1***,*
**λ**T **y **= 0

*,*
**γ **≥ **0***.*
Note that

**γ **is the feature selection vector. Crucially, this problem is convex in

**γ **and has no

local minima,21 hence it provides an optimal form of feature selection that can be eﬃcientlyobtained in conjunction with SVM training. Because a drug may bind to diﬀerent regionsof diﬀerent proteins, i.e., diﬀerent regions on diﬀerent targets can bind to the same drug,
each positive data point may correspond to a diﬀerent set of important features (substrings).

Therefore, the nature of this drug-target classiﬁcation problem is essentially a multi-instanceclassiﬁcation problem. To address this, we consider two ideas:

**Idea (a) **Use a radial basis function (RBF) kernel (Gaussian kernel), rather than a linear

kernel since this addresses the multi-instance classiﬁcation problem more eﬀectively afterimplicitly mapping data points to an inﬁnite dimensional space. After Gaussian kernelization,
the original linear kernel matrix

*K *=

*X△*(

**γ**)

*XT *becomes

*K′ *=

*e*
(

**x***i−***x***j*)

*T △*(

**γ**)(

**x***i−***x***j*).

**Idea (b) **Because each positive data point may correspond to a unique set of important

features, in principle each positive example

**x***i *should employ its own feature selection vector

**γ**+

while all negative examples should share a same vector

**γ**−. So we get

*K′′ *=

*e*
*∥***γ**i⊙**x***i−***γ**j⊙**x***j∥*2

for all

*i *and

*j*, where

**γ**i =

**γ**+ if

*y*
*i *= +1, and

**γ**i =

**γ**− if

*yi *=

*−*1. Here

*⊙ *stands for element-wise

Idea (a) can be easily applied to (8) at the sacriﬁce of convexity, while applying Idea
(b) to (8) will introduce too many extra coeﬃcients which makes the model computationally

expensive. To circumvent these issues, we introduce an eﬃcient approach to re-weight the

features. Intuitively, we wish to down-weight the features that are false positive indicator of

binding, i.e. features that have a high score/value at some negative training examples (not

bind). This motivation is similar to the case in multi-instance learning, where false positive

indicators call for more strict control than true positive indicators. Towards this end, we

introduce a

*p*-dimensional weight vector

**c **corresponding to the

*p *features, and re-scale the

feature matrix

*X *by ˜

*X *=

*X△*(

**c**). A simple formula of

**c **that concretizes our intuition is

*ij *, where

*aij *= 1 if

*xij ≤ *1

*− ϵ *and

*yi *= 1, and

*aij *= 0 otherwise. Here

*ϵ *is a
small positive number, and all elements in

*X *are assumed to have been normalized to [0

*, *1].

Therefore by replacing

*X *with ˜

*X *in (8), we encourage using features that indicate less false
min max

**λ**T **1 ***− *1

**λ**T △(

**y**)

*K′△*(

**y**)

**λ **+

**γ**T **1**
**γ**
**λ**
**0 ***≤ ***λ **≤ **1***,*
**λ**T **y **= 0

*,*
**γ **≥ **0***,*
where

*K′ *= exp

*−*1 (

**˜**
*i − ***˜**
**x***j*)

*T △*(

**γ**)(

**˜**
**x***i − ***˜**
**x***j*)

We solve (9) by using a combination of L-BFGS-B (Limited-memory Broyden-Fletcher-
Goldfarb-Shanno Bounded Optimization) and gradient decent method over

**γ**. After optimiza-

tion, we get solutions for

**γ **and

**λ**.

**γ **serves as a useful feature selector, with

*γj > ϵ *indicating

the

*j*’s features should be selected and otherwise not.

**λ **can be used to construct the hyper-

plane in the SVM and to predict new data points. Given a test data point (target)

**x***c*, we can

predict its label (binding to the drug or not) based on the sign of the classiﬁer’s output:

**x***c − ***˜**
**x***i*)

*T △*(

**γ**)(

**˜**
**x***c − ***˜**
**x***i*)

*− b.*
As a key step for solving (9), we need the partial derivative of the objective function in (9)
(denoted as

*L*4) with respect to the

*k*’s feature selector

*γk*:

**x***i − ***˜**
**x***j*)

*T △*(

**γ**)(

**˜**
**x***i − ***˜**
**x***j*)

*T*
**3. Experimental Results and Discussion**
**3.1. ***Methods under Comparison*
We compared our method with the state-of-the-art method proposed by Bleakley et al.17 Inparticular, we focused on target-side prediction of their method to make the two approachescomparable. Bleakley et al.17 used the normalized Smith-Waterman score in (1) to evaluatethe similarity between two target sequences. In the context of SVM classiﬁcation, they usedthis target-target similarity matrix as the kernel matrix, i.e. the kernel matrix was ﬁxed intheir method. Based on this similarity measure, nearest neighbor (NN) classiﬁers can also beconstructed as a baseline. We refer to Bleakley et al.’s approach as BLM SVM and BLM NNrespectively. On the other hand, our methods include:

*• *SS L1-SVM: L1-SVM with

*withSS *feature (the main model of this paper),

*• *SS L1-SVM: the classic L2 norm SVM with

*withSS *feature,

*• *SS NN FS: nearest neighbor classiﬁer based on the features selected by SS L1-SVM,

*• *SS NN noFS: nearest neighbor classiﬁer based on all

*withSS *features.

**3.2. ***Experiment Settings*
The framework of our experiment is similar to Bleakley et al.17 Speciﬁcally, we enumeratedall pairs of drug

*di *and protein

*t *in the whole data set. For each (

*di, t*) pair, we treated

*t *asthe single test example while the remaining proteins were used as training examples. To learnan L1 and L2 SVM, we chose the hyper-parameters (e.g.

*β *and

*ϵ*) by using three-fold crossvalidation on the training set, making sure that all the three folders contain at least one targetthat binds to the drug (i.e., at least one positive example). After the classiﬁcation model waslearned, we applied it to protein

*t *in a way like (10), and obtained a score

*yit *that could besubsequently used to compute useful performance measures (see Section 3.4). All

*yit *calculatedby ranging over all drugs

*di *and target

*t *in the data set constituted a drug-by-target scoretable.

We set the minimum length of a feature sub-sequence to 5 after trying all lengths from
4 to 12, noting that a too small cutoﬀ generated excessively many features while a too bigcutoﬀ generated an insuﬃcient number of features.

**3.3. ***Datasets*
We used drug-target interaction information from Bleakley et al.,17 which was collected fromthe KEGG BRITE,2 BRENDA,22 SuperTarget23 and DrugBank24 databases. In particular, we
The precision-recall curves of the four methods SS L1-SVM, SS SVM, BLM SVM, BLM NN,
SS NN FS, SS NN noFS on three data set. The results are based on training data with drug interactingwith at least 2 targets.

used three data sets—nuclear receptors, GPCRS, and ion channel—which have 54, 223, 210drugs, 26, 95, 204 targets, and 90, 635, 1476 interactions, respectively. The three data setsused in this article are identical to those used in the state-of-the-art study,17 which facilitatesa fair comparison between the two methods. Since we only focused on target-side prediction,we did not require any drug structural or pharmacological information to obtain drug-drugsimilarity information. The amino acid sequences of the target proteins were obtained fromthe KEGG GENES database.2

**3.4. ***Classification results*
We used ﬁve measurements to evaluate the quality of drug-target prediction: Area under thePrecision-Recall Curve (AUPR), Area under the ROC Curve (AUC), F-Measure, Precision(or Speciﬁcation), and Recall (or Sensitivity). With the prediction score table

*yit *availablefrom Section 3.2, these performance measures were all computed in a micro-average fashion.

That is, given a cutoﬀ point, all

*yit *could be converted into a binary label via thresholding(i.e., binding or not). By comparing these labels with the ground truth over the whole drug-by-target score table, we derived the number of false positive and false negative, which led toPrecision, Recall, and F-Measure. The AUPR and AUC were derived by increasing the cutoﬀswith a ﬁne step size, which led to thousands of points in the precision-recall curve. Of the ﬁvemeasurements, AUPR, AUC, and F-Measure are more robust than the others.

We only demonstrate the results based on

*withSS *feature because the

*withoutSS *feature
set did not result in as good performance. Tables 1, 2, and 3 demonstrate the eﬀectivenessof the diﬀerent drug-target prediction methods over the ﬁve evaluation quantities. The F-Measure, Precision, and Recall scores reported in these tables were obtained at the cutoﬀ pointwhere F-Measure was maximized for respective methods. Figure 1 demonstrates the precision-recall curves of SS L1-SVM and SS SVM compared to BLM SVM, BLM NN, SS NN FS, andSS NN noFS on three data sets, namely Nuclear, GPCR, and Ion Channel from left to right.

Based on these evaluation, the SVM approaches that use

*withSS *feature set (i.e.,
SS L1-SVM and SS SVM) outperform the current state-of-the-art methods BLM SVM andBLM NN. Moreover, the L1 norm feature selection method SS L1-SVM is more eﬀective thanthe traditional SVM method; it uses only 72

*.*85%, 85

*.*02%, and 62

*.*86% of the original features
Evaluations of classiﬁcation quality on Nuclear data set. The F-Measure,
Precision, and Recall scores were obtained at the cutoﬀ point where F-Measure wasmaximized for respective prediction methods.

**Performance comparison:**
Evaluations of classiﬁcation quality on GPCR data set. The F-Measure,
Precision, and Recall scores were obtained at the cutoﬀ point where F-Measure wasmaximized for respective prediction methods.

**Performance comparison:**
Evaluations of classiﬁcation quality on Ion data set. The F-Measure, Preci-
sion, and Recall scores were obtained at the cutoﬀ point where F-Measure was maxi-mized for respective prediction methods.

**Performance comparison:**
in the Nuclear, GPCR, and Ion Channels datasets, respectively. The signiﬁcant reduction infeature set size can not only make the classiﬁcation more eﬃcient and eﬀective, it can alsohelp biological practitioners to identify important features more accurately.

We further investigated the prediction result generated by the SS L1-SVM method and
the BLM SVM method. At the prediction cutoﬀ where both methods attained their ownmaximum F-Measure score, there were 8, 127, and 78 true positive interactions that SS L1-SVM managed to identify but were missed by BLM SVM. This was in comparison to 7, 16,
52 true positives that were identiﬁed by BLM SVM but not by SS L1-SVM. False positive isanother important measurement of a method. On the three datasets Nuclear, GPCR, and IonChannels, SS L1-SVM generated 0, 73, and 139 false positive interactions, compared to 2, 85,117 false positive interactions generated by BLM SVM.

Some interesting case studies are in order. On the Nuclear dataset, the two nearest neigh-
bors of the target protein RORB (KEGG Homo sapiens protein ID “hsa6096”) under the nor-malized Smith-Waterman score are RORA (“hsa6095”) and RORC (“hsa6097”), with scores0.578 and 0.458 respectively. RORB and RORC share a common interacting drug

*Tretinoin*(KEGG drug ID “D00094”) while RORB and RORA do not. According to the BLM method,RORB will be predicted to have no interaction with

*Tretinoin *because its nearest neighborRORA does not interact with

*Tretinoin*. On the contrary, our method can correctly identifythe interaction between RORB and

*Tretinoin *because the

*withSS *feature set based methodcan discover two important substrings “EVVLVRMCRA-N” and “N-TV-FEGKYGGM” thatexist in both RORB and RORC. Therefore, although the overall match score between RORBand RORC is not the highest, their feature vectors (with respect to the two feature substrings)are the most similar.

On the GPCR dataset, the ﬁve nearest neighbors of the target protein CHRM1 (KEGG
Homo sapiens protein ID “hsa1128”) under the normalized Smith-Waterman scores areCHRM5 (“hsa1133”), CHRM3 (“hsa1131”), CHRM4 (“hsa1132”), CHRM2 (“hsa1129”),and HRH3 (“hsa11255”), with scores 0.4707, 0.4536, 0.4237, 0.4228, and 0.2446 respec-tively. Although CHRM1 is supposed to bind to drug

*Metoclopramide *(KEGG drug ID“D00726”), none of its ﬁve nearest neighbors bind to this drug. In fact binding occursonly with the 6-th nearest neighbor, HRH2 (“hsa3274”), whose

*SWnorm *score with re-spect to CHRM1 is 0.2137. Therefore, the BLM methods can hardly predict that CHRM1binds to

*Metoclopramide*. In contrast, our method can correctly predict this interac-tion because the important substrings such as “KRTPRRAA”, “Y-AKRTP-RAA-MI-L-W”, and “NYFL-SLA-AD” are present in both CHRM1 and several proteins that bindto

*Metoclopramide*, e.g., HTR1A (“hsa3350”), HTR1B (“hsa3351”), HTR1D (“hsa3352”),HTR1E (“hsa3354”), HTR1F (“hsa3355”), HTR2A (“hsa3356”), HTR2B (“hsa3357”),HTR2C(“hsa3358”), HTR4(“hsa3360”), HTR5A(“hsa3361”), and HTR6(“hsa3362”), whichare all considered as faraway neighbors according to the

*SWnorm *scores.

The binding regions discovered by our computational model can also be found on the
Ion dataset. To provide potential drug-target binding regions for further investigation, weproduced all the important common substrings selected by the SS L1-SVM method, whichare made available online at “http://www.cs.ualberta.ca/~ys3/drug_target”.

**4. Conclusions**
In this article, we proposed a novel drug-target interaction prediction method based on poten-tial drug-target binding regions. According to the evaluation metrics, the proposed methodsigniﬁcantly outperformed the current state-of-the-art methods. More importantly, it identi-ﬁed a number of drug-target interactions that were missed by previous methods. We believethat the poor recall of previous methods is due to the use of a target kernel matrix based
on Smith-Waterman score: a low overall similarity between two protein sequences does notmean they do not share common drug binding regions. This drawback was overcome in ourapproach by collecting a large number of candidate binding regions (i.e., common substrings)that subsequently played the primary role in interaction prediction. In addition, the use of anexplicit vector representation, as opposed to implicit similarity measure, enabled the easy useof non-linear kernel expansions that were not possible for ﬁxed kernel methods like BLM.

Besides the kernels based on substring feature vector, we believe the techniques of string
kernel proposed in25 could be useful in this problem. One straightforward beneﬁt of using thestring kernel is that it will automatically consider all substrings of a given sequence pair. It canalso provide more intuitive understanding of substring-based sequence similarities than usingGaussian kernel. However, to employ the string kernel, one needs to customize the featureselection function into the model and to extend the non-gapped matching in string kernels.

We presented a feature selection method based on

*L*1-norm SVM that could not only pre-
dict the binding relations more accurately, but also ﬁnd important candidate binding regions(features). It integrated feature selection directly into

*L*1-norm SVM and kernelized the op-timization model. A drawback was that the sparse regularization term tended to select onlya single feature from the candidate set. This is a well known problem with

*L*1 based regu-larization.26 To avoid this limitation, we will investigate a combination of

*L*1 and

*L*2 normregularizers, known as the elastic net,26 which is generally more eﬀective at group featureselection. Another possible extension is to adopt the OSCAR model,27 which appears evenmore eﬀective. We also discovered that the inference problem of drug-target interaction—insome cases—can be considered as a multi-instance learning problem. So we proposed usingmultiple feature selection vectors for each positive training example in theory and applied thefeature cost vector to address the multi-instance problem in practice. We hypothesize thatmore advanced machine learning methods speciﬁcally tailored for multi-instance classiﬁcationcan further improve the accuracy of drug-target interaction prediction. Moreover, consideringthat protein 3D structures carry the essential binding information and an increasing amount ofprotein 3D structure is being made available (e.g., PSI Nature Structural Biology Knowledge-base28), incorporating protein 3D information in the prediction model in addition to sequenceinformation would lead to promising improvement.

**References**
1. C. M. Dobson, Chemical space and biology,

*Nature ***432**, 824 (2004).

2. M. Kanehisa, S. Goto, M. Hattori and et al., From genomics to chemical genomics: New devel-

opments in kegg,

*Nucleic Acids Res. ***34**, D354 (2006).

3. B. R. Stockwell, Chemical genetics: Ligand-based discovery of gene function,

*Nat. Rev. Genet.*
**1**, 116 (2000).

4. D. L. Wheeler, T. Barrett, D. A. Benson and et al., Database resources of the national center
for biotechnology information,

*Nucleic Acids Res. ***34**, D173 (2006).

5. M. Kuhn, D. Szklarczyk, A. Franceschini, C. von Mering, L. J. Jensen and P. Bork, Stitch 3:
zooming in on protein-chemical interactions,

*Nucleic Acids Res. ***40**, 876 (2012).

6. G. Jones, P. Willett, R. C. Glen and et al., Development and validation for a genetic algorithm
for ﬂexible docking,

*J. Mol. Biol. ***267**, 727 (1997).

7. G. M. Morris, D. S. Goodsell, R. S. Halliday and et al., Automated docking using a lamarckian
genetic algorithm and empirical binding free energy function,

*J. Comput. Chem. ***19**, 1639 (1998).

8. B. K. Shoichet, D. L. Bodian and I. D. Kuntz, Molecular docking using shape descriptors,

*J.*
*Comput. Chem. ***13**, 380 (1992).

9. S. J. Haggarty, K. M. Koeller, J. C. Wong and et al., Multidimensional chemical genetic analysis
of diversity-oriented synthesis-derived deacetylase inhibitors using cell-based assays,

*Chem. Biol.*

**10**, 383 (2003).

10. F. G. Kuruvilla, A. F. Shamji, S. M. Sternson and et al., Dissecting glucose signalling with
diversity-oriented synthesis and small-molecule microarrays,

*Nature ***416**, 653 (2002).

11. M. J. Keiser, B. L. Roth, B. N. Armbruster and et al., Relating protein pharmacology by ligand
chemistry,

*Nat. Biotech. ***25**, 197 (2007).

12. M. Campillos, M. Kuhn, A. C. Gavin and et al., Drug target identiﬁcation using side-eﬀect
similarity,

*Science ***321**, 263 (2008).

13. Y. Yamanishi, M. Araki, A. Gutteridge and et al., Prediction of drug-target interaction networks
from the integration of chemical and genomic spaces,

*Bioinformatics ***24**, i232 (2008).

14. L. Perlman, A. Gottlieb, N. Atias, E. Ruppin and R. Sharan, Combining drug and gene similarity
measures for drug-target elucidation,

*J. Comput. Biol. ***18(2)**, 133 (2011).

15. N. Nagamine and Y. Sakakibara, Statistical prediction of protein-chemical interactions based on
chemical structure and mass spectrometry data,

*Bioinformatics ***23**, 2004 (2007).

16. L. Jacob and J. P. Vert, Protein-ligand interaction prediction: An improved chemogenomics
approach,

*Bioinformatics ***24**, 2149 (2008).

17. K. Bleakley and Y. Yamanishi, Supervised prediction of drug-target interactions using bipartite
local models,

*Bioinformatics ***25**, 2397 (2009).

18. Y. Yamanishi, M. Kotera, M. Kanehisa and et al., Drug-target interaction prediction from chem-
ical, genomic and pharmacological data in an integrated framework,

*Bioinformatics ***26**, i246

(2010).

19. T. F. Smith and M. Waterman, Identiﬁcation of common molecular subsequences,

*J. Mol. Biol*
**147**, 195 (1981).

20. C. Micchelli and M. Pontil, Learning the kernel function via regularization,

*J. Mach. Learn. Res.*
**6**, 1099 (2006).

21. G. Lanckriet, M. Cristianini, P. Bartlett and et al., Learning the kernel matrix with semi-deﬁnite
programming,

*J. Mach. Learn. Res. ***5**, 27 (2004).

22. I. Schomburg, A. Chang, C. Ebeling and et al., Brenda, the enzyme database: Updates and
major new developments,

*Nucleic Acids Res. ***32**, D431 (2004).

23. S. Gunther, M. Kuhn, M. Dunkel and et al., Supertarget and matador: Resources for exploring
drug-target relationships,

*Nucleic Acids Res. ***36**, D919 (2008).

24. D. S. Wishart, C. Knox, A. C. Guo and et al., Drugbank: A knowledgebase for drugs, drug
actions and drug targets,

*Nucleic Acids Res. ***36**, D901 (2007).

25. S. V. N. Vishwanathan and A. J. Smola, Fast kernels for string and tree matching,

*NIPS ***15**,

26. H. Zou and T. Hastie, Regularization and variable selection via the elastic net,

*Journal of the*
*Royal Statistical Society B ***67**, 301 (2005).

27. W. Zhong and J. Kwok, Eﬃcient sparse modeling with automatic feature grouping,

*ICML ***28**,

28. H. M. Berman, Psi nature structural biology knowledgebase (www.sbkb.org/kb/index.html)

Source: http://users.cecs.anu.edu.au/~xzhang/papers/Shietal13.pdf

Communiqué de presse 11 Novembre 2011 Lancement d’Equistone Partners Europe Reprise de la société de gestion Barclays Private Equity (« BPE ») par ses équipes Equistone Partners Europe (« Equistone ») annonce la reprise de la société de gestion Barclays Private Equity, un des leaders européen du capital investissement sur le segment du mid-market. Equistone, sociét

DIGITISATION OF THE MINGANA-LEWIS PALIMPSEST, CAMBRIDGE UNIVERSITY LIBRARY. FINAL REPORT OF THE PROJECT FUNDED BY TIMA The main purpose of this project was the digitisation of the Mingana-Lewis palimpsest (Cambridge University Library Or. 1287), applying ultraviolet lighting in order to read the scriptio inferior of the parchment, that bears some fragments of Qur’anic text. Further aim of