INTEGRATING MICROARRAY AND PROTEOMICS DATA
ANNELEEN DAEMEN1∗, OLIVIER GEVAERT1, TIJL DE BIE2, ANNELIES
DEBUCQUOY3, JEAN-PASCAL MACHIELS4, BART DE MOOR1 AND
1Katholieke Universiteit Leuven, Department of Electrical Engineering (ESAT),SCD-SISTA (BIOI), Kasteelpark Arenberg 10 - bus 2446,
2University of Bristol, Department of Engineering Mathematics,Queen’s Building, University Walk, Bristol, BS8 1TR, UK
3Katholieke Universiteit Leuven / University Hospital Gasthuisberg Leuven,Department of Radiation Oncology and Experimental Radiation,Herestraat 49, B-3000 Leuven, Belgium
4Universit´e Catholique de Louvain, St Luc University Hospital,Department of Medical Oncology, Ave. Hippocrate 10,
To investigate the combination of cetuximab, capecitabine and radiotherapy inthe preoperative treatment of patients with rectal cancer, forty tumour sampleswere gathered before treatment (T0), after one dose of cetuximab but before ra-diotherapy with capecitabine (T1) and at moment of surgery (T2). The tumourand plasma samples were subjected at all timepoints to Affymetrix microarray andLuminex proteomics analysis, respectively. At surgery, the Rectal Cancer Regres-sion Grade (RCRG) was registered. We used a kernel-based method with LeastSquares Support Vector Machines to predict RCRG based on the integration ofmicroarray and proteomics data on T0 and T1. We demonstrated that combiningmultiple data sources improves the predictive power. The best model was basedon 5 genes and 10 proteins at T0 and T1 and could predict the RCRG with anaccuracy of 91.7%, sensitivity of 96.2% and specificity of 80%.
A recent challenge for genomics is the integration of complementary viewsof the genome provided by various types of genome-wide data. It is likely
∗To whom correspondence should be addressed: [email protected]
that these multiple views contain different, partly independent and comple-mentary information. In the near future, the amount of available data willincrease further (e.g. methylation, alternative splicing, metabolomics, etc). This makes data fusion an increasingly important topic in bioinformatics.
Kernel Methods and in particular Support Vector Machines (SVMs)
for supervised classification are a powerful class of methods for patternanalysis, and in recent years have become a standard tool in data analysis,computational statistics, and machine learning applications.1–2 Based ona strong theoretical framework, their rapid uptake in applications such asbioinformatics, chemoinformatics, and even computational linguistics, isdue to their reliability, accuracy, computational efficiency, demonstrated incountless applications, as well as their capability to handle a very wide rangeof data types and to combine them (e.g. kernel methods have been usedto analyze sequences, vectors, networks, phylogenetic trees, etc). Kernelmethods work by mapping any kind of input items (be they sequences,numeric vectors, molecular structures, etc) into a high dimensional space. The embedding of the data into a vector space is performed by a ma-thematical object called a ’kernel function’ that can efficiently computethe inner product between all pairs of data items in the embedding space,resulting into the so-called kernel matrix. Through these inner products,all data sets are represented by this real-valued square matrix, independentof the nature or complexity of the objects to be analyzed, which makes alltypes of data equally treatable and easily comparable.
Their ability to deal with complexly structured data made kernel me-
thods ideally positioned for heterogeneous data integration.
understood and demonstrated in 2002, when a crucial paper integratedamino-acid sequence information (and similarity statistics), expressiondata, protein-protein interaction data, and other types of genomic infor-mation to solve a single classification problem: the classification of trans-membrane versus non transmembrane proteins.3 Thanks to this integrationof information a higher accuracy was achieved than what was possible basedon any of the data sources separately. This and related approaches are nowwidely used in bioinformatics.4–6
Inspired by this idea we adapted this framework which is based on a con-
vex optimization problem solvable with semi-definite programming (SDP). As supervised classification algorithm, we used Least Squares Support Vec-tor Machines (LS-SVMs) instead of SVMs. LS-SVMs are easier and fasterfor high dimensional data because the quadratic programming problem isconverted into a linear problem. Secondly, LS-SVMs are also more suitable
as they contain regularization which allows tackling the problem of overfit-ting. We have shown that regularization seems to be very important whenapplying classification methods on high dimensional data.7
The algorithm described in this paper will be adapted on data of pa-
tients with rectal cancer. To investigate the combination of cetuximab,capecitabine and radiotherapy in the preoperative treatment of patientswith rectal cancer, microarray and proteomics data were gathered fromforty rectal cancer patients at three timepoints during therapy. At surgery,different outcomes were registered but here we focus on the Rectal Can-cer Regression Grade8 (RCRG), a pathological staging system based onWheeler for irradiated rectal cancer. It includes a measurement of tu-mour response after preoperative therapy. In this paper, patients weredivided into two groups which we would like to distinguish: the positivegroup (RCRG pos) contained Wheeler 1 (good responsiveness; tumour issterilized or only microscopic foci of adenocarcinoma remain); the nega-tive group (RCRG neg) consisted of Wheeler 2 (moderate responsiveness;marked fibrosis but with still a macroscopic tumour) and Wheeler 3 (poorresponsiveness; little or no fibrosis with abundant macroscopic tumour). We refer the readers to Ref. 9 for more details about the study and thepatient characteristics.
In this paper, we would like to demonstrate that integrating multiple
available data sources in an appropriate way using kernel methods increasesthe predictive power compared to models built only on one data set. Thedeveloped algorithm will be demonstrated on rectal cancer patient data. The goal is to predict at T1 (= before the start of radiotherapy) the RCRG.
Forty patients with rectal cancer (T3-T4 and/or N+) from seven Belgiancenters were enrolled in a phase I/II study investigating the combinationof cetuximab, capecitabine and radiotherapy in the preoperative treatmentof patients with rectal cancer.9 Tissue and plasma samples were gatheredbefore treatment (T0), after one dose of cetuximab but before radiotherapywith capecitabine (T1) and at moment of surgery (T2). At all these threetimepoints, the frozen tissues were used for Affymetrix microarray analysiswhile the plasma samples were used for Luminex proteomics analysis. Be-cause we had to exclude some patients, ultimately the data set contained36 patients.
The samples were hybridized to Affymetrix human U133 2.0 plus gene
chip arrays. The resulting data was first preprocessed for each timepointseparately using RMA.10 Secondly, the probe sets were mapped on EntrezGene Ids by taking the median of all probe sets that matched on the samegene. Probe sets that matched on multiple genes were excluded and un-known probe sets were given an arbitrary Entrez Gene Id. This reducesthe number of features from 54613 probe sets to 27650 genes. Next, onecan imagine that the number of differentially expressed genes will be muchlower than these 27650 genes. Therefore, a prefiltering without referenceto phenotype can be used to reduce the number of genes. Taking into ac-count the low signal-to-noise ratio of microarray data, we decided to filterout genes that show low variation across all samples. Only retaining thegenes with a variance in the top 25% reduces the number of features ateach timepoint to 6913 genes.
The proteomics data consist of 96 proteins, previously known to be
involved in cancer, measured for all patients in a Luminex 100 instrument. Proteins that had absolute values above the detection limit in less than 20%of the samples were excluded for each timepoint separately. This results inthe exclusion of six proteins at T0, four at T1 and six at T2. The proteomicsexpression values of transforming growth factor alpha (TGFα), which hadalso too many values below the detection limit, were replaced by the resultsof ELISA tests performed at the Department of Experimental Oncologyin Leuven. For the remaining proteins the missing values were replacedby half of the minimum detected for each protein over all samples, andvalues exceeding the upper limit were replaced by the upper limit value. Because most of the proteins had a positively skewed distribution, a logtransformation (base 2) was performed.
In this paper, only the data sets at T0 and T1 were used because the
goal of the models is to predict before start of chemoradiation the RCRG.
Kernel methods are a group of algorithms that do not depend on the natureof the data because they represent data entities through a set of pairwisecomparisons called the kernel matrix. The size of this matrix is determinedonly by the number of data entities, whatever the nature or the comple-xity of these entities. For example a set of 100 patients each characterizedby 6913 gene expression values is still represented by a 100 × 100 kernelmatrix.4 Similarly as 96 proteins characterized by their 3D structure are
also represented by a 100 × 100 kernel matrix. The kernel matrix can begeometrically expressed as a transformation of each data point x to a highdimensional feature space with the mapping function Φ(x). By defining akernel function k(xk, xl) as the inner product Φ(xk), Φ(xl) of two datapoints xk and xl, an explicit representation of Φ(x) in the feature spaceis not needed anymore. Any symmetric, positive semidefinite function is avalid kernel function, resulting in many possible kernels, e.g. linear, polyno-mial and diffusion kernels. They all correspond to a different transformationof the data, meaning that they extract a specific type of information fromthe data set. Therefore, the kernel representation can be applied to manydifferent types of data and is not limited to vectorial or matrix form.
An example of a kernel algorithm for supervised classification is the Sup-
port Vector Machine (SVM) developed by Vapnik and others.11 Contraryto most other classification methods and due to the way data is representedthrough kernels, SVMs can tackle high dimensional data (e.g. microarraydata). The SVM forms a linear discriminant boundary in feature space withmaximum distance between samples of the two considered classes. This cor-responds to a non-linear discriminant function in the original input space. A modified version of SVM, the Least Squares Support Vector Machine (LS-SVM), was developed by Suykens et al.12–13 On high dimensional data setsthis modified version is much faster for classification because a linear sys-tem instead of a quadratic programming problem needs to be solved. TheLS-SVM also contains regularization which tackles the problem of overfit-ting. In the next section we describe the use of LS-SVMs with a normalizedlinear kernel to predict the RCRG in rectal cancer patients based on thekernel integration of microarray and proteomics data at T0 and T1.
There exist three ways to learn simultaneously from multiple data sourcesusing kernel methods: early, intermediate and late integration.14 Figure 1gives a global overview of these three methods in the case of 2 availabledata sets. In this paper, intermediate integration is chosen. In this way,kernel functions can be better adapted to each data set separately. And byadding the kernel matrices before training the LS-SVM, only one predictedoutcome per patient is obtained which makes an extra decision functionunnecessary. dataset I dataset II outcome EARLY INTEGRATION INTERMEDIATE INTEGRATION LATE INTEGRATION Three methods to learn from multiple data sources. In early integration, an
LS-SVM is trained on the kernel matrix, computed from the concatenated data set. Inintermediate integration, a kernel matrix is computed for both data sets and an LS-SVM is trained on the sum of the kernel matrices. In late integration, two LS-SVMs aretrained separately for each data set. A decision function results in a single outcome foreach patient.
In this paper, the normalized linear kernel function was used:
k(xk, xl) = k(xk, xl)/ k(xk, xk)k(xl, xl)
with k(xk, x) = xT x instead of the linear kernel function k(x
With the normalized version, the values in the kernel matrix will bebounded because the data points are projected onto the unit sphere, whilethese elements can take very large values without normalization. Norma-lizing is thus required when combining multiple data sources to guaranteethe same order of magnitude for the kernel matrices of the data sets.
There are four data sets that have to be combined: microarray data at
T0, at T1 and proteomics data at T0 and at T1. Because each data setis represented by a kernel matrix, these data sources can be integrated ina straightforward way by adding the multiple kernel matrices according tointermediate integration explained previously. In this combination, eachof the matrices is given a specific weight µi. The resulting kernel matrix
is given in Eq. 2. Positive semidefiniteness of the linear combination ofkernel matrices is guaranteed when the weights µi are constrained to benon-negative. K = µ1K1 + µ2K2 + µ3K3 + µ4K4.
The choices of the weights are important. Previous studies have shown thatthe optimization of the weights only leads to a better performance whensome of the available data sets are redundant or contain much noise.3 In ourcase we believe that the microarray and proteomics data sets are equallyreliable based on our results of LS-SVMs on each data source separately(data not shown). Therefore to avoid optimizing the weights, they werechosen equally: µ1 = µ2 = µ3 = µ4 = 0.25.
Due to the data set size, we chose a leave-one-out cross-validation (LOO-
CV) strategy to estimate the generalization performance (see Fig. 2). Sinceboth classes were unbalanced (26 RCRG pos and 10 RCRG neg), the mino-rity class was resampled in each LOO iteration by randomly duplicatinga sample from the minority class and adding uniform noise ([0,0.1]). Thiswas repeated until the number of samples in the minority class was at least70% of the majority class (chosen without optimization).
After choosing the weights fixed, three parameters are left that have to
be optimized: the regularization parameter γ of the LS-SVM, the number ofgenes used from the microarray data sets both at T0 and T1 and the numberof proteins used from the proteomics data sets. To accomplish this, a three-dimensional grid was defined as shown in Fig. 2 on which the parametersare optimized by maximizing a criterion on the training set. The possiblevalues for γ on this grid range from 10−10 to 1010 on a logarithmic scale. The possible number of genes that were tested are 5, 10, 30, 50, 100, 300,500, 1000, 3000 and all genes. The number of proteins used are 5, 10, 25,50 and all proteins. Genes and proteins were selected by ranking thesefeatures using the Wilcoxon rank sum test. In each LOO-CV iteration, amodel is built for each possible combination of parameters on the 3D-grid. Each model with the instantiated parameters is evaluated on the left outsample. This whole procedure is repeated for all samples in the set. Themodel with the highest accuracy is chosen. If multiple models with equalaccuracy, the model with the highest sum of sensitivity and specificity ischosen. T (surgery) COMPLETE SET n samples microarray datasets proteomics RCRG datasets (n-1) samples parameters Methodology for developing a classifier. The available data contains microar-
ray data and proteomics data both at T0 and T1. The regularization parameter γ andthe number of genes (GS) and proteins (PS) are determined with a leave-one-out cross-validation strategy on the complete set. In each leave-one-out iteration, an LS-SVMmodel is trained on the most significant genes and proteins for all possible combina-tions of γ and the number of features. This gives a globally best parameter combination(γ,GS,PS).
We evaluated our methodology as described in Sec. 3.3 on the rectal cancerdata set to predict the Rectal Cancer Regression Grade. The model withthe highest performance accuracy and an as high as possible sum of sen-sitivity and specificity was built on the five most significant genes and theten most significant proteins at T0 and T1 according to the RCRG. Fromnow on, we refer to this model as MPIM (Microarray and Proteomics In-tegration Model). To evaluate its performance, 6 other models were builton different combinations of data sources using the same model buildingstrategy: MMT0 (Microarray Model at T0: all microarray data at T0),MMT1 (Microarray Model at T1: all microarray data at T1), MIM (Mi-croarray Integration Model: microarray data at both timepoints), PMT0(Proteomics Model at T0: all proteomics data at T0), PMT1 (ProteomicsModel at T1: all proteomics data at T1) and PIM (Proteomics IntegrationModel: proteomics data at both timepoints).
Table 1 gives an overview of the performances of all these models. MPIM
predicts the RCRG correctly in 33 of the 36 patients (=91.7%). Almost all
patients with RCRG positive are predicted correctly with a sensitivity of96.2% and a positive predictive value of 0.926. Of the patients with RCRGnegative, 80% are classified correctly. None of the other models performsbetter for one of the performance parameters shown in Table 1.
Performance of MPIM compared to models based on different combinations of data sources.
TP, true positive; FP, false positive; FN, false negative; TN, true negative; Sens,sensitivity; Spec, specificity; PPV, positive predictive value; NPV, negative predictivevalue; Acc, predictive accuracy.
The MPIM is built on 5 genes different at T0 and T1, 9 proteins different
at T0 and T1 and 1 protein selected at both timepoints (ferritin).
Among the 10 genes, several were related to cancer. Bone morpho-
genetic protein 4 (BMP4) is involved in development, morphogenesis, cellproliferation and apoptosis. This protein, upregulated in colorectal tu-mours, seems to help initiate the metastasis of colorectal cancer withoutmaintaining these metastases.15 Integrin alpha V (ITGAV) is a receptoron cell surfaces for extracellular matrix proteins. Integrins play importantroles in cell-cell and cell-matrix interactions during a.o. immune reactions,tumour growth and progression, and cell survival. ITGAV is related tomany cancer types among which prostate and breast cancer for which itis important in the bone environment to the growth and pathogenesis ofcancer bone metastases.16
Several of the proteins have known associations with rectal and colon
cancer, such as ferritin, TGFα, MMP-2 and TNFα. Ferritin, the majorintracellular iron storage protein, is an indicator for iron deficiency anemia. This disease is recognized as a presenting feature of right-sided colon cancerand increases in men significantly the risk of having colon cancer.17 Thetransforming growth factor alpha (TGFα) is upregulated in some humancancers among which rectal cancer.18 In colon cancer, it promotes depletionof tumour-associated macrophages and secretion of amphoterin.19 TGFα is
closely related to epidermal growth factor EGF, one of the other proteins onwhich MPIM is built. EGF plays an important role in the regulation of cellgrowth, proliferation and differentiation. The matrix metalloproteinase-2(MMP-2), known to be implicated in rectal and colon cancer invasion andmetastasis, is associated with a reduced survival of these patients whenbeing higher expressed in the malignant epithelium and in the surroundingstroma.20 The tumour necrosis factor TNFα has important roles in im-munity and cellular remodelling and influences apoptosis and cell survival. Dysregulation and especially overproduction of TNFα have been observedto occur in colorectal cancer.21 Some of the other proteins such as IL-4 andIL-6 are important for the immune system whose function depends for alarge part on interleukins. IL-4 is involved in the proliferation of B cellsand the development of T cells and mast cells. It also has an importantrole in allergic response. IL-6 regulates the immune response, modulatesnormal and cancer cell growth, differentiation and cell survival.22 It causesincreased steady-state levels of TGFα mRNA in macrophage-like cells.23
We presented a framework for the combination of multiple genome-widedata sources in disease management using a kernel-based approach (seeFig. 2). Each data set is represented by a kernel matrix, based on anormalized linear kernel function. These matrices are combined accordingto the intermediate integration method illustrated in Fig. 1. Afterwards,an LS-SVM is trained on the combined kernel matrix. In this paper, weevaluated the resulting algorithm on our data set consisting of microarrayand proteomics data of rectal cancer patients to predict the Rectal Can-cer Regression Grade after a combination therapy consisting of cetuximab,capecitabine and radiotherapy. The best model (MPIM) is based on 5 genesand 10 proteins at T0 and at T1 and can predict the RCRG with an accuracyof 91.7%, sensitivity of 96.2% and specificity of 80%. Table 1 shows that theperformance parameters of MPIM are better than or equal to the valuesof the other models. This demonstrates that microarray and proteomicsdata are partly complementary and that the performance of our algorithmin which these various views on the genome are integrated improves theprediction of response to therapy upon LS-SVMs trained on a combinationof less data sources. Many of the genes and proteins on which the MPIMis built are related to rectal cancer or cancer in general.
We were inspired by the idea of Lanckriet3 and others4–6 to integrate
multiple types of genomic information to be able to solve a single classifi-cation problem with a higher accuracy than possible based on any of thegenomic information sources separately. In the framework of Lanckriet, theproblem of optimal kernel combination is formulated as a convex optimiza-tion problem using SVMs and is solvable with semi-definite programming(SDP) techniques. However, LS-SVMs are easier and faster for high dimen-sional data because the problem is formulated as a linear problem insteadof a quadratic programming problem and LS-SVMs contain regularizationwhich tackles the problem of overfitting. Instead of applying this approachto protein function in yeast which requires the reformulation of the prob-lem in 13 binary classification problems (equal to the number of differentfunctional classes), we applied a modified version of this framework in thepatient space where many of the prediction problems are already binary. To the author’s knowledge, this is the first time that a kernel-based inte-gration method has been applied on multiple high dimensional data sets inthe patient domain for studying cancer. Our results show that using infor-mation from different levels in the central dogma improves the classificationperformance.
We already mentioned that kernel methods have a large scope due to
their representation of the data. However, when the amount of availabledata will increase in the near future, the choice of the weights becomes moreimportant, especially when applying the algorithm to problems where thereliability of the data sources differs much or is not known a priori. Inthis paper, we chose the weights equally. We cannot guarantee though thatwithout optimizing the weights of the different data sources we get themost optimal model. However, this increases the computational burdensignificantly.
When more data sources will become available in the future, they can be
easily added to this framework. Additionally, we are currently investigatingways to improve the optimization algorithm, especially for the choice of theweights. Next, we will also apply more advanced feature selection tech-niques. At this moment a simple statistical test is used but more advancedtechniques could be applied. Finally, we will compare kernel methods withother integration frameworks (e.g. Bayesian techniques).24
AD is research assistant of the Fund for Scientific Research - Flanders(FWO-Vlaanderen). This work is partially supported by: 1. Research
Council KUL: GOA AMBioRICS, CoE EF/05/007 SymBioSys. 2. FlemishGovernment: FWO: PhD/postdoc grants, G.0499.04 (Statistics), G.0302.07(SVM/Kernel). 3. Belgian Federal Science Policy Office: IUAP P6/25(BioMaGNet, 2007-2011). 4. EU-RTD; FP6-NoE Biopattern; FP6-IP e-Tumours, FP6-MC-EST Bioptrain.
1. N Cristianini and J Shawe-Taylor, Cambridge University Press, (2000). 2. J Shawe-Taylor and N Cristianini, Cambridge University Press, (2004). 3. G Lanckriet, T De Bie et al., Bioinformatics, 20(16), 2626 (2004). 4. B Sch¨olkopf, K Tsuda and J-P Vert, MIT Press, (2004). 5. W Stafford Noble, Nature Biotechnology, 24(12), 1565 (2006). 6. T De Bie, L-C Tranchevent et al., Accepted for Bioinformatics, (2007). 7. N Pochet, F De Smet et al., Bioinformatics, 20(17), 3185 (2004). 8. J M D Wheeler, B F Warren et al., Dis Colon Rectum 45(8), 1051 (2002). 9. J-P Machiels, C Sempoux et al., Ann Oncol, in press (2007). 10. R A Irizarry, B Hobbs et al., Biostatistics, 4, 249 (2003). 11. V Vapnik, Wiley, New York (1998). 12. J Suykens and J Vandewalle, Neural Processing Letters, 9(3), 293 (1999). 13. J Suykens, T Van Gestel et al., World Scientific Publishing Co., Pte Ltd.
14. P Pavlidis, J Weston et al., Proceedings of the Fifth Annual InternationalConference on Computational Molecular Biology, 242 (2001).
15. H Deng, R Makizumi et al., Exp Cell Res, 313, 1033 (2007). 16. J A Nemeth, M L Cher et al., Clin Exp Metastasis, 20, 413 (2003). 17. D Raje, H Mukhtar et al., Dis Colon Rectum, 50, 1 (2007). 18. T Shimizu, S Tanaka et al., Oncology, 59, 229 (2000). 19. T Sasahira, T Sasaki and H Kuniyasu, J Exp Clin Cancer Res, 24(1), 69
20. T-D Kim, K-S Song et al., BMC Cancer, 6, 211 (2006). 21. K Zins, D Abraham et al., Cancer Res, 67(3), 1038 (2007). 22. S O Lee, J Y Chun et al., The Prostate, 67, 764 (2007). 23. A L Hallbeck, T M Walz and A Wasteson, Bioscience Reports, 21(3), 325
24. O Gevaert, F De Smet et al., Bioinformatics, 22(14), e184 (2006).
The new england journal of medicineSoonmyung Paik, M.D., Steven Shak, M.D., Gong Tang, Ph.D.,Chungyeul Kim, M.D., Joffre Baker, Ph.D., Maureen Cronin, Ph.D.,Frederick L. Baehner, M.D., Michael G. Walker, Ph.D., Drew Watson, Ph.D., Taesung Park, Ph.D., William Hiller, H.T., Edwin R. Fisher, M.D.,D. Lawrence Wickerham, M.D., John Bryant, Ph.D., b a c k g r o u n d The likelihood of dista
Reiseapotheke Unsere Empfehlungen für Ihren Urlaub Grundausstattung Schere, Pinzette, Wundversorgung, Schmerztabletten, Sonnenschutz, Zeckenzange Denken Sie auch im Urlaub an Arzneimittel gegen. Allergie: Bindehautentzdg.: Vividrin akut (Augen- und Nasentropfen), Berberil Einmaldosen Durchfall: Imodium, Perenterol (auch zur Vorbeugung)Elotrans Pulver (zum Ausgleich de