ENBIS-8 in Athens
21 – 25 September 2008
Abstract submission: 14 March – 11 August 2008
Multivariate Class Prediction with Gene Expression Data
23 September 2008, 15:20 – 15:40
- Submitted by
- Marco P. Seabra dos Reis
- Marco S. Reis
- University of Coimbra
- Gene expression profiling has been widely used to perform genome wide studies with several purposes, such as: studying the molecular mechanisms of diseases and cell biology, find biomarkers for certain organism malfunctions, classify certain traits on the basis of gene expression patterns and discover new ones, etc.
Gene expression data is acquired through DNA microarray technology, where genomic DNA sequences from genes immobilized in a solid matrix (probes) are hybridized with labelled mRNA representative of different cells states (targets). The magnitude of signal intensity at each probe location is then interpreted as a measure of the expression level of that particular gene, at the state corresponding to the label being analyzed.
Microarray data have been widely analyzed through univariate techniques. This class of techniques try to identify those genes that most differentiate between the states under analysis (usually two), through F and t statistics, or through other sort of univariate methodologies, such as the “signal to noise ratio” (Golub et al., 1999) and the SAM (“Significance Analysis of Microarrays”; Tusher et al., 2001) methods. The simplicity underlying these methodologies enables them to adequately control classification error rates, such has the False Positive Rate (FPR), Family-Wise Error Rate (FWER) and the False Discovery Rate (FDR). However, they do tend to disregard the cooperative behaviour of gene expression, i.e., their combined activity under cell certain conditions. This turns out to be a significant drawback of the univariate methodologies, as it is well known that gene activity is rarely an isolated result of the action of a single gene, but a consequence of a cascade of events where several genes clusters participate.
In this context, multivariate approaches offer more flexibility for describing gene co-expression patterns, but also present some methodological limitations. For instance, Fisher Discriminant Analysis (FDA) requires the number of variables (genes in microarray data) to be less than the number of observations, a condition not met in practice. Therefore, such multivariate techniques do require a preliminary stage of variable selection, usually based on univariate approaches, where data dimensionality is reduced until the necessary condition for applying multivariate methods are met. On the other hand, it is not expected that all genes participate in each physiological response, but only clusters of functionally related genes, and therefore the methods should be able to identify such clusters of genes
In this work, an intrinsic multivariate approach is presented where the preliminary variable reduction stage is not required, but that can still be conducted after a first run of the proposed methodology, on the basis of multivariate information generated in such first trial. The approach combines PLS-DA and FDA (PLS-DA standing for “Partial Least Squares for Discriminating Analysis”), and has incorporated a “non-classification” analysis, enabling the assessment of the uncertainty for each class prediction, according to two distance measures of the expression profile under analysis to training dataset entities. We also propose a genes VIP (variable importance in projection) metric for the combined PLS-DA/FDA methodology, in order to identify key genes segregating the different classes.
The approach is illustrated using a well known data set (Golub et al., 1999), where different expression phenotypes were measured in samples from patients with different types of leukaemia: acute lymphoblastic leukaemia (ALL), subdivided according to their lineage (ALL-B and ALL-T) and acute myeloid leukaemia (AML).
Golub, T.R., Slonim, D.K., Tamayo, P., Huard, C., Gaasenbeeck, M., Mesirov, J.P., Coller, H., Loh, M.L., Downing, J.R., Caligiuri, M.A., Bloomfield, C.D., Lander, E.S. (1999). Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science, 286, 531-537.
Tusher, V.G., Tibshirani, R., Chu, G. (2001). Significance analysis of microarrays applied to the ionizing radiation response. Proc. Natl. Acad. Sci. USA, (98), 5116-5151.
Return to programme