This paper presents the R/Bioconductor package stepwiseCM, which classifies cancer samples using two heterogeneous data sets in an efficient way. The algorithm is able to capture the distinct ...classification power of two given data types without actually combining them. This package suits for classification problems where two different types of data sets on the same samples are available. One of these data types has measurements on all samples and the other one has measurements on some samples. One is easy to collect and/or relatively cheap (eg, clinical covariates) compared to the latter (high-dimensional data, eg, gene expression). One additional application for which stepwiseCM is proven to be useful as well is the combination of two high-dimensional data types, eg, DNA copy number and mRNA expression. The package includes functions to project the neighborhood information in one data space to the other to determine a potential group of samples that are likely to benefit most by measuring the second type of covariates. The two heterogeneous data spaces are connected by indirect mapping. The crucial difference between the stepwise classification strategy implemented in this package and the existing packages is that our approach aims to be cost-efficient by avoiding measuring additional covariates, which might be expensive or patient-unfriendly, for a potentially large subgroup of individuals. Moreover, in diagnosis for these individuals test, results would be quickly available, which may lead to reduced waiting times and hence lower the patients’ distress. The improvement described remedies the key limitations of existing packages, and facilitates the use of the stepwiseCM package in diverse applications.
Background
Colorectal cancer develops in a multi-step manner from normal epithelium, through a pre-malignant lesion (so-called adenoma), into a malignant lesion (carcinoma), which invades surrounding ...tissues and eventually can spread systemically (metastasis). It is estimated that only about 5% of adenomas do progress to a carcinoma.
Aim
The present study aimed to unravel the biology of adenoma to carcinoma progression by mRNA expression profiling, and to identify candidate biomarkers for adenomas that are truly at high risk of progression.
Methods
Genome-wide mRNA expression profiles were obtained from a series of 37 colorectal adenomas and 31 colorectal carcinomas using oligonucleotide microarrays. Differentially expressed genes were validated in an independent colorectal gene expression data set. Gene Set Enrichment Analysis (GSEA) was used to identify altered expression of sets of genes associated with specific biological processes, in order to better understand the biology of colorectal adenoma to carcinoma progression.
Results
mRNA expression of 248 genes was significantly different, of which 96 were upregulated and 152 downregulated in carcinomas compared to adenomas. Classification of adenomas and carcinomas using the expression of these genes showed to be very accurate, also when tested in an independent expression data set. Gene-sets associated with ageing (which is related to senescence) and chromosomal instability were upregulated, and a gene-set associated with fatty acid metabolism was downregulated in carcinomas compared to adenomas. Moreover, gene-sets associated with chromosomal location revealed chromosome 4q22 loss and chromosome 20q gain of gene-set expression as being relevant in this progression.
Concluding remark
These data are consistent with the notion that adenomas and carcinomas are distinct biological entities. Disruption of specific biological processes like senescence (ageing), maintenance of chromosomal instability and altered metabolism, are key factors in the progression from adenoma to carcinoma.