Identifying molecular cancer subtypes from multi-omics data is an important step in the personalized medicine. We introduce CancerSubtypes, an R package for identifying cancer subtypes using ...multi-omics data, including gene expression, miRNA expression and DNA methylation data. CancerSubtypes integrates four main computational methods which are highly cited for cancer subtype identification and provides a standardized framework for data pre-processing, feature selection, and result follow-up analyses, including results computing, biology validation and visualization. The input and output of each step in the framework are packaged in the same data format, making it convenience to compare different methods. The package is useful for inferring cancer subtypes from an input genomic dataset, comparing the predictions from different well-known methods and testing new subtype discovery methods, as shown with different application scenarios in the Supplementary Material.
The package is implemented in R and available under GPL-2 license from the Bioconductor website (http://bioconductor.org/packages/CancerSubtypes/).
thuc.le@unisa.edu.au or jiuyong.li@unisa.edu.au.
Supplementary data are available at Bioinformatics online.
In this article, we aim to develop a unified view of causal and non-causal feature selection methods. The unified view will fill in the gap in the research of the relation between the two types of ...methods. Based on the Bayesian network framework and information theory, we first show that causal and non-causal feature selection methods share the same objective. That is to find the Markov blanket of a class attribute, the theoretically optimal feature set for classification. We then examine the assumptions made by causal and non-causal feature selection methods when searching for the optimal feature set, and unify the assumptions by mapping them to the restrictions on the structure of the Bayesian network model of the studied problem. We further analyze in detail how the structural assumptions lead to the different levels of approximations employed by the methods in their search, which then result in the approximations in the feature sets found by the methods with respect to the optimal feature set. With the unified view, we can interpret the output of non-causal methods from a causal perspective and derive the error bounds of both types of methods. Finally, we present practical understanding of the relation between causal and non-causal methods using extensive experiments with synthetic data and various types of real-world data.
On optimal rule discovery Li, Jiuyong
IEEE transactions on knowledge and data engineering,
04/2006, Letnik:
18, Številka:
4
Journal Article
Recenzirano
In machine learning and data mining, heuristic and association rules are two dominant schemes for rule discovery. Heuristic rule discovery usually produces a small set of accurate rules, but fails to ...find many globally optimal rules. Association rule discovery generates all rules satisfying some constraints, but yields too many rules and is infeasible when the minimum support is small. Here, we present a unified framework for the discovery of a family of optimal rule sets and characterize the relationships with other rule-discovery schemes such as nonredundant association rule discovery. We theoretically and empirically show that optimal rule discovery is significantly more efficient than association rule discovery independent of data structure and implementation. Optimal rule discovery is an efficient alternative to association rule discovery, especially when the minimum support is low.
Discover Dependencies from Data-A Review Liu, Jixue; Li, Jiuyong; Liu, Chengfei ...
IEEE transactions on knowledge and data engineering,
02/2012, Letnik:
24, Številka:
2
Journal Article
Recenzirano
Functional and inclusion dependency discovery is important to knowledge discovery, database semantics analysis, database design, and data quality assessment. Motivated by the importance of dependency ...discovery, this paper reviews the methods for functional dependency, conditional functional dependency, approximate functional dependency, and inclusion dependency discovery in relational databases and a method for discovering XML functional dependencies.
Symbolic Aggregate approXimation (SAX) as a major symbolic representation has been widely used in many time series data mining applications. However, because a symbol is mapped from the average value ...of a segment, the SAX ignores important information in a segment, namely the trend of the value change in the segment. Such a miss may cause a wrong classification in some cases, since the SAX representation cannot distinguish different time series with similar average values but different trends. In this paper, we firstly design a measure to compute the distance of trends using the starting and the ending points of segments. Then we propose a modified distance measure by integrating the SAX distance with a weighted trend distance. We show that our distance measure has a tighter lower bound to the Euclidean distance than that of the original SAX. The experimental results on diverse time series data sets demonstrate that our proposed representation significantly outperforms the original SAX representation and an improved SAX representation for classification.
•We define an intuitive trend distance measure on time series segments.•We integrate the SAX distance with our weighted trend distance.•The proposed distance has a tighter lower bound to Euclidean distance than SAX.•Our method outperforms the original SAX on classification significantly.
Causal Decision Trees Li, Jiuyong; Ma, Saisai; Le, Thuc ...
IEEE transactions on knowledge and data engineering,
02/2017, Letnik:
29, Številka:
2
Journal Article
Recenzirano
Odprti dostop
Uncovering causal relationships in data is a major objective of data analytics. Currently, there is a need for scalable and automated methods for causal relationship exploration in data. ...Classification methods are fast and they could be practical substitutes for finding causal signals in data. However, classification methods are not designed for causal discovery and a classification method may find false causal signals and miss the true ones. In this paper, we develop a causal decision tree (CDT) where nodes have causal interpretations. Our method follows a well-established causal inference framework and makes use of a classic statistical test to establish the causal relationship between a predictor variable and the outcome variable. At the same time, by taking the advantages of normal decision trees, a CDT provides a compact graphical representation of the causal relationships, and the construction of a CDT is fast as a result of the divide and conquer strategy employed, making CDTs practical for representing and finding causal signals in large data sets. Experiment results demonstrate that CDTs can identify meaningful causal relationships and the CDT algorithm is scalable.
Local Search for Efficient Causal Effect Estimation Cheng, Debo; Li, Jiuyong; Liu, Lin ...
IEEE transactions on knowledge and data engineering,
2023-Sept.-1, 2023-9-1, Letnik:
35, Številka:
9
Journal Article
Recenzirano
Odprti dostop
Causal effect estimation from observational data is a challenging problem, especially with high dimensional data and in the presence of unobserved variables. The available data-driven methods for ...tackling the problem either provide an estimation of the bounds of a causal effect (i.e., nonunique estimation) or have low efficiency. The major hurdle for achieving high efficiency while trying to obtain unique and unbiased causal effect estimation is how to find a proper adjustment set for confounding control in a fast way, given the huge covariate space and considering unobserved variables. In this paper, we approach the problem as a local search task for finding valid adjustment sets in data. We establish the theorems to support the local search for adjustment sets, and we show that unique and unbiased estimation can be achieved from observational data even when there exist unobserved variables. We then propose a data-driven algorithm that is fast and consistent under mild assumptions. We also make use of a frequent pattern mining method to further speed up the search of minimal adjustment sets for causal effect estimation. Experiments conducted on extensive synthetic and real-world datasets demonstrate that the proposed algorithm outperforms the state-of-the-art criteria/estimators in both accuracy and time-efficiency.
Smoke plumes are the first things seen from space when wildfires occur. Thus, fire smoke detection is important for early fire detection. Deep Learning (DL) models have been used to detect fire smoke ...in satellite imagery for fire detection. However, previous DL-based research only considered lower spatial resolution sensors (e.g., Moderate-Resolution Imaging Spectroradiometer (MODIS)) and only used the visible (i.e., red, green, blue (RGB)) bands. To contribute towards solutions for early fire smoke detection, we constructed a six-band imagery dataset from Landsat 5 Thematic Mapper (TM) and Landsat 8 Operational Land Imager (OLI) with a 30-metre spatial resolution. The dataset consists of 1836 images in three classes, namely “Smoke”, “Clear”, and “Other_aerosol”. To prepare for potential on-board-of-small-satellite detection, we designed a lightweight Convolutional Neural Network (CNN) model named “Variant Input Bands for Smoke Detection (VIB_SD)”, which achieved competitive accuracy with the state-of-the-art model SAFA, with less than 2% of its number of parameters. We further investigated the impact of using additional Infra-Red (IR) bands on the accuracy of fire smoke detection with VIB_SD by training it with five different band combinations. The results demonstrated that adding the Near-Infra-Red (NIR) band improved prediction accuracy compared with only using the visible bands. Adding both Short-Wave Infra-Red (SWIR) bands can further improve the model performance compared with adding only one SWIR band. The case study showed that the model trained with multispectral bands could effectively detect fire smoke mixed with cloud over small geographic extents.
Cancer is not a single disease and involves different subtypes characterized by different sets of molecules. Patients with different subtypes of cancer often react heterogeneously towards the same ...treatment. Currently, clinical diagnoses rather than molecular profiles are used to determine the most suitable treatment. A molecular level approach will allow a more precise and informed way for making treatment decisions, leading to a better survival chance and less suffering of patients. Although many computational methods have been proposed to identify cancer subtypes at molecular level, to the best of our knowledge none of them are designed to discover subtypes with heterogeneous treatment responses.
In this article we propose the Survival Causal Tree (SCT) method. SCT is designed to discover patient subgroups with heterogeneous treatment effects from censored observational data. Results on TCGA breast invasive carcinoma and glioma datasets have shown that for each subtype identified by SCT, the patients treated with radiotherapy exhibit significantly different relapse free survival pattern when compared to patients without the treatment. With the capability to identify cancer subtypes with heterogeneous treatment responses, SCT is useful in helping to choose the most suitable treatment for individual patients.
Data and code are available at https://github.com/WeijiaZhang24/SurvivalCausalTree .
weijia.zhang@mymail.uinsa.edu.au.
Supplementary data are available at Bioinformatics online.
Identifying cancer subtypes is an important component of the personalised medicine framework. An increasing number of computational methods have been developed to identify cancer subtypes. However, ...existing methods rarely use information from gene regulatory networks to facilitate the subtype identification. It is widely accepted that gene regulatory networks play crucial roles in understanding the mechanisms of diseases. Different cancer subtypes are likely caused by different regulatory mechanisms. Therefore, there are great opportunities for developing methods that can utilise network information in identifying cancer subtypes.
In this paper, we propose a method, weighted similarity network fusion (WSNF), to utilise the information in the complex miRNA-TF-mRNA regulatory network in identifying cancer subtypes. We firstly build the regulatory network where the nodes represent the features, i.e. the microRNAs (miRNAs), transcription factors (TFs) and messenger RNAs (mRNAs) and the edges indicate the interactions between the features. The interactions are retrieved from various interatomic databases. We then use the network information and the expression data of the miRNAs, TFs and mRNAs to calculate the weight of the features, representing the level of importance of the features. The feature weight is then integrated into a network fusion approach to cluster the samples (patients) and thus to identify cancer subtypes. We applied our method to the TCGA breast invasive carcinoma (BRCA) and glioblastoma multiforme (GBM) datasets. The experimental results show that WSNF performs better than the other commonly used computational methods, and the information from miRNA-TF-mRNA regulatory network contributes to the performance improvement. The WSNF method successfully identified five breast cancer subtypes and three GBM subtypes which show significantly different survival patterns. We observed that the expression patterns of the features in some miRNA-TF-mRNA sub-networks vary across different identified subtypes. In addition, pathway enrichment analyses show that the top pathways involving the most differentially expressed genes in each of the identified subtypes are different. The results would provide valuable information for understanding the mechanisms characterising different cancer subtypes and assist the design of treatment therapies. All datasets and the R scripts to reproduce the results are available online at the website: http://nugget.unisa.edu.au/Thuc/cancersubtypes/.