Recently, Raman Spectroscopy (RS) was demonstrated to be a non-destructive way of cancer diagnosis, due to the uniqueness of RS measurements in revealing molecular biochemical changes between ...cancerous vs. normal tissues and cells. In order to design computational approaches for cancer detection, the quality and quantity of tissue samples for RS are important for accurate prediction. In reality, however, obtaining skin cancer samples is difficult and expensive due to privacy and other constraints. With a small number of samples, the training of the classifier is difficult, and often results in overfitting. Therefore, it is important to have more samples to better train classifiers for accurate cancer tissue classification. To overcome these limitations, this paper presents a novel generative adversarial network based skin cancer tissue classification framework. Specifically, we design a data augmentation module that employs a Generative Adversarial Network (GAN) to generate synthetic RS data resembling the training data classes. The original tissue samples and the generated data are concatenated to train classification modules. Experiments on real-world RS data demonstrate that (1) data augmentation can help improve skin cancer tissue classification accuracy, and (2) generative adversarial network can be used to generate reliable synthetic Raman spectroscopic data.
Multi-instance learning (MIL) is a useful tool for tackling labeling ambiguity in learning because it allows a bag of instances to share one label. Bag mapping transforms a bag into a single instance ...in a new space via instance selection and has drawn significant attention recently. To date, most existing work is based on the original space, using all instances inside each bag for bag mapping, and the selected instances are not directly tied to an MIL objective. As a result, it is difficult to guarantee the distinguishing capacity of the selected instances in the new bag mapping space. In this paper, we propose a discriminative mapping approach for multi-instance learning (MILDM) that aims to identify the best instances to directly distinguish bags in the new mapping space. Accordingly, each instance bag can be mapped using the selected instances to a new feature space, and hence any generic learning algorithm, such as an instance-based learning algorithm, can be used to derive learning models for multi-instance classification. Experiments and comparisons on eight different types of real-world learning tasks (including 14 data sets) demonstrate that MILDM outperforms the state-of-the-art bag mapping multi-instance learning approaches. Results also confirm that MILDM achieves balanced performance between runtime efficiency and classification effectiveness.
Graph learning, such as node classification, is typically carried out in a
closed-world
setting. A number of nodes are labeled, and the learning goal is to correctly classify remaining (unlabeled) ...nodes into classes, represented by the labeled nodes. In reality, due to limited labeling capability or dynamic evolving nature of networks, some nodes in the networks may not belong to any existing/seen classes and therefore cannot be correctly classified by closed-world learning algorithms. In this paper, we propose a new
open-world
graph learning paradigm, where the learning goal is to correctly classify nodes belonging to labeled classes into correct categories and also classify nodes not belonging to labeled classes to an unseen class. Open-world graph learning has three major challenges: (1) Graphs do not have features to represent nodes for learning; (2) unseen class nodes do not have labels and may exist in an arbitrary form different from labeled classes; and (3) graph learning should differentiate whether a node belongs to an existing/seen class or an unseen class. To tackle the challenges, we propose an uncertain node representation learning principle to use multiple versions of node feature representation to test a classifier’s response on a node, through which we can differentiate whether a node belongs to the unseen class. Technical wise, we propose constrained variational graph autoencoder, using label loss and class uncertainty loss constraints, to ensure that node representation learning is sensitive to the unseen class. As a result, node embedding features are denoted by distributions, instead of deterministic feature vectors. In order to test the certainty of a node belonging to seen classes, a sampling process is proposed to generate multiple versions of feature vectors to represent each node, using automatic thresholding to reject nodes not belonging to seen classes as unseen class nodes. Experiments, using graph convolutional networks and graph attention networks on four real-world networks, demonstrate the algorithm performance. Case studies and ablation analysis also show the advantage of the uncertain representation learning and automatic threshold selection for open-world graph learning.
In this paper, we formulate a novel graph-based learning problem, multi-graph classification (MGC), which aims to learn a classifier from a set of labeled bags each containing a number of graphs ...inside the bag. A bag is labeled positive, if at least one graph in the bag is positive, and negative otherwise. Such a multi-graph representation can be used for many real-world applications, such as webpage classification, where a webpage can be regarded as a bag with texts and images inside the webpage being represented as graphs. This problem is a generalization of multi-instance learning (MIL) but with vital differences, mainly because instances in MIL share a common feature space whereas no feature is available to represent graphs in a multi-graph bag. To solve the problem, we propose a boosting based multi-graph classification framework (bMGC). Given a set of labeled multi-graph bags, bMGC employs dynamic weight adjustment at both bag- and graph-levels to select one subgraph in each iteration as a weak classifier. In each iteration, bag and graph weights are adjusted such that an incorrectly classified bag will receive a higher weight because its predicted bag label conflicts to the genuine label, whereas an incorrectly classified graph will receive a lower weight value if the graph is in a positive bag (or a higher weight if the graph is in a negative bag). Accordingly, bMGC is able to differentiate graphs in positive and negative bags to derive effective classifiers to form a boosting model for MGC. Experiments and comparisons on real-world multi-graph learning tasks demonstrate the algorithm performance.
Graph classification has drawn great interests in recent years due to the increasing number of applications involving objects with complex structure relationships. To date, all existing graph ...classification algorithms assume, explicitly or implicitly, that misclassifying instances in different classes incurs an equal amount of cost (or risk), which is often not the case in real-life applications (where misclassifying a certain class of samples, such as diseased patients, is subject to more expensive costs than others). Although cost-sensitive learning has been extensively studied, all methods are based on data with instance-feature representation. Graphs, however, do not have features available for learning and the feature space of graph data is likely infinite and needs to be carefully explored in order to favor classes with a higher cost. In this paper, we propose, CogBoost, a fast cost-sensitive graph classification algorithm, which aims to minimize the misclassification costs (instead of the errors) and achieve fast learning speed for large scale graph data sets. To minimize the misclassification costs, CogBoost iteratively selects the most discriminative subgraph by considering costs of different classes, and then solves a linear programming problem in each iteration by using Bayes decision rule based optimal loss function. In addition, a cutting plane algorithm is derived to speed up the solving of linear programs for fast learning on large scale data sets. Experiments and comparisons on real-world large graph data sets demonstrate the effectiveness and the efficiency of our algorithm.
Labeling malware or malware clustering is important for identifying new security threats, triaging and building reference datasets. The state-of-the-art Android malware clustering approaches rely ...heavily on the raw labels from commercial AntiVirus (AV) vendors, which causes misclustering for a substantial number of weakly-labeled malware due to the inconsistent, incomplete and overly generic labels reported by these closed-source AV engines, whose capabilities vary greatly and whose internal mechanisms are opaque (i.e., intermediate detection results are unavailable for clustering). The raw labels are thus often used as the only important source of information for clustering. To address the limitations of the existing approaches, this paper presents Andre, a new ANDroid Hybrid REpresentation Learning approach to clustering weakly-labeled Android malware by preserving heterogeneous information from multiple sources (including the results of static code analysis, the meta-information of an app, and the raw-labels of the AV vendors) to jointly learn a hybrid representation for accurate clustering. The learned representation is then fed into our outlier-aware clustering to partition the weakly-labeled malware into known and unknown families. The malware whose malicious behaviours are close to those of the existing families on the network, are further classified using a three-layer Deep Neural Network (DNN). The unknown malware are clustered using a standard density-based clustering algorithm. We have evaluated our approach using 5,416 ground-truth malware from Drebin and 9,000 malware from VirusShare (uploaded between Mar. 2017 and Feb. 2018), consisting of 3324 weakly-labeled malware. The evaluation shows that Andre effectively clusters weakly-labeled malware which cannot be clustered by the state-of-the-art approaches, while achieving comparable accuracy with those approaches for clustering ground-truth samples.
Multitask learning (MTL) is commonly used for jointly optimizing multiple learning tasks. To date, all existing MTL methods have been designed for tasks with feature-vector represented instances, but ...cannot be applied to structure data, such as graphs. More importantly, when carrying out MTL, existing methods mainly focus on exploring overall commonality or disparity between tasks for learning, but cannot explicitly capture task relationships in the feature space, so they are unable to answer important questions, such as what exactly is shared between tasks and what is the uniqueness of one task differing from others? In this paper, we formulate a new multitask graph learning problem, and propose a task sensitive feature exploration and learning algorithm for multitask graph classification. Because graphs do not have features available, we advocate a task sensitive feature exploration and learning paradigm to jointly discover discriminative subgraph features across different tasks. In addition, a feature learning process is carried out to categorize each subgraph feature into one of three categories: (1) common feature; (2) task auxiliary feature; and (3) task specific feature, indicating whether the feature is shared by all tasks, by a subset of tasks, or by only one specific task, respectively. The feature learning and the multiple task learning are iteratively optimized to form a multitask graph classification model with a global optimization goal. Experiments on real-world functional brain analysis and chemical compound categorization demonstrate the algorithm's performance. Results confirm that our method can be used to explicitly capture task correlations and uniqueness in the feature space, and explicitly answer what are shared between tasks and what is the uniqueness of a specific task.