We summarise various ways of performing dimensionality reduction on high-dimensional microarray data. Many different feature selection and feature extraction methods exist and they are being widely ...used. All these methods aim to remove redundant and irrelevant features so that classification of new instances will be more accurate. A popular source of data is microarrays, a biological platform for gathering gene expressions. Analysing microarrays can be difficult due to the size of the data they provide. In addition the complicated relations among the different genes make analysis more difficult and removing excess features can improve the quality of the results. We present some of the most popular methods for selecting significant features and provide a comparison between them. Their advantages and disadvantages are outlined in order to provide a clearer idea of when to use each one of them for saving computational time and resources.
Overfitting has been widely studied in the context of classification and regression. In this paper, we study the overfitting in the context of dimensionality reduction. We show that the conventional ...wisdom of improving classification performance by maximising inter-class discrimination is not valid for high-dimensional datasets, and can lead to severe overfitting. In particular, we prove the theoretical existence of perfectly discriminative subspace projections, and show that for datasets with very high input dimensionality, inter-class discrimination should be reduced rather than maximised. This naturally leads to a simple dimensionality reduction technique, which we call Soft Discriminant Maps, which we use to show a direct relationship between the classification performance and the level of inter-class discrimination of feature extractors. Moreover, Soft Discriminant Maps consistently exhibit better classification performance than other comparable techniques.
•The causes of over-fitting in feature extraction for high-dimensional datasets are revealed.•We prove the theoretical existence of perfectly discriminative subspace projections.•Direct, inverse relationship between the classification performance the levels of inter-class discrimination.•Soft Discriminant Maps consistently performs better than other comparable techniques.
Microarray databases are a large source of genetic data, which, upon proper analysis, could enhance our understanding of biology and medicine. Many microarray experiments have been designed to ...investigate the genetic mechanisms of cancer, and analytical approaches have been applied in order to classify different types of cancer or distinguish between cancerous and non-cancerous tissue. However, microarrays are high-dimensional datasets with high levels of noise and this causes problems when using machine learning methods. A popular approach to this problem is to search for a set of features that will simplify the structure and to some degree remove the noise from the data. The most widely used approach to feature extraction is principal component analysis (PCA) which assumes a multivariate Gaussian model of the data. More recently, non-linear methods have been investigated. Among these, manifold learning algorithms, for example Isomap, aim to project the data from a higher dimensional space onto a lower dimension one. We have proposed a priori manifold learning for finding a manifold in which a representative set of microarray data is fused with relevant data taken from the KEGG pathway database. Once the manifold has been constructed the raw microarray data is projected onto it and clustering and classification can take place. In contrast to earlier fusion based methods, the prior knowledge from the KEGG databases is not used in, and does not bias the classification process--it merely acts as an aid to find the best space in which to search the data. In our experiments we have found that using our new manifold method gives better classification results than using either PCA or conventional Isomap.
Celotno besedilo
Dostopno za:
DOBA, IZUM, KILJ, NUK, PILJ, PNG, SAZU, SIK, UILJ, UKNU, UL, UM, UPUK
We introduce a new estimate of mutual information between a dataset and a target variable that can be maximised analytically and has broad applicability in the field of machine learning and ...statistical pattern recognition. This estimate has previously been employed implicitly as an approximation to quadratic mutual information. In this paper we will study the properties of these estimates of mutual information in more detail, and provide a derivation from a perspective of pairwise interactions. From this perspective, we will show a connection between our proposed estimate and Laplacian eigenmaps, which so far has not been shown to be related to mutual information. Compared with other popular measures of mutual information, which can only be maximised through an iterative process, ours can be maximised much more efficiently and reliably via closed-form eigendecomposition.
In order to provide the most effective therapy for cancer, it is important to be able to diagnose whether a patient's cancer will respond to a proposed treatment. Methylation profiling could contain ...information from which such predictions could be made. Currently, hypothesis testing is used to determine whether possible biomarkers for cancer progression produce statistically significant results. However, this approach requires the identification of individual genes, or sets of genes, as candidate hypotheses, and with the increasing size of modern microarrays, this task is becoming progressively harder. Exhaustive testing of small sets of genes is computationally infeasible, and so hypothesis generation depends either on the use of established biological knowledge or on heuristic methods. As an alternative machine learning, methods can be used to identify groups of genes that are acting together within sets of cancer data and associate their behaviors with cancer progression. These methods have the advantage of being multivariate and unbiased but unfortunately also rapidly become computationally infeasible as the number of gene probes and datasets increases. To address this problem, we have investigated a way of utilizing prior knowledge to segment microarray datasets in such a way that machine learning can be used to identify candidate sets of genes for hypothesis testing. A methylation dataset is divided into subsets, where each subset contains only the probes that relate to a known gene pathway. Each of these pathway subsets is used independently for classification. The classification method is AdaBoost with decision trees as weak classifiers. Since each pathway subset contains a relatively small number of gene probes, it is possible to train and test its classification accuracy quickly and determine whether it has valuable diagnostic information. Finally, genes from successful pathway subsets can be combined to create a classifier of high accuracy.
In many biometric pattern-recognition problems, the number of training examples per class is limited, and consequently the sample group covariance matrices often used in parametric and nonparametric ...Bayesian classifiers are poorly estimated or singular. Thus, a considerable amount of effort has been devoted to the design of other covariance estimators, for use in limited-sample and high-dimensional classification problems. In this paper, a new covariance estimate, called the maximum entropy covariance selection (MECS) method, is proposed. It is based on combining covariance matrices under the principle of maximum uncertainty. In order to evaluate the MECS effectiveness in biometric problems, experiments on face, facial expression, and fingerprint classification were carried out and compared with popular covariance estimates, including the regularized discriminant analysis and leave-one-out covariance for the parametric classifier, and the Van Ness and Toeplitz covariance estimates for the nonparametric classifier. The results show that, in image recognition applications whenever the sample group covariance matrices are poorly estimated or ill posed, the MECS method is faster and usually more accurate than the aforementioned approaches in both parametric and nonparametric Bayesian classifiers.
The United States has achieved unprecedented survival rates, as high as 98%, for casualties arriving alive at the combat hospital. Our military medical personnel are rightly proud of this ...achievement. Commanders and Servicemembers are confident that if wounded and moved to a Role II or III medical facility, their care will be the best in the world. Combat casualty care, however, begins at the point of injury and continues through evacuation to those facilities. With up to 25% of deaths on the battlefield being potentially preventable, the prehospital environment is the next frontier for making significant further improvements in battlefield trauma care. Strict adherence to the evidence-based Tactical Combat Casualty Care (TCCC) Guidelines has been proven to reduce morbidity and mortality on the battlefield. However, full implementation across the entire force and commitment from both line and medical leadership continue to face ongoing challenges. This report on prehospital trauma in the Combined Joint Operations Area?Afghanistan (CJOA-A) is a follow-on to the one previously conducted in November 2012 and published in January 2013. Both assessments were conducted by the US Central Command (USCENTCOM) Joint Theater Trauma System (JTTS). Observations for this report were collected from December 2013 to January 2014 and were obtained directly from deployed prehospital providers, medical leaders, and combatant leaders. Significant progress has been made between these two reports with the establishment of a Prehospital Care Division within the JTTS, development of a prehospital trauma registry and weekly prehospital trauma conferences, and CJOA-A theater guidance and enforcement of prehospital documentation. Specific prehospital trauma-care achievements include expansion of transfusion capabilities forward to the point of injury, junctional tourniquets, and universal approval of tranexamic acid.
Intensive international efforts are underway toward phenotyping the entire mouse genome by modifying all its ≈25,000 genes one-by-one for comparative studies. A workload of this scale has triggered ...numerous studies harnessing image informatics for the identification of morphological defects. However, existing work in this line primarily rests on abnormality detection via structural volumetrics between wild-type and gene-modified mice, which generally fails when the pathology involves no severe volume changes, such as ventricular septal defects (VSDs) in the heart. Furthermore, in embryo cardiac phenotyping, the lack of relevant work in embryonic heart segmentation, the limited availability of public atlases, and the general requirement of manual labor for the actual phenotype classification after abnormality detection, along with other limitations, have collectively restricted existing practices from meeting the high-throughput demands. This study proposes, to the best of our knowledge, the first fully automatic VSD classification framework in mouse embryo imaging. Our approach leverages a combination of atlas-based segmentation and snake evolution techniques to derive the segmentation of heart ventricles, where VSD classification is achieved by checking whether the left and right ventricles border or overlap with each other. A pilot study has validated our approach at a proof-of-concept level and achieved a classification accuracy of 100% through a series of empirical experiments on a database of 15 images.
This paper presents a method for automatically generating new designs from a set of existing objects of the same class using machine learning. In this particular work, we use a custom parametric ...chair design program to produce a large set of chairs that are tested for their physical properties using ergonomic simulations. Design schemata are found from this set of chairs and used to generate new designs by placing constraints on the generating parameters used in the program. The schemata are found by training decision trees on the chair data sets. These are automatically reverse engineered by examining the structure of the trees and creating a schema for each positive leaf. By finding a range of schemata, rather than a single solution, we maintain a diverse design space. This paper also describes how schemata for different properties can be combined to generate new designs that possess all properties required in a design brief. The method is shown to consistently produce viable designs, covering a large range of our design space, and demonstrates a significant time saving over generate and test using the same program and simulations.