Detecting divergence between oncogenic tumors plays a pivotal role in cancer diagnosis and therapy. This research work was focused on designing a computational strategy to predict the class of lung ...cancer tumors from the structural and physicochemical properties (1497 attributes) of protein sequences obtained from genes defined by microarray analysis. The proposed methodology involved the use of hybrid feature selection techniques (gain ratio and correlation based subset evaluators with Incremental Feature Selection) followed by Bayesian Network prediction to discriminate lung cancer tumors as Small Cell Lung Cancer (SCLC), Non-Small Cell Lung Cancer (NSCLC) and the COMMON classes. Moreover, this methodology eliminated the need for extensive data cleansing strategies on the protein properties and revealed the optimal and minimal set of features that contributed to lung cancer tumor classification with an improved accuracy compared to previous work. We also attempted to predict via supervised clustering the possible clusters in the lung tumor data. Our results revealed that supervised clustering algorithms exhibited poor performance in differentiating the lung tumor classes. Hybrid feature selection identified the distribution of solvent accessibility, polarizability and hydrophobicity as the highest ranked features with Incremental feature selection and Bayesian Network prediction generating the optimal Jack-knife cross validation accuracy of 87.6%. Precise categorization of oncogenic genes causing SCLC and NSCLC based on the structural and physicochemical properties of their protein sequences is expected to unravel the functionality of proteins that are essential in maintaining the genomic integrity of a cell and also act as an informative source for drug design, targeting essential protein properties and their composition that are found to exist in lung cancer tumors.
Full text
Available for:
DOBA, IZUM, KILJ, NUK, PILJ, PNG, SAZU, SIK, UILJ, UKNU, UL, UM, UPUK
The challenging problems associated with the analysis of microarray datasets are high dimensional feature, small sample size, class imbalance, noisy data, and high variance feature values. This has ...led to problems such as the curse of dimensionality, a decline in classification accuracy, and overfitting. Deep learning technology has gained massive popularity in biomedical research, and its algorithms are widely used to build models that solve complex classification problems. This study utilizes a deep neural network (DNN) for building classification models for microarray brain cancer data and ADPD (Alzheimer’s disease Parkinson’s disease) data from the KEGG (Kyoto Encyclopedia of Genes and Genomes) database. The small gene samples high dimensional feature in the given microarray data are addressed by considering a dimensionality reduction technique namely Correlated Feature Selection (CFS). The selected features of CFS were fed into the DNN for classification. For better training of the DNN model, the learning rates of various optimization algorithms were compared. The final optimal subset selected by the CFS-DNN model on brain cancer includes 112 features with an average classification accuracy of 95.83% and on ADPD data includes 40 features with an average classification accuracy of 87.5%. The performance of the proposed model is validated using 10-fold cross-validation. The proposed approach is also evaluated using precision, recall, F1-score, and Receiver Operating Characteristic curve. A comparative analysis of the proposed model with the state-of-the-art method in literature is carried out and the proposed method exhibits better performance than the one of the existing works and conventional machine learning models.
Autism spectrum disorder is the most used umbrella term for a myriad of neuro-degenerative/developmental conditions typified by inappropriate social behavior, lack of communication/comprehension ...skills, and restricted mental and emotional maturity. The intriguing factor of this disorder is attributed to the fact that it can be detected only by close monitoring of developmental milestones after childbirth. Moreover, the exact causes for the occurrence of this neurodevelopmental condition are still unknown. Besides, autism is prevalent across individuals irrespective of ethnicity, genetic/familial history, and economic/educational background. Although research suggests that autism is genetic in nature and early detection of this disorder can greatly enhance the independent lifestyle and societal adaptability of affected individuals, there is still a great dearth of information to support the statement of proven facts and figures. This research work places emphasis on the application of automated machine learning incorporated with feature ranking techniques to generate significant feature signatures for the early detection of autism. Publicly available datasets based on the Q-chat scores of individuals across diverse age groups—toddlers, children, adolescents, and adults have been employed in this study. A machine learning framework based on automated hyperparameter optimization is proposed in this work to rank the potential nonclinical markers for autism. Moreover, this study aimed at ranking the AutoML models based on Mathew’s correlation coefficient and balanced accuracy via which nonclinical markers were identified from these datasets. Besides, the feature signatures and their significance in distinguishing between classes are being reported for the first time in autism detection. The proposed framework yielded ∼90% MCC and ∼95% balanced accuracy across all four age groups of autism datasets. Deep learning approaches have yielded a maximum of 92.7% accuracy on the same datasets but are limited in their ability to extract significant markers, have not reported on MCC for unbalanced data, and cannot adapt automatically to new data entries. However, AutoML approaches are more flexible, easier to implement, and provide automated optimization, thereby yielding the highest accuracy with minimal user intervention.
Full text
Available for:
FZAB, GIS, IJS, IZUM, KILJ, NLZOH, NUK, OILJ, PILJ, PNG, SAZU, SBCE, SBMB, UL, UM, UPUK
Alzheimer’s and Parkinson’s disease are the most common forms of dementia that degenerate neurons in the brain cells. This paper targets a comparative study on the performance of data mining ...techniques in neuro-degenerative data. The existing data mining algorithms give classification accuracy ~93% with Correlation-based feature subset selection method. The proposed Decremental Feature Selection Method has yielded a more optimal feature subset that gives higher accuracy in prediction. Further exploration of computational methods to investigate the role of such genetic variants will aid in identifying the genetic cause of these diseases and design suitable drugs to target the gene property.
Full text
Available for:
GEOZS, IJS, IMTLJ, KILJ, KISLJ, NLZOH, NUK, OILJ, PNG, SAZU, SBCE, SBJE, UILJ, UL, UM, UPCLJ, UPUK, ZAGLJ, ZRSKP
Background
Data mining techniques are used to mine unknown knowledge from huge data. Microarray gene expression (MGE) data plays a major role in predicting type of cancer. But as MGE data is huge in ...volume, applying traditional data mining approaches is time consuming. Hence parallel programming frameworks like Hadoop, Spark and Mahout are necessary to ease the task of computation.
Objective
Not all the gene expressions are necessary in prediction, it is very essential to select important genes for improving classification accuracy. So feature selection algorithms are parallelized and executed on Spark framework to eliminate unnecessary genes and identify only predictive genes in very less time without affecting prediction accuracy.
Methods
Parallelized hybrid feature selection (HFS) method is proposed to serve the purpose. This method includes parallelized correlation feature subset selection followed by rank-based feature selection methods. The selected subset of genes is evaluated using parallel classification algorithms. The accuracy values obtained are compared with existing rank-weight feature selection, parallelized recursive feature selection methods and also with the values obtained by executing parallelized HFS on DistributedWekaSpark.
Results
The classification accuracy obtained with the proposed parallelized HFS method is 97% and 79% for gastric cancer and childhood leukemia respectively. The proposed parallelized HFS method produced ~ 4% to ~ 15% improvement in classification accuracy when compared with previous methods.
Conclusion
The results reveal the fact that the proposed parallelized feature selection algorithm is scalable to growing medical data and predicts cancer sub-types in lesser time with higher accuracy.
Full text
Available for:
EMUNI, FIS, FZAB, GEOZS, GIS, IJS, IMTLJ, KILJ, KISLJ, MFDPS, NLZOH, NUK, OBVAL, OILJ, PNG, SAZU, SBCE, SBJE, SBMB, SBNM, UKNU, UL, UM, UPUK, VKSCE, ZAGLJ
Prediction of secondary site mutations that reinstate mutated p53 to normalcy has been the focus of intense research in the recent past owing to the fact that p53 mutants have been implicated in more ...than half of all human cancers and restoration of p53 causes tumor regression. However laboratory investigations are more often laborious and resource intensive but computational techniques could well surmount these drawbacks. In view of this, we formulated a novel approach utilizing computational techniques to predict the transcriptional activity of multiple site (one-site to five-site) p53 mutants. The optimal MCC obtained by the proposed approach on prediction of one-site, two-site, three-site, four-site and five-site mutants were 0.775,0.341,0.784,0.916 and 0.655 respectively, the highest reported thus far in literature. We have also demonstrated that 2D and 3D features generate higher prediction accuracy of p53 activity and our findings revealed the optimal results for prediction of p53 status, reported till date. We believe detection of the secondary site mutations that suppress tumor growth may facilitate better understanding of the relationship between p53 structure and function and further knowledge on the molecular mechanisms and biological activity of p53, a targeted source for cancer therapy. We expect that our prediction methods and reported results may provide useful insights on p53 functional mechanisms and generate more avenues for utilizing computational techniques in biological data analysis.
Full text
Available for:
DOBA, IZUM, KILJ, NUK, PILJ, PNG, SAZU, SIK, UILJ, UKNU, UL, UM, UPUK
In DNA microarray research, the increase in gene expression samples and feature dimensions become a challenge for feature selection. This makes it necessary that a more efficient and improved ...classification algorithm be developed so as to select optimal features in gene expression data. This study presents a new feature selection algorithm that combines the Correlation Feature Selection (CFS) and the Velocity Clamping Particle Swarm Optimization (VCPSO) algorithm. This hybrid model takes advantage of both the filters and the wrappers. It also selects the subsets with optimal features to classify genes by using different classifiers such as Support Vector Machine (SVM), Random Forest(RF), Naïve Bayes(NB) and Decision Tree(DT). Two bioinformatics problems become the basis of evaluation for hybrid mechanisms. These are neurodegenerative brain disorder protein data and microarray cancer data. Reducing the redundancy and finding optimal gene features is the need of the hour. Our experiments show that CFS-VCPSO-SVM selection method eliminates the redundant features and classifies the gene expression data with maximum accuracy.
This research is focussed on predicting through Naïve Bayes learning, the possible p53 rescue mutants from amino-acid substitutions at the second, third and fourth site recombination that could ...reinstate normal p53 activity. The Naïve Bayes probability values of the amino-acid substitutions at the respective site-wise recombination were utilized to formulate the proposed Genetic Mutant Marker Extraction (GMME) technique that could unearth the hot spot cancer, strong rescue and weak rescue mutants. The p53 mutation records depicting the amino-acid substitutions obtained by yeast assays comprising of nearly 16,700 records, available at the University of California, Machine Learning Repository, were utilized as the training dataset for the GMME technique. The proposed GMME technique revealed the hot spot cancer mutants, strong rescue and weak rescue mutants leading to the detection of probable genetic markers for Cancer prediction from the surface regions 96-289 constituting the second, third and fourth site recombinations. Thus far, computational approaches have been able to predict rescue markers at region-specific mutations (96-105, 114-123, 130-156 and 223- 232) with respect to the second site recombination for three hot spot cancer mutants only viz, P152L, R158L and G245S. The GMME technique aimed at predicting possible rescue markers for p53 mutants at the second, third and fourth site recombinations revealing novel rescue markers for fourteen hot spot cancer mutants. Moreover, the GMME technique can be extended effectively to increasing number of recombinant sites that can be efficiently utilized to predict novel rescue markers.
Biological data is prone to grow exponentially, which consumes more resources, time and manpower. Parallelization of algorithms could reduce overall execution time. There are two main challenges in ...parallelizing computational methods. (1) Biological data is multi-dimensional in nature. (2). Parallel algorithms reduce execution time, but with the penalty of reduced prediction accuracy. This research paper targets these two issues and proposes the following approaches. (1) Vertical partitioning of data along feature space and horizontal partitioning along samples in order to ease the task of data parallelism. (2) Parallel Multilevel Feature Selection (M-FS) algorithm to select optimal and important features for improved classification of cancer sub-types. The selected features are evaluated using parallel Random Forest on Spark, compared with previously reported results and also with the results of sequential execution of same algorithms. The proposed parallel M-FS algorithm was compared with existing parallel feature selection algorithms in terms of accuracy and execution time. The results reveal that parallel multilevel feature selection algorithm improved cancer classification resulting into prediction accuracy ranging from ∼85% to ∼99% with very high speed up in terms of seconds. On the other hand, existing sequential algorithms yielded prediction accuracy of ∼65% to ∼99% with execution time of more than 24 hours.
•Biological data keeps growing and dealing with this huge data is a challenging task.•Parallel Algorithms solve his issue with increase speed up but affects accuracy.•Parallel Multilevel Feature Selection method applies vertical & horizontal partition.•Parallel Multilevel Feature Selection method selects optimal and important features.•Classification followed by Feature Selection improves classification accuracy.•The proposed method improved accuracy at high speed up compared to existing methods.
Full text
Available for:
GEOZS, IJS, IMTLJ, KILJ, KISLJ, NLZOH, NUK, OILJ, PNG, SAZU, SBCE, SBJE, UILJ, UL, UM, UPCLJ, UPUK, ZAGLJ, ZRSKP
Ontology provides an organizational framework of concepts and a system that depicts hierarchical and associative relationships pertaining to an application domain. The possibility of reuse and data ...sharing permitted by ontology, along with the formal structure coupled with hierarchies of concepts and their inter-relationships offer the opportunity to draw complex inferences and reasoning. This rationale was the motivation to construct an ontology for Psoriasis Risk Assessment and Remedy (PRAR). This paper targets two issues: (i) Need for a medical database to derive Ontology (ii) Methodology for design of Semi-Automated Ontology Construction framework (SOCF) from pioneered data. Psoriasis is one of the most recurrent skin issues in India and the world at large and hence this paper targeted the need to generate a Psoriasis Remedy Database and automatically infer the relations between the Symptoms, Causes and Treatment through Semi-Automated Ontology Reasoning and Inference. The proposed system incorporated two phases: Formulation of a novel database for Psoriasis Risk Assessment Remedy (PRAR) (ii) Articulation of a novel framework for Psoriasis detection through computational modeling and Ontology Construction. The proposed methodology was tested on 112 samples from the authenticated UCI Machine Learning Repository. The ontology developed using the proposed SOCF mapped the risk factors and remedies for Psoriasis detection with 98.7% accuracy, this being reported for the first time.