As a data-driven science, genomics largely utilizes machine learning to capture dependencies in data and derive novel biological hypotheses. However, the ability to extract new insights from the ...exponentially increasing volume of genomics data requires more expressive machine learning models. By effectively leveraging large data sets, deep learning has transformed fields such as computer vision and natural language processing. Now, it is becoming the method of choice for many genomics modelling tasks, including predicting the impact of genetic variation on gene regulatory mechanisms such as DNA accessibility and splicing.
Machine learning algorithms, when applied to sensitive data, pose a distinct threat to privacy. A growing body of prior work demonstrates that models produced by these algorithms may leak specific ...private information in the training data to an attacker, either through the models' structure or their observable behavior. However, the underlying cause of this privacy risk is not well understood beyond a handful of anecdotal accounts that suggest overfitting and influence might play a role. This paper examines the effect that overfitting and influence have on the ability of an attacker to learn information about the training data from machine learning models, either through training set membership inference or attribute inference attacks. Using both formal and empirical analyses, we illustrate a clear relationship between these factors and the privacy risk that arises in several popular machine learning algorithms. We find that overfitting is sufficient to allow an attacker to perform membership inference and, when the target attribute meets certain conditions about its influence, attribute inference attacks. Interestingly, our formal analysis also shows that overfitting is not necessary for these attacks and begins to shed light on what other factors may be in play. Finally, we explore the connection between membership inference and attribute inference, showing that there are deep connections between the two that lead to effective new attacks.
Interest in machine-learning applications within medicine has been growing, but few studies have progressed to deployment in patient care. We present a framework, context and ultimately guidelines ...for accelerating the translation of machine-learning-based interventions in health care. To be successful, translation will require a team of engaged stakeholders and a systematic process from beginning (problem formulation) to end (widespread deployment).
Machine learning methods have been remarkably successful for a wide range of application areas in the extraction of essential information from data. An exciting and relatively recent development is ...the uptake of machine learning in the natural sciences, where the major goal is to obtain novel scientific insights and discoveries from observational or simulated data. A prerequisite for obtaining a scientific outcome is domain knowledge, which is needed to gain explainability, but also to enhance scientific consistency. In this article, we review explainable machine learning in view of applications in the natural sciences and discuss three core elements that we identified as relevant in this context: transparency, interpretability, and explainability. With respect to these core elements, we provide a survey of recent scientific works that incorporate machine learning and the way that explainable machine learning is used in combination with domain knowledge from the application areas.
PMsub.2.5 refers to the total mass concentration of tiny particulates in the atmosphere near the surface, obtained by means of in situ observations and satellite remote sensing. Given the highly ...limited number of ground observation stations of inhomogeneous distribution and an ill-posed remote sensing approach, increasing efforts have been devoted to the application of machine-learning (ML) models to both ground and satellite data. A key satellite-derived parameter, aerosol optical thickness (AOD), has been most commonly used as a proxy of PMsub.2.5, although their correlation is fraught with large uncertainties. A critical question that has been overlooked concerns how much AOD helps to improve the retrieval of PMsub.2.5 relative to its uncertainty incurred concurrently. The question is addressed here by taking advantage of high-density PMsub.2.5 stations in eastern China to evaluate the contributions of AOD, determined as the difference in the accuracy of PMsub.2.5 retrievals with and without AOD for varying densities of PMsub.2.5 stations, using four popular ML models (i.e., Random Forest, Extra-trees, XGBoost, and LightGBM). Our results reveal that as the density of monitoring stations decreases, both the feature importance and permutation importance of satellite AOD demonstrate a consistent upward trend (p < 0.05). Furthermore, the ML models without AOD exhibit faster declines in overall accuracy and predictive ability compared with the models with AOD assessed using the sample-based and station-based (spatial) independent cross-validation approaches. Overall, a 10% reduction in the number of stations results in an increase of 0.7–1.2% and 0.6–1.2% in uncertainty in estimated and predicted accuracies, respectively. These findings attest to the indispensable role of satellite AOD in the PMsub.2.5 retrieval process through ML because it can significantly mitigate the negative impact of the sparse distribution of monitoring sites. This role becomes more important as the number of PMsub.2.5 stations decreases.
Recent controversies about the level of replicability of behavioral research analyzed using statistical inference have cast interest in developing more efficient techniques for analyzing the results ...of psychological experiments. Here we claim that complementing the analytical workflow of psychological experiments with Machine Learning-based analysis will both maximize accuracy and minimize replicability issues. As compared to statistical inference, ML analysis of experimental data is model agnostic and primarily focused on prediction rather than inference. We also highlight some potential pitfalls resulting from adoption of Machine Learning based experiment analysis. If not properly used it can lead to over-optimistic accuracy estimates similarly observed using statistical inference. Remedies to such pitfalls are also presented such and building model based on cross validation and the use of ensemble models. ML models are typically regarded as black boxes and we will discuss strategies aimed at rendering more transparent the predictions.
Concerns over cybersecurity in critical systems have grown significantly over the last decade. The increase in the successful attacks against infrastructure, major corporations, and governments has ...led to major investment in mitigating and preventing cyberattacks. At the same time, there has been a significant interest in utilizing data in operations, with machine learning applications becoming a popular area of study. One industry exploring machine learning applications is the nuclear industry. Because of the sensitive nature of nuclear systems, the question if attacks on nuclear data can be detected has begun to take urgency. This study explores the use of autoencoders to detect anomalies in nuclear data that could be potentially used to evaluate the operating status of a nuclear system. Data from a generic pressurized water reactor simulator used in a previous study to diagnose transients was used to train an autoencoder model using Keras. A separate portion of these data was altered by adding statistical noise for validation. Four different levels of noise were used in this experiment. Once the autoencoder was trained, a threshold was calculated using the average mean square error of the predictions and the standard deviation from that loss. Points above the threshold were classified as anomalies while points below were considered unaltered. For the initial level of noise, the model was able to score near perfect in recall, capturing all but 13 of the 13 884 altered points. However, in terms of precision, the model misclassified a number of unaltered points as altered, resulting in a score of 73.76%. To test the sensitivity of the model, the amount of noise was reduced three times, and as expected, the performance of the model worsened with each reduction. Still, the high performance in identifying altered points for higher levels of noise is an encouraging first step in developing anomaly detection systems for nuclear data.