We propose a new method based on machine learning to
play the devil’s advocate
and investigate the impact of unknown systematic effects in a quantitative way. This method proceeds by reversing the ...measurement process and using the physics results to interpret systematic effects under the Standard Model hypothesis. We explore this idea with two alternative approaches: the first one relies on a combination of gradient descent and optimisation techniques, its application and potentiality is illustrated with an example that studies the branching fraction measurement of a heavy-flavour decay. The second method employs reinforcement learning and it is applied to the determination of the
P
5
′
angular observable in
B
0
→
K
∗
0
μ
+
μ
-
decays. We find that for the former, the size of a hypothetical hidden systematic uncertainty strongly depends on the kinematic overlap between the signal and normalisation channel, while the latter is very robust against possible mismodellings of the efficiency.
Abstract
The discovery of conservation principles is crucial for understanding the fundamental behavior of both classical and quantum physical systems across numerous domains. This paper introduces ...an innovative method that merges representation learning and topological analysis to explore the topology of conservation law spaces. Notably, the robustness of our approach to noise makes it suitable for complex experimental setups and its aptitude extends to the analysis of quantum systems, as successfully demonstrated in our paper. We exemplify our method’s potential to unearth previously unknown conservation principles and endorse interdisciplinary research through a variety of physical simulations. In conclusion, this work emphasizes the significance of data-driven techniques in deepening our comprehension of the principles governing classical and quantum physical systems.
Program code has recently become a valuable active data source for training various data science models, from code classification to controlled code synthesis. Annotating code snippets play an ...essential role in such tasks. This article presents a novel approach that leverages CodeBERT, a powerful transformer-based model, to classify code snippets extracted from Code4ML automatically. Code4ML is a comprehensive machine learning code
compiled from Kaggle, a renowned data science competition platform. The
includes code snippets and information about the respective kernels and competitions, but it is limited in the quality of the tagged data, which is ~0.2%. Our method addresses the lack of labeled snippets for supervised model training by exploiting the internal ambiguity in particular labeled snippets where multiple class labels are combined. Using a specially designed algorithm, we effectively separate these ambiguous fragments, thereby expanding the pool of training data. This data augmentation approach greatly increases the amount of labeled data and improves the overall quality of the trained models. The experimental results demonstrate the prowess of the proposed code classifier, achieving an impressive F1 test score of ~89%. This achievement not only enhances the practicality of CodeBERT for classifying code snippets but also highlights the importance of enriching large-scale annotated machine learning code datasets such as Code4ML. With a significant increase in accurately annotated code snippets, Code4ML is becoming an even more valuable resource for learning and improving various data processing models.
Anomaly detection is a challenging task that frequently arises in practically all areas of industry and science, from fraud detection and data quality monitoring to finding rare cases of diseases and ...searching for new physics. Most of the conventional approaches to anomaly detection, such as one-class SVM and Robust Auto-Encoder, are one-class classification methods,
, focus on separating normal data from the rest of the space. Such methods are based on the assumption of separability of normal and anomalous classes, and subsequently do not take into account any available samples of anomalies. Nonetheless, in practical settings, some anomalous samples are often available; however, usually in amounts far lower than required for a balanced classification task, and the separability assumption might not always hold. This leads to an important task-incorporating known anomalous samples into training procedures of anomaly detection models. In this work, we propose a novel model-agnostic training procedure to address this task. We reformulate one-class classification as a binary classification problem with normal data being distinguished from pseudo-anomalous samples. The pseudo-anomalous samples are drawn from low-density regions of a normalizing flow model by feeding tails of the latent distribution into the model. Such an approach allows to easily include known anomalies into the training process of an arbitrary classifier. We demonstrate that our approach shows comparable performance on one-class problems, and, most importantly, achieves comparable or superior results on tasks with variable amounts of known anomalies.
Simulation is one of the key components in high energy physics. Historically it relies on the Monte Carlo methods which require a tremendous amount of computation resources. These methods may have ...difficulties with the expected High Luminosity Large Hadron Collider (HL-LHC) needs, so the experiments are in urgent need of new fast simulation techniques. We introduce a new Deep Learning framework based on Generative Adversarial Networks which can be faster than traditional simulation methods by 5 orders of magnitude with reasonable simulation accuracy. This approach will allow physicists to produce a sufficient amount of simulated data needed by the next HL-LHC experiments using limited computing resources.
There are many problems in physics, biology, and other natural sciences in which symbolic regression can provide valuable insights and discover new laws of nature. Widespread deep neural networks do ...not provide interpretable solutions. Meanwhile, symbolic expressions give us a clear relation between observations and the target variable. However, at the moment, there is no dominant solution for the symbolic regression task, and we aim to reduce this gap with our algorithm. In this work, we propose a novel deep learning framework for symbolic expression generation
variational autoencoder (VAE). We suggest using a VAE to generate mathematical expressions, and our training strategy forces generated formulas to fit a given dataset. Our framework allows encoding apriori knowledge of the formulas into fast-check predicates that speed up the optimization process. We compare our method to modern symbolic regression benchmarks and show that our method outperforms the competitors under noisy conditions. The recovery rate of SEGVAE is 65% on the Ngyuen dataset with a noise level of 10%, which is better than the previously reported SOTA by 20%. We demonstrate that this value depends on the dataset and can be even higher.
The use of program code as a data source is increasingly expanding among data scientists. The purpose of the usage varies from the semantic classification of code to the automatic generation of ...programs. However, the machine learning model application is somewhat limited without annotating the code snippets. To address the lack of annotated datasets, we present the Code4ML
. It contains code snippets, task summaries, competitions, and dataset descriptions publicly available from Kaggle-the leading platform for hosting data science competitions. The
consists of ~2.5 million snippets of ML code collected from ~100 thousand Jupyter notebooks. A representative fraction of the snippets is annotated by human assessors through a user-friendly interface specially designed for that purpose. Code4ML dataset can help address a number of software engineering or data science challenges through a data-driven approach. For example, it can be helpful for semantic code classification, code auto-completion, and code generation for an ML task specified in natural language.
Abstract
Modification of physical properties of materials and design of materials with on-demand characteristics is at the heart of modern technology. Rare application relies on pure materials—most ...devices and technologies require careful design of materials properties through alloying, creating heterostructures of composites, or controllable introduction of defects. At the same time, such designer materials are notoriously difficult to model. Thus, it is very tempting to apply machine learning methods to such systems. Unfortunately, there is only a handful of machine learning-friendly material databases available these days. We develop a platform for easy implementation of machine learning techniques to materials design and populate it with datasets on pristine and defected materials. Here we introduce the 2D Material Defect (2DMD) datasets that include defect properties of represented 2D materials such as MoS
2
, WSe
2
, hBN, GaSe, InSe, and black phosphorous, calculated using DFT. Our study provides a data-driven physical understanding of complex behaviors of defect properties in 2D materials, holding promise for a guide to the development of efficient machine learning models. In addition, with the increasing enrollment of datasets, our database could provide a platform for designing materials with predetermined properties.
Abstract
Two-dimensional materials offer a promising platform for the next generation of (opto-) electronic devices and other high technology applications. One of the most exciting characteristics of ...2D crystals is the ability to tune their properties via controllable introduction of defects. However, the search space for such structures is enormous, and ab-initio computations prohibitively expensive. We propose a machine learning approach for rapid estimation of the properties of 2D material given the lattice structure and defect configuration. The method suggests a way to represent configuration of 2D materials with defects that allows a neural network to train quickly and accurately. We compare our methodology with the state-of-the-art approaches and demonstrate at least 3.7 times energy prediction error drop. Also, our approach is an order of magnitude more resource-efficient than its contenders both for the training and inference part.
Adversarial Optimization provides a reliable, practical way to match two implicitly defined distributions, one of which is typically represented by a sample of real data, and the other is represented ...by a parameterized generator. Matching of the distributions is achieved by minimizing a divergence between these distribution, and estimation of the divergence involves a secondary optimization task, which, typically, requires training a model to discriminate between these distributions. The choice of the model has its trade-off: high-capacity models provide good estimations of the divergence, but, generally, require large sample sizes to be properly trained. In contrast, low-capacity models tend to require fewer samples for training; however, they might provide biased estimations. Computational costs of Adversarial Optimization becomes significant when sampling from the generator is expensive. One of the practical examples of such settings is fine-tuning parameters of complex computer simulations. In this work, we introduce a novel family of divergences that enables faster optimization convergence measured by the number of samples drawn from the generator. The variation of the underlying discriminator model capacity during optimization leads to a significant speed-up. The proposed divergence family suggests using low-capacity models to compare distant distributions (typically, at early optimization steps), and the capacity gradually grows as the distributions become closer to each other. Thus, it allows for a significant acceleration of the initial stages of optimization. This acceleration was demonstrated on two fine-tuning problems involving Pythia event generator and two of the most popular black-box optimization algorithms: Bayesian Optimization and Variational Optimization. Experiments show that, given the same budget, adaptive divergences yield results up to an order of magnitude closer to the optimum than Jensen-Shannon divergence. While we consider physics-related simulations, adaptive divergences can be applied to any stochastic simulation.