Recent advances in Deep Neural Networks (DNNs) and sensor technologies are enabling autonomous driving systems (ADSs) with an ever-increasing level of autonomy. However, assessing their dependability ...remains a critical concern. State-of-the-art ADS testing approaches modify the controllable attributes of a simulated driving environment until the ADS misbehaves. In such approaches, environment instances in which the ADS is successful are discarded, despite the possibility that they could contain hidden driving conditions in which the ADS may misbehave. In this paper, we present GenBo (GENerator of BOundary state pairs), a novel test generator for ADS testing. GenBo mutates the driving conditions of the ego vehicle (position, velocity and orientation), collected in a failure-free environment instance, and efficiently generates challenging driving conditions at the behavior boundary (i.e., where the model starts to misbehave) in the same environment instance. We use such boundary conditions to augment the initial training dataset and retrain the DNN model under test. Our evaluation results show that the retrained model has, on average, up to 3<inline-formula><tex-math notation="LaTeX">\times</tex-math> <mml:math display="inline"><mml:mo>×</mml:mo></mml:math><inline-graphic xlink:href="biagiola-ieq1-3420816.gif"/> </inline-formula> higher success rate on a separate set of evaluation tracks with respect to the original DNN model.
Context
Code hardening is meant to fight malicious tampering with sensitive code executed on client hosts. Code splitting is a hardening technique that moves selected chunks of code from client to ...server. Although widely adopted, the effective benefits of code splitting are not fully understood and thoroughly assessed.
Objective
The objective of this work is to compare non protected code vs. code splitting protected code, considering two levels of the chunk size parameter, in order to assess the effectiveness of the protection - in terms of both attack time and success rate - and to understand the attack strategy and process used to overcome the protection.
Method
We conducted an experiment with master students performing attack tasks on a small application hardened with different levels of protection. Students carried out their task working at the source code level.
Results
We observed a statistically significant effect of code splitting on the attack success rate that, on the average, was reduced from 89% with unprotected clear code to 52% with the most effective protection. The protection variant that moved some small-sized code chunks turned out to be more effective than the alternative moving fewer but larger chunks. Different strategies were identified yielding different success rates. Moreover we discovered that successful attacks exhibited different process w.r.t. failed ones.
Conclusions
We found empirical evidence of the effect of code splitting, assessed the relative magnitude, and evaluated the influence of the chunk size parameter. Moreover we extracted the process used to overcome such obfuscation technique.
Deep Reinforcement Learning (DRL) has received a lot of attention from the research community in recent years. As the technology moves away from game playing to practical contexts, such as autonomous ...vehicles and robotics, it is crucial to evaluate the quality of DRL agents.In this article, we propose a search-based approach to test such agents. Our approach, implemented in a tool called Indago, trains a classifier on failure and non-failure environment (i.e., pass) configurations resulting from the DRL training process. The classifier is used at testing time as a surrogate model for the DRL agent execution in the environment, predicting the extent to which a given environment configuration induces a failure of the DRL agent under test. The failure prediction acts as a fitness function, guiding the generation towards failure environment configurations, while saving computation time by deferring the execution of the DRL agent in the environment to those configurations that are more likely to expose failures.Experimental results show that our search-based approach finds 50% more failures of the DRL agent than state-of-the-art techniques. Moreover, such failures are, on average, 78% more diverse; similarly, the behaviors of the DRL agent induced by failure configurations are 74% more diverse.
The dataset available for pre-release training of a machine-learning based system is often not representative of all possible execution contexts that the system will encounter in the field. ...Reinforcement Learning (RL) is a prominent approach among those that support continual learning, i.e., learning continually in the field, in the post-release phase. No study has so far investigated any method to test the plasticity of RL-based systems, i.e., their capability to adapt to an execution context that may deviate from the training one.We propose an approach to test the plasticity of RL-based systems. The output of our approach is a quantification of the adaptation and anti-regression capabilities of the system, obtained by computing the adaptation frontier of the system in a changed environment. We visualize such frontier as an adaptation/anti-regression heatmap in two dimensions, or as a clustered projection when more than two dimensions are involved. In this way, we provide developers with information on the amount of changes that can be accommodated by the continual learning component of the system, which is key to decide if online, in-the-field learning can be safely enabled or not.
Safe handling of hazardous driving situations is a task of high practical relevance for building reliable and trustworthy cyber‐physical systems such as autonomous driving systems. This task ...necessitates an accurate prediction system of the vehicle's confidence to prevent potentially harmful system failures on the occurrence of unpredictable conditions that make it less safe to drive. In this paper, we discuss the challenges of adapting a misbehavior predictor with knowledge mined during the execution of the main system. Then, we present a framework for the continual learning of misbehavior predictors, which records in‐field behavioral data to determine what data are appropriate for adaptation. Our framework guides adaptive retraining using a novel combination of in‐field confidence metric selection and reconstruction error‐based weighing. We evaluate our framework to improve a misbehavior predictor from the literature on the Udacity simulator for self‐driving cars. Our results show that our framework can reduce the false positive rate by a large margin and can adapt to nominal behavior drifts while maintaining the original capability to predict failures up to several seconds in advance.
Anticipating hazardous driving situations has high practical relevance for reliable autonomous driving systems. We propose a framework for the continual adaptation of a misbehavior predictor of environmental uncertainty. Our framework guides adaptive retraining using a combination of in‐field confidence metric selection and reconstruction error‐based weighing. Our framework can reduce the false positive rate by a large margin and can adapt to nominal behavior drifts while maintaining the original capability to predict failures up to several seconds in advance.
Recent decades have seen the rise of large-scale Deep Neural Networks (DNNs) to achieve human-competitive performance in a variety of AI tasks. Often consisting of hundreds of million, if not ...hundreds of billion, parameters, these DNNs are too large to be deployed to or efficiently run on resource-constrained devices such as mobile phones or Internet of Things microcontrollers. Systems relying on large-scale DNNs thus have to call the corresponding model over the network, leading to substantial costs for hosting and running the large-scale remote model, costs which are often charged on a per-use basis. In this article, we propose BiSupervised, a novel architecture, where, before relying on a large remote DNN, a system attempts to make a prediction on a small-scale local model. A DNN supervisor monitors said prediction process and identifies easy inputs for which the local prediction can be trusted. For these inputs, the remote model does not have to be invoked, thus saving costs while only marginally impacting the overall system accuracy. Our architecture furthermore foresees a second supervisor to monitor the remote predictions and identify inputs for which not even these can be trusted, allowing to raise an exception or run a fallback strategy instead. We evaluate the cost savings and the ability to detect incorrectly predicted inputs on four diverse case studies: IMDb movie review sentiment classification, GitHub issue triaging, ImageNet image classification, and SQuADv2 free-text question answering. In all four case studies, we find that BiSupervised allows to reduce cost by at least 30% while maintaining similar system-level prediction performance. In two case studies (IMDb and SQuADv2), we find that BiSupervised even achieves a higher system-level accuracy, at reduced cost, compared to a remote-only model. Furthermore, measurements taken on our setup indicate a large potential of BiSupervised to reduce average prediction latency.
With the advent of deep learning, AI components have achieved unprecedented performance on complex, human competitive tasks, such as image, video, text and audio processing. Hence, they are ...increasingly integrated into sophisticated software systems, some of which (e.g., autonomous vehicles) are required to deliver certified dependability warranties. In this talk, I will consider the unique features of AI based systems and of the faults possibly affecting them, in order to revise the testing fundamentals and redefine the overall goal of testing, taking a statistical view on the dependability warranties that can be actually delivered. Then, I will consider the key elements of a revised testing process for AI based systems, including the test oracle and the test input generation problems. I will also introduce the notion of runtime supervision, to deal with unexpected error conditions that may occur in the field. Finally, I will identify the future steps that are essential to close the loop from testing to operation, proposing an empirical framework that reconnects the output of testing to its original goals.
This is the Replicated Computational Results (RCR) Report for our TOSEM paper “Adopting Two Supervisors for Efficient Use of Large-Scale Remote Deep Neural Networks”, where we propose a novel ...client-server architecture allowing to leverage the high accuracy of huge neural networks running on remote servers while reducing the economical and latency costs typically coming from using such models. As part of this RCR, we provide a replication package, which allows the full replication of all our results and is specifically designed to facilitate reuse.
Context: Both biallelic and monoallelic mutations in PROK2 or PROKR2 have been found in Kallmann syndrome (KS).
Objective: The objective of the study was to compare the phenotypes of KS patients ...harboring monoallelic and biallelic mutations in these genes.
Design and Patients: We studied clinical and endocrine features that reflect the functioning of the pituitary-gonadal axis, and the nonreproductive phenotype, in 55 adult KS patients (42 men and 13 women), of whom 41 had monoallelic mutations and 14 biallelic mutations in PROK2 or PROKR2.
Results: Biallelic mutations were associated with more frequent cryptorchidism (70% vs. 34%, P < 0.05) and microphallus (90% vs. 28%, P < 0.001) and lower mean testicular volume (1.2 ± 0.4 vs. 4.5 ± 6.0 ml; P < 0.01) in male patients. Likewise, the testosterone level as well as the basal FSH level and peak LH level under GnRH-stimulation were lower in males with biallelic mutations (0.2 ± 0.1 vs. 0.7 ± 0.8 ng/ml; P = 0.05, 0.3 ± 0.1 vs. 1.8 ± 3.0 IU/liter; P < 0.05, and 0.8 ± 0.8 vs. 5.2 ± 5.5 IU/liter; P < 0.05, respectively). Nonreproductive, nonolfactory anomalies were rare in both sexes and were never found in patients with biallelic mutations. The mean body mass index of the patients (23.9 ± 4.2 kg/m2 in males and 26.3 ± 6.6 kg/m2 in females) did not differ significantly from that of gender-, age-, and treatment-matched KS individuals who did not carry a mutation in PROK2 or PROKR2. Finally, circadian cortisol levels evaluated in five patients, including one with biallelic PROKR2 mutations, were normal in all cases.
Conclusion: Male patients carrying biallelic mutations in PROK2 or PROKR2 have a less variable and on average a more severe reproductive phenotype than patients carrying monoallelic mutations in these genes. Nonreproductive, nonolfactory clinical anomalies associated with KS seem to be restricted to patients with monoallelic mutations.
Patients carrying biallelic mutations in PROK2 or PROKR2 have a more severe reproductive phenotype than patients carrying monoallelic mutations in these genes.