Today’s enterprise software systems are much more complicated than the past. Increasing numbers of dependent applications, heterogeneous technologies, and wide usage of Service-Oriented Architectures ...(SOA), where numerous services communicate with each other, makes testing of such systems challenging. For testing these software systems, the concept of service virtualization is gaining popularity. Service virtualization is an automated technique to mimic the behavior of a given real service. Services can be classified as stateless or stateful services. Many services are stateful in nature, yet virtualization of stateful services is harder than virtualization of stateless services. In this work, we introduce two novel stateful service virtualization approaches. We employ classification-based and sequence-to-sequence-based machine learning algorithms in developing our solutions. Classification is a supervised learning method where the task is assigning given inputs to corresponding classes. A sequence-to-sequence model is a deep neural network architecture where the input and the output are sequences. We demonstrate the validity of our approaches on three datasets. Our evaluation shows that we obtain 75 % to 81 % accuracy on subject datasets with classification based method. Our deep neural network-based solution achieves even better accuracy results ranging from 89 to 97 % on subject datasets. Our evaluation on training times of the mentioned techniques show that classification based technique significantly outperforms other methods.
Importance-driven deep learning system testing Gerasimou, Simos; Eniser, Hasan Ferit; Sen, Alper ...
2020 IEEE/ACM 42nd International Conference on Software Engineering (ICSE),
06/2020
Conference Proceeding
Odprti dostop
Deep Learning (DL) systems are key enablers for engineering intelligent applications due to their ability to solve complex tasks such as image recognition and machine translation. Nevertheless, using ...DL systems in safety- and security-critical applications requires to provide testing evidence for their dependable operation. Recent research in this direction focuses on adapting testing criteria from traditional software engineering as a means of increasing confidence for their correct behaviour. However, they are inadequate in capturing the intrinsic properties exhibited by these systems. We bridge this gap by introducing DeepImportance, a systematic testing methodology accompanied by an Importance-Driven (IDC) test adequacy criterion for DL systems. Applying IDC enables to establish a layer-wise functional understanding of the importance of DL system components and use this information to assess the semantic diversity of a test set. Our empirical evaluation on several DL systems, across multiple DL datasets and with state-of-the-art adversarial generation techniques demonstrates the usefulness and effectiveness of DeepImportance and its ability to support the engineering of more robust DL systems.
Today's enterprise software systems are much complicated than the past. Increasing number of dependent applications, heterogeneous technologies and wide usage of Service Oriented Architectures (SOA), ...where numerous services communicate with each other, makes testing of such systems challenging. For testing these software systems, the concept of service virtualization is gaining popularity. Service virtualization is an automated technique to mimic the behavior of a given real service. Services can be classified as stateless or stateful services. Many services are stateful in nature. Although there are works in the literature for virtualization of state-less services, no such solution exists for stateful services. To the best of our knowledge, this is the first work for stateful service virtualization. We employ classification based and sequence-to-sequence based machine learning algorithms in developing our solutions. We demonstrate the validity of our approach on two data sets collected from real life services and obtain promising results.
Importance-driven deep learning system testing Gerasimou, Simos; Eniser, Hasan Ferit; Sen, Alper ...
2020 IEEE/ACM 42nd International Conference on Software Engineering: Companion Proceedings (ICSE-Companion),
06/2020
Conference Proceeding
Odprti dostop
Deep Learning (DL) systems are key enablers for engineering intelligent applications. Nevertheless, using DL systems in safety- and security-critical applications requires to provide testing evidence ...for their dependable operation. We introduce DeepImportance, a systematic testing methodology accompanied by an Importance-Driven (IDC) test adequacy criterion for DL systems. Applying IDC enables to establish a layer-wise functional understanding of the importance of DL system components and use this information to assess the semantic diversity of a test set. Our empirical evaluation on several DL systems and across multiple DL datasets demonstrates the usefulness and effectiveness of DeepImportance.
Large language models are becoming increasingly practical for translating code across programming languages, a process known as \(transpiling\). Even though automated transpilation significantly ...boosts developer productivity, a key concern is whether the generated code is correct. Existing work initially used manually crafted test suites to test the translations of a small corpus of programs; these test suites were later automated. In contrast, we devise the first approach for automated, functional, property-based testing of code translation models. Our general, user-provided specifications about the transpiled code capture a range of properties, from purely syntactic to purely semantic ones. As shown by our experiments, this approach is very effective in detecting property violations in popular code translation models, and therefore, in evaluating model quality with respect to given properties. We also go a step further and explore the usage scenario where a user simply aims to obtain a correct translation of some code with respect to certain properties without necessarily being concerned about the overall quality of the model. To this purpose, we develop the first property-guided search procedure for code translation models, where a model is repeatedly queried with slightly different parameters to produce alternative and potentially more correct translations. Our results show that this search procedure helps to obtain significantly better code translations.
Block-based visual programming environments play an increasingly important role in introducing computing concepts to K-12 students. In recent years, they have also gained popularity in neuro-symbolic ...AI, serving as a benchmark to evaluate general problem-solving and logical reasoning skills. The open-ended and conceptual nature of these visual programming tasks make them challenging, both for state-of-the-art AI agents as well as for novice programmers. A natural approach to providing assistance for problem-solving is breaking down a complex task into a progression of simpler subtasks; however, this is not trivial given that the solution codes are typically nested and have non-linear execution behavior. In this paper, we formalize the problem of synthesizing such a progression for a given reference block-based visual programming task. We propose a novel synthesis algorithm that generates a progression of subtasks that are high-quality, well-spaced in terms of their complexity, and solving this progression leads to solving the reference task. We show the utility of our synthesis algorithm in improving the efficacy of AI agents (in this case, neural program synthesizers) for solving tasks in the Karel programming environment. Then, we conduct a user study to demonstrate that our synthesized progression of subtasks can assist a novice programmer in solving tasks in the Hour of Code: Maze Challenge by Code-dot-org.
Large language models (LLMs) show promise in code translation - the task of translating code written in one programming language to another language - due to their ability to write code in most ...programming languages. However, LLM's effectiveness on translating real-world code remains largely unstudied. In this work, we perform the first substantial study on LLM-based translation to Rust by assessing the ability of five state-of-the-art LLMs, GPT4, Claude 3, Claude 2.1, Gemini Pro, and Mixtral. We conduct our study on code extracted from real-world open source projects. To enable our study, we develop FLOURINE, an end-to-end code translation tool that uses differential fuzzing to check if a Rust translation is I/O equivalent to the original source program, eliminating the need for pre-existing test cases. As part of our investigation, we assess both the LLM's ability to produce an initially successful translation, as well as their capacity to fix a previously generated buggy one. If the original and the translated programs are not I/O equivalent, we apply a set of automated feedback strategies, including feedback to the LLM with counterexamples. Our results show that the most successful LLM can translate 47% of our benchmarks, and also provides insights into next steps for improvements.
In recent years, neural networks have become the default choice for image classification and many other learning tasks, even though they are vulnerable to so-called adversarial attacks. To increase ...their robustness against these attacks, there have emerged numerous detection mechanisms that aim to automatically determine if an input is adversarial. However, state-of-the-art detection mechanisms either rely on being tuned for each type of attack, or they do not generalize across different attack types. To alleviate these issues, we propose a novel technique for adversarial-image detection, RAID, that trains a secondary classifier to identify differences in neuron activation values between benign and adversarial inputs. Our technique is both more reliable and more effective than the state of the art when evaluated against six popular attacks. Moreover, a straightforward extension of RAID increases its robustness against detection-aware adversaries without affecting its effectiveness.
Testing Deep Neural Network (DNN) models has become more important than ever with the increasing usage of DNN models in safety-critical domains such as autonomous cars. The traditional approach of ...testing DNNs is to create a test set, which is a random subset of the dataset about the problem of interest. This kind of approach is not enough for testing most of the real-world scenarios since these traditional test sets do not include corner cases, while a corner case input is generally considered to introduce erroneous behaviors. Recent works on adversarial input generation, data augmentation, and coverage-guided fuzzing (CGF) have provided new ways to extend traditional test sets. Among those, CGF aims to produce new test inputs by fuzzing existing ones to achieve high coverage on a test adequacy criterion (i.e. coverage criterion). Given that the subject test adequacy criterion is a well-established one, CGF can potentially find error inducing inputs for different underlying reasons. In this paper, we propose a novel CGF solution for structural testing of DNNs. The proposed fuzzer employs Monte Carlo Tree Search to drive the coverage-guided search in the pursuit of achieving high coverage. Our evaluation shows that the inputs generated by our method result in higher coverage than the inputs produced by the previously introduced coverage-guided fuzzing techniques.