Green Fuzzer Benchmarking Ounjai, Jiradet; Wüstholz, Valentin; Christakis, Maria
Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis,
07/2023
Conference Proceeding
Odprti dostop
Over the last decade, fuzzing has been increasingly gaining traction due to its effectiveness in finding bugs. Nevertheless, fuzzer evaluations have been challenging during this time, mainly due to ...lack of standardized benchmarking. Aiming to alleviate this issue, in 2020, Google released FuzzBench, an open-source benchmarking platform, that is widely used for accurate fuzzer benchmarking.
However, a typical FuzzBench experiment takes CPU years to run. If we additionally consider that fuzzers under active development evaluate any changes empirically, benchmarking becomes prohibitive both in terms of computational resources and time. In this paper, we propose GreenBench, a greener benchmarking platform that, compared to FuzzBench, significantly speeds up fuzzer evaluations while maintaining very high accuracy.
In contrast to FuzzBench, GreenBench drastically increases the number of benchmarks while drastically decreasing the duration of fuzzing campaigns. As a result, the fuzzer rankings generated by GreenBench are almost as accurate as those by FuzzBench (with very high correlation), but GreenBench is from 18 to 61 times faster. We discuss the implications of these findings for the fuzzing community.
Metamorphic testing of Datalog engines Mansur, Muhammad Numair; Christakis, Maria; Wüstholz, Valentin
Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering,
08/2021
Conference Proceeding
Odprti dostop
Datalog is a popular query language with applications in several domains. Like any complex piece of software, Datalog engines may contain bugs. The most critical ones manifest as incorrect results ...when evaluating queries—we refer to these as query bugs. Given the wide applicability of the language, query bugs may have detrimental consequences, for instance, by compromising the soundness of a program analysis that is implemented and formalized in Datalog. In this paper, we present the first metamorphic-testing approach for detecting query bugs in Datalog engines. We ran our tool on three mature engines and found 13 previously unknown query bugs, some of which are deep and revealed critical semantic issues.
In the last decades, numerous program analyzers have been developed both in academia and industry. Despite their abundance however, there is currently no systematic way of comparing the effectiveness ...of different analyzers on arbitrary code. In this paper, we present the first automated technique for differentially testing soundness and precision of program analyzers. We used our technique to compare six mature, state-of-the art analyzers on tens of thousands of automatically generated benchmarks. Our technique detected soundness and precision issues in most analyzers, and we evaluated the implications of these issues to both designers and users of program analyzers.
Greybox fuzzing is a lightweight testing approach that effectively detects bugs and security vulnerabilities. However, greybox fuzzers randomly mutate program inputs to exercise new paths; this makes ...it challenging to cover code that is guarded by complex checks. In this paper, we present a technique that extends greybox fuzzing with a method for learning new inputs based on already explored program executions. These inputs can be learned such that they guide exploration toward specific executions, for instance, ones that increase path coverage or reveal vulnerabilities. We have evaluated our technique and compared it to traditional greybox fuzzing on 26 real-world benchmarks. In comparison, our technique significantly increases path coverage (by up to 3X) and detects more bugs (up to 38% more), often orders-of-magnitude faster.
Testing is a promising way to gain trust in a learned action policy π, in particular if π is a neural network. A “bug” in this context constitutes undesirable or fatal policy behavior, e.g., ...satisfying a failure condition. But how do we distinguish whether such behavior is due to bad policy decisions, or whether it is actually unavoidable under the given circumstances? This requires knowledge about optimal solutions, which defeats the scalability of testing. Related problems occur in software testing when the correct program output is not known.
Metamorphic testing addresses this issue through metamorphic relations, specifying how a given change to the input should affect the output, thus providing an oracle for the correct output. Yet, how do we obtain such metamorphic relations for action policies? Here, we show that the well explored concept of relaxations in the Artificial Intelligence community can serve this purpose. In particular, if state s′ is a relaxation of state s, i.e., s′ is easier to solve than s, and π fails on easier s′ but does not fail on harder s, then we know that π contains a bug manifested on s′.
We contribute the first exploration of this idea in the context of failure testing of neural network policies π learned by reinforcement learning in simulated environments. We design fuzzing strategies for test-case generation as well as metamorphic oracles leveraging simple, manually designed relaxations. In experiments on three single-agent games, our technology is able to effectively identify true bugs, i.e., avoidable failures of π, which has not been possible until now.
Machine-learning models are becoming increasingly prevalent in our lives, for
instance assisting in image-classification or decision-making tasks.
Consequently, the reliability of these models is of ...critical importance and has
resulted in the development of numerous approaches for validating and verifying
their robustness and fairness. However, beyond such specific properties, it is
challenging to specify, let alone check, general functional-correctness
expectations from models. In this paper, we take inspiration from
specifications used in formal methods, expressing functional-correctness
properties by reasoning about $k$ different executions, so-called $k$-safety
properties. Considering a credit-screening model of a bank, the expected
property that "if a person is denied a loan and their income decreases, they
should still be denied the loan" is a 2-safety property. Here, we show the wide
applicability of $k$-safety properties for machine-learning models and present
the first specification language for expressing them. We also operationalize
the language in a framework for automatically validating such properties using
metamorphic testing. Our experiments show that our framework is effective in
identifying property violations, and that detected bugs could be used to train
better models.
In the last decades, numerous program analyzers have been developed both by academia and industry. Despite their abundance however, there is currently no systematic way of comparing the effectiveness ...of different analyzers on arbitrary code. In this paper, we present the first automated technique for differentially testing soundness and precision of program analyzers. We used our technique to compare six mature, state-of-the art analyzers on tens of thousands of automatically generated benchmarks. Our technique detected soundness and precision issues in most analyzers, and we evaluated the implications of these issues to both designers and users of program analyzers.
Recently, there is growing concern that machine-learning models, which currently assist or even automate decision making, reproduce, and in the worst case reinforce, bias of the training data. The ...development of tools and techniques for certifying fairness of these models or describing their biased behavior is, therefore, critical. In this paper, we propose a perfectly parallel static analysis for certifying causal fairness of feed-forward neural networks used for classification of tabular data. When certification succeeds, our approach provides definite guarantees, otherwise, it describes and quantifies the biased behavior. We design the analysis to be sound, in practice also exact, and configurable in terms of scalability and precision, thereby enabling pay-as-you-go certification. We implement our approach in an open-source tool and demonstrate its effectiveness on models trained with popular datasets.