We present LeJit, a template-based framework for testing Java just-in-time (JIT) compilers. Like recent template-based frameworks, LeJit executes a template -- a program with holes to be filled -- to ...generate concrete programs given as inputs to Java JIT compilers. LeJit automatically generates template programs from existing Java code by converting expressions to holes, as well as generating necessary glue code (i.e., code that generates instances of non-primitive types) to make generated templates executable. We have successfully used LeJit to test a range of popular Java JIT compilers, revealing five bugs in HotSpot, nine bugs in OpenJ9, and one bug in GraalVM. All of these bugs have been confirmed by Oracle and IBM developers, and 11 of these bugs were previously unknown, including two CVEs (Common Vulnerabilities and Exposures). Our comparison with several existing approaches shows that LeJit is complementary to them and is a powerful technique for ensuring Java JIT compiler correctness.
Regression testing---rerunning tests on each code version to detect newly-broken functionality---is important and widely practiced. But, regression testing is costly due to the large number of tests ...and the high frequency of code changes. Regression test selection (RTS) optimizes regression testing by only rerunning a subset of tests that can be affected by changes. Researchers showed that RTS based on program analysis can save substantial testing time for (medium-sized) open-source projects. Practitioners also showed that RTS based on machine learning (ML) works well on very large code repositories, e.g., in Facebook's monorepository. We combine analysis-based RTS and ML-based RTS by using the latter to choose a subset of tests selected by the former. We first train several novel ML models to learn the impact of code changes on test outcomes using a training dataset that we obtain via mutation analysis. Then, we evaluate the benefits of combining ML models with analysis-based RTS on 10 projects, compared with using each technique alone. Combining ML-based RTS with two analysis-based RTS techniques-Ekstazi and STARTS-selects 25.34% and 21.44% fewer tests, respectively.
We present JAttack, a framework that enables template-based testing for compilers. Using JAttack, a developer writes a template program that describes a set of programs to be generated and given as ...test inputs to a compiler. Such a framework enables developers to incorporate their domain knowledge on testing compilers, giving a basic program structure that allows for exploring complex programs that can trigger sophisticated compiler optimizations. A developer writes a template program in the host language (Java) that contains holes to be filled by JAttack. Each hole, written using a domain-specific language, constructs a node within an extended abstract syntax tree (eAST). An eAST node defines the search space for the hole, i.e., a set of expressions and values. JAttack generates programs by executing templates and filling each hole by randomly choosing expressions and values (available within the search space defined by the hole). Additionally, we introduce several optimizations to reduce JAttack's generation cost. While JAttack could be used to test various compiler features, we demonstrate its capabilities in helping test just-in-time (JIT) Java compilers, whose optimizations occur at runtime after a sufficient number of executions. Using JAttack, we have found six critical bugs that were confirmed by Oracle developers. Four of them were previously unknown, including two unknown CVEs (Common Vulnerabilities and Exposures). JAttack shows the power of combining developers' domain knowledge (via templates) with random testing to detect bugs in JIT compilers.
Mutation testing is widely used in research for evaluating the effectiveness of test suites. There are multiple mutation tools that perform mutation at different levels, including traditional ...mutation testing at the level of source code (SRC) and more recent mutation testing at the level of compiler intermediate representation (IR). This paper presents an extensive comparison of mutation testing at the SRC and IR levels, specifically at the C programming language and the LLVM compiler IR levels. We use a mutation testing tool called SRCIROR that implements conceptually the same mutation operators at both levels. We also employ automated techniques to account for equivalent and duplicated mutants, and to determine minimal and surface mutants. We carry out our study on 15 programs from the Coreutils library. Overall, we find mutation testing to be better at the SRC level: the SRC level produces much fewer mutants and is thus less expensive, but the SRC level still generates a similar number of minimal and surface mutants, and the mutation scores at both levels are very closely correlated. We also perform a case study on the Space program to evaluate which level's mutation score correlates better with the actual fault-detection capability of test suites sampled from Space's test pool. We find the mutation score at both levels to not be very correlated with the actual fault-detection capability of test suites.
Test-suite reduction (TSR) speeds up regression testing by removing redundant tests from the test suite, thus running fewer tests in the future builds. To decide whether to use TSR or not, a ...developer needs some way to predict how well the reduced test suite will detect real faults in the future compared to the original test suite. Prior research evaluated the cost of TSR using only program versions with seeded faults, but such evaluations do not explicitly predict the effectiveness of the reduced test suite in future builds.
We perform the first extensive study of TSR using real test failures in (failed) builds that occurred for real code changes. We analyze 1478 failed builds from 32 GitHub projects that run their tests on Travis. Each failed build can have multiple faults, so we propose a family of mappings from test failures to faults. We use these mappings to compute Failed-Build Detection Loss (FBDL), the percentage of failed builds where the reduced test suite misses to detect all the faults detected by the original test suite. We find that FBDL can be up to 52.2%, which is higher than suggested by traditional TSR metrics. Moreover, traditional TSR metrics are not good predictors of FBDL, making it difficult for developers to decide whether to use reduced test suites.
Probabilistic programming systems and machine learning frameworks like Pyro, PyMC3, TensorFlow, and PyTorch provide scalable and efficient primitives for inference and training. However, such ...operations are non-deterministic. Hence, it is challenging for developers to write tests for applications that depend on such frameworks, often resulting in flaky tests – tests which fail non-deterministically when run on the same version of code.
In this paper, we conduct the first extensive study of flaky tests in this domain. In particular, we study the projects that depend on four frameworks: Pyro, PyMC3, TensorFlow-Probability, and PyTorch. We identify 75 bug reports/commits that deal with flaky tests, and we categorize the common causes and fixes for them. This study provides developers with useful insights on dealing with flaky tests in this domain.
Motivated by our study, we develop a technique, FLASH, to systematically detect flaky tests due to assertions passing and failing in different runs on the same code. These assertions fail due to differences in the sequence of random numbers in different runs of the same test. FLASH exposes such failures, and our evaluation on 20 projects results in 11 previously-unknown flaky tests that we reported to developers.
Study on advance of sustainable agriculture and rural development Wang Haize; Lu Hao; Shi Junxin(August First Land Reclamation University, Mishan (China) Coll. of Plant Sci.)
Heilongjiang bayi nongken daxue xuebao,
(Jun 2002), Letnik:
14, Številka:
2
Journal Article
Continuous integration (CI) is a well-established technique in commercial and open-source software projects, although not routinely used in scientific publishing. In the scientific software context, ...CI can serve two functions to increase reproducibility of scientific results: providing an established platform for testing the reproducibility of these results, and demonstrating to other scientists how the code and data generate the published results. We explore scientific software testing and CI strategies using two articles published in the areas of applied mathematics and computational physics. We discuss lessons learned from reproducing these articles as well as examine and discuss existing tests. We introduce the notion of a "scientific test" as one that produces computational results from a published article. We then consider full result reproduction within a CI environment. If authors find their work too time or resource intensive to easily adapt to a CI context, we recommend the inclusion of results from reduced versions of their work (e.g., run at lower resolution, with shorter time scales, with smaller data sets) alongside their primary results within their article. While these smaller versions may be less interesting scientifically, they can serve to verify that published code and data are working properly. We demonstrate such reduction tests on the two articles studied.
Evaluating non-adequate test-case reduction Alipour, Mohammad Amin; Shi, August; Gopinath, Rahul ...
2016 31st IEEE/ACM International Conference on Automated Software Engineering (ASE),
08/2016
Conference Proceeding
Given two test cases, one larger and one smaller, the smaller test case is preferred for many purposes. A smaller test case usually runs faster, is easier to understand, and is more convenient for ...debugging. However, smaller test cases also tend to cover less code and detect fewer faults than larger test cases. Whereas traditional research focused on reducing test suites while preserving code coverage, recent work has introduced the idea of reducing individual test cases, rather than test suites, while still preserving code coverage. Other recent work has proposed non-adequately reducing test suites by not even preserving all the code coverage. This paper empirically evaluates a new combination of these two ideas, non-adequate reduction of test cases, which allows for a wide range of trade-offs between test case size and fault detection. Our study introduces and evaluates C%-coverage reduction (where a test case is reduced to retain at least C% of its original coverage) and N -mutant reduction (where a test case is reduced to kill at least N of the mutants it originally killed). We evaluate the reduction trade-offs with varying values of C% and N for four real-world C projects: Mozilla’s SpiderMonkey JavaScript engine, the YAFFS2 flash file system, Grep, and Gzip. The results show that it is possible to greatly reduce the size of many test cases while still preserving much of their fault-detection capability.