Regression testing is an important part of the development process but suffers from the presence of flaky tests. Flaky tests nondeterministically pass or fail when run on the same code, misleading ...developers about the correctness of their changes. A common type of flaky tests are async flaky tests that flakily fail due to timing-related issues such as asynchronous waits that do not return in time or different thread interleavings during execution. Developers commonly try to repair async flaky tests by inserting or increasing some wait time, but such repairs are unreliable.
We propose FlakeSync, a technique for automatically repairing async flaky tests by introducing synchronization for a specific test execution. FlakeSync works by identifying a critical point, representing some key part of code that must be executed early w.r.t. other concurrently executing code, and a barrier point, representing the part of code that should wait until the critical point has been executed. FlakeSync can modify code to check when the critical point is executed and have the barrier point keep waiting until the critical point has been executed, essentially synchronizing these two parts of code for the specific test execution. Our evaluation of FlakeSync on known flaky tests from prior work shows that FlakeSync can automatically repair 83.75% of async flaky tests, and the resulting changes add a median overhead of only 1.00X the original test runtime. We submitted 10 pull requests with our changes to developers, with 3 already accepted and none rejected.
We present SRCIROR (pronounced “sorcerer“), a toolset for performing mutation testing at the levels of C/C++ source code (SRC) and the LLVM compiler intermediate representation (IR). At the SRC ...level, SRCIROR identifies program constructs for mutation by pattern-matching on the Clang AST. At the IR level, SRCIROR directly mutates the LLVM IR instructions through LLVM passes. Our implementation enables SRCIROR to (1) handle any program that Clang can handle, extending to large programs with a minimal overhead, and (2) have a small percentage of invalid mutants that do not compile. SRCIROR enables performing mutation testing using the same classes of mutation operators at both the SRC and IR levels, and it is easily extensible to support more operators. In addition, SRCIROR can collect coverage to generate mutants only for covered code elements. Our tool is publicly available on GitHub (https://github.com/TestingResearchIllinois/srciror). We evaluate SRCIROR on Coreutils subjects. Our evaluation shows interesting differences between SRC and IR, demonstrating the value of SRCIROR in enabling mutation testing research across different levels of code representation.
Developers rely on regression testing in their continuous integration (CI) environment to find changes that introduce regression faults. While regression testing is widely practiced, it can be ...costly. Regression test selection (RTS) reduces the cost of regression testing by not running the tests that are unaffected by the changes. Industry has adopted module-level RTS for their CI environment, while researchers have proposed class-level RTS. In this paper, we compare module-and class-level RTS techniques in a cloud-based CI environment, Travis. We also develop and evaluate a hybrid RTS technique that combines aspects of the module-and class-level RTS techniques. We evaluate all the techniques on real Travis builds. We find that the RTS techniques do save testing time compared to running all tests (RetestAll), but the percentage of time for a full build using RTS (76.0%) is not as low as found in previous work, due to the extra overhead in a cloud-based CI environment. Moreover, we inspect test failures from RetestAll builds, and although we find that RTS techniques can miss to select failed tests, these test failures are almost all flaky test failures. As such, RTS techniques provide additional value in helping developers avoid wasting time debugging failures not related to the recent code changes. Overall, our results show that RTS can be beneficial for the developers in the CI environment, and RTS not only saves time but also avoids misleading developers by flaky test failures.
TSVD4J: Thread-Safety Violation Detection for Java Rahman, Shanto; Li, Chengpeng; Shi, August
2023 IEEE/ACM 45th International Conference on Software Engineering: Companion Proceedings (ICSE-Companion),
05/2023
Conference Proceeding
Concurrency bugs are difficult to detect and debug. One class of concurrency bugs are thread-safety violations, where multiple threads access thread-unsafe data structure at the same time, resulting ...in unexpected behavior. Prior work proposed an approach TSVD to detect thread-safety violations. TSVD injects delays at API calls that read/write to specific thread-unsafe data structures, tracking whether multiple threads can overlap in their accesses to the same data structure through the delays, showing potential thread-safety violations. We additionally enhance the TSVD approach to also consider read/write operations to object fields. We implement the TSVD approach in Java in our tool TSVD4J. TSVD4J can be integrated as a Maven plugin that can be included in any Maven-based application. Our evaluation on 12 applications shows that TSVD4J can detect 55 pairs of code locations accessing the same shared data structure across multiple threads, representing potential thread-safety violations. We find that the addition of tracking field accesses contributed the most to detecting these pairs. TSVD4J also detects more such pairs than existing tool RV-Predict. The demo video for TSVD4J is available at https://www.youtube.com/watch?v=-wSMzlj5cMY.
Regression test selection (RTS) reduces regression testing costs by re-running only tests that can change behavior due to code changes. Researchers and large software organizations recently developed ...and adopted several RTS tools to deal with the rapidly growing costs of regression testing. As RTS tools gain adoption, it becomes critical to check that they are correct and efficient. Unfortunately, checking RTS tools currently relies solely on limited tests that RTS tool developers manually write. We present RTSCheck, the first framework for checking RTS tools. RTSCheck feeds evolving programs (i.e., sequences of program revisions) to an RTS tool and checks the output against rules inspired by existing RTS test suites. Violations of these rules are likely due to deviations from expected RTS tool behavior, and indicative of bugs in the tool. RTSCheck uses three components to obtain evolving programs: (1) AutoEP automatically generates evolving programs and corresponding tests, (2) DefectsEP uses buggy and fixed program revisions from bug databases, and (3) EvoEP uses sequences of program revisions from actual open-source projects' histories. We used RTSCheck to check three recently developed RTS tools for Java: Clover, Ekstazi, and STARTS. RTSCheck discovered 27 bugs in these three tools.
Mutation testing is widely used in research (even if not in practice). Mutation testing tools usually target only one programming language and rely on parsing a program to generate mutants, or ...operate not at the source level but on compiled bytecode. Unfortunately, developing a robust mutation testing tool for a new language in this paradigm is a difficult and time-consuming undertaking. Moreover, bytecode/intermediate language mutants are difficult for programmers to read and understand. This paper presents a simple tool, called universalmutator, based on regular-expression-defined transformations of source code. The primary drawback of such an approach is that our tool can generate invalid mutants that do not compile, and sometimes fails to generate mutants that a parser-based tool would have produced. Additionally, it is incompatible with some approaches to improving the efficiency of mutation testing. However, the regexp-based approach provides multiple compensating advantages. First, our tool is easy to adapt to new languages; e.g., we present here the first mutation tool for Apple's Swift programming language. Second, the method makes handling multi-language programs and systems simple, because the same tool can support every language. Finally, our approach makes it easy for users to add custom, project-specific mutations.
Flaky tests are tests that pass or fail nondeterministically on the same version of code. These tests can mislead developers concerning the quality of their code changes during regression testing. A ...common kind of flaky tests are order-dependent tests, whose pass/fail outcomes depend on the test order in which they are run. Such tests have different outcomes because other tests running before them pollute shared state. Prior work has proposed repairing order-dependent tests by searching for existing tests, known as "cleaners", that reset the shared state, allowing the order-dependent test to pass when run after a polluted shared state. The code within a cleaner represents a patch to repair the order-dependent test. However, this technique requires cleaners to already exist in the test suite.
We propose ODRepair, an automated technique to repair order-dependent tests even without existing cleaners. The idea is to first determine the exact polluted shared state that results in the order-dependent test to fail and then generate code that can modify and reset the shared state so that the order-dependent test can pass. We focus on shared state through internal heap memory, in particular shared state reachable from static fields. Once we know which static field leads to the pollution, we search for reset-methods in the codebase that can potentially access and modify state reachable from that static field. We then apply an automatic test-generation tool to generate method-call sequences, targeting these reset-methods. Our evaluation on 327 order-dependent tests from a publicly available dataset shows that ODRepair automatically identifies the polluted static field for 181 tests, and it can generate patches for 141 of these tests. Compared against state-of-the-art iFixFlakies, ODRepair can generate patches for 24 tests that iFixFlakies cannot.
Library developers can provide classes and methods with underdetermined specifications that allow flexibility in future implementations. Library users may write code that relies on a specific ...implementation rather than on the specification, e.g., assuming mistakenly that the order of elements cannot change in the future. Prior work proposed the NonDex approach that detects such wrong assumptions.
We present a novel approach, called DexFix, to repair wrong assumptions on underdetermined specifications in an automated way. We run the NonDex tool on 200 open-source Java projects and detect 275 tests that fail due to wrong assumptions. The majority of failures are from iterating over HashMap/HashSet collections and the getDeclaredFields method. We provide several new repair strategies that can fix these violations in both the test code and the main code. DexFix proposes fixes for 119 tests from the detected 275 tests. We have already reported fixes for 102 tests as GitHub pull requests: 74 have been merged, with only 5 rejected, and the remaining pending.
JAttack: Java JIT Testing Using Template Programs Zang, Zhiqiang; Yu, Fu-Yao; Wiatrek, Nathan ...
2023 IEEE/ACM 45th International Conference on Software Engineering: Companion Proceedings (ICSE-Companion),
05/2023
Conference Proceeding
We present JAttack, a framework that enables compiler testing using templates. JAttack allows compiler developers to write a template program that describes a set of concrete programs to be used to ...test compilers. Such a template-based approach leverages developers' intuition on testing compilers, by allowing developers to write a template program in the host programming language (Java), which contains a basic program structure while provides an opportunity to express variants of specific language constructs in holes. Each hole, written in a domain-specific language embedded in the host language, is used to construct an extended abstract syntax tree (eAST), which defines the search space of a language construct, e.g., a set of numbers, expressions, statements, etc. JAttack executes the template program to fill every hole by randomly choosing a number, expression, or statement within the search space defined by the hole, and it generates concrete programs with all holes filled. We used JAttack to test Java just-in-time (JIT) compilers, and we have found seven critical bugs in Oracle JDK JIT compiler. Oracle developers confirmed and fixed all seven bugs, five of which were previously unknown, including two CVEs (Common Vulnerabilities and Exposures). JAttack blends developers' intuition via templates with random testing to detect bugs in compilers. The demo video for JAttack can be found at https://www.youtube.com/watch?v=meCFPxucqk4.
Reflection-aware static regression test selection Shi, August; Hadzi-Tanovic, Milica; Zhang, Lingming ...
Proceedings of ACM on programming languages,
10/2019, Letnik:
3, Številka:
OOPSLA
Journal Article
Recenzirano
Odprti dostop
Regression test selection (RTS) aims to speed up regression testing by rerunning only tests that are affected by code changes. RTS can be performed using static or dynamic analysis techniques. Our ...prior study showed that static and dynamic RTS perform similarly for medium-sized Java projects. However, the results of that prior study also showed that static RTS can be unsafe, missing to select tests that dynamic RTS selects, and that reflection was the only cause of unsafety observed among the evaluated projects.
In this paper, we investigate five techniques—three purely static techniques and two hybrid static-dynamic techniques—that aim to make static RTS safe with respect to reflection. We implement these reflection-aware (RA) techniques by extending the reflection-unaware (RU) class-level static RTS technique in a tool called STARTS. To evaluate these RA techniques, we compare their end-to-end times with RU, and with RetestAll, which reruns all tests after every code change. We also compare safety and precision of the RA techniques with Ekstazi, a state-of-the-art dynamic RTS technique; precision is a measure of unaffected tests selected.
Our evaluation on 1173 versions of 24 open-source Java projects shows negative results. The RA techniques improve the safety of RU but at very high costs. The purely static techniques are safe in our experiments but decrease the precision of RU, with end-to-end time at best 85.8% of RetestAll time, versus 69.1% for RU. One hybrid static-dynamic technique improves the safety of RU but at high cost, with end-to-end time that is 91.2% of RetestAll. The other hybrid static-dynamic technique provides better precision, is safer than RU, and incurs lower end-to-end time—75.8% of RetestAll, but it can still be unsafe in the presence of test-order dependencies. Our study highlights the challenges involved in making static RTS safe with respect to reflection.