Systematic Literature Reviews (SLRs) have established themselves as a method in the field of software engineering. The aim of an SLR is to systematically analyze existing literature in order to ...answer a research question. In this paper, we present a tool to support an SLR process. The main focus of the SLR tool (https://www.slr-tool.com/) is to create and manage an SLR project, to import search results from search engines, and to manage search results by including or excluding each paper. A demo video of our SLR tool is available at https://youtu.be/Jan8JbwiE4k.
Empirical validation of software metrics suites to predict fault proneness in object-oriented (OO) components is essential to ensure their practical use in industrial settings. In this paper, we ...empirically validate three OO metrics suites for their ability to predict software quality in terms of fault-proneness: the Chidamber and Kemerer (CK) metrics, Abreu's Metrics for Object-Oriented Design (MOOD), and Bansiya and Davis' Quality Metrics for Object-Oriented Design (QMOOD). Some CK class metrics have previously been shown to be good predictors of initial OO software quality. However, the other two suites have not been heavily validated except by their original proposers. Here, we explore the ability of these three metrics suites to predict fault-prone classes using defect data for six versions of Rhino, an open-source implementation of JavaScript written in Java. We conclude that the CK and QMOOD suites contain similar components and produce statistical models that are effective in detecting error-prone classes. We also conclude that the class components in the MOOD metrics suite are not good class fault-proneness predictors. Analyzing multivariate binary logistic regression models across six Rhino versions indicates these models may be useful in assessing quality in OO classes produced using modern highly iterative or agile software development processes.
Modern-day software development and use is a product of decades of advancement and evolution. Over time as new technologies and concepts emerged, so did new terminology to describe and discuss them. ...Most terminology used in computing is harmless, however, some are rooted in historically discriminatory, and potentially harmful, terms. While the landscape of individuals who develop technology has diversified over the years, the terminology has become a normalized part of modern software development and computing jargon. Despite organizations such as the ACM raising awareness of the potential harm certain terms can do and companies like GitHub working to change the systemic use of harmful terms in computing, it is still not clear what the landscape of harmful terminology in computing really is and how we can support the widespread detection and correction of harmful terminology in computing artifacts. To this end, we conducted a review of existing work and efforts at curating, detecting, and removing harmful terminology in computing. Combining and building on these prior efforts, we produce an extensible database of what we define as harmful terminology in computing and describe an open source proof-of-concept tool for detecting and replacing harmful computing-related terminology.
Runtime monitoring is a general approach to verifying system properties at runtime by comparing system events against a specification formalizing which event sequences are allowed. We present a ...runtime monitoring algorithm for a safety fragment of metric first-order temporal logic that overcomes the limitations of prior monitoring algorithms with respect to the expressiveness of their property specification languages. Our approach, based on automatic structures, allows the unrestricted use of negation, universal and existential quantification over infinite domains, and the arbitrary nesting of both past and bounded future operators. Furthermore, we show how to use and optimize our approach for the common case where structures consist of only finite relations, over possibly infinite domains. We also report on case studies from the domain of security and compliance in which we empirically evaluate the presented algorithms. Taken together, our results show that metric first-order temporal logic can serve as an effective specification language for expressing and monitoring a wide variety of practically relevant system properties.
Promises and async/await have become popular mechanisms for implementing asynchronous computations in JavaScript, but despite their popularity, programmers have difficulty using them. This paper ...identifies 8 anti-patterns in promise-based JavaScript code that are prevalent across popular JavaScript repositories. We present a light-weight static analysis for automatically detecting these anti-patterns. This analysis is embedded in an interactive visualization tool that additionally relies on dynamic analysis to visualize promise lifetimes and instances of anti-patterns executed at run time. By enabling the user to navigate between promises in the visualization and the source code fragments that they originate from, problems and optimization opportunities can be identified. We implement this approach in a tool called DrAsync, and found 2.6K static instances of anti-patterns in 20 popular JavaScript repositories. Upon examination of a subset of these, we found that the majority of problematic code reported by DrAsync could be eliminated through refactoring. Further investigation revealed that, in a few cases, the elimination of anti-patterns reduced the time needed to execute the refactored code fragments. Moreover, DrAsync's visualization of promise lifetimes and relationships provides additional insight into the execution behavior of asynchronous programs and helped identify further optimization opportunities.
Automatic Self-Validation for Code Coverage Profilers Yang, Yibiao; Jiang, Yanyan; Zuo, Zhiqiang ...
2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE),
2019-Nov.
Conference Proceeding
Code coverage as the primitive dynamic program behavior information, is widely adopted to facilitate a rich spectrum of software engineering tasks, such as testing, fuzzing, debugging, fault ...detection, reverse engineering, and program understanding. Thanks to the widespread applications, it is crucial to ensure the reliability of the code coverage profilers. Unfortunately, due to the lack of research attention and the existence of testing oracle problem, coverage profilers are far away from being tested sufficiently. Bugs are still regularly seen in the widely deployed profilers, like gcov and llvm-cov, along with gcc and llvm, respectively. This paper proposes Cod, an automated self-validator for effectively uncovering bugs in the coverage profilers. Starting from a test program (either from a compiler's test suite or generated randomly), Cod detects profiler bugs with zero false positive using a metamorphic relation in which the coverage statistics of that program and a mutated variant are bridged. We evaluated Cod over two of the most well-known code coverage profilers, namely gcov and llvm-cov. Within a four-month testing period, a total of 196 potential bugs (123 for gcov, 73 for llvm-cov) are found, among which 23 are confirmed by the developers.
Code smells were defined as symptoms of poor design choices applied by programmers during the development of a software project 2. They might hinder the comprehensibility and maintainability of ...software systems 5. Similarly to some previous work 3, 4, 6, 7 in this paper we investigate the relationship between the presence of code smells and the software change- and fault-proneness. Specifically, while previous work shows a significant correlation between smells and code change/fault-proneness, the empirical evidence provided so far is still limited because of:
Limited size of previous studies: The study by Khomh et al. 4 was conducted on four open source systems, while the study by D'Ambros et al. 1 was performed on seven systems. Furthermore, the studies by Li and Shatnawi 6, Olbrich et al. 7, and Gatrell and Counsell 3 were conducted considering the change history of only one software project.
Detected smells vs. manually validated smells: Previouswork studying the impact of code smells on change- and fault-proneness relied on data obtained from automatic smell detectors, whose imprecisions might have affected the results. Lack of analysis of the magnitude: Previouswork indicated that some smells can be more harmful than others, but the analysis did not take into account the magnitude of the observed phenomenon. For example, even if a specific smell type may be considered harmful when analyzing its impact on maintainability, this may not be relevant in case the number of occurrences of such a smell type in software projects is limited.
Lack of analysis of the magnitude of the effect: Previouswork indicated that classes affected by code smells have more chances to exhibit defects (or to undergo changes) than other classes. However, no study has observed the magnitude of such changes and defects, i.e., no study addressed the question: How many defects would exhibit on average a class affected by a code smell as compared to another class affected by a different kind of smell, or not affected by any smell at all?
Lack of within-artifact analysis: A class might be intrinsically change- and/or fault-prone, e.g., because it plays a core role in the system. Hence, the class may be intrinsically "smelly". Instead, there may be classes that become smelly during their lifetime because of maintenance activities. Or else, classes where the smell was removed, possibly because of refactoring activities. For such classes, it is of paramount importance to analyze the change- and fault-proneness of the class during its evolution, in order to better relate the cause (presence of smell) with the possible effect (change- or fault-proneness).
Lack of a temporal relation analysis: While previouswork correlated the presence of code smells with high fault- and changeproneness, one may wonder whether the artifact was smelly when the fault was introduced, or whether the fault was introduced before the class became smelly.
To cope with the aforementioned issues, this paper aims at corroborating previous empirical research on the impact of code smells by analyzing their diffuseness and effect on change- and faultproneness on a total of 395 releases of 30 open source systems, considering 13 different code smell types manually identified. Our results showed that classes affected by code smells tend to be significantly more change- and fault-prone than classes not affected by design problems, however their removal might be not always beneficial for improving source code maintainability.
The record-and-replay approach for software testing is important and valuable for developers in designing mobile applications. However, the existing solutions for recording and replaying Android ...applications are far from perfect. When considering the richness of mobile phones' input capabilities including touch screen, sensors, GPS, etc., existing approaches either fall short of covering all these different input types, or require elevated privileges that are not easily attained and can be dangerous. In this paper, we present a novel system, called MobiPlay, which aims to improve record-and-replay testing. By collaborating between a mobile phone and a server, we are the first to capture all possible inputs by doing so at the application layer, instead of at the Android framework layer or the Linux kernel layer, which would be infeasible without a server. MobiPlay runs the to-be-tested application on the server under exactly the same environment as the mobile phone, and displays the GUI of the application in real time on a thin client application installed on the mobile phone. From the perspective of the mobile phone user, the application appears to be local. We have implemented our system and evaluated it with tens of popular mobile applications showing that MobiPlay is efficient, flexible, and comprehensive. It can record all input data, including all sensor data, all touchscreen gestures, and GPS. It is able to record and replay on both the mobile phone and the server. Furthermore, it is suitable for both white-box and black-box testing.
Natural Attack for Pre-trained Models of Code Yang, Zhou; Shi, Jieke; He, Junda ...
2022 IEEE/ACM 44th International Conference on Software Engineering (ICSE)
Conference Proceeding
Odprti dostop
Pre-trained models of code have achieved success in many important software engineering tasks. However, these powerful models are vulnerable to adversarial attacks that slightly perturb model inputs ...to make a victim model produce wrong outputs. Current works mainly attack models of code with examples that preserve operational program semantics but ignore a fundamental requirement for adversarial example generation: perturbations should be natural to human judges, which we refer to as naturalness requirement. In this paper, we propose ALERT (Naturalness Aware Attack), a black-box attack that adversarially transforms inputs to make victim models produce wrong outputs. Different from prior works, this paper considers the natural semantic of generated examples at the same time as preserving the operational semantic of original inputs. Our user study demonstrates that human developers consistently consider that adversarial examples generated by ALERT are more natural than those generated by the state-of-the-art work by Zhang et al. that ignores the naturalness requirement. On attacking CodeBERT, our approach can achieve attack success rates of 53.62%, 27.79%, and 35.78% across three downstream tasks: vulnerability prediction, clone detection and code authorship attribution. On GraphCodeBERT, our approach can achieve average success rates of 76.95%, 7.96% and 61.47% on the three tasks. The above outperforms the baseline by 14.07% and 18.56% on the two pretrained models on average. Finally, we investigated the value of the generated adversarial examples to harden victim models through an adversarial fine-tuning procedure and demonstrated the accuracy of CodeBERT and GraphCodeBERT against ALERT-generated adversarial examples increased by 87.59% and 92.32%, respectively.
Automated program repair (APR) is one of the recent advances in automated software engineering aiming for reducing the burden of debugging by suggesting high-quality patches that either directly fix ...the bugs, or help the programmers in the course of manual debugging. We believe scalability, applicability, and accurate patch validation are the main design objectives for a practical APR technique. In this paper, we present PraPR, our implementation of a practical APR technique that operates at the level of JVM bytecode. We discuss design decisions made in the development of PraPR, and argue that the technique is a viable baseline toward attaining aforementioned objectives. Our experimental results show that: (1) PraPR can fix more bugs than state-of-the-art APR techniques and can be over 10X faster, (2) state-of-the-art APR techniques suffer from dataset overfitting, while the simplistic template-based PraPR performs more consistently on different datasets, and (3) PraPR can fix bugs for other JVM languages, such as Kotlin. PraPR is publicly available at https://github.com/prapr/prapr.