Refactoring is commonly performed manually, supported by regression testing, which serves as a safety net to provide confidence on the edits performed. However, inadequate test suites may prevent ...developers from initiating or performing refactorings. We propose RefDistiller, a static analysis approach to support the inspection of manual refactorings. It combines two techniques. First, it applies predefined templates to identify potential missed edits during manual refactoring. Second, it leverages an automated refactoring engine to identify extra edits that might be incorrect. RefDistiller also helps determine the root cause of detected anomalies. In our evaluation, RefDistiller identifies 97 percent of seeded anomalies, of which 24 percent are not detected by generated test suites. Compared to running existing regression test suites, it detects 22 times more anomalies, with 94 percent precision on average. In a study with 15 professional developers, the participants inspected problematic refactorings with RefDistiller versus testing only. With RefDistiller, participants located 90 percent of the seeded anomalies, while they located only 13 percent with testing. The results show RefDistiller can help check the correctness of manual refactorings.
Interactive Code Review for Systematic Changes Tianyi Zhang; Myoungkyu Song; Pinedo, Joseph ...
2015 IEEE/ACM 37th IEEE International Conference on Software Engineering
1
Conference Proceeding
Developers often inspect a diff patch during peer code reviews. Diff patches show low-level program differences per file without summarizing systematic changes -- similar, related changes to multiple ...contexts. We present Critics, an interactive approach for inspecting systematic changes. When a developer specifies code change within a diff patch, Critics allows developers to customize the change template by iteratively generalizing change content and context. By matching a generalized template against the codebase, it summarizes similar changes and detects potential mistakes. We evaluated Critics using two methods. First, we conducted a user study at Salesforce.com, where professional engineers used Critics to investigate diff patches authored by their own team. After using Critics, all six participants indicated that they would like Critics to be integrated into their current code review environment. This also attests to the fact that Critics scales to an industry-scale project and can be easily adopted by professional engineers. Second, we conducted a user study where twelve participants reviewed diff patches using Critics and Eclipse diff. The results show that human subjects using Critics answer questions about systematic changes 47.3% more correctly with 31.9% saving in time during code review tasks, in comparison to the baseline use of Eclipse diff. These results show that Critics should improve developer productivity in inspecting systematic changes during peer code reviews.
The diversity within different microbiome communities that drive biogeochemical processes influences many different phenotypes. Analyses of these communities and their diversity by countless ...microbiome projects have revealed an important role of metagenomics in understanding the complex relation between microbes and their environments. This relationship can be understood in the context of microbiome composition of specific known environments. These compositions can then be used as a template for predicting the status of similar environments. Machine learning has been applied as a key component to this predictive task. Several analysis tools have already been published utilizing machine learning methods for metagenomic analysis. Despite the previously proposed machine learning models, the performance of deep neural networks is still under-researched. Given the nature of metagenomic data, deep neural networks could provide a strong boost to growth in the prediction accuracy in metagenomic analysis applications. To meet this urgent demand, we present a deep learning based tool that utilizes a deep neural network implementation for phenotypic prediction of unknown metagenomic samples. (1) First, our tool takes as input taxonomic profiles from 16S or WGS sequencing data. (2) Second, given the samples, our tool builds a model based on a deep neural network by computing multi-level classification. (3) Lastly, given the model, our tool classifies an unknown sample with its unlabeled taxonomic profile. In the benchmark experiments, we deduced that an analysis method facilitating a deep neural network such as our tool can show promising results in increasing the prediction accuracy on several samples compared to other machine learning models.
Inspecting and testing code changes typically require a significant amount of developer effort. As a system evolves, developers often create composite changes by mixing multiple development issues, ...as opposed to addressing one independent issue — an atomic change. Inspecting composite changes often becomes time-consuming and error-prone. To test unrelated edits on composite changes, rerunning all regression tests may require excessive time. To address the problem, we present an interactive technique for change decomposition to support code reviews and regression test selection, called C
hg
C
utter
. When a developer specifies code change within a diff patch, C
hg
C
utter
partitions composite changes into a set of related atomic changes, which is more cohesive and self-contained regarding the issue being addressed. For composite change inspection, it generates an intermediate program version that only includes a related change subset using program dependence relationships. For cost reduction during regression testing, it safely selects only affected tests responsible for changes to an intermediate version. In the evaluation, we apply C
hg
C
utter
to 28 composite changes in four open source projects. C
hg
C
utter
partitions these changes with 95.7% accuracy, while selecting affected tests with 89.0% accuracy. We conduct a user study with professional software engineers at PayPal and find that C
hg
C
utter
is helpful in understanding and validating composite changes, scaling to industry projects.
Machine learning (ML) techniques discover knowledge from large amounts of data. Modeling in ML is becoming essential to software systems in practice. The accuracy and efficiency of ML models have ...been focused on ML research communities, while there is less attention on validating the qualities of ML models. Validating ML applications is a challenging and time-consuming process for developers since prediction accuracy heavily relies on generated models. ML applications are written by relatively more data-driven programming based on the black box of ML frameworks. All of the datasets and the ML application need to be individually investigated. Thus, the ML validation tasks take a lot of time and effort. To address this limitation, we present a novel quality validation technique that increases the reliability for ML models and applications, called MLVal. Our approach helps developers inspect the training data and the generated features for the ML model. A data validation technique is important and beneficial to software quality since the quality of the input data affects speed and accuracy for training and inference. Inspired by software debugging/validation for reproducing the potential reported bugs, MLVal takes as input an ML application and its training datasets to build the ML models, helping ML application developers easily reproduce and understand anomalies in the ML application. We have implemented an Eclipse plugin for MLVal that allows developers to validate the prediction behavior of their ML applications, the ML model, and the training data on the Eclipse IDE. In our evaluation, we used 23,500 documents in the bioengineering research domain. We assessed the ability of the MLVal validation technique to effectively help ML application developers: (1) investigate the connection between the produced features and the labels in the training model, and (2) detect errors early to secure the quality of models from better data. Our approach reduces the cost of engineering efforts to validate problems, improving data-centric workflows of the ML application development.
De novo genome assembly is a technique that builds the genome of a specimen using overlaps of genomic fragments without additional work with reference sequence. Sequence fragments (called reads) are ...assembled as contigs and scaffolds by the overlaps. The quality of the de novo assembly depends on the length and continuity of the assembly. To enable faster and more accurate assembly of species, existing sequencing techniques have been proposed, for example, high-throughput next-generation sequencing and long-reads-producing third-generation sequencing. However, these techniques require a large amounts of computer memory when very huge-size overlap graphs are resolved. Also, it is challenging for parallel computation.
To address the limitations, we propose an innovative algorithmic approach, called Scalable Overlap-graph Reduction Algorithms (SORA). SORA is an algorithm package that performs string graph reduction algorithms by Apache Spark. The SORA's implementations are designed to execute de novo genome assembly on either a single machine or a distributed computing platform. SORA efficiently compacts the number of edges on enormous graphing paths by adapting scalable features of graph processing libraries provided by Apache Spark, GraphX and GraphFrames.
We shared the algorithms and the experimental results at our project website, https://github.com/BioHPC/SORA . We evaluated SORA with the human genome samples. First, it processed a nearly one billion edge graph on a distributed cloud cluster. Second, it processed mid-to-small size graphs on a single workstation within a short time frame. Overall, SORA achieved the linear-scaling simulations for the increased computing instances.
As the prevailing programming model of enterprise applications is becoming more declarative, programmers are spending an increasing amount of their time and efforts writing and maintaining metadata, ...such as XML or annotations. Although metadata is a cornerstone of modern software, automatic bug finding tools cannot ensure that metadata maintains its correctness during refactoring and enhancement. To address this shortcoming, this paper presents metadata invariants, a new abstraction that codifies various naming and typing relationships between metadata and the main source code of a program. We reify this abstraction as a domain-specific language. We also introduce algorithms to infer likely metadata invariants and to apply them to check metadata correctness in the presence of program evolution. We demonstrate how metadata invariant checking can help ensure that metadata remains consistent and correct during program evolution; it finds metadata-related inconsistencies and recommends how they should be corrected. Similar to static bug finding tools, a metadata invariant checker identifies metadata-related bugs as a program is being refactored and enhanced. Because metadata is omnipresent in modern software applications, our approach can help ensure the overall consistency and correctness of software as it evolves.
Detecting metadata bugs on the fly Myoungkyu Song; Tilevich, Eli
2012 34th International Conference on Software Engineering (ICSE),
2012-June
Conference Proceeding
Programmers are spending a large and increasing amount of their time writing and modifying metadata, such as Java annotations and XML deployment descriptors. And yet, automatic bug finding tools ...cannot find metadata-related bugs introduced during program refactoring and enhancement. To address this shortcoming, we have created metadata invariants, a new programming abstraction that expresses naming and typing relationships between metadata and the main source code of a program. A paper that appears in the main technical program of ICSE 2012 describes the idea, concept, and prototype of metadata invariants 4. The goal of this demo is to supplement that paper with a demonstration of our Eclipse plugin, Metadata Bug Finder (MBF). MBF takes as input a script written in our domain-specific language that describes a set of metadata coding conventions the programmer wishes to enforce. Then after each file save operation, MBF checks the edited codebase for the presence of any violations of the given metadata programming conventions. These violations are immediately reported to the programmer as potential metadata-related bugs. By making the programmer aware of these potential bugs, MBF prevents them from seeping into production, thereby improving the overall correctness of the edited codebase.
In modern web-based applications, an increasing amount of source code is generated dynamically at runtime. Web applications commonly execute dynamically generated code (DGC) emitted by third-party, ...black-box generators, run at remote sites. Web developers often need to adapt DGC before it can be executed: embedded HTML can be vulnerable to cross-site scripting attacks; an API may be incompatible with some browsers; and the program's state created by DGC may not be persisting. Lacking any systematic approaches for adapting DGC, web developers resort to ad-hoc techniques that are unsafe and error-prone. This study presents an approach for adapting DGC systematically that follows the program-transformation-by-example paradigm. The proposed approach provides predefined, domain-specific before/after examples that capture the variability of commonly used adaptations. By approving or rejecting these examples, web developers determine the required adaptation transformations, which are encoded in an adaptation script operating on the generated code's abstract syntax tree. The proposed approach is a suite of practical JavaScript program adaptations and their corresponding before/after examples. The authors have successfully applied the approach to real web applications to adapt third-party generated JavaScript code for security, browser compatibility, and persistence.
Among the well-known means to increase programmer productivity and decrease development effort is systematic software reuse. Although large scale reuse remains an elusive goal, programmers have been ...successfully reusing individual software artifacts, including components, libraries, and specifications. One software artifact that is not amenable to reuse is metadata, which has become an essential part of modern software development. Specifically, mainstream metadata formats, including XML and Java 5 annotations, are not amenable to systematic reuse. As a result, software that uses metadata cannot fully reap the benefits traditionally associated with systematic reuse. To address this lack of metadata reusability, this article presents Pattern-Based Structural Expressions (PBSE), a new metadata format that is not only reusable, but also offers conciseness and maintainability advantages. PBSE can be reused both across individual program components and across entire applications. In addition, PBSE make it possible to reuse metadata-expressed functionality across languages. In particular, we show how implementations of non-functional concerns (commonly expressed through metadata) in existing languages can be reused in emerging languages via automated metadata translation. Because metadata constitutes an intrinsic part of modern software applications, the ability to systematically reuse metadata is essential for improving programmer productivity and decreasing development effort.
•We discuss the advantages and shortcomings of mainstream metadata formats.•We present a metadata format that is reusable across applications and languages.•We present an approach to reuse non-functional concerns during cross-compilation.•We present a DSL for declaratively expressing differences between metadata formats.•We enable the X10 language to reuse persistence and unit testing in Java and C++.