API-Evolution Support with Diff-CatchUp Zhenchang Xing; Stroulia, E.
IEEE transactions on software engineering,
12/2007, Volume:
33, Issue:
12
Journal Article
Peer reviewed
Applications built on reusable component frameworks are subject to two independent, and potentially conflicting, evolution processes. The application evolves in response to the specific requirements ...and desired qualities of the application's stakeholders. On the other hand, the evolution of the component framework is driven by the need to improve the framework functionality and quality while maintaining its generality. Thus, changes to the component framework frequently change its API on which its client applications rely and, as a result, these applications break. To date, there has been some work aimed at supporting the migration of client applications to newer versions of their underlying frameworks, but it usually requires that the framework developers do additional work for that purpose or that the application developers use the same tools as the framework developers. In this paper, we discuss our approach to tackle the API-evolution problem in the context of reuse-based software development, which automatically recognizes the API changes of the reused framework and proposes plausible replacements to the "obsolete" API based on working examples of the framework code base. This approach has been implemented in the Diff-CatchUp tool. We report on two case studies that we have conducted to evaluate the effectiveness of our approach with its Diff-CatchUp prototype.
Programming-specific Q&A sites (e.g., Stack Overflow) are being used extensively by software developers for knowledge sharing and acquisition. Due to the cross-reference of questions and answers ...(note that users also reference URLs external to the Q&A site. In this paper, URL sharing refers to internal URLs within the Q&A site, unless otherwise stated), knowledge is diffused in the Q&A site, forming a large knowledge network. In Stack Overflow, why do developers share URLs? How is the community feedback to the knowledge being shared? What are the unique topological and semantic properties of the resulting knowledge network in Stack Overflow? Has this knowledge network become stable? If so, how does it reach to stability? Answering these questions can help the software engineering community better understand the knowledge diffusion process in programming-specific Q&A sites like Stack Overflow, thereby enabling more effective knowledge sharing, knowledge use, and knowledge representation and search in the community. Previous work has focused on analyzing user activities in Q&A sites or mining the textual content of these sites. In this article, we present a methodology to analyze URL sharing activities in Stack Overflow. We use open coding method to analyze why users share URLs in Stack Overflow, and develop a set of quantitative analysis methods to study the structural and dynamic properties of the emergent knowledge network in Stack Overflow. We also identify system designs, community norms, and social behavior theories that help explain our empirical findings. Through this study, we obtain an in-depth understanding of the knowledge diffusion process in Stack Overflow and expose the implications of URL sharing behavior for Q&A site design, developers who use crowdsourced knowledge in Stack Overflow, and future research on knowledge representation and search.
Full text
Available for:
EMUNI, FIS, FZAB, GEOZS, GIS, IJS, IMTLJ, KILJ, KISLJ, MFDPS, NLZOH, NUK, OILJ, PNG, SAZU, SBCE, SBJE, SBMB, SBNM, UKNU, UL, UM, UPUK, VKSCE, ZAGLJ
Security vulnerabilities have been continually disclosed and documented. For the effective understanding, management, and mitigation of the fast-growing number of vulnerabilities, an important ...practice in documenting vulnerabilities is to describe the key vulnerability aspects, such as vulnerability type, root cause, affected product, impact, attacker type, and attack vector. In this article, we first investigate 133,639 vulnerability reports in the Common Vulnerabilities and Exposures (CVE) database over the past 20 years. We find that 56%, 85%, 38%, and 28% of CVEs miss vulnerability type, root cause, attack vector, and attacker type, respectively. By comparing the differences of the latest updated CVE reports across different databases, we observe that 1,476 missing key aspects in 1,320 CVE descriptions were augmented manually in the National Vulnerability Database (NVD), which indicates that the vulnerability database maintainers try to complete the vulnerability descriptions in practice to mitigate such a problem. To help complete the missing information of key vulnerability aspects and reduce human efforts, we propose a neural-network-based approach called PMA to predict the missing key aspects of a vulnerability based on its known aspects. We systematically explore the design space of the neural network models and empirically identify the most effective model design in the scenario. Our ablation study reveals the prominent correlations among vulnerability aspects when predicting. Trained with historical CVEs, our model achieves 88%, 71%, 61%, and 81% in F1 for predicting the missing vulnerability type, root cause, attacker type, and attack vector of 8,623 “future” CVEs across 3 years, respectively. Furthermore, we validate the predicting performance of key aspect augmentation of CVEs based on the manually augmented CVE data collected from NVD, which confirms the practicality of our approach. We finally highlight that PMA has the ability to reduce human efforts by recommending and augmenting missing key aspects for vulnerability databases, and to facilitate other research works such as severity level prediction of CVEs based on the vulnerability descriptions.
In the daily development process, developers often need assistance in finding a sequence of APIs to accomplish their development tasks. Existing deep learning models, which have recently been ...developed for recommending one single API, can be adapted by using encoder-decoder models together with beam search to generate API sequence recommendations. However, the generated API sequence recommendations heavily rely on the probabilities of API suggestions at each decoding step, which do not take into account other domain-specific factors (e.g., whether an API suggestion satisfies the program syntax and how diverse the API sequence recommendations are). Moreover, it is difficult for developers to find similar API sequence recommendations, distinguish different API sequence recommendations, and make a selection when the API sequence recommendations are ordered by probabilities. Thus, what we need is more than deep learning. In this paper, we propose an approach, named
Cook
, to combine deep learning models with post-processing strategies for API sequence recommendation. Specifically, we enhance beam search with code-specific heuristics to improve the quality of API sequence recommendations. We develop a clustering algorithm to cluster API sequence recommendations so as to make it easier for developers to find similar API sequence recommendations and distinguish different API sequence recommendations. We also propose a method to generate a summary for each cluster to help developers understand the API sequence recommendations. Our evaluation results have shown that (1) three deep learning models with our heuristic-enhanced beam search achieved better performance than with the original beam search in terms of CIDEr-1, CIDEr-5 and CIDEr-10 scores, with an average improvement of 1.8, 2.3 and 2.3, respectively; and (2) our clustering algorithm achieved high performance on six metrics and outperformed two variant clustering algorithms. Moreover, our user study with 24 participants shows that
Cook
can help developers accomplish programming tasks faster and pass more test cases, and the participants confirm that clusters and summaries indeed help them understand and select the correct API sequence recommendations.
Full text
Available for:
EMUNI, FIS, FZAB, GEOZS, GIS, IJS, IMTLJ, KILJ, KISLJ, MFDPS, NLZOH, NUK, OILJ, PNG, SAZU, SBCE, SBJE, SBMB, SBNM, UKNU, UL, UM, UPUK, VKSCE, ZAGLJ
Commit messages can be regarded as the documentation of software changes. These messages describe the content and purposes of changes, hence are useful for program comprehension and software ...maintenance. However, due to the lack of time and direct motivation, commit messages sometimes are neglected by developers. To address this problem, Jiang et al. proposed an approach (we refer to it as NMT), which leverages a neural machine translation algorithm to automatically generate short commit messages from code. The reported performance of their approach is promising, however, they did not explore why their approach performs well. Thus, in this paper, we first perform an in-depth analysis of their experimental results. We find that (1) Most of the test diffs from which NMT can generate high-quality messages are similar to one or more training diffs at the token level. (2) About 16% of the commit messages in Jiang et al.’s dataset are noisy due to being automatically generated or due to them describing repetitive trivial changes. (3) The performance of NMT declines by a large amount after removing such noisy commit messages. In addition, NMT is complicated and time-consuming. Inspired by our first finding, we proposed a simpler and faster approach, named NNGen (Nearest Neighbor Generator), to generate concise commit messages using the nearest neighbor algorithm. Our experimental results show that NNGen is over 2,600 times faster than NMT, and outperforms NMT in terms of BLEU (an accuracy measure that is widely used to evaluate machine translation systems) by 21%. Finally, we also discuss some observations for the road ahead for automated commit message generation to inspire other researchers.
Informal discussions on social platforms (e.g., Stack Overflow) accumulates a large body of programming knowledge in natural language text. Natural language process (NLP) techniques can be exploited ...to harvest this knowledge base for software engineering tasks. To make an effective use of NLP techniques, consistent vocabulary is essential. Unfortunately, the same concepts are often intentionally or accidentally mentioned in many different morphological forms in informal discussions, such as abbreviations, synonyms and misspellings. Existing techniques to deal with such morphological forms are either designed for general English or predominantly rely on domain-specific lexical rules. A thesaurus of software-specific terms and commonly-used morphological forms is desirable for normalizing software engineering text, but very difficult to build manually. In this work, we propose an automatic approach to build such a thesaurus. Our approach identifies software-specific terms by contrasting software-specific and general corpuses, and infers morphological forms of software-specific terms by combining distributed word semantics, domain-specific lexical rules and transformations, and graph analysis of morphological relations. We evaluate the coverage and accuracy of the resulting thesaurus against community-curated lists of software-specific terms, abbreviations and synonyms. We also manually examine the correctness of the identified abbreviations and synonyms in our thesaurus. We demonstrate the usefulness of our thesaurus in a case study of normalizing questions from Stack Overflow and CodeProject.
What help do developers seek, when and how? Hongwei Li; Zhenchang Xing; Xin Peng ...
2013 20th Working Conference on Reverse Engineering (WCRE),
2013-Oct.
Conference Proceeding
Software development often requires knowledge beyond what developers already possess. In such cases, developers have to seek help from different sources of information. As a metacognitive skill, help ...seeking influences software developers' efficiency and success in many situations. However, there has been little research to provide a systematic investigation of the general process of help seeking activities in software engineering and human and system factors affecting help seeking. This paper reports our empirical study aiming to fill this gap. Our study includes two human experiments, involving 24 developers and two typical software development tasks. Our study gathers empirical data that allows us to provide an in-depth analysis of help-seeking task structures, task strategies, information sources, process model, and developers' information needs and behaviors in seeking and using help information and in managing information during help seeking. Our study provides a detailed understanding of help seeking activities in software engineering, the challenges that software developers face, and the limitations of existing tool support. This can lead to the design and development of more efficient and usable help seeking support that helps developers become better help seekers.
► We propose ICFL, an iterative context-aware approach for feature location. ► ICFL considers structural similarity between features and program elements. ► We evaluate ICFL using a small industry ...system and a large open-source system. ► ICFL is more robust and can increase recall with only minor decrease of precision. ► Structural similarity can complement lexical similarity for feature location.
Locating program element(s) relevant to a particular feature is an important step in efficient maintenance of a software system. The existing feature location techniques analyse each feature independently and perform a one-time analysis after being provided an initial input. As a result, these techniques are sensitive to the quality of the input. In this paper, we propose to address the above issues in feature location using an iterative context-aware approach. The underlying intuition is that features are not independent of each other, and the structure of source code resembles the structure of features. The distinguishing characteristics of the proposed approach are: (1) it takes into account the structural similarity between a feature and a program element to determine feature-element relevance and (2) it employs an iterative process to propagate the relevance of the established mappings between a feature and a program element to the neighbouring features and program elements. We evaluate our approach using two different systems, DirectBank, a small-scale industry financial system, and Linux kernel, a large-scale open-source operating system. Our evaluation suggests that the proposed approach is more robust and can significantly increase the recall of feature location with only a minor decrease of precision.
Full text
Available for:
GEOZS, IJS, IMTLJ, KILJ, KISLJ, NUK, OILJ, PNG, SAZU, SBCE, SBJE, UL, UM, UPCLJ, UPUK
Companies often develop and maintain a collection of product variants that share some common features but also support different, customer-specific features. To reengineering such legacy product ...variants for systematic reuse, one must identify features and their implementing code units (e.g. functions, files) in different product variants. Information retrieval (IR) techniques may be applied for that purpose. In this paper, we discuss problems that hinder direct application of IR techniques to a collection of product variants. To counter these problems, we present an approach to support effective feature location in product variants. The novelty of our approach is that we exploit commonalities and differences of product variants by software differencing and FCA techniques so that IR technique can achieve satisfactory results for feature location in product variants. We have implemented our approach and conducted evaluation with a collection of nine Linux kernel product variants. Our evaluation shows that our approach always significantly outperforms a direct application of IR technique in the subject product variants.