A software system's design determines many of its properties, such as maintainability and performance. An understanding of design is needed to maintain system properties as changes to the system ...occur. Unfortunately, many systems do not have up-to-date design documentation and approaches that have been developed to recover design often focus on how a system works by extracting structural and behaviour information rather than information about the desired design properties, such as robustness or performance. In this paper, we explore whether it is possible to automatically locate where design is discussed in on-line developer discussions. We investigate and introduce a classifier that can locate paragraphs in pull request discussions that pertain to design with an average AUC score of 0.87. We show that this classifier, when applied to projects on which it was not trained, agrees with the identification of design points by humans with an average AUC score of 0.79. We describe how this classifier could be used as the basis of tools to improve such tasks as reviewing code and implementing new features.
When refactoring high-level models, measuring the differences between the original and the refactored model helps the designers know how the original model was modified and if the transformation ...added more complexity or/and improved the model. M2K is a methodology that parses legacy C code, maps it in a high-level model to represent the domain concepts and proposes a refactored model to improve the mapped design. Based on both models, we propose a distance to indicate, from the domain viewpoint, if the original identified concept keeps the same structure or, conversely, if the refactorings modify the concepts represented in the original model. Our approach is based on models generated through the M2K methodology and does not take into account syntactical variations between models. To show the applicability and the validation of our approach, firstly we show how we apply it on a trivial case study. Then, we show the results of applying our proposal to thirteen case studies (small-scale real projects implemented in C) that were also used to validate the M2K methodology.
•We propose a complementary mechanism to improve efficiency of software clustering.•We performed two simulations to test the accuracy of the proposed technique.•We found out that accuracy will ...decrease when utility classes are involved.•We found that multiple cutting points are feasible under certain circumstances.
Software clustering is a key technique that is used in reverse engineering to recover a high-level abstraction of the software in the case of limited resources. Very limited research has explicitly discussed the problem of finding the optimum set of clusters in the design and how to penalize for the formation of singleton clusters during clustering.
This paper attempts to enhance the existing agglomerative clustering algorithms by introducing a complementary mechanism. To solve the architecture recovery problem, the proposed approach focuses on minimizing redundant effort and penalizing for the formation of singleton clusters during clustering while maintaining the integrity of the results.
An automated solution for cutting a dendrogram that is based on least-squares regression is presented in order to find the best cut level. A dendrogram is a tree diagram that shows the taxonomic relationships of clusters of software entities. Moreover, a factor to penalize clusters that will form singletons is introduced in this paper. Simulations were performed on two open-source projects. The proposed approach was compared against the exhaustive and highest gap dendrogram cutting methods, as well as two well-known cluster validity indices, namely, Dunn’s index and the Davies-Bouldin index.
When comparing our clustering results against the original package diagram, our approach achieved an average accuracy rate of 90.07% from two simulations after the utility classes were removed. The utility classes in the source code affect the accuracy of the software clustering, owing to its omnipresent behavior. The proposed approach also successfully penalized the formation of singleton clusters during clustering.
The evaluation indicates that the proposed approach can enhance the quality of the clustering results by guiding software maintainers through the cutting point selection process. The proposed approach can be used as a complementary mechanism to improve the effectiveness of existing clustering algorithms.
The process of understanding and reusing software is often time-consuming, especially in legacy code and open-source libraries. While some core code of open-source libraries may be well-documented, ...it is frequently the case that open-source libraries lack informative API documentation and reliable design information. As a result, the source code itself is often the sole reliable source of information for program understanding activities. In this article, we propose a reverse-engineering approach that can provide assistance during the process of understanding software through the automatic recovery of hidden design patterns in software libraries. Specifically, we use ontology formalism to represent the conceptual knowledge of the source code and semantic rules to capture the structures and behaviors of the design patterns in the libraries. Several software libraries were examined with this approach and the evaluation results show that effective and flexible detection of design patterns can be achieved without using hard-coded heuristics.
Summary
When analyzing legacy code, generating a high‐level model of an application during the reverse engineering process helps the developers understand how the application is structured and how ...the dependencies relate the different software entities. Within the context of procedural programming languages (such as C), the existing approaches to get a model of the code require documentation and/or implicit knowledge that stakeholders acquire during the software building. These approaches use the code itself to build a syntactic model where we see the different software artifacts, such as variables, functions, and modules. However, there is no supporting methodology to detect and analyze if there are relationships/dependencies between those artifacts, such as which variable in a module is declared using an data type described in another one, or which are the functions that are using parameters typed with an data type; or any design decision taken by original developers, such as how the developer has implemented functions in different modules. On the other hand, current developers use object‐oriented (OO) paradigm to implement not only business applications but also useful methodologies/tools that allow semiautomatic analysis of any application. We must remark the legacy procedural code still has worth and is working in several industries, and as any evolving code, the developers have to be able to perform maintenance tasks minimizing the limitations offered by the language. Based on useful properties that the OO paradigm (and their supporting analysis tools) provide, such as UML models, we propose M2K as a methodology to generate a high‐level model from legacy procedural code, mainly written in Ansi C. To understand how C‐based applications were implemented is not a new problem in software reengineering. However, our contribution is based on building an OO model and suggesting different refactorings that help the developer to improve it and to eventually guide a new implementation of the target application. Specifically, the methodology builds cohesive software entities mapped from procedural code and makes the coupling between C entities explicit in the high‐level model. The result of our methodology is a set of refactored class candidates: a structure that groups a set of variables and a set of functions obtained from the C applications. Based on the class candidate model, we propose refactorings based on OO design principles to improve the design of the application. The most relevant design improvements were obtained with algorithm ion by applying the strategy pattern, attributes/methods relocalization, variables types generalization, and removing/renaming methods/attributes. Besides a methodology and the supporting tool, we provide 14 case studies based on real projects implemented in C, and we showed how the results validate our proposal.
This article presents a method of reverse engineering applied to the particular case of a cam in order to recover the form and dimensions of the design of the original piece, which take into account: ...design intent, general knowledge of the problem, different geometric and dimensional restrictions, and the digitized point cloud. Rather than by employing complex mathematical algorithms, a fit is achieved by drawing a parametric outline that complies with the design intent, and by adjusting the different parameters through successive approximations using commercial CAD software commands.
Accurate recognition of design patterns from source code supports development-related tasks such as program comprehension, maintenance, reverse engineering, and re-engineering. Researchers focused on ...this problem for many years, and a variety of recognition approaches have been proposed. Though, much progress has been made, we still identify a lack of flexibility and accuracy in the pattern recognition process. This paper evaluates different design pattern recovery approaches and examines the detection accuracy of these approaches. We found that the major impedance in the accurate recovery of design patterns is the large number of variations for implementing the same pattern. Furthermore, we realized that the combination of multiple searching techniques is required to improve accuracy of pattern detection. Based on these observations, we propose variable pattern definitions, which can be customized and improved towards a pattern catalog that detects patterns in all their variations. The customizable pattern definitions are created from reusable feature types. Each feature type can use one or more searching techniques for efficient detection. The proposed approach supports detection of patterns from multiple programming languages. A prototype implementation of the approach was tested on seven different open-source software projects. For each software project, a baseline was determined and the trustworthiness of each pattern–project combination was rated. The extracted results have been compared with established baselines and with the results of previous techniques.
The artifacts constituting a software system are sometimes unnecessarily coupled with one another or may drift over time. As a result, support of software partitioning, recovery, and restructuring is ...often necessary. This paper presents studies on applying the numerical taxonomy clustering technique to software applications. The objective is to facilitate those activities just mentioned and to improve design, evaluation and evolution. Numerical taxonomy is mathematically simple and yet it is a useful mechanism for component clustering and software partitioning. The technique can be applied at various levels of abstraction or to different software life-cycle phases. We have applied the technique to: (1) software partitioning at the software architecture design phase; (2) grouping of components based on the source code to recover the software architecture in the reverse engineering process; (3) restructuring of a software to support evolution in the maintenance stage; and (4) improving cohesion and reducing coupling for source code. In this paper, we provide an introduction to the numerical taxonomy, discuss our experiences in applying the approach to various areas, and relate the technique to the context of similar work.