Recent studies have adopted pre-trained language models, such as CodeT5 and CodeGPT, for automated program generation tasks like code generation, repair, and translation. Numerous language model ...based approaches have been proposed and evaluated on various benchmark datasets, demonstrating promising performance. However, there is still uncertainty about the reliability of these models, particularly their realistic ability to consistently transform code sequences. This raises a question: are these techniques sufficiently trustworthy for automated program generation? Consequently, further research is needed to understand model logic and assess reliability and explainability. To bridge these research gaps, we conduct a thorough empirical study of eight popular language models on five representative datasets to determine the capabilities and limitations of automated program generation approaches. We further employ advanced explainable AI approaches to highlight the tokens that significantly contribute to the code transformation. We discover that state-of-the-art approaches suffer from inappropriate performance evaluation stemming from severe data duplication, causing overoptimistic results. Our explainability analysis reveals that, in various experimental scenarios, language models can recognize code grammar and structural information, but they exhibit limited robustness to changes in input sequences. Overall, more rigorous evaluation approaches and benchmarks are critical to enhance the reliability and explainability of automated program generation moving forward. Our findings provide important guidelines for this goal.
Syntax-aware on-the-fly code completion Takerngsaksiri, Wannita; Tantithamthavorn, Chakkrit; Li, Yuan-Fang
Information and software technology,
January 2024, 2024-01-00, Volume:
165
Journal Article
Peer reviewed
Open access
Code completion aims to help improve developers’ productivity by suggesting the next code tokens from a given context. Various approaches have been proposed to incorporate abstract syntax tree (AST) ...information for model training, ensuring that code completion is aware of the syntax of the programming languages. However, existing syntax-aware code completion approaches are not on-the-fly, as we found that for every two-thirds of characters that developers type, AST fails to be extracted because it requires the syntactically correct source code, limiting its practicality in real-world scenarios. On the other hand, existing on-the-fly code completion does not consider syntactic information yet.
In this paper, we propose PyCoder to leverage token types, a kind of lightweight syntactic information, which is readily available and aligns with the natural order of source code.
Our PyCoder is trained in a multi-task training manner so that by learning the supporting task of predicting token types during the training phase, the models achieve better performance on predicting tokens and lines of code without the need for token types in the inference phase.
Comprehensive experiments show that PyCoder achieves the first rank on the CodeXGLUE leaderboard with an accuracy of 77.12% for the token-level predictions, which is 0.43%–24.25% more accurate than baselines. In addition, PyCoder achieves an exact match of 43.37% for the line-level predictions, which is 3.63%–84.73% more accurate than baselines.
These results lead us to conclude that token type information (an alternative to syntactic information) that is rarely used in the past can greatly improve the performance of code completion approaches, without requiring the syntactically correct source code like AST-based approaches do. Our PyCoder is publicly available on HuggingFace and GitHub.
Full text
Available for:
GEOZS, IJS, IMTLJ, KILJ, KISLJ, NLZOH, NUK, OILJ, PNG, SAZU, SBCE, SBJE, UILJ, UL, UM, UPCLJ, UPUK, ZAGLJ, ZRSKP
Just-In-Time (JIT) defect prediction has been proposed to help teams to prioritize the limited resources on the most risky commits (or pull requests), yet it remains largely a black-box, whose ...predictions are not explainable nor actionable to practitioners. Thus, prior studies have applied various model-agnostic techniques to explain the predictions of JIT models. Yet, explanations generated from existing model-agnostic techniques are still not formally sound, robust, and actionable. In this paper, we propose FoX , a Fo rmal e X plainer for JIT Defect Prediction, which builds on formal reasoning about the behaviour of JIT defect prediction models and hence is able to provide provably correct explanations, which are additionally guaranteed to be minimal. Our experimental results show that FoX is able to efficiently generate provably-correct, robust, and actionable explanations while existing model-agnostic techniques cannot. Our survey study with 54 software practitioners provides valuable insights into the usefulness and trustworthiness of our FoX approach. 86% of participants agreed that our approach is useful, while 74% of participants found it trustworthy. Thus, this paper serves as an important stepping stone towards trustable explanations for JIT models to help domain experts and practitioners better understand why a commit is predicted as defective and what to do to mitigate the risk.
Software Quality Assurance (SQA) planning aims to define proactive plans, such as defining maximum file size, to prevent the occurrence of software defects in future releases. To aid this, defect ...prediction models have been proposed to generate insights as the most important factors that are associated with software quality. Such insights that are derived from traditional defect models are far from actionable-i.e., practitioners still do not know what they should do or avoid to decrease the risk of having defects, and what is the risk threshold for each metric. A lack of actionable guidance and risk threshold can lead to inefficient and ineffective SQA planning processes. In this paper, we investigate the practitioners' perceptions of current SQA planning activities, current challenges of such SQA planning activities, and propose four types of guidance to support SQA planning. We then propose and evaluate our AI-Driven SQAPlanner approach, a novel approach for generating four types of guidance and their associated risk thresholds in the form of rule-based explanations for the predictions of defect prediction models. Finally, we develop and evaluate a visualization for our SQAPlanner approach. Through the use of qualitative survey and empirical evaluation, our results lead us to conclude that SQAPlanner is needed, effective, stable, and practically applicable. We also find that 80 percent of our survey respondents perceived that our visualization is more actionable. Thus, our SQAPlanner paves a way for novel research in actionable software analytics-i.e., generating actionable guidance on what should practitioners do and not do to decrease the risk of having defects to support SQA planning.
The term ethics is widely used, explored, and debated in the context of developing Artificial Intelligence (AI) based software systems. In recent years, numerous incidents have raised the profile of ...ethical issues in AI development and led to public concerns about the proliferation of AI technology in our everyday lives. But what do we know about the views and experiences of those who develop these systems – the AI practitioners? We conducted a grounded theory literature review (GTLR) of 38 primary empirical studies that included AI practitioners’ views on ethics in AI and analysed them to derive five categories: practitioner
awareness
,
perception
,
need
,
challenge
, and
approach
. These are underpinned by multiple codes and concepts that we explain with evidence from the included studies. We present a
taxonomy of ethics in AI from practitioners’ viewpoints
to assist AI practitioners in identifying and understanding the different aspects of AI ethics. The taxonomy provides a landscape view of the key aspects that concern AI practitioners when it comes to ethics in AI. We also share an agenda for future research studies and recommendations for practitioners, managers, and organisations to help in their efforts to better consider and implement ethics in AI.
Full text
Available for:
EMUNI, FIS, FZAB, GEOZS, GIS, IJS, IMTLJ, KILJ, KISLJ, MFDPS, NLZOH, NUK, OILJ, PNG, SAZU, SBCE, SBJE, SBMB, SBNM, UKNU, UL, UM, UPUK, VKSCE, ZAGLJ
The interpretation of defect models heavily relies on software metrics that are used to construct them. However, such software metrics are often correlated in defect models. Prior work often uses ...feature selection techniques to remove correlated metrics in order to improve the performance of defect models. Yet, the interpretation of defect models may be misleading if feature selection techniques produce subsets of inconsistent and correlated metrics. In this paper, we investigate the consistency and correlation of the subsets of metrics that are produced by nine commonly-used feature selection techniques. Through a case study of 13 publicly-available defect datasets, we find that feature selection techniques produce inconsistent subsets of metrics and do not mitigate correlated metrics, suggesting that feature selection techniques should not be used and correlation analyses must be applied when the goal is model interpretation. Since correlation analyses often involve manual selection of metrics by a domain expert, we introduce AutoSpearman, an automated metric selection approach based on correlation analyses. Our evaluation indicates that AutoSpearman yields the highest consistency of subsets of metrics among training samples and mitigates correlated metrics, while impacting model performance by 1-2%pts. Thus, to automatically mitigate correlated metrics when interpreting defect models, we recommend future studies use AutoSpearman in lieu of commonly-used feature selection techniques.
Just-In-Time (JIT) defect prediction (i.e., an AI/ML model to predict defect-introducing commits) is proposed to help developers prioritize their limited Software Quality Assurance (SQA) resources on ...the most risky commits. However, the explainability of JIT defect models remains largely unexplored (i.e., practitioners still do not know why a commit is predicted as defect-introducing). Recently, LIME has been used to generate explanations for any AI/ML models. However, the random perturbation approach used by LIME to generate synthetic neighbors is still suboptimal, i.e., generating synthetic neighbors that may not be similar to an instance to be explained, producing low accuracy of the local models, leading to inaccurate explanations for just-in-time defect models. In this paper, we propose PyExplainer---i.e., a local rule-based model-agnostic technique for generating explanations (i.e., why a commit is predicted as defective) of JIT defect models. Through a case study of two open-source software projects, we find that our PyExplainer produces (1) synthetic neighbors that are 41%-45% more similar to an instance to be explained; (2) 18%-38% more accurate local models; and (3) explanations that are 69%-98% more unique and 17%-54% more consistent with the actual characteristics of defect-introducing commits in the future than LIME (a state-of-the-art model-agnostic technique). This could help practitioners focus on the most important aspects of the commits to mitigate the risk of being defect-introducing. Thus, the contributions of this paper build an important step towards Explainable AI for Software Engineering, making software analytics more explainable and actionable. Finally, we publish our PyExplainer as a Python package to support practitioners and researchers (https://github.com/awsm-research/PyExplainer).
Jürgen Cito is interviewing Vijayaraghavan Murali (VM), a software engineer at Meta, and Eddie Aftandilian (EA), a principal researcher at GitHub Next.
LineVul Fu, Michael; Tantithamthavorn, Chakkrit
2022 IEEE/ACM 19th International Conference on Mining Software Repositories (MSR),
05/2022
Conference Proceeding
Open access
Software vulnerabilities are prevalent in software systems, causing a variety of problems including deadlock, information loss, or system failures. Thus, early predictions of software vulnerabilities ...are critically important in safety-critical software systems. Various ML/DL-based approaches have been proposed to predict vulnerabilities at the file/function/method level. Recently, IVDetect (a graph-based neural network) is proposed to predict vulnerabilities at the function level. Yet, the IVDetect approach is still inaccurate and coarse-grained. In this paper, we propose LineVul, a Transformer-based line-level vulnerability prediction approach in order to address several limitations of the state-of-the-art IVDetect approach. Through an empirical evaluation of a large-scale real-world dataset with 188k+ C/C++ functions, we show that LineVul achieves (1) 160%-379% higher F1-measure for function-level predictions; (2) 12%-25% higher Top-10 Accuracy for line-level predictions; and (3) 29%-53% less Effort@20%Recall than the baseline approaches, highlighting the significant advancement of LineVul towards more accurate and more cost-effective line-level vulnerability predictions. Our additional analysis also shows that our LineVul is also very accurate (75%-100%) for predicting vulnerable functions affected by the Top-25 most dangerous CWEs, highlighting the potential impact of our LineVul in real-world usage scenarios.
Reviewer recommendation approaches have been proposed to provide automated support in finding suitable reviewers to review a given patch. However, they mainly focused on reviewer experience, and did ...not take into account the review workload, which is another important factor for a reviewer to decide if they will accept a review invitation. We set out to empirically investigate the feasibility of automatically recommending reviewers while considering the review workload amongst other factors. We develop a novel approach that leverages a multi-objective meta-heuristic algorithm to search for reviewers guided by two objectives, i.e., (1) maximizing the chance of participating in a review, and (2) minimizing the skewness of the review workload distribution among reviewers. Through an empirical study of 230,090 patches with 7,431 reviewers spread across four open source projects, we find that our approach can recommend reviewers who are potentially suitable for a newly-submitted patch with 19% - 260% higher F-measure than the five benchmarks. Our empirical results demonstrate that the review workload and other important information should be taken into consideration in find-ing reviewers who are potentially suitable for a newly-submitted patch. In addition, the results show the effectiveness of realizing this approach using a multi-objective search-based approach.