Information leaks via side channels remain a challenging problem to guarantee confidentiality. Static analysis is a prevalent approach for detecting side channels. However, the side-channel analysis ...poses challenges to the static techniques since they arise from non-functional aspects of systems and require an analysis of multiple traces. In addition, the outcome of static analysis is usually restricted to binary answers. In practice, real-world applications may need to disclose some aspects of the confidential information to ensure desired functionality. Therefore,
quantification
techniques are necessary to evaluate the resulting threats. In this paper, we propose a dynamic analysis technique to detect and quantify side channels. Our novel approach is to split the problem into two tasks. First, we learn a timing model of the program as a neural network. While the program implements the functionality, the neural network models the non-functional property that does not exist in the syntax or semantics of programs. Second, we analyze the neural network to quantify information leaks. As demonstrated in our experiments, both of these tasks are feasible in practice—making the approach a significant improvement over state-of-the-art side channel detectors and quantifiers. Thus, our key technical contributions are (a) a
binarized neural network
architecture that enables side-channel discovery and (b) a novel
MILP-based counting algorithm
to estimate the side-channel strength. On a set of benchmarks, we show that neural network models the timing of programs with thousands of methods precisely. We also show that neural networks with thousands of neurons can be efficiently analyzed to quantify information leaks via timing side channels.
The deep feedforward neural networks (DNNs) are increasingly deployed in socioeconomic critical decision support software systems. DNNs are exceptionally good at finding min-imal, sufficient ...statistical patterns within their training data. Consequently, DNNs may learn to encode decisions-amplifying existing biases or introducing new ones-that may disadvantage protected individuals/groups and may stand to violate legal protections. While the existing search based software testing approaches have been effective in discovering fairness defects, they do not supplement these defects with debugging aids-such as severity and causal explanations-crucial to help developers triage and decide on the next course of action. Can we measure the severity of fairness defects in DNNs? Are these defects symptomatic of improper training or they merely reflect biases present in the training data? To answer such questions, we present Dice: an information-theoretic testing and debugging framework to discover and localize fairness defects in DNNs. The key goal of Dice is to assist software developers in triaging fairness defects by ordering them by their severity. Towards this goal, we quantify fairness in terms of protected information (in bits) used in decision making. A quantitative view of fairness defects not only helps in ordering these defects, our empirical evaluation shows that it improves the search efficiency due to resulting smoothness of the search space. Guided by the quan-titative fairness, we present a causal debugging framework to localize inadequately trained layers and neurons responsible for fairness defects. Our experiments over ten DNNs, developed for socially critical tasks, show that Dice efficiently characterizes the amounts of discrimination, effectively generates discriminatory instances (vis-a-vis the state-of-the-art techniques), and localizes layers/neurons with significant biases.
Despite all the progress achieved in recent years in detecting anomalies in network systems, detecting unseen anomalies such as zero-day attacks still remained a challenging task. Traditional ...signature-based Network Intrusion Detection Systems (NIDS) cannot detect such anomalies as there exists no known signature for them. Moreover, Machine Learning-based (ML-based) NIDS trained with a vanilla supervised learning method cannot detect them as they come from a different distribution compared to what the model has been trained on. Domain adaptation techniques help transfer the knowledge gained from a labeled source domain to an unlabeled target domain. Such techniques have the potential to make a model trained on a dataset containing a few network attacks to detect new types of anomalies that might happen in the future. However, recent domain adaptation methods have been mostly designed for images and provide very limited benefits when applied to network traffic. In this paper, we introduce Proportional Progressive Pseudo-Labeling (PPPL), an effective approach for building a more general domain adaptation technique that can be leveraged to detect unseen anomalies in network systems. At the beginning of the training phase, PPPL progressively reduces target domain classification error, by training the model directly with pseudo-labeled target domain samples, while excluding samples with pseudo-labels that are more likely to be wrong from the training set and postponing training on such samples. Our evaluation conducted on the CICIDS2017 dataset shows that PPPL can significantly outperform other baselines in detecting unseen anomalies with up to 58% improvement based on the average F1 score.
This paper investigates the parameter space of machine learning (ML) algorithms in aggravating or mitigating fairness bugs. Data-driven software is increasingly applied in social-critical ...applications where ensuring fairness is of paramount importance. The existing approaches focus on addressing fairness bugs by either modifying the input dataset or modifying the learning algorithms. On the other hand, the selection of hyperparameters, which provide finer controls of ML algorithms, may enable a less intrusive approach to influence the fairness. Can hyperparameters amplify or suppress discrimination present in the input dataset? How can we help programmers in detecting, understanding, and exploiting the role of hyperparameters to improve the fairness? We design three search-based software testing algorithms to un-cover the precision-fairness frontier of the hyperparameter space. We complement these algorithms with statistical debugging to explain the role of these parameters in improving fairness. We implement the proposed approaches in the tool Parfait-ML (PARameter FAIrness Testing for ML Libraries) and show its effectiveness and utility over five mature ML algorithms as used in six social-critical applications. In these applications, our approach successfully iden-tified hyperparameters that significantly improve (vis-a-vis the state-of-the-art techniques) the fairness without sacrificing precision. Surprisingly, for some algorithms (e.g., random forest), our approach showed that certain configuration of hyperparameters (e.g., restricting the search space of attributes) can amplify biases across applications. Upon further investigation, we found intuitive explanations of these phenomena, and the results corroborate simi-lar observations from the literature.
Metamorphic Testing and Debugging of Tax Preparation Software Tizpaz-Niari, Saeid; Monjezi, Verya; Wagner, Morgan ...
2023 IEEE/ACM 45th International Conference on Software Engineering: Software Engineering in Society (ICSE-SEIS)
Conference Proceeding
Odprti dostop
This paper presents a data-driven debugging framework to improve the trustworthiness of US tax preparation software systems. Given the legal implications of bugs in such software on its users, ...ensuring compliance and trustworthiness of tax preparation software is of paramount importance. The key barriers in developing debugging aids for tax preparation systems are the unavailability of explicit specifications and the difficulty of obtaining oracles. We posit that, since the US tax law adheres to the legal doctrine of precedent, the specifications about the outcome of tax preparation software for an individual taxpayer must be viewed in comparison with individuals that are deemed similar. Consequently, these specifications are naturally available as properties on the software requiring similar inputs provide similar outputs. Inspired by the metamorphic testing paradigm, we dub these relations metamorphic relations as they relate to structurally modified inputs.In collaboration with legal and tax experts, we explicated metamorphic relations for a set of challenging properties from various US Internal Revenue Services (IRS) publications including Form 1040 (U.S. Individual Income Tax Return), Publication 596 (Earned Income Tax Credit), Schedule 8812 (Qualifying Children and Other Dependents), and Form 8863 (Education Credits). While we focus on an open-source tax preparation software for our case study, the proposed framework can be readily extended to other commercial software. We develop a randomized test-case generation strategy to systematically validate the correctness of tax preparation software guided by metamorphic relations. We further aid this test-case generation by visually explaining the behavior of software on suspicious instances using easy-to-interpret decision-tree models. Our tool uncovered several accountability bugs with varying severity ranging from non-robust behavior in corner-cases (unreliable behavior when tax returns are close to zero) to missing eligibility conditions in the updated versions of software.
Differential performance bugs are manifested when the runtime behavior of systems, such as execution time, is unexpectedly different for two or more similar inputs. These bugs are notoriously ...difficult to detect, explain, and fix since they arise from the performance differential between multiple executions of the system. Differential performance bugs give rise to both performance and security issues: they may lead to poor user experiences, waste crucial hardware resources, and leak confidential information via these differentials.This thesis introduces differential performance debugging, a framework to discover, localize, and mitigate differential performance bugs with a focus on timing side channels and performance defects in web applications and machine learning libraries.We formalize and study the core computational problem in the differential performance debugging as the discriminant learning problem. After establishing its intractability, we propose two data-driven approaches to infer discriminants efficiently. The first approach views the data as independent points and employs clustering algorithms with off-the-shelf decision trees in a novel approach called discriminant regression trees. The second approach views the data as functional points and generalizes the clustering and decision tree inferences with functional data models. We choose decision trees as the core component in the debugging mostly due to their simplicity in the interpretation for (human) users involved in the process as well as for automatic mitigations.These techniques have been implemented in a set of tools to debug and mitigate differential performance defects. On a set of micro-benchmarks, the thesis shows that the differential performance debugging outperforms state-of-the-art techniques in detecting bugs, localizing the root causes, and mitigating them. On the set of larger case studies, the tools have found and explained multiple performance bugs in popular machine learning frameworks such as scikit-learn and side-channel vulnerabilities in critical Java libraries such as OpenJDK and Jetty. The developers have since used the results to fix multiple bugs in their implementations.
This paper leverages the statistics of extreme values to predict the worst-case convergence times of machine learning algorithms. Timing is a critical non-functional property of ML systems, and ...providing the worst-case converge times is essential to guarantee the availability of ML and its services. However, timing properties such as worst-case convergence times (WCCT) are difficult to verify since (1) they are not encoded in the syntax or semantics of underlying programming languages of AI, (2) their evaluations depend on both algorithmic implementations and underlying systems, and (3) their measurements involve uncertainty and noise. Therefore, prevalent formal methods and statistical models fail to provide rich information on the amounts and likelihood of WCCT.Our key observation is that the timing information we seek represents the extreme tail of execution times. Therefore, extreme value theory (EVT), a statistical discipline that focuses on understanding and predicting the distribution of extreme values in the tail of outcomes, provides an ideal framework to model and analyze WCCT in the training and inference phases of ML paradigm. Building upon the mathematical tools from EVT, we propose a practical framework to predict the worst-case timing properties of ML. Over a set of linear ML training algorithms, we show that EVT achieves a better accuracy for predicting WCCTs than relevant statistical methods such as the Bayesian factor. On the set of larger machine learning training algorithms and deep neural network inference, we show the feasibility and usefulness of EVT models to accurately predict WCCTs, their expected return periods, and their likelihood.
This paper leverages the statistics of extreme values to predict the worst-case convergence times of machine learning algorithms. Timing is a critical non-functional property of ML systems, and ...providing the worst-case converge times is essential to guarantee the availability of ML and its services. However, timing properties such as worst-case convergence times (WCCT) are difficult to verify since (1) they are not encoded in the syntax or semantics of underlying programming languages of AI, (2) their evaluations depend on both algorithmic implementations and underlying systems, and (3) their measurements involve uncertainty and noise. Therefore, prevalent formal methods and statistical models fail to provide rich information on the amounts and likelihood of WCCT. Our key observation is that the timing information we seek represents the extreme tail of execution times. Therefore, extreme value theory (EVT), a statistical discipline that focuses on understanding and predicting the distribution of extreme values in the tail of outcomes, provides an ideal framework to model and analyze WCCT in the training and inference phases of ML paradigm. Building upon the mathematical tools from EVT, we propose a practical framework to predict the worst-case timing properties of ML. Over a set of linear ML training algorithms, we show that EVT achieves a better accuracy for predicting WCCTs than relevant statistical methods such as the Bayesian factor. On the set of larger machine learning training algorithms and deep neural network inference, we show the feasibility and usefulness of EVT models to accurately predict WCCTs, their expected return periods, and their likelihood.