In the past few years, malware classification techniques have shifted from shallow traditional machine learning models to deeper neural network architectures. The main benefit of some of these is the ...ability to work with raw data, guaranteed by their automatic feature extraction capabilities. This results in less technical expertise needed while building the models, thus less initial pre-processing resources. Nevertheless, such advantage comes with its drawbacks, since deep learning models require huge quantities of data in order to generate a model that generalizes well. The amount of data required to train a deep network without overfitting is often unobtainable for malware analysts. We take inspiration from image-based data augmentation techniques and apply a sequence of semantics-preserving syntactic code transformations (obfuscations) to a small dataset of programs to generate a larger dataset. We then design two learning models, a convolutional neural network and a bi-directional long short-term memory, and we train them on images extracted from compiled binaries of the newly generated dataset. Through transfer learning we then take the features learned from the obfuscated binaries and train the models against two state of the art malware datasets, each containing around 10 000 samples. Our models easily achieve up to 98.5% accuracy on the test set, which is on par or better than the present state of the art approaches, thus validating the approach.
Learning metamorphic malware signatures from samples Campion, Marco; Dalla Preda, Mila; Giacobazzi, Roberto
Journal of Computer Virology and Hacking Techniques,
09/2021, Letnik:
17, Številka:
3
Journal Article
Recenzirano
Odprti dostop
Metamorphic malware are self-modifying programs which apply semantic preserving transformations to their own code in order to foil detection systems based on signature matching. Metamorphism impacts ...both software security and code protection technologies: it is used by malware writers to evade detection systems based on pattern matching and by software developers for preventing malicious host attacks through software diversification. In this paper, we consider the problem of automatically extracting metamorphic signatures from the analysis of metamorphic malware variants. We define a metamorphic signature as an abstract program representation that ideally captures all the possible code variants that might be generated during the execution of a metamorphic program. For this purpose, we developed
MetaSign
: a tool that takes as input a collection of metamorphic code variants and produces, as output, a set of transformation rules that could have been used to generate the considered metamorphic variants.
MetaSign
starts from a control flow graph representation of the input variants and agglomerates them into an automaton which approximates the considered code variants. The upper approximation process is based on the concept of widening automata, while the semantic preserving transformation rules, used by the metamorphic program, can be viewed as rewriting rules and modeled as grammar productions. In this setting, the grammar recognizes the language of code variants, while the production rules model the metamorphic transformations. In particular, we formalize the language of code variants in terms of pure context-free grammars, which are similar to context-free grammars with no terminal symbols. After the widening process, we create a positive set of samples from which we extract the productions of the grammar by applying a learning grammar technique. This allows us to learn the transformation rules used by the metamorphic engine to generate the considered code variants. We validate the results of
MetaSign
on some case studies.
We introduce the notion of local completeness in abstract interpretation and define a logic for proving both the correctness and incorrectness of some program specification. Abstract interpretation ...is extensively used to design sound-by-construction program analyses that over-approximate program behaviours. Completeness of an abstract interpretation A for all possible programs and inputs would be an ideal situation for verifying correctness specifications, because the analysis can be done compositionally and no false alert will arise. Our first result shows that the class of programs whose abstract analysis on A is complete for all inputs has a severely limited expressiveness. A novel notion of local completeness weakens the above requirements by considering only some specific, rather than all, program inputs and thus finds wider applicability. In fact, our main contribution is the design of a proof system, parameterized by an abstraction A, that, for the first time, combines over- and under-approximations of program behaviours. Thanks to local completeness, in a provable triple ⊢A P c Q, the assertion Q is an under-approximation of the strongest post-condition postc(P) such that the abstractions in A of Q and postc(P) coincide. This means that Q is never too coarse, namely, under mild assumptions, the abstract interpretation of c does not yield false alerts for the input P iff Q has no alert. Thus, ⊢A P c Q not only ensures that all the alerts raised in Q are true ones, but also that if Q does not raise alerts then c is correct.
History of Abstract Interpretation Giacobazzi, Roberto; Ranzato, Francesco
IEEE annals of the history of computing,
04/2022, Letnik:
44, Številka:
2
Journal Article
Recenzirano
Odprti dostop
We trace the roots of abstract interpretation and its role as a foundational principle to understand and design static program analysis and verification methods. Starting from the historical roots of ...formal methods and static program analysis, we show how abstract interpretation evolved and influenced the way we reason about program correctness in different programming languages and how this method shaped the literature and the practice in program analysis in the last 45 years.
Abstract interpretation is a well-known and extensively used method to extract over-approximate program invariants by a sound program analysis algorithm. Soundness means that no program errors are ...lost and it is, in principle, guaranteed by construction. Completeness means that the abstract interpreter reports no false alarms for all possible inputs, but this is extremely rare because it needs a very precise analysis. We introduce a weaker notion of completeness, called local completeness, which requires that no false alarms are produced only relatively to some fixed program inputs. Based on this idea, we introduce a program logic, called Local Completeness Logic for an abstract domain A, for proving both the correctness and incorrectness of program specifications. Our proof system, which is parameterized by an abstract domain A, combines over- and under-approximating reasoning. In a provable triple ⊦A p q, is a program, q is an under-approximation of the strongest post-condition of on input p such that their abstractions in A coincide. This means that q is never too coarse, namely, under some mild assumptions, the abstract interpretation of does not yield false alarms for the input p iff q has no alarm. Therefore, proving ⊦A p q not only ensures that all the alarms raised in q are true ones, but also that if q does not raise alarms, then is correct. We also prove that if A is the straightforward abstraction making all program properties equivalent, then our program logic coincides with O’Hearn’s incorrectness logic, while for any other abstraction, contrary to the case of incorrectness logic, our logic can also establish program correctness.
Robustness is a key and desirable property of any classifying system, in particular, to avoid the ever-rising threat of adversarial attacks. Informally, a classification system is robust when the ...result is not affected by the perturbation of the input. This notion has been extensively studied, but little attention has been dedicated to how the perturbation affects the classification. The interference between perturbation and classification can manifest in many different ways, and its understanding is the main contribution of the present article. Starting from a rigorous definition of a standard notion of robustness, we build a formal method for accommodating the required degree of robustness—depending on the amount of error the analyst may accept on the classification result. Our idea is to precisely model this error as an abstraction. This leads us to define weakened forms of robustness also in the context of programming languages, particularly in language-based security, e.g., information-flow policies, and in program verification. The latter is possible by moving from a quantitative (standard) model of perturbation to a novel qualitative model, given by means of the notion of abstraction. As in language-based security, we show that it is possible to confine adversities, which means to characterize the degree of perturbation (and/or the degree of class generalization) for which the classifier may be deemed adequately robust. We conclude with an experimental evaluation of our ideas, showing how weakened forms of robustness apply to state-of-the-art image classifiers.
Abstract Non-Interference Giacobazzi, Roberto; Mastroeni, Isabella
ACM transactions on privacy and security,
05/2018, Letnik:
21, Številka:
2
Journal Article
Recenzirano
Non-interference happens when some elements of a dynamic system do not interfere, i.e., do not affect, other elements in the same system. Originally introduced in language-based security, ...non-interference means that the manipulation of private information has no effect on public observations of data. In this article, we introduce
abstract non-interference
as a weakening of non-interference by abstract interpretation. Abstract non-interference is parametric on which private information we want to protect and which are the observational capabilities of the external observer, i.e., what the attacker can observe of a computation and of the data manipulated during the computation. This allows us to model a variety of situations in information-flow security, where the security of a system can be mastered by controlling the degree of precision of the strongest harmless attacker and the properties that are potentially leaked in case of successful attack.
Imprecision is inherent in any decidable (sound) approximation of undecidable program properties. In abstract interpretation this corresponds to the release of false alarms, e.g., when it is used for ...program analysis and program verification. As all alarming systems, a program analysis tool is credible when few false alarms are reported. As a consequence, we have to live together with false alarms, but also we need methods to control them. As for all approximation methods, also for abstract interpretation we need to estimate the accumulated imprecision during program analysis. In this paper we introduce a theory for estimating the error propagation in abstract interpretation, and hence in program analysis. We enrich abstract domains with a weakening of a metric distance. This enriched structure keeps coherence between the standard partial order relating approximated objects by their relative precision and the effective error made in this approximation. An abstract interpretation is precise when it is complete. We introduce the notion of partial completeness as a weakening of precision. In partial completeness the abstract interpreter may produce a bounded number of false alarms. We prove the key recursive properties of the class of programs for which an abstract interpreter is partially complete with a given bound of imprecision. Then, we introduce a proof system for estimating an upper bound of the error accumulated by the abstract interpreter during program analysis. Our framework is general enough to be instantiated to most known metrics for abstract domains.
In abstract interpretation-based static analysis, approximation is encoded by abstract domains. They provide systematic guidelines for designing abstract semantic functions that approximate some ...concrete system behaviors under analysis. It may happen that an abstract domain contains redundant information for the specific purpose of approximating a given concrete semantic function. This paper introduces the notion of correctness kernel of an abstract interpretation, a methodology for simplifying abstract domains, i.e. removing abstract values from them, in a maximal way while retaining exactly the same approximate behavior of the system under analysis. We show that in abstract model checking correctness kernels provide a simplification paradigm of the abstract state space that is guided by examples, meaning that this simplification preserves spuriousness of examples (i.e., abstract paths). In particular, we show how correctness kernels can be integrated with the well-known CEGAR (CounterExample-Guided Abstraction Refinement) methodology.