Software (De-)Obfuscation: How Good Is It?

Alexander Pretschner, Technische Universität München, Germany

joint work with Sebastian Banescu


Software obfuscation aims at hiding data, code, or logic. Examples for obfuscating data include license or cryptographic keys. Code is obfuscated in order to avoid the easy detection and subsequent disabling of license checks, or runtime integrity checkers that verify if the code has been tampered with. Finally, algorithms or logic often constitute IP that its owner does not want to publicly divulge. There is a plethora of software obfuscation techniques that are described in detail in a forthcoming article [BP17].

At first glance, one may be tempted to argue that the problem can simply be solved by encrypting relevant pieces of (machine) code and data. This, however, does not solve the problem because in order to be executed or used, code or data need to be decrypted. This requires their existence as plaintext at some moment in time—at which the code or the data can be read by an attacker. Moreover, this leads to the problem of protecting the key itself, which requires a root of trust. These roots of trust can be implemented in software (e.g., white box crypto) or hardware (e.g., Intel SGX or TPM) which both suffer from various disadvantages that we cannot discuss here.

The goal of software obfuscation is to defend against so-called Man-At-The-End (MATE) attackers. These are attackers that have access to code in binary or source code format and, in principle, can make use of unlimited resources in order to de-obfuscate a piece of software. In practice, of course, resources are not unlimited, and the effort that attackers invest will depend on the anticipated gains of doing so.

The “effort” to de-obfuscate a piece of software is hard to characterize. Intuitively, one would expect it to strongly correlate with the “quality” of the obfuscation strategy that was applied which means that the quality of obfuscation and de-obfuscation are dual concepts. Collberg, Thomborson and Low [CTL97] suggest considering two aspects of the quality of obfuscation: potency and resilience. Potency is the property of an obfuscated piece of software to resist a human attacker to de-obfuscate it. Resilience, in contrast, characterizes the resistance with respect to automated attackers. Collberg et al. understood that resistance is relative to the resources (that is, cost) that an attacker is willing to spend on a de-obfuscation attack.

Unfortunately, it is not clear how to characterize the power of an obfuscation transformation in a practical way. There are several essentially probabilistic, rarely complexity-theoretic, characterizations that give rise to the beginning of a theory of obfuscation. However, these rather theoretical characterizations tend to be of limited value to a practitioner because they do not state, for a given obfuscator, precisely how potent or resilient it is.

Interestingly, the situation is similar for the strength of cryptographic encryption. There are a few studies that estimate how long it will take to break a key of a specified length, but we are, again from a practitioner’s rather than a theoretician’s perspective, not aware of hard lower bounds on the effort needed by a clever attacker to break a specific cipher. (One may indeed see cryptography as a special case of obfuscation: Given (1) an original artifact—code or data in the case of obfuscation, and plaintext data in the case of cryptography—and (2) a set of parameter—a set of transformations, their ordering, and transformation-specific parameters in the case of obfuscation; and the cryptographic key in the case of crypto—obfuscation and encryption transformations yield obfuscated and encrypted artifacts. De-obfuscation and decryption are the respective inverse transformations).

In terms of potency, it seems naturally hard to make solid statements about the quality of the respective obfuscation transformation—because human ingenuity is hard to predict. When considering fixed automated attacker models, the situation slightly improves. It is then possible to apply a specific automated attacker on a sample set of de-obfuscation problems and use machine learning technology to build respective prediction models. In recent work [BCP17], we have shown that it is possible, in the context of the study presented in the paper, to predict the time for attacks that are based on symbolic execution technology [BCG+16] with an accuracy of >90% for >80% of the considered programs. An obvious observation is that this study suffers from several threats to external validity, most notably the dependence on the specific attack technology (symbolic execution) that we use. However, this seems to be the nature of the beast, and we do see hope that there may only a limited number of realistic automated attacks that could, and should, be studied in a way similar to our work.

In the future, we consider it of utmost importance to technically characterize the de-obfuscation problem, probably as a search problem, and use this characterization as a basis for a systematic study, and understanding, of the “quality” of obfuscation. We believe that one of the most pressing questions in software obfuscation is the question for which there are only very partial answers today: from a practitioner’s perspective, how good is obfuscation? Can we provide estimates of the cost (the resources needed by the attacker) to de-obfuscate in a way that is similar to physical safes that are assessed by the time that an attacker with a standard set of tools needs to break it [UL11]?

Finally, the title of this brief statement deliberately plays with the meaning of “good.” So far, we have considered obfuscation technology to be “good” if it raises the bar for the attacker, and have equated quality with effort for an automated attack to de-obfuscate. A second relevant meaning of “good” pertains to the moral, or ethical dimension. Hiding code or data may not be considered “good” by the proponents of open source software or by opponents of digital rights management technology, and it certainly is not good if the technology is used by malware to evade detection.

References

[BCG+16] Banescu, Collberg, Ganesh, Newsham, Pretschner: Code Obfuscation Against Symbolic Execution Attacks. Proc. ACSAC, pp. 189-200, 2016

[BCP17] Banescu, Collberg, Pretschner: Predicting the Resilience of Obfuscated Code Against Symbolic Execution Attacks via Machine Learning. Proc. USENIX, to appear, 2017

[BP17] Banescu, Pretschner: A Tutorial on Software Obfuscation. Advances in Computing, to appear, 2017

[CTL97] Collberg, Thomborson, Low: A Taxonomy of Obfuscation Transformations. Technical Report #148, The University of Auckland, 1997

[UL11] UL 687 Standard for Burglary-Resistant Safes. Underwriters Laboratories, 2011. https://standardscatalog.ul.com/standards/en/standard_687