10 Aug. 2012: Detecting Android Sandboxes
Detecting Android Sandboxes
Felix Matenaar and Patrick Schulz
team@dexlabs.org
0. Outline
1. Introduction
2. Motivation
3. Qemu BT Architecture
4. Proof of Concept
5. Conclusion
6. References
1. Introduction
In february Google started protecting the Android market using
the Google bouncer. This tool is used to detect malicious behavior of new
applications prior to publishing them in the Android market. It is mainly based
on dynamic analysis. This means it executes the analysis target in an
instrumented emulator environment which is based on the virtualization software
Qemu. As Jon Oberheide already pointed out, the Google bouncer is detectable and
can therefor be circumvented by malware [1]. There are also other dynamic
analysis frameworks using Qemu like Andrubis, Taintdroid and DroidBox. The main
reason for the decision to use Qemu might be that it is used in the Android
Emulator and is therefor a well suited environment to build new dynamic analysis
tools. In this blog post we will show why using Qemu as a basis for runtime
behavior analysis has significant disadvantages and why malware can circumvent
this kind of sandboxing technique regardless of the actual system environment.
This means that in addition to the work of Jon Oberheide, our detection method
currently works with all binary translation based Qemu setups.
2. Motivation
Detection of environments used for dynamic analysis has been a
topic since several years. Especially for sandboxes built by the antivirus
industry it is a problem, when malware detects such environments and as a consequence
pretending legitimate behavior. Detection methods have also been used for code
protection purposes, e.g when preventing automation assisted reverse engineering
using dynamic analysis. In order to push malware into the Android market, preventing
the Google bouncer from detecting the malicious behavior is useful.
As a consequence the antivirus industry should be aware of
methods used by malware to circumvent dynamic analysis in order to adapt
their products. Basically one can differentiate between two approaches which
are used when dynamic analysis detection is done. The first approach is located
on the system environment layer. As an example Jon Oberheide has shown how to
detect the Google bouncer. One of his methods included testing of the
presence/absence of the file "/sys/qemu_trace". The existence of this file is
one hint that the application is executed in the bouncer. This is considered as
fingerprinting the system environment. The advantage lies in the accuracy.
However one disadvantage is that this method is specific to the Android bouncer
and does not solve the detection problem for other dynamic analysis environments
like andrubis [6]. In addition fingerprinting approaches can often be mitigated by adjusting
the system environment to not match the fingerprint anymore.
Another approach relies on detecting the virtualization/emulation technology used for
dynamic analysis. Joanna Rutkowska [7] has shown a method to detect x86 hardware
virtualization some years ago. Using this method, malware could detect hardware
virtualization based analysis environments. Equivalent techniques are known for
products like VMware.
The approach that is explained here can be categorized in the second group and
is capable of detecting Qemu fully-emulated environments based on binary
translation.
3. The Qemu Binary Translation Architecture
Wikipedia [2] offers the following explanation for Binary Translation (BT):
In computing, binary translation is the emulation of one instruction set by another
through translation of code. Sequences of instructions are translated from the source
to the target instruction set.
Qemu can use this technique to translate the native
code running in the guest system, e.g. ARM, on the host system with the same or a
different instruction set like x86 in our example use case. Comparing BT with instruction
set emulation like implemented in Bochs [3], in practice it shows to be a lot faster since
already translated code blocks can be cached and directly executed natively
afterwards. This results in a significant performance gain. On the other hand BT
preserves the capability for debugging and code instrumentation and can even be
faster than hardware assisted virtualization in cases where instruction or
memory tracing is implemented in a granular manner. Using BT the huge amount of
context switches that would be necessary can be reduced using the static
analyzer of the translation process. For hardware virtualization such a static
analyzer would have to be implemented in the first place. Otherwise single
stepping would be necessary which produces a lot of overhead. As long as the
sandboxing host systems don't use the ARM architecture, they are forced to either use
instruction set emulation (which is very slow) or binary translation to analyze
ARM based environments. Supporting ARM is currently necessary since there are
Android applications shipped with native ARM code only. Chad Kersey [4] gave a
presentation about Qemu internals for the linux users group at Georgia Tech.
They are a good starting point for further information. Basically guest code
execution boils down to the following chart.
The already translated code is executed until the next branch instruction is reached.
Since the branch target address is available at the end of the basic block
execution, a cache lookup is done. In case of a hit, the basic block starting at
the branch target has already been translated and the code is executed.
Otherwise the basic block at the branch target has to be translated prior to its
execution. The block translation function is then called which iterates over the
instructions and translates them to an equivalent set of instructions in the
target instruction set. This block translation ends after the first branch
instruction. Now the branch target code is pushed into the cache and is
available for execution.
Now we are going to explain the core aspect of our detection method: A
physical CPU increases the program counter after each instruction such that the
program counter is always up to date. Since the registers in the translated code are emulated
in order to keep the program counter up to date after each instruction, the
translator would have to increase the virtual program counter. This would results in
at least one additional program counter increase for each source code instruction. However
since the translated target code is executed natively, a correct virtual program counter
is only necessary in cases where an instruction from the source code accesses
it. Qemu handles these cases but otherwise does not update the program counter
in the virtual CPU register as part of an optimization. As a consequence the
virtual program counter often just points to the start of a basic block since it
is updated after every branch.
Now we ask you: What does this mean to multitasking of the operating system
in our emulated environment which is nowadays built upon timer interrupts that can occur
after the execution of any instruction? Since the virtual program counter does
not need to be correct as shown in the previous section, the virtual CPU must
not be interrupted during the execution of a basic block since it would be impossible
to reconstruct the program counter. As a consequence the operating system scheduling in
a Qemu fully-emulated environment using binary translation does not occur during a basic
blocks. Our argumentation is revisited in the following picture.
4. Proof of Concept
As a summary, the key idea is about doing a scheduling behavior measurement of
the operating system in the virtualized environment. In our PoC this is
done by creating a histogram over the addresses where a thread is scheduled.
In a non-emulated environment the histogram will show a nearly equally
distribution over the addresses where scheduling occurs. This means our
thread is scheduled at any address by an equal probability. However in the
Qemu setup we described this is not the case. We observe that scheduling only
occurs after execution of a complete basic block. Therefor our distribution
vastly differs from an equally one. Our PoC basically detects possible
scheduling points. The measurement is achieved using two threads.
The first thread loops over a single basic block that subsequently increments a
global variable during execution. At the beginning of each loop iteration, this
variable is set to a constant value 1. The other thread is also executed in a
loop where in each iteration the global variable to which the first thread writes
to is read. A histogram is generated over the occurrences of each value of this
variable that has been read during an iteration. The following picture
illustrates how the histogram is generated.
If the first thread is scheduled at some point during execution of the basic
block and the second thread is continued, the global variable will store a value
representing the progress which the first thread made in the basic block
subsequently increasing it. Therefor our global variable represents the
addresses where the first thread is scheduled. This measurement step is iterated
such that a histogram is generated representing the distribution of scheduling
points. If the variance between these scheduling points is too high an emulated
environment is detected. Otherwise we have a negative. Both cases are shown in
the following screenshots.
We evaluated our PoC against a commonly used sandbox named andrubis [6].
Unfortunately the screenshot during their dynamic analysis was taken at the beginning
of the application run which prevented us from showing the results in it.
However in the following picture we can see the values read by the second thread
which clearly shows that the scheduling event only appears after execution of the whole
basic block - always 0x20. Also the variance underlined at the bottom of the picture
points to the fact that the application is running in an Qemu BT emulated environment.
5. Conclusion
We have shown how Qemu binary translation can be detected as long as
used in full system emulation mode. The advantage of our method relies to the
fact that we gain information leaked by the core design of the Qemu binary
translator. The difference to prior detection methods relying on a specific
system environment information leak is based on the fact that it is more
generic, meaning Qemu specific and not $product specific, and harder to fix. As
a limitation our method will not detect dynamic analysis which is built based on
hardware assisted virtualization such as implemented on x86 and the upcoming
hardware assisted ARM virtualization technology [5].
6. References
[1] https://blog.duosecurity.com/2012/06/dissecting-androids-bouncer/
[2] http://en.wikipedia.org/wiki/Binary_translation
[3] http://bochs.sourceforge.net
[4] http://lugatgt.org/content/qemu_internals/downloads/slides.pdf
[5] http://www.arm.com/products/processors/technologies/virtualization-extensions.php
[6] http://blog.iseclab.org/2012/06/04/andrubis-a-tool-for-analyzing-unknown-android-applications-2/
[7] http://theinvisiblethings.blogspot.de/
detect_bt_x86.tar.gz