10 Aug. 2012: Detecting Android Sandboxes
				Detecting Android Sandboxes
			     Felix Matenaar and Patrick Schulz
				    team@dexlabs.org

0. Outline
		
	1.		Introduction
	2.		Motivation 
	3.		Qemu BT Architecture 
	4.		Proof of Concept
	5.		Conclusion
        6.              References
		
		
1. Introduction 

	In february Google started protecting the Android market using
	the Google bouncer. This tool is used to detect malicious behavior of new
	applications prior to publishing them in the Android market. It is mainly based
	on dynamic analysis. This means it executes the analysis target in an
	instrumented emulator environment which is based on the virtualization software
	Qemu. As Jon Oberheide already pointed out, the Google bouncer is detectable and
	can therefor be circumvented by malware [1].  There are also other dynamic
	analysis frameworks using Qemu like Andrubis, Taintdroid and DroidBox. The main
	reason for the decision to use Qemu might be that it is used in the Android
	Emulator and is therefor a well suited environment to build new dynamic analysis
	tools. In this blog post we will show why using Qemu as a basis for runtime
	behavior analysis has significant disadvantages and why malware can circumvent
	this kind of sandboxing technique regardless of the actual system environment.
	This means that in addition to the work of Jon Oberheide, our detection method
	currently works with all binary translation based Qemu setups.
		
		
2. Motivation

	Detection of environments used for dynamic analysis has been a
	topic since several years. Especially for sandboxes built by the antivirus
	industry it is a problem, when malware detects such environments and as a consequence 
	pretending legitimate behavior. Detection methods have also been used for code
	protection purposes, e.g when preventing automation assisted reverse engineering
	using dynamic analysis. In order to push malware into the Android market, preventing
        the Google bouncer from detecting the malicious behavior is useful.
	As a consequence the antivirus industry should be aware of
	methods used by malware to circumvent dynamic analysis in order to adapt
	their products.  Basically one can differentiate between two approaches which
	are used when dynamic analysis detection is done. The first approach is located
	on the system environment layer. As an example Jon Oberheide has shown how to
	detect the Google bouncer. One of his methods included testing of the
	presence/absence of the file "/sys/qemu_trace". The existence of this file is
	one hint that the application is executed in the bouncer. This is considered as
	fingerprinting the system environment. The advantage lies in the accuracy.
	However one disadvantage is that this method is specific to the Android bouncer
	and does not solve the detection problem for other dynamic analysis environments
	like andrubis [6]. In addition fingerprinting approaches can often be mitigated by adjusting
        the system environment to not match the fingerprint anymore.
	Another approach relies on detecting the virtualization/emulation technology used for 
	dynamic analysis. Joanna Rutkowska [7] has shown a method to detect x86 hardware 
	virtualization some years ago. Using this method, malware could detect hardware 
	virtualization based analysis environments. Equivalent techniques are known for 
	products like VMware.

	The approach that is explained here can be categorized in the second group and
	is capable of detecting Qemu fully-emulated environments based on binary
	translation.
		
		
3. The Qemu Binary Translation Architecture 

	Wikipedia [2] offers the following explanation for Binary Translation (BT):
In computing, binary translation is the emulation of one instruction set by another 
through translation of code. Sequences of instructions are translated from the source 
to the target instruction set. 
Qemu can use this technique to translate the native code running in the guest system, e.g. ARM, on the host system with the same or a different instruction set like x86 in our example use case. Comparing BT with instruction set emulation like implemented in Bochs [3], in practice it shows to be a lot faster since already translated code blocks can be cached and directly executed natively afterwards. This results in a significant performance gain. On the other hand BT preserves the capability for debugging and code instrumentation and can even be faster than hardware assisted virtualization in cases where instruction or memory tracing is implemented in a granular manner. Using BT the huge amount of context switches that would be necessary can be reduced using the static analyzer of the translation process. For hardware virtualization such a static analyzer would have to be implemented in the first place. Otherwise single stepping would be necessary which produces a lot of overhead. As long as the sandboxing host systems don't use the ARM architecture, they are forced to either use instruction set emulation (which is very slow) or binary translation to analyze ARM based environments. Supporting ARM is currently necessary since there are Android applications shipped with native ARM code only. Chad Kersey [4] gave a presentation about Qemu internals for the linux users group at Georgia Tech. They are a good starting point for further information. Basically guest code execution boils down to the following chart. The already translated code is executed until the next branch instruction is reached. Since the branch target address is available at the end of the basic block execution, a cache lookup is done. In case of a hit, the basic block starting at the branch target has already been translated and the code is executed. Otherwise the basic block at the branch target has to be translated prior to its execution. The block translation function is then called which iterates over the instructions and translates them to an equivalent set of instructions in the target instruction set. This block translation ends after the first branch instruction. Now the branch target code is pushed into the cache and is available for execution. Now we are going to explain the core aspect of our detection method: A physical CPU increases the program counter after each instruction such that the program counter is always up to date. Since the registers in the translated code are emulated in order to keep the program counter up to date after each instruction, the translator would have to increase the virtual program counter. This would results in at least one additional program counter increase for each source code instruction. However since the translated target code is executed natively, a correct virtual program counter is only necessary in cases where an instruction from the source code accesses it. Qemu handles these cases but otherwise does not update the program counter in the virtual CPU register as part of an optimization. As a consequence the virtual program counter often just points to the start of a basic block since it is updated after every branch. Now we ask you: What does this mean to multitasking of the operating system in our emulated environment which is nowadays built upon timer interrupts that can occur after the execution of any instruction? Since the virtual program counter does not need to be correct as shown in the previous section, the virtual CPU must not be interrupted during the execution of a basic block since it would be impossible to reconstruct the program counter. As a consequence the operating system scheduling in a Qemu fully-emulated environment using binary translation does not occur during a basic blocks. Our argumentation is revisited in the following picture. 4. Proof of Concept As a summary, the key idea is about doing a scheduling behavior measurement of the operating system in the virtualized environment. In our PoC this is done by creating a histogram over the addresses where a thread is scheduled. In a non-emulated environment the histogram will show a nearly equally distribution over the addresses where scheduling occurs. This means our thread is scheduled at any address by an equal probability. However in the Qemu setup we described this is not the case. We observe that scheduling only occurs after execution of a complete basic block. Therefor our distribution vastly differs from an equally one. Our PoC basically detects possible scheduling points. The measurement is achieved using two threads. The first thread loops over a single basic block that subsequently increments a global variable during execution. At the beginning of each loop iteration, this variable is set to a constant value 1. The other thread is also executed in a loop where in each iteration the global variable to which the first thread writes to is read. A histogram is generated over the occurrences of each value of this variable that has been read during an iteration. The following picture illustrates how the histogram is generated. If the first thread is scheduled at some point during execution of the basic block and the second thread is continued, the global variable will store a value representing the progress which the first thread made in the basic block subsequently increasing it. Therefor our global variable represents the addresses where the first thread is scheduled. This measurement step is iterated such that a histogram is generated representing the distribution of scheduling points. If the variance between these scheduling points is too high an emulated environment is detected. Otherwise we have a negative. Both cases are shown in the following screenshots. We evaluated our PoC against a commonly used sandbox named andrubis [6]. Unfortunately the screenshot during their dynamic analysis was taken at the beginning of the application run which prevented us from showing the results in it. However in the following picture we can see the values read by the second thread which clearly shows that the scheduling event only appears after execution of the whole basic block - always 0x20. Also the variance underlined at the bottom of the picture points to the fact that the application is running in an Qemu BT emulated environment. 5. Conclusion We have shown how Qemu binary translation can be detected as long as used in full system emulation mode. The advantage of our method relies to the fact that we gain information leaked by the core design of the Qemu binary translator. The difference to prior detection methods relying on a specific system environment information leak is based on the fact that it is more generic, meaning Qemu specific and not $product specific, and harder to fix. As a limitation our method will not detect dynamic analysis which is built based on hardware assisted virtualization such as implemented on x86 and the upcoming hardware assisted ARM virtualization technology [5]. 6. References [1] https://blog.duosecurity.com/2012/06/dissecting-androids-bouncer/ [2] http://en.wikipedia.org/wiki/Binary_translation [3] http://bochs.sourceforge.net [4] http://lugatgt.org/content/qemu_internals/downloads/slides.pdf [5] http://www.arm.com/products/processors/technologies/virtualization-extensions.php [6] http://blog.iseclab.org/2012/06/04/andrubis-a-tool-for-analyzing-unknown-android-applications-2/ [7] http://theinvisiblethings.blogspot.de/
< Blog List