Ghidra Angr Integration Tool - Symbolic Execution for serpentine monsters!

23 Aug 2024 - FoundryZero

We’ve created AngrIntegration, a Ghidra plugin that integrates the popular symbolic execution suite angr. You can symbolically execute any architecture that Ghidra can load, using the same pcode representation that Ghidra uses internally! Using the plugin, most aspects of angr can be controlled using the GUI.

Get it here: github.com/foundryzero/ghidra-angr-integration-tool

Ghidra Angr Integration Tool logo

Symbolic execution is an incredibly powerful binary analysis technique, where the program (or parts of it) are simulated under a specific environment that provides a deep understanding of how inputs to the program affect both the behavior and any outputs. Unfortunately, most of the tools for applying it are complex, and take a while to correctly setup and get working on a given project.

To help fix this, we created AngrIntegration, a Ghidra plugin that integrates the popular symbolic execution suite angr. You can symbolically execute any architecture that Ghidra can load, using the same pcode representation that Ghidra uses internally! Using the plugin, most aspects of angr can be controlled using the GUI, and if any more advanced functionality is needed, there are python hooks that allow you to customise aspects of how symbolic execution occurs.

Symbolic Execution

The core principle of symbolic execution is relatively simple: rather than just operating on concrete values such as 3 or 0x001021d3, memory and registers in a program being symbolically executed can also take symbolic expressions that can contain variables.

For example, suppose in a very simple case you had some user-supplied input in the variable num, and the C code

int incredibly_complex_function(int num) {
    return num + 1;
}

is used to apply some unreasonably complex operation. Static analysis could be used to understand it, but that’s a long and manual process - symbolic execution can help here! When running normally, to get an output from this function we would need to choose a concrete value of num, and somehow figure out the behavior of the function based on observing it’s output. When using angr, we could instead make num its own symbolic variable, and angr would execute the function symbolically, giving a return value of <num> + 1.

If that value is then used in a branch, the simulation will split in two, executing both the path where a branch would be taken, and the path where it is not. Over the course of a whole program, this results in many states executing at once, with potentially thousands of symbolic variables for user input and the environment.

This simulation is paired with a constraint solver, in angr’s case claripy, that can take a set of symbolic variables and provide a solution (if possible) assigning concrete values to variables, satisfying a set of constraints built up over execution. Constraints are added on conditonal branches, such that in the state where the branch is taken, the variables must have values that would have led to that branch being taken.

Custom constraints can also be added! For example, suppose you’ve found a situation in which the instruction pointer gets set to some symbolic variable. At this point, angr will stop simulating those states, since a symbolic instruction pointer could cause a potential branch to every possible memory location, which is infeasible to simulate. However, if you have some interesting memory address you’d like to jump to, you can add a new constraint restricting the instruction pointer to that address, and the constraint solver will tell you a sample set of values such that that state will be reached:

Symbolic execution is incredibly powerful but there are a number of caveats: perfectly simulating the whole environment is often impossible, so some concessions have to be made, particularly surrounding syscalls and hardware features. angr implements it’s own equivalents for many common functions, for example libc calls, implementing them in a way more compatible with symbolic execution. For example, the libc function rand would normally return a random integer, but angr’s replacement returns an unconstrained symbolic variable.

Aside from properly simulating the environment, the major other concern is state explosion, where for some reason or another huge numbers of states are created, slowing execution to a crawl. angr has a number of options to mitigate this, including dumping states to disk, dropping ones that hit certain addresses, or just limiting the total number of states that can be created.

angr in practice

angr by default uses a simulation engine using the VEX IR (from Valgrind!) to run, but this doesn’t support as many architectures as Ghidra does. It alternatively has a simulation engine based on Ghidra’s pcode representation, but this is significantly less fully featured than the VEX engine. (angr also has engines for executing Java bytecode, and for accelerating execution using unicorn, a emulation solution based on qemu.) Regardless of the engine you’re using, angr needs to know basic information such as the architecture and execution environment.

To do this, the plugin provides a python ArchitectureInterface that can be implemented to override many parts of it’s behavior. For example, the support for the microcorruption environment for the MSP430 is implemented as an architecture module, providing support for the syscall-like interrupts it uses for I/O, as well as informing angr of the way it halts (since the MSP430 does not have a simple hlt instruction):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
class MSP430ArchitectureIf(ArchitectureInterface):
    def get_calling_convention(self) -> type[angr.calling_conventions.SimCC]:
        return SimCCMSP430  # tells angr how arguments are passed and returned

    def get_main_obj_args(self) -> dict[str, Any]:
        return {
            "backend": "blob",
            "entry_point": 0x4400,
            "base_addr": 0x0,
            "segments": [(0x0, 0x0, 0x4FFF), (0x6000, 0x6000, 0xFFFF - 0x6000)],
        }  # tells angr how to load the binary blobs

    def get_arch(self) -> archinfo.Arch:
        return MSP430Arch  # basic information such as the machine word size and registers

    def get_extra_loader_args(self) -> dict[str, Any]:
        # the MSP430 is a 16 bit machine so tweak the loader to treat that in a more sane way
        return {"rebase_granularity": 0x100}  

    def get_engine(self) -> angr.engines.SimEngine:
        return angr.engines.UberEnginePcode  # use the pcode engine since VEX doesn't support the MSP430

    def get_techniques(self) -> list[angr.ExplorationTechnique]:
        return [MSP430HaltTechnique()]  # tells angr when the machine has halted

    def get_hooks(self) -> dict[int, angr.SimProcedure]:
        # python handler that will be triggered when the instruction pointer reaches 0x0010
        return {0x0010: MSP430MicrocorruptionInterruptHook()}  

The plugin

The plugin can be included in a tool, and will automatically set itself up to run the currently open program. You can select the architecture file to use, where to start and end simulation, and define your own symbolic variables, constraints and hooks all without leaving Ghidra!

At any point when running a simulation, the break button in the lower right will immediately pause simulation and open a Python 3 REPL at that point, with access to the current state of the simulation. Despite using Python 3, this plugin doesn’t require any of the various python 3 layers for Ghidra.

Worked Example: half-twins

To demonstrate the plugin, this section will give a walkthrough of a usage of the plugin on the half-twins example binary. This is a simple crackme that just takes two inputs as arguments, performs some unknown algorithm on them, then decided if they are accepted or not.

Initially importing the program to Ghidra, we can immediately try just running angr with the default settings, and see what we get:

1
2
3
4
5
6
7
8
9
angr [angr_main]> ===== SYMBOLIC EXECUTION COMPLETE =====
angr [angr_main]> Found 1 states.
angr [angr_main]> 
angr [angr_main]> =================== STATE deadended[0] ===================
angr [angr_main]> sample stdin: b''
angr [angr_main]> stdout: b'hmm... i\'m not sure you know what the word "twins" mean :/\ngood luck next time\n'
angr [angr_main]> ip: <BV64 0x301058>
angr [angr_main]> 
angr> angr stopped.

Testing the binary, we can see that this error message is printed when there aren’t the correct number of arguments - by default, angr will not pass anything through argc/argv. The simplest way to specify that is by using the Call State entry point, simulating main with arguments of our choice. Selecting the address of main, 0x101149, the plugin will automatically fill in what Ghidra thinks is the signature of the function:

This is not correct - main should have the usual two parameters! We could either fix this in Ghidra, updating the function definition, and reselecting the address, or we could just manually correct the signature. Either way, the signature should be set to int main(int argc, char** argv), and two purple boxes will appear for the two arguments. Purple boxes can contain Python expressions that define the value of the symbolic variable, and have access to some useful helper functions to avoid the need to know angr in detail for simple tasks.

In this case, we know the program takes two arguments, so setting argc to 3, we can again try running the symbolic execution.

[!WARNING]
Due to a Ghidra limitation, array syntax like char *argv[] in function signatures is not currently supported. Use char ** argv instead.

This will take some time, and angr will spew warnings at you: we’ve left argv undefined, so it will be set to a symbolic variable, and create a lot of states as every possible option for the memory at argv is tested. This is highly inefficient, since we know quite a lot about the structure of argv: we know it should be a 3-valued array with a constant string, and two symbolic arguments. To encode that, put the line

1
[alloc("half-twins"), alloc(sym_str(10, name="arg1")), alloc(sym_str(10, name="arg2"))]

in the argv purple box. The alloc helper function will place the data in it’s argument into memory somewhere, returning a pointer to that memory. We use this along with sym_str, which constructs a string of symbolic variables, to create our array of strings that will get passed into main()! Running this again, we can see that it takes dramatically less time to run, and only produces 20 states rather than over four hundred. Furthermore, state 17’s stdout is showing a correct answer!

1
2
3
4
angr [angr_main]> =================== STATE deadended[17] ===================
angr [angr_main]> sample stdin: b''
angr [angr_main]> stdout: b'Abby & Gabby: yaayy!! nice job! :D\n'
angr [angr_main]> ip: <BV64 0x501038>

At this point, we could enter a REPL and inspect the values of arg1 and arg2 that got the program to this state, but checking the “Automatically show a solution for all variables when done” box will automatically find a solution to all variables in the state:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
angr [angr_main]> SAMPLE SOLUTION: 
angr [angr_main]>   callstate_main_argc : 3 = 0x3
angr [angr_main]>   arg1 = b'\x01\x04 \x02\x02\x80\x02 \x02\x02'
angr [angr_main]>   arg2 = b'\x01\x04 \x02\x02\x01\x01\x01\x01\x01'
angr [angr_main]>     ---
angr [angr_main]>   ('reg', '0x38', '0x1'): <BV64 reg_rbp_21_64{UNINITIALIZED}> = 0x0
angr [angr_main]>   ('reg', '0x28', '0x1'): <BV64 reg_rbx_22_64{UNINITIALIZED}> = 0x0
angr [angr_main]>   ('reg', '0x4c', '0x1'): <BV32 reg_4c_23_32{UNINITIALIZED}> = 0x0
angr [angr_main]>   ('mem', '0x540', '0x1'): <BV664 mem_540_24_664{UNINITIALIZED}> = 0x0
angr [angr_main]>   ('api', 'strlen', '0x1'): <BV64 strlen_25_64> = 0xa
angr [angr_main]>   ('mem', '0x593', '0x1'): <BV88 mem_593_26_88{UNINITIALIZED}> = 0x0
angr [angr_main]>   ('api', 'strlen', '0x2'): <BV64 strlen_27_64> = 0xa
angr [angr_main]>   ('api', 'strlen', '0x3'): <BV64 strlen_28_64> = 0xa
angr [angr_main]>   ('api', 'strlen', '0x4'): <BV64 strlen_29_64> = 0xa
angr [angr_main]>   ('api', 'strlen', '0x5'): <BV64 strlen_30_64> = 0xa
angr [angr_main]>   ('api', 'strlen', '0x6'): <BV64 strlen_31_64> = 0xa
angr [angr_main]>   ('api', 'strlen', '0x7'): <BV64 strlen_33_64> = 0xa

These values do solve the binary, but the program is clearly expecting numerical input, or at the very least printable input. Under the constraints tab, add two constraints:

1
2
[Or(And(char >= 0x30, char <= 0x39), char == 0x0) for char in arg1]
[Or(And(char >= 0x30, char <= 0x39), char == 0x0) for char in arg2]

These ensure that values of each character in our two arguments will be an ASCII numeral (or the null terminator). In constraints, any symbolic variables you’ve defined can be treated as if they were locals in python, as bitvectors for single variables or lists for symbolic strings. Running the program again, we finally get some useful output:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
angr [angr_main]> =================== STATE deadended[17] ===================
angr [angr_main]> sample stdin: b''
angr [angr_main]> stdout: b'Abby & Gabby: yaayy!! nice job! :D\n'
angr [angr_main]> ip: <BV64 0x501038>
angr [angr_main]> SAMPLE SOLUTION: 
angr [angr_main]>   callstate_main_argc : 3 = 0x3
angr [angr_main]>   arg1 = b'0400011111'
angr [angr_main]>   arg2 = b'0400000000'
angr [angr_main]>     ---
angr [angr_main]>   ('reg', '0x38', '0x1'): <BV64 reg_rbp_21_64{UNINITIALIZED}> = 0x0
angr [angr_main]>   ('reg', '0x28', '0x1'): <BV64 reg_rbx_22_64{UNINITIALIZED}> = 0x0
angr [angr_main]>   ('reg', '0x4c', '0x1'): <BV32 reg_4c_23_32{UNINITIALIZED}> = 0x0
angr [angr_main]>   ('mem', '0x540', '0x1'): <BV664 mem_540_24_664{UNINITIALIZED}> = 0x0
angr [angr_main]>   ('api', 'strlen', '0x1'): <BV64 strlen_25_64> = 0xa
angr [angr_main]>   ('mem', '0x593', '0x1'): <BV88 mem_593_26_88{UNINITIALIZED}> = 0x0
angr [angr_main]>   ('api', 'strlen', '0x2'): <BV64 strlen_27_64> = 0xa
angr [angr_main]>   ('api', 'strlen', '0x3'): <BV64 strlen_28_64> = 0xa
angr [angr_main]>   ('api', 'strlen', '0x4'): <BV64 strlen_29_64> = 0xa
angr [angr_main]>   ('api', 'strlen', '0x5'): <BV64 strlen_30_64> = 0xa
angr [angr_main]>   ('api', 'strlen', '0x6'): <BV64 strlen_31_64> = 0xa
angr [angr_main]>   ('api', 'strlen', '0x7'): <BV64 strlen_33_64> = 0xa

This input does in fact work on the original binary:

The plugin has many more features not shown in this example, and aims to be versatile enough for most use-cases. See the README for more details!