typetapper/report.md

20 KiB

Introduction

Binary analysis, or program analysis of programs without source code available, is an important discipline for the modern world. Many programs which we would like to answer various questions about, such as about their behavior, their security properties, or how to modify them, do not have source code available and thus must be reverse engineered. Binary reverse engineering is difficult, for many reasons: compilers typically discard most kinds of metadata that make code readable by a human, assembly language is not very understandable, the compiler may apply optimizations which maintain the correctness of the program but change its behavior substantially, among other reasons.

Typically, a reverse engineer will use an interactive disassembler or decompiler, such as IDA Pro, Ghidra, Binary Ninja, or angr management, in a quest to transform their target code until it is easy to read. The part of this process we will focus on in this paper is the task of structure type recovery. Without types available, all pointer accesses will appear of the form of a generic pointer, such a a void*, being offset by some value, cast to a specific pointer type, such as a float*, and then dereferenced. This is very ugly and requires a lot of mental overhead to parse. Ideally, we would assign a real type to the pointer, and the decompiler would display the dereference as a simple field access, but determining the type to assign is highly nontrivial. Fully automated algorithms for binary type inference exist, but there is a lack of trust in them from reverse engineers, since they are opaque and do not justify their results.

Our goal is to recover data-flow information from a binary and visualize pertinent aspects of it with respect to structure recovery analysis. Ideally, this will empower reverse engineers to make their own inferences about the correct type of a structure pointer. We implemented this system, which we call TypeTapper, as a plugin for angr-management, a Qt-based Python GUI for reverse engineering, based on angr, the open source binary analysis platform.

Algorithms and Data Structures

How do we take our input, an arbitrary executable file, and transform it until it is meaningful to visualize? The first steps of this process are loading the file into a virtual address space, lifting it to an intermediate representation suitable for analysis, and performing control flow recovery to generate a control flow graph. All of these are provided by angr. From here, we can begin focusing on generating our first data structure, the data-flow graph.

Data-flow Graph

The data flow graph is a directed graph where the nodes are atoms and the edges are relationships between those atoms. "Atom" is a program analysis term referring to a particular data storage location at a particular point in the program. For example, the register rdx at program counter 0x400123 and the register rdx at program counter 0x400456 are two different atoms. Each atom is annotated with its properties, which are its attributes which are pertinent to the visualization. In particular, we would like to track how many times the atom or an offset thereof is used as an integer, pointer, floating point number, etc, and also its unifications, or a set of pairs of offsets of an atom which are substituted for each other at this particular location. The important aspect of these properties is that they are summable, meaning that we can summarize the properties of multiple nodes by taking their sum. The edges of the data-flow graph are annotated with operations, or a list of transformations that are applied in order to qualify the relationship between the pieces of data that are stored in two linked atoms, and control flow actions, or contextual transformations necessary to validate whether it is legal to follow this edge in a graph traversal. The important aspect of these attributes is that they can be concatenated: for any path through the graph, the edges in the path can be summarized by concatenating their control flow actions and operations, in order.

In order to generate the data-flow graph, we have written the core TypeTapper program analysis, a two-step algorithm. First, there is the intra-block live data flow phase, which computes in O(number of basic blocks) time. For each basic block in the program, it emulates the block's code over the data domain of "live data references", meaning that it tracks data loads as the creation of a reference and data stores as the creation of a link in the data flow graph. If the data is mutated in between load and store, the edge is annotated with these operations.

Second, there is the inter-block passive data flow algorithm. This is a fixed-point algorithm that attempts to connect all the isolated edges that were generated in the first phase. Every time an atom is created with no incoming edges, it is added to the analysis queue. For each item in the queue, the analysis follows the control flow graph backwards, seeking to find an atom which will eventually flow into the unsourced atom by virtue of having not been mutated between store and load and adding an edge representing this flow. These edges necessarily have no operations associated with them, but they may have control flow actions: if an edge crosses a function call or return, it is important to note this, otherwise graph traversals may follow data flow paths which do not correspond to any real program flow. Interestingly, though this phase has unbounded complexity due to being a fixed-point algorithm, in practice we observe it terminating in negligible time compared to the first phase.

Relative Atom Graph

Full-program data-flow graphs are very powerful, but also very complex, necessarily containing more information than an analyst needs for any one task. To remedy this, we introduce the concept of the relative atom graph, a view of the data-flow graph with respect to one point of interest (an atom). The nodes in this graph are also atoms, but their properties are transformed according to operations on the path from the origin atom. This is an important design consideration motivating our data structures from earlier: properties can be transformed by operations. Additionally, each node in the relative graph is also annotated with the control flow actions necessary to reach it. This means that one node from the data-flow graph may correspond to an arbitrary number of nodes in the relative atom graph, and furthermore that the relative atom graph is actually an infinite graph, of which we may only ever compute a subgraph.

The infinite nature of the relative atom graph is reflected in its core algorithm, which is to expand the subgraph structure to include all the previously unincluded nodes adjacent to a given frontier point. The relative atom graph starts as a single node, the atom of interest with an empty control flow context, and no edges. Then, the relative expansion algorithm is applied iteratively as requested by the user in order to expand the graph's frontier into progressively distant territory. This algorithm is not complex - for each edge from the frontier point in the data-flow graph corresponding to an edge not present in the relative atom subgraph, if the new control flow context computed by concatenating the frontier point's context with the edge's context actions is valid, then the edge and the corresponding adjacent atom are added to the subgraph. The atom's properties are transformed according to the concatenation of the operations on the path from the origin to the new atom.

Hierarchical Graph

The relative atom graph is substantially simpler than the data-flow graph, but it can still be overwhelming to an analyst. In particular, it does not take advantage of the fact that node properties are summable, meaning that groups of nodes can be visualized as a sum of their properties. Accordingly, we introduce the hierarchical graph, which presents as a graph where each node can be a graph itself, and nodes within the graphs within the nodes can have edges which point into the parent graph, recursively. This is implemented as an n-ary tree of nodes and a multigraph of those same nodes derived from the relative graph such that each edge from the relative graph corresponds to a path through the multigraph depicting the shortest path through the tree. Each node has cached in it the sum of the properties of all its children. The consequence of this is that the data structure can be efficiently updated and can have the question, "What are the nodes and edges that should be shown to represent any one level of hierarchy?" answered by inducing a simple subgraph on the children of a single node. This data structure is designed to be applicable to any sort of graph with properties composable in the sum-concat-transform paradigm outlined above.

The maintenance of this data structure is extremely nontrivial. Every time the relative graph is updated by the expansion algorithm, corresponding updates must be made to the hierarchical graph: when an edges edges is added to or removed from the relative graph, a path must be added to or removed from the hierarchical graph, and when a node is added to or removed from the relative graph, its properties must be added to or removed from its parent, recursively. When nodes are moved from group to group, the operation is the same as removing the node and all its edges from the relative graph, and then re-adding it with a different parent, but this is highly inefficient. Instead, algorithms were implemented for two basic operations: move a node to become a sibling of its parent, or move a node to become a child of one of its siblings. These involve a small number of incremental updates, though it is quite difficult to verify the correctness of this algorithm.

Visualization design

Now that we have transformed our input data all the way from an executable file to a hierarchical graph of atoms and properties relative to an origin atom, we can begin visualizing. The core idiom of our visualization is a property chart, which represents a single node of the hierarchical graph, visualizing the sum of the properties of its children, or if it is a leaf node, its own properties. At first glance, it is similar to a stacked bar chart, but it can express more nuanced information. The color of each box on the chart represents its category (whether the data usage being visualized is pointer-like, integer-like, float-like, etc), the y-position of each box represents the data offset, and the width of each box represents its quantity, but the height of each box also encodes information: the size of the data access being performed. Because there can be multiple data sizes in multiple categories accessed at a single offset, this necessitates a more complex layout algorithm to arrange the boxes in an intuitive and uncluttered manner.

The other aspect of a node's properties that needs to be visualized is its unifications, or the offsets which are substituted for each other at some point in the program. These are rendered as arcs drawn between the two offsets on the vertical axis. These arcs can become cluttered, so they are drawn with reduced opacity and become solid when hovered with the mouse.

The two main interactions at this level of abstraction are that a node can be renamed through a context menu entry or the N key, changing the text identifying it at its top. Additionally, the user can double click on a node to navigate to a related view. If the node represents an atom, this opens the program's disassembled code and highlights the atom. If it represents a group, this changes the current view to show the node's children. The navigation can also be done through a context menu entry or the D (down) key.

Next, let's discuss the idiom using the property chart as a glyph: the node-link diagram representing dataflows. The purpose of this idiom is to show links and paths among nodes. The graph layout algorithm being used is GraphViz's neato, but of course, no layout algorithm is perfect, and so we allow the user to re-layout the graph. They can do this by dragging nodes with the mouse or by holding down the Z key (the most natural key to press with a free left hand) to animate an iterative application of the neato algorithm. The entire scene can be panned by dragging its background, and multiple nodes can be selected for uniform dragging or taking actions on multiple nodes at once by clicking and dragging on the scene background with the shift key held to draw a rectangle.

These were interactions related to an arbitrary node-link diagram, but this graph has two additional capabilities. First, it is a subgraph of an infinite graph, so the user needs tools to control how to control the extent of the subgraph. Any node which has neighbors which are not presently part of the subgraph has a blue plus icon in its corner. Clicking the icon will temporarily add its neighbors outside the subgraph to the scene, and clicking on one of them will add the node to the subgraph, committing it to the scene. Clicking anywhere else removes the temporary nodes. Pressing the delete key while a node is selected removes it from the subgraph and marks its neighbors as able to be expanded again. Since the action of expanding the subgraph is so critical to the workflow, there is a shortcut to accelerate it: by holding down the X (expand) key, mousing over a node is equivalent to clicking on it and clicking its plus icon, committing it to the graph and beginning the expansion process. Because most laptop touchpads do not allow moving the mouse at the same time as pressing letter keys, this "rapid fire expand" mode can be toggled on and off by tapping ctrl-X.

The second capability is the hierarchical nature of the graph, to view groupings of nodes as nodes themselves. This depiction deliberately collapses the representation of a group and a leaf via property summation. The main difference between a leaf node and a group node is that double-clicking on a leaf opens the disassembly view, while double clicking on a group transforms the scene to view the children of that group. This can be un-done by double-clicking on the background to navigate to the parent group, or by pressing the U (up) key. The other main capability that the hierarchical graph adds is that nodes may have edges which extend outside the current grouping. By including the group's parent node in the induced subgraph, all these edges appear as edges to the parent node, but this ruins the graph layout. Instead, we add a unique "Out" node for each edge to connect to. Differentiating between the out nodes and the regular nodes uses a new channel, shape: the normal nodes are boxes and the out nodes are circles. Double clicking on an out node navigates to the parent group, centered on the target of the out node's edge.

There are also tools to form and change these groupings. Dragging a node into a group node will move the dragged node into the target node's group. If the target node is a leaf and not a group, a new group will be formed and both of the nodes will be moved into it. You can also create a new group from an arbitrary number of nodes by selecting them and then pressing the C (create) key or right-clicking and selecting the corresponding menu entry. You can also move nodes up the hierarchy - the O (out) key will move all selected nodes out of the current group and into the parent group, while the B (break) key will break the selected group and move all its children to the current group. Both of these actions also have corresponding context menu entries.

The sum of these interactions is designed to let the analyst manipulate the hierarchical graph until it is intuitive for them to read and navigate, until it matches their mental model of the program, which may include an arbitrary number of nodes and levels of hierarchy.

Case Studies

We included in our submission several binaries which were interesting to view with this visualization. We dive into two of them, corewars-vm and cryptochall, here.

In corewars-vm, by starting the visualization for v7 = calloc(0x1c04a) and expanding once, you can see a node with a stride of seven. This demonstrates an interesting behavior of this program (seven-strides are very uncommon) but also highlights a limitation of the visualization: that the exact value of seven is hard to see without zooming in substantially.

If you look at the visualization for v5 = calloc(0x2fffd0), you see a pointer at offset 0. A limitation of the visualization is that in order to follow this link to see the structure of this pointer, you must click into the function's group, locate the atom with the appropriate mark, click into it to navigate to the disassembly, press f5 to navigate to the corresponding decompilation, right click on the appropriate variable, and start another TypeTapper session. However, once you do this, you can see that the structure is very similar to the structure of v1 = calloc(0x2003000). Similar structures suggest that one pointer type is the same as another pointer type. This is a useful analysis insight and ought to be made more accessible.

The cryptochall binary is compiled from C++, which means that it is using a large number of nontrivial data structures. For example, if you start typetapper from the argument variable of sub_413d5a, you can see the structure of a C++ std::string. If you expand the visualization further, you can see the entire stack frame of the main function, a huge structure in which the string is embedded. It is difficult to follow references to this structure further than main because function calls adjacent to main may show other objects also embedded in the stack frame, i.e. false positive references, and also that true positive references may show as blank because their interesting properties are several function calls deep and the visualization only shows one level at a time.

Discussion

The first, and somewhat obvious, insight gleaned in the process of developing this tool is that designing a visualization is an excellent way to validate the creation of a dataset. There were multiple instances of work going back and forth between developing a new visualization feature, using it to determine there was a mismatch between what was being visualized and the ground truth, determining that this mismatch existed at the level of the analyzed data, and then fixing the analysis.

In terms of the usefulness of the tool in its current state, the analysis and dataset generation is the weakest link by far. There are definitely instances of data which simply goes missing at some point between program writing and hierarchical graph visualization. However, once these problems and the other future work in this section is addressed, it will be prudent to evaluate the effectiveness of our tool in assisting reverse engineering tasks.

In order to address the limitations found in the case studies: the following changes will be made as future work: displaying the exact length of an arc as text, adding an interaction to follow pointer references into new visualizations, providing the ability to mark nodes as false positives in a more permanent way then simply excluding them from the subgraph (a "hidden" designation), and picking better heuristics for the increment at which to expand the graph so that true positives will show up as blank nodes less often.

Other future visualization work includes experiments in order to determine the most effective techniques for reducing noise and providing more contextual links for a given glyph. We would like to attempt to offer an option to hide all visualized nodes which represent atoms which have no interesting properties, effectively removing the nodes from the graph but adding edges such that paths through the removed nodes are preserved. Additionally, we would like to experiment with right-clicking a particular mark, i.e. a colored box or arc, to show the part of the program which prompted showing it without having to click and scan down to the leaf node containing the relevant glyph. Finally, we would like to introduce additional tools for searching and filtering a given view - perhaps a search box which highlights or centers nodes which match a given query.

Conclusion

In this paper, we presented TypeTapper, a system for analyzing a program to produce a hierarchical data-flow graph for a binary program with respect to a point of interest, then visualize and manipulate it in the context of a reverse engineering session. We believe that with additional work and evaluation, TypeTapper can represent an advancement in the program analysis community's ability to reverse engineer binary code.