Using KLEE to derive call-graph(s) from test case(s) - llvm

Each generated test amounts to a path, and I'm interested in obtaining information on the functions called along each path, and from there obtain a call graph as the union of paths for all test.
This should render a subset of the complete call-graph, taking into account symbolic variables, assumptions etc.
Has anyone looked into this? (Maybe this is already possible, without altering the KLEE source code.)
Best regards
Per Lindgren

Related

How to use flat buffers when the schema is not fixed?

Current working of my C++ application is as follows:
1. It involves launching another process and uses windows shared memory to communicate between the two processes.
2. The data is serialized in one process and de-serialized in another process. However, the data type could also vary based on the user inputs, and hence the type is also serialized so that deserializer could interpret the data correctly.
Now, I am intending to use flat-buffer to serialize and deserialize data (because of its obvious advantages - random access and backward compatibility).
However, to do that I need clarity in some areas and hoping for some help on them.
Based on the data type, I can programmatically generate schema and feed it to flatc.exe to generate files. However, instead of using flatc.exe, I am thinking to build flatc.dll (from the open source code) and use that to keep the interaction simpler. Does that sound wiser?
Secondly, what I am more unsure is of the following. I will create a schema and invoke 'Flat Buffer compiler' while the application is running. It will generate some C++ files. Now, as much as I understand I would need to build those files somehow and the built binary should be plugged in both serializer and deserializer to serialize and deserialize the actual data- and this is all while the application is running. How do I achieve all this? This problem is all stemming from the fact that my application does not have any fixed schema. What is the general approach to using flat buffers when the schema is variable?
I hope I am clear about what I am intending to ask. If not, please let me know. I will be happy to provide more details. Thanks for your answers in advance.
The answer is that you do not want this. While it is feasible, especially runtime generation of C++, compiling it into a DLL and then loading it back into your process is an extremely clumsy way of doing it.
The data structures of your program must be known at compile time (if it is written in C++), so why can't you define a schema for that just once and compile it ahead of time? Does your program allow the user to "design" data structures at runtime?
For extremely dynamic use cases such as where the user can create arbitrary objects, I'd recommend FlexBuffers (https://google.github.io/flatbuffers/flexbuffers.html). They can be used inside a FlatBuffer to store "unknown" data, or even as their own serialization format. With these, you can serialize objects whose structure is only known at runtime, they have most of the same efficiency properties of FlatBuffers, and you won't need to bundle a C++ compiler with your program :)
Best is a combination of the two, where all compile time known data is stored in FlatBuffers, and the remainder in FlexBuffers.

How to extract the active code path from a complex algorithm

I have been puzzled lately by an intruiging idea.
I wonder if there is a (known) method to extract the executed source code from a large complex algorithm. I will try to elaborate this question:
Scenario:
There is this complex algorithm where a large amount of people have worked on for many years. The algorithm creates measurement descriptions for a complex measurement device.
The input for the algorithm is a large set of input parameters, lets call this the recipe.
Based on this recipe, the algorithm is executed, and the recipe determines which functions, loops and if-then-else constructions are followed within the algorithm. When the algorithm is finished, a set of calculated measurement parameters will form the output. And with these output measurement parameters the device can perform it's measurement.
Now, there is a problem. Since the algorithm has become so complex and large over time, it is very very difficult to find your way in the algorithm when you want to add new functionality for the recipes. Basically a person wants to modify only the functions and code blocks that are affected by its recipe, but he/she has to dig in the whole algorithm and analyze the code to see which code is relevant for his or her recipe, and only after that process new functionality can be added in the right place. Even for simple additions, people tend to get lost in the huge amount of complex code.
Solution: Extract the active code path?
I have been brainstorming on this problem, and I think it would be great if there was a way to process the algorithm with the input parameters (the recipe), and to only extract the active functions and codeblocks into a new set of source files or code structure. I'm actually talking about extracting real source code here.
When the active code is extracted and isolated, this will result in a subset of source code that is only a fraction of the original source code structure, and it will be much easier for the person to analyze the code, understand the code, and make his or her modifications. Eventually the changes could be merged back to the original source code of the algorithm, or maybe the modified extracted source code can also be executed on it's own, as if it is a 'lite' version of the original algorithm.
Extra information:
We are talking about an algorithm with C and C++ code, about 200 files, and maybe 100K lines of code. The code is compiled and build with a custom Visual Studio based build environment.
So...:
I really don't know if this idea is just naive and stupid, or if it is feasible with the right amount of software engineering. I can imagine that there have been more similar situations in the world of software engineering, but I just don't know.
I have quite some experience with software engineering, but definitely not on the level of designing large and complex systems.
I would appreciate any kind of answer, suggestion or comment.
Thanks in advance!
Other naysayers say you can't do this. I disagree.
A standard static analysis is to determine control and data flow paths through code. Sometimes such a tool must make assumptions about what might happen, so such analyses tend to be "conservative" and can include more code than the true minimum. But any elimination of irrelevant code sounds like it will help you.
Furthermore, you could extract the control and data flow paths for a particular program input. Then where the extraction algorithm is unsure about what might happen, it can check what the particular input would have caused to happen. This gives a more precise result at the price of having to provide valid inputs to the tool.
Finally, using a test coverage tool, you can relatively easily determine the code exercised for a particular input of interest, and the code exercised by another input for case that is not so interesting, and compute the set difference. This gives code exercised by the interesting case, that is not in common with the uninteresting case.
My company builds build program analysis tools (see my bio). We do static analysis to extract control and data flow paths on C++ source code, and could fairly easily light up the code involved. We also make C++ test coverage tools, that can collect the interesting- and uninteresting- sets, and show you the difference superimposed over the source code.
I'm afraid what you try is mathematically impossible. The problem is that this
When the algorithm is finished, a set of calculated measurement parameters will form the output.
is impossible to determine by static code analysis.
What you're running into is essentially a variant of the Halting Problem for which has been proven that there can not be an algorithm/program that can determine, if an algorithm passed into it will yield a result in finite time.

How should I document a Lua API/object model written in C++ code?

I am working on documenting a new and expanded Lua API for the game Bitfighter (http://bitfighter.org). Our Lua object model is a subset of the C++ object model, and the methods exposed to Lua that I need to document are a subset of the methods available in C++. I want to document only the items relevant to Lua, and ignore the rest.
For example, the object BfObject is the root of all the Lua objects, but is itself in the middle of the C++ object tree. BfObject has about 40 C++ methods, of which about 10 are relevant to Lua scripters. I wish to have our documentation show BfObject as the root object, and show only those 10 relevant methods. We would also need to show its children objects in a way that made the inheritance of methods clear.
For the moment we can assume that all the code is written in C++.
One idea would be to somehow mark the objects we want to document in a way that a system such as doxygen would know what to look at and ignore the rest. Another would be to preprocess the C++ code in such a way as to delete all the non-relevant bits, and document what remains with something like doxygen. (I actually got pretty far with this approach using luadoc, but could not find a way to make luadoc show object hierarchy.)
One thing that might prove helpful is that every Lua object class is registered in a consistent manner, along with its parent class.
There are a growing number of games out there that use Lua for scripting, and many of them have decent documentation. Does anyone have a good suggestion on how to produce it?
PS To clarify, I'm happy to use any tool that will do the job -- doxygen and luadoc are just examples that I am somewhat familiar with.
I have found a solution, which, while not ideal, works pretty well. I cobbled together a Perl script which rips through all the Bitfighter source code and produces a second set of "fake" source that contains only the elements I want. I can then run this secondary source through Doxygen and get a result that is 95% of what I'm looking for.
I'm declaring victory.
One advantage of this approach is that I can document the code in a "natural" way, and don't need to worry about marking what's in and what's out. The script is smart enough to figure it out from the code structure.
If anyone is interested, the Perl script is available in the Bitfighter source archive at https://code.google.com/p/bitfighter/source/browse/luadoc.pl. It is only about 80% complete, and is missing a few very important items (such as properly displaying function args), but the structure is there, and I am satisfied the process will work. The script will improve with time.
The (very preliminary) results of the process can be seen at http://bitfighter.org/luadocs/index.html. The templates have hardly been modified, so it has a very "stock" look, but it shows that things more-or-less work.
Since some commenters have suggested that it is impossible to generate good documentation with Doxygen, I should note that almost none of our inline docs have been added yet. To get a sense of what they will look like, see the Teleporter class. It's not super good, but I think it does refute the notion that Doxygen always produces useless docs.
My major regret at this point is that my solution is really a one-off and does not address what I think is a growing need in the community. Perhaps at some point we'll standardize on a way of merging C++ and Lua and the task of creating a generalized documentation tool will be more manageable.
PS You can see what the markup in the original source files looks like... see https://code.google.com/p/bitfighter/source/browse/zap/teleporter.cpp, and search for #luaclass
Exclude either by namespace (could be class as well) of your C++ code, but not the lua code
EXCLUDE_SYMBOLS = myhier_cpp::*
in the doxygen config file or cherry pick what to exclude by using
/// #cond
class aaa {
...
...
}
/// #endcond
in your c++ code.
I personally think that separating by namespace is better since it reflects the separation in code + documentation, which leads to a namespace based scheme for separation of pure c++ from lua bindings.
Separating via exclusion is probably the most targeted approach but that would involve an extra tool to parse the code, mark up relevant lua parts and add the exclusion to the code. (Additionally you could also render special info like graphs separately with this markup and add them via an Image to your documentation, at least that's easy to do with Doxygen.). Since there has to be some kind of indication of lua code, the markup is probably not too difficult to derive.
Another solution is to use LDoc. It also allows you to write C++ comments, which will be parsed by LDoc and included into the documentation.
An advantage is that you can just the same tool to document your lua code as well. A drawback is that the project seems to be unmaintained. It may also not be possible to document complex object hierarchies, like the questioner mentioned.
I forked it myself for some small adjustments regarding c++. Have a look here.

I need help developing a polymorphism engine - instruction dependency trees

I am currently trying to write a polymorphism engine in C++ to toy-around with a neat anti-hacking stay-alive checking idea I have. However, writing the polymorphism engine is proving rather difficult - I haven't even established how I should go about doing it. The idea is to stream executable code to the user (that is the application I am protecting) and occasionally send them some code that runs some checksums on the memory image and returns it to the server. The problem is that I don't want someone to simply hijack or programatically crack the stay-alive check; instead each one would be generated on the server working off of a simple stub of code and running it through a polymorphism engine. Each stay-alive check would return a value dependant on the check-sum of the data and a random algorithm sneaked inside of the stay-alive check. If the stub returns incorrectly, I know that the stay-alive check has been tampered with.
What I have to work with:
*The executable images PDB file.
*Assembler & disassembler engine of which I have implemented a interface between them which allows to to relocate code etc...
Here are the steps I was thinking of doing and how I might do them. I am using the x86 instruction set on a windows PE executable
Steps I plan on taking(My problem is with step 2):
Expand instructions
Find simple instructions like mov, or push and replace them with a couple instructions which achieve the same end though with more instructions. In this step I'll also add loads of junk-code.
I plan on doing this just by using a series of translation tables in a database. This shouldn't be very difficult to do.
Shuffling
This is the part I have the most trouble with. I need to isolate the code in to functions. Then I need to establish a series of instruction dependancies trees, and then I need to relocate them based upon which one depend on the other. I can find functions by parsing the pdb files, but creating instruction dependancy trees is the tricky part I am totally lost on.
Compress instructions
Compress instructions and implement a series of uncommon & obscure instructions in the process. And, like the first step, do this by using a database of code signatures.
To clarify again, I need help performing step number 2 and am unsure how I should even begin. I have tried making some diagrams but they become very confusing to follow.
OH: And obviously the protected code is not going to be very optimal - but this is just a security project I wanted to play with for school.
I think what you are after for "instruction dependency trees" is data flow analysis. This is classic compiler technology that determines for each code element (primitive operations in a programming language), what information is delivered to it from other code elements. When you are done, you end up with what amounts to a graph with nodes being code elements (in your case, individual instructions) and directed arcs between them showing what information has to flow so that the later elements can execute on results produced by "earlier" elements in the graph.
You can see some examples of such flow analysis at my website (focused on tool that does program analysis; this tool is probably not appropriate for binary analysis but the examples should be helpful).
There's a ton of literature in the compiler books on doing data flow analysis. See any compiler text book.
There are a number of issues you need to handle:
Parsing the code to extract the code elements. You sound like you have access to all the instructions already.
Determining the operands required by, and the values produced, by each code element. This is pretty simple for "ADD register,register", but you may find this daunting for production x86 CPU which has an astonishingly big and crazy instruction set. You have to collect this for every instruction the CPU might execute, and that means all of them pretty much. Nontheless, this is just sweat and a lot of time spent looking at the instruction reference manuals.
Loops. Values can flow from an instruction, through other instructions, back to that same instruction, so the dataflows can form cycles (lots of cycles for complex loops). The dataflow literature will tell you how to handle this in terms of computing the dataflow arcs in the graph. What these mean for your protection scheme I don't know.
Conservative Analysis: You can't get the ideal data flow, because in fact you are analyzing arbitrary algorithms (e.g., a Turing machine); pointers aggravate this problem pretty severely and machine code is full of pointers. So what the data flow analysis engines often do when unable to decide if "x feeds y", is to simply assume "x (may) feed y". The dataflow graph turns conceptually from "x (must) feed y" into the pragmatic "x (may) feed y" type arcs; the literature in fact is full of "must" and "may" algorithms because of this.
Again, the literature tells you many ways to do [conservative] flow analysis (mostly having different degrees of conservatism; in fact the most conservatinve data flow analysis simply says "every x feeds every y"!). What this means in practice for your scheme, I don't know.
There are a lot of people interested in binary code analysis (e.g., the NSA), and they do data flow analysis on machines instructions complete with pointer analysis. You might find this presentation interesting: http://research.cs.wisc.edu/wisa/presentations/2002/0114/gogul/gogul.1.14.02.pdf
I'm not sure if what you are trying helps to prevent tampering a process. If someone attaches a debugger (process) and breaks on the send / recieve functions the checksum of the memory stays intact all shuffeling will stay as it is and the client will be seen as valid even if it isn't. This debugger or injected code is able to manipulate you when you ask what pages are used by your process (so you won't see injected code since it wouldn't tell you the pages in which it resides).
To your actual question:
Couldn't the shuffeling be implemented by relinking the executable. The linker keeps track of all the symbols that a .o file exports and imports. When all the .o files are read the real addresses of the function are put in the imported placeholders. If you put every function in a seperate cpp file and compile them to a .o file. When reordering the .o files in the linker call all the functions will be on a different address and the executable would still run fine.
I tested this with gcc on windows - and it works. By reordering the .o files when linking all functions are put to a different address.
I can find functions by parsing the pdb files, but creating
instruction dependancy trees is the tricky part I am totally lost on.
Impossible. Welcome to the Halting Problem.

Tool to find all instances of a class

I'm working on a C++ project where modules are meant to be combined in a small group to serve a specific purpose (in some sort of processing pipeline).
Sometimes it's hard to know the impact of any change, because we intuitively don't even know all the places where one of our module is being used.
I know I can do Search in Files to find all instances of a class, but is there a tool which can analyze my source code and give me the list of how many instances of each class is used?
However I might not be understand your question right, but I believe doxygen can do that: http://www.doxygen.nl/
You will be able to see how everything is being used and called from what. It will give you classes calling what other classes, a whole hierarchy of your code.
If all the paths through the code is known (very unlikely in reality) then putting a printf/cout in the class constructor would do the job nicely.
Otherwise, I would deploy a find-and-grep solution.