Reverse engineering C++ - best tools and approach [closed]

Reverse engineering C++ - best tools and approach [closed] - c++

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a book, tool, software library, tutorial or other off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 8 years ago.
Improve this question
I am sorry - C++ source code can be seen as implementation of a design, and with reverse-engineering I mean getting the design back. It seems most of you have read it as getting C++ source from binaries. I have posted a more precise question at Understanding a C++ codebase by generating UML - tools&methology
I think there are many tools that can reverse-engineer C++ (source-code), but usually it is not so easy to make sense of what you get out.
Have somebody found a good methodology?
I think one of the things I might want to see for example is the GUI-layer and how it is separated (or not from the rest). Think the tools should somehow detect packages, and then let me manually organize it.

To my knowledge, there are no reliable tools that can reverse-engineer compiled C++.
Moreover, I think it should be near impossible to construct such a device. A compiled C++ program becomes nothing more than machine language instructions. In order to kn ow how that's mapped to C++ constructs, you need to know the compiler, compiler settings, libraries included, etc ad infinitum.
Why do you want such a thing? Depending on what you want it for, there may be other ways to accomplish what you're really after.

While it isn't a complete solution. You should look into IDA Pro and Hexrays.
It is more for "reverse engineering" in the traditional sense of the phrase. As in, it will give you a good enough idea of what the code would look like in a C like language, but will not (cannot) provide fully functioning source code.
What it is good for, is getting a good understanding of how a particular segment (usually a function) works. It is "user assisted", meaning that it will often do a lot of dereferences of offsets when there is a really a struct or class. At which point, you can supply the decompiler with a struct definition (classes are really just structs with extra things like v-tables and such) and it will reanalyze the code with the new type information.
Like I said, it isn't perfect, but if you want to do "reverse engineering" it is the best solution I am aware of. If you want full "decompilation" then you are pretty much out of luck.

You can pull control flow with dissembly but you will never get data types back...
There are only integers (and maybe some shorts) in assembly. Think about objects, arrays, structs, strings, and pointer arithmetic all being the same type!

The OovAide project at http://sourceforge.net/projects/oovaide/ or on github
has a few features that may help. It uses the CLang compiler
for retrieving accurate information from the source code. It scans the
directories looking for source code, and collects the information into
a smaller dataset that contains the information needed for analysis.
One concept is called Zone Diagrams. It shows relationships between classes at
a very high level since each class as shown as a dot on the diagram, and
relationship lines are shown connecting them. This allows
the diagrams to show hundreds or thousands of classes.
The OovAide program zone diagram display has an option call "Show Child Zones",
which groups the classes that are within directories closer to each other.
There are also directory filters, which allow reducing the number of classes
shown on a diagram for very large projects.
An example of zone diagrams and how they work is shown here:
http://oovaide.sourceforge.net/articles/ZoneDiagrams.html
If the directories are assigned component types in the build settings, then
the component diagram will show the dependencies between components. This
even shows which components are dependent on external components such as
GTK, or other external libraries.
The next level down shows something like UML class diagrams, but shows all
relations instead of just aggregation and inheritance. It can show
classes that are used within methods, or classes that are passed as
parameters to methods. Any class can be chosen as a starting point, then before
a class is added the diagram, a list is displayed that allows viewing
which classes will be displayed by a relationship type.
The lowest level shows sequence diagrams. This allows navigating up or down
the call tree while showing the classes that contain the methods.

Related

How to deal with large projects in C++? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
Now that I know some of the basics of C++, I must admit that I still find it very hard to deal with code that others have written in C++. This may inherently be so, as C++ allows for complex object hierarchies that are, or at least to me, very hard to grasp if one is just supplied with a C++ Project without any further comments or instructions.
So my question is more a question to the more experienced C++ programmers among you: how can someone understand a large C++ project written by others?
I easily loose my way and can be lost for weeks, if I try to understand how a large project of, for example, 10,000 lines of code is written. Functions of classes are pointers to functions of different classes that may or may not be overloaded and may or may not be inherited by other classes, etcetera, without ending.
Are there any practical tips that may speed up my ability to read and understand large C++ projects? Is there perhaps a tutorial with such tips? Please, elaborate! :)

I've been programming professionally for some time now, and as such I have repeatedly been handed down codebases written by others before me. Understanding is never easy, especially when the code is inconsistent.
The first thing to realize, though, is that learning your ways in a new codebase is not so different than re-discovering a codebase you had not touched for a while. Thus, whether written by your old-self of others does not matter much; and since you probably manage to cope with re-discovering codebases you had worked on before, you should be able to discover new codebases as well. Don't lose hope.
The second thing to realize is that understanding is a vague term, and there are certainly different degrees. Often times, nobody asks you to understand the ins and outs completely; more likely you will be asked to understand a portion of the codebase in which either there is a bug or some new functionality should be developed. Therefore, as time passes, you will gradually gain an understanding of various portions, and you will inevitably have a deeper knowledge of the portions you worked the most whilst others can be relatively abstract or even completely obscure. It's okay, it's been a long time since human beings stopped trying to learn everything there was to learn.
With that said, there are several axis of understanding you can try:
you should look for architecture: a good thing is to trace the library dependencies (the Makefile/Project should help here) this will give you the coarse technical blocks out of which the application is built. Executables are normally leaves of the dependency trees.
you should look for data-flow: what's the trigger of the application (called directly or as a callback) ? what are the steps followed by this data (roughly, just a sketch). Do not hesitate to focus on a specific narrow usecase and use the debugger to trace things, and do not try to dig too deep at first; just get a feel of things.
There are also other axis that may help gaining some understanding of the domain the application has been written for. An understanding of the domain is useful because it provides you with a key insight on what should happen and it also helps you decipher the comments/function names.
user documentation: what is this used for ? if you can arrange for a demo it is generally very helpful, otherwise maybe you can try playing with it yourself (in a test environment)
tests: what is tested ? what is exposed to the user ?
persistent data: what is serialized ? what is saved in a database ? Persistent data is accessed at some point, so it helps if you understand when it is read/written.

If it is a working product (that runs) and you can "debug" it, start by looking at just one particular feature.
Learn how it is working from the user's point of view (UI, behaviour, inputs, outputs, ...).
Once you know the feature from the outside, just look for the code for that feature (only that feature); the starting point might be a handler for a menu, or from a dialog or a mouse/pointer event.
From there; manually trace the code for one action or sub-feature; skip deep internal libraries (treat them as black box for now) and learn how it works.
Once you know that section of code, dig deeper in libraries API that was called from the upper level code.
Take your time.
Do not try to understand everything at once.
Draw up schematic (pen and paper) of the dependencies (stay high level, no class dependencies at the beginning).
Good luck.

The problem that you are mentioning does not have clear and simple answer. Nevertheless here are some tips:
At the beginning try to randomly remember everything. Names of directories, classes, params of templates, etc. As much as you can. This sounds pointless but still makes sense.
While working with the code always think "Have I looked at this function/param/etc before?" If the answer is yes, spend with this piece of code more. If not, just make basic grasp and go on.
As the time will go on, you will find out that more and more sounds clear and easier to grasp.
It is impossible to give any exact values because size and complexity of projects vary greatly. Do not expect simple and immediate results.
Other points:
You definitely need a source code browser. Spend time in learning how to use it. Good example is http://sourceinsight.com/. This is not my site!!! I do have my own site. I will not mention it here.
If you see a function that is called 500 times, it is 500 times more likely that knowledge about this function will be useful comparing with a function, that is called only once.
The best is to grasp the architecture of the project. Trying to do this it is necessary to remember that project may have no architecture at all.
Studying the code you should remember your task. Typical situation - you need to modify something or fix a bug. If this is so look for the right part of the code and focus your effort on it.

tool for generating an outline/map of a C++ code - is there such thing? [duplicate]

This question already has answers here:
C/C++ source code visualization? [closed]
(7 answers)
Closed 9 years ago.
I need to get into and make some modifications to a software component written in C++. I am fantasizing about generating some map of the code, that would show relationships between classes and walk me through the flow / call graph of methods. Is there a tool for this?
Years ago I worked with Rational Rose modeling tool with had a feature of reverse-engineering the code and building a class diagram for it. However what's important for such project exploration is also some dynamic information like sequence diagram (ideally) or call graph. Not mentioning that Rose is too big for such one off task and actually I don't know if it exists at all still.

I personally use Doxygen https://github.com/doxygen/doxygen and its truly among the easiest program to configure in a way that makes output like what you describe.
To generate call graphs you would also need dot which you can get in graphviz http://www.graphviz.org/. There might be some other dependency's but in those cases it should say so in the configuration file which by the way is rather well commented.
The configuration file of Doxygen might seem extensive at first, but the end result is worth it.

Warning, Douml was made from an old free version of BoUML (unfortunately not the last of them), when porting it in Qt4 the team introduced a lot of bugs, and at least because of that the result is unusable. Furthermore the team didn't worked on the plug-outs mechanism, so you aren't able to define you own plug-out etc. So it is better to get BoUML, it is not free but the price is very low compared to other UML tools. Zeks, BoUML has an automatic layout in the class diagrams. My two cents.

Take a look at BOUML, I think that's exactly what you're looking for:
http://www.bouml.fr/screenshots.html

If doxygen is not enough, I'd look into Enterprise Architect for the task. It's not free but it will generate your diagrams and code model. Although, tbh, I think doxygen is exactly what you need, and it's free to boot.
Btw, If you do decide to go Bouml way (generate code model, then make diagrams by hand), consider picking Douml from sourcefoge. Unlike Bouml, it's still free.

C++ function dependency graph [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
C++ code dependency / call-graph “viewer”?
I am working on a huge C++ code base and currently I am stuck with the problem of modularizing my code. I have to divide my code into separate independent modules.
One approach I can think of is to generate the dependency graph and then do a higher level categorization. Other approach is to start with a entry point (some function abc()) and generating a function call tree whose each node will contain the name of the file in which that function resides. Thereafter I can use some script to extract those functions into separate modules.
My question is, is there any tool that can help me do this task? Has anybody faced such a problem earlier. Or can you suggest any approach to achieve the same?

First level of modularization - and I hope you already have done that - is structuring your code in classes. If your code is merely a collection of functions - i.e., C plus some syntactic suggar - then it's high time to rewrite your code, because no amount of dependency-graph-building will save you from maintenance hell.
If you have classes, modularizing should be a matter of finding classes that closely work together (like Customer, Order and Invoice) and separate them from classes that are only losely coupled with them (like Employer or Facility). Then take it from there.
Modularizing code is something that requires, first and foremost, thought. No automatic procedure is a replacement for that. Actually, from what little you wrote, I would fear that any automated process would make things worse, because apparently there has been too little thought invested in this project already. I.e., you wrote 1 million lines of code without thinking about modularization, and now you want modularization to happen while still not actually thinking about it. You are heading for epic fail.

To get some overview doxygen might help. But you have to play around a little with the doxyfile settings to generate dependency graps and if your Code base is huge you should disable dynamic stuff from the generated methods.
Doxygen can create include, inheritance, call and caller graphs using graphviz.
Here are simple examples but it also works for bigger ones.
But doxygen will only give you an overview and no refactoring capabilities.

I regularly use "Understand for C/C++" to investigate these kind of dependencies.
If the code base is really huge and you start your modularization from scratch, you might want to look at some other tools, like:
Cytoscape (which can take the output of "Understand for C/C++" to visualize the dependencies
Lattix

It sounds like you are looking for a refactoring tool. Try taking a look at the answers on this question: Is there a working C++ refactoring tool?

One method will be a bit long but what you can do is to remove a method and compile to find dependencies and than group the decadencies into one component. Although this does not resolve your issue fully but it is an approach to start off with.

How are clojure/lisp programs modeled as a diagram? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I've tried wedging my clojure diagrams into what's available in UML, using class-blocks as the file-level namespaces and dependency links to show relationships, but it's awkward and tends to discourage functional patterns. I've also tried developing ad-hoc solutions, but I can't discover a solution that works as well as UML with, say, Java (simple directed graphs seem to work in a vague manner, but this the results aren't detailed enough). Furthermore, I'm not finding anything on the web about this.
Just to be clear, I'm not trying to do anything fancy like code generation; I'm just talking about pen-and-paper diagrams mostly for my own benefit. I'm assuming I'm not the first person to have ever considered this for a lisp language.
What solutions have been proposed? Are there any commonly-used standards? What do you recommend? What tools do you use?

It depends on what you want to describe in your program.
Dependencies
Use class diagrams to model the dependencies between namespaces; in this case, it's more clear if you use packages instead of classes in a diagram.
You can also use class diagrams to model dependencies between actors
Data flow
You can also use Communication Diagrams to model the flow of data in your program. In this case, depict each namespace as an entity and each function as a method of that entity.
Or, in the case of actors, depict each actor as an entity and each message as a method.
In any case, it's not useful to try and describe the algorithm of your program in UML. In my experience, they are better described in comments in the source file.

I think its less about the language and more about your conceptual model. If you are taking a "stream processing" approach then a data-flow network diagram might be the right approach as in some of the Scheme diagrams in SICP. If you are taking a more object oriented approach (which is well supported in Lisp) then UML activity diagrams might make more sense.

My personal thought is to model the flow of the data and not the structure of the code because from what i'v seen of large(not really that large) Clojure projects the code layout tends to be really boring, with a huge pile of composeable utilities and one class that threads them together with map, redure, and STM transactions.
Clojure is very flexible in the model you choose and so you may want to go the other way around this. make the diagram first then choose the parts and patterns of the language that cleanly express the model you built.

Well, UML is deeply rooted in OO design (with C++!), so it will be very difficult to map a functional approach with UML. I don't know Clojure that well but you may be able to represent the things that resemble Java classes and interfaces (protocols?), for all the others it will be really hard.
FP is more like a series of transformations from input to output, there's no clear UML diagram for that (maybe activity diagrams?). The most common diagrams are for the static structure and the interaction between objects, but they aren't really useful for the FP paradigm.
Depending on your goal the component and deployment diagrams can be applicable.

I don't think something like UML would be a good fit for Clojure - UML is rather focused on the object oriented paradigm which is usually discouraged in Clojure.
When I'm doing functional programming I tend to think much more in terms of data and functions:
What data structures do I need? In Clojure this usually boils down to defining a map structure for each important entity I am dealing with. A simple list of fields is often enough in simple cases. In more complex cases with many different entities you will probably want to draw a tree showing the structure of your data (where each node in the tree represents a map or record type)
How do these data structures flow through different transformation functions to get the right result? Ideally these are pure functions that take an immutable value as input and produce an immutable value as output. Typically I sketch these as a pipeline / flowchart.
If you've thought through the above well enough, then converting to Clojure code is pretty easy.
Define one or more constructor functions for your data structures, and a write a couple of tests to prove they are working
Write the transformation functions bottom up (i.e. get the most basic operations working and tested first, then compose these together to define the larger functions). Write tests for every function.
If you need utility functions for GUI or IO etc., write them on demand as they are needed.
Glue it all together, testing at the REPL to make sure everything is working.
Note that you source files will typically also be structured in the sequence listed above, with more elementary functions at the top and the higher level composed functions towards the bottom. You shouldn't need any circular dependencies (that's a bad design smell in Clojure). Tests are critical - IMHO much more important in a dynamic language like Clojure than in a statically typed OOP language.
The overall logic of my code is usually the last few lines of my main source code file.

I have been wrestling with this as well. I find flow charts work great for basic functions and data. It's easy to show the data and data flow that way. Conditionals and recursion are straightforward. UML sequence/collaboration diagrams can capture some of the same info pretty well.
However, once you start using HOF, this does not work well at all.
Normal UML diagrams for packages work ok for namespaces, not that that does much.

C/C++ Header file documentation [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 5 years ago.
Improve this question
What do you think is best practice when creating public header files in C++?
Should header files contain no, brief or massive documentation? I've seen everything from almost no documentation (relying on some external documentation) to large specifications of invariants, valid parameters, return values etc. I'm not sure exactly what I prefer, large documentation is nice since you've always access to it from your editor, on the other hand a header file with very brief documentation can often show a complete interface on one or two pages of text giving a much better overview of what's possible to do with a class.
Let's say I go with something like brief or massive documentation. I want something similar to javadoc where I document return values, parameters etc. What's the best convention for that in c++? As far as I can remember doxygen does good stuff with java doc-style documentation, but are there any other conventions and tools for this I should be aware of before going for javadoc style documentation?

Usually I put documentation for the interface (parameters, return value, what the function does) in the interface file (.h), and the documentation for the implementation (how the function does) in the implementation file (.c, .cpp, .m).
I write an overview of the class just before its declaration, so the reader has immediate basic information.
The tool I use is Doxygen.

I would definetely have some documentation in the header files themselves. It greatly improves debugging to have the information next to the code, and not in separate documents. As a rule of thumb, I would document the API (return values, argument, state changes, etc) next to the code, and high-level architectural overviews in separate documents (to give a broader view of how everything is put together; it's hard to place this together with the code, since it usually references several classes at once).
Doxygen is fine from my experience.

I believe Doxygen is the most common platform for generating documentation, and as far as I know, it's more or less able to cover JavaDoc-notation (not limited to of course). I've used doxygen for C, with OK results, I do think it's more suitable for C++ though. You might want to look into robodoc as well, although I think Occam's Razor will tell you to rather go for Doxygen.
Regarding how much documentation, I've never been a documentation-fan myself, but whether I like it or not, having more documentation always beats having no documentation. I'd put the API-documentation in the header file, and the implementation documentation in the implementation (stands to reason, doesn't it?). :) That way, IDEs have the chance to pick it up and show it during autocompletion (Eclipse does this for JavaDocs, for example, perhaps also for doxygen?), which shouldn't be underestimated. JavaDoc has this additional quirk that it uses the first sentence (i.e. up to the first full stop) as a brief description, don't know if Doxygen does this though, but if so, it should be taken into consideration when writing the documentation.
Having a lot of documentation runs the risk of being out of date, however, keeping the documentation close to the code will give people a chance to keep it up to date, so you should definately keep it in the source/header files. What shouldn't be forgotten though is the production of documentation. Yes, some people will use the documentation directly (through the IDE or whatever, or just reading the header file), but some people prefer other ways, so you should probably consider putting your (regularly updated) API documentation online, all nice and browsable, as well as perhaps producing man-files if you're targeting *nix-based developers.
That's my two cents.

Put enough into the code that it can stand alone. Nearly every project I've been in where the documentation was separate, it got out of date or wasn't done, partly that if it's a separate document it becomes a separate task, partly as management didn't allow for it as a task in budgetting. Documenting inline as part of the flow works much better in my experience.
Write the documentation in a form which most editors recognise is a block; for C++ this seems to be /* rather than //. That way you can fold it and just see the declarations.

Maybe you would be interested in gtk-doc. It can be "a bit awkward to setup and use" but you can get a nice API documentation from source code, looking like this:
String Utility Functions

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js