Reverse offsetof / Get name of element by offset - c++

Given the offset of an element in a C++ struct, how can I find its name/type without manual counting? This would be especially useful when decoding ASM code where such offsets are regularly used. Ideally the tool would parse a C(++) header file and then give the answer from that. Thanks for any pointers :)

One such tool might be the compiler itself (using the same ABI-relevant flags as used to generate the code). Create a small program which includes the header file, then prints the result of offsetofapplied to each struct's members. You'll then have a suitable look-up table which you could refer to manually, or use as input to another tool you might write.
It may be possible (depending on the complexity of the headers) to auto-generate the program above (you'll probably want to run the header through the C preprocessor first, to expand macros and select the correct branch of conditionals).

Related

Use doxygen to document and compare includes and their purpose

I want to use doxygen to document my code in high detail. One thing I want to document, is the "purpose" of single instances of includes of other .c/.h files. And with any chance to find out, if the documented "usage" of the includes meets reality.
Let's say I have an Include #include <some_lib.h>, with a big set of functions but only want or be allowed to use one or two of them, I want to document which I actually use. I would like to say something like:
/**
* Used elements:
* MY_MACRO()
* myFunction()
* myGlobal_Var()
*/
#include <some_lib.h>
Then I want Doxygen to produce some sort of documentation for that file using this information. Preferably a list containing every included file and the the promised items that are used from it.
It would be even better if there is a possibility to get Doxygen to produce a list of the actually used elements. But there are for sure tools out there to do this job.
A Reviewer or a script comparing those 2 results could then easily check if other items then the promised ones used in the implementation (and if the used ones are allowed). At the same time Doxygen will deliver the documentation of the items.
Do you know about a way to achieve this? And/Or have I failed to recognize a Doxygen built-in feature to bring me close to that goal?

How do c/c++ compilers know which line an error is on

There is probably a very obvious answer to this, but I was wondering how the compiler knows which line of code my error is on. In some cases it even knows the column.
The only way I can think to do this is to tokenize the input string into a 2D array. This would store [lines][tokens].
C/C++ could be tokenized into 1 long 1D array which would probably be more efficient. I am wondering what the usual parsing method would be that would keep line information.
actually most of it is covered in the dragon book.
Compilers do Lexing/Parsing i.e.: transforming the source code into a tree representation.
When doing so each keyword variable etc. is associated with a line and column number.
However during parsing the exact origin of the failure might get lost and the information might be off.
This is the first step in the long, complicated path towards "Engineering a Compiler" or Compilers Theory
The short answer to that is: there's a module called "front-end" that usually takes care of many phases:
Scanning
Parsing
IR generator
IR optimizer ...
The structure isn't fixed so each compiler will have its own set of modules but more or less the steps involved in the front-end processing are
Scanning - maps character streams into words (also ignores whitespaces/comments) or tokens
Parsing - this is where syntax and (some) semantic analysis take place and where syntax errors are reported
To make this up to you: the compiler knows the location of your error because when something doesn't fit into a structure called "abstract syntax tree" (i.e. it cannot be constructed) or doesn't follow any of the syntax-directed translation rules, well.. there's something wrong and the compiler indicates the location where this didn't happen. If there's a grammar error on just one word/token then even a precise column location can be returned since nothing matched a terminal keyword: a basic token like the if keyword in the C/C++ language.
If you want to know more about this topic my suggestion is to start with the classic academic approach of the "Compiler Book" or "Dragon Book" and then, later on, possibly study an open-source front-end like Clang

How do I associate changed lines with functions in a git repository of C code?

I'm attempting to construct a “heatmap” from a multi-year history stored in a git repository where the unit of granularity is individual functions. Functions should grow hotter as they change more times, more frequently, and with more non-blank lines changed.
As a start, I examined the output of
git log --patch -M --find-renames --find-copies-harder --function-context -- *.c
I looked at using Language.C from Hackage, but it seems to want a complete translation unit—expanded headers and all—rather being able to cope with a source fragment.
The --function-context option is new since version 1.7.8. The foundation of the implementation in v1.7.9.4 is a regex:
PATTERNS("cpp",
/* Jump targets or access declarations */
"!^[ \t]*[A-Za-z_][A-Za-z_0-9]*:.*$\n"
/* C/++ functions/methods at top level */
"^([A-Za-z_][A-Za-z_0-9]*([ \t*]+[A-Za-z_][A-Za-z_0-9]*([ \t]*::[ \t]*[^[:space:]]+)?){1,}[ \t]*\\([^;]*)$\n"
/* compound type at top level */
"^((struct|class|enum)[^;]*)$",
/* -- */
"[a-zA-Z_][a-zA-Z0-9_]*"
"|[-+0-9.e]+[fFlL]?|0[xXbB]?[0-9a-fA-F]+[lL]?"
"|[-+*/<>%&^|=!]=|--|\\+\\+|<<=?|>>=?|&&|\\|\\||::|->"),
This seems to recognize boundaries reasonably well but doesn’t always leave the function as the first line of the diff hunk, e.g., with #include directives at the top or with a hunk that contains multiple function definitions. An option to tell diff to emit separate hunks for each function changed would be really useful.
This isn’t safety-critical, so I can tolerate some misses. Does that mean I likely have Zawinski’s “two problems”?
I realise this suggestion is a bit tangential, but it may help in order to clarify and rank requirements. This would work for C or C++ ...
Instead of trying to find text blocks which are functions and comparing them, use the compiler to make binary blocks. Specifically, for every C/C++ source file in a change set, compile it to an object. Then use the object code as a basis for comparisons.
This might not be feasible for you, but IIRC there is an option on gcc to compile so that each function is compiled to an 'independent chunk' within the generated object code file. The linker can pull each 'chunk' into a program. (It is getting pretty late here, so I will look this up in the morning, if you are interested in the idea. )
So, assuming we can do this, you'll have lots of functions defined by chunks of binary code, so a simple 'heat' comparison is 'how much longer or shorter is the code between versions for any function?'
I am also thinking it might be practical to use objdump to reconstitute the assembler for the functions. I might use some regular expressions at this stage to trim off the register names, so that changes to register allocation don't cause too many false positive (changes).
I might even try to sort the assembler instructions in the function bodies, and diff them to get a pattern of "removed" vs "added" between two function implementations. This would give a measure of change which is pretty much independent of layout, and even somewhat independent of the order of some of the source.
So it might be interesting to see if two alternative implementations of the same function (i.e. from different a change set) are the same instructions :-)
This approach should also work for C++ because all names have been appropriately mangled, which should guarantee the same functions are being compared.
So, the regular expressions might be kept very simple :-)
Assuming all of this is straightforward, what might this approach fail to give you?
Side Note: This basic strategy could work for any language which targets machine code, as well as VM instruction sets like the Java VM Bytecode, .NET CLR code, etc too.
It might be worth considering building a simple parser, using one of the common tools, rather than just using regular expressions. Clearly it is better to choose something you are familiar with, or which your organisation already uses.
For this problem, a parser doesn't actually need to validate the code (I assume it is valid when it is checked in), and it doesn't need to understand the code, so it might be quite dumb.
It might throw away comments (retaining new lines), ignore the contents of text strings, and treat program text in a very simple way. It mainly needs to keep track of balanced '{' '}', balanced '(' ')' and all the other valid program text is just individual tokens which can be passed 'straight through'.
It's output might be a separate file/function to make tracking easier.
If the language is C or C++, and the developers are reasonably disciplined, they might never use 'non-syntactic macros'. If that is the case, then the files don't need to be preprocessed.
Then a parser is mostly just looking for a the function name (an identifier) at file scope followed by ( parameter-list ) { ... code ... }
I'd SWAG it would be a few days work using yacc & lex / flex & bison, and it might be so simple that their is no need for the parser generator.
If the code is Java, then ANTLR is a possible, and I think there was a simple Java parser example.
If Haskell is your focus, their may be student projects published which have made a reasonable stab at a parser.

C++ Win32 Replace Strings in an Executable

I'm looking for a good way to replace several strings inside a native win32 compiled exe. For example, I have the following in my code:
const char *updateSite = "http://www.place.com"
const char *updateURL = "/software/release/updater.php"
I need to modify these strings with other arbitrary length strings within the exe. I realize I could store this type of configuration elsewhere, but keeping it in the exe meets the portability requirements for my app. I would appreciate any help and/or advice on the best way to do this.
Thanks!
Update: I found some code in the Metasploit project that seems to do this:
MSF:Util:Exe
I would not mess around the the EXE itself, if you really need 1 file, then do the old zip append trick and put your configs in there.
Could look like this:
> BINARY DATA
> ZIP FILE DATA
> 32bit unsigned int which's value is the size of the appended zip file
Pros:
easy to extend / maintain
you don't mess with the exe itself
you can put lots of stuff in there
Contras:
You need to link some compression lib
If you don't want to zip it, then just write some simple uncompressed archive thing your own.
In a PE file is the global relocations table- it is a list of addresses (for example, global variables or constants that must be runtime-stored, like, say, strings) that must be altered by the PE loader. If you knew which entry this particular variable was, you could get it's address and then alter it manually. However, this would be a total bitch and you'd need an in-depth knowledge of your favourite compiler and the PE format. Easier just to use XML or Lua or something else that's totally portable - they were invented for exactly this kind of purpose.
Edit:
Why not just use a const char**? Is there something wrong with this being a normal runtime variable?
IMO the best place to store that strings in a string table resource. It's incorporated into your .EXE file, so the portability will not be compromised.
Use the visual studio editor to alter that values.
Use LoadString WinAPI, or better, CString::LoadString method, in your code, to load the values.
There's also 3-rd party software allowing you to modify the strings in the compiled .EXE, without recompilation.

Methods for parsing C++ defines?

I'm building a driver in C++ which relies heavily on another dll (surprise!). This dll comes with a header which contains a huge list of defines. Each define represents different return, message and error codes.
Each of these sets of defines has a prefix which differentiates it from the others. Something like:
MSG_GOOD... MSG_BAD... MSG_INDIFFERENT...
RETURN_OK... RETURN_OMG.. RETURN_WHAT...
But hundreds of them. This makes things rather hard to debug.
Ideally, I'd like to find a way to get some kind of reasonable log output with plain text. I was thinking it might be possible to parse the H file at run time, but a less complex compile time option would be much better.
I think you'll have to parse the .h file at one point or another, because once compiled, the #defines won't be anywhere in the code anymore. Parsing .h files just for the #defines isn't too hard though - just read in full lines (mind the backslash at the end), ltrim them, and if they begin with "#define " then split the remainder at the first whitespace, trim both parts, stuff them in a key/value container and you're essentially done. Shouldn't be too hard even in C++.