Sometimes I need to know size of a struct which is not in the scope (not even on the stack, i.e. frame-related commands won't help). E.g. it happens for debugging client + server communication, when restarting the apps to just break somewhere in context of the struct with the purpose of finding the size is uncomfortable and time consuming.
How do I find size of a struct defined in a header with disregard to my current context?
For C, gdb's "expression language" is just ordinary C expressions, with a few handy extensions for debugging. This is less true for C++, primarily because C++ is just much more difficult to parse, so there expression language tends to be a subset of C++ plus some gdb extensions.
So, the short answer is you can just type:
(gdb) print sizeof(mystruct)
However, there are caveats.
First, gdb's current language matters. You can find this with show language. In the case of a struct type, in C++ there is an automatic typedef, but in C there is not. So if you are using the auto language (and you usually should), and are stopped in a C frame, you will need to use the keyword:
(gdb) print sizeof(struct mystruct)
Now, this still may not work. The usual reason at this point is that the structure isn't used in your program, and so doesn't show up in the debug info. The debug info can be optimized out even if you think it ought to have been available, because it is up to the compiler. For example, I think if a struct is only used in sizeof expressions (and no variable is ever defined of that type), then I think (hard to remember for sure) that GCC won't emit DWARF for it.
You can check to see if the type is available using readelf or dwgrep, like:
$ readelf -wi myexecutableorlibrary | grep mystruct
(Though in real life I usually use less and then examine the DWARF DIEs carefully. You will need to know a little DWARF to make sense of this.)
Sometimes in gdb it's handy to use the "filename" extension to specify exactly which entity you mean. Like:
(gdb) print 'myfile.c'::variable
Not sure if that works for types, and anyway it shouldn't usually be necessary for them.
In C/C++, you have the sizeof function which will give you the size of any type (including struct) or variable.
I'm not sure if you can apply this while debugging but you could simply have a test program with the same headers (type definitions) tell you what the size of your types is.
Related
I was debugging a program compiled in Rust using GDB (arm-none-eabi-gdb). At one point, I wanted to write to a memory address as follow:
(gdb) set *((int *) 0x24040000) = 0x0000CAFE
syntax error in expression, near `) 0x24040000) = 0x0000CAFE'.
After multiple tentative, I found out that I was casting the C style and I had to cast it the Rust style as follow:
set *(0x24040000 as *mut i32) = 0x0000CAFE
My question is how GDB is interpreting the different commands and why I get this error. Is it because the symbol (int) is not recognized, but in this case, how gdb load the symbols? Does gdb need to compile the instruction to the correct language of the binary running on the target?
Yes, it depends on the language, and the language is deduced from the filename of the loaded source file.
Quoting the manual:
print and many other GDB commands accept an expression and compute its value. Any kind of constant, variable or operator defined by the programming language you are using is valid in an expression in GDB. This includes conditional expressions, function calls, casts, and string constants.
And:
If you are not interested in seeing the value of the assignment, use the set command instead of the print command. set is really the same as print except that the expression’s value is not printed and is not put in the value history (see Value History). The expression is evaluated only for its effects.
And:
Language-specific information is built into GDB for some languages, allowing you to express operations like the above in your program’s native language, and allowing GDB to output values in a manner consistent with the syntax of your program’s native language. The language you use to build expressions is called the working language.
And:
There are two ways to control the working language—either have GDB set it automatically, or select it manually yourself. You can use the set language command for either purpose. On startup, GDB defaults to setting the language automatically.
[..] most of the time GDB infers the language from the name of the file.
For a pretty-printer that I'm writing, I would like to know the alignment or the type which is used in a container. Unfortunately using alignof() or any similar "standard" operator doesn't work (https://sourceware.org/bugzilla/show_bug.cgi?id=17095). Using "typical" macro tricks that work directly in source code also doesn't work:
p ((char *)(&((struct { char c; double _h; } *)0)->_h) - (char *)0)
A syntax error in expression, near `{ char c; double _h; } *)0)->_h) - (char *)0)'.
Is that possible at all, or maybe the only way is to have that supported by GDB internally?
There's no way to get this information, because currently gdb does not have it.
Before DWARF version 5, there was no standard way to express alignment in the debug info. DWARF 5 added DW_AT_alignment, but gdb still simply ignores this attribute; to expose it via the Python API would require reading it and storing it in gdb's internal struct type. I don't know offhand whether compilers emit this attribute yet.
If you were very desperate you could do this either using the gdb compile feature or by running the compiler yourself, and having it emit the alignment in a way that can be extracted.
However, normally alignment is not too difficult to compute from the relevant type sizes, and if your target architectures are relatively limited then it's probably simpler to just roll your own alignment computer.
I've taken a good time studying TOC and Compiler design, not done yet but I feel comfortable with the conceptions. On the other hand I have a very shallow knowledge of assembly and machine code, and I have always the desire/need to connect the two sides( HLL and LLL representation of the code ), as I'm learning C++ with paying great attention to performance and optimization discussions.
C++ is a statically typed language:
My question is: Our variables when written as expressions in the statements of the code, do all these variables ( and other entities with identifiers ) become at runtime, mere instructions of addressing to positions of the virtual memory ( for static and for globals ) and addressing relevant to stack address for local variables?
I mean, after a successful compilation including semantic and syntactic verification, isn't wise to deal with data at runtime as guaranteed entities of target memory bytes without any thinking of any identifier or any checking, with the symbol table no more needed?
If my question appeared to be the type of questions that are due to lacking of learning effort ( which I hope it doesn't ), please just inform me about that, and tell me where to read. If that was the case, then it's honestly because I'm concentrating on C++ nowadays and haven't got the chance yet to have a sound knowledge of low level languages, I apologize for that in advance.
You're spot on. Once compiled to machine code, there is no longer any notion of a variable identifier (or variable type, for that matter). It's just bytes at a certain location. Which location was determined by the compiler (when compiling) based on the variable name, or by the linker (when linking) in the case of global variables.
Of course, it can be useful to retain information such as identifiers, for debugging purposes. This is precisely what "compilation with debug information" means: when you do that, the compiler will somehow embed the (redundant) identifiers into the generated code such that a debugger can access them. Or put them in a separate file alongside; the details of that depend on the format of the debugging information.
Yes, mostly. There are a few details that will make identifiers remain more than just addresses or stack offsets.
First we have in RTTI in C++ which means that during runtime the name of at least types may still be available. For example:
const std::type_info &info = typeid(*ptr_interface);
std::cout << info.name() << std::endl;
would print the name of whatever type *ptr_interface is of.
Second, due to the way a program is linked the symbols from the object files may still be present in the executing image. You have for example the linux kernel making use of this as it can produce a backtrace of the stack including the function names. Also it uses knowledge of function names in order to be able to load and link modules. Similar functionality exists in Gnu C library, than when linked for it is able to retrieve function names in stack traces.
In normal cases though the code will not be affected by the original names of the variables (but the compiler will of course emit code suitable for the type the variable have).
Is it possible (that is, could be easily accomplished with commodity tools) to reconstruct a C++ class declaration from a .so file (non-debug, non-x86) — to the point that member functions could be called successfully as if the original .h file was available for that class?
For example, by trial and error I found that this code works when 64K are allocated for instance-local storage. Allocating 1K or 8K leads to SEGFAULT, although I never saw offsets higher than 0x0650 in disassembly and it is highly unlikely that this class actually stores much data by itself.
class TS57 {
private:
char abLocalStorage[65536]; // Large enough to fit all possible instance data.
public:
TS57(int iSomething);
~TS57(void);
int SaveAsMap(long (*)(long, long, long, long, long), char const*, char const*, char const*);
};
And what if I needed more complex classes and usage scenarios? What if allocating 64K per instance would be too wasteful? Are there simple tools (like nm or objdump) that may give insight on a .so library's type information? I found some papers on SecondWrite — an “executable analysis and rewriting framework” which does “recovery of object oriented features from C++ binaries”, — but, despite several references in newsgroups, could not even conclude whether this is a production-ready software, or just a proof of concept, or a private tool not for general use.
FYI. I am asking this totally out of curiosity. I already found a C-style wrapper for that function that not only instantiates the object and calls its method, but also conveniently does additional housekeeping. Moreover, I am not enthusiastic to write for G++ 2.95 — as this was the version that library was compiled by, and I could not find a switch to get the same name mangling scheme in my usual G++ 3.3. Therefore, I am not interested in workarounds: I wonder whether a direct solution exists — in case it is needed in the future.
The short answer is "no, a machine can't do that".
The longer answer is still roughly the same, but I'll try to explain why:
Some information about functions would be available from nm libmystuff.so|c++filt, which will demangle the names. But that will only show functions that have a public name, and it will most likely still be a bit ambiguous as to what the data-types actually mean.
During compilation, a lot of "semantical information"[1] is lost. Functions are inlined, loops are transforme [loops made with for, do-while and while, even goto in some cases, are made to look almost identical), conditions are compiled out or re-arranged, variable names are lost, much of the type information and enum-names are completely lost, etc. Private and public fields of classes would be lost.
Compiler will also do "clever" transformations on the code to replace complex instructions with less complex ones (int x; ... x = x * 5 may become lea eax, [eax*4 + eax] or similar) [this one is pretty simple - try figuring out "backwards" how the compiler solved a populationcount (number of bits set in a binary number) or cosine when it has been inlined...]
A human, that knows what the code is MEANT to do, and good knowledge of the machine code of the target processor, MAY be able to reverse engineer the code and break out functions that have been inlined. But it's still hard to tell the difference between:
void Foo::func()
{
this->x++;
}
and
void func(Foo* p)
{
p->x++;
}
These two functions should become exactly identical machine-code, and if the function does not have a name in the symbol table, there is no way to tell which it is.
[1] Information about the "meaning" of the code.
In the C++ tag wiki, it is mentioned that
C++ is a statically typed, free-form, multi-paradigm, compiled, general-purpose programming language.
Can someone please explain the terms "statically typed" and "free-form"?
Thanks.
A statically-typed language is a language where every variable has a type assigned to it at compile-time. In C++, this means that you must tell the compiler the type of each variable - that is, whether it's an int, or a double, or a string, etc. This contrasts with dynamically-typed languages like JavaScript or PHP, where each variable can hold any type, and that type can change at runtime.
A free-form language is one where there are no requirements about where various symbols have to go with regard to one another. You can add as much whitespace as you'd like (or leave out any whitespace that you don't like). You don't need to start statements on a new line, and can put the braces around code blocks anywhere you'd like. This has led to a few holy wars about The Right Way To Write C++, but I actually like the freedom it gives you.
Hope this helps!
"Statically typed" means that the types are checked at compile-time, not run-time. For example, if you write a class that does not have a foo() method, then you'll get a compile-time error if you try to call foo() on an object of that class. In dynamically-typed languages (e.g. Ruby), you would still get an error, but only at run-time.
"Free-form" means that you can use whitespace however you want (i.e. write the whole program on one line, use uneven indenting, put lots of blank lines, etc.). This is in contrast to languages like Python where whitespace is semantically significant.
Statically typed: the compiler knows what the types of all variables are. In contrast to languages like Python and Common Lisp, where the types of variables can change at runtime.
Free-form: no specific whitespace requirements. This is in contrast to old-style FORTRAN and COBOL, so I'm not sure how useful this designation is anymore.