binary file and compatibility standard information - C++ / JAVA - c++

I am reading Wikipedia article on difference btween JAVA and C++. One difference is that C++ offers 'multiple binary compatibility standards'. Could you explain what this means, or hint at a good reference. I have a clue that it means that binary 'written with' C++ is very portable, can be used on any OS or environment. I would like to have confirmation and more precision. What is it all about?
How to generate binaries? What make it not portable?
Thanks and regards.

What does int mean, exactly? When calling a function with 2 parameters - do you put the first one first on a stack or last; or do you have a structure on a heap and point to it? Do you allow an unknown number of arguments being passed into a function? How do you treat strings; arrays? Do you allocate on stack, heap, or in a private memory block? Do you mangle function names (to allow for overloads) or use them as input in source code? Do you align members in a structure on 8 bit, 16 bit, or 32 bit boundary?
All those questions (and many more) are making a great deal of difference to how one binary calls another and the answers are not as simple most of the time.
Java doesn't offer much in terms of how exactly a binary layout is being done (as it's a VM, after all) while C++ offers great flexibility to accommodate almost any imaginable requirement out there - thus it "offers binary compatibility standards", unlike Java (in your example)

Related

Converting endianness of struct-Data

What i have is:
a hex file with the bytes of a c-struct in it, orderd in big-endian
the struct definition as *.h file
the struct information as dwarf2 debug info
My application has to be written in C / C++. Intermediate scripts using for example python would be ok.
What i have to do is read the bytes of the hex-file and cast it into the struct type on a system that is little-endian.
And during this process, i will have to reverse the bytes of each struct member.
The obvious solution would be to write a conversion function, that does byteswapping for each struct-member, but since the struct has multiple layers and ~1200 members that are changing faster than i can update my conversion function, writing that by hand is no solution.
So i could generate the conversion function automatically by:
Finding and parsing the types inside multiple *.h files
Iterating members of all struct-types and generate swaps for them -> without some sort of reflection api not that easy)
loading the struct via the conversion function.
Since this solution seems like quite a bit of work, i was wandering if there is easier way like telling the compiler to swap it or use debug-info somehow.
Does anybody know a trick that might help in this case?
Thanks and greetings!
Remark:
Changing any of the processes leading to this / changing the input-conditions or delegating responsibilities to other developers involved is not pssible.
Changing something about the hex-file as an input is not possible. This file comes out of some other system that will not change to fix this problem here.
Padding, Datatype-sizes etc. are identical. This is ensured by other measures, too. So endianess is defenetly the only problem. This is also why i see no reason against using dwarf2 info to identify the bytes of every struct member.
I agree that the layout of the struct is very bad. But It has some reasons why it is that way and to be short, i can/am not allowed not change that anyway because of process-reasons and backwards compatibility.
To give some more scope:
The Software that all of this is used in is deployed to multiple different embedded devices (multiple types). The hex-file containes the calibration information of the software and is thus stored in a specific system that can only output this hex-file.
I am now porting the software to a little-endian device and i have to use the hex-file given from the "main" branch of software, which is big-endian, as an input.
There is no way to tell C or C++ compiler to swap bytes from LE to BE or vice versa automatically. You really have to do it yourself. If your data structs are really huge, probably the best way is to implement automatic conversion code generation.
The problem, as far as I understand it, is tricky but tractable. As far as I understand, data extraction won't be running on an embedded device, so it won't be resource constrained. I say - embrace the runtime inefficiency that desktop hardware allows, and go for easy to debug instead.
Instead of thinking of the source file as "almost what I need modulo a couple of minor adjustments", think of it as "generic binary file with an open ended, evolving schema". The schema description is the DWARF data.
What I would do: start a Python project. Use the pyelftools PyPI module to parse the DWARF. Scroll for the compile units (CUs). In each CU, scroll through the top level entries (DIEs). Look for a DW_TAG_structure_type DIE with a specific value of DW_AT_name (I hope the struct name is known in advance). Then go through the DW_TAG_member sub-DIEs. DW_AT_data_member_location will give you the offset, letting you work around the padding. Look at DW_AT_type to detect the member type (you'd have to resolve the DIE reference for that). Recurse into struct- and array-type members as necessary.
From that, generate a format string for the struct.unpack method - it can read big-endian ints seamlessly. Then use struct.pack to format it into whatever format the C++ consumer expects.
This depends on you being able to track the data file to the DWARF info of the generating executable, exactly the same build. I hope the processes of the organization allow for that.
Recent versions of GCC allow the declaration of the desired endianness irrespective of the target platform for a source code section using the pragma scalar_storage_order or a specific type using an attribute with the same identifier. The main catch: g++ does not support this. Also, this won't work in all cases. For example, taking a pointer to a member with transparent endianness conversion leads to an error. Unless you're okay with sticking to C for struct access (it all depends on your current codebase), this is not an option.
The persistence layout is based on the original struct layout - so be it. However, a more explicit approach of serializing the structs should be preferred for exactly the reason you bring this up. Besides the endianness issue, struct packing also affects compatibility and should be explicitly specified. For persistence, a packing of 1 would be optimal. For in-memory data structures, that alignment is far from optimal in terms of performance and concurrency characteristics. Also, different platforms might have incompatible data types (e.g. sizeof(long) on 64-bit Linux/Windows - LP64 vs. LLP64). So, keeping the persistence layout separate from in-memory data structures tends to have a long list of advantages and therefore usually outweights the disadvantage of having to maintain the serialization code separately. Particularly, if portability is a major concern.
You could take advantage of C/C++-based reflection libraries or implement one yourself. In case of C, this will definitely require macros (e.g. Metaresc). In case of C++, you might actually get away your original struct definitions (e.g. Boost.Precise and Flat Reflection).
If reflection is not an option, you could generate the serialization code either by parsing the headers or debug symbols. Generally, parsing C/C++ is more complex. By moving the structs involved into dedicated headers, you might get away with a simple C/C++ parser. To make things easier, you could simplify parsing by processing the gdb output of ptype based on debug symbols. Or, you could parse debug symbols directly. With a scripting language like Python, both approaches should be feasible (pygccxml and pyelftools come to mind).
Rather than sticking to generating the serialization code as part of the build process, you could generate that code once and require updates whenever the structs change in the future. That's what I would do in a multi-platform scenario. Doing that would also spare you the pain of implementing a perfect parser that can deal with all kinds of C/C++ input, it would only have to be good enough for one-time generation.

gcc - how to detect pointer-based memory access

I'm focusing on micropython, specifically the branch dynamic-native-modules.
This feature will, in the future, allow you to compile a C/C++ function into a native .obj and package it together with a .py interface for a huge speed boost.
Awesome! But the issue is that if you're using a RTOS, which doesn't have virtual memory, then any executing native code can access any part of the address space including peripherals, the RTOS' state etc.
You don't want the user to be able to do something like this:
void user_func()
{
/* point to arbitrary memory, potentially the reset registers, flash erase . . . you get the point */
int * a = (int*)0x1234;
*a = 0x10110000; // DESTROY!!!
}
Even the following should be disallowed:
void user_func()
{
int a;
(int*)(&a-1000) = 0x10010111;
}
SOLUTIONS?
Create own version of gcc (for each binary format)
Decompile .obj files and detect use of pointers (for each binary binary format)
FEEDBACK TO COMMENTS
I get that it may be impossible to stop a malicious user but that's not the #1 worry. We want to stop well-meaning but accidental code. If it's not possible to stop every, single case that's ok.
If we can prohibit/detect explicit pointer accesses and simply provide warnings regarding array use, that is still very valuable.
WARNING: YOU'RE USING AN ARRAY! MAKE SURE YOU DON'T GO OUT-OF-BOUNDS
Your best chance is a GCC plugin which looks at the frontend-generated GENERIC or GIMPLE IRs and implements the policies you want. Depending on the policies and the source code you want to accept, this could be a lot of work and very difficult.
If you want a purely syntax-based or type-based approach (simply rejecting all pointer arithmetic), Clang with its ASTs is easier to work with than GCC.
There could be a way of doing what you want - as long as you protect from honest mistakes, not deliberate attempts to break things.
First, you embrace C++ Core Guidelines and use support from the tools: https://github.com/isocpp/CppCoreGuidelines/blob/master/CppCoreGuidelines.md#S-tools
For your purposes, it will stop your code from using raw pointers and pointer arithmetic in non-library code. Period. Core guidelines do more than that, but this is what is related to your question.
Than, you do a simple grep on the code to make sure it is not using pointers blessed by Guidelines - this would be a case of more or less simple grep.
In practice, it is not reasonably possible.
You might consider changing the compilation (e.g. with a GCC plugin, like mentioned in Florian Weimer's answer) to check every access to arrays, but that makes the generated code significantly slower. Your native hardware is slow enough, you don't want to make it even more slow.
And Python is not exactly the right source language. Its dynamic typing would make its quite slow already.
Perhaps you might consider a statically typed language, like Ocaml (combined with a JIT or AOT compilation library like GCCJIT, etc). The type system (and its inference) increase the safety (and the speed) of the generated code, and with lot of hard work (several years, probably worth a PhD
since you'll need do new research) you might improve the type inference to even infer the occurrences where the array index is never out-of-bounds (and an index boundary check is not even needed).
In most case, upgrading the hardware (to a Raspberry Pi style one, with an MMU, and probably the ability to have a real OS with some virtual memory) is probably the most pragmatic approach
PS. Be aware of Rice's theorem. Most static source analysis cannot work reliably and well (be sound and complete, in technical terms).

Is runtime interpreter really part of C program execution?

As we know that C is a compiled language. According to C language Wikipedia it says that:
It was designed to be compiled using a relatively straightforward compiler, to provide low-level access to memory, to provide language constructs that map efficiently to machine instructions, and to require minimal run-time support. It also says that by design, C provides constructs that map efficiently to typical machine instructions, and therefore it has found lasting use in applications that had formerly been coded in assembly language, including operating systems, as well as various application software for computers ranging from supercomputers to embedded systems.
But when I read this & according to Thinking in C++ 2 by Bruce Eckel it says that in Chapter 2 titled Iostreams: (I've omitted some parts)
The big stumbling block is the runtime interpreter used for the
variable-argument list functions. This is the code that parses through
your format string at runtime and grabs and interprets arguments from
the variable argument list. It’s a problem for four reasons.
Because the interpretation happens at runtime there’s a performance
overhead you can’t get rid of. It’s frustrating because all the
information is there in the format string at compile time, but it’s
not evaluated until runtime. However, if you could parse the arguments
in the format string at compile time you could make hard function
calls that have the potential to be much faster than a runtime
interpreter (although the printf( ) family of functions is usually
quite well optimized).
this link also says that:
More type-safe: With , the type of object being I/O’d is
known statically by the compiler. In contrast, cstdio uses "%"
fields to figure out the types dynamically.
So before reading this I was thinking that interpreter isn't used in compiled language like C, but Is it really true that the runtime interpreter also available during execution of C program? Was I wrong before reading this? Is it really so much overhead occurs for this runtime interpretation compared to Iostreams?
What?
This is not runtime interpretation of the code, just inside functions using formatting strings.
Of course they have to loop over the format string to learn about the arguments and the desired formatting, which takes time.

Suitability of D for writing a Tracing JIT Compiler?

I'd like to write an interpreter and tracing JIT for a programming language I'm designing. I already have many years of experience programming in C++, but I've been wondering if perhaps newer alternatives might be better. One of the things I found most frustrating, back in my C++ days, was having to use header files to deal with the clunky one-pass compiler model. The problem is that not all languages are equally suited for this purpose. For my tracing JIT, I need to be able to write executable code into memory and have the interpreter call to that code. I will also need the generated code to be able to call back into host functions.
I started looking at Go and saw that the language had pointers but no pointer arithmetic. This immediately struck me as a huge issue. I may well want to write my own allocator and garbage collector. I will need to closely control the way my language objects are laid out in memory and be able to get the address of specific fields and write to them. Unless there's ways to deal with this, it kind of seems like Go fails to be low-level enough for my purposes.
The D language seems promising. It has pointer arithmetic and a clear outline of the ABI needed to call in and out of D. I've heard lots of good things about it. It also has garbage collection which is nice for compiler writing, but I still have a few things I'm not sure about:
Does D have standard libs that will allow me to mark chunks of memory as executable?
If I allocate a big chunk of memory that I want to manage myself, with my own GC, and have a bunch of pointers going into there, will this pose problems with D's garbage collector?
How well does D interoperate with C code, in your experience? Is loading C dynamic libraries and calling into them fairly easy?
Finally, there's the whole support aspect. For those who have used D on linux here, how good is the toolchain? Any issues? Has anyone written a JIT compiler in D, and if so, how was the experience?
I believe so, see core.memory.GC if I remember right.
No, it shouldn't. Just call malloc or whatever you need, and make sure the GC doesn't see it.
Yes, it's pretty easy to interoperate with C code.
Caveat: You probably don't want to rely on the GC either, since it's not 'precise' (i.e. can and does leak memory if you're unlucky). But for small blocks of data it's usually fine.
Go does allow pointer arithmetic, but you must import the unsafe package to do so (or use a C function). Pointer arithmetic is a common source of bugs, and Go has other mechanisms, like slices, which provide safe ways to do some of the same activities that require pointer arithmetic in C. With unsafe you can cast any pointer to a uintptr and back, and uintptr is an ordinary numeric type, which allows you do do arithmetic.
I started looking at Go and saw that the language had pointers but no pointer arithmetic. This immediately struck me as a huge issue.
You, obviously, haven't tried the language. It works pretty well without any "pointer arithmetic". If you really need to bend rules, there is always "unsafe" package that will allow you to do anything.
I may well want to write my own allocator and garbage collector. I will need to closely control the way my language objects are laid out in memory and be able to get the address of specific fields and write to them.
I haven't written allocatior or garbage collector myself, but you can take address of a field of a structure. All Go data structures are simple and easy to control and reason about. See http://research.swtch.com/godata for short introduction. Also size and aligment guarantees are part of the language http://golang.org/ref/spec#Size_and_alignment_guarantees. If nothing else, you could always jump into C or asm.
IMHO, you should try to implement some small task to see if Go fits your requirements. Feel free to ask questions at http://groups.google.com/group/golang-nuts.
Alex
There is already a JIT compiler, very serious one, done in D. I highly recommend taking a look at http://lycus.org/ , more specifically pages about the MCI project - http://github.com/lycus/mci . MCI documentation will give you some more information. As you will see, MCI is more than just a JIT, it has its own (better than anything else I have seen) IR, optimizer, verifier, etc...

Char type on 32 bit vs 64 bit

Here is the following issue:
If I am developing on a 32 bit machine and want my code to be ported to a 64 bit machine here is the senario.
My function internally use a lot of std strings. Now if I want to provide APIs can I ask them to send char * which I can then use internally? Or ask them to send me a __int64 which I convert to a string?
Another reason to use char * in my API was that at least in one type of implementation of unix (a different version of the tool) it picks up data from stdin via argv which is a char *.
In the Windows version I am not sure what to do. I could just ask for __int64 and then convert it into a string...and make it work that way or just use char * as well?
If you're providing a C++ implementation, then your public interface should just use std::string.
If however for compatibility reasons (which it sounds like you may have) you need to provide a C-style interface, then using char* is precisely the way to do it. In your 32-bit library it will be a 32 bit pointer, and in the 64 bit version of the library it will be 64 bits. This will then agree with the client users' expectations regarding the API. You should absolutely convert to a std::string inside your library at the earliest possible point however.
You seem somewhat confused. If the code you are writing is used only within the target machine, recompile will take care of most of the problems. Just don't rely on specific memory layout and you are fine. Using strings (as opposed to wstrings) probably means that the character encoding is UTF-8 (if not, reconsider) and thus limited form of data exhance (e.g. files) between platforms is also fine.
In this case, your interface decision comes to selecting between (const) std::string(&), and (const) char*, integer_type (don't rely on null terminator, please). Deciding factor being whether or not you anticipate need to support other compilers or programming languages.
Now, if you intent to make the interface callable from other machines (i.e. network interface), you have much tougher job. In that case, specify size of everything explicitly.
char is always one byte in size, both on 32-bit and 64-bit systems. However, using the std library is not the worst choice. ;) std should cope with different platforms as it is platform independent for the "most" part...
Converting to/from char* doesn't really help if you can't represent the number on your architecture.
If you are converting a 64bit integer from its decimal (or hexadecimal) textual representation into a value, you still need 64bits to store it.
You would do well to convert to string at the earliest opportunity, it is the recommended/standard for C++, and will help do away with all your char* problems.
There is a few scenarios you can follow to write portable code, see these questions:
What's the funniest user request you've ever had?
How to do portable 64 bit arithmetic, without compiler warnings
You would have problems achieving binary portability between different architectures, C++ provides for source-level portability.