Correct way to serialize binary data in C++ - c++

After having read the following 1 and 2 Q/As and having used the technique discussed below for many years on x86 architectures with GCC and MSVC and not seeing a problems, I'm now very confused as to what is supposed to be the correct but also as important "most efficient" way to serialize then deserialize binary data using C++.
Given the following "wrong" code:
int main()
std::ifstream strm("file.bin");
char buffer[sizeof(int)] = {0};,sizeof(int));
int i = 0;
// Experts seem to think doing the following is bad and
// could crash entirely when run on ARM processors:
i = reinterpret_cast<int*>(buffer);
return 0;
Now as I understand things, the reinterpret cast indicates to the compiler that it can treat the memory at buffer as an integer and subsequently is free to issue integer compatible instructions which require/assume certain alignments for the data in question - with the only overhead being the extra reads and shifts when the CPU detects the address it is trying to execute alignment oriented instructions is actually not aligned.
That said the answers provided above seem to indicate as far as C++ is concerned that this is all undefined behavior.
Assuming that the alignment of the location in buffer from which cast will occur is not conforming, then is it true that the only solution to this problem is to copy the bytes 1 by 1? Is there perhaps a more efficient technique?
Furthermore I've seen over the years many situations where a struct made up entirely of pods (using compiler specific pragmas to remove padding) is cast to a char* and subsequently written to a file or socket, then later on read back into a buffer and the buffer cast back to a pointer of the original struct, (ignoring potential endian and float/double format issues between machines), is this kind of code also considered undefined behaviour?
The following is more complex example:
int main()
std::ifstream strm("file.bin");
char buffer[1000] = {0};
const std::size_t size = sizeof(int) + sizeof(short) + sizeof(float) + sizeof(double);
const std::size_t weird_offset = 3;
buffer += weird_offset;,size);
int i = 0;
short s = 0;
float f = 0.0f;
double d = 0.0;
// Experts seem to think doing the following is bad and
// could crash entirely when run on ARM processors:
i = reinterpret_cast<int*>(buffer);
buffer += sizeof(int);
s = reinterpret_cast<short*>(buffer);
buffer += sizeof(short);
f = reinterpret_cast<float*>(buffer);
buffer += sizeof(float);
d = reinterpret_cast<double*>(buffer);
buffer += sizeof(double);
return 0;

First, you can correctly, portably, and efficiently solve the alignment problem using, e.g., std::aligned_storage::value>::type instead of char[sizeof(int)] (or, if you don't have C++11, there may be similar compiler-specific functionality).
Even if you're dealing with a complex POD, aligned_stored and alignment_of will give you a buffer that you can memcpy the POD into and out of, construct it into, etc.
In some more complex cases, you need to write more complex code, potentially using compile-time arithmetic and template-based static switches and so on, but so far as I know, nobody came up with a case during the C++11 deliberations that wasn't possible to handle with the new features.
However, just using reinterpret_cast on a random char-aligned buffer is not enough. Let's look at why:
the reinterpret cast indicates to the compiler that it can treat the memory at buffer as an integer
Yes, but you're also indicating that it can assume that the buffer is aligned properly for an integer. If you're lying about that, it's free to generate broken code.
and subsequently is free to issue integer compatible instructions which require/assume certain alignments for the data in question
Yes, it's free to issue instructions that either require those alignments, or that assume they're already taken care of.
with the only overhead being the extra reads and shifts when the CPU detects the address it is trying to execute alignment oriented instructions is actually not aligned.
Yes, it may issue instructions with the extra reads and shifts. But it may also issue instructions that don't do them, because you've told it that it doesn't have to. So, it could issue a "read aligned word" instruction which raises an interrupt when used on non-aligned addresses.
Some processors don't have a "read aligned word" instruction, and just "read word" faster with alignment than without. Others can be configured to suppress the trap and instead fall back to a slower "read word". But others—like ARM—will just fail.
Assuming that the alignment of the location in buffer from which cast will occur is not conforming, then is it true that the only solution to this problem is to copy the bytes 1 by 1? Is there perhaps a more efficient technique?
You don't need to copy the bytes 1 by 1. You could, for example, memcpy each variable one by one into properly-aligned storage. (That would only be copying bytes 1 by 1 if all of your variables were 1-byte long, in which case you wouldn't be worried about alignment in the first place…)
As for casting a POD to char* and back using compiler-specific pragmas… well, any code that relies on compiler-specific pragmas for correctness (rather than for, say, efficiency) is obviously not correct, portable C++. Sometimes "correct with g++ 3.4 or later on any 64-bit little-endian platform with IEEE 64-bit doubles" is good enough for your use cases, but that's not the same thing as actually being valid C++. And you certainly can't expect it to work with, say, Sun cc on a 32-bit big-endian platform with 80-bit doubles and then complain that it doesn't.
For the example you added later:
// Experts seem to think doing the following is bad and
// could crash entirely when run on ARM processors:
buffer += weird_offset;
i = reinterpret_cast<int*>(buffer);
buffer += sizeof(int);
Experts are right. Here's a simple example of the same thing:
int i[2];
char *c = reinterpret_cast<char *>(i) + 1;
int *j = reinterpret_cast<int *>(c);
int k = *j;
The variable i will be aligned at some address divisible by 4, say, 0x01000000. So, j will be at 0x01000001. So the line int k = *j will issue an instruction to read a 4-byte-aligned 4-byte value from 0x01000001. On, say, PPC64, that will just take about 8x as long as int k = *i, but on, say, ARM, it will crash.
So, if you have this:
int i = 0;
short s = 0;
float f = 0.0f;
double d = 0.0;
And you want to write it to a stream, how do you do it?
How do you read back from a stream?
Presumably whatever kind of stream you're using (whether ifstream, FILE*, whatever) has a buffer in it, so readFromStream(&f) is going to check whether there are sizeof(float) bytes available, read the next buffer if not, then copy the first sizeof(float) bytes from the buffer to the address of f. (In fact, it may even be smarter—it's allowed to, e.g., check whether you're just near the end of the buffer, and if so issue an asynchronous read-ahead, if the library implementer thought that would be a good idea.) The standard doesn't say how it has to do the copy. Standard libraries don't have to run anywhere but on the implementation they're part of, so your platform's ifstream could use memcpy, or *(float*), or a compiler intrinsic, or inline assembly—and it will probably use whatever's fastest on your platform.
So, how exactly would unaligned access help you optimize this or simplify it?
In nearly every case, picking the right kind of stream, and using its read and write methods, is the most efficient way of reading and writing. And, if you've picked a stream out of the standard library, it's guaranteed to be correct, too. So, you've got the best of both worlds.
If there's something peculiar about your application that makes something different more efficient—or if you're the guy writing the standard library—then of course you should go ahead and do that. As long as you (and any potential users of your code) are aware of where you're violating the standard and why (and you actually are optimizing things, rather than just doing something because it "seems like it should be faster"), this is perfectly reasonable.
You seem to think that it would help to be able to put them into some kind of "packed struct" and just write that, but the C++ standard does not have any such thing as a "packed struct". Some implementations have non-standard features that you can use for that. For example, both MSVC and gcc will let you pack the above into 18 bytes on i386, and you can take that packed struct and memcpy it, reinterpret_cast it to char * to send over the network, whatever. But it won't be compatible with the exact same code compiled by a different compiler that doesn't understand your compiler's special pragmas. It won't even be compatible with a related compiler, like gcc for ARM, which will pack the same thing into 20 bytes. When you use non-portable extensions to the standard, the result is not portable.


what is the difference between memcpy and assignment statement for a single WORD of data?

While reading the source code of RocksDB's skiplist, I have found the following code:
int UnstashHeight() const {
int rv;
memcpy(&rv, &next_[0], sizeof(int));
return rv;
Why it use memcpy? what if use pointer type cast like this:
int UnstashHeight() const {
int rv;
rv = *((int*)&next_[0]);
return rv;
Does memcpy has better portability on supporting different cpu target?
Or there is no difference at all?
I would say:
memcpy is a function, in every way you look at it is not as simple as a machine instruction that should resolve the assignment
the assignment could be optimized by the compiler in some context and simply share the same value in memory between multiple variables declared in your code (obviously if it makes sense)
as being typically mapped on a single machine instruction it has the constraints of the platform it belongs to. As stated in a comment, ARM processor requires data to be aligned to 2,4,8 bytes according to the data we are handling (4 bytes/32bit in that case). If the constraint is not satisfied an interrupt is raised.
memcpy works great on data coming from network or codecs (besides big-endian and little-endian issues) where structures try to use less bytes and bits as possible. That said your WORD can be not aligned to 32-bit boundaries, but memcpy will take care of this.
the assignment target is an instance of an int, the assignment then does not require a check on the destination: it's allocated and valid by definition (the compile guarantee that). The memcpy destination is a pointer, if you use that function widely, you may need to start to check if the destination ptr is not null around in your code.
That said, maybe I'm a little outside your target but there is not enough code to judge the formalism used for the assignment.
My 2 cents

How do I organize members in a struct to waste the least space on alignment?

[Not a duplicate of Structure padding and packing. That question is about how and when padding occurs. This one is about how to deal with it.]
I have just realized how much memory is wasted as a result of alignment in C++. Consider the following simple example:
struct X
int a;
double b;
int c;
int main()
cout << "sizeof(int) = " << sizeof(int) << '\n';
cout << "sizeof(double) = " << sizeof(double) << '\n';
cout << "2 * sizeof(int) + sizeof(double) = " << 2 * sizeof(int) + sizeof(double) << '\n';
cout << "but sizeof(X) = " << sizeof(X) << '\n';
When using g++ the program gives the following output:
sizeof(int) = 4
sizeof(double) = 8
2 * sizeof(int) + sizeof(double) = 16
but sizeof(X) = 24
That's 50% memory overhead! In a 3-gigabyte array of 134'217'728 Xs 1 gigabyte would be pure padding.
Fortunately, the solution to the problem is very simple - we simply have to swap double b and int c around:
struct X
int a;
int c;
double b;
Now the result is much more satisfying:
sizeof(int) = 4
sizeof(double) = 8
2 * sizeof(int) + sizeof(double) = 16
but sizeof(X) = 16
There is however a problem: this isn't cross-compatible. Yes, under g++ an int is 4 bytes and a double is 8 bytes, but that's not necessarily always true (their alignment doesn't have to be the same either), so under a different environment this "fix" could not only be useless, but it could also potentially make things worse by increasing the amount of padding needed.
Is there a reliable cross-platform way to solve this problem (minimize the amount of needed padding without suffering from decreased performance caused by misalignment)? Why doesn't the compiler perform such optimizations (swap struct/class members around to decrease padding)?
Due to misunderstanding and confusion, I'd like to emphasize that I don't want to "pack" my struct. That is, I don't want its members to be unaligned and thus slower to access. Instead, I still want all members to be self-aligned, but in a way that uses the least memory on padding. This could be solved by using, for example, manual rearrangement as described here and in The Lost Art of Packing by Eric Raymond. I am looking for an automated and as much cross-platform as possible way to do this, similar to what is described in proposal P1112 for the upcoming C++20 standard.
(Don't apply these rules without thinking. See ESR's point about cache locality for members you use together. And in multi-threaded programs, beware false sharing of members written by different threads. Generally you don't want per-thread data in a single struct at all for this reason, unless you're doing it to control the separation with a large alignas(128). This applies to atomic and non-atomic vars; what matters is threads writing to cache lines regardless of how they do it.)
Rule of thumb: largest to smallest alignof(). There's nothing you can do that's perfect everywhere, but by far the most common case these days is a sane "normal" C++ implementation for a normal 32 or 64-bit CPU. All primitive types have power-of-2 sizes.
Most types have alignof(T) = sizeof(T), or alignof(T) capped at the register width of the implementation. So larger types are usually more-aligned than smaller types.
Struct-packing rules in most ABIs give struct members their absolute alignof(T) alignment relative to the start of the struct, and the struct itself inherits the largest alignof() of any of its members.
Put always-64-bit members first (like double, long long, and int64_t). ISO C++ of course doesn't fix these types at 64 bits / 8 bytes, but in practice on all CPUs you care about they are. People porting your code to exotic CPUs can tweak struct layouts to optimize if necessary.
then pointers and pointer-width integers: size_t, intptr_t, and ptrdiff_t (which may be 32 or 64-bit). These are all the same width on normal modern C++ implementations for CPUs with a flat memory model.
Consider putting linked-list and tree left/right pointers first if you care about x86 and Intel CPUs. Pointer-chasing through nodes in a tree or linked list has penalties when the struct start address is in a different 4k page than the member you're accessing. Putting them first guarantees that can't be the case.
then long (which is sometimes 32-bit even when pointers are 64-bit, in LLP64 ABIs like Windows x64). But it's guaranteed at least as wide as int.
then 32-bit int32_t, int, float, enum. (Optionally separate int32_t and float ahead of int if you care about possible 8 / 16-bit systems that still pad those types to 32-bit, or do better with them naturally aligned. Most such systems don't have wider loads (FPU or SIMD) so wider types have to be handled as multiple separate chunks all the time anyway).
ISO C++ allows int to be as narrow as 16 bits, or arbitrarily wide, but in practice it's a 32-bit type even on 64-bit CPUs. ABI designers found that programs designed to work with 32-bit int just waste memory (and cache footprint) if int was wider. Don't make assumptions that would cause correctness problems, but for "portable performance" you just have to be right in the normal case.
People tuning your code for exotic platforms can tweak if necessary. If a certain struct layout is perf-critical, perhaps comment on your assumptions and reasoning in the header.
then short / int16_t
then char / int8_t / bool
(for multiple bool flags, especially if read-mostly or if they're all modified together, consider packing them with 1-bit bitfields.)
(For unsigned integer types, find the corresponding signed type in my list.)
A multiple-of-8 byte array of narrower types can go earlier if you want it to. But if you don't know the exact sizes of types, you can't guarantee that int i + char buf[4] will fill an 8-byte aligned slot between two doubles. But it's not a bad assumption, so I'd do it anyway if there was some reason (like spatial locality of members accessed together) for putting them together instead of at the end.
Exotic types: x86-64 System V has alignof(long double) = 16, but i386 System V has only alignof(long double) = 4, sizeof(long double) = 12. It's the x87 80-bit type, which is actually 10 bytes but padded to 12 or 16 so it's a multiple of its alignof, making arrays possible without violating the alignment guarantee.
And in general it gets trickier when your struct members themselves are aggregates (struct or union) with a sizeof(x) != alignof(x).
Another twist is that in some ABIs (e.g. 32-bit Windows if I recall correctly) struct members are aligned to their size (up to 8 bytes) relative to the start of the struct, even though alignof(T) is still only 4 for double and int64_t.
This is to optimize for the common case of separate allocation of 8-byte aligned memory for a single struct, without giving an alignment guarantee. i386 System V also has the same alignof(T) = 4 for most primitive types (but malloc still gives you 8-byte aligned memory because alignof(maxalign_t) = 8). But anyway, i386 System V doesn't have that struct-packing rule, so (if you don't arrange your struct from largest to smallest) you can end up with 8-byte members under-aligned relative to the start of the struct.
Most CPUs have addressing modes that, given a pointer in a register, allow access to any byte offset. The max offset is usually very large, but on x86 it saves code size if the byte offset fits in a signed byte ([-128 .. +127]). So if you have a large array of any kind, prefer putting it later in the struct after the frequently used members. Even if this costs a bit of padding.
Your compiler will pretty much always make code that has the struct address in a register, not some address in the middle of the struct to take advantage of short negative displacements.
Eric S. Raymond wrote an article The Lost Art of Structure Packing. Specifically the section on Structure reordering is basically an answer to this question.
He also makes another important point:
9. Readability and cache locality
While reordering by size is the simplest way to eliminate slop, it’s not necessarily the right thing. There are two more issues: readability and cache locality.
In a large struct that can easily be split across a cache-line boundary, it makes sense to put 2 things nearby if they're always used together. Or even contiguous to allow load/store coalescing, e.g. copying 8 or 16 bytes with one (unaliged) integer or SIMD load/store instead of separately loading smaller members.
Cache lines are typically 32 or 64 bytes on modern CPUs. (On modern x86, always 64 bytes. And Sandybridge-family has an adjacent-line spatial prefetcher in L2 cache that tries to complete 128-byte pairs of lines, separate from the main L2 streamer HW prefetch pattern detector and L1d prefetching).
Fun fact: Rust allows the compiler to reorder structs for better packing, or other reasons. IDK if any compilers actually do that, though. Probably only possible with link-time whole-program optimization if you want the choice to be based on how the struct is actually used. Otherwise separately-compiled parts of the program couldn't agree on a layout.
(#alexis posted a link-only answer linking to ESR's article, so thanks for that starting point.)
gcc has the -Wpadded warning that warns when padding is added to a structure:
<source>:4:12: warning: padding struct to align 'X::b' [-Wpadded]
4 | double b;
| ^
<source>:1:8: warning: padding struct size to alignment boundary [-Wpadded]
1 | struct X
| ^
And you can manually rearrange members so that there is less / no padding. But this is not a cross platform solution, as different types can have different sizes / alignments on different system (Most notably pointers being 4 or 8 bytes on different architectures). The general rule of thumb is go from largest to smallest alignment when declaring members, and if you're still worried, compile your code with -Wpadded once (But I wouldn't keep it on generally, because padding is necessary sometimes).
As for the reason why the compiler can't do it automatically is because of the standard ([class.mem]/19). It guarantees that, because this is a simple struct with only public members, &x.a < &x.c (for some X x;), so they can't be rearranged.
There really isn't a portable solution in the generic case. Baring minimal requirements the standard imposes, types can be any size the implementation wants to make them.
To go along with that, the compiler is not allowed to reorder class member to make it more efficient. The standard mandates that the objects must be laid out in their declared order (by access modifier), so that's out as well.
You can use fixed width types like
struct foo
int64_t a;
int16_t b;
int8_t c;
int8_t d;
and this will be the same on all platforms, provided they supply those types, but it only works with integer types. There are no fixed-width floating point types and many standard objects/containers can be different sizes on different platforms.
Mate, in case you have 3GB of data, you probably should approach an issue by other way then swapping data members.
Instead of using 'array of struct', 'struct of arrays' could be used.
So say
struct X
int a;
double b;
int c;
constexpr size_t ArraySize = 1'000'000;
X my_data[ArraySize];
is going to became
constexpr size_t ArraySize = 1'000'000;
struct X
int a[ArraySize];
double b[ArraySize];
int c[ArraySize];
X my_data;
Each element is still easily accessible mydata.a[i] = 5; mydata.b[i] = 1.5f;....
There is no paddings (except a few bytes between arrays). Memory layout is cache friendly. Prefetcher handles reading sequential memory blocks from a few separate memory regions.
That's not as unorthodox as it might looks at first glance. That approach is widely used for SIMD and GPU programming.
Array of Structures (AoS), Structure of Arrays
This is a textbook memory-vs-speed problem. The padding is to trade memory for speed. You can't say:
I don't want to "pack" my struct.
because pragma pack is the tool invented exactly to make this trade the other way: speed for memory.
Is there a reliable cross-platform way
No, there can't be any. Alignment is strictly platform-dependent issue. Sizeof different types is a platform-dependent issue. Avoiding padding by reorganizing is platform-dependent squared.
Speed, memory, and cross-platform - you can have only two.
Why doesn't the compiler perform such optimizations (swap struct/class members around to decrease padding)?
Because the C++ specifications specifically guarantee that the compiler won't mess up your meticulously organized structs. Imagine you have four floats in a row. Sometimes you use them by name, and sometimes you pass them to a method that takes a float[3] parameter.
You're proposing that compiler should shuffle them around, potentially breaking all the code since the 1970s. And for what reason? Can you guarantee that every programmer ever will actually want to save your 8 bytes per struct? I'm, for one, sure that if I have 3 GB array, I'm having bigger problems than a GB more or less.
Although the Standard grants implementations broad discretion to insert arbitrary amounts of space between structure members, that's because the authors didn't want to try to guess all the situations where padding might be useful, and the principle "don't waste space for no reason" was considered self-evident.
In practice, almost every commonplace implementation for commonplace hardware will use primitive objects whose size is a power of two, and whose required alignment is a power of two that is no larger than the size. Further, almost every such implementation will place each member of a struct at the first available multiple of its alignment that completely follows the previous member.
Some pedants will squawk that code which exploits that behavior is "non-portable". To them I would reply
C code can be non-portable. Although it strove to give programmers the opportunity to write truly portable programs, the C89 Committee did not want to force programmers into writing portably, to preclude the use of C as a “high-level assembler”: the ability to write machine specific code is one of the strengths of C.
As a slight extension to that principle, the ability of code which need only run on 90% of machines to exploit features common to that 90% of machines--even though such code wouldn't exactly be "machine-specific"--is one of the strengths of C. The notion that C programmers shouldn't be expected to bend over backward to accommodate limitations of architectures which for decades have only been used in museums should be self-evident, but apparently isn't.
You can use #pragma pack(1), but the very reason of this is that the compiler optimizes. Accessing a variable through the full register is faster than accessing it to the least bit.
Specific packing is only useful for serialization and intercompiler compatibility, etc.
As NathanOliver correctly added, this might even fail on some platforms.

Why does using reinterpret_cast to convert from char* to a structure seem to work normally?

People say it's not good to trust reinterpret_cast to convert from raw data (like char*) to a structure. For example, for the structure
struct A
unsigned int a;
unsigned int b;
unsigned char c;
unsigned int d;
sizeof(A) = 16 and __alignof(A) = 4, exactly as expected.
Suppose I do this:
char *data = new char[sizeof(A) + 1];
A *ptr = reinterpret_cast<A*>(data + 1); // +1 is to ensure it doesn't points to 4-byte aligned data
Then copy some data to ptr:
memcpy_s(sh, sizeof(A),
"\x01\x00\x00\x00\x02\x00\x00\x00\x03\x00\x00\x00\x04\x00\x00\x00", sizeof(A));
Then ptr->a is 1, ptr->b is 2, ptr->c is 3 and ptr->d is 4.
Okay, seems to work. Exactly what I was expecting.
But the data pointed by ptr is not 4-byte aligned like A should be. What problems this may cause in a x86 or x64 platform? Performance issues?
For one thing, your initialization string assumes that the underlying integers are stored in little endian format. But another architecture might use big endian, in which case your string will produce garbage. (Some huge numbers.) The correct string for that architecture would be
Then, of course, there is the issue of alignment.
Certain architectures won't even allow you to assign the address of data + 1 to a non-character pointer, they will issue a memory alignment trap.
But even architectures which will allow this (like x86) will perform miserably, having to perform two memory accesses for each integer in the structure. (For more information, see this excellent answer:
Finally, I am not completely sure about this, but I think that C and C++ do not even guarantee to you that an array of characters will contain characters packed in bytes. (I hope someone who knows more might clarify this.) Conceivably, there can be architectures which are completely incapable of addressing non-word-aligned data, so in such architectures each character would have to occupy an entire word. This would mean that it would be valid to take the address of data + 1, because it would still be aligned, but your initialization string would be unsuitable for the intended job, as the first 4 characters in it would cover your entire structure, producing a=1, b=0, c=0 and d=0.
The problem is that you can not be sure if this code will run on another platform, with the next version of Visual Studio, etc. When running on another processor, it may cause a hardware exception.
There was a time when you could read out arbitrary memory locations, but all those programs crash with an "access violation" exception nowadays. Something similar could happen to this program in the future.
However, what you can do, and what any compiler that calls itself "C++ standard compliant" must compile correctly, is this:
You can reinterpret_cast a pointer to something else, and then back to the original type. The value of the type, when read before and after, must stay the same.
I don't know what exactly you want to do, but you might get away with, for example
allocating a struct A
reinterpret_casting it to chars
saving the memory content to a file
and restore everything later:
allocate a struct A
reinterpret_cast it to chars
load the content to memory
reinterpret_cast it back to a struct A

How to write convertible code, 32 bit/64 bit?

A c++ specific question. So i read a question about what makes a program 32 bit/64 bit, and the anwser it got was something like this (sorry i cant find the question, was somedays ago i looked at it and i cant find it again:( ): As long as you dont make any "pointer assumptions", you only need to recompile it. So my question is, what are pointer assumtions ? To my understanding there is 32 bit pointer and 64 bit pointers so i figure it is something to do with that . Please show the diffrence in code between them. Any other good habits to keep in mind while writing code, that helps it making it easy to convert between the to are also welcome :) tho please share examples with them
Ps. I know there is this post:
How do you write code that is both 32 bit and 64 bit compatible?
but i tougth it was kind of to generall with no good examples, for new programmers like myself. Like what is a 32 bit storage unit ect. Kinda hopping to break it down a bit more (no pun intended ^^ ) ds.
In general it means that your program behavior should never depend on the sizeof() of any types (that are not made to be of some exact size), neither explicitly nor implicitly (this includes possible struct alignments as well).
Pointers are just a subset of them, and it probably also means that you should not try to rely on being able to convert between unrelated pointer types and/or integers, unless they are specifically made for this (e.g. intptr_t).
In the same way you need to take care of things written to disk, where you should also never rely on the size of e.g. built in types, being the same everywhere.
Whenever you have to (because of e.g. external data formats) use explicitly sized types like uint32_t.
For a well-formed program (that is, a program written according to syntax and semantic rules of C++ with no undefined behaviour), the C++ standard guarantees that your program will have one of a set of observable behaviours. The observable behaviours vary due to unspecified behaviour (including implementation-defined behaviour) within your program. If you avoid unspecified behaviour or resolve it, your program will be guaranteed to have a specific and certain output. If you write your program in this way, you will witness no differences between your program on a 32-bit or 64-bit machine.
A simple (forced) example of a program that will have different possible outputs is as follows:
int main()
std::cout << sizeof(void*) << std::endl;
return 0;
This program will likely have different output on 32- and 64-bit machines (but not necessarily). The result of sizeof(void*) is implementation-defined. However, it is certainly possible to have a program that contains implementation-defined behaviour but is resolved to be well-defined:
int main()
int size = sizeof(void*);
if (size != 4) {
size = 4;
std::cout << size << std::endl;
return 0;
This program will always print out 4, despite the fact it uses implementation-defined behaviour. This is a silly example because we could have just done int size = 4;, but there are cases when this does appear in writing platform-independent code.
So the rule for writing portable code is: aim to avoid or resolve unspecified behaviour.
Here are some tips for avoiding unspecified behaviour:
Do not assume anything about the size of the fundamental types beyond that which the C++ standard specifies. That is, a char is at least 8 bit, both short and int are at least 16 bits, and so on.
Don't try to do pointer magic (casting between pointer types or storing pointers in integral types).
Don't use a unsigned char* to read the value representation of a non-char object (for serialisation or related tasks).
Avoid reinterpret_cast.
Be careful when performing operations that may over or underflow. Think carefully when doing bit-shift operations.
Be careful when doing arithmetic on pointer types.
Don't use void*.
There are many more occurrences of unspecified or undefined behaviour in the standard. It's well worth looking them up. There are some great articles online that cover some of the more common differences that you'll experience between 32- and 64-bit platforms.
"Pointer assumptions" is when you write code that relies on pointers fitting in other data types, e.g. int copy_of_pointer = ptr; - if int is a 32-bit type, then this code will break on 64-bit machines, because only part of the pointer will be stored.
So long as pointers are only stored in pointer types, it should be no problem at all.
Typically, pointers are the size of the "machine word", so on a 32-bit architecture, 32 bits, and on a 64-bit architecture, all pointers are 64-bit. However, there are SOME architectures where this is not true. I have never worked on such machines myself [other than x86 with it's "far" and "near" pointers - but lets ignore that for now].
Most compilers will tell you when you convert pointers to integers that the pointer doesn't fit into, so if you enable warnings, MOST of the problems will become apparent - fix the warnings, and chances are pretty decent that your code will work straight away.
There will be no difference between 32bit code and 64bit code, the goal of C/C++ and other programming languages are their portability, instead of the assembly language.
The only difference will be the distrib you'll compile your code on, all the work is automatically done by your compiler/linker, so just don't think about that.
But: if you are programming on a 64bit distrib, and you need to use an external library for example SDL, the external library will have to also be compiled in 64bit if you want your code to compile.
One thing to know is that your ELF file will be bigger on a 64bit distrib than on a 32bit one, it's just logic.
What's the point with pointer? when you increment/change a pointer, the compiler will increment your pointer from the size of the pointing type.
The contained type size is defined by your processor's register size/the distrib your working on.
But you just don't have to care about this, the compilation will do everything for you.
Sum: That's why you can't execute a 64bit ELF file on a 32bit distrib.
Typical pitfalls for 32bit/64bit porting are:
The implicit assumption by the programmer that sizeof(void*) == 4 * sizeof(char).
If you're making this assumption and e.g. allocate arrays that way ("I need 20 pointers so I allocate 80 bytes"), your code breaks on 64bit because it'll cause buffer overruns.
The "kitten-killer" , int x = (int)&something; (and the reverse, void* ptr = (void*)some_int). Again an assumption of sizeof(int) == sizeof(void*). This doesn't cause overflows but looses data - the higher 32bit of the pointer, namely.
Both of these issues are of a class called type aliasing (assuming identity / interchangability / equivalence on a binary representation level between two types), and such assumptions are common; like on UN*X, assuming time_t, size_t, off_t being int, or on Windows, HANDLE, void* and long being interchangeable, etc...
Assumptions about data structure / stack space usage (See 5. below as well). In C/C++ code, local variables are allocated on the stack, and the space used there is different between 32bit and 64bit mode due to the point below, and due to the different rules for passing arguments (32bit x86 usually on the stack, 64bit x86 in part in registers). Code that just about gets away with the default stacksize on 32bit might cause stack overflow crashes on 64bit.
This is relatively easy to spot as a cause of the crash but depending on the configurability of the application possibly hard to fix.
Timing differences between 32bit and 64bit code (due to different code sizes / cache footprints, or different memory access characteristics / patterns, or different calling conventions ) might break "calibrations". Say, for (int i = 0; i < 1000000; ++i) sleep(0); is likely going to have different timings for 32bit and 64bit ...
Finally, the ABI (Application Binary Interface). There's usually bigger differences between 64bit and 32bit environments than the size of pointers...
Currently, two main "branches" of 64bit environments exist, IL32P64 (what Win64 uses - int and long are int32_t, only uintptr_t/void* is uint64_t, talking in terms of the sized integers from ) and LP64 (what UN*X uses - int is int32_t, long is int64_t and uintptr_t/void* is uint64_t), but there's the "subdivisions" of different alignment rules as well - some environments assume long, float or double align at their respective sizes, while others assume they align at multiples of four bytes. In 32bit Linux, they align all at four bytes, while in 64bit Linux, float aligns at four, long and double at eight-byte multiples.
The consequence of these rules is that in many cases, bith sizeof(struct { ...}) and the offset of structure/class members are different between 32bit and 64bit environments even if the data type declaration is completely identical.
Beyond impacting array/vector allocations, these issues also affect data in/output e.g. through files - if a 32bit app writes e.g. struct { char a; int b; char c, long d; double e } to a file that the same app recompiled for 64bit reads in, the result will not be quite what's hoped for.
The examples just given are only about language primitives (char, int, long etc.) but of course affect all sorts of platform-dependent / runtime library data types, whether size_t, off_t, time_t, HANDLE, essentially any nontrivial struct/union/class ... - so the space for error here is large,
And then there's the lower-level differences, which come into play e.g. for hand-optimized assembly (SSE/SSE2/...); 32bit and 64bit have different (numbers of) registers, different argument passing rules; all of this affects strongly how such optimizations perform and it's very likely that e.g. SSE2 code which gives best performance in 32bit mode will need to be rewritten / needs to be enhanced to give best performance 64bit mode.
There's also code design constraints which are very different for 32bit and 64bit, particularly around memory allocation / management; an application that's been carefully coded to "maximize the hell out of the mem it can get in 32bit" will have complex logic on how / when to allocate/free memory, memory-mapped file usage, internal caching, etc - much of which will be detrimental in 64bit where you could "simply" take advantage of the huge available address space. Such an app might recompile for 64bit just fine, but perform worse there than some "ancient simple deprecated version" which didn't have all the maximize-32bit peephole optimizations.
So, ultimately, it's also about enhancements / gains, and that's where more work, partly in programming, partly in design/requirements comes in. Even if your app cleanly recompiles both on 32bit and 64bit environments and is verified on both, is it actually benefitting from 64bit ? Are there changes that can/should be done to the code logic to make it do more / run faster in 64bit ? Can you do those changes without breaking 32bit backward compatibility ? Without negative impacts on the 32bit target ? Where will the enhancements be, and how much can you gain ?
For a large commercial project, answers to these questions are often important markers on the roadmap because your starting point is some existing "money maker"...

Implementing memcmp

The following is the Microsoft CRT implementation of memcmp:
int memcmp(const void* buf1,
const void* buf2,
size_t count)
while(--count && *(char*)buf1 == *(char*)buf2 ) {
buf1 = (char*)buf1 + 1;
buf2 = (char*)buf2 + 1;
return(*((unsigned char*)buf1) - *((unsigned char*)buf2));
It basically performs a byte by byte comparision.
My question is in two parts:
Is there any reason to not alter this to an int by int comparison until count < sizeof(int), then do a byte by byte comparision for what remains?
If I were to do 1, are there any potential/obvious problems?
Notes: I'm not using the CRT at all, so I have to implement this function anyway. I'm just looking for advice on how to implement it correctly.
You could do it as an int-by-int comparison or an even wider data type if you wish.
The two things you have to watch out for (at a minimum) are an overhang at the start as well as the end, and whether the alignments are different between the two areas.
Some processors run slower if you access values without following their alignment rules (some even crash if you try it).
So your code could probably do char comparisons up to an int alignment area, then int comparisons, then char comparisons again but, again, the alignments of both areas will probably matter.
Whether that extra code complexity is worth whatever savings you will get depends on many factors outside your control. A possible method would be to detect the ideal case where both areas are aligned identically and do it a fast way, otherwise just do it character by character.
The optimization you propose is very common. The biggest concern would be if you try to run it on a processor that doesn't allow unaligned accesses for anything other than a single byte, or is slower in that mode; the x86 family doesn't have that problem.
It's also more complicated, and thus more likely to contain a bug.
Don't forget that when you find a mismatch within a larger chunk, you must then identify the first differing char within that chunk so that you can calculate the correct return value (memcmp() returns the difference of the first differing bytes, treated as unsigned char values).
If you compare as int, you will need to check alignment and check if count is divisible by sizeof(int) (to compare the last bytes as char).
Is that really their implementation? I have other issues besides not doing it int-wise:
castng away constness.
does that return statement work? unsigned char - unsigned char = signed int?
int at a time only works if the pointers are aligned, or if you can read a few bytes from the front of each and they are both still aligned, so if both are 1 before the alignment boundary you can read one char of each then go int-at-a-time, but if they are aligned differently eg one is aligned and one is not, there is no way to do this.
memcmp is at its most inefficient (i.e. it takes the longest) when they do actually compare (it has to go to the end) and the data is long.
I would not write my own but if you are going to be comparing large portions of data you could do things like ensure alignment and even pad the ends, then do word-at-a-time, if you want.
Another idea is to optimize for the processor cache and fetching. Processors like to fetch in large chunks rather than individual bytes at random times. Although the internal workings may already account for this, it would be a good exercise anyway. Always profile to determine the most efficient solution.
Psuedo code:
while bytes remaining > (cache size) / 2 do // Half the cache for source, other for dest.
fetch source bytes
fetch destination bytes
perform comparison using fetched bytes
perform byte by byte comparison for remainder.
For more information, search the web for "Data Driven Design" and "data oriented programming".
Some processors, such as the ARM family, allow for conditional execution of instructions (in 32-bit, non-thumb) mode. The processor fetches the instructions but will only execute them if the conditions are satisfied. In this case, try rephrasing the comparison in terms of boolean assignments. This may also reduce the number of branches taken, which improves performance.
See also loop unrolling.
See also assembly language.
You can gain a lot of performance by tailoring the algorithm to a specific processor, but loose in the portability area.
The code you found is just a debug implementation of memcmp, it's optimized for simplicity and readability, not for performance.
The intrinsic compiler implementation is platform specific and smart enough to generate processor instructions that compare dwords or qwords (depending on the target architecture) at once whenever possible.
Also, an intrinsic implementation may return immediately if both buffers have the same address (buf1 == buf2). This check is also missing in the debug implementation.
Finally, even when you know exactly on which platform you'll be running, the perfect implementation is still the less generic one as it depends on a bunch of different factors that are specific to the rest of your program:
What is the minumum guaranteed buffer alignment?
Can you read any padding bytes past the end of a buffer without triggering an access violation?
May the buffer parameters be identical?
May the buffer size be 0?
Do you only need to compare buffer contents for equality? Or do you also need to know which one is larger (return value < 0 or > 0)?
If performace is a concern, I suggest writing the comparison routine in assembly. Most compilers give you an option to see the assembly lising that they generate for a source. You could take that code and adapt it to your needs.
Many processors implement this as a single instruction. If you can guarantee the processor you're running on it can be implemented with a single line of inline assembler.