Are signed binary representations portable between architectures? - c++

I would like to send signed integers across a network in a character stream, and/or save them to disk in a portable binary representation.
Is the ordinary binary representation that gets stored in memory when I assign a variable (I'm using long long, which is 64 bits, signed), considered to be portable between different machine architectures and operating systems?
For example, is the binary representation of a signed negative long long integer the same on an ARM machine as on an x86 machine, and if so, is it considered good/acceptable practice to take advantage of the fact?
Edit: We're already addressing the issue of byte order by using integer operations to deconstruct the integer into chars from the LSB side. My question is whether the 2's complement signed representation is consistent across architectures.

In short: NO, you cannot rely on binary compatibility across architectures. Different host machines use different byte orders (that's called endianess).
To transport numbers over network it's general consent to use big endian format (aka network byte order). If both machines use also big endian format, they'll take advantage of this automatically. For the ARM architecture byte orders need to be converted.
To convert from network to host byte order and vice versa you can use the ntohl(), ntohs(),htonl(),htons() function family from arpa/inet.h.

Integers are not "portable by design" in C/C++; you might encounter endianness issues. This is unrelated to them being signed or unsigned.
If you are writing this software, make sure to send your integers in a known byte order. The receiving part needs to read them in the same order, performing an endianness conversion when necessary.

Related

Floating point differences when porting MIPS code to x86_64

I am currently working on porting a piece of code written and compiled for SGI using MIPSPro to RHEL 6.7 with gcc 4.4.7. My target architecture is x86_64 I was able to generate an executable for this code and now I am trying to run it.
I am trying to read a binary data from a file, this file was generated in the SGI system by basically casting object's pointer to a char* and saving that to a file. The piece of binary data that I am trying to read has more or less this format:
[ Header, Object A , Object B, ..., Object N ]
Where each object is an instantiation of different classes.
The way the code currently processes the file is by reading it all into memory, and taking the pointer to where the object starts and using reinterpret_class<Class A>(pointer) to it. Something tells me that the people who original designed this were not concerned about portability.
So far I was able to deal with the endianness of the Header object by just swapping the bytes. Unfortunately, Objects A, B, .., N all contain fields of type double and trying to do a byte-swap for 8 bytes does not seem to work.
My question then is, are doubles in SGI/MIPSPro structured differently than in Linux? I know that the sizeof(double) in the SGI machine returns 8 so I think they are of the same size.
According to the MIPSPro ABI:
the MIPS processors conform to the IEEE 754 floating point standard
Your target platform, x86_64, shares this quality.
As such, double means IEEE-754 double-precision float on both platforms.
When it comes to endianness, x86_64 processors are little-endian; but, according to the MIPSpro assembly programmers' guide, some MIPSPro processors are big-endian:
For R4000 and earlier systems, byte ordering is configurable into either big-endian or little-endian byte ordering (configuration occurs during hardware reset). When configured as a big-endian system, byte 0 is always the most-significant (leftmost) byte. When configured as a little-endian system, byte 0 is always the least-significant (rightmost byte).
The R8000 CPU, at present, supports big-endian only
So, you will have to check the datasheet for the original platform and see whether any byte swapping is needed.

Who decides the sizeof any datatype or structure (depending on 32 bit or 64 bit)?

Who decides the sizeof any datatype or structure (depending on 32 bit or 64 bit)? The compiler or the processor? For example, sizeof(int) is 4 bytes for a 32 bit system whereas it's 8 bytes for a 64 bit system.
I also read that sizeof(int) is 4 bytes when compiled using both 32-bit and 64-bit compiler.
Suppose my CPU can run both 32-bit as well as 64-bit applications, who will play main role in deciding size of data the compiler or the processor?
It's ultimately the compiler. The compiler implementors can decide to emulate whatever integer size they see fit, regardless of what the CPU handles the most efficiently. That said, the C (and C++) standard is written such, that the compiler implementor is free to choose the fastest and most efficient way. For many compilers, the implementers chose to keep int as a 32 bit, although the CPU natively handles 64 bit ints very efficiently.
I think this was done in part to increase portability towards programs written when 32 bit machines were the most common and who expected an int to be 32 bits and no longer. (It could also be, as user user3386109 points out, that 32 bit data was preferred because it takes less space and therefore can be accessed faster.)
So if you want to make sure you get 64 bit ints, you use int64_t instead of int to declare your variable. If you know your value will fit inside of 32 bits or you don't care about size, you use int to let the compiler pick the most efficient representation.
As for the other datatypes such as struct, they are composed from the base types such as int.
It's not the CPU, nor the compiler, nor the operating system. It's all three at the same time.
The compiler can't just make things up. It has to adhere to the right ABI[1] that that the operating system provides. If structs and system calls provided by the operating system have types with certain sizes and alignment requirements the compiler isn't really free to make up its own reality unless the compiler developers want to reimplement wrapper functions for everything the operating system provides. Then the ABI of the operating system can't just be completely made up, it has to do what can be reasonably done on the CPU. And very often the ABI of one operating system will be very similar to other ABIs for other operating systems on the same CPU because it's easier to just be able to reuse the work they did (on compilers among other things).
In case of computers that support both 32 bit and 64 bit code there still needs to be work done by the operating system to support running programs in both modes (because the system has to provide two different ABIs). Some operating systems don't do it and on those you don't have a choice.
[1] ABI stands for Application Binary Interface. It's a set of rules for how a program interacts with the operating system. It defines how a program is stored on disk to be runnable by the operating system, how to do system calls, how to link with libraries, etc. But for being able to link to libraries for example, your program and the library have to agree on how to make function calls between your program an the library (and vice versa) and to be able to make function calls both the program and the library have to have the same idea of stack layout, register usage, function call conventions, etc. And for function calls you need to agree on what the parameters mean and that includes sizes, alignment and signedness of types.
It is strictly, 100%, entirely the compiler that decides the value of sizeof(int). It is not a combination of the system and the compiler. It is just the compiler (and the C/C++ language specifications).
If you develop iPad or iPhone apps you do the compiler runs on your Mac. The Mac and the iPhone/iPac use different processors. Nothing about your Mac tells the compiler what size should be used for int on the iPad.
The processor designer determines what registers and instructions are available, what the alignment rules for efficient access are, how big memory addresses are and so-on.
The C standard sets minimum requirements for the built-in types. "char" must be at least 8 bit, "short" and "int" must be at least 16 bit, "long" must be at least 32 bit and "long long" must be at least 64 bit. It also says that "char" must be equivilent to the smallest unit of memory the program can address and that the size ordering of the standard types must be maintained.
Other standards may also have an impact. For example version 2 of the "single Unix specification" says that int must be at least 32-bits.
Finally existing code has an impact. Porting is hard enough already, noone wants to make it any harder than they have to.
When porting an OS and compiler to a new CPU someone has to define what is known of as a "C ABI". This defines how binary code talks to each other including.
The size and alignment requirements of the built-in types.
The packing rules for structures (and hence what their size will be).
How parameters are passed and returned
How the stack is managed
In general once and ABI is defined for a combination of CPU family and OS it doesn't change much (sometimes the size of more obscure types like "long double" changes). Changing it brings a bunch of breakage for relatively little gain.
Similarly those porting an OS to a platform with similar characteristics to an existing one will usually choose the same sizes as on previous platforms that the OS was ported to.
In practice OS/compiler vendors typically settle on one of a few combinations of sizes for the basic integer types.
"LP32": char is 8 bits. short and int are 16 bits, long and pointer are 32-bits. Commonly used on 8 bit and 16 bit platforms.
"ILP32": char is 8 bits, short is 16 bits. int, long and pointer are all 32 bits. If long long exists it is 64 bit. Commonly used on 32 bit platforms.
"LLP64": char is 8 bits. short is 16 bits. int and long are 32 bits. long long and pointer are 64 bits. Used on 64 bit windows.
"LP64": char is 8 bits. short is 16 bits. int is 32 bits. long, long long and pointer are 64 bits. Used on most 64-bit unix-like systems.
"ILP64": char is 8 bits, short is 16 bits, int, long and pointer and long long are all 64 bits. Apparently used on some early 64-bit operating systems but rarely seen nowadays.
64 bit processors can typically run both 32-bit and 64-bit binaries. Generally this is handled by having a compatibility layer in your OS. So your 32-bit binary uses the same data types it would use when running on a 32-bit system, then the compatibility layer translates the system calls so that the 64-bit OS can handle them.
The compiler decides how large the basic types are, and what the layout of structures is. If a library declares any types, it will decide how those are defined and therefore what size they are.
However, it is often the case that compatibility with an existing standard, and the need to link to existing libraries produced by other compilers, forces a given implementation to make certain choices. For example, the language standard says that a wchar_t has to be wider than 16 bits, and on Linux, it is 32 bits wide, but it’s always been 16 bits on Windows, so compilers for Windows all choose to be compatible with the Windows API instead of the language standard. A lot of legacy code for both Linux and Windows assumes that a long is exactly 32 bits wide, while other code assumed it was wide enough to hold a timestamp in seconds or an IPv4 address or a file offset or the bits of a pointer, and (after one compiler defined int as 64 bits wide and long as 32 bits wide) the language standard made a new rule that int cannot be wider than long.
As a result, mainstream compilers from this century choose to define int as 32 bits wide, but historically some have defined it as 16 bits, 18 bits, 32 bits, 64 bits and other sizes. Some compilers let you choose whether long will be exactly 32 bits wide, as some legacy code assumes, or as wide as a pointer, as other legacy code assumes.
This demonstrates how assumptions you make today, like some type always being 32 bits wide, might come back to bite you in the future. This has already happened to C codebases twice, in the transitions to 32-bit and 64-bit code.
But what should you actually use?
The int type is rarely useful these days. There’s usually some other type you can use that makes a stronger guarantee of what you’ll get. (It does have one advantage: types that aren’t as wide as an int could get automatically widened to int, which could cause a few really weird bugs when you mix signed and unsigned types, and int is the smallest type guaranteed not to be shorter than int.)
If you’re using a particular API, you’ll generally want to use the same type it does. There are numerous types in the standard library for specific purposes, such as clock_t for clock ticks and time_t for time in seconds.
If you want the fastest type that’s at least 16 bits wide, that’s int_fast16_t, and there are other similar types. (Unless otherwise specified, all these types are defined in <stdint.h>.) If you want the smallest type that’s at least 32 bits wide, to pack the most data into your arrays, that’s int_least32_t. If you want the widest possible type, that’s intmax_t. If you know you want exactly 32 bits, and your compiler has a type like that, it’s int32_t If you want something that’s 32 bits wide on a 32-bit machine and 64 bits wide on a 64-bit machine, and always the right size to store a pointer, that’s intptr_t. If you want a good type for doing array indexing and pointer math, that’s ptrdiff_t from <stddef.h>. (This one’s in a different header because it’s from C89, not C99.)
Use the type you really mean!
When you talk about the compiler, you mush have a clean image about build|host|target, i.e, the machine you are building on (build), the machine that you are building for (host), and the machine that GCC will produce code for (target), because for "cross compiling" is very different from "native compiling".
About the question "who decide the sizeof datatype and structure", it depends on the target system you told compiler to build binary for. If target is 64 bits, the compiler will translate sizeof(long) to 8, and if the target is a 32 bits machine, the compiler will translate sizeof(long) to 4. All these have been predefined by header file you used to build your program. If you read your `$MAKETOP/usr/include/stdint.h', there are typedefs to define the size of your datatype.
To avoid the error created by the size difference, Google coding style-Integer_Types recommend using types like int16_t, uint32_t, int64_t, etc. Those were defined in <stdint.h>.
Above is only those `Plain Old Data', such as int. If you talk about a structure, there is another story, because the size of a structure depends on packing alignment, the boundaries alignment for each field in the structure, which will have impact on the size of the structure.
It's the compiler, and more precisely its code generator component.
Of course, the compiler is architecture-aware and makes choices that fit with it.
In some cases, the work is performed in two passes, one at compile-time by an intermediate code generators, then a second at run-time by a just-in-time compiler. But this is still a compiler.

Network byte order and endianness issues

I read on internet that standard byte order for networks is big endian, also known as network byte order. Before transferring data on network, data is first converted to network byte order (big endian).
But can any one please let me know who will take care of this conversion.
Whether the code developer do really worry about this endianness? If yes, can you please let me know the examples where we need to take care (in case of C, C++).
The first place where the network vs native byte order matters is in creating sockets and specifying the IP address and port number. Those must be in the correct order or you will not end up talking to the correct computer, or you'll end up talking to the incorrect port on the correct computer if you mapped the IP address but not the port number.
The onus is on the programmer to get the addresses in the correct order. There are functions like htonl() that convert from host (h) to network (n) order; l indicates 'long' meaning '4 bytes'; s indicates 'short' meaning '2 bytes' (the names date from an era before 64-bit systems).
The other time it matters is if you are transferring binary data between two computers, either via a network connection correctly set up over a socket, or via a file. With single-byte code sets (SBCS), or UTF-8, you don't have problems with textual data. With multi-byte code sets (MBCS), or UTF-16LE vs UTF-16BE, or UTF-32, you have to worry about the byte order within characters, but the characters will appear one after the other. If you ship a 32-bit integer as 32-bits of data, the receiving end needs to know whether the first byte is the MSB (most significant byte — for big-endian) or the LSB (least significant byte — for little-endian) of the 32-bit quantity. Similarly with 16-bit integers, or 64-bit integers. With floating point, you could run into the additional problem that different computers could use different formats for the floating point, independently of the endianness issue. This is less of a problem than it used to be thanks to IEEE 744.
Note that IBM mainframes use EBCDIC instead of ASCII or ISO 8859-x character sets (at least by default), and the floating point format is not IEEE 744 (pre-dating that standard by a decade or more). These issues, therefore, are crucial to deal with when communicating with the mainframe. The programs at the two ends have to agree with how each end will understand the other. Some protocols define a byte order (e.g. network byte order); others define 'sender makes right' or 'receiver makes right' or 'client makes right' or 'server makes right', placing the conversion workload on different parts of the system.
One advantage of text protocols (especially those using an SBCS) is that they evade the problems of endianness — at the cost of converting text to value and back, but computation is cheap compared to even gigabit networking speeds.
In C and C++, you will have to worry about endianness in low level network code. Typically the serialization and deserialization code will call a function or macro that adjusts the endianness - reversing it on little endian machines, doing nothing on big endian machines - when working with multibyte data types.
Just send stuff in the correct order that the receiver can understand,
i.e. use http://www.manpagez.com/man/3/ntohl/ and their ilk.

Should I worry about Big Endianness or is it only a trivial aspect?

Are there many computers which use Big Endian? I just tested on 5 different computers, each purchased in different years, and different models. Each use Little Endian. Is Big Endian still used now days or was it for older processors such as the Motorola 6800?
Edit:
Thank you TreyA, intel.com/design/intarch/papers/endian.pdf is a very nice and handy article. It covers every answers bellow, and also expands upon them.
There's many processors in use today that is big endian, or allows the option to switch endian mode between big and little endian, (e.g. SPARC, PowerPC, ARM, Itanium..).
It depends on what you mean by "care about endian". You usually don't need to care that much specifically about endianess if you just program to the data you need. Endian matters when you need to communicate to the outside world, such as read/write a file, or send data over a network and you do that by reading/writing integers larger than 1 byte directly to/from memory.
When you do need to deal with external data, you need to know its format. Part of its format is to e.g. know how an integer is encoded in that data. If the format specifies that the first byte of an 4 byte integer is the most significant byte of said integer, you read that byte and place it at the most significant byte of the integer in your program, and you would be able to accomplish that fine
with code that runs on both little and big endian machines.
So it's not so much specifically about the processor endianess, but the data you need to deal with. That data might have integers stored in either "endian", you need to know which, and various data formats will use various endianess depending on some specification, or depending on the whim of the guy that came up with the format.
Big endian is still by far the most used, in terms of different architectures. In fact, outside of the Intel and the old DEC computers, I don't know of a small endian: Sparc, Power PC (IBM Unix machines), HP's Unix platforms, IBM mainframes, etc. are all big endian. But does it matter? About the only time I've had to consider endianness was when implementing some low level system routines, like modf. Otherwise, int is an integer value in a certain range, and that's it.
The following common platforms use big-endian encoding:
Java
Network data in TCP/UDP packets (maybe even on the IP level, but I'm not sure about that)
The x86/x64 CPUs are little-endian. If you are going to interface with binary data between the two, you should definitely be aware of this.
This qualifies more as a comment than an answer, but I can't comment and I think it's such a great article to read, that I think it worthwhile.
This is a classic on endianness by Danny Cohen, dating from 1980:
ON HOLY WARS AND A PLEA FOR PEACE
There is not enough context to the question. In general, you should simply be aware of it at all times, but you do not need to stress over it in everyday coding. If you plan on messing with the internal bytes of your integers, start worrying about endianness. If you plan on doing standard math on your integers, don't worry about it.
The two big places where endianness pops up is in networking (big endian standard) and binary records (have to research whether integers are stored big endian or little endian).

What makes a system little-endian or big-endian?

I'm confused with the byte order of a system/cpu/program.
So I must ask some questions to make my mind clear.
Question 1
If I only use type char in my C++ program:
void main()
{
char c = 'A';
char* s = "XYZ";
}
Then compile this program to a executable binary file called a.out.
Can a.out both run on little-endian and big-endian systems?
Question 2
If my Windows XP system is little-endian, can I install a big-endian Linux system in VMWare/VirtualBox?
What makes a system little-endian or big-endian?
Question 3
If I want to write a byte-order-independent C++ program, what do I need to take into account?
Can a.out both run on little-endian and big-endian system?
No, because pretty much any two CPUs that are so different as to have different endian-ness will not run the same instruction set. C++ isn't Java; you don't compile to something that gets compiled or interpreted. You compile to the assembly for a specific CPU. And endian-ness is part of the CPU.
But that's outside of endian issues. You can compile that program for different CPUs and those executables will work fine on their respective CPUs.
What makes a system little-endian or big-endian?
As far as C or C++ is concerned, the CPU. Different processing units in a computer can actually have different endians (the GPU could be big-endian while the CPU is little endian), but that's somewhat uncommon.
If I want to write a byte-order independent C++ program, what do I need to take into account?
As long as you play by the rules of C or C++, you don't have to care about endian issues.
Of course, you also won't be able to load files directly into POD structs. Or read a series of bytes, pretend it is a series of unsigned shorts, and then process it as a UTF-16-encoded string. All of those things step into the realm of implementation-defined behavior.
There's a difference between "undefined" and "implementation-defined" behavior. When the C and C++ spec say something is "undefined", it basically means all manner of brokenness can ensue. If you keep doing it, (and your program doesn't crash) you could get inconsistent results. When it says that something is defined by the implementation, you will get consistent results for that implementation.
If you compile for x86 in VC2010, what happens when you pretend a byte array is an unsigned short array (ie: unsigned char *byteArray = ...; unsigned short *usArray = (unsigned short*)byteArray) is defined by the implementation. When compiling for big-endian CPUs, you'll get a different answer than when compiling for little-endian CPUs.
In general, endian issues are things you can localize to input/output systems. Networking, file reading, etc. They should be taken care of in the extremities of your codebase.
Question 1:
Can a.out both run on little-endian and big-endian system?
No. Because a.out is already compiled for whatever architecture it is targeting. It will not run on another architecture that it is incompatible with.
However, the source code for that simple program has nothing that could possibly break on different endian machines.
So yes it (the source) will work properly. (well... aside from void main(), which you should be using int main() instead)
Question 2:
If my WindowsXP system is little-endian, can I install a big-endian
Linux system in VMWare/VirtualBox?
Endian-ness is determined by the hardware, not the OS. So whatever (native) VM you install on it, will be the same endian as the host. (since x86 is all little-endian)
What makes a system little-endian or big-endian?
Here's an example of something that will behave differently on little vs. big-endian:
uint64_t a = 0x0123456789abcdefull;
uint32_t b = *(uint32_t*)&a;
printf("b is %x",b)
*Note that this violates strict-aliasing, and is only for demonstration purposes.
Little Endian : b is 89abcdef
Big Endian : b is 1234567
On little-endian, the lower bits of a are stored at the lowest address. So when you access a as a 32-bit integer, you will read the lower 32 bits of it. On big-endian, you will read the upper 32 bits.
Question 3:
If I want to write a byte-order independent C++ program, what do I
need to take into account?
Just follow the standard C++ rules and don't do anything ugly like the example I've shown above. Avoid undefined behavior, avoid type-punning...
Little-endian / big-endian is a property of hardware. In general, binary code compiled for one hardware cannot run on another hardware, except in a virtualization environments that interpret machine code, and emulate the target hardware for it. There are bi-endian CPUs (e.g. ARM, IA-64) that feature a switch to change endianness.
As far as byte-order-independent programming goes, the only case when you really need to do it is to deal with networking. There are functions such as ntohl and htonl to help you converting your hardware's byte order to network's byte order.
The first thing to clarify is that endianness is a hardware attribute, not a software/OS attribute, so WinXP and Linux are not big-endian or little endian, but rather the hardware on which they run is either big-endian or little endian.
Endianness is a description of the order in which the bytes are stored in a data-type. A system that is big-endian stores the most significant (read biggest value) value first and a little-endian system stores the least significant byte first. It is not mandatory to have each datatype be the same as the others on a system so you can have mixed-endian systems.
A program that is little endian would not run on a big-endian system, but that has more to with the instruction set available than the endianness of the system on which it was compiled.
If you want to write a byte-order independent program you simply need to not depend on the byte order of your data.
1: The output of the compiler will depend on the options you give it and if you use a cross-compiler. By default, it should run on the operating system you are compiling it on and not others (perhaps not even others of the same type; not all Linux binaries run on all Linux installs, for example). In large projects, this will be the least of your concern, as libraries, etc, will need built and linked differently on each system. Using a proper build system (like make) will take care of most of this without you needing to worry.
2: Virtual machines abstract the hardware in such a way as to allow essentially anything to run within anything else. How the operating systems manage their memory is unimportant as long as they both run on the same hardware and support whatever virtualization model is in use. Endianness means the byte-order; if it is read left-right or right-left (or some other format). Some hardware supports both and virtualization allows both to coexist in that case (although I am not aware of how this would be useful except that it is possible in theory). However, Linux works on many different architectures (and Windows some other than Ixxx), so the situation is more complicated.
3: If you monkey with raw memory, such as with binary operators, you might put yourself in a position of depending on endianness. However, most modern programming is at a higher level than this. As such, you are likely to notice if you get into something which may impose endianness-based limitations. If such is ever required, you can always implement options for both endiannesses using the preprocessor.
The endianness of a system determine how the bytes are interpreted, so what bit is considered the "first" and what is considered the "last".
You need to care about it only when loading or saving from some sources external to your program, like disk or networks.