What makes a system little-endian or big-endian? - c++

I'm confused with the byte order of a system/cpu/program.
So I must ask some questions to make my mind clear.
Question 1
If I only use type char in my C++ program:
void main()
{
char c = 'A';
char* s = "XYZ";
}
Then compile this program to a executable binary file called a.out.
Can a.out both run on little-endian and big-endian systems?
Question 2
If my Windows XP system is little-endian, can I install a big-endian Linux system in VMWare/VirtualBox?
What makes a system little-endian or big-endian?
Question 3
If I want to write a byte-order-independent C++ program, what do I need to take into account?

Can a.out both run on little-endian and big-endian system?
No, because pretty much any two CPUs that are so different as to have different endian-ness will not run the same instruction set. C++ isn't Java; you don't compile to something that gets compiled or interpreted. You compile to the assembly for a specific CPU. And endian-ness is part of the CPU.
But that's outside of endian issues. You can compile that program for different CPUs and those executables will work fine on their respective CPUs.
What makes a system little-endian or big-endian?
As far as C or C++ is concerned, the CPU. Different processing units in a computer can actually have different endians (the GPU could be big-endian while the CPU is little endian), but that's somewhat uncommon.
If I want to write a byte-order independent C++ program, what do I need to take into account?
As long as you play by the rules of C or C++, you don't have to care about endian issues.
Of course, you also won't be able to load files directly into POD structs. Or read a series of bytes, pretend it is a series of unsigned shorts, and then process it as a UTF-16-encoded string. All of those things step into the realm of implementation-defined behavior.
There's a difference between "undefined" and "implementation-defined" behavior. When the C and C++ spec say something is "undefined", it basically means all manner of brokenness can ensue. If you keep doing it, (and your program doesn't crash) you could get inconsistent results. When it says that something is defined by the implementation, you will get consistent results for that implementation.
If you compile for x86 in VC2010, what happens when you pretend a byte array is an unsigned short array (ie: unsigned char *byteArray = ...; unsigned short *usArray = (unsigned short*)byteArray) is defined by the implementation. When compiling for big-endian CPUs, you'll get a different answer than when compiling for little-endian CPUs.
In general, endian issues are things you can localize to input/output systems. Networking, file reading, etc. They should be taken care of in the extremities of your codebase.

Question 1:
Can a.out both run on little-endian and big-endian system?
No. Because a.out is already compiled for whatever architecture it is targeting. It will not run on another architecture that it is incompatible with.
However, the source code for that simple program has nothing that could possibly break on different endian machines.
So yes it (the source) will work properly. (well... aside from void main(), which you should be using int main() instead)
Question 2:
If my WindowsXP system is little-endian, can I install a big-endian
Linux system in VMWare/VirtualBox?
Endian-ness is determined by the hardware, not the OS. So whatever (native) VM you install on it, will be the same endian as the host. (since x86 is all little-endian)
What makes a system little-endian or big-endian?
Here's an example of something that will behave differently on little vs. big-endian:
uint64_t a = 0x0123456789abcdefull;
uint32_t b = *(uint32_t*)&a;
printf("b is %x",b)
*Note that this violates strict-aliasing, and is only for demonstration purposes.
Little Endian : b is 89abcdef
Big Endian : b is 1234567
On little-endian, the lower bits of a are stored at the lowest address. So when you access a as a 32-bit integer, you will read the lower 32 bits of it. On big-endian, you will read the upper 32 bits.
Question 3:
If I want to write a byte-order independent C++ program, what do I
need to take into account?
Just follow the standard C++ rules and don't do anything ugly like the example I've shown above. Avoid undefined behavior, avoid type-punning...

Little-endian / big-endian is a property of hardware. In general, binary code compiled for one hardware cannot run on another hardware, except in a virtualization environments that interpret machine code, and emulate the target hardware for it. There are bi-endian CPUs (e.g. ARM, IA-64) that feature a switch to change endianness.
As far as byte-order-independent programming goes, the only case when you really need to do it is to deal with networking. There are functions such as ntohl and htonl to help you converting your hardware's byte order to network's byte order.

The first thing to clarify is that endianness is a hardware attribute, not a software/OS attribute, so WinXP and Linux are not big-endian or little endian, but rather the hardware on which they run is either big-endian or little endian.
Endianness is a description of the order in which the bytes are stored in a data-type. A system that is big-endian stores the most significant (read biggest value) value first and a little-endian system stores the least significant byte first. It is not mandatory to have each datatype be the same as the others on a system so you can have mixed-endian systems.
A program that is little endian would not run on a big-endian system, but that has more to with the instruction set available than the endianness of the system on which it was compiled.
If you want to write a byte-order independent program you simply need to not depend on the byte order of your data.

1: The output of the compiler will depend on the options you give it and if you use a cross-compiler. By default, it should run on the operating system you are compiling it on and not others (perhaps not even others of the same type; not all Linux binaries run on all Linux installs, for example). In large projects, this will be the least of your concern, as libraries, etc, will need built and linked differently on each system. Using a proper build system (like make) will take care of most of this without you needing to worry.
2: Virtual machines abstract the hardware in such a way as to allow essentially anything to run within anything else. How the operating systems manage their memory is unimportant as long as they both run on the same hardware and support whatever virtualization model is in use. Endianness means the byte-order; if it is read left-right or right-left (or some other format). Some hardware supports both and virtualization allows both to coexist in that case (although I am not aware of how this would be useful except that it is possible in theory). However, Linux works on many different architectures (and Windows some other than Ixxx), so the situation is more complicated.
3: If you monkey with raw memory, such as with binary operators, you might put yourself in a position of depending on endianness. However, most modern programming is at a higher level than this. As such, you are likely to notice if you get into something which may impose endianness-based limitations. If such is ever required, you can always implement options for both endiannesses using the preprocessor.

The endianness of a system determine how the bytes are interpreted, so what bit is considered the "first" and what is considered the "last".
You need to care about it only when loading or saving from some sources external to your program, like disk or networks.

Related

Making a program portable between machines that have different number of bits in a "machine byte"

We are all fans of portable C/C++ programs.
We know that sizeof(char) or sizeof(unsigned char) is always 1 "byte". But that 1 "byte" doesn't mean a byte with 8 bits. It just means a "machine byte", and the number of bits in it can differ from machine to machine. See this question.
Suppose you write out the ASCII letter 'A' into a file foo.txt. On any normal machine these days, which has a 8-bit machine byte, these bits would get written out:
01000001
But if you were to run the same code on a machine with a 9-bit machine byte, I suppose these bits would get written out:
001000001
More to the point, the latter machine could write out these 9 bits as one machine byte:
100000000
But if we were to read this data on the former machine, we wouldn't be able to do it properly, since there isn't enough room. Somehow, we would have to first read one machine byte (8 bits), and then somehow transform the final 1 bit into 8 bits (a machine byte).
How can programmers properly reconcile these things?
The reason I ask is that I have a program that writes and reads files, and I want to make sure that it doesn't break 5, 10, 50 years from now.
How can programmers properly reconcile these things?
By doing nothing. You've presented a filesystem problem.
Imagine that dreadful day when the first of many 9-bit machines is booted up, ready to recompile your code and process that ASCII letter A that you wrote to a file last year.
To ensure that a C/C++ compiler can reasonably exist for this machine, this new computer's OS follows the same standards that C and C++ assume, where files have a size measured in bytes.
...There's already a little problem with your 8-bit source code. There's only about a 1-in-9 chance each source file is a size that can even exist on this system.
Or maybe not. As is often the case for me, Johannes Schaub - litb has pre-emptively cited the standard regarding valid formats for C++ source code.
Physical source file characters are mapped, in an
implementation-defined manner, to the basic source character set
(introducing new-line characters for end-of-line indicators) if
necessary. Trigraph sequences (2.3) are replaced by corresponding
single-character internal representations. Any source file character
not in the basic source character set (2.2) is replaced by the
universal-character-name that des- ignates that character. (An
implementation may use any internal encoding, so long as an actual
extended character encountered in the source file, and the same
extended character expressed in the source file as a
universal-character-name (i.e. using the \uXXXX notation), are handled
equivalently.)
"In an implementation-defined manner." That's good news...as long as some method exists to convert your source code to any 1:1 format that can be represented on this machine, you can compile it and run your program.
So here's where your real problem lies. If the creators of this computer were kind enough to provide a utility to bit-extend 8-bit ASCII files so they may be actually stored on this new machine, there's already no problem with the ASCII letter A you wrote long ago. And if there is no such utility, then your program already needs maintenance and there's nothing you could have done to prevent it.
Edit: The shorter answer (addressing comments that have since been deleted)
The question asks how to deal with a specific 9-bit computer...
With hardware that has no backwards-compatible 8-bit instructions
With an operating system that doesn't use "8-bit files".
With a C/C++ compiler that breaks how C/C++ programs have historically written text files.
Damian Conway has an often-repeated quote comparing C++ to C:
"C++ tries to guard against Murphy, not Machiavelli."
He was describing other software engineers, not hardware engineers, but the intention is still sound because the reasoning is the same.
Both C and C++ are standardized in a way that requires you to presume that other engineers want to play nice. Your Machiavellian computer is not a threat to your program because it's a threat to C/C++ entirely.
Returning to your question:
How can programmers properly reconcile these things?
You really have two options.
Accept that the computer you describe would not be appropriate in the world of C/C++
Accept that C/C++ would not be appropriate for a program that might run on the computer you describe
Only way to be sure is to store data in text files, numbers as strings of number characters, not some amount of bits. XML using UTF-8 and base 10 should be pretty good overall choice for portability and readability, as it is well defined. If you want to be paranoid, keep the XML simple enough, so that in a pinch it can be easily parsed with simple custom parser, in case a real XML parser is not readily available for your hypothetical computer.
When parsing numbers, and it is bigger than what fits in your numeric data type, well, that's an error situation you need to handle as you see fit in the context. Or use a "big int" library, which can then handle arbitrarily large numbers (with an order of magnitude performance hit compared to "native" numeric data types, of course).
If you need to store bit fields, then store bit fields, that is number of bits and then bit values in whatever format.
If you have a specific numeric range, then store the range, so you can explicitly check if they fit in available numeric data types.
Byte is pretty fundamental data unit, so you can not really transfer binary data between storages with different amount of bits, you have to convert, and to convert you need to know how the data is formatted, otherwise you simply can not convert multi-byte values correctly.
Adding actual answer:
In you C code, do not handle byte buffers, except in isolated functions which you will then modify as appropriate for CPU architecture. For example .JPEG handling functions would take either a struct wrapping the image data in unspecified way, or a file name to read the image from, but never a raw char* to byte buffer.
Wrap strings in a container which does not assume encoding (presumably it will use UTF-8 or UTF-16 on 8-bit byte machine, possibly currently non-standard UTF-9 or UTF-18 on 9-bit byte machine, etc).
Wrap all reads from external sources (network, disk files, etc) into functions which return native data.
Create code where no integer overflows happen, and do not rely on overflow behavior in any algorithm.
Define all-ones bitmasks using ~0 (instead of 0xFFFFFFFF or something)
Prefer IEEE floating point numbers for most numeric storage, where integer is not required, as those are independent of CPU architecture.
Do not store persistent data in binary files, which you may have to convert. Instead use XML in UTF-8 (which can be converted to UTF-X without breaking anything, for native handling), and store numbers as text in the XML.
Same as with different byte orders, except much more so, only way to be sure is to port your program to actual machine with different number of bits, and run comprehensive tests. If this is really important, then you may have to first implement such a virtual machine, and port C-compiler and needed libraries for it, if you can't find one otherwise. Even careful (=expensive) code review will only take you part of the way.
if you're planning to write programs for Quantum Computers(which will be available in the near future for us to buy), then start learning Quantum Physics and take a class on programming them.
Unless you're planning for a boolean computer logic in the near future, then.. my question is how will you make it sure that the filesystem available today will not be the same tomorrow? or how a file stored with 8 bit binary will remain portable in the filesystems of tomorrow?
If you want to keep your programs running through generations, my suggestion is create your own computing machine, with your own filesystem and your own operating system, and change the interface as the needs of tomorrow change.
My problem is, the computer system I programmed a few years ago doesn't exist(Motorola 68000) anymore for normal public, and the program heavily relied on the machine's byte order and assembly language. Not portable anymore :-(
If you're talking about writing and reading binary data, don't bother. There is no portability guarantee today, other than that data you write from your program can be read by the same program compiled with the same compiler (including command-line settings). If you're talking about writing and reading textual data, don't worry. It works.
First: The original practical goal of portability is to reduce work; therefore if portability requires more effort than non-portability to achieve the same end result, then writing portable code in such case is no longer advantageous. Do not target 'portability' simply out of principle. In your case, a non-portable version with well-documented notes regarding the disk format is a more efficient means of future-proofing. Trying to write code that somehow caters to any possible generic underlying storage format will probably render your code nearly incomprehensible, or so annoying to maintain that it will fall out of favor for that reason (no need to worry about future-proofing if no one wants to use it anyway 20 yrs from now).
Second: I don't think you have to worry about this, because the only realistic solution to running 8-bit programs on a 9-bit machine (or similar) is via Virtual Machines.
It is extremely likely that anyone in the near or distant future using some 9+ bit machine will be able to start up a legacy x86/arm virtual machine and run your program that way. Hardware 25-50 years from now should have no problem what-so-ever of running entire virtual machines just for the sake of executing a single program; and that program will probably still load, execute, and shutdown faster than it does today on current native 8-bit hardware. (some cloud services today in fact, already trend toward starting entire VMs just to service individual tasks)
I strongly suspect this is the only means by which any 8-bit program would be run on 9/other-bit machines, due to the points made in other answers regarding the fundamental challenges inherent to simply loading and parsing 8-bit source code or 8-bit binary executables.
It may not be remotely resembling "efficient" but it would work. This also assumes, of course, that the VM will have some mechanism by which 8-bit text files can be imported and exported from the virtual disk onto the host disk.
As you can see, though, this is a huge problem that extends well beyond your source code. The bottom line is that, most likely, it will be much cheaper and easier to update/modify or even re-implement-from-scratch your program on the new hardware, rather than to bother trying to account for such obscure portability issues up-front. The act of accounting for it almost certainly requires more effort than just converting the disk formats.
8-bit bytes will remain until end of time, so don't sweat it. There will be new types, but this basic type will never ever change.
I think the likelihood of non-8-bit bytes in future computers is low. It would require rewriting so much, and for so little benefit. But if it happens...
You'll save yourself a lot of trouble by doing all calculations in native data types and just rewriting inputs. I'm picturing something like:
template<int OUTPUTBITS, typename CALLABLE>
class converter {
converter(int inputbits, CALLABLE datasource);
smallestTypeWithAtLeast<OUTPUTBITS> get();
};
Note that this can be written in the future when such a machine exists, so you need do nothing now. Or if you're really paranoid, make sure get just calls datasource when OUTPUTBUTS==inputbits.
Kind of late but I can't resist this one. Predicting the future is tough. Predicting the future of computers can be more hazardous to your code than premature optimization.
Short Answer
While I end this post with how 9-bit systems handled portability with 8-bit bytes this experience also makes me believe 9-bit byte systems will never arise again in general purpose computers.
My expectation is that future portability issues will be with hardware having a minimum of 16 or 32 bit access making CHAR_BIT at least 16.
Careful design here may help with any unexpected 9-bit bytes.
QUESTION to /. readers: is anyone out there aware of general purpose CPUs in production today using 9-bit bytes or one's complement arithmetic? I can see where embedded controllers may exist, but not much else.
Long Answer
Back in the 1990s's the globalization of computers and Unicode made me expect UTF-16, or larger, to drive an expansion of bits-per-character: CHAR_BIT in C. But as legacy outlives everything I also expect 8-bit bytes to remain an industry standard to survive at least as long as computers use binary.
BYTE_BIT: bits-per-byte (popular, but not a standard I know of)
BYTE_CHAR: bytes-per-character
The C standard does not address a char consuming multiple bytes. It allows for it, but does not address it.
3.6 byte: (final draft C11 standard ISO/IEC 9899:201x)
addressable unit of data storage large enough to hold any member of the basic character set of the execution environment.
NOTE 1: It is possible to express the address of each individual byte of an object uniquely.
NOTE 2: A byte is composed of a contiguous sequence of bits, the number of which is implementation-defined. The least significant bit is called the low-order bit; the most significant bit is called the high-order bit.
Until the C standard defines how to handle BYTE_CHAR values greater than one, and I'm not talking about “wide characters”, this the primary factor portable code must address and not larger bytes. Existing environments where CHAR_BIT is 16 or 32 are what to study. ARM processors are one example. I see two basic modes for reading external byte streams developers need to choose from:
Unpacked: one BYTE_BIT character into a local character. Beware of sign extensions.
Packed: read BYTE_CHAR bytes into a local character.
Portable programs may need an API layer that addresses the byte issue. To create on the fly and idea I reserve the right to attack in the future:
#define BYTE_BIT 8 // bits-per-byte
#define BYTE_CHAR (CHAR_BIT/BYTE_BIT) //bytes-per-char
size_t byread(void *ptr,
size_t size, // number of BYTE_BIT bytes
int packing, // bytes to read per char
// (negative for sign extension)
FILE *stream);
size_t bywrite(void *ptr,
size_t size,
int packing,
FILE *stream);
size number BYTE_BIT bytes to transfer.
packing bytes to transfer per char character. While typically 1 or BYTE_CHAR it could indicate BYTE_CHAR of the external system, which can be smaller or larger than the current system.
Never forget endianness clashes.
Good Riddance To 9-Bit Systems:
My prior experience with writing programs for 9-bit environments lead me to believe we will not see such again, unless you happen to need a program to run on a real old legacy system somewhere. Likely in a 9-bit VM on a 32/64-bit system. Since year 2000 I sometimes make a quick search for, but have not seen, references to current current descendants of the old 9-bit systems.
Any, highly unexpected in my view, future general purpose 9-bit computers would likely either have an 8-bit mode, or 8-bit VM (#jstine), to run programs under. The only exception would be special purpose built embedded processors, which general purpose code would not likely to run on anyway.
In days of yore one 9-bit machine was the PDP/15. A decade of wrestling with a clone of this beast make me never expect to see 9-bit systems arise again. My top picks on why follow:
The extra data bit came from robbing the parity bit in core memory. Old 8-bit core carried a hidden parity bit with it. Every manufacturer did it. Once core got reliable enough some system designers switched the already existing parity to a data bit in a quick ploy to gain a little more numeric power and memory addresses during times of weak, non MMU, machines. Current memory technology does not have such parity bits, machines are not so weak, and 64-bit memory is so big. All of which should make the design changes less cost effective then the changes were back then.
Transferring data between 8-bit and 9-bit architectures, including off-the-shelf local I/O devices, and not just other systems, was a continuous pain. Different controllers on the same system used incompatible techniques:
Use the low order 16-bits of 18 bit words.
Use the low-order 8 bits of 9-bit bytes where the extra high-order bit might be set to the parity from bytes read from parity sensitive devices.
Combine the low-order 6 bits of three 8-bit bytes to make 18 bit binary words.
Some controllers allowed selecting between 18-bit and 16-bit data transfers at run time. What future hardware, and supporting system calls, your programs would find just can't be predicted in advance.
Connecting to the 8-bit Internet will be horrid enough by itself to kill any 9-bit dreams someone has. They got away with it back then as machines were less interconnected in those times.
Having something other than an even multiple of 2 bits in byte-addressed storage brings up all sorts of troubles. Example: if you want an array of thousands of bits in 8-bit bytes you can unsigned char bits[1024] = { 0 }; bits[n>>3] |= 1 << (n&7);. To fully pack 9-bits you must do actual divides, which brings horrid performance penalties. This also applies to bytes-per-word.
Any code not actually tested on 9-bit byte hardware may well fail on it's first actual venture into the land of unexpected 9-bit bytes, unless the code is so simple that refactoring it in the future for 9-bits is only a minor issue. The prior byread()/bywrite() may help here but it would likely need an additional CHAR_BIT mode setting to set the transfer mode, returning how the current controller arranges the requested bytes.
To be complete anyone who wants to worry about 9-bit bytes for the educational experience may need to also worry about one's complement systems coming back; something else that seems to have died a well deserved death (two zeros: +0 and -0, is a source of ongoing nightmares... trust me). Back then 9-bit systems often seemed to be paired with one's complement operations.
In a programming language, a byte is always 8-bits. So, if a byte representation has 9-bits on some machine, for whatever reason, its up to the C compiler to reconcile that. As long as you write text using char, - say, if you write/read 'A' to a file, you would be writing/reading only 8-bits to the file. So, you should not have any problem.

Factors influencing the size of primitives,their range and running of binaries on different platforms [duplicate]

Would the size of an integer depend upon the compiler, OS and processor?
The answer to this question depends on how far from practical considerations we are willing to get.
Ultimately, in theory, everything in C and C++ depends on the compiler and only on the compiler. Hardware/OS is of no importance at all. The compiler is free to implement a hardware abstraction layer of any thickness and emulate absolutely anything. There's nothing to prevent a C or C++ implementation from implementing the int type of any size and with any representation, as long as it is large enough to meet the minimum requirements specified in the language standard. Practical examples of such level of abstraction are readily available, e.g. programming languages based on "virtual machine" platform, like Java.
However, C and C++ are intended to be highly efficient languages. In order to achieve maximum efficiency a C or C++ implementation has to take into account certain considerations derived from the underlying hardware. For that reason it makes a lot of sense to make sure that each basic type is based on some representation directly (or almost directly) supported by the hardware. In that sense, the size of basic types do depend on the hardware.
In other words, a specific C or C++ implementation for a 64-bit hardware/OS platform is absolutely free to implement int as a 71-bit 1's-complement signed integral type that occupies 128 bits of memory, using the other 57 bits as padding bits that are always required to store the birthdate of the compiler author's girlfriend. This implementation will even have certain practical value: it can be used to perform run-time tests of the portability of C/C++ programs. But that's where the practical usefulness of that implementation would end. Don't expect to see something like that in a "normal" C/C++ compiler.
Yes, it depends on both processors (more specifically, ISA, instruction set architecture, e.g., x86 and x86-64) and compilers including programming model. For example, in 16-bit machines, sizeof (int) was 2 bytes. 32-bit machines have 4 bytes for int. It has been considered int was the native size of a processor, i.e., the size of register. However, 32-bit computers were so popular, and huge number of software has been written for 32-bit programming model. So, it would be very confusing if 64-bit computer would have 8 bytes for int. Both Linux and Windows remain 4 bytes for int. But, they differ in the size of long.
Please take a look at the 64-bit programming model like LP64 for most *nix and LLP64 for Windows:
http://www.unix.org/version2/whatsnew/lp64_wp.html
http://en.wikipedia.org/wiki/64-bit#64-bit_data_models
Such differences are actually quite embarrassing when you write code that should work both on Window and Linux. So, I'm always using int32_t or int64_t, rather than long, via stdint.h.
Yes, it would. Did they mean "which would it depend on: the compiler or the processor"? In that case the answer is basically "both." Normally, int won't be bigger than a processor register (unless that's smaller than 16 bits), but it could be smaller (e.g. a 32-bit compiler running on a 64-bit processor). Generally, however, you'll need a 64-bit processor to run code with a 64-bit int.
Based on some recent research I have done studying up for firmware interviews:
The most significant impact of the processors bit architecture ie, 8bit, 16bit, 32bit, 64bit is how you need to most efficiently store each byte of information in order to best compute variables in the minimum number of cycles.
The bit size of your processor tells you what the natural word length the CPU is capable of handling in one cycle. A 32bit machine needs 2 cycles to handle a 64bit double if it is aligned properly in memory. Most personal computers were and still are 32bit hence the most likely reason for the C compiler typical affinity for 32bit integers with options for larger floating point numbers and long long ints.
Clearly you can compute larger variable sizes so in that sense the CPU's bit architecture determines how it will have to store larger and smaller variables in order to achieve best possible efficiency of processing but it is in no way a limiting factor in the definitions of byte sizes for ints or chars, that is part of compilers and what is dictated by convention or standards.
I found this site very helpful, http://www.geeksforgeeks.org/archives/9705, for explaining how the CPU's natural word length effects how it will chose to store and handle larger and smaller variable types, especially with regards to bit packing into structs. You have to be very cognizant of how you chose to assign variables because larger variables need to be aligned in memory so they take the fewest number of cycles when divided by the CPU's word length. This will add a lot of potentially unnecessary buffer/empty space to things like structs if you poorly order the assignment of your variables.
The simple and correct answer is that it depends on the compiler. It doesn't mean architecture is irrelevant but the compiler deals with that, not your application. You could say more accurately it depends on the (target) architecture of the compiler for example if its 32 bits or 64 bits.
Consider you have windows application that creates a file where it writes an int plus other things and reads it back. What happens if you run this on both 32 bits and 64 bits windows? What happens if you copy the file created on 32 bits system and open it in 64 bits system?
You might think the size of int will be different in each file but no they will be the same and this is the crux of the question. You pick the settings in compiler to target for 32 bits or 64 bits architecture and that dictates everything.
http://www.agner.org/optimize/calling_conventions.pdf
"3 Data representation" contains good overview of what compilers do with integral types.
Data Types Size depends on Processor, because compiler wants to make CPU easier accessible the next byte. for eg: if processor is 32bit, compiler may not choose int size as 2 bytes[which it supposed to choose 4 bytes] because accessing another 2 bytes of that int(4bytes) will take additional CPU cycle which is waste. If compiler chooses int as 4 bytes CPU can access full 4 bytes in one shot which speeds your application.
Thanks
Size of the int is equal to the word-length that depends upon the underlying ISA. Processor is just the hardware implementation of the ISA and the compiler is just the software-side implementation of the ISA. Everything revolves around the underlying ISA. Most popular ISA is Intel's IA-32 these days. it has a word length of 32bits or 4bytes. 4 bytes could be the max size of 'int' (just plain int, not short or long) compilers. based on IA-32, could use.
size of data type basically depends upon the type of compiler and compilers are designed on the basis of architecture of processors so externally data type can be considered to be compiler dependent.for ex size of integer is 2 byte in 16 bit tc compiler but 4 byte in gcc compiler although they are executed in same processor
Yes , I found that size of int in turbo C was 2 bytes where as in MSVC compiler it was 4 bytes.
Basically the size of int is the size of the processor registers.

Will Endianness be an issue for this type of binary IO operation?

For space efficiency, I have decided to code my save files using binary. Every byte represents an id for the type of tile. Would this cause an issue with different Endian computing?
Also, out of curiosity, is it the CPU or the operating system which sets the Endian type?
Additional information: I am using C++ and building an x-platform game. I do not want to use an additional API such as Boost.
Yes, it will cause issues - if the saved file from BE gets loaded on LE or vice versa. This is why some Unicode encodings such as UTF-16 and UTF-32 have a so-called byte-order marker.
If your code is usually compiled on BE you will still have to make sure that the LE code will swap the byte order before making use of the data.
The CPU sets the Endianess and some chips (e.g. some MIPS CPUs) allow that to be switched when bootstrapping the system.
We could use a little more info. Cross platform is one thing, but what platforms? If you mean cross platform like x86 Mac, x86 Linux and x86 Windows, then no you won't need to worry about it (although struct packing may still be an issue if you try to just fwrite structs to disk and compile with different compilers on different platforms). Even if you have a couple of different OS/CPU combos you can make a list of everything you want to support and if they all have the same endianess, don't worry about it.
If you have no expectation that save data will be moved from platform to platform, you also don't need to worry about it. Endianess is only an issue when you want create data on a big-endian machine and then read it on a little-endian machine or vice versa. If these are just local data files, no big deal, although it's probably safe to assume that if your users can copy their saves from one platform to another, they will, as they will do pretty much anything you don't want them to do and don't support.
Additionally, since you only mention bytes, if a byte array is as complex as your data is going to get, you actually don't need to worry about endianess. It's only an issue for multi-byte data types. So if you are just saving byte arrays, and the rest of your bookkeeping data also fits in bytes, there's nothing to worry about, but as soon as you save a short, int or float you'll have potential endian issues.
My personal opinion is whenever you serialize take endianess into account, but I have an extremely multplatform background (i.e. shipping the same product on 5 game systems). It's pretty easy, the swap macros are already there, and when you inevitably decide to move to another endianess you won't have to rewrite stuff. If the data is more complex or structured, maybe consider a library like Protocol Buffers or BSON.
Both the CPU and operating system may be responsible for endianess. Historically it was baked into the CPU, and while x86 is still hardwired as little-endian, most modern RISC derivatives can operate in either mode, making it the choice of the hardware and OS developers.

Should I worry about Big Endianness or is it only a trivial aspect?

Are there many computers which use Big Endian? I just tested on 5 different computers, each purchased in different years, and different models. Each use Little Endian. Is Big Endian still used now days or was it for older processors such as the Motorola 6800?
Edit:
Thank you TreyA, intel.com/design/intarch/papers/endian.pdf is a very nice and handy article. It covers every answers bellow, and also expands upon them.
There's many processors in use today that is big endian, or allows the option to switch endian mode between big and little endian, (e.g. SPARC, PowerPC, ARM, Itanium..).
It depends on what you mean by "care about endian". You usually don't need to care that much specifically about endianess if you just program to the data you need. Endian matters when you need to communicate to the outside world, such as read/write a file, or send data over a network and you do that by reading/writing integers larger than 1 byte directly to/from memory.
When you do need to deal with external data, you need to know its format. Part of its format is to e.g. know how an integer is encoded in that data. If the format specifies that the first byte of an 4 byte integer is the most significant byte of said integer, you read that byte and place it at the most significant byte of the integer in your program, and you would be able to accomplish that fine
with code that runs on both little and big endian machines.
So it's not so much specifically about the processor endianess, but the data you need to deal with. That data might have integers stored in either "endian", you need to know which, and various data formats will use various endianess depending on some specification, or depending on the whim of the guy that came up with the format.
Big endian is still by far the most used, in terms of different architectures. In fact, outside of the Intel and the old DEC computers, I don't know of a small endian: Sparc, Power PC (IBM Unix machines), HP's Unix platforms, IBM mainframes, etc. are all big endian. But does it matter? About the only time I've had to consider endianness was when implementing some low level system routines, like modf. Otherwise, int is an integer value in a certain range, and that's it.
The following common platforms use big-endian encoding:
Java
Network data in TCP/UDP packets (maybe even on the IP level, but I'm not sure about that)
The x86/x64 CPUs are little-endian. If you are going to interface with binary data between the two, you should definitely be aware of this.
This qualifies more as a comment than an answer, but I can't comment and I think it's such a great article to read, that I think it worthwhile.
This is a classic on endianness by Danny Cohen, dating from 1980:
ON HOLY WARS AND A PLEA FOR PEACE
There is not enough context to the question. In general, you should simply be aware of it at all times, but you do not need to stress over it in everyday coding. If you plan on messing with the internal bytes of your integers, start worrying about endianness. If you plan on doing standard math on your integers, don't worry about it.
The two big places where endianness pops up is in networking (big endian standard) and binary records (have to research whether integers are stored big endian or little endian).

Does the size of an int depend on the compiler and/or processor?

Would the size of an integer depend upon the compiler, OS and processor?
The answer to this question depends on how far from practical considerations we are willing to get.
Ultimately, in theory, everything in C and C++ depends on the compiler and only on the compiler. Hardware/OS is of no importance at all. The compiler is free to implement a hardware abstraction layer of any thickness and emulate absolutely anything. There's nothing to prevent a C or C++ implementation from implementing the int type of any size and with any representation, as long as it is large enough to meet the minimum requirements specified in the language standard. Practical examples of such level of abstraction are readily available, e.g. programming languages based on "virtual machine" platform, like Java.
However, C and C++ are intended to be highly efficient languages. In order to achieve maximum efficiency a C or C++ implementation has to take into account certain considerations derived from the underlying hardware. For that reason it makes a lot of sense to make sure that each basic type is based on some representation directly (or almost directly) supported by the hardware. In that sense, the size of basic types do depend on the hardware.
In other words, a specific C or C++ implementation for a 64-bit hardware/OS platform is absolutely free to implement int as a 71-bit 1's-complement signed integral type that occupies 128 bits of memory, using the other 57 bits as padding bits that are always required to store the birthdate of the compiler author's girlfriend. This implementation will even have certain practical value: it can be used to perform run-time tests of the portability of C/C++ programs. But that's where the practical usefulness of that implementation would end. Don't expect to see something like that in a "normal" C/C++ compiler.
Yes, it depends on both processors (more specifically, ISA, instruction set architecture, e.g., x86 and x86-64) and compilers including programming model. For example, in 16-bit machines, sizeof (int) was 2 bytes. 32-bit machines have 4 bytes for int. It has been considered int was the native size of a processor, i.e., the size of register. However, 32-bit computers were so popular, and huge number of software has been written for 32-bit programming model. So, it would be very confusing if 64-bit computer would have 8 bytes for int. Both Linux and Windows remain 4 bytes for int. But, they differ in the size of long.
Please take a look at the 64-bit programming model like LP64 for most *nix and LLP64 for Windows:
http://www.unix.org/version2/whatsnew/lp64_wp.html
http://en.wikipedia.org/wiki/64-bit#64-bit_data_models
Such differences are actually quite embarrassing when you write code that should work both on Window and Linux. So, I'm always using int32_t or int64_t, rather than long, via stdint.h.
Yes, it would. Did they mean "which would it depend on: the compiler or the processor"? In that case the answer is basically "both." Normally, int won't be bigger than a processor register (unless that's smaller than 16 bits), but it could be smaller (e.g. a 32-bit compiler running on a 64-bit processor). Generally, however, you'll need a 64-bit processor to run code with a 64-bit int.
Based on some recent research I have done studying up for firmware interviews:
The most significant impact of the processors bit architecture ie, 8bit, 16bit, 32bit, 64bit is how you need to most efficiently store each byte of information in order to best compute variables in the minimum number of cycles.
The bit size of your processor tells you what the natural word length the CPU is capable of handling in one cycle. A 32bit machine needs 2 cycles to handle a 64bit double if it is aligned properly in memory. Most personal computers were and still are 32bit hence the most likely reason for the C compiler typical affinity for 32bit integers with options for larger floating point numbers and long long ints.
Clearly you can compute larger variable sizes so in that sense the CPU's bit architecture determines how it will have to store larger and smaller variables in order to achieve best possible efficiency of processing but it is in no way a limiting factor in the definitions of byte sizes for ints or chars, that is part of compilers and what is dictated by convention or standards.
I found this site very helpful, http://www.geeksforgeeks.org/archives/9705, for explaining how the CPU's natural word length effects how it will chose to store and handle larger and smaller variable types, especially with regards to bit packing into structs. You have to be very cognizant of how you chose to assign variables because larger variables need to be aligned in memory so they take the fewest number of cycles when divided by the CPU's word length. This will add a lot of potentially unnecessary buffer/empty space to things like structs if you poorly order the assignment of your variables.
The simple and correct answer is that it depends on the compiler. It doesn't mean architecture is irrelevant but the compiler deals with that, not your application. You could say more accurately it depends on the (target) architecture of the compiler for example if its 32 bits or 64 bits.
Consider you have windows application that creates a file where it writes an int plus other things and reads it back. What happens if you run this on both 32 bits and 64 bits windows? What happens if you copy the file created on 32 bits system and open it in 64 bits system?
You might think the size of int will be different in each file but no they will be the same and this is the crux of the question. You pick the settings in compiler to target for 32 bits or 64 bits architecture and that dictates everything.
http://www.agner.org/optimize/calling_conventions.pdf
"3 Data representation" contains good overview of what compilers do with integral types.
Data Types Size depends on Processor, because compiler wants to make CPU easier accessible the next byte. for eg: if processor is 32bit, compiler may not choose int size as 2 bytes[which it supposed to choose 4 bytes] because accessing another 2 bytes of that int(4bytes) will take additional CPU cycle which is waste. If compiler chooses int as 4 bytes CPU can access full 4 bytes in one shot which speeds your application.
Thanks
Size of the int is equal to the word-length that depends upon the underlying ISA. Processor is just the hardware implementation of the ISA and the compiler is just the software-side implementation of the ISA. Everything revolves around the underlying ISA. Most popular ISA is Intel's IA-32 these days. it has a word length of 32bits or 4bytes. 4 bytes could be the max size of 'int' (just plain int, not short or long) compilers. based on IA-32, could use.
size of data type basically depends upon the type of compiler and compilers are designed on the basis of architecture of processors so externally data type can be considered to be compiler dependent.for ex size of integer is 2 byte in 16 bit tc compiler but 4 byte in gcc compiler although they are executed in same processor
Yes , I found that size of int in turbo C was 2 bytes where as in MSVC compiler it was 4 bytes.
Basically the size of int is the size of the processor registers.