Filesize difference when cross compiling

Filesize difference when cross compiling - c++

I am writing a small application in c++ that runs on my host machine (linux x86) and on a a target machine(arm).
The problem I have is that on the host machine my binary is about 700kb of size but on the target machine it is about 7mb.
I am using the same compile switches for both platforms. My first though was that a library on the arget machine got linked statically but I checked both binaries with objdump and both use the same dynamically link libraries.
So can anyone give me hint how I can figure out why there is such a huge difference in size?

While different computer architectures can theoretically require completely different amounts of executable code for the same program, a factor of 10 is not really expected among modern architectures. ARM and x86 may be different, but they are still designed in the same universe where memory and bandwidth is not something to waste, leading CPU designers to try to keep the executable code as tight as possible.
I would, therefore, look at the following possibilities, in order of probability:
Symbol stripping: if one of the two binaries has been stripped from its symbols, then it would be significantly smaller, especially if compiled with debugging information. You might want to try to strip both binaries and see what happens.
Static linking: I have occasionally encountered build systems for embedded targets that would prefer static linking over using shared libraries. Examining the library dependencies of each binary would probably detect this.
Additional enabled code: The larger binary may have additional code enabled because e.g. the build system found an additional optional library or because the target platform requires specific handle.
Still, a factor of 10 is probably too much for this, unless the smaller binary is missing a lot of functionality or the larger one has linked in some optional library statically.
Different compiler configuration: You should not only look at the compiler options that you supply, but also at the defaults the compiler uses for each target. For example if the compiler has significantly higher inlining or loop unrolling limits in one architecture, the resulting executable could baloon-out noticeably.

first there is no reason to expect the same code compiled for different architectures to have any kind of relationship in size to each other. You can easily have A be larger than B then change one line of code and then B is larger than A.
Second the "binaries" you are talking about are I am guessing elf, which is a little bit of binary and some to a lot of overhead. The overhead can vary between architectures and other such things.
Bottom line if you are compiling the same code for two architectures/platforms or with different compilers or compile options for the same architecture there is no reason to expect the file sizes to have any relationship to each other.

Different architectures can have completely different ways to handle the same thing. For example loading immediate value on CISC (e.g. x86) architecture is usually one instruction, while on RISC (e.g. ppc, arm) it usually is more than one instruction, the actual number needed being dependent on the value. For example if the instruction set only allows 16bit immediate values, you may need up to 7 instructions to load a 64bit value (loading by 16bits and shifting in between the loads). Hence the code is inherently different.

One reason not mentioned so far, but relevant to ARM/x86 comparisons is Floating Point emulation. All x86 chips today come with native FP support (and x86-64 even with SIMD FP support via SSE), but ARM CPU's often lack a FP unit. That in turn means even a simple FP addition has to be turned into a long sequence of integer operations on exponents and mantissa's.

Related

What affects generated machine code at each step of the compilation process?

I am almost certain this question has been asked before, but I can not seem to find the right keywords to search for to get an answer. My apologies if this is a duplicate.
I am better trying to understand the compilation process of say a C++ file as it goes from the C++ syntax to the binary machine code. In addition I am trying to understand what influences the resulting machine code.
First, I am nearly certain that the following are the only factors (for most systems) that dictate the final machine code (please correct me if I am wrong here)
The tools used to compile, assemble, and link.
Things like gnu c compiler, clang, visual studio, nasm, ect.
The kernel of the system being used.
Whether its a specific version of the linux kernel, windows microkernel, or some other kernel like a mac os x one.
The operating system being used.
This one I am less clear about. I am unsure if machines running the same linux kernel, but different os, in this case let's say debian vs centos, will they produce different binaries.
Lastly the hardware architecture.
Different cpu architectures like arm 64, x86, power pc, ect. take different op codes so obviously the machine code should be different.
So with that being said here is my understanding of the compilation process and where each of these dependencies show up.
I write a C++ file and use code that my system can understand. A good example might be using <winsock.h> on windows and <sys/socket.h> on linux.
The preprocessor runs and executes any preprocessor macros.
Here I know that different preprocessors will define different macros but for now I will assume this is not too machine dependent. (This might be wrong to assume).
The compiler tools run to produce assembly file outputs.
Here the assembly produced depends on the compiler and what optimizations or choices it makes.
It also depends on the kernel because different kernels have different system calls and store files in different locations. This means the assembly might make changes such as different branching when calling functions specific to that kernel.
The operating system? Still unsure how the operating system fits in to this. If two machines have the same kernel, what does the operating system do to the binaries?
Finally the assembly code depends on the cpu architecture. I think that is a pretty obvious statement.
Once the compiler produces an assembly. We can then invoke the assembler to turn our assembly code into almost complete machine code. (I think machine code is identical to binary opcodes a cpu manual lists but this might be wrong).
The corresponding machine code files (often called object files I think) contain nearly all the instructions needed to run or reference other machine code files which will be linked in the next step.
This machine code usually has some format (I think ELF is a popular format for linux) and this format is dependent on the linker for sure.
I don't think the kernel, operating system, or hardware affect the layout/format of the object file but this is probably wrong. If they do please correct this.
The hardware will affect the actual machine code produced because again I think it is a 1 to 1 mapping of machine code instructions to opcodes for a cpu.
I am unsure if the kernel or operating system affect the linking process because I thought their changes were already incorporated in the compiling step.
Finally the linking step occurs.
I think this is as simple as the linker looking for all the referenced machine code and injecting it into one complete machine code file which can be executed.
I have no clue what affects this besides the linker tool itself.
So with all that, I need help identifying inaccuracies with the procedure I described above, and any dependencies I might have missed whether it be cpu, os, kernel, or tool ones.
Thank you and sorry for the long winded question. This probably should have been broken up into multiple questions but I am too far in. If this does not go well I may ask each part in individual questions.
EDIT:
Questions with more focus.
What components of a machine affect the machine code produced given a C++ file input?

Actually that is a lot of questions and usually you're question would be much too broad for SO (as you managed to recognize by yourself). But on the other hand you showed a deep interest (just by writing such a long and profound question) and also a lot of correct understanding of the process of compiling a program. The things you are missing or not understanding correctly (and you are probably the most interested in) are those things, that I myself found hard to learn. Thus I will provide you with some important points, that I think you are missing in the big picture.
Note that I am very much used to Linux, so I will mostly describe how things work on Linux. But I believe that most things also happen in a similar way on other operating systems.
Let's begin with the hardware. A modern computer has a CPU of some architecture. There are lots of different of CPU architectures. You mentioned some of them like arm, x86, etc. which are families of similar CPUs and can be divided into smaller groups by bit width and/or supported extensions. Ultimately your processor has a specified instruction set that defines which opcodes it supports and what those opcodes do. If a native (compiled) program runs, there are raw opcodes in the memory and the CPU directly executes them following its architecture specification.
Aside from the CPU there is a lot more hardware connected to your computer. Usually communicating with this hardware is complicated and not standardized. If a user program for example gets input keystrokes from the keyboard, in does not have to directly communicate with the keyboard, but rather does this via the operating system kernel. This works by a mechanism called syscall interrupt. The kernel installs an handler routine, that is called if a user program triggers such an interrupt with a special CPU instruction. You can think of it like a language agnostic function call from the program into the kernel. For example for Linux you can find a list of all syscalls at the syscall(2) man page. The syscalls form the kernel's Application Binary Interface (kernel ABI). Reading and writing from a terminal or using a filesystem are examples for syscall functionality.
As you can see, there are already very high level functions, that are implemented in the kernel. However the functionality is still quite limited for most typical applications. To encapsulate the syscalls and provide functions for memory management, utility functions, mathematical functions and many other things you probably use in your daily programs, there is usually another layer between the program and the kernel. This thing is called the C standard library, and it is a shared library (we will cover what exactly this is in a moment). On GNU/Linux it is the glibc which is the single most important library on a GNU/Linux system (and notably not part of the kernel 1). While it implements all the features that are required by the C standard (for example functions like malloc() or strcpy()), it also ships a lot of additional functions which are a superset of the ISO C standard library, the POSIX standard and some extensions. This interface is usually called the Application Programming Interface (API) of the operating system. While it is in principle possible to bypass the API and directly use the syscalls, almost all programs (even when written in other languages than C or C++) use the C library.
Now get yourself a coffee and a few minutes of rest. We now have enough background information to look at how a C++ program is transformed into a binary, and how exactly this binary is executed.
A C++ program consists of different compilation units (usually each different source file is a compilation unit). Each compilation unit undergoes the following steps
The preprocessor is run on the file. It includes header, expands macros and does some other stuff. As you wrote in your question this is rather platform independent. The preprocessor actions are standardized in the C++ standard.
The resulting code is compiled. That means C++ code is translated into assembly code. Because assembly code directly reflects the CPU instructions, this step is dependent on the target CPU architecture, that the compiler was configured for (usually the host CPU). The compiler is allowed to optimize and translate the program in any way it wants, as long as it follows the as-if rule. Thus this step is also higly dependent on the compiler you are using.
Note: Symbols (especially functions) that are not defined, are left undefined. If you say call the malloc() function, this will not be compiled, but left unevaluated until later. Thus this step is also not much dependent on the operating system.
Assembling takes place. This is very straightforward. The assembly code usually can be converted directly into binary CPU instructions. Local symbols (such as goto labels etc.) are resolved and replaced by their corresponding addresses. Unknown external symbols such as the mentioned malloc() call still are left unevaluated and are stored in the object file's symbol table. Because most of the syscalls are wrapped in library functions, the assembly code will usually not directly contain syscall code. Thus this step is depended on the CPU architecture. It is however dependent on the ABI2, which in term is dependent on the compiler and the OS.
Linking takes place. The different compilation units are combined into a single executable binary in an OS-dependent format (e.g. GNU/Linux uses ELF). Here yet more symbols are resolved. For example if one compilation calls a function in another compilation unit, this call is resolved and the symbol is replaced by the function address. If you link to a library statically, this is just treated like another compilation unit and included into the executable with its symbols resolved.
Shared libraries are checked for the needed symbols, but not linked yet. For example in case of the malloc() call, the linker checks, that there is a malloc symbol in the glibc, but the symbol in the executable still remains unresolved.
At this point you have a executable binary. As you might noticed, there might still be unresolved symbols in that binary. Thus you cannot just load that binary into RAM and let the CPU execute it. A final step called dynamic linking is needed. On Linux the program that performs this step is called the dynamic linker/loader. Its task is to load the executable ELF file into memory, look up all the needed dynamic libraries, load them into memory as well (a list is stored in the ELF file) and resolve the remaining symbols. This last step happens each time the program is executed. Now finally the malloc() symbol is resolved with the address in the glibc shared library.
You have pure CPU instructions in memory, the CPU's program counter register (the one that tracks the next instruction) is set to the entry point, and the program can begin to run. Every now and then it is interrupted either because it makes a syscall, or because it is interrupted by the kernel scheduler to let another program run on that CPU core.
I hope I could answer some of your questions and satisfy your curiosity. I think the most important part you were missing, was how dynamic linking happens. This is a very interesting topic which is related to concepts like position independent code. I wish you could luck learning.
1 this is also one reason why some people insist on calling Linux based systems GNU/Linux. The glibc library (together with many other GNU programs) defines much of the operating system structure, interacts with supplementary programs and configuration files etc. There are however Linux based systems without glibc. One of them is Android, using Googles bionic libc.
2 The ABI is related to the calling convention. This is a mixture of operating system, programming language and compiler specification. It is one of the reasons (besides name mangling, see the comment of PeterCordes below) you need those extern "C" {...} scopes in C++ header files, that declare C functions in shared libraries. It basically is a convention on how to pass parameters and return values between functions.

Neither operating system nor kernel are directly involved in any of this.
Their limited involvement is in that if you want to build Linux 64 bit binaries for x86 using gnu tools then you need to in some way (download and install or build yourself) build the gnu tools themselves for that target processor and that operating system. As system calls are specific to the operating system and target, and also the binaries supported by that operating system. Not strictly just the elf file format, that is just a container, but the linking and possibly bootstrap is also specific to the operating systems loader. (or if building something for the kernel that would have other rules). For example, does the application loader initialize .bss and .data for you from specific information in the .elf file, or like on an mcu does the bootstrap code itself have to do this?
The builder for gnu tools for a target like linux and ideally a pre-built binary for your os and target, would have paths setup in some way. The c library would have a default linker script and its intimate partner the bootstrap.
After that point, it is just a dumb toolchain. Include files be they at the system level, compiler level, or programmer level are just includes in the C language. The default paths and gcc knows where it was executed from so it knows where in a normal build the gcc and other libraries live.
gcc itself is not a compiler actually it calls other programs like the preprocessor, the compiler itself, the assembler and linker.
The preprocessor is going to do the search and replace for includes and defines and end up with one great big cpp file, then pass that to the compiler.
The compiler front end (C++ language for gcc for example) turns that into an internal language, allocate an int with this name, and another add the two and blah. A pseudo code if you will. This gets a lot of the optimization work done on it then eventually the back end (which for gnu could be x86, mips, arm, etc independent to some extent of the front and middle). The LLVM tools, are at least capable of exposing that middle, internal, language to external files (external to the memory used by the compiler to do the compilation) and you can combine and optimize those bytecode files and then convert them to assembly or direct to object in the llvm world. I think this is an exception not a rule, others just use internal tables.
While I think it is wise and sane to use an assembly language step. Not all compilers do and do not assume that all compilers do. Some output objects.
Yes that assembly is naturally partial, external functions (labels) and variables (labels) cannot be resolved at the object level. The linker has to do that.
So the target (x86, arm, etc) does affect the construction of the elf file as
there are certain items, magic numbers specific to the target. As mentioned the operating system and or kernel do affect the elf in that there are rules for construction of the binary for that kernel or operating system. Remember that elf is just a container like tar or zip or mkv etc. Do not assume that the operating system can handle every possible choice you want to make with the contents that the linker will allow (the tools are dumb, do what they are told).
So your source.
All the relevant sources that go with it including system includes, compiler includes and your includes.
gcc/g++ is a wrapper program that manages the steps.
calls the pre-processor expands includes and defines into one file (no magic here)
call the compiler to parse that one file into internal tables, think pseudo code and data
many, many possible optimizers that operate on these structures
backend, including peephole optimizer, turns the tables into assembly language (for gnu at least)
assembler is called to turn the asm into an object
If all the objects are specified and gcc is told to link, then...
Linker combines all the objects for the binary, including the bootstrap, including already built libraries, stubs, etc, and command line or more likely a linker script (linker script and bootstrap have an intimate relationship they are not assumed to be separable and not part of the compiler they are part of a C library, etc).
Kernel module loader or operating system application loader fed the file and per the rules of that loader loads and runs the program.

Choose a C binary according to the enviroment

I have compiled my code with specific flags (-Os, -O2, -march=native and their combinations) in order to produce a faster execution time.
But my problem is that I don't run always in the same machine (because in my lab there are several different machines). Sometimes I run within a MacOS, or within a Linux (in both cases with different OS versions).
I wonder if there is a way to determine which binary will be run depending on the environment where the binary will run (I mean cache size, cpu cores, and other properties about the specific machine)?. In other words, how to choose (when the program loads) the faster binary (previously compiled with different target binary sizes and instruction-set extensions) according to the specific machine used?
Thanks in advance.

What you're talking about is called a fat binary (not FAT, the acronym). From Wikipedia1:
A fat binary (or multiarchitecture binary) is a computer executable program which has been expanded (or "fattened") with code native to multiple instruction sets which can consequently be run on multiple processor types. This results in a file larger than a normal one-architecture binary file, thus the name.
At quick glance, there doesn't seem to be much support for it (see this question from the Programmer StackExchange for more information). Apple implemented this briefly when transitioning from PowerPC to Intel, but it doesn't seem to have been explored much since then.
Technically, fat binaries refer to a single binary that could run on multiple architectures...but I imagine the premise would hold for a single binary that runs on multiple OSes. And it comes back to the point Bizkit made in his/her/zir answer - generally, you compile your source code for the environment that you're in ahead of time.

You may prebuilt a bunch of executables and choose one according to environment variable or things like uname. A Better approach to the problem is choose a toolchain that is able to perform JIT, install-time optimization and/or runtime optimization, like llvm.

Is there a reason you can't just recompile your source code on each machine? Compilers are already written and optimized exactly for this kind of stuff. Simply recompile your source code on that machine architecture and you'll have a binary that runs just fine on that machine.

If you want your code tuned for the cache size of the machine you run on, check out the way Automatically Tuned Linear Algebra Software (ATLAS) does it. When you compile it, it runs some tests to find what size to use for cache-blocking its loops, and puts that in a header file.

Will statically linked c++ binary work on every system with same architecture?

I'm making a very simple program with c++ for linux usage, and I'd like to know if it is possible to make just one big binary containing all the dependencies that would work on any linux system.
If my understanding is correct, any compiler turns source code into machine instructions, but since there are often common parts of code that can be reused with different programs, most programs depend on another libraries.
However if I have the source code for all my dependencies, I should be able to compile a binary in a way that would not require anything from the system? Will I be able to run stuff compiled on 64bit system on a 32bit system?

In short: Maybe.
The longer answer is:
It depends. You can't, for example, run a 64-bit binary on a 32-bit system, that's just not even nearly possible. Yes, it's the same processor family, but there are twice as many registers in the 64-bit system, which also has twice as long registers. What's the 32-bit processor going to "give back" for the value of those bits and registers that doesn't exist in the hardware in the processor? It just plain won't work. Some of the instructions also completely change meaning, so the system really needs to be "right" for the compiled code, or it won't work - fortunately, Linux will check this and plain refuse if it's not right.
You can BUILD a 32-bit binary on a 64-bit system (assuming you have all the right libraries, etc, installed for both 64- and 32-bit, etc).
Similarly, if you try to run ARM code on an x86 processor, or MIPS code on an ARM processor, it simply has no chance of working, because the actual instructions are completely different (or they would be in breach of some patent/copyright or similar, because processor instruction sets contain portions that are "protected intellectual property" in nearly all cases - so designers have to make sure they do NOT do "the same as someone else's design"). Like for 32-bit and 64-bit, you simply won't get a chance to run the wrong binary here, it just won't work.
Sometimes, there are subtle differences, for example ARM code can be compiled with "hard" or "soft" floating point. If the code is compiled for hard float, and there isn't the right support in the OS, then it won't run the binary. Worse yet, if you compile on x86 for SSE instructions, and try to run on a non-SSE processor, the code will simply crash [unless you specifically build code to "check for SSE, and display error if not present"].
So, if you have a binary that passes the above criteria, the Linux system tends to change a tiny bit between releases, and different distributions have subtle "fixes" that change things. Most of the time, these are completely benign (they fix some obscure corner-case that someone found during testing, but the general, non-corner case behaviour is "normal"). However, if you go from Linux version 2.2 to Linux version 3.15, there will be some substantial differences between the two versions, and the binary from the old one may very well be incompatible with the newer (and almost certainly the other way around) - it's hard to know exactly which versions are and aren't compatible. Within releases that are close, then it should work OK as long as you are not specifically relying on some feature that is present in only one (after all, new things ARE added to the Linux kernel from time to time). Here the answer is "maybe".
Note that in the above is also your implementation of the C and C++ runtime, so if you have a "new" C or C++ runtime library that uses Linux kernel feature X, and try to run it on an older kernel, before feature X was implemented (or working correctly for the case the C or C++ runtime is trying to use it).
Static linking is indeed a good way to REDUCE the dependency of different releases. And a good way to make your binary huge, which may be preventing people from downloading it.
Making the code open source is a much better way to solve this problem, then you just distribute your source code and a list of "minimum requirements", and let other people deal with it needing to be recompiled.

In practice, it depends on "sufficiently simple". If you're using C++11, you'll quickly find that the C++11 libraries have dependencies on modern libc releases. In turn, those only ship with modern Linux distributions. I'm not aware of any "Long Term Support" Linux distribution which today (June 2014) ships with libc support for GCC 4.8

The short answer is no, at least without serious hack.
Different linux distribution may have different glue code between user-space and kernel. For instant, an hello world seemingly without dependency built from ubuntu cannot be executed under CentOS.
EDIT: Thanks for the comment. I re-verify this and the cause is im using 32-bit VM. Sorry for causing confusion. However, as noted above, the rule of thumb is that even same linux distribution may sometime breaks compatibility in order to deploy bugfix, so the conclusion stands.

Cross Platform Floating Point Consistency

I'm developing a cross-platform game which plays over a network using a lockstep model. As a brief overview, this means that only inputs are communicated, and all game logic is simulated on each client's computer. Therefore, consistency and determinism is very important.
I'm compiling the Windows version on MinGW32, which uses GCC 4.8.1, and on Linux I'm compiling using GCC 4.8.2.
What struck me recently was that, when my Linux version connected to my Windows version, the program would diverge, or de-sync, instantly, even though the same code was compiled on both machines! Turns out the problem was that the Linux build was being compiled via 64 bit, whereas the Windows version was 32 bit.
After compiling a Linux 32 bit version, I was thankfully relieved that the problem was resolved. However, it got me thinking and researching on floating point determinism.
This is what I've gathered:
A program will be generally consistent if it's:
ran on the same architecture
compiled using the same compiler
So if I assume, targeting a PC market, that everyone has a x86 processor, then that solves requirement one. However, the second requirement seems a little silly.
MinGW, GCC, and Clang (Windows, Linux, Mac, respectively) are all different compilers based/compatible with/on GCC. Does this mean it's impossible to achieve cross-platform determinism? or is it only applicable to Visual C++ vs GCC?
As well, do the optimization flags -O1 or -O2 affect this determinism? Would it be safer to leave them off?
In the end, I have three questions to ask:
1) Is cross-platform determinism possible when using MinGW, GCC, and Clang for compilers?
2) What flags should be set across these compilers to ensure the most consistency between operating systems / CPUs?
3) Floating point accuracy isn't that important for me -- what's important is that they are consistent. Is there any method to reducing floating point numbers to a lower precision (like 3-4 decimal places) to ensure that the little rounding errors across systems become non-existent? (Every implementation I've tried to write so far has failed)
Edit: I've done some cross-platform experiments.
Using floatation points for velocity and position, I kept a Linux Intel Laptop and a Windows AMD Desktop computer in sync for up to 15 decimal places of the float values. Both systems are, however, x86_64. The test was simple though -- it was just moving entities around over a network, trying to determine any visible error.
Would it make sense to assume that the same results would hold if a x86 computer were to connect to a x86_64 computer? (32 bit vs 64 bit Operating System)

Cross-platform and cross-compiler consistency is of course possible. Anything is possible given enough knowledge and time! But it might be very hard, or very time-consuming, or indeed impractical.
Here are the problems I can foresee, in no particular order:
Remember that even an extremely small error of plus-or-minus 1/10^15 can blow up to become significant (you multiply that number with that error margin with one billion, and now you have a plus-or-minus 0.000001 error which might be significant.) These errors can accumulate over time, over many frames, until you have a desynchronized simulation. Or they can manifest when you compare values (even naively using "epsilons" in floating-point comparisons might not help; only displace or delay the manifestation.)
The above problem is not unique to distributed deterministic simulations (like yours.) The touch on the issue of "numerical stability", which is a difficult and often neglected subject.
Different compiler optimization switches, and different floating-point behavior determination switches might lead to the compiler generate slightly different sequences of CPU instructions for the same statements. Obviously these must be the same across compilations, using the same exact compilers, or the generated code must be rigorously compared and verified.
32-bit and 64-bit programs (note: I'm saying programs and not CPUs) will probably exhibit slightly different floating-point behaviors. By default, 32-bit programs cannot rely on anything more advanced than x87 instruction set from the CPU (no SSE, SSE2, AVX, etc.) unless you specify this on the compiler command line (or use the intrinsics/inline assembly instructions in your code.) On the other hand, a 64-bit program is guaranteed to run on a CPU with SSE2 support, so the compiler will use those instructions by default (again, unless overridden by the user.) While x87 and SSE2 float datatypes and operations on them are similar, they are - AFAIK - not identical. Which will lead to inconsistencies in the simulation if one program uses one instruction set and another program uses another.
The x87 instruction set includes a "control word" register, which contain flags that control some aspects of floating-point operations (e.g. exact rounding behavior, etc.) This is a runtime thing, and your program can do one set of calculations, then change this register, and after that do the exact same calculations and get a different result. Obviously, this register must be checked and handled and kept identical on the different machines. It is possible for the compiler (or the libraries you use in your program) to generate code that changes these flags at runtime inconsistently across the programs.
Again, in case of the x87 instruction set, Intel and AMD have historically implemented things a little differently. For example, one vendor's CPU might internally do some calculations using more bits (and therefore arrive at a more accurate result) that the other, which means that if you happen to run on two different CPUs (both x86) from two different vendors, the results of simple calculations might not be the same. I don't know how and under what circumstances these higher accuracy calculations are enabled and whether they happen under normal operating conditions or you have to ask for them specifically, but I do know these discrepancies exist.
Random numbers and generating them consistently and deterministically across programs has nothing to do with floating-point consistency. It's important and source of many bugs, but in the end it's just a few more bits of state that you have to keep synched.
And here are a couple of techniques that might help:
Some projects use "fixed-point" numbers and fixed-point arithmetic to avoid rounding errors and general unpredictability of floating-point numbers. Read the Wikipedia article for more information and external links.
In one of my own projects, during development, I used to hash all the relevant state (including a lot of floating-point numbers) in all the instances of the game and send the hash across the network each frame to make sure even one bit of that state wasn't different on different machines. This also helped with debugging, where instead of trusting my eyes to see when and where inconsistencies existed (which wouldn't tell me where they originated, anyways) I would know the instant some part of the state of the game on one machine started diverging from the others, and know exactly what it was (if the hash check failed, I would stop the simulation and start comparing the whole state.)
This feature was implemented in that codebase from the beginning, and was used only during the development process to help with debugging (because it had performance and memory costs.)
Update (in answer to first comment below): As I said in point 1, and others have said in other answers, that doesn't guarantee anything. If you do that, you might decrease the probability and frequency of an inconsistency occurring, but the likelihood doesn't become zero. If you don't analyze what's happening in your code and the possible sources of problems carefully and systematically, it is still possible to run into errors no matter how much you "round off" your numbers.
For example, if you have two numbers (e.g. as results of two calculations that were supposed to produce identical results) that are 1.111499999 and 1.111500001 and you round them to three decimal places, they become 1.111 and 1.112 respectively. The original numbers' difference was only 2E-9, but it has now become 1E-3. In fact, you have increased your error 500'000 times. And still they are not equal even with the rounding. You've exacerbated the problem.
True, this doesn't happen much, and the examples I gave are two unlucky numbers to get in this situation, but it is still possible to find yourself with these kinds of numbers. And when you do, you're in trouble. The only sure-fire solution, even if you use fixed-point arithmetic or whatever, is to do rigorous and systematic mathematical analysis of all your possible problem areas and prove that they will remain consistent across programs.
Short of that, for us mere mortals, you need to have a water-tight way to monitor the situation and find exactly when and how the slightest discrepancies occur, to be able to solve the problem after the fact (instead of relying on your eyes to see problems in game animation or object movement or physical behavior.)

No, not in practice. For example, sin() might come from a library or from a compiler intrinsic, and differ in rounding. Sure, that's only one bit, but that's already out of sync. And that one bit error may add up over time, so even an imprecise comparison may not be sufficient.
N/A
You can't reduce FP precision for a given type, and I don't even see how it would help you. You'd turn the occasional 1E-6 difference into an occasional 1E-4 difference.

Next to your concerns on determinism, I have another remark: if you are worried about calculation consistency on a distributed system, you may have a design issue.
You could think about your application as a bunch of nodes, each responsible for their own calculations. If information about another node is needed, it should sent to you by that node.

1.)
In principle cross platform, OS, hardware compatibility is possible but in practice it's a pain.
In general your results will depend on which OS you use, which compiler, and which hardware you use. Change any one of those and your results might change. You have to test all changes.
I use Qt Creator and qmake (cmake is probably better but qmake works for me) and test my code in MSVC on Windows, GCC on Linux, and MinGW-w64 on Windows. I test both 32-bit and 64-bit. This has to be done whenever code changes.
2.) and 3.)
In terms of floating point some compilers will use x87 instead of SSE in 32-bit mode. See this as an example of the consequences of when that happens Why a number crunching program starts running much slower when diverges into NaNs? All 64-bit systems have SSE so I think most use SSE/AVX in 64-bit otherwise, e.g. in 32 bit mode, you might need to force SSE with something like -mfpmath=sse and -msse2.
But if you want a more compatible version of GCC on windows then I would used MingGW-w64 for 32-bit (aka MinGW-w32) or MinGW-w64 in 64bit . This is not the same thing as MinGW (aka mingw32). The projects have diverged. MinGW depends on MSVCRT (the MSVC C runtime library) and MinGW-w64 does not. The Qt project has a pretty good description of MinGW-w64 and installiation. http://qt-project.org/wiki/MinGW-64-bit
You might also want to consider writing a CPU dispatcher cpu dispatcher for visual studio for AVX and SSE.

Compile and optimize for different target architectures

Summary: I want to take advantage of compiler optimizations and processor instruction sets, but still have a portable application (running on different processors). Normally I could indeed compile 5 times and let the user choose the right one to run.
My question is: how can I can automate this, so that the processor is detected at runtime and the right executable is executed without the user having to chose it?
I have an application with a lot of low level math calculations. These calculations will typically run for a long time.
I would like to take advantage of as much optimization as possible, preferably also of (not always supported) instruction sets. On the other hand I would like my application to be portable and easy to use (so I would not like to compile 5 different versions and let the user choose).
Is there a possibility to compile 5 different versions of my code and run dynamically the most optimized version that's possible at execution time? With 5 different versions I mean with different instruction sets and different optimizations for processors.
I don't care about the size of the application.
At this moment I'm using gcc on Linux (my code is in C++), but I'm also interested in this for the Intel compiler and for the MinGW compiler for compilation to Windows.
The executable doesn't have to be able to run on different OS'es, but ideally there would be something possible with automatically selecting 32 bit and 64 bit as well.
Edit: Please give clear pointers how to do it, preferably with small code examples or links to explanations. From my point of view I need a super generic solution, which is applicable on any random C++ project I have later.
Edit I assigned the bounty to ShuggyCoUk, he had a great number of pointers to look out for. I would have liked to split it between multiple answers but that is not possible. I'm not having this implemented yet, so the question is still 'open'! Please, still add and/or improve answers, even though there is no bounty to be given anymore.
Thanks everybody!

Yes it's possible. Compile all your differently optimised versions as different dynamic libraries with a common entry point, and provide an executable stub that that loads and runs
the correct library at run-time, via the entry point, depending on config file or other information.

Can you use script?
You could detect the CPU using script, and dynamically load the executable that is most optimized for architecture. It can choose 32/64 bit versions too.
If you are using a Linux you can query the cpu with
cat /proc/cpuinfo
You could probably do this with a bash/perl/python script or windows scripting host on windows. You probably don't want to force the user to install a script engine. One that works on the OS out of the box IMHO would be best.
In fact, on windows you probably would want to write a small C# app so you can more easily query the architecture. The C# app could just spawn whatever executable is fastest.
Alternatively you could put your different versions of code in a dll's or shared object's, then dynamically load them based on the detected architecture. As long as they have the same call signature it should work.

If you wish this to cleanly work on Windows and take full advantage in 64bit capable platforms of the additional 1. Addressing space and 2. registers (likely of more use to you) you must have at a minimum a separate process for the 64bit ones.
You can achieve this by having a separate executable with the relevant PE64 header. Simply using CreateProcess will launch this as the relevant bitness (unless the executable launched is in some redirected location there is no need to worry about WoW64 folder redirection
Given this limitation on windows it is likely that simply 'chaining along' to the relevant executable will be the simplest option for all different options, as well as making testing an individual one simpler.
It also means you 'main' executable is free to be totally separate depending on the target operating system (as detecting the cpu/OS capabilities is, by it's nature, very OS specific) and then do most of the rest of your code as shared objects/dlls.
Also you can 'share' the same files for two different architectures if you currently do not feel that there is any point using the differing capabilities.
I would suggest that the main executable is capable of being forced into making a specific choice so you can see what happens with 'lesser' versions on a more capable machine (or what errors come up if you try something different).
Other possibilities given this model are:
Statically linking to different versions of the standard runtimes (for ones with/without thread safety) and using them appropriately if you are running without any SMP/SMT capabilities.
Detect if multiple cores are present and whether they are real or hyper threading (also whether the OS knows how the schedule effectively in those cases)
checking the performance of things like the system timer/high performance timers and using code optimized to this behaviour, say if you do anything where you look for a certain amount of time to expire and thus can know your best possible granularity.
If you wish to optimize you choice of code based on cache sizing/other load on the box. If you are using unrolled loops then more aggressive unrolling options may depend on having a certain amount level 1/2 cache.
Compiling conditionally to use doubles/floats depending on the architecture. Less important on intel hardware but if you are targetting certain ARM cpu's some have actual floating point hardware support and others require emulation. The optimal code would change heavily, even to the extent you just use conditional compilation rather than using the optimizing compiler(1).
Making use of co-processor hardware like CUDA capable graphics cards.
detect virtualization and alter behaviour (perhaps trying to avoid file system writes)
As to doing this check you have a few options, the most useful one on Intel being the the cpuid instruction.
Windows
Use someone else's implementation but you'll have to pay
Use a free open source one
Linux
Use the built in one
You could also look at open source software doing the same thing
Pixman does a fair amount of this and is a permissive licence.
Alternatively re-implement/update an existing one using available documentation on the features you need.
Quite a lot of separate documents to work out how to detect things:
Intel:
SSE 4.1/4.2
SSE3
MMX
A large part of what you would be paying for in the CPU-Z library is someone doing all this (and the nasty little issues involved) for you.
be careful with this - it is hard to beat decent optimizing compilers on this

Have a look at liboil: http://liboil.freedesktop.org/wiki/ . It can dynamically select implementations of multimedia-related computations at run-time. You may find you can liboil itself and not just its techniques.

Since you mention you are using GCC, I'll assume your code is in C (or C++).
Neil Butterworth already suggested making separate dynamic libraries, but that requires some non-trivial cross-platform considerations (manually loading dynamic libraries is different on Linux, Windows, OSX, etc., and getting it right will likely take some time).
A cheap solution is to simply write all of your variants using unique names, and use a function pointer to select the proper one at runtime.
I suspect the extra dereference caused by the function pointer will be amortized by the actual work you are doing (but you'll want to confirm that).
Also, getting different compiler optimizations will likely require different .c/.cpp files, as well as some twiddling of your build tool. But it's probably less overall work than separate libraries (which needed this already in one form or another).

Since you didn't specify whether you have limits on the number of files, I propose another solution: compile 5 executables, and then create a sixth executable that launches the appropriate binary. Here is some pseudocode, for Linux
int main(int argc, char* argv[])
{
char* target_path[MAXPATH];
char* new_argv[];
char* specific_version = determine_name_of_specific_version();
strcpy(target_path, "/usr/lib/myapp/versions");
strcat(target_path, specific_version);
/* append NULL to argv */
new_argv = malloc(sizeof(char*)*(argc+1));
memcpy(new_argv, argv, argc*sizeof(char*));
new_argv[argc] = 0;
/* optionally set new_argv[0] to target_path */
execv(target_path, new_argv);
}
On the plus side, this approach allows to provide the user transparently with both 32-bit and 64-bit binaries, unlike any library methods that have been proposed. On the minus side, there is no execv in Win32 (but a good emulation in cygwin); on Windows, you have to create a new process, rather than re-execing the current one.

Lets break the problem down to its two constituent parts. 1) Creating platform dependent optimized code and 2) building on multiple platforms.
The first problem is pretty straightforward. Encapsulate the platform dependent code in a set of functions. Create a different implementation of each function for each platform. Put each implementation in its own file or set of files. It's easiest for the build system if you put each platform's code in a separate directory.
For part two I suggest you look at Gnu Atuotools (Automake, AutoConf, and Libtool). If you've ever downloaded and built a GNU program from source code you know you have to run ./configure before running make. The purpose of the configure script is to 1) verify that your system has all of the required libraries and utilities need to build and run the program and 2) customize the Makefiles for the target platform. Autotools is the set of utilities for generating the configure script.
Using autoconf, you can create little macros to check that the machine supports all of the CPU instructions your platform dependent code needs. In most cases, the macros already exists, you just have to copy them into your autoconf script. Then, automake and autoconf can set up the Makefiles to pull in the appropriate implementation.
All this is a bit much for creating an example here. It takes a little time to learn. But the documentation is all out there. There is even a free book available online. And the process is applicable to your future projects. For multi-platform support, this is really the most robust and easiest way to go, I think. A lot of the suggestions posted in other answers are things that Autotools deals with (CPU detection, static & shared library support) without you have to think about it too much. The only wrinkle you might have to deal with is finding out if Autotools are available for MinGW. I know they are part of Cygwin if you can go that route instead.

You mentioned the Intel compiler. That is funny, because it can do something like this by default. However, there is a catch. The Intel compiler didn't insert checks for the approopriate SSE functionality. Instead, they checked if you had a particular Intel chip. There would still be a slow default case. As a result, AMD CPUs would not get suitable SSE-optimized versions. There are hacks floating around that will replace the Intel check with a proper SSE check.
The 32/64 bits difference will require two executables. Both the ELF and PE format store this information in the exectuables header. It's not too hard to start the 32 bits version by default, check if you are on a 64 bit system, and then restart the 64 bit version. But it may be easier to create an appropriate symlink at installation time.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js