What affects generated machine code at each step of the compilation process?

What affects generated machine code at each step of the compilation process? - c++

I am almost certain this question has been asked before, but I can not seem to find the right keywords to search for to get an answer. My apologies if this is a duplicate.
I am better trying to understand the compilation process of say a C++ file as it goes from the C++ syntax to the binary machine code. In addition I am trying to understand what influences the resulting machine code.
First, I am nearly certain that the following are the only factors (for most systems) that dictate the final machine code (please correct me if I am wrong here)
The tools used to compile, assemble, and link.
Things like gnu c compiler, clang, visual studio, nasm, ect.
The kernel of the system being used.
Whether its a specific version of the linux kernel, windows microkernel, or some other kernel like a mac os x one.
The operating system being used.
This one I am less clear about. I am unsure if machines running the same linux kernel, but different os, in this case let's say debian vs centos, will they produce different binaries.
Lastly the hardware architecture.
Different cpu architectures like arm 64, x86, power pc, ect. take different op codes so obviously the machine code should be different.
So with that being said here is my understanding of the compilation process and where each of these dependencies show up.
I write a C++ file and use code that my system can understand. A good example might be using <winsock.h> on windows and <sys/socket.h> on linux.
The preprocessor runs and executes any preprocessor macros.
Here I know that different preprocessors will define different macros but for now I will assume this is not too machine dependent. (This might be wrong to assume).
The compiler tools run to produce assembly file outputs.
Here the assembly produced depends on the compiler and what optimizations or choices it makes.
It also depends on the kernel because different kernels have different system calls and store files in different locations. This means the assembly might make changes such as different branching when calling functions specific to that kernel.
The operating system? Still unsure how the operating system fits in to this. If two machines have the same kernel, what does the operating system do to the binaries?
Finally the assembly code depends on the cpu architecture. I think that is a pretty obvious statement.
Once the compiler produces an assembly. We can then invoke the assembler to turn our assembly code into almost complete machine code. (I think machine code is identical to binary opcodes a cpu manual lists but this might be wrong).
The corresponding machine code files (often called object files I think) contain nearly all the instructions needed to run or reference other machine code files which will be linked in the next step.
This machine code usually has some format (I think ELF is a popular format for linux) and this format is dependent on the linker for sure.
I don't think the kernel, operating system, or hardware affect the layout/format of the object file but this is probably wrong. If they do please correct this.
The hardware will affect the actual machine code produced because again I think it is a 1 to 1 mapping of machine code instructions to opcodes for a cpu.
I am unsure if the kernel or operating system affect the linking process because I thought their changes were already incorporated in the compiling step.
Finally the linking step occurs.
I think this is as simple as the linker looking for all the referenced machine code and injecting it into one complete machine code file which can be executed.
I have no clue what affects this besides the linker tool itself.
So with all that, I need help identifying inaccuracies with the procedure I described above, and any dependencies I might have missed whether it be cpu, os, kernel, or tool ones.
Thank you and sorry for the long winded question. This probably should have been broken up into multiple questions but I am too far in. If this does not go well I may ask each part in individual questions.
EDIT:
Questions with more focus.
What components of a machine affect the machine code produced given a C++ file input?

Actually that is a lot of questions and usually you're question would be much too broad for SO (as you managed to recognize by yourself). But on the other hand you showed a deep interest (just by writing such a long and profound question) and also a lot of correct understanding of the process of compiling a program. The things you are missing or not understanding correctly (and you are probably the most interested in) are those things, that I myself found hard to learn. Thus I will provide you with some important points, that I think you are missing in the big picture.
Note that I am very much used to Linux, so I will mostly describe how things work on Linux. But I believe that most things also happen in a similar way on other operating systems.
Let's begin with the hardware. A modern computer has a CPU of some architecture. There are lots of different of CPU architectures. You mentioned some of them like arm, x86, etc. which are families of similar CPUs and can be divided into smaller groups by bit width and/or supported extensions. Ultimately your processor has a specified instruction set that defines which opcodes it supports and what those opcodes do. If a native (compiled) program runs, there are raw opcodes in the memory and the CPU directly executes them following its architecture specification.
Aside from the CPU there is a lot more hardware connected to your computer. Usually communicating with this hardware is complicated and not standardized. If a user program for example gets input keystrokes from the keyboard, in does not have to directly communicate with the keyboard, but rather does this via the operating system kernel. This works by a mechanism called syscall interrupt. The kernel installs an handler routine, that is called if a user program triggers such an interrupt with a special CPU instruction. You can think of it like a language agnostic function call from the program into the kernel. For example for Linux you can find a list of all syscalls at the syscall(2) man page. The syscalls form the kernel's Application Binary Interface (kernel ABI). Reading and writing from a terminal or using a filesystem are examples for syscall functionality.
As you can see, there are already very high level functions, that are implemented in the kernel. However the functionality is still quite limited for most typical applications. To encapsulate the syscalls and provide functions for memory management, utility functions, mathematical functions and many other things you probably use in your daily programs, there is usually another layer between the program and the kernel. This thing is called the C standard library, and it is a shared library (we will cover what exactly this is in a moment). On GNU/Linux it is the glibc which is the single most important library on a GNU/Linux system (and notably not part of the kernel 1). While it implements all the features that are required by the C standard (for example functions like malloc() or strcpy()), it also ships a lot of additional functions which are a superset of the ISO C standard library, the POSIX standard and some extensions. This interface is usually called the Application Programming Interface (API) of the operating system. While it is in principle possible to bypass the API and directly use the syscalls, almost all programs (even when written in other languages than C or C++) use the C library.
Now get yourself a coffee and a few minutes of rest. We now have enough background information to look at how a C++ program is transformed into a binary, and how exactly this binary is executed.
A C++ program consists of different compilation units (usually each different source file is a compilation unit). Each compilation unit undergoes the following steps
The preprocessor is run on the file. It includes header, expands macros and does some other stuff. As you wrote in your question this is rather platform independent. The preprocessor actions are standardized in the C++ standard.
The resulting code is compiled. That means C++ code is translated into assembly code. Because assembly code directly reflects the CPU instructions, this step is dependent on the target CPU architecture, that the compiler was configured for (usually the host CPU). The compiler is allowed to optimize and translate the program in any way it wants, as long as it follows the as-if rule. Thus this step is also higly dependent on the compiler you are using.
Note: Symbols (especially functions) that are not defined, are left undefined. If you say call the malloc() function, this will not be compiled, but left unevaluated until later. Thus this step is also not much dependent on the operating system.
Assembling takes place. This is very straightforward. The assembly code usually can be converted directly into binary CPU instructions. Local symbols (such as goto labels etc.) are resolved and replaced by their corresponding addresses. Unknown external symbols such as the mentioned malloc() call still are left unevaluated and are stored in the object file's symbol table. Because most of the syscalls are wrapped in library functions, the assembly code will usually not directly contain syscall code. Thus this step is depended on the CPU architecture. It is however dependent on the ABI2, which in term is dependent on the compiler and the OS.
Linking takes place. The different compilation units are combined into a single executable binary in an OS-dependent format (e.g. GNU/Linux uses ELF). Here yet more symbols are resolved. For example if one compilation calls a function in another compilation unit, this call is resolved and the symbol is replaced by the function address. If you link to a library statically, this is just treated like another compilation unit and included into the executable with its symbols resolved.
Shared libraries are checked for the needed symbols, but not linked yet. For example in case of the malloc() call, the linker checks, that there is a malloc symbol in the glibc, but the symbol in the executable still remains unresolved.
At this point you have a executable binary. As you might noticed, there might still be unresolved symbols in that binary. Thus you cannot just load that binary into RAM and let the CPU execute it. A final step called dynamic linking is needed. On Linux the program that performs this step is called the dynamic linker/loader. Its task is to load the executable ELF file into memory, look up all the needed dynamic libraries, load them into memory as well (a list is stored in the ELF file) and resolve the remaining symbols. This last step happens each time the program is executed. Now finally the malloc() symbol is resolved with the address in the glibc shared library.
You have pure CPU instructions in memory, the CPU's program counter register (the one that tracks the next instruction) is set to the entry point, and the program can begin to run. Every now and then it is interrupted either because it makes a syscall, or because it is interrupted by the kernel scheduler to let another program run on that CPU core.
I hope I could answer some of your questions and satisfy your curiosity. I think the most important part you were missing, was how dynamic linking happens. This is a very interesting topic which is related to concepts like position independent code. I wish you could luck learning.
1 this is also one reason why some people insist on calling Linux based systems GNU/Linux. The glibc library (together with many other GNU programs) defines much of the operating system structure, interacts with supplementary programs and configuration files etc. There are however Linux based systems without glibc. One of them is Android, using Googles bionic libc.
2 The ABI is related to the calling convention. This is a mixture of operating system, programming language and compiler specification. It is one of the reasons (besides name mangling, see the comment of PeterCordes below) you need those extern "C" {...} scopes in C++ header files, that declare C functions in shared libraries. It basically is a convention on how to pass parameters and return values between functions.

Neither operating system nor kernel are directly involved in any of this.
Their limited involvement is in that if you want to build Linux 64 bit binaries for x86 using gnu tools then you need to in some way (download and install or build yourself) build the gnu tools themselves for that target processor and that operating system. As system calls are specific to the operating system and target, and also the binaries supported by that operating system. Not strictly just the elf file format, that is just a container, but the linking and possibly bootstrap is also specific to the operating systems loader. (or if building something for the kernel that would have other rules). For example, does the application loader initialize .bss and .data for you from specific information in the .elf file, or like on an mcu does the bootstrap code itself have to do this?
The builder for gnu tools for a target like linux and ideally a pre-built binary for your os and target, would have paths setup in some way. The c library would have a default linker script and its intimate partner the bootstrap.
After that point, it is just a dumb toolchain. Include files be they at the system level, compiler level, or programmer level are just includes in the C language. The default paths and gcc knows where it was executed from so it knows where in a normal build the gcc and other libraries live.
gcc itself is not a compiler actually it calls other programs like the preprocessor, the compiler itself, the assembler and linker.
The preprocessor is going to do the search and replace for includes and defines and end up with one great big cpp file, then pass that to the compiler.
The compiler front end (C++ language for gcc for example) turns that into an internal language, allocate an int with this name, and another add the two and blah. A pseudo code if you will. This gets a lot of the optimization work done on it then eventually the back end (which for gnu could be x86, mips, arm, etc independent to some extent of the front and middle). The LLVM tools, are at least capable of exposing that middle, internal, language to external files (external to the memory used by the compiler to do the compilation) and you can combine and optimize those bytecode files and then convert them to assembly or direct to object in the llvm world. I think this is an exception not a rule, others just use internal tables.
While I think it is wise and sane to use an assembly language step. Not all compilers do and do not assume that all compilers do. Some output objects.
Yes that assembly is naturally partial, external functions (labels) and variables (labels) cannot be resolved at the object level. The linker has to do that.
So the target (x86, arm, etc) does affect the construction of the elf file as
there are certain items, magic numbers specific to the target. As mentioned the operating system and or kernel do affect the elf in that there are rules for construction of the binary for that kernel or operating system. Remember that elf is just a container like tar or zip or mkv etc. Do not assume that the operating system can handle every possible choice you want to make with the contents that the linker will allow (the tools are dumb, do what they are told).
So your source.
All the relevant sources that go with it including system includes, compiler includes and your includes.
gcc/g++ is a wrapper program that manages the steps.
calls the pre-processor expands includes and defines into one file (no magic here)
call the compiler to parse that one file into internal tables, think pseudo code and data
many, many possible optimizers that operate on these structures
backend, including peephole optimizer, turns the tables into assembly language (for gnu at least)
assembler is called to turn the asm into an object
If all the objects are specified and gcc is told to link, then...
Linker combines all the objects for the binary, including the bootstrap, including already built libraries, stubs, etc, and command line or more likely a linker script (linker script and bootstrap have an intimate relationship they are not assumed to be separable and not part of the compiler they are part of a C library, etc).
Kernel module loader or operating system application loader fed the file and per the rules of that loader loads and runs the program.

Related

How does the C++ compiler know which CPU architecture is being used

with reference to : http://www.cplusplus.com/articles/2v07M4Gy/
During the compilation phase,
This phase translates the program into a low level assembly level code. The compiler takes the preprocessed file ( without any directives) and generates an object file containing assembly level code. Now, the object file created is in the binary form. In the object file created, each line describes one low level machine level instruction.
Now, if I am correct then different CPU architectures works on different assembly languages/syntax.
My question is how does the compiler comes to know to which assembly language syntax the source code has to be changed? In other words, how does the C++ compiler know which CPU architecture is there in the machine it is working on ?
Is there any mapping used by assembler w.r.t the CPU architecture for generating assembly code for different CPU architectures?
N.S : I am beginner !!

Each compiler needs to be "ported" to the given system. For each system supported, a "compiler port" needs to be programmed by someone who knows the system in-depth.

WARNING : This is extremely simplified
In short, there are three main parts to a compiler :
"Front-end" : This part reads the language (in this case c++) and converts it to a sort of pseudo-code specific to the compiler. (An Abstract Syntactic Tree, or AST)
"Optimizer/Middle-end" : This part takes the AST and make a non-architecture-dependant optimized one.
"Back-end" : This part takes the AST, and converts it to binary executable code, specific to the architecture you want to compile your language on.
When you download a c++ compiler for your platform, you, in fact, download the c++ frontend with the linux-amd64 backend, for example.
This coding architecture is extremely helpful, because it allows to port the compiler for another architecture without rewriting the whole parsing/optimizing thing. It also allows someone to create another optimizer, or even another frontend supporting a whole different language, and, as long as it outputs a correct AST, it will be compatible with every single backend ever written for this compiler.

Simply put, the knowledge of the target system is coded into the compiler.
So you might have a C compiler that generates SPARC binaries, and a C compiler that generates VAX binaries. They both accept the same input language (as defined in the C standard), but produce different programs from it.
Often we just refer to "the compiler", meaning the one that will generate binaries for our current environment.
In modern times, the distinction has become less obvious with compiler collections such as GCC. Now the "different compilers" are often the same compiler program, just set up with different configurations (these are the "target description files").

Just to complete the answers given here:
The target architecture is indeed coded into the specific compiler instance you're using. This is important also for a process called "cross-compiling" - The process of compiling on a certain system an executable that would operate on another system/architecture.
Consider working on an embedded system-on-chip that uses a completely different instruction set than your own - You're working on an x86/64 Linux system, but need to compile a mobile app running on an ARM micro-processor, or some other type of assembly architecture.
It would be unreasonable to compile your code on the target system, which might be so limited in CPU and memory that it can't feasibly run a compiler - and so you can use a GCC (or any other compiler) port for that target architecture on your favorite system.
It's also quite critical to remember that the entire tool-chain is often compatible to the target system, for instance when shared libraries such as libc are getting in play - as the target OS could be a different release of Linux and would have different versions of common functions - In which case it's common to use tool-chains that contain all the necessary libraries and use something like chroot or mock to compile in the "target environment" from within your system.

How to call an assembly function from C++ dynamically?

REQUIREMENT: For a certain project we have unique requirement. The application supports an expression language that allows the user to define their own complex expressions that can be evaluated at run time (many hundred times a second) and they need to be executed at machine level for performance.
WORKING: Our expression parser translates the script into corresponding assembly language routine perfectly. We checked it by statically linking the object files generated with our C test program and they produce correct result.
Since the client can change the script anytime, our program (at run time) detects the change, calls the parser which generates the corresponding assembly routine. We then call the assembler from back end to create the object code.
PROBLEM
How can we call this assembly routine dynamically from the C++ program
(Loader)?
We are not supposed to call the C++ compiler to link it with the loader because the loader already would have other subroutines running and we cannot take the loader off, recompile and then execute the new loader program.
I tried searching for a solution online but every time the results are littered with .NET assembly dynamic calling. Our app has nothing to do with .NET.

First, the "generated plugin" approach (on Linux; my answer focuses on Linux but could be adapted to Windows with some effort; you could use many-platform frameworks like Qt or POCO or Glib from GTK; then all wrap plugin loading abilities à la dlopen with a common API that you could use on Windows, on Linux, on MacOSX, on Android) :
generate C (or assembly) code in some file /tmp/generated01.c (you might even generate C++ code using standard C++ containers, but its compilation would be significantly slower; beware of name mangling so emit and use extern "C" functions; read the C++ dlopen mini HowTo). See this answer explaining why generating C is worthwhile (and could be better, and more portable, than generating assembler code).
run (using fork+execve+waitpid, or simply system) a compilation of that generated file into a shared object /tmp/genenerated01.so by running gcc -fPIC -Wall -O /tmp/generated01.c -shared -o /tmp/generated01.so command; you practically need to get position-independent code, hence the -fPIC flag. If using dlopen on your generated assembler code you'll need to improve your assembler generator to emit PIC code.
dlopen that new /tmp/generated01.so (so use the dynamic linker), see dlopen(3); you could even remove the now useless generated C file /tmp/generated01.c
dlsym the relevant symbols to get function pointers to the generated code, see dlsym(3); your application would simply call the generated code using these function pointers.
when you are sure that you don't need any functions from it and that no call frame uses it, you could dlclose that shared object library (but you might accept to leak some address space by not calling dlclose at all)
The above approach is worthwhile and can be used a big lot of times (my manydl.c demonstrates that you could dlopen a million different shared objects), and is practically even compatible (even when emitting C code!) with an interactive Read-Eval-Print-Loop -on most current desktops and laptops and servers-, since most of the time the generated /tmp/generated01.c would be quite small (e.g. a few hundred lines at most) to be very quickly generated and compiled (by gcc, etc...). I am even using this in MELT for its REPL mode. On Linux this plugin approach generally requires to link the main application with -rdynamic (so that dlopen-ed plugins can reference and call functions from the main application).
Then, other approaches could be to use some Just-In-Time compilation library, like
GNU lightning (which emits slow machine code very quickly - so very short JIT emission time, but the generated code is running slowly since it is very unoptimized)
asmjit; it is x86-64 specific, and enables you to generate individual x86-64 machine instructions
GNU libjit is available for several platforms, and offer an "interpreter" mode for other platforms
LLVM (part of Clang/LLVM compiler, usable as a JIT library)
GCCJIT (a new JIT library front-end to GCC)
Grossly speaking, the first elements of that list are able to emit JIT machine code fairly quickly, but that code won't run as fast as compiling with gcc -fPIC -O1 or -O2 the equivalent generated C code (but would run typically 2x to 5x slower!); the last two elements (LLVM & GCCJIT) are compiler based: so they are able to optimize and emit efficient code, at the expense of slower JIT code emission. All the JIT libraries are able (like dlsym does for plugins) to give function pointers to newly JIT-constructed functions.
Notice that there is a trade-off to be made: some techniques are able to generate quickly some machine code, if you accept that generated code to later run a bit slowly; other techniques (notably GCCJIT or LLVM) are spending time to optimize the generated machine code, so takes more time to emit the machine code, but that code would later run quickly. You should not expect both (small generation time, quick execution time), since there is no such thing as a free lunch.
I believe that generating manually some assembler code is practically not worthwhile. You won't be able to generate very optimized code (because optimization is a very difficult art, and both GCC and Clang have millions of source line code for optimization passes), unless you spend many years of work for that. Using some JIT library is easier, and "compiling" to C or C++ is also quite easy (you leave the burden of optimization to the C compiler you are calling).
You could also consider rewriting your application into some language with homoiconicity and metaprogramming abilities (e.g. multi-stage programming), such as Common Lisp (and many others, e.g. those providing eval). Its SBCL implementation is always emitting machine code...
You could also embed an interpreter like Lua -perhaps even LuaJit- or Guile in your application. The main advantage of embedding an existing language is that there are resources (books, modules, ...) and community of people knowing them (designing a good language is difficult!). Also, the embedded interpreter library is well designed and probably well debugged (since used a lot), and some of them are fast enough (since using bytecode techniques).

As the comments already suggest, LoadLibrary (Windows) and dlopen (Linux/POSIX) are by far the easiest solution. These are specifically intended to dynamically load code. Equally important, they both allow unloading as well, and there are functions to then get a function entry point by name.

You can dynamically do it. I will take linux case as an example. Since your parser working fine and generates machine code, you should be able to generate .so (for linux) or .dll for windows.
Next, load the library as
handle = dlopen(so_file_name, RTLD_LAZY);
Next get function pointer
func = dlsym(handle, "function_name");
Then you should be able to execute it as func()
One thing you need to experiment (in case you do not get desired result) is close and open the so file or dll file (you need to do only if required, else it may reduce performance)

It sounds like you can generate the proper byte code. So you could just ensure that you generate position independent code, write it into an executable piece of memory, and then call or create thread upon the code. The simplest way would just be to cast the pointer to the base of the memory you wrote the code into as a function pointer, and then call it.
If you write your bytecode to avoid referencing different sections, and instead reference offsets from its loaded base, 'loading' the code is as easy as writing it to executable memory. You could do a call/pop/jmp to find the base of the code once it begins executing.
Conversely, and probably the easiest solution, would be to just write the code out as function expecting arguments, that way you could pass the code's base and any other arguments to it, as you would with any other function, as long as you use the proper typedef for your function pointer, and the generated assembly handles the arguments properly. As long as you avoid creating absolute jumps or data references to absolute addresses, you shouldn't have any issue.

too late but I think it would help someone else.
in case you want to dynamically execute a piece of code, you can create an interpreter for this.
compile your expressions into some byte code then write the interpreter for executing this.
here is a tutorial about writing interpreters, but in python.
https://ruslanspivak.com/lsbasi-part1/
you can write it using c/c++

Why can't object (.obj) files be moved across platforms?

Why can't we move .obj files from c compilation across OS platforms and use it to build the executable file at the end?
If we can do so can we call C a platform independent language like Java?

The C language is platform independent.
The files generated by the compiler, the object and executable files, are platform dependent. This is due to the fact the ultimate goal of a compiler is to generate an executable file for the target architecture only, not for every known architecture.
Java class files are platform independent because Sun was the only designer of Java, it actually made all the rules (from bytecode to file format and VM behavior) and everyone else had to adapt.
This didn't happen with native binary formats, every OS made its format, compiler made its object format and every CPU has its own ISA.

There is absolutely nothing in any specification that says this CAN'T be the case. (Note that the languages C and C++ are both platform independent, but the OBJECT files produced by C and C++ are what is NOT platform independent)
However, because C and C++ are both languages designed for performance, most compilers produce machine code for the target system. And you may then say "but my Linux machine runs on the same processor as my Windows machine", but of course, that's not the ONLY difference between object files or executable files on different OS architectures. And whilst it may be possible to convert object files containing machine code for the same processor from one format to another, it's fraught with problematic things like "what do with inlined system calls" (in other words, someone called gettimeofday via the std::chrono interface, and the compiler inlined this call, which is a call directly to the OS - well, Windows has no idea what gettimeofday is, it's called GetSystemTime or some such, and the method of calling the OS is completely different...)
If you want an OS independent system, then all object files must be "pure" - and of course, both systems need to support the same object file format (or support conversion of them).
One could make a C or C++ compiler that does what Java (and C#, etc) does, where the compiler doesn't produce machine-code for the target system, but produces a "intermediate form" - but that would be a little contrary to the ideas of C and C++, which is that the language is designed to be VERY efficient, and not have a lot of overhead. If portability is more important than performance, maybe you want to use Java? Or some other portable language...

Different platforms use different object file formats (ELF for Linux, COFF/PE for Windows), so an object file built on one platform may not be usable on another.
Remember that an object file is (usually) native machine code, just not in an executable form.

C is cross-platform in source code level.
Once it is compiled, the binary is subject to many factors.
In architecture level, it is possible to generate intermediate object code like LLVM, and do JIT on target machine so that the code fits for target architecture.
However, unless you are doing free-standing development, the platform dependency can comes to play a party preventing you to directly run the code. These dependency may include linking parameters, difference in implementation of standard libraries, platform-specific features, etc.
There is still exception, if the OS provide binary-level compatibility (like BSD ), you may indeed run code compiled for other platform directly.

What do C and Assembler actually compile to? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
So I found out that C(++) programs actually don't compile to plain "binary" (I may have gotten some things wrong here, in that case I'm sorry :D) but to a range of things (symbol table, os-related stuff,...) but...
Does assembler "compile" to pure binary? That means no extra stuff besides resources like predefined strings, etc.
If C compiles to something else than plain binary, how can that small assembler bootloader just copy the instructions from the HDD to memory and execute them? I mean if the OS kernel, which is probably written in C, compiles to something different than plain binary - how does the bootloader handle it?
edit: I know that assembler doesn't "compile" because it only has your machine's instruction set - I didn't find a good word for what assembler "assembles" to. If you have one, leave it here as comment and I'll change it.

C typically compiles to assembler, just because that makes life easy for the poor compiler writer.
Assembly code always assembles (not "compiles") to relocatable object code. You can think of this as binary machine code and binary data, but with lots of decoration and metadata. The key parts are:
Code and data appear in named "sections".
Relocatable object files may include definitions of labels, which refer to locations within the sections.
Relocatable object files may include "holes" that are to be filled with the values of labels defined elsewhere. The official name for such a hole is a relocation entry.
For example, if you compile and assemble (but don't link) this program
int main () { printf("Hello, world\n"); }
you are likely to wind up with a relocatable object file with
A text section containing the machine code for main
A label definition for main which points to the beginning of the text section
A rodata (read-only data) section containing the bytes of the string literal "Hello, world\n"
A relocation entry that depends on printf and that points to a "hole" in a call instruction in the middle of a text section.
If you are on a Unix system a relocatable object file is generally called a .o file, as in hello.o, and you can explore the label definitions and uses with a simple tool called nm, and you can get more detailed information from a somewhat more complicated tool called objdump.
I teach a class that covers these topics, and I have students write an assembler and linker, which takes a couple of weeks, but when they've done that most of them have a pretty good handle on relocatable object code. It's not such an easy thing.

Let's take a C program.
When you run gcc, clang, or 'cl' on the c program, it will go through these stages:
Preprocessor (#include, #ifdef, trigraph analysis, encoding translations, comment management, macros...) including lexing into preprocessor tokens and eventually resulting in flat text for input to the compiler proper.
Lexical analysis (producing tokens and lexical errors).
Syntactical analysis (producing a parse tree and syntactical errors).
Semantic analysis (producing a symbol table, scoping information and scoping/typing errors) Also data-flow, transforming the program logic into an "intermediate representation" that the optimizer can work with. (Often an SSA). clang/LLVM uses LLVM-IR, gcc uses GIMPLE then RTL.
Optimization of the program logic, including constant propagation, inlining, hoisting invariants out of loops, auto-vectorization, and many many other things. (Most of the code for a widely-used modern compiler is optimization passes.) Transforming through intermediate representations is just part of how some compilers work, making it impossible / meaningless to "disable all optimizations"
Outputing into assembly source (or another intermediate format like .NET IL bytecode)
Assembling of the assembly into some binary object format.
Linking of the assembly into whatever static libraries are needed, as well as relocating it if needed.
Output of final executable in elf, PE/coff, MachO64, or whatever other format
In practice, some of these steps may be done at the same time, but this is the logical order. Most compilers have options to stop after any given step (e.g. preprocess or asm), including dumping internal representation between optimization passes for open-source compilers like GCC. (-ftree-dump-...)
Note that there's a 'container' of elf or coff format around the actual executable binary, unless it's a DOS .com executable
You will find that a book on compilers(I recommend the Dragon book, the standard introductory book in the field) will have all the information you need and more.
As Marco commented, linking and loading is a large area and the Dragon book more or less stops at the output of the executable binary. To actually go from there to running on an operating system is a decently complex process, which Levine in Linkers and Loaders covers.
I've wiki'd this answer to let people tweak any errors/add information.

There are different phases in translating C++ into a binary executable. The language specification does not explicitly state the translation phases. However, I will describe the common translation phases.
Source C++ To Assembly or Itermediate Language
Some compilers actually translate the C++ code into an assembly language or an intermediate language. This is not a required phase, but helpful in debugging and optimizations.
Assembly To Object Code
The next common step is to translate Assembly language into an Object code. The object code contains assembly code with relative addresses and open references to external subroutines (methods or functions). In general, the translator puts in as much information into an object file as it can, everything else is unresolved.
Linking Object Code(s)
The linking phase combines one or more object codes, resolves references and eliminates duplicate subroutines. The final output is an executable file. This file contains information for the operating system and relative addresses.
Executing Binary Files
The Operating System loads the executable file, usually from a hard drive, and places it into memory. The OS may convert relative addresses into physical locations. The OS may also prepare resources (such as DLLs and GUI widgets) that are required by the executable (which may be stated in the Executable file).
Compiling Directly To Binary
Some compilers, such as the ones used in Embedded Systems, have the capability to compile from C++ directly to an executable binary code. This code will have physical addresses instead of relative address and not require an OS to load.
Advantages
One of the advantages of these phases is that C++ programs can be broken into pieces, compiled individually and linked at a later time. They can even be linked with pieces from other developers (a.k.a. libraries). This allows developers to only compiler pieces in development and link in pieces that are already validated. In general, the translation from C++ to object is the time consuming part of the process. Also, a person doesn't want to wait for all the phases to complete when there is an error in the source code.
Keep an open mind and always expect the Third Alternative (Option).

To answer your questions, please note that this is subjective as there are different processors, different platforms, different assemblers and C compilers, in this case, I will talk about the Intel x86 platform.
Assemblers do not usually assemble to pure / flat binary (raw machine code), instead usually to a file defined with segments such as data, text and bss to name but a few; this is called an object file. The Linker steps in and adjusts the segments to make it executable, that is, ready to run. Incidentally, the default output when you assemble using GNU as foo.s is a.out, that is a shorthand for Assembler Output. (But the same filename is the gcc default for linker output, with the assembler output being only a temporary.)
Boot loaders have a special directive defined, back in the days of DOS, it would be common to find a directive such as .Org 100h, which defines the assembler code to be of the old .COM variety before .EXE took over in popularity. Also, you did not need to have a assembler to produce a .COM file, using the old debug.exe that came with MSDOS, did the trick for small simple programs, the .COM files did not need a linker and were straight ready-to-run binary format. Here's a simple session using DEBUG.
1:*a 0100
2:* mov AH,07
3:* int 21
4:* cmp AL,00
5:* jnz 010c
6:* mov AH,07
7:* int 21
8:* mov AH,4C
9:* int 21
10:*
11:*r CX
12:*10
13:*n respond.com
14:*w
15:*q
This produces a ready-to-run .COM program called 'respond.com' that waits for a keystroke and not echo it to the screen. Notice, the beginning, the usage of 'a 100h' which shows that the Instruction pointer starts at 100h which is the feature of a .COM. This old script was mainly used in batch files waiting for a response and not echo it. The original script can be found here.
Again, in the case of boot loaders, they are converted to a binary format, there was a program that used to come with DOS, called EXE2BIN. That was the job of converting the raw object code into a format that can be copied on to a bootable disk for booting. Remember no linker is run against the assembled code, as the linker is for the runtime environment and sets up the code to make it runnable and executable.
The BIOS when booting, expects code to be at segment:offset, 0x7c00, if my memory serves me correct, the code (after being EXE2BIN'd), will start executing, then the bootloader relocates itself lower down in memory and continue loading by issuing int 0x13 to read from the disk, switch on the A20 gate, enable the DMA, switch onto protected mode as the BIOS is in 16bit mode, then the data read from the disk is loaded into memory, then the bootloader issues a far jump into the data code (likely to be written in C). That is in essence how the system boots.
Ok, the previous paragraph sounds abstracted and simple, I may have missed out something, but that is how it is in a nutshell.

To answer the assembly part of the question, assembly doesn't compile to binary as I understand it. Assembly === binary. It directly translates. Each assembly operation has a binary string that directly matches it. Each operation has a binary code, and each register variable has a binary address.
That is, unless Assembler != Assembly and I'm misunderstanding your question.

They compile to a file in a specific format (COFF for Windows, etc), composed of headers and segments, some of which have "plain binary" op codes. Assemblers and compilers (such as C) create the same sort of output. Some formats, such as the old *.COM files, had no headers, but still had certain assumptions (such as where in memory it would get loaded or how big it could be).
On Windows machines, the OS's boostrapper is in a disk sector loaded by the BIOS, where both of these are "plain". Once the OS has loaded its loader, it can read files that have headers and segments.
Does that help?

There are two things that you may mix here. Generally there are two topics:
Executable File Formats (see a list here), for example COFF, XCOFF, ELF
Intermediate Languages, like CIL or GIMPLE or bytecode
The latter may compile to the former in the process of assembly. Some intermediate formats are not assembled, but executed by a virtual machine. In case of C++ it may be compiled into CIL, which is assembled into a .NET assembly, hence there me be some confusion.
But in general C and C++ are usually compiled into binary, or in other words, into a executable file format.

You have a lot of answers to read through, but I think I can keep this succinct.
"Binary code" refers to the bits that feed through the microprocessor's circuits. The microprocessor loads each instruction from memory in sequence, doing whatever they say. Different processor families have different formats for instructions: x86, ARM, PowerPC, etc. You point the processor at the instruction you want by giving it the address of the instruction in memory, and then it chugs merrily along through the rest of the program.
When you want to load a program into the processor, you first have to make the binary code accessible in memory so it has an address in the first place. The C compiler outputs a file in the filesystem, which has to be loaded into a new virtual address space. Therefore, in addition to binary code, that file has to include the information that it has binary code, and what its address space should look like.
A bootloader has different requirements, so its file format might be different. But the idea is the same: binary code is always a payload in a larger file format, which includes at a minimum a sanity check to ensure that it's written in the correct instruction set.
C compilers and assemblers are typically configured to produce static library files. For embedded applications, you're more likely to find a compiler which produces something like a raw memory image with instructions beginning at address zero. Otherwise, you can write a linker which converts the output of the C compiler into whatever else you want.

As I understand it, a chipset (CPU, etc.) will have a set of registers for storing data, and understand a set of instructions for manipulating these registers. The instructions will be things like 'store this value to this register', 'move this value', or 'compare these two values'. These instructions are often expressed in short human-grokable alphabetic codes (assembly language, or assembler) which are mapped to the numbers that the chipset understands - those numbers are presented to the chip in binary (machine code.)
Those codes are the lowest level that the software gets down to. Going deeper than that gets into the architecture of the actual chip, which is something I haven't gotten involved in.

The executable files (PE format on windows) cannot be used to boot the computer because the PE loader is not in memory.
The way bootstrapping works is that the master boot record on the disk contains a blob of a few hundred bytes of code. The BIOS of the computer (in ROM on the motherboard) loads this blob into memory and sets the CPU instruction pointer to the beginning of this boot code.
The boot code then loads a "second stage" loader, on Windows called NTLDR (no extension) from the root directory. This is raw machine code that, like the MBR loader, is loaded into memory cold and executed.
NTLDR has the full capability to load PE files including DLLs and drivers.

С(++) (unmanaged) really compiles to plain binary. Some OS-related stuff - are BIOS and OS function calls, they're different for each OS, but still binary.
1. Assembler compiles to pure binary, but, as strange as it gets, it is less optimized than C(++)
2. OS kernel, as well as bootloader, also written in C, so no problems here.
Java, Managed C++, and other .NET stuff, compiles into some pseudocode (MSIL in .NET), which makes it cross-OS and cross-platform, but requires local interpreter or translator to run.

Compile and optimize for different target architectures

Summary: I want to take advantage of compiler optimizations and processor instruction sets, but still have a portable application (running on different processors). Normally I could indeed compile 5 times and let the user choose the right one to run.
My question is: how can I can automate this, so that the processor is detected at runtime and the right executable is executed without the user having to chose it?
I have an application with a lot of low level math calculations. These calculations will typically run for a long time.
I would like to take advantage of as much optimization as possible, preferably also of (not always supported) instruction sets. On the other hand I would like my application to be portable and easy to use (so I would not like to compile 5 different versions and let the user choose).
Is there a possibility to compile 5 different versions of my code and run dynamically the most optimized version that's possible at execution time? With 5 different versions I mean with different instruction sets and different optimizations for processors.
I don't care about the size of the application.
At this moment I'm using gcc on Linux (my code is in C++), but I'm also interested in this for the Intel compiler and for the MinGW compiler for compilation to Windows.
The executable doesn't have to be able to run on different OS'es, but ideally there would be something possible with automatically selecting 32 bit and 64 bit as well.
Edit: Please give clear pointers how to do it, preferably with small code examples or links to explanations. From my point of view I need a super generic solution, which is applicable on any random C++ project I have later.
Edit I assigned the bounty to ShuggyCoUk, he had a great number of pointers to look out for. I would have liked to split it between multiple answers but that is not possible. I'm not having this implemented yet, so the question is still 'open'! Please, still add and/or improve answers, even though there is no bounty to be given anymore.
Thanks everybody!

Yes it's possible. Compile all your differently optimised versions as different dynamic libraries with a common entry point, and provide an executable stub that that loads and runs
the correct library at run-time, via the entry point, depending on config file or other information.

Can you use script?
You could detect the CPU using script, and dynamically load the executable that is most optimized for architecture. It can choose 32/64 bit versions too.
If you are using a Linux you can query the cpu with
cat /proc/cpuinfo
You could probably do this with a bash/perl/python script or windows scripting host on windows. You probably don't want to force the user to install a script engine. One that works on the OS out of the box IMHO would be best.
In fact, on windows you probably would want to write a small C# app so you can more easily query the architecture. The C# app could just spawn whatever executable is fastest.
Alternatively you could put your different versions of code in a dll's or shared object's, then dynamically load them based on the detected architecture. As long as they have the same call signature it should work.

If you wish this to cleanly work on Windows and take full advantage in 64bit capable platforms of the additional 1. Addressing space and 2. registers (likely of more use to you) you must have at a minimum a separate process for the 64bit ones.
You can achieve this by having a separate executable with the relevant PE64 header. Simply using CreateProcess will launch this as the relevant bitness (unless the executable launched is in some redirected location there is no need to worry about WoW64 folder redirection
Given this limitation on windows it is likely that simply 'chaining along' to the relevant executable will be the simplest option for all different options, as well as making testing an individual one simpler.
It also means you 'main' executable is free to be totally separate depending on the target operating system (as detecting the cpu/OS capabilities is, by it's nature, very OS specific) and then do most of the rest of your code as shared objects/dlls.
Also you can 'share' the same files for two different architectures if you currently do not feel that there is any point using the differing capabilities.
I would suggest that the main executable is capable of being forced into making a specific choice so you can see what happens with 'lesser' versions on a more capable machine (or what errors come up if you try something different).
Other possibilities given this model are:
Statically linking to different versions of the standard runtimes (for ones with/without thread safety) and using them appropriately if you are running without any SMP/SMT capabilities.
Detect if multiple cores are present and whether they are real or hyper threading (also whether the OS knows how the schedule effectively in those cases)
checking the performance of things like the system timer/high performance timers and using code optimized to this behaviour, say if you do anything where you look for a certain amount of time to expire and thus can know your best possible granularity.
If you wish to optimize you choice of code based on cache sizing/other load on the box. If you are using unrolled loops then more aggressive unrolling options may depend on having a certain amount level 1/2 cache.
Compiling conditionally to use doubles/floats depending on the architecture. Less important on intel hardware but if you are targetting certain ARM cpu's some have actual floating point hardware support and others require emulation. The optimal code would change heavily, even to the extent you just use conditional compilation rather than using the optimizing compiler(1).
Making use of co-processor hardware like CUDA capable graphics cards.
detect virtualization and alter behaviour (perhaps trying to avoid file system writes)
As to doing this check you have a few options, the most useful one on Intel being the the cpuid instruction.
Windows
Use someone else's implementation but you'll have to pay
Use a free open source one
Linux
Use the built in one
You could also look at open source software doing the same thing
Pixman does a fair amount of this and is a permissive licence.
Alternatively re-implement/update an existing one using available documentation on the features you need.
Quite a lot of separate documents to work out how to detect things:
Intel:
SSE 4.1/4.2
SSE3
MMX
A large part of what you would be paying for in the CPU-Z library is someone doing all this (and the nasty little issues involved) for you.
be careful with this - it is hard to beat decent optimizing compilers on this

Have a look at liboil: http://liboil.freedesktop.org/wiki/ . It can dynamically select implementations of multimedia-related computations at run-time. You may find you can liboil itself and not just its techniques.

Since you mention you are using GCC, I'll assume your code is in C (or C++).
Neil Butterworth already suggested making separate dynamic libraries, but that requires some non-trivial cross-platform considerations (manually loading dynamic libraries is different on Linux, Windows, OSX, etc., and getting it right will likely take some time).
A cheap solution is to simply write all of your variants using unique names, and use a function pointer to select the proper one at runtime.
I suspect the extra dereference caused by the function pointer will be amortized by the actual work you are doing (but you'll want to confirm that).
Also, getting different compiler optimizations will likely require different .c/.cpp files, as well as some twiddling of your build tool. But it's probably less overall work than separate libraries (which needed this already in one form or another).

Since you didn't specify whether you have limits on the number of files, I propose another solution: compile 5 executables, and then create a sixth executable that launches the appropriate binary. Here is some pseudocode, for Linux
int main(int argc, char* argv[])
{
char* target_path[MAXPATH];
char* new_argv[];
char* specific_version = determine_name_of_specific_version();
strcpy(target_path, "/usr/lib/myapp/versions");
strcat(target_path, specific_version);
/* append NULL to argv */
new_argv = malloc(sizeof(char*)*(argc+1));
memcpy(new_argv, argv, argc*sizeof(char*));
new_argv[argc] = 0;
/* optionally set new_argv[0] to target_path */
execv(target_path, new_argv);
}
On the plus side, this approach allows to provide the user transparently with both 32-bit and 64-bit binaries, unlike any library methods that have been proposed. On the minus side, there is no execv in Win32 (but a good emulation in cygwin); on Windows, you have to create a new process, rather than re-execing the current one.

Lets break the problem down to its two constituent parts. 1) Creating platform dependent optimized code and 2) building on multiple platforms.
The first problem is pretty straightforward. Encapsulate the platform dependent code in a set of functions. Create a different implementation of each function for each platform. Put each implementation in its own file or set of files. It's easiest for the build system if you put each platform's code in a separate directory.
For part two I suggest you look at Gnu Atuotools (Automake, AutoConf, and Libtool). If you've ever downloaded and built a GNU program from source code you know you have to run ./configure before running make. The purpose of the configure script is to 1) verify that your system has all of the required libraries and utilities need to build and run the program and 2) customize the Makefiles for the target platform. Autotools is the set of utilities for generating the configure script.
Using autoconf, you can create little macros to check that the machine supports all of the CPU instructions your platform dependent code needs. In most cases, the macros already exists, you just have to copy them into your autoconf script. Then, automake and autoconf can set up the Makefiles to pull in the appropriate implementation.
All this is a bit much for creating an example here. It takes a little time to learn. But the documentation is all out there. There is even a free book available online. And the process is applicable to your future projects. For multi-platform support, this is really the most robust and easiest way to go, I think. A lot of the suggestions posted in other answers are things that Autotools deals with (CPU detection, static & shared library support) without you have to think about it too much. The only wrinkle you might have to deal with is finding out if Autotools are available for MinGW. I know they are part of Cygwin if you can go that route instead.

You mentioned the Intel compiler. That is funny, because it can do something like this by default. However, there is a catch. The Intel compiler didn't insert checks for the approopriate SSE functionality. Instead, they checked if you had a particular Intel chip. There would still be a slow default case. As a result, AMD CPUs would not get suitable SSE-optimized versions. There are hacks floating around that will replace the Intel check with a proper SSE check.
The 32/64 bits difference will require two executables. Both the ELF and PE format store this information in the exectuables header. It's not too hard to start the 32 bits version by default, check if you are on a 64 bit system, and then restart the 64 bit version. But it may be easier to create an appropriate symlink at installation time.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js