Is the preprocessor, assembler and linker a part of the compiler? - c++

So I've been taught, as many of us have, that the compiler is a program that translates your human readable code into machine readable code. The more you look into it however, you learn that the "compilation process" is actually broken up into 4 different parts: the preprocessor, compiler, assembler and linker. I think not understanding where all these parts fit into place have confused me a bit.
Are all the steps described in a typical compilation process part of
the compiler program?
Or are things like the assembler and linker separate programs built
into IDE's along with compilers to generate code?
Does it depend on the compiler or programming language?
If separate, is the compiler responsible for just the assembly code
creation as well as optimizing the assembly code?

Are all the steps described in a typical compilation process part of the compiler program?
All the steps are required by the translation process. The process includes Preprocess, Compilation, assembly / machine code instruction generation, and producing an executable (e.g. linking).
A translator program, a.k.a. compiler, does not need to put all steps into one compiler executable.
For example, a program may be composed of more than one translation unit, so they can be compiled all at once, then the pieces can be linked together. Often separating compilation from linking is beneficial.
Or are things like the assembler and linker separate programs built into IDE's along with compilers to generate code?
Some IDE's like Eclipse, do not have built-in compilers or linkers. The Eclipse IDE is designed to work with various compilers and linker. The Eclipse IDE needs to be configured as to what tools it will use when building a program.
Does it depend on the compiler or programming language?
IDEs are usually independent from compilers and languages. The NetBeans IDE can be used with Java or C++ (similarly with Eclipse).
Some IDEs may have features that work better with one language than another, such as keyword highlighting.
If separate, is the compiler responsible for just the assembly code creation as well as optimizing the assembly code?
Assembly language creation is not a required part of the process.
Typically, compilers have an option you can supply in order to print an assembly language listing.
Some compilers emit executable code without going through the generation of assembly language.

The meaning of the term “compiler” depends on the context.
For the beginner, the compiler is the tool you use to create an executable program from your source code.
Delving a little deeper, one learns that with practical toolchains there is at least a division into compiler and linker.
And while the above two views have been based solely on tool usage, when one learns more about C++ one appreciates the division into preprocessing and compilation “proper”, i.e. a preprocessor and a compiler, and a linker, where the preprocessor produces text, the compiler produces object code, and the linker produces executables or libraries.
Delving even deeper into things one may start to differentiate between different internal phases of the compiler (in the trio above). Some compiler utilize an assembler, some generate code directly from an abstract syntax tree, some compilers go as far as using a whole C compiler at the end, just translating the language X source code to C source code. E.g. Eiffel compilers used to do this, and probably do it still. And C++ started out that way, as a front end to a C compiler.
And especially with the idea of just translating to C, one may call that part the real compiler, with the C compiler at the end as just one of the tools invoked by the compiler proper.
So, it depends very much on the context.

Related

What are some examples of non-determinism in the C++ compiler?

I'm looking for examples of code that triggers non-determinism in GCC or Clang's compilation process.
One prominent example is the usage of the __DATE__ macro.
GCC and Clang have a plethora of compiler flags to control the outcome of non-deterministic actions within the compiler eg. -frandom-seed and -fno-guess-branch-probability
Are there any small examples that are affected by these flags?
To be more precise:
$ c++ main.cpp -o main && shasum main
aabbccddee
$ c++ main.cpp -o main && shasum main
eeddccbbaa
I'm looking for macro-free code examples where multiple runs of the compiler lead to different outputs, but can be fixed by e.g. -frandom-seed
EDIT:
related: from the gcc docs:
-fno-guess-branch-probability:
Sometimes gcc will opt to use a randomized model to guess branch probabilities,
when none are available from either profiling feedback (-fprofile-arcs)
or __builtin_expect.
This means that different runs of the compiler on the same program
may produce different object code.
The default is -fguess-branch-probability at levels -O, -O2, -O3, -Os.
While old, this question is interesting for reproducible builds.
As you've stated, there are multiple source of non-determinism while compiling some C/C++ source.
Non-determinism in preprocessor
The preprocessor usually implements some numerous super macro which are changing between runs. There's the obvious __DATE__ and __TIME__ but also the non obvious __cplusplus or __STD_C_VERSION__ or __GNUC_PATCHLEVEL__ which can changes when the OS updates.
There's also the __FILE__ that will contain the path of the building environment (different from machine to machine).
Please notice that for the former macro, GCC observes the environment variable SOURCE_DATE_EPOCH to overwrite the date and time macro. Other compilers might have some other behavior.
Non-determinism in the compiler
The compiler might have different optimization strategies based on non-deterministic approach. You've cited one in GCC, but other might exists.
For MSVC, you might be interested in the /BREPRO compiler flag.
You'll have to RTFM for your compiler to know more.
Non-determinism in the linker
On some architecture, the linked object and/or library will contain a timestamp. MacOS is one of them. So for the same set of .o files, you'll get a different resulting executable.
Also, if you use Link Time Optimization, many compiler will create different versions of the .o files named randomly. Again for GCC, you'll use -frandom-seed=31415 to "fix" this randomness, but YMMV.
Non-determinism in the build-process
Sometimes repositories contain additional operation that are performed outside of the compilation stage. Like generating header files based on some configuration flags (or other steps).
In that case, this per-project's specific operations might not be deterministic either.
For a good overview of the deterministic builds, please refer to this post

C++ systems not using "source files"

In The C++ Programming Language (4th edition), §15.1, Stroustrup states:
A file is the traditional unit
of storage (in a file system) and the traditional unit of compilation. There are systems that do not
store, compile, and present C++ programs to the programmer as sets of files.
Sadly, he doesn't give further information. Do you know any example of such systems?
EDIT:
I mean if you know any actual free, commercial, opensource or whatever C++ implementation that doesn't deal with files as we are accustomed to.
And I was wondering: Why do that systems exist? What's the point? What can be the advantages of such a design philosophy? What the drawbacks?
IIRC, in the 1980s IBM Visual Age C++ stored the program source code (or perhaps a faithful representation of its AST) in some proprietary database. (It is rumored that header files also sit in some database at that time).
And current C++ compilers are often able to get the source code from a generated file, or even some pipe. For instance, on my Linux I could have a program mygeneratorgenerating some C++ code on its stdout and invoke the GCC compiler as:
mygenerator | g++ -x c++ /dev/stdin -Wall -O -o myprogram
However, today, most compilers are generally compiling source files and header files from some file system. Notice that an optimizing compiler spend much more time in compilation proper than in disk IO, and you could use some tmpfs file system, so file read&write time is negligible when compiling C++ code (even parsing is often quicker than optimization & code generation).
So I know no C++ compiler used in 2015 which compiles and optimize source code outside of source files
Actually, generating C++ code is often a good idea (I'm doing it in MELT, which enables you to customize GCC), but usually you tweak your build procedure (e.g. your Makefile) to generate then compile some temporary C++ files. With current computers and operating systems and compilers (e.g. Linux & GCC) you could even generate some temporary C++ file, fork a compilation of it into a shared object plugin, and dlopen(3) it.
A possible reason to store the source code in something better than a file -e.g. some database- would be to make an incremental compiler, which would recompile only one function if it was the only modification from the previous compilation. In practice, this is difficult to implement in existing compilers (it has been discussed, and sort-of prototyped, within the GCC community, but nothing stable came out of this). But C++ or C is not the best language for such an approach (Common Lisp is much better, and SBCL is able to compile and optimize in memory and incrementally), in particular because of its preprocessor.
BTW, tinycc is able to compile C code sitting inside some const char* string in memory, but the performance of the generated machine code is bad (since tcc does not do any kind of serious optimizations, that current processors need so much).
Notice also that with link time optimizations (e.g. compile and link with g++ -flto -O2) compilers are keeping some form of the AST (actually the Gimple representation of GCC) in object files.
C++ source code can be stored in a database in various ways.

How is clang able to steer C/C++ code optimization?

I was told that clang is a driver that works like gcc to do preprocessing, compilation and linkage work. During the compilation and linkage, as far as I know, it's actually llvm that does the optimization ("-O1", "-O2", "-O3", "-Os", "-flto").
But I just cannot understand how llvm is involved.
It seems that compiling source code doesn't even need a static library such as libLLVMCore.a, instead for debian clang packages depends on another package called libllvm-3.4(clang version is 3.4), which contains libLLVM-3.4.so(.1), does clang use this shared library for optimization?
I've checked clang source code for a while and found that include/clang/Driver/Options.td contains the related options, but unfortunately I failed to find the source files that include that file, so I'm still not aware of the mechanism.
I hope someone might give me some hints.
(TL;DontWannaRead - skip to the end of this answer)
To answer your question properly you first need to understand the difference between a compiler's front-end and back-end (especially the first one).
Clang is a compiler front-end (http://en.wikipedia.org/wiki/Clang) for C, C++, Objective C and Objective C++ languages.
Clang's duty is the following:
i.e. translating from C++ source code (or C, or Objective C, etc..) to LLVM IR, a textual lower-level representation of what should that code do. In order to do this Clang employs a number of sub-modules whose descriptions you could find in any decent compiler construction book: lexer, parser + a semantic analyzer (Sema), etc..
LLVM is a set of libraries whose primary task is the following: suppose we have the LLVM IR representation of the following C++ function
int double_this_number(int num) {
int result = 0;
result = num;
result = result * 2;
return result;
}
the core of the LLVM passes should optimize LLVM IR code:
What to do with the optimized LLVM IR code is entirely up to you: you can translate it to x86_64 executable code or modify it and then spit it out as ARM executable code or GPU executable code. It depends on the goal of your project.
The term "back-end" is often confusing since there are many papers that would define the LLVM libraries a "middle end" in a compiler chain and define the "back end" as the final module which does the code generation (LLVM IR to executable code or something else which no longer needs processing by the compiler). Other sources refer to LLVM as a back end to Clang. Either way, their role is clear and they offer a powerful mechanism: whatever the language you're targeting (C++, C, Objective C, Python, etc..) if you have a front-end which translates it to LLVM IR, you can use the same set of LLVM libraries to optimize it and, as long as you have a back-end for your target architecture, you can generate optimized executable code.
Recalling that LLVM is a set of libraries (not just optimization passes but also data structures, utility modules, diagnostic modules, etc..), Clang also leverages many LLVM libraries during its front-ending process. You can't really tear every LLVM module away from Clang since the latter is built on the former set.
As for the reason why Clang is said to be a "compilation driver": Clang manages interpreting the command line parameters (descriptions and many declarations are TableGen'd and they might require a bit more than a simple grep to swim through the sources), decides which Jobs and phases are to be executed, set up the CodeGenOptions according to the desired/possible optimization and transformation levels and invokes the appropriate modules (clangCodeGen in BackendUtil.cpp is the one that populates a module pass manager with the optimizations to apply) and tools (e.g. the Windows ld linker). It steers the compilation process from the very beginning to the end.
Finally I would suggest reading Clang and LLVM documentation, they're pretty explicative and most of your questions should look for an answer there in the first place.
It's not exactly like GCC, so don't spend too much time trying to match the two precisely.
The LLVM compiler is a compiler for one specific language, LLVM. What Clang does is compile C++ code to LLVM, without optimizations. Clang can then invoke the LLVM compiler to compile that LLVM code to optimized assembly.

LLVM bitcode cross-platform

Just to be sure: Is LLVM bitcode cross-platform? By which I mean, can the generated IR (".bc") file be distrubuted and interpreted/JITed over various platforms?
If so, how does Clang convert C++ into platform independend code? While in the C++ language itself, preprocessors for determining it's target platform are used before it actually compiles.
LLVM IR can be cross-platform, with the obvious exceptions others have listed. However, that does not mean Clang generates cross-platform code. As you note, the preprocessor is almost universally used to only pass parts of the code to the C/C++ compiler, depending on the platform. Even when this is not done in user code, many system headers include a bit or two that's platform-specific, such as typedefs. For example, if you compile C code using size_t to LLVM IR on a platform where size_t is 32 bit, the LLVM IR now uses i32 for that, and there's no way in hell you can reverse engineer that to fix it.
Google's Portable Native Client project (thanks #willglynn for the link), if I understand it correctly, achieves portability by fixing the ABI for all target platforms. So in that sense, it doesn't solve the aforementioned issues: The LLVM IR is not portable to platform with a different ABI. The only reason this is more portable is that the clients provide a layer which matches the PNaCl ABI to the actual ABI. In other words, PNaCl code isn't portable to many platforms, the "PNaCl VM" is.
So, bottom line: If you're very careful, you can use LLVM IR across multiple platforms, but not without doing significant additional work (which Clang doesn't do) to abstract over the ABI differences.
Given an IR file, can I be sure it could compile to my target?
You can not assume an arbitrary IR file will always be cross-platform, as there are things in a given file that might not be platform-independent. The most notable example is that the IR can contain actual assembler sequences (via module-level or inline assembly segments), but there are other examples - e.g. usage of target specific intrinsics or calling conventions that are only supported on some targets.
Can I generate an IR file that is guaranteed to compile on all targets?
I don't know, but I believe you can, especially if you avoid specifying things like inline assembly, calling conventions, required / preferred ABI for types, etc. It can affect the optimizations the compiler will perform, though.

Intermediate code from C++

I want to compile a C++ program to an intermediate code. Then, I want to compile the intermediate code for the current processor with all of its resources.
The first step is to compile the C++ program with optimizations (-O2), run the linker and do most of the compilation procedure. This step must be independent of operating system and architecture.
The second step is to compile the result of the first step, without the original source code, for the operating system and processor of the current computer, with optimizations and special instructions of the processor (-march=native). The second step should be fast and with minimal software requirements.
Can I do it? How to do it?
Edit:
I want to do it, because I want to distribute a platform independent program that can use all resources of the processor, without the original source code, instead of distributing a compilation for each platform and operating system. It would be good if the second step be fast and easy.
Processors of the same architecture may have different features. X86 processors may have SSE1, SSE2 or others, and they can be 32 or 64 bit. If I compile for a generic X86, it will lack of SSE optimizations. After many years, processors will have new features, and the program will need to be compiled for new processors.
Just a suggestion - google clang and LLVM.
How much do you know about compilers? You seem to treat "-O2" as some magical flag.
For instance, register assignment is a typical optimization. You definitely need to now how many registers are available. No point in assigning foo to register 16, and then discover in phase 2 that you're targetting an x86.
And those architecture-dependent optimizations can be quite complex. Inlining depends critically on call cost, and that in turn depends on architecture.
Once you get to "processor-specific" optimizations, things get really tricky. It's really tough for a platform-specific compiler to be truly "generic" in its generation of object or "intermediate" code at an appropriate "level": Unless it's something like "IL" (intermediate language) code (like the C#-IL code, or Java bytecode), it's really tough for a given compiler to know "where to stop" (since optimizations occur all over the place at different levels of the compilation when target platform knowledge exists).
Another thought: What about compiling to "preprocessed" source code, typically with a "*.i" extension, and then compile in a distributed manner on different architectures?
For example, most (all) the C and C++ compilers support something like:
cc /P MyFile.cpp
gcc -E MyFile.cpp
...each generates MyFile.i, which is the preprocessed file. Now that the file has included ALL the headers and other #defines, you can compile that *.i file to the target object file (or executable) after distributing it to other systems. (You might need to get clever if your preprocessor macros are specific to the target platform, but it should be quite straight-forward with your build system, which should generate the command line to do this pre-processing.)
This is the approach used by distcc to preprocess the file locally, so remote "build farms" need not have any headers or other packages installed. (You are guaranteed to get the same build product, no matter how the machines in the build farm are configured.)
Thus, it would similarly have the effect of centralizing the "configuration/pre-processing" for a single machine, but provide cross-compiling, platform-specific compiling, or build-farm support in a distributed manner.
FYI -- I really like the distcc concept, but the last update for that particular project was in 2008. So, I'd be interested in other similar tools/products if you find them. (In the mean time, I'm writing a similar tool.)