LLVM bitcode cross-platform - c++

Just to be sure: Is LLVM bitcode cross-platform? By which I mean, can the generated IR (".bc") file be distrubuted and interpreted/JITed over various platforms?
If so, how does Clang convert C++ into platform independend code? While in the C++ language itself, preprocessors for determining it's target platform are used before it actually compiles.

LLVM IR can be cross-platform, with the obvious exceptions others have listed. However, that does not mean Clang generates cross-platform code. As you note, the preprocessor is almost universally used to only pass parts of the code to the C/C++ compiler, depending on the platform. Even when this is not done in user code, many system headers include a bit or two that's platform-specific, such as typedefs. For example, if you compile C code using size_t to LLVM IR on a platform where size_t is 32 bit, the LLVM IR now uses i32 for that, and there's no way in hell you can reverse engineer that to fix it.
Google's Portable Native Client project (thanks #willglynn for the link), if I understand it correctly, achieves portability by fixing the ABI for all target platforms. So in that sense, it doesn't solve the aforementioned issues: The LLVM IR is not portable to platform with a different ABI. The only reason this is more portable is that the clients provide a layer which matches the PNaCl ABI to the actual ABI. In other words, PNaCl code isn't portable to many platforms, the "PNaCl VM" is.
So, bottom line: If you're very careful, you can use LLVM IR across multiple platforms, but not without doing significant additional work (which Clang doesn't do) to abstract over the ABI differences.

Given an IR file, can I be sure it could compile to my target?
You can not assume an arbitrary IR file will always be cross-platform, as there are things in a given file that might not be platform-independent. The most notable example is that the IR can contain actual assembler sequences (via module-level or inline assembly segments), but there are other examples - e.g. usage of target specific intrinsics or calling conventions that are only supported on some targets.
Can I generate an IR file that is guaranteed to compile on all targets?
I don't know, but I believe you can, especially if you avoid specifying things like inline assembly, calling conventions, required / preferred ABI for types, etc. It can affect the optimizations the compiler will perform, though.

Related

Is the preprocessor, assembler and linker a part of the compiler?

So I've been taught, as many of us have, that the compiler is a program that translates your human readable code into machine readable code. The more you look into it however, you learn that the "compilation process" is actually broken up into 4 different parts: the preprocessor, compiler, assembler and linker. I think not understanding where all these parts fit into place have confused me a bit.
Are all the steps described in a typical compilation process part of
the compiler program?
Or are things like the assembler and linker separate programs built
into IDE's along with compilers to generate code?
Does it depend on the compiler or programming language?
If separate, is the compiler responsible for just the assembly code
creation as well as optimizing the assembly code?
Are all the steps described in a typical compilation process part of the compiler program?
All the steps are required by the translation process. The process includes Preprocess, Compilation, assembly / machine code instruction generation, and producing an executable (e.g. linking).
A translator program, a.k.a. compiler, does not need to put all steps into one compiler executable.
For example, a program may be composed of more than one translation unit, so they can be compiled all at once, then the pieces can be linked together. Often separating compilation from linking is beneficial.
Or are things like the assembler and linker separate programs built into IDE's along with compilers to generate code?
Some IDE's like Eclipse, do not have built-in compilers or linkers. The Eclipse IDE is designed to work with various compilers and linker. The Eclipse IDE needs to be configured as to what tools it will use when building a program.
Does it depend on the compiler or programming language?
IDEs are usually independent from compilers and languages. The NetBeans IDE can be used with Java or C++ (similarly with Eclipse).
Some IDEs may have features that work better with one language than another, such as keyword highlighting.
If separate, is the compiler responsible for just the assembly code creation as well as optimizing the assembly code?
Assembly language creation is not a required part of the process.
Typically, compilers have an option you can supply in order to print an assembly language listing.
Some compilers emit executable code without going through the generation of assembly language.
The meaning of the term “compiler” depends on the context.
For the beginner, the compiler is the tool you use to create an executable program from your source code.
Delving a little deeper, one learns that with practical toolchains there is at least a division into compiler and linker.
And while the above two views have been based solely on tool usage, when one learns more about C++ one appreciates the division into preprocessing and compilation “proper”, i.e. a preprocessor and a compiler, and a linker, where the preprocessor produces text, the compiler produces object code, and the linker produces executables or libraries.
Delving even deeper into things one may start to differentiate between different internal phases of the compiler (in the trio above). Some compiler utilize an assembler, some generate code directly from an abstract syntax tree, some compilers go as far as using a whole C compiler at the end, just translating the language X source code to C source code. E.g. Eiffel compilers used to do this, and probably do it still. And C++ started out that way, as a front end to a C compiler.
And especially with the idea of just translating to C, one may call that part the real compiler, with the C compiler at the end as just one of the tools invoked by the compiler proper.
So, it depends very much on the context.

How is clang able to steer C/C++ code optimization?

I was told that clang is a driver that works like gcc to do preprocessing, compilation and linkage work. During the compilation and linkage, as far as I know, it's actually llvm that does the optimization ("-O1", "-O2", "-O3", "-Os", "-flto").
But I just cannot understand how llvm is involved.
It seems that compiling source code doesn't even need a static library such as libLLVMCore.a, instead for debian clang packages depends on another package called libllvm-3.4(clang version is 3.4), which contains libLLVM-3.4.so(.1), does clang use this shared library for optimization?
I've checked clang source code for a while and found that include/clang/Driver/Options.td contains the related options, but unfortunately I failed to find the source files that include that file, so I'm still not aware of the mechanism.
I hope someone might give me some hints.
(TL;DontWannaRead - skip to the end of this answer)
To answer your question properly you first need to understand the difference between a compiler's front-end and back-end (especially the first one).
Clang is a compiler front-end (http://en.wikipedia.org/wiki/Clang) for C, C++, Objective C and Objective C++ languages.
Clang's duty is the following:
i.e. translating from C++ source code (or C, or Objective C, etc..) to LLVM IR, a textual lower-level representation of what should that code do. In order to do this Clang employs a number of sub-modules whose descriptions you could find in any decent compiler construction book: lexer, parser + a semantic analyzer (Sema), etc..
LLVM is a set of libraries whose primary task is the following: suppose we have the LLVM IR representation of the following C++ function
int double_this_number(int num) {
int result = 0;
result = num;
result = result * 2;
return result;
}
the core of the LLVM passes should optimize LLVM IR code:
What to do with the optimized LLVM IR code is entirely up to you: you can translate it to x86_64 executable code or modify it and then spit it out as ARM executable code or GPU executable code. It depends on the goal of your project.
The term "back-end" is often confusing since there are many papers that would define the LLVM libraries a "middle end" in a compiler chain and define the "back end" as the final module which does the code generation (LLVM IR to executable code or something else which no longer needs processing by the compiler). Other sources refer to LLVM as a back end to Clang. Either way, their role is clear and they offer a powerful mechanism: whatever the language you're targeting (C++, C, Objective C, Python, etc..) if you have a front-end which translates it to LLVM IR, you can use the same set of LLVM libraries to optimize it and, as long as you have a back-end for your target architecture, you can generate optimized executable code.
Recalling that LLVM is a set of libraries (not just optimization passes but also data structures, utility modules, diagnostic modules, etc..), Clang also leverages many LLVM libraries during its front-ending process. You can't really tear every LLVM module away from Clang since the latter is built on the former set.
As for the reason why Clang is said to be a "compilation driver": Clang manages interpreting the command line parameters (descriptions and many declarations are TableGen'd and they might require a bit more than a simple grep to swim through the sources), decides which Jobs and phases are to be executed, set up the CodeGenOptions according to the desired/possible optimization and transformation levels and invokes the appropriate modules (clangCodeGen in BackendUtil.cpp is the one that populates a module pass manager with the optimizations to apply) and tools (e.g. the Windows ld linker). It steers the compilation process from the very beginning to the end.
Finally I would suggest reading Clang and LLVM documentation, they're pretty explicative and most of your questions should look for an answer there in the first place.
It's not exactly like GCC, so don't spend too much time trying to match the two precisely.
The LLVM compiler is a compiler for one specific language, LLVM. What Clang does is compile C++ code to LLVM, without optimizations. Clang can then invoke the LLVM compiler to compile that LLVM code to optimized assembly.

Are g++ and clang++ 100% binary compatible? [duplicate]

If I build a static library with llvm-gcc, then link it with a program compiled using mingw gcc, will the result work?
The same for other combinations of llvm-gcc, clang and normal gcc. I'm interested in how this works out on Linux (using normal non-mingw gcc, of course) and other platforms as well, but the emphasis is on Windows.
I'm also interested in all languages, but with a strong emphasis on C and C++ - obviously clang doesn't support Fortran etc, but I believe llvm-gcc does.
I assume they all use the ELF file format, but what about call conventions, virtual table layouts etc?
Yes, for C code Clang and GCC are compatible (they both use the GNU Toolchain for linking, in fact.) You just have to make sure that you tell clang to create compiled objects and not intermediate bitcode objects. C ABI is well-defined, so the only issue is storage format.
C++ is not portable between compilers in the slightest; different compilers use different virtual table calls, constructors, destruction, name mangling, template implementations, etc. As a rule you should assume objects from one C++ compiler will not work with another.
However yes, at the time of writing Clang++ is able to use GCC/C++ compiled libraries as well; I recently set up a rig to compile C++ programs with clang using G++'s standard runtime library and it compiles+links just fine.
I don't know the answer, but slide 10 in this presentation seems to imply that the ".o" files produced by llvmgcc contain LLVM bytecode (.bc) instead of the usual target-specific object code, so that link-time optimization is possible. However, the LLVM linker should be able to link LLVM code with code produced by "normal" GCC, as the next slide says "link in native .o files and libraries here".
LLVM is a Linux tool, I have sometimes found that Linux compilers don't work quite right on Windows. I would be curious whether you get it to work or not.
I use -m i386pep when linking clang's .o files by ld. llvm's devotion to integrating with gcc is seen openly at http://dragonegg.llvm.org/ so its very intuitive to guess llvm family will greatly be cross-compatible with gcc tool-chain.
Sorry - I was coming back to llvm after a break, and have never done much more than the tutorial. First time around, I kind of burned out after the struggle getting LLVM 2.6 to build on MinGW GCC - thankfully not a problem with LLVM 2.7.
Going through the tutorial again today I noticed in Chapter 5 of the tutorial not only a clear statement that LLVM uses the ABI (Application Binary Interface) of the platform, but also that the tutorial compiler depends on this to allow access to external functions such as sin and cos.
I still don't know whether the compatible ABI extends to C++, though. That's not an issue of call conventions so much as name mangling, struct layout and vtable layout.
Being able to make C function calls is enough for most things, there's still a few issues where I care about C++.
Hopefully they fixed it but I avoid llvm-gcc because I (also) use llvm as a cross compiler and when you use llvm-gcc -m32 on a 64 bit machine the -m32 is ignored and you get 64 bit ints which have to be faked on your 32 bit target machine. Clang does not have that bug nor does gcc. Also the more I use clang the more I like. As to your direct question, dont know, in theory these days targets have well known or used calling conventions. And you would hope both gcc and llvm conform to the same but you never know. the simplest way to find this out is to write a couple of simple functions, compile and disassemble using both tool sets and see how they pass operands to the functions.

Clang vs. LLVMC -- what's the difference?

What's the difference between llvmc.exe and clang.exe? Which one do I use for compiling C or C++ code?
llvmc is a frontend for various programs in the LLVM toolchain, in particular the llvm-* ones, ie by default it will try to use llvm-gcc and llvm-g++ to compile C and C++ files.
You can pass -clang to llvmc if that's what you want to use, and it's probably possible to configure llvmc so clang will be used by default, but I have no idea how to do that.
I'd recommend to just use clang and clang++ directly, which can be used as drop-in replacements for gcc and g++.
llvmc was an experimental driver that was intended to support multiple different source languages. Clang and Clang++ have always been the preferred way to drive the (C / C++ / Objective-C) compiler. In fact, llvmc has been removed from mainline.
In short, you should definitely use "clang" and never "llvmc".
LLVM originally stands for Low-Level Virtual Machine, and is today mostly used either:
as a backend optimizer/compiler
as a JIT compiler
On the other hand, Clang is a collection of libraries for dealing with the C language family that notably contains a compiler (clang) which acts as a front-end for C, C++, Objective-C and Objective-C++ on top of the LLVM libraries.
So, in your case, you will want to use clang and clang++ to compile C and C++ respectively, and don't worry about the fact that LLVM is used behind the scenes to optimize your code and deal with generation of machine instructions adapted to your architecture.

Intermediate code from C++

I want to compile a C++ program to an intermediate code. Then, I want to compile the intermediate code for the current processor with all of its resources.
The first step is to compile the C++ program with optimizations (-O2), run the linker and do most of the compilation procedure. This step must be independent of operating system and architecture.
The second step is to compile the result of the first step, without the original source code, for the operating system and processor of the current computer, with optimizations and special instructions of the processor (-march=native). The second step should be fast and with minimal software requirements.
Can I do it? How to do it?
Edit:
I want to do it, because I want to distribute a platform independent program that can use all resources of the processor, without the original source code, instead of distributing a compilation for each platform and operating system. It would be good if the second step be fast and easy.
Processors of the same architecture may have different features. X86 processors may have SSE1, SSE2 or others, and they can be 32 or 64 bit. If I compile for a generic X86, it will lack of SSE optimizations. After many years, processors will have new features, and the program will need to be compiled for new processors.
Just a suggestion - google clang and LLVM.
How much do you know about compilers? You seem to treat "-O2" as some magical flag.
For instance, register assignment is a typical optimization. You definitely need to now how many registers are available. No point in assigning foo to register 16, and then discover in phase 2 that you're targetting an x86.
And those architecture-dependent optimizations can be quite complex. Inlining depends critically on call cost, and that in turn depends on architecture.
Once you get to "processor-specific" optimizations, things get really tricky. It's really tough for a platform-specific compiler to be truly "generic" in its generation of object or "intermediate" code at an appropriate "level": Unless it's something like "IL" (intermediate language) code (like the C#-IL code, or Java bytecode), it's really tough for a given compiler to know "where to stop" (since optimizations occur all over the place at different levels of the compilation when target platform knowledge exists).
Another thought: What about compiling to "preprocessed" source code, typically with a "*.i" extension, and then compile in a distributed manner on different architectures?
For example, most (all) the C and C++ compilers support something like:
cc /P MyFile.cpp
gcc -E MyFile.cpp
...each generates MyFile.i, which is the preprocessed file. Now that the file has included ALL the headers and other #defines, you can compile that *.i file to the target object file (or executable) after distributing it to other systems. (You might need to get clever if your preprocessor macros are specific to the target platform, but it should be quite straight-forward with your build system, which should generate the command line to do this pre-processing.)
This is the approach used by distcc to preprocess the file locally, so remote "build farms" need not have any headers or other packages installed. (You are guaranteed to get the same build product, no matter how the machines in the build farm are configured.)
Thus, it would similarly have the effect of centralizing the "configuration/pre-processing" for a single machine, but provide cross-compiling, platform-specific compiling, or build-farm support in a distributed manner.
FYI -- I really like the distcc concept, but the last update for that particular project was in 2008. So, I'd be interested in other similar tools/products if you find them. (In the mean time, I'm writing a similar tool.)