Please note: this question is not about LLVM IR, but LLVM's MIR, an internal intermediate representation lower than the former one.
This documentation on LLVM Machine code description classes, says (highlighting mine):
At the high-level, LLVM code is translated to a machine specific representation formed out of MachineFunction , MachineBasicBlock , and MachineInstr instances...
However, the same paragraph goes on and says:
This representation is completely target agnostic, representing instructions in their most abstract form...
My question is, how to understand this paragraph?
I have a hard time reconciling the claim that this intermediate representation is machine specific and target agnostic at the same time. I thought "machine" and "target", in LLVM's context, mean the same thing - the instruction set architecture (e.g. x86_64, MIPS) used by the compiled executable.
Examples are welcome.
There are different ways to be platform specific. For instance, you could have a differently-named opcode for add, or perhaps with different overflow semantics, or you could use the same add for all, with the operands/flags specified by the same arguments for all target platforms, with the same default values.
And there are many target-specific details such as the size or alignment of pointers that affect your code even if they don't affect any single instruction.
Machine IR represents instructions in their most abstract form. It doesn't try hide that on this target, pointers have 32 bits.
I want to force LLVM to generate CMPx-, TEST- and alike instructions on x86-64 to be up to 8 bit width only, forcing e.g. 32bit-int comparisons into four separate cmp+branch pairs. This obviously requires some bit-masking and increased instruction count.
Can I achieve this by simply "disabling" certain instructions for x86-64 so LLVM auto-generates the required glue code? Do I have to write a pass and work on the IR myself?
No, there is no way of disabling certain instructions like this from a vanilla build of LLVM. Anything you do to achieve this will require modifying LLVM.
You have several options for modifying LLVM:
You can add an x86-specific pass to the LLVM backend (does not work on the IR) which directly expands the cmp and test instructions into chains of instructions on sub-registers. You would have to do this after instruction selection to preclude some target-independent pass from undoing the transformation. This is called an "MI" pass in LLVM parlance. As an example you can look in X86FixupSetCC.cpp (mirror here). This has a huge advantage in that you can put it behind a flag and otherwise control whether it occurs once you add the core functionality.
You can modify LLVM's instruction tables for LLVM in the X86 .td files to only define these instructions for 8-bit registers, and then add the def Pat<...>; patterns to the .td files that allow programs with wider comparisons to still have their instructions selected (much as Colin suggested above). This has the disadvantage of not only have you modified your LLVM but you can't easily turn those modifications on and off behind some flag.
You can't do anything to LLVM's IR that will really help here because the code generator will just optimize things back into instruction patterns you're trying to avoid.
Hope this helps!
What you're probably looking to do is redefine the lowering pattern in the x86 .td files. There's code that looks like "def Pat<...>;" that defines a translation from one graph of instructions to another. There should be a pattern for going from IR comparison instructions to the x86 32bit compare instructions. You'll want to edit this pattern and instead output your sequence of comparisons.
I have compiled my code with specific flags (-Os, -O2, -march=native and their combinations) in order to produce a faster execution time.
But my problem is that I don't run always in the same machine (because in my lab there are several different machines). Sometimes I run within a MacOS, or within a Linux (in both cases with different OS versions).
I wonder if there is a way to determine which binary will be run depending on the environment where the binary will run (I mean cache size, cpu cores, and other properties about the specific machine)?. In other words, how to choose (when the program loads) the faster binary (previously compiled with different target binary sizes and instruction-set extensions) according to the specific machine used?
Thanks in advance.
What you're talking about is called a fat binary (not FAT, the acronym). From Wikipedia1:
A fat binary (or multiarchitecture binary) is a computer executable program which has been expanded (or "fattened") with code native to multiple instruction sets which can consequently be run on multiple processor types. This results in a file larger than a normal one-architecture binary file, thus the name.
At quick glance, there doesn't seem to be much support for it (see this question from the Programmer StackExchange for more information). Apple implemented this briefly when transitioning from PowerPC to Intel, but it doesn't seem to have been explored much since then.
Technically, fat binaries refer to a single binary that could run on multiple architectures...but I imagine the premise would hold for a single binary that runs on multiple OSes. And it comes back to the point Bizkit made in his/her/zir answer - generally, you compile your source code for the environment that you're in ahead of time.
You may prebuilt a bunch of executables and choose one according to environment variable or things like uname. A Better approach to the problem is choose a toolchain that is able to perform JIT, install-time optimization and/or runtime optimization, like llvm.
Is there a reason you can't just recompile your source code on each machine? Compilers are already written and optimized exactly for this kind of stuff. Simply recompile your source code on that machine architecture and you'll have a binary that runs just fine on that machine.
If you want your code tuned for the cache size of the machine you run on, check out the way Automatically Tuned Linear Algebra Software (ATLAS) does it. When you compile it, it runs some tests to find what size to use for cache-blocking its loops, and puts that in a header file.
In the doxygen reference manual for llvm you can create a target data instance from a Module object or execution engine.
How do I get the target data for the current/native platform?
Well... usually the information to be added to TargetData can be extracted from the platform ABI document. This is where all natural sizes, alignments, etc. are specified. Sometimes if you have a compiler for your platform you can try to match all entries with your compiler.
I believe in the latter case it is possible to write some binary which will generate the TargetData info for you but noone did this before iirc.
Summary: I want to take advantage of compiler optimizations and processor instruction sets, but still have a portable application (running on different processors). Normally I could indeed compile 5 times and let the user choose the right one to run.
My question is: how can I can automate this, so that the processor is detected at runtime and the right executable is executed without the user having to chose it?
I have an application with a lot of low level math calculations. These calculations will typically run for a long time.
I would like to take advantage of as much optimization as possible, preferably also of (not always supported) instruction sets. On the other hand I would like my application to be portable and easy to use (so I would not like to compile 5 different versions and let the user choose).
Is there a possibility to compile 5 different versions of my code and run dynamically the most optimized version that's possible at execution time? With 5 different versions I mean with different instruction sets and different optimizations for processors.
I don't care about the size of the application.
At this moment I'm using gcc on Linux (my code is in C++), but I'm also interested in this for the Intel compiler and for the MinGW compiler for compilation to Windows.
The executable doesn't have to be able to run on different OS'es, but ideally there would be something possible with automatically selecting 32 bit and 64 bit as well.
Edit: Please give clear pointers how to do it, preferably with small code examples or links to explanations. From my point of view I need a super generic solution, which is applicable on any random C++ project I have later.
Edit I assigned the bounty to ShuggyCoUk, he had a great number of pointers to look out for. I would have liked to split it between multiple answers but that is not possible. I'm not having this implemented yet, so the question is still 'open'! Please, still add and/or improve answers, even though there is no bounty to be given anymore.
Thanks everybody!
Yes it's possible. Compile all your differently optimised versions as different dynamic libraries with a common entry point, and provide an executable stub that that loads and runs
the correct library at run-time, via the entry point, depending on config file or other information.
Can you use script?
You could detect the CPU using script, and dynamically load the executable that is most optimized for architecture. It can choose 32/64 bit versions too.
If you are using a Linux you can query the cpu with
cat /proc/cpuinfo
You could probably do this with a bash/perl/python script or windows scripting host on windows. You probably don't want to force the user to install a script engine. One that works on the OS out of the box IMHO would be best.
In fact, on windows you probably would want to write a small C# app so you can more easily query the architecture. The C# app could just spawn whatever executable is fastest.
Alternatively you could put your different versions of code in a dll's or shared object's, then dynamically load them based on the detected architecture. As long as they have the same call signature it should work.
If you wish this to cleanly work on Windows and take full advantage in 64bit capable platforms of the additional 1. Addressing space and 2. registers (likely of more use to you) you must have at a minimum a separate process for the 64bit ones.
You can achieve this by having a separate executable with the relevant PE64 header. Simply using CreateProcess will launch this as the relevant bitness (unless the executable launched is in some redirected location there is no need to worry about WoW64 folder redirection
Given this limitation on windows it is likely that simply 'chaining along' to the relevant executable will be the simplest option for all different options, as well as making testing an individual one simpler.
It also means you 'main' executable is free to be totally separate depending on the target operating system (as detecting the cpu/OS capabilities is, by it's nature, very OS specific) and then do most of the rest of your code as shared objects/dlls.
Also you can 'share' the same files for two different architectures if you currently do not feel that there is any point using the differing capabilities.
I would suggest that the main executable is capable of being forced into making a specific choice so you can see what happens with 'lesser' versions on a more capable machine (or what errors come up if you try something different).
Other possibilities given this model are:
Statically linking to different versions of the standard runtimes (for ones with/without thread safety) and using them appropriately if you are running without any SMP/SMT capabilities.
Detect if multiple cores are present and whether they are real or hyper threading (also whether the OS knows how the schedule effectively in those cases)
checking the performance of things like the system timer/high performance timers and using code optimized to this behaviour, say if you do anything where you look for a certain amount of time to expire and thus can know your best possible granularity.
If you wish to optimize you choice of code based on cache sizing/other load on the box. If you are using unrolled loops then more aggressive unrolling options may depend on having a certain amount level 1/2 cache.
Compiling conditionally to use doubles/floats depending on the architecture. Less important on intel hardware but if you are targetting certain ARM cpu's some have actual floating point hardware support and others require emulation. The optimal code would change heavily, even to the extent you just use conditional compilation rather than using the optimizing compiler(1).
Making use of co-processor hardware like CUDA capable graphics cards.
detect virtualization and alter behaviour (perhaps trying to avoid file system writes)
As to doing this check you have a few options, the most useful one on Intel being the the cpuid instruction.
Windows
Use someone else's implementation but you'll have to pay
Use a free open source one
Linux
Use the built in one
You could also look at open source software doing the same thing
Pixman does a fair amount of this and is a permissive licence.
Alternatively re-implement/update an existing one using available documentation on the features you need.
Quite a lot of separate documents to work out how to detect things:
Intel:
SSE 4.1/4.2
SSE3
MMX
A large part of what you would be paying for in the CPU-Z library is someone doing all this (and the nasty little issues involved) for you.
be careful with this - it is hard to beat decent optimizing compilers on this
Have a look at liboil: http://liboil.freedesktop.org/wiki/ . It can dynamically select implementations of multimedia-related computations at run-time. You may find you can liboil itself and not just its techniques.
Since you mention you are using GCC, I'll assume your code is in C (or C++).
Neil Butterworth already suggested making separate dynamic libraries, but that requires some non-trivial cross-platform considerations (manually loading dynamic libraries is different on Linux, Windows, OSX, etc., and getting it right will likely take some time).
A cheap solution is to simply write all of your variants using unique names, and use a function pointer to select the proper one at runtime.
I suspect the extra dereference caused by the function pointer will be amortized by the actual work you are doing (but you'll want to confirm that).
Also, getting different compiler optimizations will likely require different .c/.cpp files, as well as some twiddling of your build tool. But it's probably less overall work than separate libraries (which needed this already in one form or another).
Since you didn't specify whether you have limits on the number of files, I propose another solution: compile 5 executables, and then create a sixth executable that launches the appropriate binary. Here is some pseudocode, for Linux
int main(int argc, char* argv[])
{
char* target_path[MAXPATH];
char* new_argv[];
char* specific_version = determine_name_of_specific_version();
strcpy(target_path, "/usr/lib/myapp/versions");
strcat(target_path, specific_version);
/* append NULL to argv */
new_argv = malloc(sizeof(char*)*(argc+1));
memcpy(new_argv, argv, argc*sizeof(char*));
new_argv[argc] = 0;
/* optionally set new_argv[0] to target_path */
execv(target_path, new_argv);
}
On the plus side, this approach allows to provide the user transparently with both 32-bit and 64-bit binaries, unlike any library methods that have been proposed. On the minus side, there is no execv in Win32 (but a good emulation in cygwin); on Windows, you have to create a new process, rather than re-execing the current one.
Lets break the problem down to its two constituent parts. 1) Creating platform dependent optimized code and 2) building on multiple platforms.
The first problem is pretty straightforward. Encapsulate the platform dependent code in a set of functions. Create a different implementation of each function for each platform. Put each implementation in its own file or set of files. It's easiest for the build system if you put each platform's code in a separate directory.
For part two I suggest you look at Gnu Atuotools (Automake, AutoConf, and Libtool). If you've ever downloaded and built a GNU program from source code you know you have to run ./configure before running make. The purpose of the configure script is to 1) verify that your system has all of the required libraries and utilities need to build and run the program and 2) customize the Makefiles for the target platform. Autotools is the set of utilities for generating the configure script.
Using autoconf, you can create little macros to check that the machine supports all of the CPU instructions your platform dependent code needs. In most cases, the macros already exists, you just have to copy them into your autoconf script. Then, automake and autoconf can set up the Makefiles to pull in the appropriate implementation.
All this is a bit much for creating an example here. It takes a little time to learn. But the documentation is all out there. There is even a free book available online. And the process is applicable to your future projects. For multi-platform support, this is really the most robust and easiest way to go, I think. A lot of the suggestions posted in other answers are things that Autotools deals with (CPU detection, static & shared library support) without you have to think about it too much. The only wrinkle you might have to deal with is finding out if Autotools are available for MinGW. I know they are part of Cygwin if you can go that route instead.
You mentioned the Intel compiler. That is funny, because it can do something like this by default. However, there is a catch. The Intel compiler didn't insert checks for the approopriate SSE functionality. Instead, they checked if you had a particular Intel chip. There would still be a slow default case. As a result, AMD CPUs would not get suitable SSE-optimized versions. There are hacks floating around that will replace the Intel check with a proper SSE check.
The 32/64 bits difference will require two executables. Both the ELF and PE format store this information in the exectuables header. It's not too hard to start the 32 bits version by default, check if you are on a 64 bit system, and then restart the 64 bit version. But it may be easier to create an appropriate symlink at installation time.