Why is the LLVM execution engine faster than compiled code?

Why is the LLVM execution engine faster than compiled code? - llvm

I have a compiler which targets LLVM, and I provide two ways to run the code:
Run it automatically. This mode compiles the code to LLVM and uses the ExecutionEngine JIT to compile it into machine code on-the-fly and run it without ever generating an output file.
Compile it and run separately. This mode outputs an LLVM .bc file, which I manually optimise (with opt), compile to native assembly (with llc) compile to machine code and link (with gcc), and run.
I was expecting approach #2 to be faster than approach #1, or at least the same speed, but running a few speed tests, I am surprised to find that #2 consistently runs about twice as slow. That is a huge speed difference.
Both cases are running the same LLVM source code. With approach #1, I haven't yet bothered to run any LLVM optimisation passes (which is why I was expecting it to be slower). With approach #2, I am running opt with -std-compile-opts and llc with -O3, to maximise optimisation, yet it isn't getting anywhere near #1. Here is an example run of the same program:
#1 without optimisation: 11.833s
#2 without optimisation: 22.262s
#2 with optimisation (-std-compile-opts and -O3): 18.823s
Is the ExecutionEngine doing something special that I don't know about? Is there any way for me to optimise the compiled code to achieve the same performance as the ExecutionEngine JIT?

It is normal for a VM with JIT to run some applications faster than than a compiled application. That's because a VM with JIT is like a simulator that simulates a virtual computer, and also runs a compiler in realtime. Because both tasks are built into the VM with JIT, the machine simulator can feed information to the compiler so that the code can be recompiled to run more efficiently. The information that it provides is not available to statically compiled code.
This effect has also been noted with Java VMs and with Python's PyPy VM, among others.

Another issue is aligning code and other optimizations. Nowadays cpu's are so complex that it's hard to predict which techniques will result in faster execution of final binary.
As an real-life example, let's consider Google's Native Client - I mean original nacl compilation approach, not involing LLVM (cause, as far as I know, currently there is direction on supporting both "nativeclient" and "LLVM bitcode"(modyfied) code).
As you can see on presentations (check out youtube.com) or in papers, like this Native Client: A Sandbox for Portable, Untrusted x86 Native Code, even their aligning technique makes code size bigger, in some cases such aligning of instructions (for example with noops) gives better cache hitting.
Aligning instructions with noops and instruction reordering it known in parallel computing, and here it shows it's impact as well.
I hope this answer gives an idea how much circumstances might influence on code speed execution, and that are many possible reasons for different pieces of code, and each of them needs investigation. Nevermore, it's interesting topic, so If you find some more details, don't hestitate to reedit your answer and let us know in "Post-Scriptorium", what have you found more :). (Maybe link to whitepaper/devblog with new findings :) ). Benchmarks are always welcome - take a look : http://llvm.org/OpenProjects.html#benchmark .

Related

Optimize for a specific machine / processor architecture

In this highly voted answer to a question on the performance differences between C++ and Java I learn that the JIT compiler is sometimes able to optimize better because it can determine the exact specifics of the machine (processor, cache sizes, etc.):
Generally, C# and Java can be just as fast or faster because the JIT
compiler -- a compiler that compiles your IL the first time it's
executed -- can make optimizations that a C++ compiled program cannot
because it can query the machine. It can determine if the machine is
Intel or AMD; Pentium 4, Core Solo, or Core Duo; or if supports SSE4,
etc.
A C++ program has to be compiled beforehand usually with mixed
optimizations so that it runs decently well on all machines, but is
not optimized as much as it could be for a single configuration (i.e.
processor, instruction set, other hardware).
Question: Is there a way to tell the compiler to optimize specifically for my current machine? Is there a compiler which is able to do this?

For GCC, you can use the flag -march=native. Be aware that the generated code may not run on other CPUs because
GCC uses this name to determine what kind of instructions it can emit
when generating assembly code.
So CPU specific assembly can be generated.
If you want your code to run on other CPU types, but tune it for better performance on your CPU, then you should use -mtune=native:
Specify the name of the processor to tune the performance for. The
code will be tuned as if the target processor were of the type
specified in this option, but still using instructions compatible with
the target processor specified by a -mcpu= option.

Certainly a compiler could be instructed to optimize for a specific architecture. This is true of gcc, if you look at the multitude of architecture flags that you can pass in. The same is true to a lesser extent on Visual Studio, as it has the -MACHINE option and /arch options.
However, unlike in Java, this likely means that the generated code is only (safe) to run on that hardware that is being targeted. The assertion that Java can be just as fast or faster only likely holds in the case of generically compiled C++ code. Given the target architecture, C++ code compiled for that specific architecture will likely be as fast or faster than equivalent Java code. Of course, it's much more work to support multiple architectures in this way.

What is the difference between -fprofile-use and -fauto-profile?

What is the difference between -fprofile-use and -fauto-profile?
Here's what the docs say:
https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html#Optimize-Options
-fprofile-use
-fprofile-use=path
Enable profile feedback-directed optimizations, and the following optimizations which are generally profitable only with profile feedback available: [...]
If path is specified, GCC looks at the path to find the profile feedback data files. See -fprofile-dir.
and underneath that
-fauto-profile
-fauto-profile=path
Enable sampling-based feedback-directed optimizations, and the following optimizations which are generally profitable only with profile feedback available: [...]
path is the name of a file containing AutoFDO profile information. If omitted, it defaults to fbdata.afdo in the current directory.
(The list of optimizations in the [...] for -fauto-profile is longer.)

I stumbled into this thread by a path I can't even remember and am learning this stuff as I go along. But I don't like seeing an unanswered question if I could learn something from it! So I got reading.
Feedback-Directed Optimisation
As GCC say, both of these are modes of applying Feedback-Directed Optimisation. By running the program and profiling what it does, how it does it, how long it spends in which functions, etc. - we may facilitate extra, directed optimisations from the resulting data. Results from the profiler are 'fed forward' to the optimiser. Next, presumably, you can take your profile-optimised binary and profile that, then compile another FDO'd version, and so on... hence the feedback part of the name.
The real answer, the difference between these two switches, isn't very clearly documented, but it's available if we just need to look a little further.
-fprofile-use
Firstly, your quote for -fprofile-use only really states that it requires -fprofile-generate, an option that isn't very well documented: the reference from -use just tells you to read the page you're already on, where in all cases, -generate is only mentioned but never defined. Useful! But! We can refer to the answers to this question: How to use profile guided optimizations in g++?
As that answer states, and the piece of GCC's documentation in question here gently indicates... -fprofile-generate causes instrumentation to be added to the output binary. As that page summarises, an instrumented executable has stuff added to facilitate extra checks or insights during its runtime.
(The other form of instrumentation I know - and the one I've used - is the compiler add-on library UBSan, which I use via GCC's -fsanitize=undefined option. This catches bits of Undefined Behaviour at runtime. GCC with this on has revealed UB I might've otherwise taken ages to find - and made me wonder how my programs ran at all! Clang can use this library too, and maybe other compilers.)
-fauto-profile
In contrast, -fauto-profile is different. The key distinction is hinted, if not clearly, in the synopsis you quoted for it:
path is the name of a file containing AutoFDO profile information.
This mode handles profiling and subsequent optimisations using AutoFDO. To Google we go: AutoFDO The first few lines don't explain this as succinctly as possible, and I think the best summary is buried rather far down the page:
The major difference between AutoFDO [-fauto-profile] and FDO [-fprofile-use] is that AutoFDO profiles on optimized binary instead of instrumented binary. This makes it very different in handling cloned functions.
How does it do this? -fauto-profile requires you to provide profiling files written out by the Linux kernel's profiler, Perf, converted to the AutoFDO format. Perf, rather than adding instrumentation, uses hardware features of the CPU and kernel-level features of the OS to profile various statistics about a program while it's running:
perf is powerful: it can instrument CPU performance counters, tracepoints, kprobes, and uprobes (dynamic tracing). It is capable of lightweight profiling. [...] Performance counters are CPU hardware registers that count hardware events such as instructions executed, cache-misses suffered, or branches mispredicted. They form a basis for profiling applications to trace dynamic control flow and identify hotspots.
So, that lets it profile an optimised program, rather than an instrumented one. We might reasonably presume this is more representative of how your program would react in the real world - and so can facilitate gathering more useful profiling data and applying more effective optimisations as a result.
An example of how to do the legwork of tying all this together and getting -fauto-profile to do something with your program is summarised here: Feedback directed optimization with GCC and Perf
(Maybe now that I learned all this, I'll try these options out some day!)

underscore_d gives an in-depth insight into the differences.
Here is my take on it.
Performing internal profiling by compiling initially with -fprofile-generate, which integrates the profiler into the binary for the performance data collection run. Execute the binary, for 10 minutes or whatever time you think covers enough activity for the profiler to record. Recompile again instead with -fprofile-use along with -fprofile-correction if it is a multi-threaded application. Internal profiler runs causes a significant performance hit (25% in my case) which does not reflect the real world non-profiler included binary behavior, so could result in less accurate profiling, but if all activity when running the profiler enabled binary scales with the performance penalty, I guess it should not matter.
Alternatively you can use the perf tool (more error prone and effort) which is specific to your kernel (may also need kernel built to support profiling, tracing etc) to create the profiling data. This could be considered, external profiling and has negligible impact on the application performance while being profiled. You run this on the binary that you compile normally. I cannot find any studies comparing the two.
perf record -e br_inst_retired:near_taken -b -o perf.data *your_program.unstripped -program -parameters*
then without stripping the binary, convert the profiling data into something GCC understands...
create_gcov --binary=your_program.unstripped --profile=perf.data --gcov=profile.afdo
Then recompile the application using -fauto-profile. Perf and AutoFDO/create_gcov version specific issues are known to exist. I referred to https://blog.wnohang.net/index.php/2015/04/29/feedback-directed-optimization-with-gcc-and-perf/ for detailed information on this alternative profiling method.
-fprofile-use and -fauto-profile both enable many optimization options by default, in my case the unwanted -funroll-loops which I knew had negative impact on performance in my application. If your the pedantic type, you can test option combinations by including the disabling counterpart in the compile flags, in my case -fno-unroll-loops.
Using internal profiling with my program after stripping the binary, it reduced the size by 25% (compared to original non-profiler stripped binary) however I only observed sub-percentile performance gains and the previous work output fluctuations that are reported by the program log (it's a crypto currency miner) were more erratic, instead of a gradual rising and lowering between peaks and troughs in hash rates like originally.
Overall, a stab in the dark.

What happens to sse commands on a chip that doesn't support them?

Lets say I write a function using _mm_fmadd_ss, as far as I know, most(if not all) AMD chips don't support that or have their own version.
What happens to the software when ran on one of these chips? Could I compile for multiple chips? How would the program choose between them?
At first I thought "oh I'll just do preprocessor #ifdef whatever" but then I realized that only the stuff that passes those conditions during preprocessing of the code makes it to the output.

A program will usually terminate or won't execute at all, if it contains invalid instructions. One way to write portable SIMD code is to use gcc's vector extensions, but you still need to set a valid target architecture. On the other hand, if you run your code in a virtual machine, that supports the instructions, the code may run just fine, even if the host CPU does not support the instructions. I regularly test NEON code on a x86 computer under QEMU, for example. Also, people have ported BOCHS onto android and run x86 code (+SSE) on ARM-powered mobile phones.

Get optimized source code from GCC

I have a task to create optimized C++ source code and give it to friend for compilation. It means, that I do not control the final compilation, I just write the source code of C++ program.
I know, that a can make optimization during compilation with -O1 (and -O2 and others) options of GCC. But how can I get this optimized source code instead of compiled program? I am not able to configure parameters of my friend's compiler, that is why I need to make a good source on my side.

The optimizations performed by GCC are low level, that means you won't get C++ code again but assembly code in best case. But you won't be able to convert it or something.
In sum: Optimize the source code on code level, not on object level.

You could ask GCC to dump its internal (Gimple, ...) representations, at various "stages". The middle-end of GCC is made of hundreds of passes, and you could ask GCC to dump them, with arguments like -fdump-tree-all or -fdump-gimple-all; beware that you can get hundreds of dump files for a single compilation!
However, GCC internal representations are quite low level, and you should not expect to understand them without reading a lot of material.
The dump options I am mentionning are mostly useful to those working inside GCC, or extending it thru plugins coded in C or extensions coded in MELT (a high-level domain specific language to extend GCC). I am not sure they will be very useful to your friend. However, they can be useful to make you understand that optimization passes do a lot of complex processing.
And don't forget that premature optimization is evil : you should first make your program run correctly, then benchmark and profile it, at last optimize the few parts worth of your efforts. You probably won't be able to write correct & efficient programs without testing and running them yourself, before giving them to your friend.

Easy - choose the best algorithm possible, let the rest be handled by the optimizer.
Optimizing the source code is different than optimizing the binary. You optimize the source code, the compiler will optimize the binary.
For anything more than algorithm choice, you'll need to do some profiling. Sure, there are practices that can speed up code speed, but some make the code less readable. Only optimize when you have to, and after you measure.

LLVM what is it and how can i use it to cross platform compilations

I was reading here and there about llvm that can be used to ease the pain of cross platform compilations in c++ , i was trying to read the documents but i didn't understand how can i
use it in real life development problems can someone please explain me in simple words how can i use it ?

The key concept of LLVM is a low-level "intermediate" representation (IR) of your program.
This IR is at about the level of assembler code, but it contains more information to facilitate optimization.
The power of LLVM comes from its ability to defer compilation of this intermediate representation to a specific target machine until just before the code needs to run. A just-in-time (JIT) compilation approach can be used for an application to produce the code it needs just before it needs it.
In many cases, you have more information at the time the program is running that you do back at head office, so the program can be much optimized.
To get started, you could compile a C++ program to a single intermediate representation, then compile it to multiple platforms from that IR.
You can also try the Kaleidoscope demo, which walks you through creating a new language without having to actually write a compiler, just write the IR.
In performance-critical applications, the application can essentially write its own code that it needs to run, just before it needs to run it.

Why don't you go to the LLVM website and check out all the documentation there. They explain in great detail what LLVM is and how to use it. For example they have a Getting Started page.

LLVM is, as its name says a low level virtual machine which have code generator. If you want to compile to it, you can use either gcc front end or clang, which is c/c++ compiler for LLVM which is still work in progress.

It's important to note that a bunch of information about the target comes from the system header files that you use when compiling. LLVM does not defer resolving things like "size of pointer" or "byte layout" so if you compile with 64-bit headers for a little-endian platform, you cannot use that LLVM source code to target a 32-bit big-endian assembly output pater.

There is a good chapter in a book explaining everything nicely here: www.aosabook.org/en/llvm.html

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js