Is there a backend optimizer in LLVM? - c++

I can get the optimization level from the command llc -help
-O=<char> - Optimization level. [-O0, -O1, -O2, or -O3] (default = '-O2')
I want to know what the optimization does exactly.
So, I'm searching the source code of the backend optimizer.
I google it by "llvm backend optimizer", but there is no information about it, only some target-independent pass source code.
I want to know what does the optimization do for div-rem-pars.
It can combine two instructions of llvm IR to one instruction for assembly code.

Apparently there are backend optimizers options in llvm. They are however not well documented [1,2]. The TargetMachine [3] class has functions getOptLevel and setOptLevel to set an optimization level from 0-3 for a specific target machine, so starting from there you can try to track where it is used.
[1] https://llvm.org/docs/CodeGenerator.html#ssa-based-machine-code-optimizations
[2] https://llvm.org/docs/CodeGenerator.html#late-machine-code-optimizations
[3] https://llvm.org/doxygen/classllvm_1_1TargetMachine.html

Related

How to get optimization level (O0,O1,O2 etc) from a llvm pass?

I am writing a llvm pass. I want to bail out my pass if optimization level is O0 or O1. I am not able to find the right API to query the optimization level from a function pass. I tried to search this option in codebase and doc, but could not locate one.

Relationship between clang, opt, llc and llvm-linker

I looked at the source code of clang, llc, and opt a little while ago, to see how each one of them adds optimizations to the pipeline. My understanding was that clang adds the same optimizations that opt and llc have in their pipelines, by calling the same methods that opt and llc call. Also clang does not separately call opt and/or llc.
This is almost fine, except that there is a risk that at some point opt may end up with different optimizations in its pipeline (when compared with clang) due to source changes that is done in one but not the other. Same is true for the comparison of llc and clang. Is this perception correct?
Also I have seen charts that show the following workflow: clang, opt, llvm-linker, opt again (for IPA?) then llc. I cannot connect this workflow to what I have seen in the clang. Even my understanding of LTO is that the linker (Say gold) will call optimizations. I cannot understand the role of llvm-linker here.
Any insights is highly appreciated.
opt, llc and llvm-linker are developer-side tools that could be used to run some methods implemented in LLVM libraries. End-user normally should never use them.
The "charts" are probably just someones' custom-built quick'n'dirty LTO pipeline.

Force reduced width of comparison instructions

I want to force LLVM to generate CMPx-, TEST- and alike instructions on x86-64 to be up to 8 bit width only, forcing e.g. 32bit-int comparisons into four separate cmp+branch pairs. This obviously requires some bit-masking and increased instruction count.
Can I achieve this by simply "disabling" certain instructions for x86-64 so LLVM auto-generates the required glue code? Do I have to write a pass and work on the IR myself?
No, there is no way of disabling certain instructions like this from a vanilla build of LLVM. Anything you do to achieve this will require modifying LLVM.
You have several options for modifying LLVM:
You can add an x86-specific pass to the LLVM backend (does not work on the IR) which directly expands the cmp and test instructions into chains of instructions on sub-registers. You would have to do this after instruction selection to preclude some target-independent pass from undoing the transformation. This is called an "MI" pass in LLVM parlance. As an example you can look in X86FixupSetCC.cpp (mirror here). This has a huge advantage in that you can put it behind a flag and otherwise control whether it occurs once you add the core functionality.
You can modify LLVM's instruction tables for LLVM in the X86 .td files to only define these instructions for 8-bit registers, and then add the def Pat<...>; patterns to the .td files that allow programs with wider comparisons to still have their instructions selected (much as Colin suggested above). This has the disadvantage of not only have you modified your LLVM but you can't easily turn those modifications on and off behind some flag.
You can't do anything to LLVM's IR that will really help here because the code generator will just optimize things back into instruction patterns you're trying to avoid.
Hope this helps!
What you're probably looking to do is redefine the lowering pattern in the x86 .td files. There's code that looks like "def Pat<...>;" that defines a translation from one graph of instructions to another. There should be a pattern for going from IR comparison instructions to the x86 32bit compare instructions. You'll want to edit this pattern and instead output your sequence of comparisons.

What is LLVM CodeGen optimization?

The ExecutionEngine class in LLVM library has a option to set the CodeGen optimization level (CodeGenOpt::Level). Do I understand it right that CodeGen optimizations are applied during machine code generation and they are not related to IR? If I want to optimize IR I need to do it with other tools?
The optimizations that happen in the JIT when CodeGenOpt is set are a) which instruction selector is chose (fast isel vs selection dag), and b) whether any optimizations are run during the MC level passes.
If you want optimization on the IR level you'll need to create your own PassManager and add the passes you want to run.

Why is the LLVM execution engine faster than compiled code?

I have a compiler which targets LLVM, and I provide two ways to run the code:
Run it automatically. This mode compiles the code to LLVM and uses the ExecutionEngine JIT to compile it into machine code on-the-fly and run it without ever generating an output file.
Compile it and run separately. This mode outputs an LLVM .bc file, which I manually optimise (with opt), compile to native assembly (with llc) compile to machine code and link (with gcc), and run.
I was expecting approach #2 to be faster than approach #1, or at least the same speed, but running a few speed tests, I am surprised to find that #2 consistently runs about twice as slow. That is a huge speed difference.
Both cases are running the same LLVM source code. With approach #1, I haven't yet bothered to run any LLVM optimisation passes (which is why I was expecting it to be slower). With approach #2, I am running opt with -std-compile-opts and llc with -O3, to maximise optimisation, yet it isn't getting anywhere near #1. Here is an example run of the same program:
#1 without optimisation: 11.833s
#2 without optimisation: 22.262s
#2 with optimisation (-std-compile-opts and -O3): 18.823s
Is the ExecutionEngine doing something special that I don't know about? Is there any way for me to optimise the compiled code to achieve the same performance as the ExecutionEngine JIT?
It is normal for a VM with JIT to run some applications faster than than a compiled application. That's because a VM with JIT is like a simulator that simulates a virtual computer, and also runs a compiler in realtime. Because both tasks are built into the VM with JIT, the machine simulator can feed information to the compiler so that the code can be recompiled to run more efficiently. The information that it provides is not available to statically compiled code.
This effect has also been noted with Java VMs and with Python's PyPy VM, among others.
Another issue is aligning code and other optimizations. Nowadays cpu's are so complex that it's hard to predict which techniques will result in faster execution of final binary.
As an real-life example, let's consider Google's Native Client - I mean original nacl compilation approach, not involing LLVM (cause, as far as I know, currently there is direction on supporting both "nativeclient" and "LLVM bitcode"(modyfied) code).
As you can see on presentations (check out youtube.com) or in papers, like this Native Client: A Sandbox for Portable, Untrusted x86 Native Code, even their aligning technique makes code size bigger, in some cases such aligning of instructions (for example with noops) gives better cache hitting.
Aligning instructions with noops and instruction reordering it known in parallel computing, and here it shows it's impact as well.
I hope this answer gives an idea how much circumstances might influence on code speed execution, and that are many possible reasons for different pieces of code, and each of them needs investigation. Nevermore, it's interesting topic, so If you find some more details, don't hestitate to reedit your answer and let us know in "Post-Scriptorium", what have you found more :). (Maybe link to whitepaper/devblog with new findings :) ). Benchmarks are always welcome - take a look : http://llvm.org/OpenProjects.html#benchmark .