I'm starting new research in the field of compiler optimization.
As a start, I'm looking into several different papers related and encountered a few different optimization techniques.
One main thing I'm currently looking at is the compilers' technique that converting input source code into a graph (e.g. control-flow, data-flow, linked list, etc.), then performs optimization onto the graph and produces the machine code. Code-to-Graph-to-Code. For example, JIT compilers in the JavaScript engines, i.e. V8, ChakraCore, etc.
Then, I came across LLVM IR. Because of the earlier searches, my impression of optimization on code is doing on a graph like explained above. However, I do not believe that is the case for LLVM, but I'm not sure. I found that there are tools to generate a control-flow graph from the LLVM IR, but it doesn't mean it's optimizing the graph.
So, my question is "Is LLVM IR a graph?" If not, how does it optimize the code? Code-to-Code directly?
LLVM IR (and it's backend form, Machine IR) is a traditional three-address code IR so technically is not a graph IR in the sense e.g. sea-of-nodes IR is. But it contains several graph structures in it: a graph of basic blocks (Control Flow Graph) and a graph of data dependencies (SSA def-use chains) which are used to simplify optimizations. In addition, during instruction selection phase in backend original LLVM IR is temporarily converted to a true graph IR - SelectionDAG.
Related
I'm just starting to learn about llvm and a bit confused with transformations and Passes.
An LLVM pass is something that goes through either by you or by an LLVM backend generated LLVM IR. From the structure of said IR, we can do two things.
Analysis in which we from the IR provides some sort of information about the program for static analysis. The clang static analyzer is an example of such a tool.
Transformation:
Another option is that we change the IR as we pass through it. We make a transformation. Usually, we do this to make the resulting executable better. We optimize the code. This last part is what is called a transformation, or Transform Passes to quote the LLVM documentation. Simply stated, transformations are operations conducted by some transform pass, and that relates to changing the IR into some other form when executing the pass.
More information about this can be found here LLVM passes.
Object code can be disassembled in to an assembly language. Is there a way to turn object code or an executable into LLVM IR?
I mean, yes, you can convert machine language to LLVM IR. The IR is Turing-complete, meaning it can compute whatever some other Turing-complete system can compute. At worst, you could have an LLVM IR representation of an x86 emulator, and just execute the machine code given as a string.
But your question specifically asked about converting "back" to IR, in the sense of the IR result being similar to the original IR. And the answer is, no, not really. The machine language code will be the result of various optimization passes, and there's no way to determine what the code looked like before that optimization. (arrowd mentioned McSema in a comment, which does its best, but in general the results will be very different from the original code.)
I am interested in analyzing a CFG of a C/C++ program where the CFG's nodes contain LLVM IR instructions. Is there any way to leverage LLVM to extract a persistent in-memory object of this CFG? I do not want to implement a pass in the compiler; I want the CFG to undergo analysis in my own program.
The LLVM IR in-memory representation is amenable to CFG analysis because all the basic blocks are organized as a graph already. Within a basic block, the instruction sequence is linear. Some interesting in-function CFG-related code in LLVM is: lib/Analysis/CFG.cpp and lib/Analysis/CFGPrinter.cpp
As you might know, PIN is a dynamic binary instrumentation tool. By using Pin for example, I can instrument every load and store in my application. I was wondering If there is a similar tool which injects code at compile time (Using a higher level of information, not requiring us to write the LLVM pass), rather than at runtime like Pin. I am especially interested for such kind of tool for LLVM.
You could write LLVM passes of your own and apply them on your code to "instrument" it during compile time. These work on LLVM IR and produce LLVM IR, so for some tasks this will be a very natural thing to do and for other tasks it might be cumbersome or difficult (because of the differences between LLVM and IR and the source language). It depends.
Is it theoretically and/or practically possible to compile native c++ to some sort of intermediate language which will then be compiled at run time?
Along the same lines, is "portable" the term used to denote this?
LLVM which is a compiler infrastructure parses C++ code, transforming it to an intermediate language called LLVM IR (IR stands for Intermediate Representation) which looks like high-level assembly language. It is a machine independent language. Generating IR is one phase. In the next phase, it passes through various optimizers (called pass). which then reaches to third phase which emits machine code (i.e machine dependent code).
It is a module-based design; output of one phase (module) becomes input of another. You could save IR on your disk, so that the remaining phases can resume later, maybe on entirely different machine!
So you could generate IR and then do rest of the things on runtime? I've not done that myself, but LLVM seems really promising.
Here is the documentation of LLVM IR:
LLVM Language Reference Manual
This topic on Stackoverlow seems interesting, as it says,
LLVM advantages:
JIT - you can compile and run your code dynamically.
And these articles are good read:
The Design of LLVM (on drdobs.com)
Create a working compiler with the LLVM framework, Part 1