Object code can be disassembled in to an assembly language. Is there a way to turn object code or an executable into LLVM IR?
I mean, yes, you can convert machine language to LLVM IR. The IR is Turing-complete, meaning it can compute whatever some other Turing-complete system can compute. At worst, you could have an LLVM IR representation of an x86 emulator, and just execute the machine code given as a string.
But your question specifically asked about converting "back" to IR, in the sense of the IR result being similar to the original IR. And the answer is, no, not really. The machine language code will be the result of various optimization passes, and there's no way to determine what the code looked like before that optimization. (arrowd mentioned McSema in a comment, which does its best, but in general the results will be very different from the original code.)
Related
I'm starting new research in the field of compiler optimization.
As a start, I'm looking into several different papers related and encountered a few different optimization techniques.
One main thing I'm currently looking at is the compilers' technique that converting input source code into a graph (e.g. control-flow, data-flow, linked list, etc.), then performs optimization onto the graph and produces the machine code. Code-to-Graph-to-Code. For example, JIT compilers in the JavaScript engines, i.e. V8, ChakraCore, etc.
Then, I came across LLVM IR. Because of the earlier searches, my impression of optimization on code is doing on a graph like explained above. However, I do not believe that is the case for LLVM, but I'm not sure. I found that there are tools to generate a control-flow graph from the LLVM IR, but it doesn't mean it's optimizing the graph.
So, my question is "Is LLVM IR a graph?" If not, how does it optimize the code? Code-to-Code directly?
LLVM IR (and it's backend form, Machine IR) is a traditional three-address code IR so technically is not a graph IR in the sense e.g. sea-of-nodes IR is. But it contains several graph structures in it: a graph of basic blocks (Control Flow Graph) and a graph of data dependencies (SSA def-use chains) which are used to simplify optimizations. In addition, during instruction selection phase in backend original LLVM IR is temporarily converted to a true graph IR - SelectionDAG.
I'm just starting to learn about llvm and a bit confused with transformations and Passes.
An LLVM pass is something that goes through either by you or by an LLVM backend generated LLVM IR. From the structure of said IR, we can do two things.
Analysis in which we from the IR provides some sort of information about the program for static analysis. The clang static analyzer is an example of such a tool.
Transformation:
Another option is that we change the IR as we pass through it. We make a transformation. Usually, we do this to make the resulting executable better. We optimize the code. This last part is what is called a transformation, or Transform Passes to quote the LLVM documentation. Simply stated, transformations are operations conducted by some transform pass, and that relates to changing the IR into some other form when executing the pass.
More information about this can be found here LLVM passes.
Is it theoretically and/or practically possible to compile native c++ to some sort of intermediate language which will then be compiled at run time?
Along the same lines, is "portable" the term used to denote this?
LLVM which is a compiler infrastructure parses C++ code, transforming it to an intermediate language called LLVM IR (IR stands for Intermediate Representation) which looks like high-level assembly language. It is a machine independent language. Generating IR is one phase. In the next phase, it passes through various optimizers (called pass). which then reaches to third phase which emits machine code (i.e machine dependent code).
It is a module-based design; output of one phase (module) becomes input of another. You could save IR on your disk, so that the remaining phases can resume later, maybe on entirely different machine!
So you could generate IR and then do rest of the things on runtime? I've not done that myself, but LLVM seems really promising.
Here is the documentation of LLVM IR:
LLVM Language Reference Manual
This topic on Stackoverlow seems interesting, as it says,
LLVM advantages:
JIT - you can compile and run your code dynamically.
And these articles are good read:
The Design of LLVM (on drdobs.com)
Create a working compiler with the LLVM framework, Part 1
Is there an easy way of going from llvm ir to working source code?
Specifically, I'd like to start with some simple C++ code that merely modifies PODs (mainly arrays of ints, floats, etc), convert it to llvm ir, perform some simple analysis and translation on it and then convert it back into C++ code?
It don't really mind about any of the names getting mangled, I'd just like to be able to hack about with the source before doing the machine-dependent optimisations.
There are number of options actually. The 2 that you'll probably be interested in are -march=c and -march=cpp, which are options to llc.
Run:
llc -march=c -o code.c code.ll
This will convert the LLVM bitcode in code.ll back to C and put it in code.c.
Also:
llc -march=cpp -o code.cpp code.ll
This is different than the C output engine. It actually will write out C++ code that can be run to reconstruct the IR. I use this personal to embed LLVM IR in a program without having to deal with parsing bitcode files or anything.
-march=cpp has more options you can see with llc --help, such as -cppgen= which controls how much of the IR the output C++ reconstructs.
CppBackend was removed. We have no -march=cpp and -march=c option since 2016-05-05, r268631.
There is an issue here... it might not be possible to easily represent the IR back into the language.
I mean, you'll probably be able to get some representation, but it might be less readable.
The issue is that the IR is not concerned with high-level semantic, and without it...
I'd rather advise you to learn to read the IR. I can read a bit of it without that much effort, and I am far from being a llvm expert.
Otherwise, you can C code from the IR. It won't be much more similar to your C++ code, but you'll perhaps feel better without ssa and phi nodes.
I was reading here and there about llvm that can be used to ease the pain of cross platform compilations in c++ , i was trying to read the documents but i didn't understand how can i
use it in real life development problems can someone please explain me in simple words how can i use it ?
The key concept of LLVM is a low-level "intermediate" representation (IR) of your program.
This IR is at about the level of assembler code, but it contains more information to facilitate optimization.
The power of LLVM comes from its ability to defer compilation of this intermediate representation to a specific target machine until just before the code needs to run. A just-in-time (JIT) compilation approach can be used for an application to produce the code it needs just before it needs it.
In many cases, you have more information at the time the program is running that you do back at head office, so the program can be much optimized.
To get started, you could compile a C++ program to a single intermediate representation, then compile it to multiple platforms from that IR.
You can also try the Kaleidoscope demo, which walks you through creating a new language without having to actually write a compiler, just write the IR.
In performance-critical applications, the application can essentially write its own code that it needs to run, just before it needs to run it.
Why don't you go to the LLVM website and check out all the documentation there. They explain in great detail what LLVM is and how to use it. For example they have a Getting Started page.
LLVM is, as its name says a low level virtual machine which have code generator. If you want to compile to it, you can use either gcc front end or clang, which is c/c++ compiler for LLVM which is still work in progress.
It's important to note that a bunch of information about the target comes from the system header files that you use when compiling. LLVM does not defer resolving things like "size of pointer" or "byte layout" so if you compile with 64-bit headers for a little-endian platform, you cannot use that LLVM source code to target a 32-bit big-endian assembly output pater.
There is a good chapter in a book explaining everything nicely here: www.aosabook.org/en/llvm.html