how to convert native machine code to llvm bytecode - llvm

Is there a machanism to convert native machine code to llvm bytecode?
If not, please give some idea how to implement it.
Thanks!

Libcpu was designed to convert machine code of various architectures to LLVM IR to facilitate writing emulators.

Related

can I use llvm ir in another language project?

I read that llvm is backend having multiple frontend, c, c++, swift ...
So, can I convert ir code based something to ir code based another?
Or, can I just use ir code in different language project?

Can LLVM-based languages be used in OS development?

As far as I have understood, LLVM doesn't let you enforce using a specific processor register. Does that mean a language that uses LLVM under the hood, can't be used for developing an OS, a bootloader or such things that for example requires direct access to registers?
Are there any other reasons why LLVM IR can or cannot be used in OS development?
LLVM is an abstract machine. As such, it does not directly allow you to access certain hardware registers. However, you can still use inline assembly (through the call asm LLVM bitcode mnemonic) or program the few functions that need access to fixed hardware registers in assembly and call them from your LLVM code.

The perks and the pipeline for an LLVM based compilation

I see that more and more people are switching to LLVM, especially people with a background in C or C++, so there is a pattern in which kind of people are approaching this compiler, what surprises me is the highly heterogeneous set of technologies that LLVM can manage, and I don't get what is pipeline that this virtual machine follows and what are the resulting benefits.
I would like to stress the fact that I'm focusing on LLVM, not really on clang.
A 1 in a million example is this one ( Youtube Video ), where the pipeline is not really obvious for me, or this other one, but apparently there a lot of totally different solutions where, for example, LLVM is used in conjunction with a JIT solution.
In short I see different syntax and semantics, people using LLVM to produce GPU shaders or binary objects, but I can't see the common denominator.
What is the meaning of "LLVM based compilation", Considering LLVM as a black box, what is the kind of input, output and the business logic in the middle ?
I can't see the common denominator.
The common denominator is converting code in one language to code in another language. And that's exactly what compilers do. So if you want to convert a piece of code in a "source language" to one in a "target language", what you need to do is:
Write a "front-end" - a component that converts from your source language to what LLVM expects as input. That language is an LLVM-specific language called "LLVM Bitcode" or "LLVM IR".
Alternatively, reuse an existing front-end - for example Clang.
Write a "back-end" - a component that converts from what LLVM emits to your target language.
Or use an existing back-end, for example LLVM's x86 back-end.
That's it. Now you get to enjoy things like the optimizations LLVM performs on the code between its input and output, its common framework for "lowering" the code to something closer to machine code, etc.
GCC behaves the same, by the way, it's just that LLVM is considered by many to be superior in some aspects, particularly licensing and ease of modification.
LLVM's advantage over other source-available compilers is that it is designed as a set of reusable libraries. That means to some degree you can pick-and-choose what to include in your tool. Not every language tool needs optimization and not every language tool needs code generation. LLVM is a very flexible system for langauge processing.
Generally when people say, "LLVM based compilation," they mean using one or more of the LLVM libraries to implement their tool. They can leverage all of the work put into LLVM in understanding its IR and generating code for multiple targets.
The LLVM IR is the common representation used by most of the LLVM libraries. It is the interface you need to write to. For low-level stuff like machine code you will need to deal with some of the other LLVM representations (MachineInstr, MC, etc.).
As for writing a frontend to generate that LLVM IR, the tricky part is ensuring that the translation from your source language to the LLVM IR preserves the semantics of the source language. The LLVM IR has a well-defined but low-level set of semantics for each instruction. If your source language has higher-level semantics you will have to lower them into LLVM IR instruction sequences to implement it. For example, there is no LLVM instruction that handles C-style bitfield access so C language frontends must use a sequence of LLVM instructions to implement the functionality (generally shifts and bitwise operations).
As long as you implement the semantics of your source language in the LLVM IR correctly, the LLVM libraries will have no problem performing correct code transformations. If some desired transformation requires higher-level semantics information than LLVM IR can provide, you either have to do the transformation in some stage before converting to LLVM IR (and so you will have the high-level information available) or you can pass attribute information in the LLVM IR to convey the high-level semantics and write a custom LLVM pass to implement the transformation. It is usually far cleaner to do the former than the latter.

Why is llvm considered unsuitable for implementing a JIT?

Many dynamic languages implement (or want to implement) a JIT Compiler in order to speed up their execution times. Inevitably, someone from the peanut gallery asks why they don't use LLVM. The answer is often, "LLVM is unsuitable for building a JIT." (For Example, Armin Rigo's comment here.)
Why is LLVM Unsuitable for building a JIT?
Note: I know LLVM has its own JIT. If LLVM used to be unsuitable, but now is suitable, please say what changed. I'm not talking about running LLVM Bytecode on the LLVM JIT, I'm talking about using the LLVM libraries to implement a JIT for a dynamic language.
Why is LLVM Unsuitable for building a JIT?
I wrote HLVM, a high-level virtual machine with a rich static type system including value types, tail call elimination, generic printing, C FFI and POSIX threads with support for both static and JIT compilation. In particular, HLVM offers incredible performance for a high-level VM. I even implemented an ML-like interactive front-end with variant types and pattern matching using the JIT compiler, as seen in this computer algebra demonstration. All of my HLVM-related work combined totals just a few weeks work (and I am not a computer scientist, just a dabbler).
I think the results speak for themselves and demonstrate unequivocally that LLVM is perfectly suitable for JIT compilation.
There are some notes about LLVM in the Unladen Swallow post-mortem blog post:
http://qinsb.blogspot.com/2011/03/unladen-swallow-retrospective.html .
Unfortunately, LLVM in its current state is really designed as a static compiler optimizer and back end. LLVM code generation and optimization is good but expensive. The optimizations are all designed to work on IR generated by static C-like languages. Most of the important optimizations for optimizing Python require high-level knowledge of how the program executed on previous iterations, and LLVM didn't help us do that.
There is a presentation on using LLVM as a JIT backened where the address many of the concerns raised as to why its bad, most of its seems to boil down to people building a static compiler as a JIT instead of building an actual JIT.
It takes a long time to start up is the biggest complaint - however, this is not so much of an issue if you did what Java does and start up in interpreter mode, and use LLVM to compile the most used parts of the program.
Also while there are arguments like this scattered all over the internet, Mono has been using LLVM as a JIT compiler successfully for a while now (though it's worth noting that it defaults to their own faster but less efficient backend, and they also modified parts of LLVM).
For dynamic languages, LLVM might not be the right tool, just because it was designed for optimizing system programming languages like C and C++ which are strongly/statically typed and support very low level features. In general the optimizations performed on C don't really make dynamic languages fast, because you're just creating an efficient way of running a slow system. Modern dynamic language JITs do things like inlining functions that are only known at runtime, or optimizing based on what type a variable has most of the time, which LLVM is not designed for.
Update: as of 7/2014, LLVM has added a feature called "Patch Points", which are used to support Polymorphic Inline Caches in Safari's FTL JavaScript JIT. This covers exactly the use case complained about int Armin Rigo's comment in the original question.
For a more detailed rant about the LLVM IR see here: LLVM IR is a compiler IR.

Creating a VHDL backend for LLVM?

LLVM is very modular and allows you to fairly easily define new backends. However most of the documentation/tutorials on creating an LLVM backend focus on adding a new processor instruction set and registers. I'm wondering what it would take to create a VHDL backend for LLVM? Are there examples of using LLVM to go from one higher level language to another?
Just to clarify: are there examples of translating LLVM IR to a higher level language instead of to an assembly language? For example: you could read in C with Clang, use LLVM to do some optimization and then write out code in another language like Java or maybe Fortran.
Yes !
There are many LLVM back-end targeting VHDL/Verilog around :
(open source) Legup paper
(commercial) Xilinx HLS
(online) C-to-verilog
And I know there are many others...
The interesting thing about such low-level representations as LLVM or GIMPLE (also called RTL by the the way) is that they expose static-single assignments (SSA) forms : this can be translated to hardware quite directly, as SSA can be seen as a tree of multiplexers...
There's nothing really special about the LLVM IR. It's a standard DAG with variable arity. Decompiling LLVM IR is a lot like decompiling machine language.
You might be able to leverage some frontend optimizations such as constant folding, but that sounds pretty minor compared to the whole task.
My only experience with LLVM was writing a binary translator for a class project, from a toy CISC to a custom RISC.
I'd say, since it's the closest thing to a standard IR (well, GCC GIMPLE is a close second), see if it fits with your algorithms and style and evaluate it as one alternative.
Note that GCC also started out prioritizing portability above all, and has also accomplished a lot.
I'm not sure I follow how parts of your question relate one to another.
To target LLVM into a high-level language like C is very possible and you seem to have found one reference point.
VHDL is a whole other business however. Do you consider VHDL a high-level language? It may be, but but describing hardware/logic. Sure VHDL has some constructs that you can employ to actually program in it, but it's hardly a fruitful endeavor. VHDL describes hardware and thus makes translating LLVM IR into it a very hard problem, unless of course you design a CPU with a custom instruction set in VHDL and translate LLVM IR into your instructions.
This thread was one of the first things I found while looking for the same thing.
I found a project that's rather far along that cleanly builds under/with llvm 3.5. It's pretty darn cool. It spits out HDL and does various other cool FPGA related things. While it's designed to work with TTAs and generate images for FPGA (or simulate them), it can probably also be made to do some trivial HDL generation from c functions.
It was perfect for my purposes because I wanted to upload to an Altera FPGA, and the fpga_stdout example even spits out Quartus build scripts and project files.
TTA-Based Co-design Environment
I also tried the things listed in the accepted answer and a couple others and found that they weren't going to work for me or weren't very high quality (usually both). TCE is professional feeling, but purely academic I believe. Very nice all the way around.
It seems the question was partially answered, so Iā€™d like to give it a shot:
What it would take to create a VHDL backend for LLVM?
What it would take to translate LLVM IR to a higher level language (presumably with the intention of converting between high-level langs)?
I will give you some background on 2. And expand at a later date on 1.
If you want to convert LLVM IR to a high-level language such as C or Java:
You would have to take the LLVM instructions, and abstract that out into its equivalent C code. Then you need to take the remaining features that LLVM does not have an equivalent for (like classes and abstractions for C++) and write a routine that would find those patterns in the LLVM (like reused blocks) and write C. For the basic stuff, its pretty straightforward. But, just follow the train of thought and you quickly find yourself realizing the true difficultly of the problem, after all not everyone writes simple C. To compound the difficulty further, you may not get the same LLVM IR when compiling the generated C! (Consider the resulting feedback loop)
As for Java, you are in for an even harder battle going direct from LLVM IR, and in either case still have the problem you likely won't get the same code compiling to LLVM IR, if one even can do that. Rather, you would translate LLVM IR to JVM Bytecode. Then you could use a reverse compiler to get your Java.
A group of Chinese students was apparently able to do this, but they wondered why such little interest in their research. I would say its bc they don't fully understand just what the LLVM guys have done, and how it is better than the JVM. (In fact, LLVM arguably makes the JVM obsolete ;)
Even though this seems useful in that one can use LLVM as an intermediary between C and Java to convert bidirectionally, this solution is actually of little use because we are asking the wrong question. See, the entire reason you would want that for practical purposes is to have a common code base and increase performance.
But the real problem is that we need a language that has abstracted the common features of modern languages, and that gives you a central language that you can build from. http://julialang.org/ has answered the question šŸ˜‰
Looks like the best place to start is with the CBackend in the LLVM source:
llvm/lib/Target/CBackend/CBackend.cpp
tl,dr: I don't think LLVM is the right tool
What your are looking for is way to translate LLVM code to a higher language that's what emscripten do for Javascript.
But it looks like you miss a bit the point of LLVM as it's meant to generate static code in order to achieve that they use a specific intermediate language build for that purpose.
As you can see the way emscripten works is by implementing a stack, but without using javascript as a human would have done it.
They are several project that try to achieve what you original question was, like MyHDL that turns python to VHDL or Verilog.