Where can I find QEMU (TCG)'s / LLVM IR operational semantics? - llvm

Is anyone aware pf any public QEMU's TCG operational semantics description?
I rather found something about LLVM, but I'm not sure if these two are quite similar? especially that TCGs use sometimes custom helper functions rather than micro-ops.
Any pointers would be appreciated.
Thanks.
Best regards,

tcg/README in the QEMU sources is the documentation for the various TCG ops. Note that the exact set and semantics of TCG ops are purely an internal implementation detail of QEMU -- they can and do change (mostly adding new ops, but occasionally clarifying or changing existing ones).
There's no relationship between LLVM IR and QEMU TCG, except that both are instances of the very common compiler design pattern of "use an architecture-agnostic intermediate representation".

Related

VM for scripted languages

I'm currently looking at different virtual machines to run a host of different scripted languages (in an embedded manner).
Two VMs that have caught my eye are:
LLVM: While I have seen posts that suggest not using LLVM as an VM, it does seem to have a lot going for it. It can do optimization, JIT, has a nice debugger already, etc. While there doesn't seem to be too much documentation on using LLVM in this manner, there is Cling which is capable of running c++11 as an interpreted language (which is pretty impressive), as well as the command-line tool 'lli'.
libJIT: Technically this isn't a VM, but provides the necessary tools to create one.
So my questions are:
Does anyone have experience with either of these VMs and can give negative/positive experiences.
I've gone through a lot of the documentation for both LLVM and libJIT, but wanted to check if anyone had any recommendations for other resources (esp. for LLVM).
Are there other VMs out there that I should consider? I've done some fairly extensive searches, so this not a question of what VMs are out there, but rather one of software that people have used and would recommend.
As for actual use of the VM I'm intending to embed the VM within a c++ program to provide a scriptable user environment. I'm already using Lua for some of the stuff, but for various reasons I want to be able to support other languages as well.
Finally, I've looked at Parrot, but I am a little hesitant to use it from some of the things I've read about it (maybe someone can convince me otherwise?).
update
I came across http://vmkit.llvm.org, which looks like it uses LLVM to create a full-fledged VM.

Why is llvm considered unsuitable for implementing a JIT?

Many dynamic languages implement (or want to implement) a JIT Compiler in order to speed up their execution times. Inevitably, someone from the peanut gallery asks why they don't use LLVM. The answer is often, "LLVM is unsuitable for building a JIT." (For Example, Armin Rigo's comment here.)
Why is LLVM Unsuitable for building a JIT?
Note: I know LLVM has its own JIT. If LLVM used to be unsuitable, but now is suitable, please say what changed. I'm not talking about running LLVM Bytecode on the LLVM JIT, I'm talking about using the LLVM libraries to implement a JIT for a dynamic language.
Why is LLVM Unsuitable for building a JIT?
I wrote HLVM, a high-level virtual machine with a rich static type system including value types, tail call elimination, generic printing, C FFI and POSIX threads with support for both static and JIT compilation. In particular, HLVM offers incredible performance for a high-level VM. I even implemented an ML-like interactive front-end with variant types and pattern matching using the JIT compiler, as seen in this computer algebra demonstration. All of my HLVM-related work combined totals just a few weeks work (and I am not a computer scientist, just a dabbler).
I think the results speak for themselves and demonstrate unequivocally that LLVM is perfectly suitable for JIT compilation.
There are some notes about LLVM in the Unladen Swallow post-mortem blog post:
http://qinsb.blogspot.com/2011/03/unladen-swallow-retrospective.html .
Unfortunately, LLVM in its current state is really designed as a static compiler optimizer and back end. LLVM code generation and optimization is good but expensive. The optimizations are all designed to work on IR generated by static C-like languages. Most of the important optimizations for optimizing Python require high-level knowledge of how the program executed on previous iterations, and LLVM didn't help us do that.
There is a presentation on using LLVM as a JIT backened where the address many of the concerns raised as to why its bad, most of its seems to boil down to people building a static compiler as a JIT instead of building an actual JIT.
It takes a long time to start up is the biggest complaint - however, this is not so much of an issue if you did what Java does and start up in interpreter mode, and use LLVM to compile the most used parts of the program.
Also while there are arguments like this scattered all over the internet, Mono has been using LLVM as a JIT compiler successfully for a while now (though it's worth noting that it defaults to their own faster but less efficient backend, and they also modified parts of LLVM).
For dynamic languages, LLVM might not be the right tool, just because it was designed for optimizing system programming languages like C and C++ which are strongly/statically typed and support very low level features. In general the optimizations performed on C don't really make dynamic languages fast, because you're just creating an efficient way of running a slow system. Modern dynamic language JITs do things like inlining functions that are only known at runtime, or optimizing based on what type a variable has most of the time, which LLVM is not designed for.
Update: as of 7/2014, LLVM has added a feature called "Patch Points", which are used to support Polymorphic Inline Caches in Safari's FTL JavaScript JIT. This covers exactly the use case complained about int Armin Rigo's comment in the original question.
For a more detailed rant about the LLVM IR see here: LLVM IR is a compiler IR.

Writing a new jit

I'm interested in starting my own JIT project in C++. I'm not that unfamiliar with assembly, or compiler design etc etc. But, I am very unfamiliar with the resulting machine code format - like, what does a mov instruction actually look like when all is said and done and it's time to call that function pointer. So, what are the best resources for creating such a thing?
Edit: Right now, I'm only interested in x86 on Windows, stretching a tiny bit to 64bit Windows in the future.
You want to have a look at the processor manuals for the architecture you are interested in. Those manuals describe the opcode encoding. For x86 processors, the manuals can be downloaded from this page.
Starting your project on top of LLVM might shield you from the platform details.
http://llvm.org/
LLVM is used by several dynamic language JIT compilers.
GNU lightning is a multi-architecture (x86, SPARC, PPC) library for generating code within another program. You'll need to understand general assembly language concepts, but not at a very deep level. You won't have to write anything architecture-specific at all. The down side to lightning (at least last time I used it) is that the interface presented is the intersection of the features available on the supported targets: The small register set of x86, a RISC instruction set like SPARC, and so on. The single-pass code generation is easy to use but has its own quirks, like you can't relocate your output buffer (because of address references) so if you run out of space you generally have to start over. The good thing is that you will probably get a working example going very quickly.
Older versions of NASM come with a fairly concise opcode reference that has x86 instruction encodings. (Looks like there's no 64-bit info, though.) I found this one using google:
http://alien.dowling.edu/~rohit/nasmdocb.html
The official manuals say basically the same thing (and a lot more besides), but not quite so conveniently.

Creating a VHDL backend for LLVM?

LLVM is very modular and allows you to fairly easily define new backends. However most of the documentation/tutorials on creating an LLVM backend focus on adding a new processor instruction set and registers. I'm wondering what it would take to create a VHDL backend for LLVM? Are there examples of using LLVM to go from one higher level language to another?
Just to clarify: are there examples of translating LLVM IR to a higher level language instead of to an assembly language? For example: you could read in C with Clang, use LLVM to do some optimization and then write out code in another language like Java or maybe Fortran.
Yes !
There are many LLVM back-end targeting VHDL/Verilog around :
(open source) Legup paper
(commercial) Xilinx HLS
(online) C-to-verilog
And I know there are many others...
The interesting thing about such low-level representations as LLVM or GIMPLE (also called RTL by the the way) is that they expose static-single assignments (SSA) forms : this can be translated to hardware quite directly, as SSA can be seen as a tree of multiplexers...
There's nothing really special about the LLVM IR. It's a standard DAG with variable arity. Decompiling LLVM IR is a lot like decompiling machine language.
You might be able to leverage some frontend optimizations such as constant folding, but that sounds pretty minor compared to the whole task.
My only experience with LLVM was writing a binary translator for a class project, from a toy CISC to a custom RISC.
I'd say, since it's the closest thing to a standard IR (well, GCC GIMPLE is a close second), see if it fits with your algorithms and style and evaluate it as one alternative.
Note that GCC also started out prioritizing portability above all, and has also accomplished a lot.
I'm not sure I follow how parts of your question relate one to another.
To target LLVM into a high-level language like C is very possible and you seem to have found one reference point.
VHDL is a whole other business however. Do you consider VHDL a high-level language? It may be, but but describing hardware/logic. Sure VHDL has some constructs that you can employ to actually program in it, but it's hardly a fruitful endeavor. VHDL describes hardware and thus makes translating LLVM IR into it a very hard problem, unless of course you design a CPU with a custom instruction set in VHDL and translate LLVM IR into your instructions.
This thread was one of the first things I found while looking for the same thing.
I found a project that's rather far along that cleanly builds under/with llvm 3.5. It's pretty darn cool. It spits out HDL and does various other cool FPGA related things. While it's designed to work with TTAs and generate images for FPGA (or simulate them), it can probably also be made to do some trivial HDL generation from c functions.
It was perfect for my purposes because I wanted to upload to an Altera FPGA, and the fpga_stdout example even spits out Quartus build scripts and project files.
TTA-Based Co-design Environment
I also tried the things listed in the accepted answer and a couple others and found that they weren't going to work for me or weren't very high quality (usually both). TCE is professional feeling, but purely academic I believe. Very nice all the way around.
It seems the question was partially answered, so Iā€™d like to give it a shot:
What it would take to create a VHDL backend for LLVM?
What it would take to translate LLVM IR to a higher level language (presumably with the intention of converting between high-level langs)?
I will give you some background on 2. And expand at a later date on 1.
If you want to convert LLVM IR to a high-level language such as C or Java:
You would have to take the LLVM instructions, and abstract that out into its equivalent C code. Then you need to take the remaining features that LLVM does not have an equivalent for (like classes and abstractions for C++) and write a routine that would find those patterns in the LLVM (like reused blocks) and write C. For the basic stuff, its pretty straightforward. But, just follow the train of thought and you quickly find yourself realizing the true difficultly of the problem, after all not everyone writes simple C. To compound the difficulty further, you may not get the same LLVM IR when compiling the generated C! (Consider the resulting feedback loop)
As for Java, you are in for an even harder battle going direct from LLVM IR, and in either case still have the problem you likely won't get the same code compiling to LLVM IR, if one even can do that. Rather, you would translate LLVM IR to JVM Bytecode. Then you could use a reverse compiler to get your Java.
A group of Chinese students was apparently able to do this, but they wondered why such little interest in their research. I would say its bc they don't fully understand just what the LLVM guys have done, and how it is better than the JVM. (In fact, LLVM arguably makes the JVM obsolete ;)
Even though this seems useful in that one can use LLVM as an intermediary between C and Java to convert bidirectionally, this solution is actually of little use because we are asking the wrong question. See, the entire reason you would want that for practical purposes is to have a common code base and increase performance.
But the real problem is that we need a language that has abstracted the common features of modern languages, and that gives you a central language that you can build from. http://julialang.org/ has answered the question šŸ˜‰
Looks like the best place to start is with the CBackend in the LLVM source:
llvm/lib/Target/CBackend/CBackend.cpp
tl,dr: I don't think LLVM is the right tool
What your are looking for is way to translate LLVM code to a higher language that's what emscripten do for Javascript.
But it looks like you miss a bit the point of LLVM as it's meant to generate static code in order to achieve that they use a specific intermediate language build for that purpose.
As you can see the way emscripten works is by implementing a stack, but without using javascript as a human would have done it.
They are several project that try to achieve what you original question was, like MyHDL that turns python to VHDL or Verilog.

Any ideas on how to integrate with nmap programmatically?

I'm just starting to look into how to integrate nmap, an open source security product, into some c++ code. If anyone's tried this, and has some ideas on the best approach, I'd certainly appreciate it.
Thanks for the responses. Specifically, I'd like to run a port scan (ipv6). I would definitely prefer non-gpl solutions such as a command line or sockets interface. However, I'm also this point I'm looking for the fastest solution/s, as we're up against some stringent timelines, and we can backload implemententing the non-gpl solution if necessary.
(note: I am not a lawyer, and none of this should be taken as legal advice)
You should probably note that Nmap considers a product that parses its output to be a derived work, according to the licensing chapter in the manual, and thus fall under the GPL licensing obligations. The GPLv2 does not define what a derived work is, instead letting that be up to the courts and according to definitions in copyright law. The usual interpretation is that any form of linking, other than linking to system libraries included in the operating system, makes the linked work a derived work, while separate process that talk over pipes or the network are not necessarily derived works, though as mentioned in the GPL FAQ, "if the semantics of the communication are intimate enough, exchanging complex internal data structures, that too could be a basis to consider the two parts as combined into a larger program." That seems to be the interpretation that the Nmap developers are taking.
Anyhow, assuming that you don't need to worry about the GPL, you probably want to look at the output options for Nmap; in particular, -oX for XML output and -oG for "greppable" output. If you need more control over what Nmap does, you should look into the Nmap Scripting Engine, a Lua scripting engine in Nmap that gives you all kinds of control.
Are you looking to use specific pieces of functionality? An easy way I've found of using nmap in other languages is to have it spit out xml using the -oX switch. There is a DTD (and numerous ways to covert this to an xsd for your favourite binding tool) what you can use to consume the data.
Given that you're writing in C++ though it probably would be easy enough to link against directly if you needed.
You could always read it in pipe. According to tradition that is acceptable even if non-GPL accesses GPL.
"Integrate" doesn't really say enough. What kind of things do you want to do? Depending on the level of "integration" it might be enough to run nmap as a separate process and capture its output. The advantage there is that you can update the version of nmap without rebuilding your app. If you require tighter coupling, then it depends on what functionality you need and "library-izing" nmap, but be warned that it's GPL code and this kind of integration would require source distribution of your app..