Creating a VHDL backend for LLVM? - llvm

LLVM is very modular and allows you to fairly easily define new backends. However most of the documentation/tutorials on creating an LLVM backend focus on adding a new processor instruction set and registers. I'm wondering what it would take to create a VHDL backend for LLVM? Are there examples of using LLVM to go from one higher level language to another?
Just to clarify: are there examples of translating LLVM IR to a higher level language instead of to an assembly language? For example: you could read in C with Clang, use LLVM to do some optimization and then write out code in another language like Java or maybe Fortran.

Yes !
There are many LLVM back-end targeting VHDL/Verilog around :
(open source) Legup paper
(commercial) Xilinx HLS
(online) C-to-verilog
And I know there are many others...
The interesting thing about such low-level representations as LLVM or GIMPLE (also called RTL by the the way) is that they expose static-single assignments (SSA) forms : this can be translated to hardware quite directly, as SSA can be seen as a tree of multiplexers...

There's nothing really special about the LLVM IR. It's a standard DAG with variable arity. Decompiling LLVM IR is a lot like decompiling machine language.
You might be able to leverage some frontend optimizations such as constant folding, but that sounds pretty minor compared to the whole task.
My only experience with LLVM was writing a binary translator for a class project, from a toy CISC to a custom RISC.
I'd say, since it's the closest thing to a standard IR (well, GCC GIMPLE is a close second), see if it fits with your algorithms and style and evaluate it as one alternative.
Note that GCC also started out prioritizing portability above all, and has also accomplished a lot.

I'm not sure I follow how parts of your question relate one to another.
To target LLVM into a high-level language like C is very possible and you seem to have found one reference point.
VHDL is a whole other business however. Do you consider VHDL a high-level language? It may be, but but describing hardware/logic. Sure VHDL has some constructs that you can employ to actually program in it, but it's hardly a fruitful endeavor. VHDL describes hardware and thus makes translating LLVM IR into it a very hard problem, unless of course you design a CPU with a custom instruction set in VHDL and translate LLVM IR into your instructions.

This thread was one of the first things I found while looking for the same thing.
I found a project that's rather far along that cleanly builds under/with llvm 3.5. It's pretty darn cool. It spits out HDL and does various other cool FPGA related things. While it's designed to work with TTAs and generate images for FPGA (or simulate them), it can probably also be made to do some trivial HDL generation from c functions.
It was perfect for my purposes because I wanted to upload to an Altera FPGA, and the fpga_stdout example even spits out Quartus build scripts and project files.
TTA-Based Co-design Environment
I also tried the things listed in the accepted answer and a couple others and found that they weren't going to work for me or weren't very high quality (usually both). TCE is professional feeling, but purely academic I believe. Very nice all the way around.

It seems the question was partially answered, so Iā€™d like to give it a shot:
What it would take to create a VHDL backend for LLVM?
What it would take to translate LLVM IR to a higher level language (presumably with the intention of converting between high-level langs)?
I will give you some background on 2. And expand at a later date on 1.
If you want to convert LLVM IR to a high-level language such as C or Java:
You would have to take the LLVM instructions, and abstract that out into its equivalent C code. Then you need to take the remaining features that LLVM does not have an equivalent for (like classes and abstractions for C++) and write a routine that would find those patterns in the LLVM (like reused blocks) and write C. For the basic stuff, its pretty straightforward. But, just follow the train of thought and you quickly find yourself realizing the true difficultly of the problem, after all not everyone writes simple C. To compound the difficulty further, you may not get the same LLVM IR when compiling the generated C! (Consider the resulting feedback loop)
As for Java, you are in for an even harder battle going direct from LLVM IR, and in either case still have the problem you likely won't get the same code compiling to LLVM IR, if one even can do that. Rather, you would translate LLVM IR to JVM Bytecode. Then you could use a reverse compiler to get your Java.
A group of Chinese students was apparently able to do this, but they wondered why such little interest in their research. I would say its bc they don't fully understand just what the LLVM guys have done, and how it is better than the JVM. (In fact, LLVM arguably makes the JVM obsolete ;)
Even though this seems useful in that one can use LLVM as an intermediary between C and Java to convert bidirectionally, this solution is actually of little use because we are asking the wrong question. See, the entire reason you would want that for practical purposes is to have a common code base and increase performance.
But the real problem is that we need a language that has abstracted the common features of modern languages, and that gives you a central language that you can build from. http://julialang.org/ has answered the question šŸ˜‰

Looks like the best place to start is with the CBackend in the LLVM source:
llvm/lib/Target/CBackend/CBackend.cpp

tl,dr: I don't think LLVM is the right tool
What your are looking for is way to translate LLVM code to a higher language that's what emscripten do for Javascript.
But it looks like you miss a bit the point of LLVM as it's meant to generate static code in order to achieve that they use a specific intermediate language build for that purpose.
As you can see the way emscripten works is by implementing a stack, but without using javascript as a human would have done it.
They are several project that try to achieve what you original question was, like MyHDL that turns python to VHDL or Verilog.

Related

The perks and the pipeline for an LLVM based compilation

I see that more and more people are switching to LLVM, especially people with a background in C or C++, so there is a pattern in which kind of people are approaching this compiler, what surprises me is the highly heterogeneous set of technologies that LLVM can manage, and I don't get what is pipeline that this virtual machine follows and what are the resulting benefits.
I would like to stress the fact that I'm focusing on LLVM, not really on clang.
A 1 in a million example is this one ( Youtube Video ), where the pipeline is not really obvious for me, or this other one, but apparently there a lot of totally different solutions where, for example, LLVM is used in conjunction with a JIT solution.
In short I see different syntax and semantics, people using LLVM to produce GPU shaders or binary objects, but I can't see the common denominator.
What is the meaning of "LLVM based compilation", Considering LLVM as a black box, what is the kind of input, output and the business logic in the middle ?
I can't see the common denominator.
The common denominator is converting code in one language to code in another language. And that's exactly what compilers do. So if you want to convert a piece of code in a "source language" to one in a "target language", what you need to do is:
Write a "front-end" - a component that converts from your source language to what LLVM expects as input. That language is an LLVM-specific language called "LLVM Bitcode" or "LLVM IR".
Alternatively, reuse an existing front-end - for example Clang.
Write a "back-end" - a component that converts from what LLVM emits to your target language.
Or use an existing back-end, for example LLVM's x86 back-end.
That's it. Now you get to enjoy things like the optimizations LLVM performs on the code between its input and output, its common framework for "lowering" the code to something closer to machine code, etc.
GCC behaves the same, by the way, it's just that LLVM is considered by many to be superior in some aspects, particularly licensing and ease of modification.
LLVM's advantage over other source-available compilers is that it is designed as a set of reusable libraries. That means to some degree you can pick-and-choose what to include in your tool. Not every language tool needs optimization and not every language tool needs code generation. LLVM is a very flexible system for langauge processing.
Generally when people say, "LLVM based compilation," they mean using one or more of the LLVM libraries to implement their tool. They can leverage all of the work put into LLVM in understanding its IR and generating code for multiple targets.
The LLVM IR is the common representation used by most of the LLVM libraries. It is the interface you need to write to. For low-level stuff like machine code you will need to deal with some of the other LLVM representations (MachineInstr, MC, etc.).
As for writing a frontend to generate that LLVM IR, the tricky part is ensuring that the translation from your source language to the LLVM IR preserves the semantics of the source language. The LLVM IR has a well-defined but low-level set of semantics for each instruction. If your source language has higher-level semantics you will have to lower them into LLVM IR instruction sequences to implement it. For example, there is no LLVM instruction that handles C-style bitfield access so C language frontends must use a sequence of LLVM instructions to implement the functionality (generally shifts and bitwise operations).
As long as you implement the semantics of your source language in the LLVM IR correctly, the LLVM libraries will have no problem performing correct code transformations. If some desired transformation requires higher-level semantics information than LLVM IR can provide, you either have to do the transformation in some stage before converting to LLVM IR (and so you will have the high-level information available) or you can pass attribute information in the LLVM IR to convey the high-level semantics and write a custom LLVM pass to implement the transformation. It is usually far cleaner to do the former than the latter.

Tokenization and AST

Have a rather abstract question for you all. I'm looking at getting involved in a static code analysis project. It uses C and C++ as the language to develop in so if any code could be in either of those languages in your response that would be great.
My question:
I need to understand some of the basic concepts/constructs used to process code for static analysis. I have heard people use things like AST and tokenization etc. I was just wondering if anything can clarify how these things are applied in creating a static analysis tool? Id like more of an explanation of tokenization as I dont really understand that so well. I understand it is a sort of way to process strings but I'm not confident in that answer. Furthermore, I know that the project I'm looking at passes the code through the preprocessor before it is analyzed. Can anyone explain this? Surely if it is static code analysis it needn't be preprocessed?
Hope someone can clear this up for me.
Cheers.
Tokenization is the act of breaking source text into language elements such as operators, variable names, numbers, etc. Parsing reads sequences of tokens and build Abstract Syntax Trees, which is a particular program representation. Tokenization and parsing are necessary for static analysis, but hardly interesting, in the same way that ante-to-the-pot is necessary to playing poker but not the interesting part of the game in any way.
If you are building a static analyzer (you imply you expect to work on one implemented in C or C++), you will need fundamental knowledge of compiling (parsing not so much unless you are building a parser for the language to be statically analyzed), but certainly about program representations (ASTs, triples, control and data flow graphs, ...), type and property inference, and limits on analysis accuracy (the cause of conservative analyses.
The program representations are fundamental because these are the data structures that most static analysers really process; its simply too hard to wring useful facts directly from program text. These concepts can be used to implement static analysis capabilities in any programming language to implement analysis type tools; there's nothing special in implementing them in C or C++.
Run, don't walk, to your nearest compiler class for the first part of this. If you don't have it, you won't be able to do anything effective in tool building. The second part you will more likely find in a graduate computer science class.
If you get past that basic knowledge issue, you will either decide to implement an analysis tool from scratch, or build on existing analysis tool infrastructure. Few people decide to build one from scratch; it takes a huge amount of work (years or decades) to build robust parsers, flow analyzers, etc. needed as foundations for any particular static analysis. Mostly people try to use some existing infrastructure.
There's a huge list of candidates at: http://en.wikipedia.org/wiki/List_of_tools_for_static_code_analysis
If you insist on processing C or C++ and building your own custom sophisticated analysis, you really, really need a tool that can handle real C and C++ code. There are IMHO a limited number of good candidates:
GCC (and various graft ons such as Starynkevitch's MELT, about which I know little)
Clang (pretty spectacular toolset)
DMS (and its C and C++ front ends) [my company's tool]
Open64 compiler infrastructure
Rose compiler infrastructure (based on EDG's industry-wide front end)
Each one of these are big systems and require a big investment to understand and begin to use. Don't underrate the learning curve.
There are lots of other tools out there that sort of process C and C++, but "sort of" is pretty useless for static analysis purposes.
If you intend to simply use a static analysis tool, you can avoid learning most of the parsing and program representation questions; instead you'll need to learn as much as you can about the specific analysis tool you intend to use. You'll still be a lot better off with the compiler background above because you will understand what the tool does, why it does it, and why it produces the kinds of answers that it does (usually it produces answers that are unsatisfying in a lot of ways due to the conservative limits on analysis accuracy).
Lastly, you should be clear that you understand the difference between static analysis and dynamic analysis [using data collected at runtime to determine program properties]. Most end users couldn't care less how you get information about their code, and each analysis approach has its strength and weaknesses.
If you are interested in static analysis, you should not bother about syntax analysis (i.e. lexing, parsing, building the abstract syntax tree), because that won't learn you much.
So I suggest you to use some existing parsing infrastructure. This is particularly true if you want to analyze C++ source code, because just coding a C++ parser is a difficult thing to do.
For example, you could leverage on the GCC compiler, or on the LLVM/Clang compiler. Be aware of their open source license: GCC is under GPLv3, and Clang/LLVM has a BSD-like license. GCC is able to handle not only C & C++, but also Fortran, Go, Ada, Objective-C.
Because of the GPLv3 license of GCC, your development above it should be also free software under GPLv3. However, that also means that the GCC community is quite big. If you know Ocaml and are interested of static analysis of C only, you could consider Frama-C (LGPL licensed)
Actually, I am working on GCC, by providing the MELT extension framework (MELT is also GPLv3 licensed). MELT is a high-level domain specific language to extend GCC and should be the ideal choice to work on static analysis using the powerful framework and internal representations (Gimple, Tree, ...) of GCC.
(Of course, LLVM folks would be delighted to have their work used too; and nobody knows well both LLVM and GCC, because learning such big software is already a challenge, so people have to choose on which they invest their time).
So my suggestion is to join an existing static analysis or compiler effort, don't reinvent the wheel by starting from scratch (because that means years of effort, just to have a good parser).
By so doing, you will take advantage of the existing infrastructure (so you won't bother about AST-s and parsing) and you will improve some existing project.
I would be very delighted if you are interested in MELT. FWIW, there is a MELT tutorial at Paris, on january 24th 2012, HiPEAC conference.
If you are looking into static analysis for C & C++, your first stop should be libclang, along with its associated papers.
In terms of why the preprocessor is run, static analysis is meant to catch errors and possible unintended code functionality (very basically), this it must analyse what the compiler will be compiling, not what the user sees.
I'm pretty sure #Ira Baxter will popup with a far more complete answer :)

Writing LLVM source files vs. using APIs

I am creating an LLVM backend for a compiler. I am wondering if there is any downside to having my backend write IR code in files instead of using the APIs. The APIs are complicated (especially if one is using a language other than C++, in my case Haskell) and hard to use. The IR is much easier to understand. I don't need JIT compilation, the output code will be compiled to machine code by the standard command line tools.
The IR format changes from version to version. API changes much less frequently. There were examples in the past when IR format changed dramatically, so you'd need to invest big amount of time to tolerate these changes.
Using API is the preferable method. If sometimes it's not clear for you which API calls you will need - you can use cpp backend as a source of inspiration :)
As Anton said, there's a definite advantage in using the API as opposed to spitting out textual IR. I just want to address the point you raise regarding the complexity of the API and its usage from Haskell.
Note that LLVM has a C API, which (apart from being more stable) is suitable for foreign language interfaces. Python bindings exist for LLVM using this API, as well as Haskell bindings (this is easily found by Google) and for other languages as well.

Why is llvm considered unsuitable for implementing a JIT?

Many dynamic languages implement (or want to implement) a JIT Compiler in order to speed up their execution times. Inevitably, someone from the peanut gallery asks why they don't use LLVM. The answer is often, "LLVM is unsuitable for building a JIT." (For Example, Armin Rigo's comment here.)
Why is LLVM Unsuitable for building a JIT?
Note: I know LLVM has its own JIT. If LLVM used to be unsuitable, but now is suitable, please say what changed. I'm not talking about running LLVM Bytecode on the LLVM JIT, I'm talking about using the LLVM libraries to implement a JIT for a dynamic language.
Why is LLVM Unsuitable for building a JIT?
I wrote HLVM, a high-level virtual machine with a rich static type system including value types, tail call elimination, generic printing, C FFI and POSIX threads with support for both static and JIT compilation. In particular, HLVM offers incredible performance for a high-level VM. I even implemented an ML-like interactive front-end with variant types and pattern matching using the JIT compiler, as seen in this computer algebra demonstration. All of my HLVM-related work combined totals just a few weeks work (and I am not a computer scientist, just a dabbler).
I think the results speak for themselves and demonstrate unequivocally that LLVM is perfectly suitable for JIT compilation.
There are some notes about LLVM in the Unladen Swallow post-mortem blog post:
http://qinsb.blogspot.com/2011/03/unladen-swallow-retrospective.html .
Unfortunately, LLVM in its current state is really designed as a static compiler optimizer and back end. LLVM code generation and optimization is good but expensive. The optimizations are all designed to work on IR generated by static C-like languages. Most of the important optimizations for optimizing Python require high-level knowledge of how the program executed on previous iterations, and LLVM didn't help us do that.
There is a presentation on using LLVM as a JIT backened where the address many of the concerns raised as to why its bad, most of its seems to boil down to people building a static compiler as a JIT instead of building an actual JIT.
It takes a long time to start up is the biggest complaint - however, this is not so much of an issue if you did what Java does and start up in interpreter mode, and use LLVM to compile the most used parts of the program.
Also while there are arguments like this scattered all over the internet, Mono has been using LLVM as a JIT compiler successfully for a while now (though it's worth noting that it defaults to their own faster but less efficient backend, and they also modified parts of LLVM).
For dynamic languages, LLVM might not be the right tool, just because it was designed for optimizing system programming languages like C and C++ which are strongly/statically typed and support very low level features. In general the optimizations performed on C don't really make dynamic languages fast, because you're just creating an efficient way of running a slow system. Modern dynamic language JITs do things like inlining functions that are only known at runtime, or optimizing based on what type a variable has most of the time, which LLVM is not designed for.
Update: as of 7/2014, LLVM has added a feature called "Patch Points", which are used to support Polymorphic Inline Caches in Safari's FTL JavaScript JIT. This covers exactly the use case complained about int Armin Rigo's comment in the original question.
For a more detailed rant about the LLVM IR see here: LLVM IR is a compiler IR.

Current state of the art for c++ parsers?

I understand that it's a very hard thing to do, what with #ifdef, #define, and templates, but what is the state of the art of c++ parsers (be it open source, or proprietary?).
I mean, for a university project I'm thinking of creating a tool for analysing c++ code bases, but it seems very hard to find a good parser for it.
Should I give up and settle for java parsers? Similarly, what's the state of the art for java parsers? What about c#?
Also, would ripping the parser part of g++ apart from it ever work for the purposes of code analysis, or is it too much effort trying to do so?
You're in luck! Clang just started being able to parse most c++ programs within the last few months: http://clang.llvm.org/ It's one of the few open source parsers actually able to parse most of C++. (Mostly just GCC and CLANG, I hear Oink(?) Can get pretty good sometimes) And it's built to be used as a library by IDEs and the like, even has architecture built to support code rewriting.
There are some proprietary parsers that get the job done, But none of them are really usable without source access.
Regarding ripping apart gcc, That's not very practical for code analysis depending on what you are trying to do, you could use the new plugin architecture to get some usable information out of it, however at a very early level in parsing, it does something called term folding, where the parser itself will optimize out things like "x = x" (A simplistic example) And other aspects of the compiler expects this to happen, so it's not trivial to remove. Thus making gcc nearly useless for anything resembling source rewriting.
For C++ you can use GCC with -fdump-translation-unit & friends option to get AST from it.
See: http://www.manpagez.com/man/1/g++/
If you can compile something by g++ then you can get tree from it.
The industry stardard C++ parser, widely used in compilers, in EDG's C++ front end. I have no experience with this; but I understand it handles a huge variety of C++ dialects. I understand you can get it free for research purposes.
The open source standard is the GCC compiler. I hear is it difficult to understand and modify.
There's CLANG as mentioned in other answers. I have no experience here. My understanding is that it is fairly sophisticated especially in terms of supporting analysis.
Our proprietary DMS Software Reengineering Toolkit has full C++ parser with full name and type resolution, preprocessor expansion (or retention, which the other tools will not do). The C++ front end handles several dialects of C++: ANSI, GCC, MS Visual Studio. As you might guess, I have a lot of experience with this one.
DMS/CppFrontEnd has been used to carry out program analyses as well as massive program source-to-source transformations on C++ code, enabled by DMS's pattern parser, which will parse any fragment of C++ code. I believe the other C++ front ends don't provide source-to-source transformations. With those you can likely hack at the ASTs procedurally, but this is pretty inconvenient because you have to know the precise AST structure and for C++ this is pretty complicated.
DMS also has full C, Java and COBOL front ends with name and type resolution as well as control and data flow analysis. It has parsers (but not name and type analysis) for many other langauges, including C#.
AFAIK, the other "C++ parsers" can't do this, sort of by definition. One can apply source-to-source transformations on any of these, or any mixture of these.
clang is worth looking into. it's fast and they provide apis to hook into their backend.
Xcode 4 uses clang for tasks such as parsing, error reporting/detection in some cases, auto-completion, and fix-its.