Can all LLVM frontends produce the same bytecode - llvm

LLVM is the backend for many languages such as C#, Ruby, Zig and so on.
My question is: can all LLVM frontend languages “in principle” produce the same LLVM bytecode?
Are there any minimum requirements for a frontend to be able to produce the same byte code as say, C++?

In principle, yes… except not quite. Because the C++ language requires that the bytecode access c++ vtbls and other constructs that aren't quite the same in other languages.
One language may require that a null pointer check be inserted in a particular spot, another language makes it possible to elide that check. The output from two languages will always have tiny differences like that.

Related

Is the AST, abstract syntax tree, defined by the language or by the frontend?

In the last few weeks I have been experimenting with ASTs and Clang, in particular clang-tidy.
Clang offers some classes and way to interact with the ASTs, but what I don't understand is if the clang::VarDecl I am using so often is something named and created by the creators of Clang, or by the creators of the language.
Who decided that was to be called VarDecl?
I mean, is the AST (and all its elements) something that came from the mind of the inventor of the language and the various frontends just creates classes named after a document written by him/her or every frontend potentially creates its AST of a given source code and so Clang's and GCC's are different?
Is the AST, abstract syntax tree, defined by the language
Not fully. Each definition in the C++ language standard comes with a short syntax notation and there is a informative annex with grammar summary. But the annex notes https://eel.is/c++draft/gram :
This summary of C++ grammar is intended to be an aid to comprehension.
It is not an exact statement of the language. In particular, the
grammar described here accepts a superset of valid C++ constructs. [...]
There is no VarDecl in that grammar from standard, variable declaration just one interpretation of simple-declaration.
or by the frontend?
Internals of the compiler, if it has a frontend, or it hasn't, it has 3 or 1000 stages, are part of the compiler implementation. From the point of the language, the compiler can be implemented in any way it wants, as long as it translates valid programs correctly. Let's say generally, language specifies what should happen when, not how.
So to answer the question, AST (if used at all in any form) is defined by the compiler.
Who decided that was to be called VarDecl?
I most probably suspect Chris Lattner at https://github.com/llvm/llvm-project/commit/a11999d83a8ed1a2661feb858f0af786f2b829ad .
that came from the mind of the inventor of the language and the various frontends just creates classes named after a document written by him/her or every frontend potentially creates its AST of a given source code and so Clang's and GCC's are different?
Surely they are influenced by what is in the standard, but every compiler has its own internals. Well, in short, Clang VarDecl and GCC VAR_DECL are different, and they are also different conceptually - let's say GCC uses switch(...) case VAR_DECL: and Clang uses classes clang::VarDecl.

The perks and the pipeline for an LLVM based compilation

I see that more and more people are switching to LLVM, especially people with a background in C or C++, so there is a pattern in which kind of people are approaching this compiler, what surprises me is the highly heterogeneous set of technologies that LLVM can manage, and I don't get what is pipeline that this virtual machine follows and what are the resulting benefits.
I would like to stress the fact that I'm focusing on LLVM, not really on clang.
A 1 in a million example is this one ( Youtube Video ), where the pipeline is not really obvious for me, or this other one, but apparently there a lot of totally different solutions where, for example, LLVM is used in conjunction with a JIT solution.
In short I see different syntax and semantics, people using LLVM to produce GPU shaders or binary objects, but I can't see the common denominator.
What is the meaning of "LLVM based compilation", Considering LLVM as a black box, what is the kind of input, output and the business logic in the middle ?
I can't see the common denominator.
The common denominator is converting code in one language to code in another language. And that's exactly what compilers do. So if you want to convert a piece of code in a "source language" to one in a "target language", what you need to do is:
Write a "front-end" - a component that converts from your source language to what LLVM expects as input. That language is an LLVM-specific language called "LLVM Bitcode" or "LLVM IR".
Alternatively, reuse an existing front-end - for example Clang.
Write a "back-end" - a component that converts from what LLVM emits to your target language.
Or use an existing back-end, for example LLVM's x86 back-end.
That's it. Now you get to enjoy things like the optimizations LLVM performs on the code between its input and output, its common framework for "lowering" the code to something closer to machine code, etc.
GCC behaves the same, by the way, it's just that LLVM is considered by many to be superior in some aspects, particularly licensing and ease of modification.
LLVM's advantage over other source-available compilers is that it is designed as a set of reusable libraries. That means to some degree you can pick-and-choose what to include in your tool. Not every language tool needs optimization and not every language tool needs code generation. LLVM is a very flexible system for langauge processing.
Generally when people say, "LLVM based compilation," they mean using one or more of the LLVM libraries to implement their tool. They can leverage all of the work put into LLVM in understanding its IR and generating code for multiple targets.
The LLVM IR is the common representation used by most of the LLVM libraries. It is the interface you need to write to. For low-level stuff like machine code you will need to deal with some of the other LLVM representations (MachineInstr, MC, etc.).
As for writing a frontend to generate that LLVM IR, the tricky part is ensuring that the translation from your source language to the LLVM IR preserves the semantics of the source language. The LLVM IR has a well-defined but low-level set of semantics for each instruction. If your source language has higher-level semantics you will have to lower them into LLVM IR instruction sequences to implement it. For example, there is no LLVM instruction that handles C-style bitfield access so C language frontends must use a sequence of LLVM instructions to implement the functionality (generally shifts and bitwise operations).
As long as you implement the semantics of your source language in the LLVM IR correctly, the LLVM libraries will have no problem performing correct code transformations. If some desired transformation requires higher-level semantics information than LLVM IR can provide, you either have to do the transformation in some stage before converting to LLVM IR (and so you will have the high-level information available) or you can pass attribute information in the LLVM IR to convey the high-level semantics and write a custom LLVM pass to implement the transformation. It is usually far cleaner to do the former than the latter.

Why is llvm considered unsuitable for implementing a JIT?

Many dynamic languages implement (or want to implement) a JIT Compiler in order to speed up their execution times. Inevitably, someone from the peanut gallery asks why they don't use LLVM. The answer is often, "LLVM is unsuitable for building a JIT." (For Example, Armin Rigo's comment here.)
Why is LLVM Unsuitable for building a JIT?
Note: I know LLVM has its own JIT. If LLVM used to be unsuitable, but now is suitable, please say what changed. I'm not talking about running LLVM Bytecode on the LLVM JIT, I'm talking about using the LLVM libraries to implement a JIT for a dynamic language.
Why is LLVM Unsuitable for building a JIT?
I wrote HLVM, a high-level virtual machine with a rich static type system including value types, tail call elimination, generic printing, C FFI and POSIX threads with support for both static and JIT compilation. In particular, HLVM offers incredible performance for a high-level VM. I even implemented an ML-like interactive front-end with variant types and pattern matching using the JIT compiler, as seen in this computer algebra demonstration. All of my HLVM-related work combined totals just a few weeks work (and I am not a computer scientist, just a dabbler).
I think the results speak for themselves and demonstrate unequivocally that LLVM is perfectly suitable for JIT compilation.
There are some notes about LLVM in the Unladen Swallow post-mortem blog post:
http://qinsb.blogspot.com/2011/03/unladen-swallow-retrospective.html .
Unfortunately, LLVM in its current state is really designed as a static compiler optimizer and back end. LLVM code generation and optimization is good but expensive. The optimizations are all designed to work on IR generated by static C-like languages. Most of the important optimizations for optimizing Python require high-level knowledge of how the program executed on previous iterations, and LLVM didn't help us do that.
There is a presentation on using LLVM as a JIT backened where the address many of the concerns raised as to why its bad, most of its seems to boil down to people building a static compiler as a JIT instead of building an actual JIT.
It takes a long time to start up is the biggest complaint - however, this is not so much of an issue if you did what Java does and start up in interpreter mode, and use LLVM to compile the most used parts of the program.
Also while there are arguments like this scattered all over the internet, Mono has been using LLVM as a JIT compiler successfully for a while now (though it's worth noting that it defaults to their own faster but less efficient backend, and they also modified parts of LLVM).
For dynamic languages, LLVM might not be the right tool, just because it was designed for optimizing system programming languages like C and C++ which are strongly/statically typed and support very low level features. In general the optimizations performed on C don't really make dynamic languages fast, because you're just creating an efficient way of running a slow system. Modern dynamic language JITs do things like inlining functions that are only known at runtime, or optimizing based on what type a variable has most of the time, which LLVM is not designed for.
Update: as of 7/2014, LLVM has added a feature called "Patch Points", which are used to support Polymorphic Inline Caches in Safari's FTL JavaScript JIT. This covers exactly the use case complained about int Armin Rigo's comment in the original question.
For a more detailed rant about the LLVM IR see here: LLVM IR is a compiler IR.

correct me if my understanding of c++ is wrong

correct me if any of my following current understanding of c++ is wrong:
C++ is an extended version of C. Therefore, C++ is just as efficient as C.
Moreover, any application written in C can be compiled using C++ compilers
C syntax is also valid C++ syntax
C++ is at the exact same language level hierarchy as C.
Language Level Hierarchy
eg. lowest-level: assembly language,
high-levels: Java, PHP, etc
so my interpretation is that
C++/C is at a lower level than Java,PHP etc since it's closer to hardware level (and therefore because of this,it's more efficient than Java, PHP, etc), yet it is not as extreme as assembly language....but C++/C is at the same level with each other and neither one is closer to hardware level
If you start with code that's legal as both C and C++, it will typically compile to the same result with both, or close enough that efficiency is only minimally affected.
It's possible to write C that isn't allowable as C++ (e.g., using a variable with a name that's the same as one of the key words added in C++, such as new). Most such cases, however, are trivial to convert so they're allowed in C++. Probably the most difficult case to convert is code that uses function declarations instead of prototypes (or uses functions without declarations at all, which was allowed in older versions of C).
See 2 -- some syntactical C won't work as C++. As noted, it's usually trivial to convert though.
No, not really. Although C++ does provide the same low-level operations as C, it also has higher-level operations that C lacks.
C++ is at the exact same language level hierarchy as C.
Language Level Hierarchy
eg. lowest-level: assembly language, high-levels: Java, PHP, etc
Programming languages are often categorised from 1st generation (machine code), 2nd generation (assembly language), 3rd generation (imperative languages), 4th generation (definition's a bit vague - domain-specific languages intended for high productivity, e.g. SQL), 5th generation (typical language of the problem expression, e.g. maths notation, logic, or a human language; Miranda, Prolog). See e.g. http://en.wikipedia.org/wiki/Fifth-generation_programming_language and its links.
In that sense, C and C++ are both 3rd generation languages. (As Jerry points out, so are PHP, Java, PERL, Ruby, C#...). Using that yardstick, these languages belong in the same general group... they're all languages in which you have to tell the computer how to solve the problem, but not at a CPU-specific level.
In another sense though, C++ has higher level programming concepts than C, such as Object Orientation, functors, and more polymorphic features including templates and overloading, even though they're all ways to organise and access the steps for solving the problem. Higher level languages (i.e. 5GL) don't need to be told that - rather, they just need a description of the problem and knowing how to solve the entire domain of problems they find a workable approach for your specific case.
C++/C is at a lower level than Java,PHP etc since it's closer to hardware level (and therefore because of this,it's more efficient than Java, PHP, etc), yet it is not as extreme as assembly language....but C++/C is at the same level with each other and neither one is closer to hardware level
This is confusing things a bit. Summarily:
C++ and C do span lower than Java/PHP, yes.
C++ and C do tend to be more efficient, yes. You can get a general impression of this at http://benchmarksgame.alioth.debian.org/u64q/which-programs-are-fastest.html - don't take it too literally, it depends a lot on your problem space.
C++ and C both go as low as each other, but C++ has some higher level programming support too (though it's still a 3GL like C).
Let's look at a few examples:
bit shifting: Java is designed to be more portable (sometimes at the expense of performance) than C or C++, so even with JIT certain operations might be a bit inefficient on some platforms, but it may be convenient that they operate predictably. If you're doing equivalent work, and care about the edge cases where CPU behaviours differ, you'll find C and C++ leave operator behaviour for the implementation to specify. You may need to write multiple versions of the code for the different deployment platforms, only to end up getting pretty much the same performance as Java (but programs often know they won't exercise edge cases, or don't care about the behavioural differences). In that respect, Java has abstracted away a low-level concern and could reasonably be considered higher level but pessimistic.
C++ provides some higher level facilities such as templates (and hence template metaprogramming), and multiple inheritance. Compilers commonly provide low level facilities such as inline assembly and the ability to call arbitrary functions from other objects/libraries as long as the function signatures are known at compile time (some libraries work around this limitation). Interpreted (e.g. PHP) and Virtual Machine based (e.g. Java) languages tend to be worse at this.
Java also provides some higher level facilities that C++ lacks - e.g. introspection, serialisation.
Generally, I tend to conceive of C++ spanning both lower and higher than Java. Put another way, Java overlaps a section in the middle of C++'s span. But, Java has a few stand-out high-level features too.
PHP is an interpreted language that again abstracts away some low level concerns, but generally fails to provide good facilities for more abstract or robust programming techniques too. Like most interpreters, it does allow run-time evaluation of arbitrary source code, as well as run-time modification of class metadata etc., which allows a high level, powerful but dangerously unstructured approach to programming. That kind of thing isn't possible in a compiled language unless the compiler is shipped in the deployment environment (and even then there are more limitations).
C++ is an extended version of C. Therefore, C++ is just as efficient as C.
Generally true.
Moreover, any application written in C can be compiled using C++ compilers
C syntax is also valid C++ syntax
There are some trivial differences, e.g.:
in C++, main() must have return type int and implicitly returns 0 on exit if not return statement's encountered, but C allows void or int and for the latter must explicitly return an int
C++ has additional keywords (e.g. mutable, virtual, class, explicit...) that are therefore not legal C++ identifiers, but are legal in C
Still, your conception is essentially true.
1/4 and 2/3 seem to be saying very similar things, but:
Yes (Depends on what you mean by "extended", but at a broad level, yes)
Not always
Not always
Yes
Moreover, any application written in C
can be compiled using C++ compilers
Not every C program can be compiled using a C++ compiler. There are some differences between C and C++ (keywords, for example), that prevent mixing C and C++ in some ways. Stroustrup adresses some important points in C and C++: Siblings.
C++ is an extended version of C.
Therefore, C++ is just as efficient as
C.
That depends on the language features you use. I heard that using OOP might bring more cache misses than using a more C-like approach. I can't tell wether this is true or not, as I didn't read more on that subject. But it might be something which should be considered. This is only one example were performance isn't easy comparable.
This isn't exactly true, beyond extra C++ language features that are slower, there are different optimizations that can be done that will change this. Due to the better C++ type system, these are actually normally in C++'s favor however.
No, a big case is that C++ doesn't support automatic cast from void* so for instance
char* c = malloc(10); // Is valid C, but not C++
char* c = (char*)malloc(10) //Is required in C++
Except for C99 and newer C features, I think this is nearly always the case. Keep in mind this is only taking into account syntax this doesn't mean that everything that can compile in C can also compile in C++.
Could you elaborate on what you mean by this, what do you mean by "language level hierarchy"?
Summary:
True.
Dangerously false.
False.
Subjective
Some examples for 2/3:
sizeof 'a' is 1 in C++ and sizeof(int) in C.
char *s = malloc(len+1); is correct C but invalid C++.
char s[2*strlen(name)+1]; is valid (albeit dangerous) C, but invalid C++.
sizeof (1?"hello":"goodbye") issizeof(char *)` in C but 6 in C++.
Attempting to compile existing C code as C++ is simply invalid and likely to produce dangerous bugs even if you hunt down and "fix" all the compile-time errors. And writing code that's valid in both languages is perhaps a reasonable entry for a polyglot competition, but not for any serious use. The intersection of C and C++ is actually a very ugly language that's the worst of both worlds.
Your understanding is wrong in some of your points:
1) your first point is right.C++ is an extension of c.
2) second point is right . C can be compiled using c++ compilers.
3) Some of C syntax varies from c++. In c++, using structure , c should specify structure name but c++ it is not mandatory to specify structure name.Also C++ have the concept of class that is not available in c. C++ also have higher security mechanisms.
4)C is procedural language but c++ is object oriented approach. so c++ is not at the exact same language level hierarchy as c.
C language is not a subset of C++ lanaguage. Check the C99 spec for example - it will not compile in C++ compiler easily. However most of C89 source code can be copied&paste to C++ source code.
C and C++ are languages that can be implemented with "zero overhead" comparing to bare iron.
No. But most of C++ compilers are C compilers too. It means that you can compile .C and .C++ files using the same toolchain.
No, The evolution of these languages differs. See answer to question 1.
C++ is multiparadigm language. Yes, it can be used in the same way as C. But it can be used as DSL too - it provides greater abstraction level.
That's a whole big question to answer.
Not in all cases!
not true because of 3
not true
They are not exactly the same
I don't think language level hierarchy matters too much for a thing. For example, C is a high-level one compared to assembly language while it's a low-level one compared with Java/C#.

Creating a VHDL backend for LLVM?

LLVM is very modular and allows you to fairly easily define new backends. However most of the documentation/tutorials on creating an LLVM backend focus on adding a new processor instruction set and registers. I'm wondering what it would take to create a VHDL backend for LLVM? Are there examples of using LLVM to go from one higher level language to another?
Just to clarify: are there examples of translating LLVM IR to a higher level language instead of to an assembly language? For example: you could read in C with Clang, use LLVM to do some optimization and then write out code in another language like Java or maybe Fortran.
Yes !
There are many LLVM back-end targeting VHDL/Verilog around :
(open source) Legup paper
(commercial) Xilinx HLS
(online) C-to-verilog
And I know there are many others...
The interesting thing about such low-level representations as LLVM or GIMPLE (also called RTL by the the way) is that they expose static-single assignments (SSA) forms : this can be translated to hardware quite directly, as SSA can be seen as a tree of multiplexers...
There's nothing really special about the LLVM IR. It's a standard DAG with variable arity. Decompiling LLVM IR is a lot like decompiling machine language.
You might be able to leverage some frontend optimizations such as constant folding, but that sounds pretty minor compared to the whole task.
My only experience with LLVM was writing a binary translator for a class project, from a toy CISC to a custom RISC.
I'd say, since it's the closest thing to a standard IR (well, GCC GIMPLE is a close second), see if it fits with your algorithms and style and evaluate it as one alternative.
Note that GCC also started out prioritizing portability above all, and has also accomplished a lot.
I'm not sure I follow how parts of your question relate one to another.
To target LLVM into a high-level language like C is very possible and you seem to have found one reference point.
VHDL is a whole other business however. Do you consider VHDL a high-level language? It may be, but but describing hardware/logic. Sure VHDL has some constructs that you can employ to actually program in it, but it's hardly a fruitful endeavor. VHDL describes hardware and thus makes translating LLVM IR into it a very hard problem, unless of course you design a CPU with a custom instruction set in VHDL and translate LLVM IR into your instructions.
This thread was one of the first things I found while looking for the same thing.
I found a project that's rather far along that cleanly builds under/with llvm 3.5. It's pretty darn cool. It spits out HDL and does various other cool FPGA related things. While it's designed to work with TTAs and generate images for FPGA (or simulate them), it can probably also be made to do some trivial HDL generation from c functions.
It was perfect for my purposes because I wanted to upload to an Altera FPGA, and the fpga_stdout example even spits out Quartus build scripts and project files.
TTA-Based Co-design Environment
I also tried the things listed in the accepted answer and a couple others and found that they weren't going to work for me or weren't very high quality (usually both). TCE is professional feeling, but purely academic I believe. Very nice all the way around.
It seems the question was partially answered, so I’d like to give it a shot:
What it would take to create a VHDL backend for LLVM?
What it would take to translate LLVM IR to a higher level language (presumably with the intention of converting between high-level langs)?
I will give you some background on 2. And expand at a later date on 1.
If you want to convert LLVM IR to a high-level language such as C or Java:
You would have to take the LLVM instructions, and abstract that out into its equivalent C code. Then you need to take the remaining features that LLVM does not have an equivalent for (like classes and abstractions for C++) and write a routine that would find those patterns in the LLVM (like reused blocks) and write C. For the basic stuff, its pretty straightforward. But, just follow the train of thought and you quickly find yourself realizing the true difficultly of the problem, after all not everyone writes simple C. To compound the difficulty further, you may not get the same LLVM IR when compiling the generated C! (Consider the resulting feedback loop)
As for Java, you are in for an even harder battle going direct from LLVM IR, and in either case still have the problem you likely won't get the same code compiling to LLVM IR, if one even can do that. Rather, you would translate LLVM IR to JVM Bytecode. Then you could use a reverse compiler to get your Java.
A group of Chinese students was apparently able to do this, but they wondered why such little interest in their research. I would say its bc they don't fully understand just what the LLVM guys have done, and how it is better than the JVM. (In fact, LLVM arguably makes the JVM obsolete ;)
Even though this seems useful in that one can use LLVM as an intermediary between C and Java to convert bidirectionally, this solution is actually of little use because we are asking the wrong question. See, the entire reason you would want that for practical purposes is to have a common code base and increase performance.
But the real problem is that we need a language that has abstracted the common features of modern languages, and that gives you a central language that you can build from. http://julialang.org/ has answered the question 😉
Looks like the best place to start is with the CBackend in the LLVM source:
llvm/lib/Target/CBackend/CBackend.cpp
tl,dr: I don't think LLVM is the right tool
What your are looking for is way to translate LLVM code to a higher language that's what emscripten do for Javascript.
But it looks like you miss a bit the point of LLVM as it's meant to generate static code in order to achieve that they use a specific intermediate language build for that purpose.
As you can see the way emscripten works is by implementing a stack, but without using javascript as a human would have done it.
They are several project that try to achieve what you original question was, like MyHDL that turns python to VHDL or Verilog.