Converting LLVM IR to Java Bytecode

Converting LLVM IR to Java Bytecode - llvm

I am beginner and want to build translator that can convert LLVM bitcode to Java Bytecode.
Can somebody please tell me in brief or list some major steps how to go through it.

In our company (Altimesh), we did the same thing for CIL. For Java Bytecode, the task is likely very similar.
I can tell you it's quite a long task.
First thing : LLVM libraries are written in C++
That means you either have to learn c++, and a way to generate java bytecode from C++, or export the symbols you need from LLVM libraries to JNI. I strongly recommend the second option, as you'll get a pure Java implementation (and you'll soon figure out that you don't need that many symbols from LLVM API).
Once you figured that out, you need to:
Parse modules from files
here is a simple example (using llvm 3.9 API, which is quite old now):
llvm::Module* llvm__Module_buildFromFile (llvm::LLVMContext* context, const char* filename)
{
llvm::ErrorOr<std::unique_ptr<llvm::MemoryBuffer>> buf = llvm::MemoryBuffer::getFile(filename);
llvm::SMDiagnostic diag;
return llvm::parseIR(buf->get()->getMemBufferRef(), diag, *context).release();
}
Parse debug infos
void llvm__DebugInfoFinder__processModule(llvm::DebugInfoFinder* self, llvm::Module* M)
{
self->processModule(*M);
}
debug info, or metadata, are quite a pain with llvm, as they change very frequently (compared to instructions). So you either have to stick to an LLVM's version (probably a bad choice), or update your code as soon as a new LLVM release gets out.
Once you're there, most of the pain is behind you, and you enter the world of fun.
I strongly recommend to start with something very very simple, such as a simple addition program.
Then always keep two windows opened, godbolt showing you input llvm you need to parse, and a java window showing you the target (here is an example for MSIL).
Once you're able to transpile your first program (hurrraah, I can add two integers :) ), you will soon want to transpile more stuff, and soon you will face two insanities:
getelementptr. This is how arrays, memory, structures... is accessed in LLVM. This is a pretty magic instruction.
phi. Crucial instruction in LLVM system, as it allows Single Static Assignment, which is fairly important for the backend (register allocator and co). I don't know in Java, but this was obviously not available in MSIL.
Once all of that is done, you enter the endless world of pain of special cases, weird C constructs you didn't know about, gcc extensions and so on...
Anyway good luck!

Related

Creating a modular language in LLVM?

I'm developing a new language in LLVM using the C++ API which compiles down to target the C ABI.
I would like to support modular compilation by allowing end users to build what are effectively static libraries. I noticed the LLVM C++ API has a llvm::Linker class that I can use during compilation to combine source files (llvm::Module), however I wanted to guarantee library compatibility via metadata version numbers or at least the publicly exposed interface between separate compilation runs.
Much of the information available on metadata in LLVM suggest that it should only be used for extended information that would not break correctness when silently removed.
llvm
blog
IntrinsicsMetadataAttributes
pdf
I wouldn't think this would be a deal breaker as it could be global metadata, but it would be good to get a second opinion on that point.
I also know there is a method in IRReader to parseIRFile so I can load some previously built bc files. I would be curious if it would be reasonable practice to include size and CRC information for comparison when loading these files.
My language has concepts similar to C# including interfaces. I figure I could allow modular compilation by importing/exporting an interface type along with external functions (Much like C++, I don't restrict the language to only methods of classes).
This approach allows me to include language specific information in the interface without needing to encode it in the IR as both the library and the calling code would be required to build with the interface. This again requires the interfaces to be compatible.
One language feature that would require extended information would be named parameters in functions.
My language is very type-safe and also mandates named parameters so there is no predetermined function parameter order. This allows call sites to be more explicit, the compiler to catch erroneous parameter usage, and authors have more liberty in determining default parameters as they are not restricted to the last parameters to the function.
The compiler will need to know names, modifiers, defaults, etc. of these parameters to correctly map calls at compile time, so I figure the interface approach would work well here.
TL;DR
Does LLVM have any predefined facilities for building static libraries?
Is version number, size, and CRC information reasonable use cases for LLVM's metadata?

This is probably not QUITE an answer... Or at least not a complete answer.
I like this question, as I'm going to need a solution in the future too (some time in the next few months or years) for my Pascal compiler. It supports "units" which is meant to be a separately compiled object, but currently what I do is simply drag in the source file and compile it into the main llvm::Module - that's neither efficient nor flexible (can't use the linker to choose between the "Linux" and "Windows" version of some code, for example - not that I think there is 5% chance that my compiler will work on Windows without modification anyway...)
However, I'm not sure storing the "object" file as LLVM IR would be the right thing to do. I was thinking that a better way would be to store your AST in some serialized form - then
you don't depend on LLVM versions changing the IR format.
You can add whatever metadata you like. There won't be much
difference in generating LLVM-IR from this during your link phase or
building the IR at compile and then reading the IR to figure out if
the metadata is correct. [The slow part, as you may have already found out, is the optimisation and MC generation, and you'd still have to do that either way]
Like I started out, I'm not sure this is an answer, but it's my thoughts so far on the subject. Now I'll go back to adding debug symbol stuff to my Pascal compiler... Before Christmas, I couldn't see the source in GDB. Now I can step, but no viewing of variables yet...

Embed C++ compiler in application

Aren't shaders cool? You can toss in just a plain string and as long as it is valid source, it will compile, link and execute. I was wondering if there is a way to embed GCC inside a user application so that it is "self sufficient" e.g. has the internal capability to compile native binaries compatible to itself.
So far I've been invoking stand alone GCC from a process, started inside the application, but I was wondering if there is some API or something that could allow to use "directly" rather than a standalone compiler. Also, in the case it is possible, is it permitted?
EDIT: Although the original question was about CGG, I'd settle for information how to embed LLVM/Clang too.
And now a special edit for people who cannot put 2 + 2 together: The question asks about how to embed GCC or Clang inside of an executable in a way that allows an internal API to be used from code rather than invoking compilation from a command prompt.

I'd add +1 to the suggestion to use Clang/LLVM instead of GCC. A few good reasons why:
it is more modular and flexible
compilation time can be substantially lower than GCC
it supports the platforms you listed in the comments
it has an API that can be used internally
string source = "app.c";
string target= "app";
llvm::sys::Path clangPath = llvm::sys::Program::FindProgramByName("clang");
// arguments
vector<const char *> args;
args.push_back(clangPath.c_str());
args.push_back(source.c_str());
args.push_back("-l");
args.push_back("curl");
clang::TextDiagnosticPrinter *DiagClient = new clang::TextDiagnosticPrinter(llvm::errs(), clang::DiagnosticOptions());
clang::IntrusiveRefCntPtr<clang::DiagnosticIDs> DiagID(new clang::DiagnosticIDs());
clang::DiagnosticsEngine Diags(DiagID, DiagClient);
clang::driver::Driver TheDriver(args[0], llvm::sys::getDefaultTargetTriple(), target, true, Diags);
clang::OwningPtr<clang::driver::Compilation> c(TheDriver.BuildCompilation(args));
int res = 0;
const clang::driver::Command *FailingCommand = 0;
if (c) res = TheDriver.ExecuteCompilation(*c, FailingCommand);
if (res < 0) TheDriver.generateCompilationDiagnostics(*c, FailingCommand);

Yes, it is possible, for example, QEMU does it.
I don't have any personal experience in this field, but from what I've read, it seems that LLVM might be better suited for embedding and extending than GCC.

Some older list of C++ compilers and interpreters is available at http://www.thefreecountry.com/compilers/cpp.shtml.
Answer to the "self sufficient" application is usually a good language interpreter. There are many of them out there, many compile the code into a byte code files. Very popular and easily embeddable is the Lua language interpreter. Even some strong players use it.
There was also an open source C++ interpreter with great language compatibility produced years ago starting with F.. Don't remember the rest of the name. There are also many other tools able to produce native binaries (e.g. Free Pascal).
Choice of the language and the target platform depends on the intentions. What would be the "self sufficiency" good for. Who will write those libraries. Once you have that clear - use Google - there is a wildlife out there. One of the latest beasts is the open sourced C# compiler "Roslyn"
EDIT
If you need some C compiler (as you generate C subset) that can be "embedded" you are probably looking for a "portable C compiler" in the sense that you can put it on USB stick and carry with you. Portable applications can be easily "embedded" into other applications and can be easily included in the installer.
Possibility to "embed" compiler as statically linked code into main application binary is probably not required.
Some reference to portable MinGW is described in this https://stackoverflow.com/questions/7617410/portable-c-compiler-ide SO question.
An open source C++ editor with integrated MinGW is here https://code.google.com/p/pocketcpp/.
I don't have anything more to say as I'd have to go and browse Google - so I will not win the bounty :)

Why not just call the compiler and linker from your application using fork()/exec() (for UNIX-like platforms)? Create a shared library that you can then load with dlopen().
This avoids possible licensing issues and gives you less of a maintenance burden.
This is e.g. what varnish does with its configuration files;
The VCL language is a small domain-specific language designed to be used to define request handling and document caching policies for Varnish Cache.
When a new configuration is loaded, the varnishd management process translates the VCL code to C and compiles it to a shared object which is then dynamically linked into the server process.

C++ add new code during runtime

I am using C++ (in xcode and code::blocks), I don't know much.
I want to make something compilable during runtime.
for eg:
char prog []={"cout<<"helloworld " ;}
It should compile the contents of prog.
I read a bit about quines , but it didn't help me .

It's sort of possible, but not portably, and not simply.
Basically, you have to write the code out to a file, then
compile it to a dll (invoking the compiler with system), and
then load the dll. The first is simple, the last isn't too
difficult (but will require implementation specific code), but
the middle step can be challenging: obviously, it only works if
the compiler is installed on the system, but you have to find
where it is installed, verify that it is the same version (or
at least a version which generates binary compatible code),
invoke it with the same options that were used when your code
was compiled, and process any errors.
C++ wasn't designed for this. (Compiled languages generally
aren't.)

The short answer is "no, you can't do that". C and C++ were never designed to do this.
That's pretty much also the long answer to the actual question, but I'll expand a bit on a few ideas.
The code, as compiled by the compiler is pretty certainly not trivial to add things to. There are a few techniques that can be used to "add more code" to a program:
Add a dynamic shared library (DLL), which contains code that has been compiled separately to the existing code. You could of course also have code in your program to output some code, compile this code with the compiler, link it into a dynamic library, and load it in your code.
You could build your own little code-generator that generates machine code in a chunk of memory. Note that you probably need to call a "special" memory allocation function, as "normal" memory allocations are typically not allowed to be executed - you need to allocate "with execute permission" - VirtualAlloc in Windows does have such a flag, and mmap in Linux/Unix flavours does too. And of course, you pretty much have to "be a compiler" to achieve this.
You could naturally also invent your own interpreted language, which would allow your program to load in for example a text-file with commands/instructions to be executed, or contain text inside the program for execution with this language.
But like I said to start with, this is not what C and C++ (and most other compiled languages) were meant for, so it's not going to be as simple as "stick some C++ code in a string, and make it run".

It depends why you want to do this.
If it's for efficiency reasons - you know what a function does only at run time, but it has to be very efficient - then what was already suggested (writing to a file, compiling to a dll / so and dynamically loading it) is your best option.
BUT if the reason you want this is to allow for user-input behaviour, say a general function your read from a database (behaviour or a unit ingame? value of a field in a plot?) - or more generally you just want to change / augment behaviour at runtime with little concern for efficiency, I recommend using an outside scripting language like lua, which easily interacts with your compiled C++ code.

The C and C++ languages compile to binary machine code, unlike Java and C# which generate instructions for a 'virtual machine' or interpreted scripting languages such as JavaScript. The compilation of C++ is performed by a separate executable, the compiler, which is not incorporated into the resulting executable.
So the language does not have any built in "eval" capability to translate further code once compilation is finished.
It's not uncommon for new C/C++ programmers to think they need to do this, but they typically don't. Perhaps you could expand further on what you're actually looking to do.
But if you do actually need to be able to do this, your options are:
Write code to compile a new executable with the new code and then run the resulting program.
Write a simple parser and "virtual machine" of your own,
Look at incorporating an embedded scripting/interpreted language such as Lua,
Try and wrap your head around integrating CINT,
See also: Scripting language for C++

Define C++ function at runtime

I'm trying to adjust some mathematical code I've written to allow for arbitrary functions, but I only seem to be able to do so by pre-defining them at compile time, which seems clunky. I'm currently using function pointers, but as far as I can see the same problem would arise with functors. To provide a simplistic example, for forward-difference differentiation the code used is:
double xsquared(double x) {
return x*x;
}
double expx(double x) {
return exp(x);
}
double forward(double x, double h, double (*af)(double)) {
double answer = (af(x+h)-af(x))/h;
return answer;
}
Where either of the first two functions can be passed as the third argument. What I would like to do, however, is pass user input (in valid C++) rather than having to set up the functions beforehand. Any help would be greatly appreciated!

Historically the kind of functionality you're asking for has not been available in C++. The usual workaround is to embed an interpreter for a language other than C++ (Lua and Python for example are specifically designed for being integrated into C/C++ apps to allow scripting of them), or to create a new language specific to your application with your own parser, compiler, etc. However, that's changing.
Clang is a new open source compiler that's having its development by Apple that leverages LLVM. Clang is designed from the ground up to be usable not only as a compiler but also as a C++ library that you can embed into your applications. I haven't tried it myself, but you should be able to do what you want with Clang -- you'd link it as a library and ask it to compile code your users input into the application.
You might try checking out how the ClamAV team already did this, so that new virus definitions can be written in C.
As for other compilers, I know that GCC recently added support for plugins. It maybe possible to leverage that to bridge GCC and your app, but because GCC wasn't designed for being used as a library from the beginning it might be more difficult. I'm not aware of any other compilers that have a similar ability.

As C++ is a fully compiled language, you cannot really transform user input into code unless you write your own compiler or interpreter. But in this example, it can be possible to build a simple interpreter for a Domain Specific Language which would be mathematical formulae. All depends on what you want to do.

You could always take the user's input and run it through your compiler, then executing the resulting binary. This of course would have security risks as they could execute any arbitrary code.
Probably easier is to devise a minimalist language that lets users define simple functions, parsing them in C++ to execute the proper code.

The best solution is to use an embedded language like lua or python for this type of task. See e.g. Selecting An Embedded Language for suggestions.

You may use tiny C compiler as library (libtcc).
It allows you to compile arbitrary code in run-time and load it, but it is only works for C not C++.
Generally the only way is following:
Pass the code to compiler and create shared object or DLL
Load this Shared object or DLL
Use function from this shared object.

C++, unlike some other languages like Perl, isn't capable of doing runtime interpretation of itself.
Your only option here would be to allow the user to compile small shared libraries that could be dynamically-loaded by your application at runtime.

Well, there are two things you can do:
Take full advantage of boost/C++0x lambda's and to define functions at runtime.
If only mathematical formula's are needed, libraries like muParser are designed to turn a string into bytecode, which can be seen as defining a function at runtime.

While it seems like a blow off, there are a lot of people out there who have written equation parsers and interpreters for c++ and c, many commercial, many flawed, and all as different as faces in a crowd. One place to start is the college guys writing infix to postfix translators. Some of these systems use paranthetical grouping followed by putting the items on a stack like you would find in the old HP STL library. I spent 30 seconds and found this one:
http://www.speqmath.com/tutorials/expression_parser_cpp/index.html
possible search string:"gcc 'equation parser' infix to postfix"

LLVM what is it and how can i use it to cross platform compilations

I was reading here and there about llvm that can be used to ease the pain of cross platform compilations in c++ , i was trying to read the documents but i didn't understand how can i
use it in real life development problems can someone please explain me in simple words how can i use it ?

The key concept of LLVM is a low-level "intermediate" representation (IR) of your program.
This IR is at about the level of assembler code, but it contains more information to facilitate optimization.
The power of LLVM comes from its ability to defer compilation of this intermediate representation to a specific target machine until just before the code needs to run. A just-in-time (JIT) compilation approach can be used for an application to produce the code it needs just before it needs it.
In many cases, you have more information at the time the program is running that you do back at head office, so the program can be much optimized.
To get started, you could compile a C++ program to a single intermediate representation, then compile it to multiple platforms from that IR.
You can also try the Kaleidoscope demo, which walks you through creating a new language without having to actually write a compiler, just write the IR.
In performance-critical applications, the application can essentially write its own code that it needs to run, just before it needs to run it.

Why don't you go to the LLVM website and check out all the documentation there. They explain in great detail what LLVM is and how to use it. For example they have a Getting Started page.

LLVM is, as its name says a low level virtual machine which have code generator. If you want to compile to it, you can use either gcc front end or clang, which is c/c++ compiler for LLVM which is still work in progress.

It's important to note that a bunch of information about the target comes from the system header files that you use when compiling. LLVM does not defer resolving things like "size of pointer" or "byte layout" so if you compile with 64-bit headers for a little-endian platform, you cannot use that LLVM source code to target a 32-bit big-endian assembly output pater.

There is a good chapter in a book explaining everything nicely here: www.aosabook.org/en/llvm.html

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js