Call C/C++ functions from the ExecutionEngine - llvm

I am learning llvm and wanted to do a proof of concept of an idea I have.
Basically, I want to split my compiler and my runtime. The compiler would give a .bc and the runtime would load it via ParseBitcodeFile and use the ExecutionEngine to run it. This part is working.
Now, to make system calls easily, I want to be able to implement in my runtime C/C++ functions which do all the system calls (file io, stdout printing, etc). My question is, how could I call these functions from the code from my toy compiler, which is compiled in a different step by llvm, and allow it to be used when executed.

Good news: when using the JIT ExecutionEngine, this will just work. When the JIT-er finds an external symbol used by the IR which is not found in the IR itself, it looks in the JIT-ing process itself, so any symbols visible from your host program can be called.
This is explained directly in part 4 of the LLVM tutorial:
Whoa, how does the JIT know about sin and cos? The answer is
surprisingly simple: in this example, the JIT started execution of a
function and got to a function call. It realized that the function was
not yet JIT compiled and invoked the standard set of routines to
resolve the function. In this case, there is no body defined for the
function, so the JIT ended up calling “dlsym("sin")” on the
Kaleidoscope process itself. Since “sin” is defined within the JIT’s
address space, it simply patches up calls in the module to call the
libm version of sin directly.
For the gory details look at lib/ExecutionEngine/JIT/JIT.cpp - in particular its usage of DynamicLibrary.

Eli's answer is great and you should accept it. There's another alternative, though, which is to separately compile your runtime's source files to LLVM modules (e.g. with Clang) and use ExecutionEngine::addModule() to add them.
It's less convenient, and it means compiling the same files twice (once for your host program, another to get Modules from them), but the advantage is that it enables inlining and other cross-function optimizations from your JITted code.

Related

C++ Source-to-Source Transformation with Clang

I am working on a project for which I need to "combine" code distributed over multiple C++ files into one file. Due to the nature of the project, I only need one entry function (the function that will be defined as the top function in the Xilinx High-Level-Synthesis software -> see context below). The signature of this function needs to be preserved in the transformation. Whether other functions from other files simply get copied into the file and get called as a subroutine or are inlined does not matter. I think due to variable and function scopes simply concatenating the files will not work.
Since I did not write the C++ code myself and it is quite comprehensive, I am looking for a way to do the transformation automatically. The possibilities I am aware of to do this are the following:
Compile the code to LLVM IR with inlining flags and use a C++/C backend to turn the LLVM code into the source language again. This will result in bad source code and require either an old release of Clang and LLVM or another backend like JuliaComputing. 1
The other option would be developing a tool that relies on using the AST and a library like LibTooling to restructure the code. This option would probably result in better code and put everything into one file without the unnecessary inlining. 2 However, this options seems too complicated to put the all the code into one file.
Hence my question: Are you aware of a better or simply alternative approach to solve this problem?
Context: The project aims to run some of the code on a Xilinx FPGA and the Vitis High-Level-Synthesis tool requires all code that is to be made into a single IP block to be contained in a single file. That is why I need to realise this transformation.

How can one execute C++ code at runtime from a string representing a function?

I'm working on a project wherein I need to be able to save a function string to disk, so I am having the user pass a string of characters that is the actual code of the function and saving it to disk. The opposite is necessary as well; loading a string (from file) and executing as a function at runtime within C++. I need to load this function and return a function pointer to be used in my program. I'm looking at Clang right now, but some of it is a little over my head. So basically I have two questions;
Can Clang run code extracted from a string (loaded from disk)?
Can a compiled Clang function be represented with a function pointer pointing to it?
Any ideas?
The simple answer to your question is "yes", the slightly more complex answer is "not at all easily".
Doing it with C++ would require that you compile and link your function into a DLL/shared object, load it, then acquire the exported function. In addition, accepting such code from the user would be a terrible security risk
C++ is a very poor choice for such run-time execution, you would be far better off going with a language meant for that use, JavaScript or Python come to mind.
You can't easily do this in a compiled language.
For a compiled program to execute a C++ function that has been dynamically provided at runtime, that function would need to be compiled itself. You could make your program call the compiler at runtime to generate a callable library (e.g. one that implements an interface or abstract class and is callable via Dependency Injection), but this is complex and is a project in and of itself. This also means that your application must be packaged with the compiler or must only be installed on systems that contain a compatible compiler - somewhat realistic on Linux, not at all so on Windows.
A better solution would be to use an interpreter. JavaScript and Lisp both come with an eval() function that does exactly what you want - it takes a string (in the case of JavaScript) or a list (in the case of Lisp) and executes it as code.
A third possibility is to find a C++ interpreter that has an eval() function. I'm not sure if any exist. You could try to write one yourself.

Creating a modular language in LLVM?

I'm developing a new language in LLVM using the C++ API which compiles down to target the C ABI.
I would like to support modular compilation by allowing end users to build what are effectively static libraries. I noticed the LLVM C++ API has a llvm::Linker class that I can use during compilation to combine source files (llvm::Module), however I wanted to guarantee library compatibility via metadata version numbers or at least the publicly exposed interface between separate compilation runs.
Much of the information available on metadata in LLVM suggest that it should only be used for extended information that would not break correctness when silently removed.
llvm
blog
IntrinsicsMetadataAttributes
pdf
I wouldn't think this would be a deal breaker as it could be global metadata, but it would be good to get a second opinion on that point.
I also know there is a method in IRReader to parseIRFile so I can load some previously built bc files. I would be curious if it would be reasonable practice to include size and CRC information for comparison when loading these files.
My language has concepts similar to C# including interfaces. I figure I could allow modular compilation by importing/exporting an interface type along with external functions (Much like C++, I don't restrict the language to only methods of classes).
This approach allows me to include language specific information in the interface without needing to encode it in the IR as both the library and the calling code would be required to build with the interface. This again requires the interfaces to be compatible.
One language feature that would require extended information would be named parameters in functions.
My language is very type-safe and also mandates named parameters so there is no predetermined function parameter order. This allows call sites to be more explicit, the compiler to catch erroneous parameter usage, and authors have more liberty in determining default parameters as they are not restricted to the last parameters to the function.
The compiler will need to know names, modifiers, defaults, etc. of these parameters to correctly map calls at compile time, so I figure the interface approach would work well here.
TL;DR
Does LLVM have any predefined facilities for building static libraries?
Is version number, size, and CRC information reasonable use cases for LLVM's metadata?
This is probably not QUITE an answer... Or at least not a complete answer.
I like this question, as I'm going to need a solution in the future too (some time in the next few months or years) for my Pascal compiler. It supports "units" which is meant to be a separately compiled object, but currently what I do is simply drag in the source file and compile it into the main llvm::Module - that's neither efficient nor flexible (can't use the linker to choose between the "Linux" and "Windows" version of some code, for example - not that I think there is 5% chance that my compiler will work on Windows without modification anyway...)
However, I'm not sure storing the "object" file as LLVM IR would be the right thing to do. I was thinking that a better way would be to store your AST in some serialized form - then
you don't depend on LLVM versions changing the IR format.
You can add whatever metadata you like. There won't be much
difference in generating LLVM-IR from this during your link phase or
building the IR at compile and then reading the IR to figure out if
the metadata is correct. [The slow part, as you may have already found out, is the optimisation and MC generation, and you'd still have to do that either way]
Like I started out, I'm not sure this is an answer, but it's my thoughts so far on the subject. Now I'll go back to adding debug symbol stuff to my Pascal compiler... Before Christmas, I couldn't see the source in GDB. Now I can step, but no viewing of variables yet...

How to call an assembly function from C++ dynamically?

REQUIREMENT: For a certain project we have unique requirement. The application supports an expression language that allows the user to define their own complex expressions that can be evaluated at run time (many hundred times a second) and they need to be executed at machine level for performance.
WORKING: Our expression parser translates the script into corresponding assembly language routine perfectly. We checked it by statically linking the object files generated with our C test program and they produce correct result.
Since the client can change the script anytime, our program (at run time) detects the change, calls the parser which generates the corresponding assembly routine. We then call the assembler from back end to create the object code.
PROBLEM
How can we call this assembly routine dynamically from the C++ program
(Loader)?
We are not supposed to call the C++ compiler to link it with the loader because the loader already would have other subroutines running and we cannot take the loader off, recompile and then execute the new loader program.
I tried searching for a solution online but every time the results are littered with .NET assembly dynamic calling. Our app has nothing to do with .NET.
First, the "generated plugin" approach (on Linux; my answer focuses on Linux but could be adapted to Windows with some effort; you could use many-platform frameworks like Qt or POCO or Glib from GTK; then all wrap plugin loading abilities à la dlopen with a common API that you could use on Windows, on Linux, on MacOSX, on Android) :
generate C (or assembly) code in some file /tmp/generated01.c (you might even generate C++ code using standard C++ containers, but its compilation would be significantly slower; beware of name mangling so emit and use extern "C" functions; read the C++ dlopen mini HowTo). See this answer explaining why generating C is worthwhile (and could be better, and more portable, than generating assembler code).
run (using fork+execve+waitpid, or simply system) a compilation of that generated file into a shared object /tmp/genenerated01.so by running gcc -fPIC -Wall -O /tmp/generated01.c -shared -o /tmp/generated01.so command; you practically need to get position-independent code, hence the -fPIC flag. If using dlopen on your generated assembler code you'll need to improve your assembler generator to emit PIC code.
dlopen that new /tmp/generated01.so (so use the dynamic linker), see dlopen(3); you could even remove the now useless generated C file /tmp/generated01.c
dlsym the relevant symbols to get function pointers to the generated code, see dlsym(3); your application would simply call the generated code using these function pointers.
when you are sure that you don't need any functions from it and that no call frame uses it, you could dlclose that shared object library (but you might accept to leak some address space by not calling dlclose at all)
The above approach is worthwhile and can be used a big lot of times (my manydl.c demonstrates that you could dlopen a million different shared objects), and is practically even compatible (even when emitting C code!) with an interactive Read-Eval-Print-Loop -on most current desktops and laptops and servers-, since most of the time the generated /tmp/generated01.c would be quite small (e.g. a few hundred lines at most) to be very quickly generated and compiled (by gcc, etc...). I am even using this in MELT for its REPL mode. On Linux this plugin approach generally requires to link the main application with -rdynamic (so that dlopen-ed plugins can reference and call functions from the main application).
Then, other approaches could be to use some Just-In-Time compilation library, like
GNU lightning (which emits slow machine code very quickly - so very short JIT emission time, but the generated code is running slowly since it is very unoptimized)
asmjit; it is x86-64 specific, and enables you to generate individual x86-64 machine instructions
GNU libjit is available for several platforms, and offer an "interpreter" mode for other platforms
LLVM (part of Clang/LLVM compiler, usable as a JIT library)
GCCJIT (a new JIT library front-end to GCC)
Grossly speaking, the first elements of that list are able to emit JIT machine code fairly quickly, but that code won't run as fast as compiling with gcc -fPIC -O1 or -O2 the equivalent generated C code (but would run typically 2x to 5x slower!); the last two elements (LLVM & GCCJIT) are compiler based: so they are able to optimize and emit efficient code, at the expense of slower JIT code emission. All the JIT libraries are able (like dlsym does for plugins) to give function pointers to newly JIT-constructed functions.
Notice that there is a trade-off to be made: some techniques are able to generate quickly some machine code, if you accept that generated code to later run a bit slowly; other techniques (notably GCCJIT or LLVM) are spending time to optimize the generated machine code, so takes more time to emit the machine code, but that code would later run quickly. You should not expect both (small generation time, quick execution time), since there is no such thing as a free lunch.
I believe that generating manually some assembler code is practically not worthwhile. You won't be able to generate very optimized code (because optimization is a very difficult art, and both GCC and Clang have millions of source line code for optimization passes), unless you spend many years of work for that. Using some JIT library is easier, and "compiling" to C or C++ is also quite easy (you leave the burden of optimization to the C compiler you are calling).
You could also consider rewriting your application into some language with homoiconicity and metaprogramming abilities (e.g. multi-stage programming), such as Common Lisp (and many others, e.g. those providing eval). Its SBCL implementation is always emitting machine code...
You could also embed an interpreter like Lua -perhaps even LuaJit- or Guile in your application. The main advantage of embedding an existing language is that there are resources (books, modules, ...) and community of people knowing them (designing a good language is difficult!). Also, the embedded interpreter library is well designed and probably well debugged (since used a lot), and some of them are fast enough (since using bytecode techniques).
As the comments already suggest, LoadLibrary (Windows) and dlopen (Linux/POSIX) are by far the easiest solution. These are specifically intended to dynamically load code. Equally important, they both allow unloading as well, and there are functions to then get a function entry point by name.
You can dynamically do it. I will take linux case as an example. Since your parser working fine and generates machine code, you should be able to generate .so (for linux) or .dll for windows.
Next, load the library as
handle = dlopen(so_file_name, RTLD_LAZY);
Next get function pointer
func = dlsym(handle, "function_name");
Then you should be able to execute it as func()
One thing you need to experiment (in case you do not get desired result) is close and open the so file or dll file (you need to do only if required, else it may reduce performance)
It sounds like you can generate the proper byte code. So you could just ensure that you generate position independent code, write it into an executable piece of memory, and then call or create thread upon the code. The simplest way would just be to cast the pointer to the base of the memory you wrote the code into as a function pointer, and then call it.
If you write your bytecode to avoid referencing different sections, and instead reference offsets from its loaded base, 'loading' the code is as easy as writing it to executable memory. You could do a call/pop/jmp to find the base of the code once it begins executing.
Conversely, and probably the easiest solution, would be to just write the code out as function expecting arguments, that way you could pass the code's base and any other arguments to it, as you would with any other function, as long as you use the proper typedef for your function pointer, and the generated assembly handles the arguments properly. As long as you avoid creating absolute jumps or data references to absolute addresses, you shouldn't have any issue.
too late but I think it would help someone else.
in case you want to dynamically execute a piece of code, you can create an interpreter for this.
compile your expressions into some byte code then write the interpreter for executing this.
here is a tutorial about writing interpreters, but in python.
https://ruslanspivak.com/lsbasi-part1/
you can write it using c/c++

Any tutorial for embedding Clang as script interpreter into C++ Code?

I have no experience with llvm or clang, yet. From what I read clang is said to be easily embeddable Wikipedia-Clang, however, I did not find any tutorials about how to achieve this. So is it possible to provide the user of a c++ application with scripting-powers by JIT compiling and executing user-defined code at runtime? Would it be possible to call the applications own classes and methods and share objects?
edit: I'd prefer a C-like syntax for the script-languge (or even C++ itself)
I don't know of any tutorial, but there is an example C interpreter in the Clang source that might be helpful. You can find it here: http://llvm.org/viewvc/llvm-project/cfe/trunk/examples/clang-interpreter/
You probably won't have much of a choice of syntax for your scripting language if you go this route. Clang only parses C, C++, and Objective C. If you want any variations, you may have your work cut out for you.
I think here's what exactly you described.
http://root.cern.ch/drupal/content/cling
You can use clang as a library to implement JIT compilation as stated by other answers.
Then, you have to load up the compiled module (say, an .so library).
In order to accomplish this, you can use standard dlopen (unix) or LoadLibrary (windows) to load it, then use dlsym (unix) to dynamically reference compiled functions, say a "script" main()-like function whose name is known. Note that for C++ you would have to use mangled symbols.
A portable alternative is e.g. GNU's libltdl.
As an alternative, the "script" may run automatically at load time by implementing module init functions or putting some static code (the constructor of a C++ globally defined object would be called immediately).
The loaded module can directly call anything in the main application. Of course symbols are known at compilation time by using the proper main app's header files.
If you want to easily add C++ "plugins" to your program, and know the component interface a priori (say your main application knows the name and interface of a loaded class from its .h before the module is loaded in memory), after you dynamically load the library the class is available to be used as if it was statically linked. Just be sure you do not try to instantiate a class' object before you dlopen() its module.
Using static code allows to implement nice automatic plugin registration mechanisms too.
I don't know about Clang but you might want to look at Ch:
http://www.softintegration.com/
This is described as an embeddable or stand-alone c/c++ interpreter. There is a Dr. Dobbs article with examples of embedding it here:
http://www.drdobbs.com/architecture-and-design/212201774
I haven't done more than play with it but it seems to be a stable and mature product. It's commercial, closed-source, but the "standard" version is described as free for both personal and commercial use. However, looking at the license it seems that "commercial" may only include internal company use, not embedding in a product that is then sold or distributed. (I'm not a lawyer, so clearly one should check with SoftIntegration to be certain of the license terms.)
I am not sure that embedding a C or C++ compiler like Clang is a good idea in your case. Because the "script", that is the (C or C++) code fed (at runtime!) can be arbitrary so be able to crash the entire application. You usually don't want faulty user input to be able to crash your application.
Be sure to read What every C programmer should know about undefined behavior because it is relevant and applies to C++ also (including any "C++ script" used by your application). Notice that, unfortunately, a lot of UB don't crash processes (for example a buffer overflow could corrupt some completely unrelated data).
If you want to embed an interpreter, choose something designed for that purpose, like Guile or Lua, and be careful that errors in the script don't crash the entire application. See this answer for a more detailed discussion of interpreter embedding.