LLVM API to assemble and link IR to executable - c++

I am building a toy language in C++ using LLVM 15.0.7, following the quintessential kaleidoscope guide. However, my language does not use a JIT, and is instead compiled. What I want is to be able to turn my LLVM Module into an executable file (i.e. an a.out), ideally with control over what files it links with.
My current "solution" is extremely hacky, and involves the following:
llvm::InitializeAllTargetInfos();
llvm::InitializeAllTargets();
llvm::InitializeAllTargetMCs();
llvm::InitializeAllAsmParsers();
llvm::InitializeAllAsmPrinters();
auto TargetTriple = llvm::sys::getDefaultTargetTriple();
myLlvmModule->setTargetTriple(TargetTriple);
std::string Error;
auto Target = llvm::TargetRegistry::lookupTarget(TargetTriple, Error);
if (!Target)
return 1;
auto CPU = "generic";
auto Features = "";
llvm::TargetOptions opt;
auto RM = llvm::Optional<llvm::Reloc::Model>();
auto TheTargetMachine =
Target->createTargetMachine(TargetTriple, CPU, Features, opt, RM, llvm::None, llvm::CodeGenOpt::Aggressive);
myLlvmModule->setDataLayout(TheTargetMachine->createDataLayout());
auto Filename = "out.o";
std::error_code EC;
llvm::raw_fd_ostream dest(Filename, EC, llvm::sys::fs::OF_None);
if (EC)
return 1;
llvm::legacy::PassManager pass;
auto FileType = llvm::CGFT_ObjectFile;
if (TheTargetMachine->addPassesToEmitFile(pass, dest, nullptr, FileType))
return 1;
pass.run(*myLlvmModule);
dest.flush();
system("cc out.o");
remove("out.o");
Basically, I'm assembling the IR into an object file, and then passing it to Clang or GCC with system() where it handles the linking and emitting to an executable.
Obviously, this is not a good solution, since I have no control over what version of Clang/GCC a user has, and calling to another process like this makes my compiler no longer "self contained". Most importantly, it defeats the purpose of learning to do the steps by hand.
I really don't want to call system in my project. There has to be a better way, right? Surely other LLVM-based compilers don't do this! Surely LLVM has an API tool that takes the in-memory IR representation and outputs an executable, or alternatively a tool that links object files... right?
I tried looking into the Object sub-directory in the LLVM source to see if maybe there's some tool that lets me manually do what that system call is doing, but I'm just not experienced enough with LLVM to know what to look for. I've also tried looking at other compilers for inspiration on how this is approached, but again, it's hard for me as someone new to LLVM to parse through these massive projects. There's a real lack of quality guides out there on this stuff.
Any help or guidance here would be greatly appreciated.

Related

C++ function instrumentation via clang++'s -finstrument-functions : how to ignore internal std library calls?

Let's say I have a function like:
template<typename It, typename Cmp>
void mysort( It begin, It end, Cmp cmp )
{
std::sort( begin, end, cmp );
}
When I compile this using -finstrument-functions-after-inlining with clang++ --version:
clang version 11.0.0 (...)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: ...
The instrument code explodes the execution time, because my entry and exit functions are called for every call of
void std::__introsort_loop<...>(...)
void std::__move_median_to_first<...>(...)
I'm sorting a really big array, so my program doesn't finish: without instrumentation it takes around 10 seconds, with instrumentation I've cancelled it at 10 minutes.
I've tried adding __attribute__((no_instrument_function)) to mysort (and the function that calls mysort), but this doesn't seem to have an effect as far as these standard library calls are concerned.
Does anyone know if it is possible to ignore function instrumentation for the internals of a standard library function like std::sort? Ideally, I would only have mysort instrumented, so a single entry and a single exit!
I see that clang++ sadly does not yet support anything like finstrument-functions-exclude-function-list or finstrument-functions-exclude-file-list, but g++ does not yet support -finstrument-functions-after-inlining which I would ideally have, so I'm stuck!
EDIT: After playing more, it would appear the effect on execution-time is actually less than that described, so this isn't the end of the world. The problem still remains however, because most people who are doing function instrumentation in clang will only care about the application code, and not those functions linked from (for example) the standard library.
EDIT2: To further highlight the problem now that I've got it running in a reasonable time frame: the resulting trace that I produce from the instrumented code with those two standard library functions is 15GB. When I hard code my tracing to ignore the two function addresses, the resulting trace is 3.7MB!
I've run into the same problem. It looks like support for these flags was once proposed, but never merged into the main branch.
https://reviews.llvm.org/D37622
This is not a direct answer, since the tool doesn't support what you want to do, but I think I have a decent work-around. What I wound up doing was creating a "skip list" of sorts. In the instrumented functions (__cyg_profile_func_enter and __cyg_profile_func_exit), I would guess the part that is contributing most to your execution time is the printing. If you can come up with a way of short-circuiting the profile functions, that should help, even if it's not the most ideal. At the very least it will limit the size of the output file.
Something like
#include <stdint.h>
uintptr_t skipAddrs[] = {
// assuming 64-bit addresses
0x123456789abcdef, 0x2468ace2468ace24
};
size_t arrSize = 0;
int main(void)
{
...
arrSize = sizeof(skipAddrs)/sizeof(skipAddrs[0]);
// https://stackoverflow.com/a/37539/12940429
...
}
void __cyg_profile_func_enter (void *this_fn, void *call_site) {
for (size_t idx = 0; idx < arrSize; idx++) {
if ((uintptr_t) this_fn == skipAddrs[idx]) {
return;
}
}
}
I use something like objdump -t binaryFile to examine the symbol table and find what the addresses are for each function.
If you specifically want to ignore library calls, something that might work is examining the symbol table of your object file(s) before linking against libraries, then ignoring all the ones that appear new in the final binary.
All this should be possible with things like grep, awk, or python.
You have to add attribute __attribute__((no_instrument_function)) to the functions that should not be instrumented. Unfortunately it is not easy to make it work with C/C++ standard library functions because this feature requires editing all the C++ library functions.
There are some hacks you can do like #define existing macros from include/__config to add this attribute as well. e.g.,
-D_LIBCPP_INLINE_VISIBILITY=__attribute__((no_instrument_function,internal_linkage))
Make sure to append existing macro definition with no_instrument_function to avoid unexpected errors.

How to use the RandomNumberGenerator within llvm?

I was hoping someone can give me an example as to how to use the RandomNumberGenerator class within LLVM. All of the examples I am able to find seem to use outdated methods.
I would like to be able to create a RNG within a pass that can be overridden with the '-rng-seed' parameter.
How can this value be accessed if it was provided as a parameter, and how to create the value if it was not provided as a parameter?
Also, I understand that a single RNG is not meant to be shared between threads for a single module. If I am running multiple passes on a module, can they share the same generated RNG?
The RandomNumberGenerator class has a private constructor (check its doc and the source file under llvm/lib/Support/RandomNumberGenerator.cpp), so the only way (that I know of, at least) to get a hold of an instance is via Module's createRNG method.
So, assuming that you have llvm:Function pass (and using C++11):
bool runOnFunction(llvm::Function &CurFunc) override {
auto rng = CurFunc.getParent()->createRNG(this);
llvm::errs() << (*rng)() << '\n';
return false;
}
Now you can run this on a module like this (assuming you modified the hello world pass from the documentation):
opt -load ./libLLVMHelloPass.so -hello foo.bc -o bar.bc
Rerunning this, it will give you the same pseudo-random number.
The -rng-seed option becomes available to your pass once you include the header (and link against the LLVM support library, i.e. llvm-config --libfiles support). So, changing the above execution line to something like:
opt -load ./libLLVMHelloPass.so -hello -rng-seed 42 foo.bc -o bar.bc
should give a different sequence.
Lastly, AFAIK, LLVM passes via opt are run sequentially in the context of a PassManager (certainly for the legacy one). I believe one should adhere to that advice when building a custom standalone LLVM tool using multi-threading (in other words, not intended to be run by opt). For relevant examples of standalone apps using the LLVM API have a look into the unit tests source subdir (one hint is to look for .cpp files that have a main(), although they are not always set up like that).

execute C++ from String variable

it is possible in C++ to execute the C++ code from string variable.
Like in Javascript:
var theInstructions = "alert('Hello World'); var x = 100";
var F=new Function (theInstructions);
return(F());
I want something very similar like Javascript in C++. How to do that ?
No, C++ is a static typed, compiled to native binary language.
Although you could use LLVM JIT compilation, compile and link without interrupting the runtime. Should be doable, but it is just not in the domain of C++.
If you want a scripting engine under C++, you could use for example JS - it is by far the fastest dynamic solution out there. Lua, Python, Ruby are OK as well, but typically slower, which may not be a terrible thing considering you are just using it for scripting.
For example, in Qt you can do something like:
int main(int argc, char *argv[])
{
QCoreApplication a(argc, argv);
QScriptEngine engine;
QScriptValue value = engine.evaluate("var a = 20; var b = 30; a + b");
cout << value.toNumber();
return a.exec();
}
And you will get 50 ;)
You will need to invoke a compiler to compile the code. In addition, you will need to generate some code to wrap the string in a function declaration. Finally, you'll then somehow need to load the compiled code.
If I were doing this (which I would not) I would:
Concatenate a standard wrapper function header around the code
Invoke a compiler via the command line (system()) to build a shared
library (.dll on windows or .so on linux)
Load the shared library and map the function
Invoke the function
This is really not the way you want to write C code in most cases.
Directly, no. But you can:
write that string to a file.
invoke the compiler and compile that file.
execute the resulting binary.
C++ is a compiled language. You compile C++ source into machine code, the executable. That is loaded and executed. The compiler knows about C++ (and has all the library headers available). The executable doesn't, and that is why it cannot turn a string into executable code. You can, indeed, execute the contents of a string if it happens to contain machine code instructions, but that is generally a very bad idea...
That doesn't mean that it wouldn't be possible to do this kind of run-time compilation. Very little (if, indeed, anything) is impossible in C++. But what you'd be doing would be implementing a C++ compiler object... look at other compiler projects before deciding you really want this.
Interpreted languages can do this with ease - they merely have to pass the string to the interpreter that is already running the program. They pay for this kind of flexibility in other regards.
You can use Cling as C++ interpreter.
I created small CMake project for easier Cling integration: C++ as compile-time scripting language (https://github.com/derofim/cling-cmake)
Short answer is no. Hackers would have a field day. You can however use the Windows IActiveScriptSite interface to utilize Java/VB script. Google IActiveScriptSite, there are numerous examples on the web. Or you can do what I am currently doing, roll your own script engine.

llvm line number of an instruction

I want to get the line number of an instruction (and also of a variable declaration - alloca and global). The instruction is saved in an array of instructions. I have the function:
Constant* metadata::getLineNumber(Instruction* I){
if (MDNode *N = I->getMetadata("dbg")) { // this if is never executed
DILocation Loc(N);
unsigned Line = Loc.getLineNumber();
return ConstantInt::get(Type::getInt32Ty(I->getContext()), Line);
} // else {
// return NULL; }
}
and in my main() I have :
errs()<<"\nLine number is "<<*metadata::getLineNumber(allocas[p]);
the result is NULL since I->getMetadata("dbg") is false.
Is there a possibility to enable dbg flags in LLVM without rebuilding the LLVM framework, like using a flag when compiling the target program or when running my pass (I used -debug) ?
Compiling a program with ā€œ-O3 -gā€ should give full debug information, but I still have the same result. I am aware of http://llvm.org/docs/SourceLevelDebugging.html , from where I can see that is quite easy to take the source line number from a metadata field.
PS: for Allocas, it seems that I have to use findDbgDeclare method from DbgInfoPrinter.cpp.
Thank you in advance !
LLVM provides debugging information if you specify the -g flag to Clang. You don't need to rebuild LLVM to enable/disable it - any LLVM will do (including a pre-built one from binaries or binary packages).
The problem may be that you're trying to have debug information in highly optimized code (-O3). This is not necessarily possible, since LLVM simply optimizes some code away in such cases and there's not much meaning to debug information. LLVM tries to preserve debug info during optimizations, but it's not an easy task.
Start by generating unoptimized code with debug info (-O0 -g) and write your code/passes to work with that. Then graduate to optimized code, and try to examine what specifically gets lost. If you think that LLVM is being stupid, don't hesitate to open a bug.
Some random tips:
Generate IR from clang (-emit-llvm) and see the debug metadata nodes in it. Then you can run through opt with optimizations and see what remains.
The -debug option to llc and other LLVM tools is quite unrelated to debug info in the source.

Parsing and Modifying LLVM IR code

I want to read (parse) LLVM IR code (which is saved in a text file) and add some of my own code to it. I need some example of doing this, that is, how this is done by using the libraries provided by LLVM for this purpose. So basically what I want is to read in the IR code from a text file into the memory (perhaps the LLVM library represents it in AST form, I dont know), make modifications, like adding some more nodes in the AST and then finally write back the AST in the IR text file.
Although I need to both read and modify the IR code, I would greatly appreciate if someone could provide or refer me to some example which just read (parses) it.
First, to fix an obvious misunderstanding: LLVM is a framework for manipulating code in IR format. There are no ASTs in sight (*) - you read IR, transform/manipulate/analyze it, and you write IR back.
Reading IR is really simple:
int main(int argc, char** argv)
{
if (argc < 2) {
errs() << "Expected an argument - IR file name\n";
exit(1);
}
LLVMContext &Context = getGlobalContext();
SMDiagnostic Err;
Module *Mod = ParseIRFile(argv[1], Err, Context);
if (!Mod) {
Err.print(argv[0], errs());
return 1;
}
[...]
}
This code accepts a file name. This should be an LLVM IR file (textual). It then goes on to parse it into a Module, which represents a module of IR in LLVM's internal in-memory format. This can then be manipulated with the various passes LLVM has or you add on your own. Take a look at some examples in the LLVM code base (such as lib/Transforms/Hello/Hello.cpp) and read this - http://llvm.org/docs/WritingAnLLVMPass.html.
Spitting IR back into a file is even easier. The Module class just writes itself to a stream:
some_stream << *Mod;
That's it.
Now, if you have any specific questions about specific modifications you want to do to IR code, you should really ask something more focused. I hope this answer shows you how to parse IR and write it back.
(*) IR doesn't have an AST representation inside LLVM, because it's a simple assembly-like language. If you go one step up, to C or C++, you can use Clang to parse that into ASTs, and then do manipulations at the AST level. Clang then knows how to produce LLVM IR from its AST. However, you do have to start with C/C++ here, and not LLVM IR. If LLVM IR is all you care about, forget about ASTs.
This is usually done by implementing an LLVM pass/transform. This way you don't have to parse the IR at all because LLVM will do it for you and you will operate on a object-oriented in-memory representation of the IR.
This is the entry point for writing an LLVM pass. Then you can look at any of the already implemented standard passes that come bundled with LLVM (look into lib/Transforms).
The Opt tool takes llvm IR code, runs a pass on it, and then spits out transformed llvm IR on the other side.
The easiest to start hacking is lib\Transforms\Hello\Hello.cpp. Hack it, run through opt with your source file as input, inspect output.
Apart from that, the docs for writing passes is really quite good.
The easiest way to do this is to look at one of the existing tools and steal code from it. In this case, you might want to look at the source for llc. It can take either a bitcode or .ll file as input. You can modify the input file any way you want and then write out the file using something similar to the code in llvm-dis if you want a text file.
As mentioned above the best way it to write a pass. But if you want to simply iterate through the instructions and do something with the LLVM provided an InstVisitor class. It is a class that implements the visitor pattern for instructions. It is very straight forward to user, so if you want to avoid learning how to implement a pass, you could resort to that.