Parsing and Modifying LLVM IR code - c++

I want to read (parse) LLVM IR code (which is saved in a text file) and add some of my own code to it. I need some example of doing this, that is, how this is done by using the libraries provided by LLVM for this purpose. So basically what I want is to read in the IR code from a text file into the memory (perhaps the LLVM library represents it in AST form, I dont know), make modifications, like adding some more nodes in the AST and then finally write back the AST in the IR text file.
Although I need to both read and modify the IR code, I would greatly appreciate if someone could provide or refer me to some example which just read (parses) it.

First, to fix an obvious misunderstanding: LLVM is a framework for manipulating code in IR format. There are no ASTs in sight (*) - you read IR, transform/manipulate/analyze it, and you write IR back.
Reading IR is really simple:
int main(int argc, char** argv)
{
if (argc < 2) {
errs() << "Expected an argument - IR file name\n";
exit(1);
}
LLVMContext &Context = getGlobalContext();
SMDiagnostic Err;
Module *Mod = ParseIRFile(argv[1], Err, Context);
if (!Mod) {
Err.print(argv[0], errs());
return 1;
}
[...]
}
This code accepts a file name. This should be an LLVM IR file (textual). It then goes on to parse it into a Module, which represents a module of IR in LLVM's internal in-memory format. This can then be manipulated with the various passes LLVM has or you add on your own. Take a look at some examples in the LLVM code base (such as lib/Transforms/Hello/Hello.cpp) and read this - http://llvm.org/docs/WritingAnLLVMPass.html.
Spitting IR back into a file is even easier. The Module class just writes itself to a stream:
some_stream << *Mod;
That's it.
Now, if you have any specific questions about specific modifications you want to do to IR code, you should really ask something more focused. I hope this answer shows you how to parse IR and write it back.
(*) IR doesn't have an AST representation inside LLVM, because it's a simple assembly-like language. If you go one step up, to C or C++, you can use Clang to parse that into ASTs, and then do manipulations at the AST level. Clang then knows how to produce LLVM IR from its AST. However, you do have to start with C/C++ here, and not LLVM IR. If LLVM IR is all you care about, forget about ASTs.

This is usually done by implementing an LLVM pass/transform. This way you don't have to parse the IR at all because LLVM will do it for you and you will operate on a object-oriented in-memory representation of the IR.
This is the entry point for writing an LLVM pass. Then you can look at any of the already implemented standard passes that come bundled with LLVM (look into lib/Transforms).

The Opt tool takes llvm IR code, runs a pass on it, and then spits out transformed llvm IR on the other side.
The easiest to start hacking is lib\Transforms\Hello\Hello.cpp. Hack it, run through opt with your source file as input, inspect output.
Apart from that, the docs for writing passes is really quite good.

The easiest way to do this is to look at one of the existing tools and steal code from it. In this case, you might want to look at the source for llc. It can take either a bitcode or .ll file as input. You can modify the input file any way you want and then write out the file using something similar to the code in llvm-dis if you want a text file.

As mentioned above the best way it to write a pass. But if you want to simply iterate through the instructions and do something with the LLVM provided an InstVisitor class. It is a class that implements the visitor pattern for instructions. It is very straight forward to user, so if you want to avoid learning how to implement a pass, you could resort to that.

Related

Compile a C++ function inside a C++ program

Consider the following problem,
A C++ program may emit source of a C++ function, for example, say it will create a string with contents as below:
std::vector<std::shared_ptr<C>> get_ptr_vec()
{
std::vector<std::shared_ptr<C>> vec;
vec.push_back(std::shared_ptr<C>(new C(val1)));
vec.push_back(std::shared_ptr<C>(new C(val2)));
vec.push_back(std::shared_ptr<C>(new C(val3)));
vec.push_back(std::shared_ptr<C>(new C(val4)));
return vec;
}
The values of val1 etc will be determined at runtime when the program create the string of the source above. And this source will be write to a file, say get_ptr_vec.cpp.
Then another C++ program will need to read this source file, and compile it, and call the get_ptr_vec function and get the object it returns. Kind of like a JIT compiler.
Is there any way I can do this? One workaround I think would be having a script that will compile the file, build it into a shared library. And the second program can get the function through dlopen. However, is there anyway to skip this and having the second program to compile the file (without call to system). Note that, the second program will not be able to see this source file at compile time. In fact, there will be likely thousands such small source files emitted by the first program.
To give a little background, the first program will build a tree of expressions, and will serialize the tree by traversing through postorder. Each node of tree will have a string representation written to the file. The second program will read the list of this serialized tree nodes, and need to be able to reconstruct this list of strings to a list of C++ objects (and later from this list I can reconstruct the tree).
I think the LLVM framework may have something to offer here. Can someone give me some pointers on this? Not necessary a full answer, just somewhere for me to start.
You can compile your generated code with clang and emit LLVM bitcode (-emit-llvm flag). Then, statically link your program with parts of LLVM that read bitcode files and JITs them. Finally, take compiled bitcode and run JIT on them, so they will be available in your program's address space.

LLVM IR: Identifying Variables with Metadata Nodes

Currently I'm working on a tool which identifies load and store accesses on global and field variables on arbitrary programs. Furthermore, the accessed variables should be identified by their source level names/identifiers.
In order to accomplish this I compile the source code of the program under diagnosis into LLVM IR with debug information. So far so good, the generated Metadata Nodes contain the desired source level identifiers. However, I'm unable to draw connections to some LLVM IR identifiers and the information in the meta data.
For example, consider a satic member of a class:
class TestClass {
public:
static int Number;
};
The corresponding LLVM IR looks like this:
#_ZN12TestClass6NumberE = external global i32, align 4
...
!15 = !DIDerivedType(tag: DW_TAG_member, name: "Number", scope: !"_ZTS12TestClass", file: !12, line: 5, baseType: !16, flags: DIFlagPublic | DIFlagStaticMember)
In this controlled example I know that "#_ZN12TestClass6NumberE" is an identifier for "Number". However, in general I fail to see how I can find out which IR identifiers correspond to which meta data.
Can somebody help me out?
Since no one seems to have a good solution to my problem I will tell my own inconvient approach to handle this problem. LLVM's generated MetaData nodes contain information about the defined types and variables of the code. However, there is no information about which generated IR variables correspond to which source code variables. LLVM merely links metadata information of IR instructions with correspdoning source locations (lines and columns). This makes sense, since the main task of LLVMs metadata is not analysis but debugging.
Still, the contained information is not useless. My solution to this problems is to use the clang AST for the analysis of the source code. Here we gain information about which variable is accessed at which source location. So, in order to get information about the source variable identities during LLVM IR instrumentation, we just need to map the source locations to source variable identites during the clang AST analysis. As a second step we perform the IR instrumentation by using our previously gathered information. When we encounter a store or load instruction in the IR, we search in the metadata node of this instruction for its corresponding source location. Since we have mapped source locations to source variable identities, we can now easily access the source variable identity of the IR instruction.
So, why do I not just use clang AST for identifying stores and loads on variables? Because distinguishing reads and writes in the AST is not a simple task. The AST can easily tell you that a variable is accessed but it depends on the operation whether the accessed variable is read or written. So, I would have to consider every single operation/operator to determine whether the variable is written/read or both. LLVM is much simpler, more low-level, in this regard and as such less error-prone. Furthermore, actual instrumentation (speaking code insertion) is much more difficult in the AST as it is with LLVM. Because of these two reasons, I believe that a combination of clang AST and LLVM IR instrumentation is the best solution for my problem.

How can I make the LLVM IR of a function available to my program?

I'm working on a library which I'd like certain introspection features to be available. Let's say I'm compiling with clang, so I have access to libtooling or whatever.
What I'd like specifically is for someone to be able to view the LLVM IR of an already-compiled function as part of the program. I know that, when compiling, I can use -emit-llvm to get the IR. But that saves it to a file. What I'd like is for the LLVM IR to be embedded in and retrievable from the program itself -- e.g. my_function_object.llvm_ir()
Is such a thing possible? Thanks!
You're basically trying to have reflection to your program. Reflection requires the existence of metadata in your binary. This doesn't exist out of the box in LLVM, as far as I know.
To achieve an effect like this, you could create a global key-value dictionary in your program, exposed via an exported function - something like IRInstruction* retrieve_llvm_ir_stream(char* name).
This dictionary would map some kind of identifier (for example, the exported name) of a given function to an in-memory array that represents the IR stream of that function (each instruction represented as a custom IRInstruction struct, for example). The types and functions of the representation format (like the custom IRInstruction struct) will have to be included in your source.
At the step of the IR generation, this dictionary will be empty. Immediately after the IR generation step, you'll need to add a custom build step: open the IR file and populate the dictionary with the data - for each exported function of your program, inject its name as a key to the dictionary and its IR stream as a value. The IR stream would be generated from the definitions of your functions, as read by your custom build tool (which would leverage the LLVM API to read the generated IR and convert it to your format).
Then, proceed to the assembler and linker as before.

Generate C++ style code using LLVM

I am looking into making a custom language for fun, mostly to learn how it works, but I am having a bit of trouble with concepts before I dig into code.
I have looked at Kaleidoscope example code and many other online resources, but I am confused on how to do a couple things:
My Goal
Translate my code into C++ code OR directly into a machine code with C++ style AST
Reason
Mostly to learn, but it would be great if I get this going well enough I may develop it further.
What is my Language?
My language is going to be specific to sql and database creation with emphasis on version control and caching strategies.
My problem
I am unsure of how to translate some of the information in my "language" to the C++ equivalent.
Example:
//An Integer type which is nullable and the default value of null.
int number = nullable;
Would translate to something like this...
public static sqlInt number = new sqlInt(true, null);
The problem I am having is, how would I generate the AST and LLVM code generation to recognize the "nullable" field as a new sqlInt without explicitly writing it out? And this would need to work for more complex types:
Example 2:
//Create a foreign key which is of type int, points to pkID
//forces reference and holds ZeroToMany records for a single pkID
//It is also nullable with a default value of 0.
FK<int>(tableName.pkID, true, ZTM) fk1 = nullable(0);
Would translate to something like this:
public static FK<sqlInt> fk1 = new FK<sqlInt>(tableName.pkID, true,
ZTM, true, 0);
The question remains, would I have to build the AST special? if so what would I have to do to make this possible? Or would this be specific to LLVM?
I can't seem to find an example of a llvm language similar to this style.
I don't have any actual code as of now, I am simply gathering information and I can't seem to figure this part out from the code I have looked at.
Edit
I understand (mostly) how to make a parser and lexer to find the function and assign it to a variable, but I am unsure when I should derive the function to declare the sqlInt and how to find the correct params...etc. Is this during the code-generation after the LLVM IR? should I account for this before the LLVM IR?
If you're using LLVM you're going to want to translate from your language to LLVM IR rather than to the c++ ast.
The process of going from the text of your source language to IR is lexing, parsing, semantic analysis, and lowering.

Generate runnable LLVM IR from Julia script?

I am wondering how to convert Julia code into runnable LLVM IR(the *.ll file).
There is a command named code_llvm which can compile a Julia function into LLVM IR. But its result contains something like %jl_value_t* which seems to be an (hidden?) object type, and it doesn't look like pure LLVM IR.
Is there a way to generate runnable LLVM IR from Julia, so that I can run it with lli xx.ll (or do something else)?
The code_llvm function just shows the Function by default, but you can also have it print out a complete module:
open("file.ll", "w") do io
code_llvm(io, +, (Int, Int); raw=true, dump_module=true, optimize=true)
end
This output (file.ll) is now valid to use with other llvm tools, such as llc and opt. However, since it's just the code for this one function, and assumes the existence of all the other code and data, it's not necessarily going to work with lli, so buyer beware.
If you want a complete system, you might be interested in the --output-bc flag to Julia, which will dump a complete object file in LLVM format. This is used extensively internally to build and bootstrap Julia. It's also wrapped into a utility tool at https://github.com/JuliaLang/PackageCompiler.jl to automate some of these steps.