LLVM IR: Identifying Variables with Metadata Nodes - c++

Currently I'm working on a tool which identifies load and store accesses on global and field variables on arbitrary programs. Furthermore, the accessed variables should be identified by their source level names/identifiers.
In order to accomplish this I compile the source code of the program under diagnosis into LLVM IR with debug information. So far so good, the generated Metadata Nodes contain the desired source level identifiers. However, I'm unable to draw connections to some LLVM IR identifiers and the information in the meta data.
For example, consider a satic member of a class:
class TestClass {
public:
static int Number;
};
The corresponding LLVM IR looks like this:
#_ZN12TestClass6NumberE = external global i32, align 4
...
!15 = !DIDerivedType(tag: DW_TAG_member, name: "Number", scope: !"_ZTS12TestClass", file: !12, line: 5, baseType: !16, flags: DIFlagPublic | DIFlagStaticMember)
In this controlled example I know that "#_ZN12TestClass6NumberE" is an identifier for "Number". However, in general I fail to see how I can find out which IR identifiers correspond to which meta data.
Can somebody help me out?

Since no one seems to have a good solution to my problem I will tell my own inconvient approach to handle this problem. LLVM's generated MetaData nodes contain information about the defined types and variables of the code. However, there is no information about which generated IR variables correspond to which source code variables. LLVM merely links metadata information of IR instructions with correspdoning source locations (lines and columns). This makes sense, since the main task of LLVMs metadata is not analysis but debugging.
Still, the contained information is not useless. My solution to this problems is to use the clang AST for the analysis of the source code. Here we gain information about which variable is accessed at which source location. So, in order to get information about the source variable identities during LLVM IR instrumentation, we just need to map the source locations to source variable identites during the clang AST analysis. As a second step we perform the IR instrumentation by using our previously gathered information. When we encounter a store or load instruction in the IR, we search in the metadata node of this instruction for its corresponding source location. Since we have mapped source locations to source variable identities, we can now easily access the source variable identity of the IR instruction.
So, why do I not just use clang AST for identifying stores and loads on variables? Because distinguishing reads and writes in the AST is not a simple task. The AST can easily tell you that a variable is accessed but it depends on the operation whether the accessed variable is read or written. So, I would have to consider every single operation/operator to determine whether the variable is written/read or both. LLVM is much simpler, more low-level, in this regard and as such less error-prone. Furthermore, actual instrumentation (speaking code insertion) is much more difficult in the AST as it is with LLVM. Because of these two reasons, I believe that a combination of clang AST and LLVM IR instrumentation is the best solution for my problem.

Related

How can I disassemble the result of LLVM MCJIT compilation?

I have a program I wrote which uses LLVM 3.5 as a JIT compiler, which I'm trying to update to use MCJIT in LLVM 3.7. I have it mostly working, but I'm struggling to reproduce one debug-only feature I implemented with LLVM 3.5.
I would like to be able to see the host machine code (e.g. x86, x64 or ARM, not LLVM IR) generated by the JIT process; in debug builds I log this out as my program is running. With LLVM 3.5 I was able to do this by invoking ExecutionEngine::runJITOnFunction() to fill in a llvm::MachineCodeInfo object, which gave me the start address and size of the generated code. I could then disassemble that code.
I can't seem to find any equivalent in MCJIT. I can get the start address of the function (e.g. via getPointerToFunction()) but not the size.
I have seen Disassemble Memory but apart from not having that much detail in the answers, it seems to be more about how to disassemble a sequence of bytes. I know how to do that, my question is: how can I get hold of the sequence of bytes in the first place?
If it would help to make this more concrete, please reinterpret this question as: "How can I extend the example Kaleidoscope JIT to show the machine code (x86, ARM, etc) it produces, not just the LLVM IR?"
Thanks.
You have at least two options here.
Supply your own memory manager. This must be well documented and is done in many projects using MCJIT. But for the sake of completeness here's the code:
class MCJITMemoryManager : public llvm::RTDyldMemoryManager {
public:
static std::unique_ptr<MCJITMemoryManager> Create();
MCJITMemoryManager();
virtual ~MCJITMemoryManager();
// Allocate a memory block of (at least) the given size suitable for
// executable code. The section_id is a unique identifier assigned by the
// MCJIT engine, and optionally recorded by the memory manager to access a
// loaded section.
byte* allocateCodeSection(uintptr_t size, unsigned alignment,
unsigned section_id,
llvm::StringRef section_name) override;
// Allocate a memory block of (at least) the given size suitable for data.
// The SectionID is a unique identifier assigned by the JIT engine, and
// optionally recorded by the memory manager to access a loaded section.
byte* allocateDataSection(uintptr_t size, unsigned alignment,
unsigned section_id, llvm::StringRef section_name,
bool is_readonly) override;
...
}
Pass a memory manager instance to EngineBuilder:
std::unique_ptr<MCJITMemoryManager> manager = MCJITMemoryManager::Create();
llvm::ExecutionEngine* raw = lvm::EngineBuilder(std::move(module))
.setMCJITMemoryManager(std::move(manager))
...
.create();
Now via these callbacks you have control over the memory where the code gets emitted. (And the size is passed directly to your method). Simply remember the address of the buffer you allocated for code section and, stop the program in gdb and disassemble the memory (or dump it somewhere or even use LLVM's disassembler).
Just use llc on your LLVM IR with appropriate options (optimization level, etc.). As I see it, MCJIT is called so for a reason and that reason is that it reuses the existing code generation modules (same as llc).
Include the following header llvm/Object/SymbolSize.h, and use the function llvm::object::computeSymbolSizes(ObjectFile&). You will need to get an instance of the ObjectFile somehow.
To get that instance, here is what you could do:
Declare a class that is called to convert a Module to an ObjectFile, something like:
class ModuleToObjectFileCompiler {
...
// Compile a Module to an ObjectFile.
llvm::object::OwningBinary<llvm::object::ObjectFile>
operator() (llvm::Module&);
};
To implement the operator() of ModuleToObjectFileCompiler, take a look at llvm/ExecutionEngine/Orc/CompileUtils.h where the class SimpleCompiler is defined.
Provide an instance of ModuleToObjectFileCompiler to an instance of llvm::orc::IRCompileLayer, for instance:
new llvm::orc::IRCompileLayer
<llvm::orc::ObjectLinkingLayer
<llvm::orc::DoNothingOnNotifyLoaded> >
(_object_layer, _module_to_object_file);
The operator() of ModuleToObjectFileCompiler receives the instance of ObjectFile which you can provide to computeSymbolSizes(). Then check the returned std::vector to find out the sizes in bytes of all symbols defined in that Module. Save the information for the symbols you are interested in. And that's all.

How can I make the LLVM IR of a function available to my program?

I'm working on a library which I'd like certain introspection features to be available. Let's say I'm compiling with clang, so I have access to libtooling or whatever.
What I'd like specifically is for someone to be able to view the LLVM IR of an already-compiled function as part of the program. I know that, when compiling, I can use -emit-llvm to get the IR. But that saves it to a file. What I'd like is for the LLVM IR to be embedded in and retrievable from the program itself -- e.g. my_function_object.llvm_ir()
Is such a thing possible? Thanks!
You're basically trying to have reflection to your program. Reflection requires the existence of metadata in your binary. This doesn't exist out of the box in LLVM, as far as I know.
To achieve an effect like this, you could create a global key-value dictionary in your program, exposed via an exported function - something like IRInstruction* retrieve_llvm_ir_stream(char* name).
This dictionary would map some kind of identifier (for example, the exported name) of a given function to an in-memory array that represents the IR stream of that function (each instruction represented as a custom IRInstruction struct, for example). The types and functions of the representation format (like the custom IRInstruction struct) will have to be included in your source.
At the step of the IR generation, this dictionary will be empty. Immediately after the IR generation step, you'll need to add a custom build step: open the IR file and populate the dictionary with the data - for each exported function of your program, inject its name as a key to the dictionary and its IR stream as a value. The IR stream would be generated from the definitions of your functions, as read by your custom build tool (which would leverage the LLVM API to read the generated IR and convert it to your format).
Then, proceed to the assembler and linker as before.

Generate C++ style code using LLVM

I am looking into making a custom language for fun, mostly to learn how it works, but I am having a bit of trouble with concepts before I dig into code.
I have looked at Kaleidoscope example code and many other online resources, but I am confused on how to do a couple things:
My Goal
Translate my code into C++ code OR directly into a machine code with C++ style AST
Reason
Mostly to learn, but it would be great if I get this going well enough I may develop it further.
What is my Language?
My language is going to be specific to sql and database creation with emphasis on version control and caching strategies.
My problem
I am unsure of how to translate some of the information in my "language" to the C++ equivalent.
Example:
//An Integer type which is nullable and the default value of null.
int number = nullable;
Would translate to something like this...
public static sqlInt number = new sqlInt(true, null);
The problem I am having is, how would I generate the AST and LLVM code generation to recognize the "nullable" field as a new sqlInt without explicitly writing it out? And this would need to work for more complex types:
Example 2:
//Create a foreign key which is of type int, points to pkID
//forces reference and holds ZeroToMany records for a single pkID
//It is also nullable with a default value of 0.
FK<int>(tableName.pkID, true, ZTM) fk1 = nullable(0);
Would translate to something like this:
public static FK<sqlInt> fk1 = new FK<sqlInt>(tableName.pkID, true,
ZTM, true, 0);
The question remains, would I have to build the AST special? if so what would I have to do to make this possible? Or would this be specific to LLVM?
I can't seem to find an example of a llvm language similar to this style.
I don't have any actual code as of now, I am simply gathering information and I can't seem to figure this part out from the code I have looked at.
Edit
I understand (mostly) how to make a parser and lexer to find the function and assign it to a variable, but I am unsure when I should derive the function to declare the sqlInt and how to find the correct params...etc. Is this during the code-generation after the LLVM IR? should I account for this before the LLVM IR?
If you're using LLVM you're going to want to translate from your language to LLVM IR rather than to the c++ ast.
The process of going from the text of your source language to IR is lexing, parsing, semantic analysis, and lowering.

Available Analysis and Transform passes for LLVM

Is there any document on the list of Analysis and Transform passes available for use in the AnalysisUsage::addRequired<> and Pass::geAnalysis<> functions?
I can get a list of passes in http://llvm.org/docs/Passes.html, but it only shows the command line names for the passes. How can I know the underlying pass classes?
Not really, no. Just look at the source. The header files in include/llvm/Analysis/ and include/llvm/Transforms/ will tell you everything you need to know.
Moreover, grepping over the source for getAnalysis< will tell you which passes are used as analyses inside the LLVM source code.

Parsing and Modifying LLVM IR code

I want to read (parse) LLVM IR code (which is saved in a text file) and add some of my own code to it. I need some example of doing this, that is, how this is done by using the libraries provided by LLVM for this purpose. So basically what I want is to read in the IR code from a text file into the memory (perhaps the LLVM library represents it in AST form, I dont know), make modifications, like adding some more nodes in the AST and then finally write back the AST in the IR text file.
Although I need to both read and modify the IR code, I would greatly appreciate if someone could provide or refer me to some example which just read (parses) it.
First, to fix an obvious misunderstanding: LLVM is a framework for manipulating code in IR format. There are no ASTs in sight (*) - you read IR, transform/manipulate/analyze it, and you write IR back.
Reading IR is really simple:
int main(int argc, char** argv)
{
if (argc < 2) {
errs() << "Expected an argument - IR file name\n";
exit(1);
}
LLVMContext &Context = getGlobalContext();
SMDiagnostic Err;
Module *Mod = ParseIRFile(argv[1], Err, Context);
if (!Mod) {
Err.print(argv[0], errs());
return 1;
}
[...]
}
This code accepts a file name. This should be an LLVM IR file (textual). It then goes on to parse it into a Module, which represents a module of IR in LLVM's internal in-memory format. This can then be manipulated with the various passes LLVM has or you add on your own. Take a look at some examples in the LLVM code base (such as lib/Transforms/Hello/Hello.cpp) and read this - http://llvm.org/docs/WritingAnLLVMPass.html.
Spitting IR back into a file is even easier. The Module class just writes itself to a stream:
some_stream << *Mod;
That's it.
Now, if you have any specific questions about specific modifications you want to do to IR code, you should really ask something more focused. I hope this answer shows you how to parse IR and write it back.
(*) IR doesn't have an AST representation inside LLVM, because it's a simple assembly-like language. If you go one step up, to C or C++, you can use Clang to parse that into ASTs, and then do manipulations at the AST level. Clang then knows how to produce LLVM IR from its AST. However, you do have to start with C/C++ here, and not LLVM IR. If LLVM IR is all you care about, forget about ASTs.
This is usually done by implementing an LLVM pass/transform. This way you don't have to parse the IR at all because LLVM will do it for you and you will operate on a object-oriented in-memory representation of the IR.
This is the entry point for writing an LLVM pass. Then you can look at any of the already implemented standard passes that come bundled with LLVM (look into lib/Transforms).
The Opt tool takes llvm IR code, runs a pass on it, and then spits out transformed llvm IR on the other side.
The easiest to start hacking is lib\Transforms\Hello\Hello.cpp. Hack it, run through opt with your source file as input, inspect output.
Apart from that, the docs for writing passes is really quite good.
The easiest way to do this is to look at one of the existing tools and steal code from it. In this case, you might want to look at the source for llc. It can take either a bitcode or .ll file as input. You can modify the input file any way you want and then write out the file using something similar to the code in llvm-dis if you want a text file.
As mentioned above the best way it to write a pass. But if you want to simply iterate through the instructions and do something with the LLVM provided an InstVisitor class. It is a class that implements the visitor pattern for instructions. It is very straight forward to user, so if you want to avoid learning how to implement a pass, you could resort to that.