Abstract Interpretation in LLVM

Abstract Interpretation in LLVM - llvm

I need to use abstract interpretation to do some analysis using LLVM.
Is this possible? or I need to use analysis tools easier.
If I could do that by LLVM , which classes would help me to formulate the statements from the original source code to get the relations between the variables (and the possible ranges of values for each variable)

You can have a look at KLEE which is a symbolic interpreter for LLVM bitcode: https://github.com/klee

If you are using the interval domain for your analysis, you can use the Constant Range class to represent intervals. It will allow you to abstract away arithmetic operations on ranges. With the debug metadata and some additional bookkeeping, you can get the relations between variables. See this answer.

You can have a look at the Pagai static analyzer, that computes invariants on LLVM bitcode using state-of-the-art abstract interpretation techniques and can instrument a .bc file with the obtained invariants, to be used by your tool.
http://pagai.forge.imag.fr

Related

Creating a modular language in LLVM?

I'm developing a new language in LLVM using the C++ API which compiles down to target the C ABI.
I would like to support modular compilation by allowing end users to build what are effectively static libraries. I noticed the LLVM C++ API has a llvm::Linker class that I can use during compilation to combine source files (llvm::Module), however I wanted to guarantee library compatibility via metadata version numbers or at least the publicly exposed interface between separate compilation runs.
Much of the information available on metadata in LLVM suggest that it should only be used for extended information that would not break correctness when silently removed.
llvm
blog
IntrinsicsMetadataAttributes
pdf
I wouldn't think this would be a deal breaker as it could be global metadata, but it would be good to get a second opinion on that point.
I also know there is a method in IRReader to parseIRFile so I can load some previously built bc files. I would be curious if it would be reasonable practice to include size and CRC information for comparison when loading these files.
My language has concepts similar to C# including interfaces. I figure I could allow modular compilation by importing/exporting an interface type along with external functions (Much like C++, I don't restrict the language to only methods of classes).
This approach allows me to include language specific information in the interface without needing to encode it in the IR as both the library and the calling code would be required to build with the interface. This again requires the interfaces to be compatible.
One language feature that would require extended information would be named parameters in functions.
My language is very type-safe and also mandates named parameters so there is no predetermined function parameter order. This allows call sites to be more explicit, the compiler to catch erroneous parameter usage, and authors have more liberty in determining default parameters as they are not restricted to the last parameters to the function.
The compiler will need to know names, modifiers, defaults, etc. of these parameters to correctly map calls at compile time, so I figure the interface approach would work well here.
TL;DR
Does LLVM have any predefined facilities for building static libraries?
Is version number, size, and CRC information reasonable use cases for LLVM's metadata?

This is probably not QUITE an answer... Or at least not a complete answer.
I like this question, as I'm going to need a solution in the future too (some time in the next few months or years) for my Pascal compiler. It supports "units" which is meant to be a separately compiled object, but currently what I do is simply drag in the source file and compile it into the main llvm::Module - that's neither efficient nor flexible (can't use the linker to choose between the "Linux" and "Windows" version of some code, for example - not that I think there is 5% chance that my compiler will work on Windows without modification anyway...)
However, I'm not sure storing the "object" file as LLVM IR would be the right thing to do. I was thinking that a better way would be to store your AST in some serialized form - then
you don't depend on LLVM versions changing the IR format.
You can add whatever metadata you like. There won't be much
difference in generating LLVM-IR from this during your link phase or
building the IR at compile and then reading the IR to figure out if
the metadata is correct. [The slow part, as you may have already found out, is the optimisation and MC generation, and you'd still have to do that either way]
Like I started out, I'm not sure this is an answer, but it's my thoughts so far on the subject. Now I'll go back to adding debug symbol stuff to my Pascal compiler... Before Christmas, I couldn't see the source in GDB. Now I can step, but no viewing of variables yet...

Add comments to LLVM IR?

Is it possible to add comments into a BasicBlock? I only want that when I print out the IR for debugging I can have a few comments that help me. That is, I fully expect them to be lost once I pass them to the optimizer.

No, it's not possible directly. Comments, by which you probably mean the lexical elements beginning with a semicolon (;) in the textual IR representation, have no representation in the in-memory IR (and binary bitcode). As you probably know, LLVM IR has three equivalent representations (in memory API level, textual "assembly" level, binary bitcode level). Once the LLVM assembly IR parser reads the code into memory, comments are lost.
What you could do, however, is use metadata for this purpose. You can create arbitrary metadata attached to any instruction, as well as global module-level metadata. This is a hack, for sure, but if you really think you need some sort of annotation, metadata is the way. LLVM uses metadata for a number of annotation needs, like debug info and alias analysis annotations.

How to use LLVM to generate a call graph?

I'm looking into generating a call-graph for the linux kernel that would include function pointers (see my previous question Static call graph generation for the Linux kernel for more information). I've been told LLVM should be suitable for this purpose, however I was unable to find the relevant information on llvm.org
Any help, including pointers to relevant documentation, would be appreciated.

First, you have to compile your kernel into LLVM IR (instead of native object files). Then, using llvm-ld, combine all the IR object files into a single large module. It could be quite a tricky thing to do, you'll have to modify the makefiles heavily, but I believe it is doable.
Now you can do your analysis. A simple call graph can be generated using the opt tool with -dot-callgraph pass. It is unlikely to handle function pointers, so you may want to modify it.
Tracking all the possible data flow paths that would carry your function pointers is quite a challenge, and in general case it is impossible to do (if there are any pointer to integer casts, if pointers are stored in complicated data structures, etc.). For a majority of specific cases you can try to implement a global abstract interpretation to approximate all the possible data flow paths for your pointers. It would not be accurate, of course, but then you'll get at least a conservative approximation.

Getting AST for C++?

I'm looking to get an AST for C++ that I can then parse with an external program. What programs are out there that are good for generating an AST for C++? I don't care what language it is implemented in or the output format (so long as it is readily parseable).
My overall goal is to transform a C++ unit test bed to its corresponding C# wrapper test bed.

You can use clang and especially libclang to parse C++ code. It's a very high quality, hand written library for lexing, parsing and compiling C++ code but it can also generate an AST.
Clang also supports C, Objective-C and Objective-C++. Clang itself is written in C++.

Actually, GCC will emit the AST at any stage in the pipeline that interests you, including the GENERIC and GIMPLE forms. Check out the (plethora of) command-line switches begining with -fdump- — e.g. -fdump-tree-original-raw
This is one of the easier (…) ways to work, as you can use it on arbitrary code; just pass the appropriate CFLAGS or CXXFLAGS into most Makefiles:
make CXXFLAGS=-fdump-tree-original-raw all
… and you get “the works.”
Updated: Saw this neat little graphing system based on GCC AST's while checking my flag name :-) Google FTW.
http://digitocero.com/en/blog/exporting-and-visualizing-gccs-abstract-syntax-tree-ast

Our C++ Front End, built on top of our DMS Software Reengineering Toolkit can parse a variety of C++ dialects (including C++11 and ObjectiveC) and export that AST as an XML document with a command line switch. See example ASTs produced by this front end.
As a practical matter, you will need more than the AST; you can't really do much with C++ (or any other modern language) without an understanding of the meaning and scope of each identifier. For C++, meaning/scope are particularly ugly. The DMS C++ front end handles all of that; it can build full symbol tables associating identifers to explicit C++ types. That information isn't dumpable in XML with a command line switch, but it is "technically easy" to code logic in DMS to walk the symbol table and spit out XML. (there is an option to dump this information, just not in XML format).
I caution you against the idea of manipulating (or even just analyzing) the XML. First, XSLT isn't a particularly good way to understand the meaning of the ASTs, let alone transform the AST, because the ASTs represent context sensitive language structures (that's why you want [nee MUST HAVE] the symbol table). You can read the XML into a dom-like tree if you like and write your own procedural code to manipulate it. But source-to-source transformations are an easier way; you can write your transformations using C++ notation rather than buckets of code goo climbing over a tree data structure.
You'll have another problem: how to generate valid C++ code from the transformed XML. If you don't mind spitting out raw text, you can solve this problem in purely ad hoc ways, at the price of having no gaurantee other than sweat that generated code is syntactically valid. If you want to generate a C++ representation of your final result as an AST, and regenerate valid text from that, you'll need a prettyprinter, which are not technically hard but still a lot of work to build especially for a language as big as C++.
Finally, the reason that tools like DMS exist is to provide the vast amount of infrastructure it takes to process/manipulate complex structure such as C++ ASTs. (parse, analyse, transform, prettyprint). You can try to replicate all this machinery yourself, but this is usually a poor time/cost/productivity tradeoff. The claim is it is best to stay within the tool ecosystem rather than escape it and build bad versions of it yourself. If you haven't done this before, you'll find this out painfully.
FWIW, DMS has been used to carry out massive analysis and transformations on C++ source code. See Publications on DMS and check the papers by Akers on "Re-engineering C++ Component Models".
Clang is based on the same kind of philosophy; there's an ecosystem of tools.
YMMV, but I'd be surprised.

Using clang to analyze C++ code

We want to do some fairly simple analysis of user's C++ code and then use that information to instrument their code (basically regen their code with a bit of instrumentation code) so that the user can run a dynamic analysis of their code and get stats on things like ranges of values of certain numeric types.
clang should be able to handle enough C++ now to handle the kind of code our users would be throwing at it - and since clang's C++ coverage is continuously improving by the time we're done it'll be even better.
So how does one go about using clang like this as a standalone parser? We're thinking we could just generate an AST and then walk it looking for objects of the classes we're interested in tracking. Would be interested in hearing from others who are using clang without LLVM.

clang is designed to be modular. Quoting from its page:
A major design concept for clang is
its use of a library-based
architecture. In this design, various
parts of the front-end can be cleanly
divided into separate libraries which
can then be mixed up for different
needs and uses.
Look at clang libraries like libast for your needs. Read more here.

What you didn't indicate is what kind of "analyses" you wanted to do. Most C++ analyses require that you have accurate symbol table data so that when you encounter a symbol foo you have some idea what it is. (You technically don't even know what + is without such a symbol table!) You also need generic type information; if you have an expression "a*b", what is the type of the result? Having "name and type" information is key to almost anything you want to do for analysis.
If you insist on clang, then there are other answers here. I don't know it it provides for name and type resolution.
If you need name and type resolution, then another solution would the DMS Software Reengineering Toolkit. DMS provides generic compiler like infrastructure for parsing, analyzing, transforming, and un-parsing (regenerating source code from the compiler data structures). DMS's industrial-strength C++ front end (it has many other language front ends, too) provides full name and type resolution according to the ANSI standard as well a GCC and MS VC++ dialects.
Code transformations can be implemented via an abstract-syntax tree interface provided by DMS, or by pattern-directed program transformation rules written in the surface syntax of your target language (in this case, C++). Here's a simple transformation using the rule language:
domain Cpp~GCC3; -- says we want patterns for C++ in the GCC3 dialect
rule optimize_to_increment(lhs:left_hand_side):expression -> expression
" \lhs = \lhs + 1 " -> " \lhs++" if no_side_effects(lhs).
This implicitly operates on the ASTs built by DMS, to modify them. The conditional
allows you to inquire about arbitrary properties of pattern variables (in this case, lhs), including name and type constraints if you wish.
DMS has been used many times for very sophisticated program analysis and transformation of C++ code. We build C++ test coverage tools by instrumenting C++ code in a rather obvious way using DMS. At the website, there's a bibligraphy with papers describing how DMS was used to restructure the architecture of a large product line of military aircraft mission software. This kind of activity literally pours C++ in one architectural shape into another by applying large numbers of pattern directed transforms such as the above.
It is likely to be very easy to implement your instrumentation. And you don't have to wait for it to mature.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js