I am reading the llvm's compiler writing guide:
https://llvm.org/docs/tutorial/LangImpl02.html
In that guide, they are using a simple language called "kaleidoscope" as an example. Before reading that guide, I was under the impression that a single AST is generated for every program (I assume that the program is written on a single file and hence no linking is necessary). But it seems that llvm creates a separate AST for every line (or, to be more precise, for every construct). Hence, for a single program, llvm can create hundreds of separate ast's. Is this interpretation correct?
First of all, note that this chapter doesn't really have much to do with LLVM. It's just explaining how to write a parser and an AST for the language. It does not use any code from the LLVM library¹ and wouldn't look any differently in a project that did not use LLVM at all². The LLVM-specific part only comes later when you translate the AST to LLVM IR. So if anything, it's not that LLVM generates "multiple ASTs", it's that the code from the tutorial generates "multiple ASTs".
So is it accurate to say that code generates multiple ASTs? Kind of - it all depends on what exactly you mean by that.
Like any tree, an AST consists of multiple subtrees. Each subtree is itself a valid tree. So you could say that every non-trival tree is in fact a collection of multiple trees and this would apply to the AST in the tutorial as well.
However it's important to note that all of the subtrees are part of the larger tree. It is not true that the code creates multiple trees that aren't connected to each other if that's what you were thinking.
¹ Other than llvm::make_unique, but that could just as well be replaced with std::make_unique if your compiler supports C++14 or your own implementation if it doesn't.
² On a similar note, it is also perfectly possible to write an LLVM-based compiler by generating the LLVM IR directly in the parser and not create any ASTs at all. Whether and how you generate your ASTs is entirely independent from LLVM.
Related
I am working on a project for which I need to "combine" code distributed over multiple C++ files into one file. Due to the nature of the project, I only need one entry function (the function that will be defined as the top function in the Xilinx High-Level-Synthesis software -> see context below). The signature of this function needs to be preserved in the transformation. Whether other functions from other files simply get copied into the file and get called as a subroutine or are inlined does not matter. I think due to variable and function scopes simply concatenating the files will not work.
Since I did not write the C++ code myself and it is quite comprehensive, I am looking for a way to do the transformation automatically. The possibilities I am aware of to do this are the following:
Compile the code to LLVM IR with inlining flags and use a C++/C backend to turn the LLVM code into the source language again. This will result in bad source code and require either an old release of Clang and LLVM or another backend like JuliaComputing. 1
The other option would be developing a tool that relies on using the AST and a library like LibTooling to restructure the code. This option would probably result in better code and put everything into one file without the unnecessary inlining. 2 However, this options seems too complicated to put the all the code into one file.
Hence my question: Are you aware of a better or simply alternative approach to solve this problem?
Context: The project aims to run some of the code on a Xilinx FPGA and the Vitis High-Level-Synthesis tool requires all code that is to be made into a single IP block to be contained in a single file. That is why I need to realise this transformation.
I'm developing a new language in LLVM using the C++ API which compiles down to target the C ABI.
I would like to support modular compilation by allowing end users to build what are effectively static libraries. I noticed the LLVM C++ API has a llvm::Linker class that I can use during compilation to combine source files (llvm::Module), however I wanted to guarantee library compatibility via metadata version numbers or at least the publicly exposed interface between separate compilation runs.
Much of the information available on metadata in LLVM suggest that it should only be used for extended information that would not break correctness when silently removed.
llvm
blog
IntrinsicsMetadataAttributes
pdf
I wouldn't think this would be a deal breaker as it could be global metadata, but it would be good to get a second opinion on that point.
I also know there is a method in IRReader to parseIRFile so I can load some previously built bc files. I would be curious if it would be reasonable practice to include size and CRC information for comparison when loading these files.
My language has concepts similar to C# including interfaces. I figure I could allow modular compilation by importing/exporting an interface type along with external functions (Much like C++, I don't restrict the language to only methods of classes).
This approach allows me to include language specific information in the interface without needing to encode it in the IR as both the library and the calling code would be required to build with the interface. This again requires the interfaces to be compatible.
One language feature that would require extended information would be named parameters in functions.
My language is very type-safe and also mandates named parameters so there is no predetermined function parameter order. This allows call sites to be more explicit, the compiler to catch erroneous parameter usage, and authors have more liberty in determining default parameters as they are not restricted to the last parameters to the function.
The compiler will need to know names, modifiers, defaults, etc. of these parameters to correctly map calls at compile time, so I figure the interface approach would work well here.
TL;DR
Does LLVM have any predefined facilities for building static libraries?
Is version number, size, and CRC information reasonable use cases for LLVM's metadata?
This is probably not QUITE an answer... Or at least not a complete answer.
I like this question, as I'm going to need a solution in the future too (some time in the next few months or years) for my Pascal compiler. It supports "units" which is meant to be a separately compiled object, but currently what I do is simply drag in the source file and compile it into the main llvm::Module - that's neither efficient nor flexible (can't use the linker to choose between the "Linux" and "Windows" version of some code, for example - not that I think there is 5% chance that my compiler will work on Windows without modification anyway...)
However, I'm not sure storing the "object" file as LLVM IR would be the right thing to do. I was thinking that a better way would be to store your AST in some serialized form - then
you don't depend on LLVM versions changing the IR format.
You can add whatever metadata you like. There won't be much
difference in generating LLVM-IR from this during your link phase or
building the IR at compile and then reading the IR to figure out if
the metadata is correct. [The slow part, as you may have already found out, is the optimisation and MC generation, and you'd still have to do that either way]
Like I started out, I'm not sure this is an answer, but it's my thoughts so far on the subject. Now I'll go back to adding debug symbol stuff to my Pascal compiler... Before Christmas, I couldn't see the source in GDB. Now I can step, but no viewing of variables yet...
Examples found on the web for clang tools are always run on toy examples, which are usually all really trivial C programs.
I am building a tool which performs source-to-source transformations on C++ code, which is obviously a very, very challenging task, but clang is up to this task.
The issue I am facing now is that the AST that clang generates for any C++ code that utilizes the STL is enormous. For example I have some C++ code for which clang++ -ast-dump ... | wc -l is 67,018 lines of horrifying AST gobbledygook!
99% of this is standard library stuff, which I aim to ignore in my source-to-source metaprogramming task. So, to achieve this I want to simply filter out files. Suppose I want to look at only the class definitions in the headers of the project that I'm analyzing (and ignore all standard library headers's stuff), I will need to just figure out which header each of my CXXRecordDecl's came from!
Can this be done?
Edit: Hopefully this is a way to go about it. Trying this out now... The important bit is that it has to tell me the header that the decls came out of, not the cpp file corresponding to the translation unit.
In my experience so far, the "source" of some given AST node is best retrieved by using Locations. For example every node at least has a start location, and when you print this out it will contain the header file path.
Then it's possible to use this path to decide whether it is a system library or part of your application code that you still are interested in examining.
One route I'm looking at is to narrow matches with things like hasName() (as found here. For example:
recordDecl(hasName("MyBaseClass")) // etc.
However your comment above using -ast-dump is something I tried as well to get a lay of the land on my own CLang tool. I found this post to be extremely helpful. Armed with their suggestion, I used clang-check to filter to a specific class name and fed it my top-level CPP file. The output was a much more manageable few hundred lines representing the class declarations and definitions of interest.
Is it theoretically and/or practically possible to compile native c++ to some sort of intermediate language which will then be compiled at run time?
Along the same lines, is "portable" the term used to denote this?
LLVM which is a compiler infrastructure parses C++ code, transforming it to an intermediate language called LLVM IR (IR stands for Intermediate Representation) which looks like high-level assembly language. It is a machine independent language. Generating IR is one phase. In the next phase, it passes through various optimizers (called pass). which then reaches to third phase which emits machine code (i.e machine dependent code).
It is a module-based design; output of one phase (module) becomes input of another. You could save IR on your disk, so that the remaining phases can resume later, maybe on entirely different machine!
So you could generate IR and then do rest of the things on runtime? I've not done that myself, but LLVM seems really promising.
Here is the documentation of LLVM IR:
LLVM Language Reference Manual
This topic on Stackoverlow seems interesting, as it says,
LLVM advantages:
JIT - you can compile and run your code dynamically.
And these articles are good read:
The Design of LLVM (on drdobs.com)
Create a working compiler with the LLVM framework, Part 1
I'm looking to get an AST for C++ that I can then parse with an external program. What programs are out there that are good for generating an AST for C++? I don't care what language it is implemented in or the output format (so long as it is readily parseable).
My overall goal is to transform a C++ unit test bed to its corresponding C# wrapper test bed.
You can use clang and especially libclang to parse C++ code. It's a very high quality, hand written library for lexing, parsing and compiling C++ code but it can also generate an AST.
Clang also supports C, Objective-C and Objective-C++. Clang itself is written in C++.
Actually, GCC will emit the AST at any stage in the pipeline that interests you, including the GENERIC and GIMPLE forms. Check out the (plethora of) command-line switches begining with -fdump- — e.g. -fdump-tree-original-raw
This is one of the easier (…) ways to work, as you can use it on arbitrary code; just pass the appropriate CFLAGS or CXXFLAGS into most Makefiles:
make CXXFLAGS=-fdump-tree-original-raw all
… and you get “the works.”
Updated: Saw this neat little graphing system based on GCC AST's while checking my flag name :-) Google FTW.
http://digitocero.com/en/blog/exporting-and-visualizing-gccs-abstract-syntax-tree-ast
Our C++ Front End, built on top of our DMS Software Reengineering Toolkit can parse a variety of C++ dialects (including C++11 and ObjectiveC) and export that AST as an XML document with a command line switch. See example ASTs produced by this front end.
As a practical matter, you will need more than the AST; you can't really do much with C++ (or any other modern language) without an understanding of the meaning and scope of each identifier. For C++, meaning/scope are particularly ugly. The DMS C++ front end handles all of that; it can build full symbol tables associating identifers to explicit C++ types. That information isn't dumpable in XML with a command line switch, but it is "technically easy" to code logic in DMS to walk the symbol table and spit out XML. (there is an option to dump this information, just not in XML format).
I caution you against the idea of manipulating (or even just analyzing) the XML. First, XSLT isn't a particularly good way to understand the meaning of the ASTs, let alone transform the AST, because the ASTs represent context sensitive language structures (that's why you want [nee MUST HAVE] the symbol table). You can read the XML into a dom-like tree if you like and write your own procedural code to manipulate it. But source-to-source transformations are an easier way; you can write your transformations using C++ notation rather than buckets of code goo climbing over a tree data structure.
You'll have another problem: how to generate valid C++ code from the transformed XML. If you don't mind spitting out raw text, you can solve this problem in purely ad hoc ways, at the price of having no gaurantee other than sweat that generated code is syntactically valid. If you want to generate a C++ representation of your final result as an AST, and regenerate valid text from that, you'll need a prettyprinter, which are not technically hard but still a lot of work to build especially for a language as big as C++.
Finally, the reason that tools like DMS exist is to provide the vast amount of infrastructure it takes to process/manipulate complex structure such as C++ ASTs. (parse, analyse, transform, prettyprint). You can try to replicate all this machinery yourself, but this is usually a poor time/cost/productivity tradeoff. The claim is it is best to stay within the tool ecosystem rather than escape it and build bad versions of it yourself. If you haven't done this before, you'll find this out painfully.
FWIW, DMS has been used to carry out massive analysis and transformations on C++ source code. See Publications on DMS and check the papers by Akers on "Re-engineering C++ Component Models".
Clang is based on the same kind of philosophy; there's an ecosystem of tools.
YMMV, but I'd be surprised.