I want to build a tool that can automatically transform the Clang AST into an equivalent AST in other languages (MLIR). I have some experiences with LLVM passes, but never directly played with Clang.
I wonder what is the best way to do that. One thing I can think of is to dump the Clang AST into a file and read back in a pass to construct another AST.
Can this be done within Clang itself and output an AST in the transformed language? If so, is there any tutorial I can start with?
Thank you very much.
Related
I am reading the llvm's compiler writing guide:
https://llvm.org/docs/tutorial/LangImpl02.html
In that guide, they are using a simple language called "kaleidoscope" as an example. Before reading that guide, I was under the impression that a single AST is generated for every program (I assume that the program is written on a single file and hence no linking is necessary). But it seems that llvm creates a separate AST for every line (or, to be more precise, for every construct). Hence, for a single program, llvm can create hundreds of separate ast's. Is this interpretation correct?
First of all, note that this chapter doesn't really have much to do with LLVM. It's just explaining how to write a parser and an AST for the language. It does not use any code from the LLVM library¹ and wouldn't look any differently in a project that did not use LLVM at all². The LLVM-specific part only comes later when you translate the AST to LLVM IR. So if anything, it's not that LLVM generates "multiple ASTs", it's that the code from the tutorial generates "multiple ASTs".
So is it accurate to say that code generates multiple ASTs? Kind of - it all depends on what exactly you mean by that.
Like any tree, an AST consists of multiple subtrees. Each subtree is itself a valid tree. So you could say that every non-trival tree is in fact a collection of multiple trees and this would apply to the AST in the tutorial as well.
However it's important to note that all of the subtrees are part of the larger tree. It is not true that the code creates multiple trees that aren't connected to each other if that's what you were thinking.
¹ Other than llvm::make_unique, but that could just as well be replaced with std::make_unique if your compiler supports C++14 or your own implementation if it doesn't.
² On a similar note, it is also perfectly possible to write an LLVM-based compiler by generating the LLVM IR directly in the parser and not create any ASTs at all. Whether and how you generate your ASTs is entirely independent from LLVM.
I have an AST that I have created by parsing source-code using the Clang library. I would like to clone the AST so that I can parse some other code which might be invalid. If it is invalid, I will return to my clone for parsing further source-code snippets. I don't want to have to parse the same source-code twice for performance reasons.
The Clang AST representation is very rich, and some nodes contain pointers to other nodes.
Does Clang provide a utility for easily cloning ASTs that handles this complexity?
I want to use ANTLR 3.5 in a C++ program of mine, but I'm running into trouble with how to actually make use of the parser and lexer that gets generated. Using a grammar similar to the one here, I can do something like SimpleCalcParser.expr(). However, if I want to do something more complex (e.g. parse a language that doesn't just result in a single value, but something more imperative or declarative), it seems quite difficult with the C++ target. As far as I can tell, there is no ability to output an AST or template. Without this, I'm not sure how you can do anything other than just determine if your input parsed correctly or not. Does anyone know how to do this with the C++ target, or would using the C target to generate an AST and using that in C++ be a better option?
some time ago I created some patches for C++ target(see github). There should be AST generation added(but not 100% complete) and also added some tests, which you can use as an example. With current ATLR 3.5 each rule has to return some complex class as a value. And you have to "build" tree manually by using rule actions.
I'm looking to get an AST for C++ that I can then parse with an external program. What programs are out there that are good for generating an AST for C++? I don't care what language it is implemented in or the output format (so long as it is readily parseable).
My overall goal is to transform a C++ unit test bed to its corresponding C# wrapper test bed.
You can use clang and especially libclang to parse C++ code. It's a very high quality, hand written library for lexing, parsing and compiling C++ code but it can also generate an AST.
Clang also supports C, Objective-C and Objective-C++. Clang itself is written in C++.
Actually, GCC will emit the AST at any stage in the pipeline that interests you, including the GENERIC and GIMPLE forms. Check out the (plethora of) command-line switches begining with -fdump- — e.g. -fdump-tree-original-raw
This is one of the easier (…) ways to work, as you can use it on arbitrary code; just pass the appropriate CFLAGS or CXXFLAGS into most Makefiles:
make CXXFLAGS=-fdump-tree-original-raw all
… and you get “the works.”
Updated: Saw this neat little graphing system based on GCC AST's while checking my flag name :-) Google FTW.
http://digitocero.com/en/blog/exporting-and-visualizing-gccs-abstract-syntax-tree-ast
Our C++ Front End, built on top of our DMS Software Reengineering Toolkit can parse a variety of C++ dialects (including C++11 and ObjectiveC) and export that AST as an XML document with a command line switch. See example ASTs produced by this front end.
As a practical matter, you will need more than the AST; you can't really do much with C++ (or any other modern language) without an understanding of the meaning and scope of each identifier. For C++, meaning/scope are particularly ugly. The DMS C++ front end handles all of that; it can build full symbol tables associating identifers to explicit C++ types. That information isn't dumpable in XML with a command line switch, but it is "technically easy" to code logic in DMS to walk the symbol table and spit out XML. (there is an option to dump this information, just not in XML format).
I caution you against the idea of manipulating (or even just analyzing) the XML. First, XSLT isn't a particularly good way to understand the meaning of the ASTs, let alone transform the AST, because the ASTs represent context sensitive language structures (that's why you want [nee MUST HAVE] the symbol table). You can read the XML into a dom-like tree if you like and write your own procedural code to manipulate it. But source-to-source transformations are an easier way; you can write your transformations using C++ notation rather than buckets of code goo climbing over a tree data structure.
You'll have another problem: how to generate valid C++ code from the transformed XML. If you don't mind spitting out raw text, you can solve this problem in purely ad hoc ways, at the price of having no gaurantee other than sweat that generated code is syntactically valid. If you want to generate a C++ representation of your final result as an AST, and regenerate valid text from that, you'll need a prettyprinter, which are not technically hard but still a lot of work to build especially for a language as big as C++.
Finally, the reason that tools like DMS exist is to provide the vast amount of infrastructure it takes to process/manipulate complex structure such as C++ ASTs. (parse, analyse, transform, prettyprint). You can try to replicate all this machinery yourself, but this is usually a poor time/cost/productivity tradeoff. The claim is it is best to stay within the tool ecosystem rather than escape it and build bad versions of it yourself. If you haven't done this before, you'll find this out painfully.
FWIW, DMS has been used to carry out massive analysis and transformations on C++ source code. See Publications on DMS and check the papers by Akers on "Re-engineering C++ Component Models".
Clang is based on the same kind of philosophy; there's an ecosystem of tools.
YMMV, but I'd be surprised.
Is there an easy way of going from llvm ir to working source code?
Specifically, I'd like to start with some simple C++ code that merely modifies PODs (mainly arrays of ints, floats, etc), convert it to llvm ir, perform some simple analysis and translation on it and then convert it back into C++ code?
It don't really mind about any of the names getting mangled, I'd just like to be able to hack about with the source before doing the machine-dependent optimisations.
There are number of options actually. The 2 that you'll probably be interested in are -march=c and -march=cpp, which are options to llc.
Run:
llc -march=c -o code.c code.ll
This will convert the LLVM bitcode in code.ll back to C and put it in code.c.
Also:
llc -march=cpp -o code.cpp code.ll
This is different than the C output engine. It actually will write out C++ code that can be run to reconstruct the IR. I use this personal to embed LLVM IR in a program without having to deal with parsing bitcode files or anything.
-march=cpp has more options you can see with llc --help, such as -cppgen= which controls how much of the IR the output C++ reconstructs.
CppBackend was removed. We have no -march=cpp and -march=c option since 2016-05-05, r268631.
There is an issue here... it might not be possible to easily represent the IR back into the language.
I mean, you'll probably be able to get some representation, but it might be less readable.
The issue is that the IR is not concerned with high-level semantic, and without it...
I'd rather advise you to learn to read the IR. I can read a bit of it without that much effort, and I am far from being a llvm expert.
Otherwise, you can C code from the IR. It won't be much more similar to your C++ code, but you'll perhaps feel better without ssa and phi nodes.