Is it possible to clone an AST in Clang? - c++

I have an AST that I have created by parsing source-code using the Clang library. I would like to clone the AST so that I can parse some other code which might be invalid. If it is invalid, I will return to my clone for parsing further source-code snippets. I don't want to have to parse the same source-code twice for performance reasons.
The Clang AST representation is very rich, and some nodes contain pointers to other nodes.
Does Clang provide a utility for easily cloning ASTs that handles this complexity?

Related

Transforming the Clang AST into an AST in other languages

I want to build a tool that can automatically transform the Clang AST into an equivalent AST in other languages (MLIR). I have some experiences with LLVM passes, but never directly played with Clang.
I wonder what is the best way to do that. One thing I can think of is to dump the Clang AST into a file and read back in a pass to construct another AST.
Can this be done within Clang itself and output an AST in the transformed language? If so, is there any tutorial I can start with?
Thank you very much.

how many ast's a llvm program generate?

I am reading the llvm's compiler writing guide:
https://llvm.org/docs/tutorial/LangImpl02.html
In that guide, they are using a simple language called "kaleidoscope" as an example. Before reading that guide, I was under the impression that a single AST is generated for every program (I assume that the program is written on a single file and hence no linking is necessary). But it seems that llvm creates a separate AST for every line (or, to be more precise, for every construct). Hence, for a single program, llvm can create hundreds of separate ast's. Is this interpretation correct?
First of all, note that this chapter doesn't really have much to do with LLVM. It's just explaining how to write a parser and an AST for the language. It does not use any code from the LLVM library¹ and wouldn't look any differently in a project that did not use LLVM at all². The LLVM-specific part only comes later when you translate the AST to LLVM IR. So if anything, it's not that LLVM generates "multiple ASTs", it's that the code from the tutorial generates "multiple ASTs".
So is it accurate to say that code generates multiple ASTs? Kind of - it all depends on what exactly you mean by that.
Like any tree, an AST consists of multiple subtrees. Each subtree is itself a valid tree. So you could say that every non-trival tree is in fact a collection of multiple trees and this would apply to the AST in the tutorial as well.
However it's important to note that all of the subtrees are part of the larger tree. It is not true that the code creates multiple trees that aren't connected to each other if that's what you were thinking.
¹ Other than llvm::make_unique, but that could just as well be replaced with std::make_unique if your compiler supports C++14 or your own implementation if it doesn't.
² On a similar note, it is also perfectly possible to write an LLVM-based compiler by generating the LLVM IR directly in the parser and not create any ASTs at all. Whether and how you generate your ASTs is entirely independent from LLVM.

Getting the Lua syntax tree with liblua

Is it possible to get the syntax tree with liblua?
I need the AST of lua code, but I can't depend on ANTLR4, so I'm looking for a self contained solution. Since my host app already embeds lua, liblua would be perfect.
If not with liblua, what other options are there for parsing Lua in C++?
The built-in Lua parser is a one-pass compiler which directly generates Lua VM instructions. No AST is created, and it would be more complicated to decompile the VM code into an AST than to create a parser using whatever parser generator you feel comfortable with.
For example, Bison produces perfectly acceptable C++ code without requiring any run-time. It has a C++ API but it is also possible to use the C template and compile the result as a C++ program.

Ignore missing headers with clang AST parser

I'm on Windows, using MSVC to compile my project, but I need clang for its neat AST parser, which allow me to write a little code generator.
Problem is, clang cannot parse MSVC headers (a very-well known and understandable problem).
I tried two options :
I include MSVC header folder, parsing the built-in headers included in my code will end-up leading to a fatal error at some point, preventing me from parsing the parts I want correctly.
What I did before is simply not provide any built-in headers and forward declare the types I needed. It worked fine and somehow it doesn't anymore with latest Clang. I don't really know if the parser policy on missing header changed, but it is causing complete failure every time something like <string> is included and not much get parsed.
I am using the python bindings (libclang), but I would consider switching to C/C++ API if there would be a solution there.
Is there anyway I can alter this behavior and make clang continue parsing even when some headers are not found ?
Use SetSuppressIncludeNotFoundError. Took me an hour to find! You can imagine how glad I was to find it!
https://clang.llvm.org/doxygen/classclang_1_1Preprocessor.html#ac7bafe67fc32e41460855b39d20ff6af
One way to ignore the errors due to missing headers is to set SetSuppressIncludeNotFoundError to true in your definition of ASTFrontendAction. An example for the same is given below.
{
public:
virtual std::unique_ptr<clang::ASTConsumer> CreateASTConsumer(
clang::CompilerInstance &Compiler, llvm::StringRef InFile)
{
Compiler.getPreprocessor().SetSuppressIncludeNotFoundError(true);
return std::unique_ptr<clang::ASTConsumer>(
new CustomASTConsumer(&Compiler.getASTContext()));
}
};
For a complete example using ASTFrontendAction, please visit at https://clang.llvm.org/docs/RAVFrontendAction.html
So you want to process C++ code that uses MS headers, and you want access to ASTs so that you can generate code. And Clang won't handle MS headers.
So Clang can't be the answer unless it gets a radical upgrade.
You asked for "any solution that can make this work".
Our DMS Software Reengineering Tookit with its C++14 Front End can do this.
DMS provides general parsing, AST construction/inspection/transformation/generation, and inverse parsing (conversion of ASTs back into compilable code), parameterized by language definitions.
The C++ front end provides a full C++14 parser, preprocessor handling, AST construction, and full name and type resolution. It has been tested with GCC and MS VS 2013 header files; we're testing with 2015 header files now.
(It also handles MS VS 2013 syntax, too).
It handles the tough parsing cases completely, including the C++ famous "most vexing parse". You can see parse trees at get human readable AST from c++ code.
DMS does not provide Python bindings, nor a direct C++ interface. Rather, it is a standalone tool designed to support the construction of custom tools (e.g., your "little code generator"). It has its own very extensive set of internal APIs, coded in metaprogramming language PARLANSE, which is LISP-like. Other aspects of DMS are managed by using DSLs for lexers, grammars, and transformations. See below.
A word of caution: any tool that can process C++ is gauranteed to be complex. DMS is correspondingly complex, and it takes a while to learn to use it, so you're not going to get instant answers. The good news here
is that some things are easier to do. Your code generation problem
is likely "read a skeleton file, and then replace key entries in it with problem specific code". If that's the case, a DMS tool with the following code (simplified for presentation here) will likely do the trick:
...
(= myAST (Registry:ParseFile (. filename) (. `CppVisualStudio2013') ...)
(Registry:ApplyTransforms myAST (. `MyTransforms.rsl'))
(Registry:PrettyPrint myAST (concat filename `.modified'))
...
with a transforms file MyTransforms.rsl containing source-to-source surface-syntax (e.g, C++ syntax) transformation rules of the conceptual form
rule rulename if_you_see THIS then replace_by ("-->") THAT
An actual C++ rule might look like (making this up because I don't
know your actual code generation goals)
rule replace_abstraction(s: STRING_LITERAL):
" abstraction_place_holder(\s) "
-> " my_DSL_library(\s,17); "
The ApplyTransforms call above will apply all the rules in this file until none apply any further.
Writing surface syntax transforms, where you can do it, is way easier than making calls on a procedure library (which, like Clang, DMS offers) that hack at the tree.
You can write more complex metaprograms using PARLANSE to apply some rules in one place, other rules someplace else, and you can mix source-to-source transforms with procedural transforms that hack directly at the tree if you want.
If you want more details on what transforms look like, ask and I'll provide a link.

Getting AST for C++?

I'm looking to get an AST for C++ that I can then parse with an external program. What programs are out there that are good for generating an AST for C++? I don't care what language it is implemented in or the output format (so long as it is readily parseable).
My overall goal is to transform a C++ unit test bed to its corresponding C# wrapper test bed.
You can use clang and especially libclang to parse C++ code. It's a very high quality, hand written library for lexing, parsing and compiling C++ code but it can also generate an AST.
Clang also supports C, Objective-C and Objective-C++. Clang itself is written in C++.
Actually, GCC will emit the AST at any stage in the pipeline that interests you, including the GENERIC and GIMPLE forms. Check out the (plethora of) command-line switches begining with -fdump- — e.g. -fdump-tree-original-raw
This is one of the easier (…) ways to work, as you can use it on arbitrary code; just pass the appropriate CFLAGS or CXXFLAGS into most Makefiles:
make CXXFLAGS=-fdump-tree-original-raw all
… and you get “the works.”
Updated: Saw this neat little graphing system based on GCC AST's while checking my flag name :-) Google FTW.
http://digitocero.com/en/blog/exporting-and-visualizing-gccs-abstract-syntax-tree-ast
Our C++ Front End, built on top of our DMS Software Reengineering Toolkit can parse a variety of C++ dialects (including C++11 and ObjectiveC) and export that AST as an XML document with a command line switch. See example ASTs produced by this front end.
As a practical matter, you will need more than the AST; you can't really do much with C++ (or any other modern language) without an understanding of the meaning and scope of each identifier. For C++, meaning/scope are particularly ugly. The DMS C++ front end handles all of that; it can build full symbol tables associating identifers to explicit C++ types. That information isn't dumpable in XML with a command line switch, but it is "technically easy" to code logic in DMS to walk the symbol table and spit out XML. (there is an option to dump this information, just not in XML format).
I caution you against the idea of manipulating (or even just analyzing) the XML. First, XSLT isn't a particularly good way to understand the meaning of the ASTs, let alone transform the AST, because the ASTs represent context sensitive language structures (that's why you want [nee MUST HAVE] the symbol table). You can read the XML into a dom-like tree if you like and write your own procedural code to manipulate it. But source-to-source transformations are an easier way; you can write your transformations using C++ notation rather than buckets of code goo climbing over a tree data structure.
You'll have another problem: how to generate valid C++ code from the transformed XML. If you don't mind spitting out raw text, you can solve this problem in purely ad hoc ways, at the price of having no gaurantee other than sweat that generated code is syntactically valid. If you want to generate a C++ representation of your final result as an AST, and regenerate valid text from that, you'll need a prettyprinter, which are not technically hard but still a lot of work to build especially for a language as big as C++.
Finally, the reason that tools like DMS exist is to provide the vast amount of infrastructure it takes to process/manipulate complex structure such as C++ ASTs. (parse, analyse, transform, prettyprint). You can try to replicate all this machinery yourself, but this is usually a poor time/cost/productivity tradeoff. The claim is it is best to stay within the tool ecosystem rather than escape it and build bad versions of it yourself. If you haven't done this before, you'll find this out painfully.
FWIW, DMS has been used to carry out massive analysis and transformations on C++ source code. See Publications on DMS and check the papers by Akers on "Re-engineering C++ Component Models".
Clang is based on the same kind of philosophy; there's an ecosystem of tools.
YMMV, but I'd be surprised.