What's the easiest way to parse C++ for code generation? - c++

I would like to generate some wrapper code based on C++ types. I basically would like to parse some C++ headers, get the types, classes and their fields defined in the headers, and generate some code based on them.
What would be the easiest way to parse C++ and get type information? I thought about using the Clang C++ parser, but I couldn't make a working hello world in a couple of hours, so I gave up for the time being.
Could you advise any other way to parse C++, or if Clang is the easiest solution, could you point me to a simple getting started guide to be able to parse C++ types with it?
(basically any technology would be ok, C++, Java, C#, etc., this would be part of a command line tool)

Clang is definitely the easiest option. Consider using cindex python bindings, it's pretty straightforward. Alternatively, you could get an older version of clang which still features an xml backend.
EDIT: the link above seems to be down, so here is a link to the google cache of it.
Another link suggested in the comments: http://www.altdevblogaday.com/2014/03/05/implementing-a-code-generator-with-libclang/

Unless your object is to verify correctness, or the code involves advanced template stuff, consider using the XML output of DOxygen or GCC_XML. Alternatively, consider clang, even if that's what you found too complex. Note that for clang it might be best to work in *nix-land.

If your generation tool is in Java, consider using the parser from the Eclipse CDT.
my set of dependencies are:
com.ibm.icu_4.4.2.v20110823.jar
org.eclipse.cdt.core_5.3.2.201202111925.jar
org.eclipse.equinox.common_3.6.0.v20110523.jar
(these are from an old Eclipse version, because I have a dependency on old java class versions), but taking from the latest CDT wil do.
parsing involves:
FileContent reader;
reader = FileContent.createForExternalFileLocation(fullPath);
IScannerInfo info = new ScannerInfo(definedSymbols, includePaths);
return GPPLanguage.getDefault().getASTTranslationUnit(reader, info, FilesProvider.getInstance(), null, 0,log);
This returns an IASTTranslationUnit that can be accessed through a Visitor pattern (ASTVisitor).
I cannot comment on the accuracy of the parsing in corner scenarios, because so far I've been generating code based on simple C++ structure definitions.

Related

Ignore missing headers with clang AST parser

I'm on Windows, using MSVC to compile my project, but I need clang for its neat AST parser, which allow me to write a little code generator.
Problem is, clang cannot parse MSVC headers (a very-well known and understandable problem).
I tried two options :
I include MSVC header folder, parsing the built-in headers included in my code will end-up leading to a fatal error at some point, preventing me from parsing the parts I want correctly.
What I did before is simply not provide any built-in headers and forward declare the types I needed. It worked fine and somehow it doesn't anymore with latest Clang. I don't really know if the parser policy on missing header changed, but it is causing complete failure every time something like <string> is included and not much get parsed.
I am using the python bindings (libclang), but I would consider switching to C/C++ API if there would be a solution there.
Is there anyway I can alter this behavior and make clang continue parsing even when some headers are not found ?
Use SetSuppressIncludeNotFoundError. Took me an hour to find! You can imagine how glad I was to find it!
https://clang.llvm.org/doxygen/classclang_1_1Preprocessor.html#ac7bafe67fc32e41460855b39d20ff6af
One way to ignore the errors due to missing headers is to set SetSuppressIncludeNotFoundError to true in your definition of ASTFrontendAction. An example for the same is given below.
{
public:
virtual std::unique_ptr<clang::ASTConsumer> CreateASTConsumer(
clang::CompilerInstance &Compiler, llvm::StringRef InFile)
{
Compiler.getPreprocessor().SetSuppressIncludeNotFoundError(true);
return std::unique_ptr<clang::ASTConsumer>(
new CustomASTConsumer(&Compiler.getASTContext()));
}
};
For a complete example using ASTFrontendAction, please visit at https://clang.llvm.org/docs/RAVFrontendAction.html
So you want to process C++ code that uses MS headers, and you want access to ASTs so that you can generate code. And Clang won't handle MS headers.
So Clang can't be the answer unless it gets a radical upgrade.
You asked for "any solution that can make this work".
Our DMS Software Reengineering Tookit with its C++14 Front End can do this.
DMS provides general parsing, AST construction/inspection/transformation/generation, and inverse parsing (conversion of ASTs back into compilable code), parameterized by language definitions.
The C++ front end provides a full C++14 parser, preprocessor handling, AST construction, and full name and type resolution. It has been tested with GCC and MS VS 2013 header files; we're testing with 2015 header files now.
(It also handles MS VS 2013 syntax, too).
It handles the tough parsing cases completely, including the C++ famous "most vexing parse". You can see parse trees at get human readable AST from c++ code.
DMS does not provide Python bindings, nor a direct C++ interface. Rather, it is a standalone tool designed to support the construction of custom tools (e.g., your "little code generator"). It has its own very extensive set of internal APIs, coded in metaprogramming language PARLANSE, which is LISP-like. Other aspects of DMS are managed by using DSLs for lexers, grammars, and transformations. See below.
A word of caution: any tool that can process C++ is gauranteed to be complex. DMS is correspondingly complex, and it takes a while to learn to use it, so you're not going to get instant answers. The good news here
is that some things are easier to do. Your code generation problem
is likely "read a skeleton file, and then replace key entries in it with problem specific code". If that's the case, a DMS tool with the following code (simplified for presentation here) will likely do the trick:
...
(= myAST (Registry:ParseFile (. filename) (. `CppVisualStudio2013') ...)
(Registry:ApplyTransforms myAST (. `MyTransforms.rsl'))
(Registry:PrettyPrint myAST (concat filename `.modified'))
...
with a transforms file MyTransforms.rsl containing source-to-source surface-syntax (e.g, C++ syntax) transformation rules of the conceptual form
rule rulename if_you_see THIS then replace_by ("-->") THAT
An actual C++ rule might look like (making this up because I don't
know your actual code generation goals)
rule replace_abstraction(s: STRING_LITERAL):
" abstraction_place_holder(\s) "
-> " my_DSL_library(\s,17); "
The ApplyTransforms call above will apply all the rules in this file until none apply any further.
Writing surface syntax transforms, where you can do it, is way easier than making calls on a procedure library (which, like Clang, DMS offers) that hack at the tree.
You can write more complex metaprograms using PARLANSE to apply some rules in one place, other rules someplace else, and you can mix source-to-source transforms with procedural transforms that hack directly at the tree if you want.
If you want more details on what transforms look like, ask and I'll provide a link.

How can I generate metadata output with clang similar to that of gcc-xml?

I found that GCCXML is not being maintained anymore (I think the last version is from 2009 from their CVS repository). People usually suggest to check out clang, but I couldn't find a comprehensive documentation that described how to generate a similar output. Not necessarily XML, but the same information in a parsable (documented, if binary or obscure) format. If there is a way to get the same information from a recent gcc version, that is also fine.
This is for a hobby project for dynamic invocation of C++ code. I know about similar projects (pygccxml, xrtti, openc++), but the point is to make it, for fun.
There used to be a way to print an xml dump with Clang but it was more or less supported and has been removed. There are developers options to get a dump at various stages, but the format is for human consumption, and unstable.
The recommendation for Clang users has always been code integration:
Either directly using Clang from C++, and for example use a RecursiveASTVisitor implementation
Or use libclang from C or C++.
Unlike gcc, clang is conceived as a set of a libraries to be reused, so it does not make much sense to try and write a parser for some clang output: it is just much more error-prone than just consuming the information right at the source.
You could use the Clang based CastXML to generate XML file.
https://github.com/CastXML/CastXML
You could code a GCC plugin, or better yet, a MELT extension, to do that. You'll need a GCC 4.6 or later version (4.7 is coming out soon).
However, extending GCC, either thru a plugin coded in C or better yet with an extension coded in MELT (a domain specific language suited to extend GCC) takes some time, because you need to understand and handle most of main GCC internal representations (Gimple and Tree-s).
If you want to use MELT, I'll be delighted to help you.

Confusion in continuing the Project Related to C Syntax Analyser

My target is to make a program (using C++) which would take C source code as input and check for "SYNTAX ERRORS ONLY".
Now for this, do i need to know about Regular Expressions, Grammar generation and Parsers??
I would like to use tools like Yacc/Flex/Bison - but the problems i am facing are -
How to use these tools? I mean i am only scratching at the surface when i read about these tools - i feel clueless.
How can i use these tools in tandem with my C++ source code?
How "The Hell" do i Get Started with this?
Use somebody else's C parser. For example, the parser used by the clang project. http://clang.llvm.org/
Then you can focus on the other hard part of your problem: detecting errors.
To get started with Yacc and Lex (or the Gnu versions, Bison and Flex) I can recommend Tom Niemann's A Compact Guide to Lex & Yacc.
I also suggest that you have a look of other projects doing the same thing. The are often named with lint in their name, as http://www.splint.org/
It all depends on what kind of errors you want to check.
In any cases you certainly need to learn more about compiler architectures. This book is a reference http://www.cs.princeton.edu/~appel/modern/c/
If you want to work at the syntactic level,
you certainly want to work with lex and Yacc.
This link may help you to get started with a working grammar (though outdated): http://www.lysator.liu.se/c/ANSI-C-grammar-y.html
Less powerfull syntax checking can be done using regular expression. You can do less with regular expression than with an actual parser (see http://en.wikipedia.org/wiki/Chomsky_hierarchy). But it is certainly far more practical.
if you want to perform high level checking. Like "Does this group of function alway take const parameters ?" etc ... You can probably use GCC ability to dump abstract syntax trees (see http://digitocero.com/en/blog/exporting-and-visualizing-gccs-abstract-syntax-tree-ast). Checks other compilers or front-end as well. An abstract tree contains many information you can "check".
If you want to handle compilation errors: related to type checking etc... I can't help you, you probably want to look at other people projects before starting to write your own compiler.
see also:
http://decomp.ulb.ac.be/roelwuyts/playground/canalysistools/
http://wiki.altium.com/display/ADOH/Static+Code+Analysis+-+CERT+C+Secure+Code+Checking
Some people in my previous labs worked on C and C++ analysis and transformation
http://www.lrde.epita.fr/cgi-bin/twiki/view/Transformers/
The project is now in standby, and has proved to be a complex subject even for people used to compiler writting (especially in the case of C++ transformation).
Finally your needs are maybe far simpler than this. Did you think about
FILE *output = popen("gcc -Wall my_c_file.c", "r");
(and then just checking the output of gcc)
How do you define "SYNTAX ERRORS ONLY"? If you just want to know what are the errors, why don't you call external gcc to perform a compilation and report the errors?

Python code to parse and inspect c++

Is there a library for Python that will allow me to parse c++ code?
For example, let's say I want to parse some c++ code and find the names of all classes and their member functions/variables.
I can think of a few ways to hack it together using regular expressions, but if there is an existing library it would be more helpful.
In the past I've used for such purposes gccxml (a C++ parser that emits easily-parseable XML) -- I hacked up my own Python interfaces to it, but now there's a pygccxml which should package that up nicely for you.
Parsing C++ accurately is light-years from something you can do with a regular expression.
You need a full C++ parser, and they're pretty hard to build. I've been involved in building one over several years, and track who is doing it; I don't know of any being attempted in Python.
The one I work on is DMS C++ Front End.
It provides not only parsing, but full name and type resolution. After parsing, you can basically extract detailed information about the code at whatever level of detail you like, including arbittary details about function content.
You might consider using GCCXML, which does contain a parser, and will produce, I believe, the names of all classes, functions, and top-level variables. GCCXML won't give you any information about what's inside a function.
This is a little outside your question's scope perhaps... but depending on what you're trying to achieve, perhaps Exuberant Ctags is worth looking at.
Have not tried, but using the Python bindings from LLVM's Clang parser may work; see here.
How about pyparsing?

Programmatically parse and edit C++ Source Files

I want to programmatically parse and edit C++ source files. I need to change/add code in certain sections of code (i.e. in functions, class blocks, etc). I would also (preferably) be able to get comments as well.
Part of what I want to do can be explained by the following piece of code:
CPlusPlusSourceParser cp = new CPlusPlusSourceParser(“x.cpp”); // Create C++ Source Parser Object
CPlusPlusSourceFunction[] funcs = cp.getFunctions(); // Get all the functions
for (int i = 0; i &lt funcs.length; i++) { // Loop through all functions
funcs[i].append(/* … code I want to append …*/); // Append some code to function
}
cp.save(); // Save new source
cp.close(); // Close file
How can I do that?
I’d like to be able to do this preferably in Java, C++, Perl, Python or C#. However, I am open to other language API’s.
This is similar to AST from C code
If your comfortable with Java antlr can easily parser your code into an abstract syntax tree, and then apply transformation to that tree. A default AST transform is to simply print out the original source.
You can use any parser generator tool to generate a c++ parser for you, but first you have to get the CFG (context free grammar) for C++ , check Antlr
Edit:
Also Antlr supports a lot of target languages
You need a working grammar and parser for C++ which is, however, not too easy as this can't be constructed with most parser generators out there. But once you have a parser you can actually take the abstract syntax tree of the program and alter it in nearly any way you want.
The Mozilla project has a tool that does this.
The Clang static analyzer is now somewhat famous for doing a good job analyzing and rewriting C++. Stroustrup wrote a paper about a research project at Texas A&M, but I don't think it's been released.
A robust C++ parser is available with our DMS Software Reengineering Toolkit. It parses a variety of C++ dialects including ANSI, GNU 3/4, MSVS6 and MSVisual Studio 2005 and managaged C++.
It builds ASTs and symbol tables (the latter is way harder than you might think). You can navigate the ASTs, transform into different valid C++ programs, and regenerate code including comments.
In a C# -- or general .net -- approach, you might be able to get some use out of the C++/CLI CodeDOM provider -- having not used the C++ version of this type, I don't know how well it would handle code that is template heavy.
have a look at the doxygen project, its a open source project, to parse and document several programming languages, C++ included. I believe using this project's lexer will get you more than half the way