Python code to parse and inspect c++ - c++

Is there a library for Python that will allow me to parse c++ code?
For example, let's say I want to parse some c++ code and find the names of all classes and their member functions/variables.
I can think of a few ways to hack it together using regular expressions, but if there is an existing library it would be more helpful.

In the past I've used for such purposes gccxml (a C++ parser that emits easily-parseable XML) -- I hacked up my own Python interfaces to it, but now there's a pygccxml which should package that up nicely for you.

Parsing C++ accurately is light-years from something you can do with a regular expression.
You need a full C++ parser, and they're pretty hard to build. I've been involved in building one over several years, and track who is doing it; I don't know of any being attempted in Python.
The one I work on is DMS C++ Front End.
It provides not only parsing, but full name and type resolution. After parsing, you can basically extract detailed information about the code at whatever level of detail you like, including arbittary details about function content.
You might consider using GCCXML, which does contain a parser, and will produce, I believe, the names of all classes, functions, and top-level variables. GCCXML won't give you any information about what's inside a function.

This is a little outside your question's scope perhaps... but depending on what you're trying to achieve, perhaps Exuberant Ctags is worth looking at.

Have not tried, but using the Python bindings from LLVM's Clang parser may work; see here.

How about pyparsing?

Related

What's the easiest way to parse C++ for code generation?

I would like to generate some wrapper code based on C++ types. I basically would like to parse some C++ headers, get the types, classes and their fields defined in the headers, and generate some code based on them.
What would be the easiest way to parse C++ and get type information? I thought about using the Clang C++ parser, but I couldn't make a working hello world in a couple of hours, so I gave up for the time being.
Could you advise any other way to parse C++, or if Clang is the easiest solution, could you point me to a simple getting started guide to be able to parse C++ types with it?
(basically any technology would be ok, C++, Java, C#, etc., this would be part of a command line tool)
Clang is definitely the easiest option. Consider using cindex python bindings, it's pretty straightforward. Alternatively, you could get an older version of clang which still features an xml backend.
EDIT: the link above seems to be down, so here is a link to the google cache of it.
Another link suggested in the comments: http://www.altdevblogaday.com/2014/03/05/implementing-a-code-generator-with-libclang/
Unless your object is to verify correctness, or the code involves advanced template stuff, consider using the XML output of DOxygen or GCC_XML. Alternatively, consider clang, even if that's what you found too complex. Note that for clang it might be best to work in *nix-land.
If your generation tool is in Java, consider using the parser from the Eclipse CDT.
my set of dependencies are:
com.ibm.icu_4.4.2.v20110823.jar
org.eclipse.cdt.core_5.3.2.201202111925.jar
org.eclipse.equinox.common_3.6.0.v20110523.jar
(these are from an old Eclipse version, because I have a dependency on old java class versions), but taking from the latest CDT wil do.
parsing involves:
FileContent reader;
reader = FileContent.createForExternalFileLocation(fullPath);
IScannerInfo info = new ScannerInfo(definedSymbols, includePaths);
return GPPLanguage.getDefault().getASTTranslationUnit(reader, info, FilesProvider.getInstance(), null, 0,log);
This returns an IASTTranslationUnit that can be accessed through a Visitor pattern (ASTVisitor).
I cannot comment on the accuracy of the parsing in corner scenarios, because so far I've been generating code based on simple C++ structure definitions.

Extracting C functions signatures for Erlang

I was wondering what the most suitable tool is for extracting c (and eventually c++) function names, arguments and their types from source code. I would like to use a tool that can be automated as much as possible.
I want to extract the signature types then have the output read by another program written in Erlang. My goal is to use this information to construct the equivalent of the C types/signatures in Erlang (using ei). I would like this process to work for any C file so I can't use any hardcoded stuff.
I have found a lot of tools that look promising like CLang, CScope and ANTLR and so on but I don't know if any of them will work or if there is a better approach out there.
Thanks.
Surely there is something better, but if you don't find it:
gcc -aux-info output demo.c
sed '/include/d' output
Extracts functions form source code skipping standard functions

Parsing c++ function headers from a file using GNU toolchain

I need to parse function headers from a .i file used by SWIG which contains all sorts of garbage beside the function headers. (final output would be a list of function declarations)
The best option for me would be using the GNU toolchain (GCC, Binutils, etc..) to do so, but i might be missing an easy way of doing it with SWIG. If I am please tell me!
Thanks :]
edit: I also don't know how to do that with GCC toolchain, if you have an idea it will be great.
I would try getting an XML dump of the abstract syntax tree either from clang or from gccxml. From there you can easily extract the function declarations you are interested in.
Our DMS Software Reengineering Toolkit provides general purpose program parsing, analysis, and transformation capability. It has front ends for a wide variety of languages, including C++.
It has been used to analyze and transforms very complex C++ programs and their header files.
You aren't clear as to what you will do after you "parse the function headers"; normally people want to extract some information or produce another artifact. DMS with its C++ front end can do the parsing; you can configure DMS to do the custom stuff.
As a practical matter, this isn't usually an afternoon's exercise; DMS is a complex beast, because it has to deal with complex beasts such as C++. And I'd expect you to face the same kind of complexity for any tool that can handle C++. The GCC toolchain can clearly handle C++, so you might be able to do it with that (at that same level of complexity) but GCC is designed to be a compiler, and IMHO you will find it a fight to get it do what you want.
Your "output function declarations" goal isn't clear. You want just the function names? You want a function signature? You want all the type declarations on which the function depends? You want all the type declarations on which the function depends, if they are not already present in an existing include file you intend to use?
The best way to extract function decls from the garbage which is C header files is to substitute out what constitutes the most smelly garbage: macros. You can do that with:
cpp - The C Preprocessor

How to extract ALL typedefs and structs and unions from c++ source

I have inherited a Visual Studio project that contains hundreds of files.
I would like to extract all the typedefs, structs and unions from each .h/.cpp file and put the results in a file).
Each typdef/struct/union should be on one line in the results file. This would make sorting much easier.
typdef int myType;
struct myFirstStruct { char a; int b;...};
union Part_Number_Serial_Number_Part_2_Response_Message_Type {struct{Message_Response_Head_Type Head; Part_Num_Serial_Num_Part_2_Report_Array Part_2_Report; Message_Tail_Type Tail;} Data; BYTE byData[140];}myUnion;
struct { bool c; int d;...}mySecondStruct;
My problem is, I do not know what to look for (grammar of typedef/structs/unions) using a regular expression.
I cannot believe that nobody has done this before (I googled and have not found anything on this).
Does anyone know the regular expressions for these? (Note some are commented out using // others /* */)
Or a tool to accomplish this.
Edit:
I am toying with the idea of autogenerating source code and/or dialogs for modifying messages that use the underlying typedef/struct/union. I was going to use the output to generate an XML file that could be used for this reason.
The source for these are in C/C++ and used in almost all my projects. These projects are usually NOT in C/C++. By using the XML version I would only need to update/add the typedef/struct/union only in one place and all the projects would be able to autogen the source and/or dialogs.
I can't imagine a purpose for this except for some sort of documentation effort. If that is what you're looking for I would suggest doxygen.
To answer your question, I seriously doubt any amount of regular expressions will be sufficient. What you need to do is actually parse the code. I have heard of a library out there for building compilers and C++ tools that would provide the parsing aspect but I'm sorry to say I have forgotten the name. I know it's out there though so I'd start searching for that.
You will NOT be able to accomplish this with a regular expression. The only way to actually do this will be to get hold of a lexer and parser for the C++ grammar and write the code yourself to dump the interesting bits to a file or database upon encountering one of the structures you're interested in. And unfortunately, C++ parsing is rather hard.
parsing c++ is ... difficult. Instead of killing yourself trying to parse it there are a few options.
gcc-xml
cscope
ctags
global
Each of these will parse c++ code and grab the info your after. If you want to dump it to a file in the format you requested you'd be a lot better off parsing their data files than parsing raw c++.
I recommend you skip all of this and just use doxygen. It won't be in your preferred format but you'll be better off getting used to doxygen's layout.

Making a C++ app scriptable

I have several functions in my program that look like this:
void foo(int x, int y)
Now I want my program to take a string that looks like:
foo(3, 5)
And execute the corresponding function. What's the most straightforward way to implement this?
When I say straightforward, I mean reasonably extensible and elegant, but it shouldn't take too long to code up.
Edit:
While using a real scripting language would of course solve my problem, I'd still like to know if there is a quick way to implement this in pure C++.
You can embed Python fairly simply, and that would give you a really powerful, extensible way to script your program. You can use the following to easily (more or less) expose your C++ code to Python:
Boost Python
SWIG
I personally use Boost Python and I'm happy with it, but it is slow to compile and can be difficult to debug.
You could take a look at Lua.
I'd also go for the scripting language answer.
Using pure C++, I would probably use a parser generator, which will will get the token and grammar rules, and will give me C code that exactly can parse the given function call language, and provides me with an syntax tree of that call. flex can be used to tokenize an input, and bison can be used to parse the tokens and transform them into an syntax tree. Alternatively to that approach, Boost Spirit can be used to parse the function call language too. I have never used any of these tools, but have worked on programs that use them, thus I somewhat know what I would use in case I had to solve that problem.
For very simple cases, you could change your syntax to this:
func_name arg1, arg2
Then you can use:
std::istringstream str(line);
std::string fun_name; str >> fun_name;
map[fun_name](tokenize_args(str));
The map would be a
std::map<std::string, boost::function<void(std::vector<std::string>)> > map;
Which would be populated with the functions at the start of your program. tokenize_args would just separate the arguments, and return a vector of them as strings. Of course, this is very primitive, but i think it's reasonable if all you want is some way to call a function (of course, if you want really script support, this approach won't suffice).
As Daniel said:
Script languages like Lua and Python would be the most used script languages for binding together c++ libraries.
You will have to add a script interface to your c++ application. The buildup of this interface obviously depends on what script language you chose.
CERN provides CINT, a C/C++ interpreter that can be embedded within your application to provide scripting capabilities.
If you only wish to call a function by literal name, you could use linker-specific functions.
On POSIX-compliant operating systems (like Linux), you can use dlopen() and dlsym(). You simply parse the input string and figure out the function name and arguments. Then you can ask the linker to find the function by name using dlsym().
On Windows however, these functions aren't available (unless there's some POSIX environment around, like Cygwin). But you can use the Windows API.
You can take a look here for details on these things: http://en.wikipedia.org/wiki/Dynamic_loading
C++ Reflection [2] by Fabio Lombardelli provides full re-flection for C++ through template metaprogramming tech-niques. While it is fully compliant with the C++ standards,it requires the programmer to annotate the classes in order forthem to be reflective
http://cppreflect.sourceforge.net/
otherwise you'd want a function pointer hash table i think
Does your system have to "take a string"? You could expose COM (or CORBA, or whatever) interfaces on your application and have whatever is generating these commands call into your application directly.