Extracting C functions signatures for Erlang - c++

I was wondering what the most suitable tool is for extracting c (and eventually c++) function names, arguments and their types from source code. I would like to use a tool that can be automated as much as possible.
I want to extract the signature types then have the output read by another program written in Erlang. My goal is to use this information to construct the equivalent of the C types/signatures in Erlang (using ei). I would like this process to work for any C file so I can't use any hardcoded stuff.
I have found a lot of tools that look promising like CLang, CScope and ANTLR and so on but I don't know if any of them will work or if there is a better approach out there.
Thanks.

Surely there is something better, but if you don't find it:
gcc -aux-info output demo.c
sed '/include/d' output
Extracts functions form source code skipping standard functions

Related

parser generator that generates stand-alone C++ code

Is there a LALR parser generator that produces stand-alone C++ code? I am hoping that it would generate two files named something like "Parser.cpp" and "Parser.hpp," and the generated parser is implemented in a single class (that I can wrap in whatever namespace) that I can use for my parsing needs.
I want to use it for fun (i.e. small personal projects), and I'd like the output to be stand-alone (without any headers) so that I know I can compile it wherever I have a C++ compiler.
The search so far:
I've looked at flex/bison, but AFAIK they both require special headers and libraries. I've also looked at ANTLR a little bit, but it is not obvious to me that it can generate stand-alone C++ code. If someone can confirm that it can, then I might look more into it.
GOLD Parser (Bart Kiers mentioned the list on Wikipedia) has support for C and C++ languages. It does not generate a completely self-contained C/C++ source code file. All it does is the generation of Lexer/Parser tables which can be consumed by the "parsing engine".
To accomplish your task (or something similar) I did the following:
Prepare your LALR grammar in Gold's format
Generate parsing tables (one binary file)
Use an old trick to convert the binary file into a header file like
unsigned char ParseTable[] = { ... };
Modify the loader from the "parsing engine" sources (or use the C version which supports in-memory loading, as I remember)
Combine the sources for the GPEngine (if it is a C++ version) into the .h/.cpp pair.
Append the ParseTable to .cpp
Sure, it's not that straightforward, but all the steps can in principle be done within a single "combine" script which can be used with a number of grammars.
I guess the major drawback is the fact that GOLD is closed-source and windows-only (it means that to produce the parsing tables you have to use Windows machine).
ANTLR can generate C++ code although IMHO I find the support for C++ is a bit weak, it is more like C code. Still it is a good environment to work with ANTLRWorks giving you a graphical representation of your syntax tree.
The output from flex+bison consists of two .c files and one .h file. These are completely stand-alone, in that they are all you need to compile into your application to make use of the parser. There are no additional libraries or headers needed (beside the standard C ones).
Unless I've misunderstood your requirements, you definitely can do what you want with flex+bison.

Parsing c++ function headers from a file using GNU toolchain

I need to parse function headers from a .i file used by SWIG which contains all sorts of garbage beside the function headers. (final output would be a list of function declarations)
The best option for me would be using the GNU toolchain (GCC, Binutils, etc..) to do so, but i might be missing an easy way of doing it with SWIG. If I am please tell me!
Thanks :]
edit: I also don't know how to do that with GCC toolchain, if you have an idea it will be great.
I would try getting an XML dump of the abstract syntax tree either from clang or from gccxml. From there you can easily extract the function declarations you are interested in.
Our DMS Software Reengineering Toolkit provides general purpose program parsing, analysis, and transformation capability. It has front ends for a wide variety of languages, including C++.
It has been used to analyze and transforms very complex C++ programs and their header files.
You aren't clear as to what you will do after you "parse the function headers"; normally people want to extract some information or produce another artifact. DMS with its C++ front end can do the parsing; you can configure DMS to do the custom stuff.
As a practical matter, this isn't usually an afternoon's exercise; DMS is a complex beast, because it has to deal with complex beasts such as C++. And I'd expect you to face the same kind of complexity for any tool that can handle C++. The GCC toolchain can clearly handle C++, so you might be able to do it with that (at that same level of complexity) but GCC is designed to be a compiler, and IMHO you will find it a fight to get it do what you want.
Your "output function declarations" goal isn't clear. You want just the function names? You want a function signature? You want all the type declarations on which the function depends? You want all the type declarations on which the function depends, if they are not already present in an existing include file you intend to use?
The best way to extract function decls from the garbage which is C header files is to substitute out what constitutes the most smelly garbage: macros. You can do that with:
cpp - The C Preprocessor

Python code to parse and inspect c++

Is there a library for Python that will allow me to parse c++ code?
For example, let's say I want to parse some c++ code and find the names of all classes and their member functions/variables.
I can think of a few ways to hack it together using regular expressions, but if there is an existing library it would be more helpful.
In the past I've used for such purposes gccxml (a C++ parser that emits easily-parseable XML) -- I hacked up my own Python interfaces to it, but now there's a pygccxml which should package that up nicely for you.
Parsing C++ accurately is light-years from something you can do with a regular expression.
You need a full C++ parser, and they're pretty hard to build. I've been involved in building one over several years, and track who is doing it; I don't know of any being attempted in Python.
The one I work on is DMS C++ Front End.
It provides not only parsing, but full name and type resolution. After parsing, you can basically extract detailed information about the code at whatever level of detail you like, including arbittary details about function content.
You might consider using GCCXML, which does contain a parser, and will produce, I believe, the names of all classes, functions, and top-level variables. GCCXML won't give you any information about what's inside a function.
This is a little outside your question's scope perhaps... but depending on what you're trying to achieve, perhaps Exuberant Ctags is worth looking at.
Have not tried, but using the Python bindings from LLVM's Clang parser may work; see here.
How about pyparsing?

Is there a better (more modern) tool than lex/flex for generating a tokenizer for C++?

I recent added source file parsing to an existing tool that generated output files from complex command line arguments.
The command line arguments got to be so complex that we started allowing them to be supplied as a file that was parsed as if it was a very large command line, but the syntax was still awkward. So I added the ability to parse a source file using a more reasonable syntax.
I used flex 2.5.4 for windows to generate the tokenizer for this custom source file format, and it worked. But I hated the code. global variables, wierd naming convention, and the c++ code it generated was awful. The existing code generation backend was glued to the output of flex - I don't use yacc or bison.
I'm about to dive back into that code, and I'd like to use a better/more modern tool. Does anyone know of something that.
Runs in Windows command prompt (Visual studio integration is ok, but I use make files to build)
Generates a proper encapsulated C++ tokenizer. (No global variables)
Uses regular expressions for describing the tokenizing rules (compatible with lex syntax a plus)
Does not force me to use the c-runtime (or fake it) for file reading. (parse from memory)
Warns me when my rules force the tokenizer to backtrack (or fixes it automatically)
Gives me full control over variable and method names (so I can conform to my existing naming convention)
Allows me to link multiple parsers into a single .exe without name collisions
Can generate a UNICODE (16bit UCS-2) parser if I want it to
Is NOT an integrated tokenizer + parser-generator (I want a lex replacement, not a lex+yacc replacement)
I could probably live with a tool that just generated the tokenizing tables if that was the only thing available.
Ragel: http://www.complang.org/ragel/ It fits most of your requirements.
It runs on Windows
It doesn't declare the variables, so you can put them inside a class or inside a function as you like.
It has nice tools for analyzing regular expressions to see when they would backtrack. (I don't know about this very much, since I never use syntax in Ragel that would create a backtracking parser.)
Variable names can't be changed.
Table names are prefixed with the machine name, and they're declared "const static", so you can put more than one in the same file and have more than one with the same name in a single program (as long as they're in different files).
You can declare the variables as any integer type, including UChar (or whatever UTF-16 type you prefer). It doesn't automatically handle surrogate pairs, though. It doesn't have special character classes for Unicode either (I think).
It only does regular expressions... has no bison/yacc features.
The code it generates interferes very little with a program. The code is also incredibly fast, and the Ragel syntax is more flexible and readable than anything I've ever seen. It's a rock solid piece of software. It can generate a table-driven parser or a goto-driven parser.
Flex also has a C++ output option.
The result is a set of classes that do that parsing.
Just add the following to the head of you lex file:
%option C++
%option yyclass="Lexer"
Then in you source it is:
std::fstream file("config");
Lexer lexer(&file)
while(int token = lexer.yylex())
{
}
Boost.Spirit.Qi (parser-tokenizer) or Boost.Spirit.Lex (tokenizer only). I absolutely love Qi, and Lex is not bad either, but I just tend to take Qi for my parsing needs...
The only real drawback with Qi tends to be an increase in compile time, and it is also runs slightly slower than hand-written parsing code. It is generally much faster than parsing with regex, though.
http://www.boost.org/doc/libs/1_41_0/libs/spirit/doc/html/index.html
There's two tools that comes to mind, although you would need to find out for yourself which would be suitable, Antlr and GoldParser. There are language bindings available in both tools in which it can be plugged into the C++ runtime environment.
boost.spirit and Yard parser come to my mind. Note that the approach of having lexer generators is somewhat substituted by C++ inner DSL (domain-specific language) to specify tokens. Simply because it is part of your code without using an external utility, just by following a series of rules to specify your grammar.

Making a C++ app scriptable

I have several functions in my program that look like this:
void foo(int x, int y)
Now I want my program to take a string that looks like:
foo(3, 5)
And execute the corresponding function. What's the most straightforward way to implement this?
When I say straightforward, I mean reasonably extensible and elegant, but it shouldn't take too long to code up.
Edit:
While using a real scripting language would of course solve my problem, I'd still like to know if there is a quick way to implement this in pure C++.
You can embed Python fairly simply, and that would give you a really powerful, extensible way to script your program. You can use the following to easily (more or less) expose your C++ code to Python:
Boost Python
SWIG
I personally use Boost Python and I'm happy with it, but it is slow to compile and can be difficult to debug.
You could take a look at Lua.
I'd also go for the scripting language answer.
Using pure C++, I would probably use a parser generator, which will will get the token and grammar rules, and will give me C code that exactly can parse the given function call language, and provides me with an syntax tree of that call. flex can be used to tokenize an input, and bison can be used to parse the tokens and transform them into an syntax tree. Alternatively to that approach, Boost Spirit can be used to parse the function call language too. I have never used any of these tools, but have worked on programs that use them, thus I somewhat know what I would use in case I had to solve that problem.
For very simple cases, you could change your syntax to this:
func_name arg1, arg2
Then you can use:
std::istringstream str(line);
std::string fun_name; str >> fun_name;
map[fun_name](tokenize_args(str));
The map would be a
std::map<std::string, boost::function<void(std::vector<std::string>)> > map;
Which would be populated with the functions at the start of your program. tokenize_args would just separate the arguments, and return a vector of them as strings. Of course, this is very primitive, but i think it's reasonable if all you want is some way to call a function (of course, if you want really script support, this approach won't suffice).
As Daniel said:
Script languages like Lua and Python would be the most used script languages for binding together c++ libraries.
You will have to add a script interface to your c++ application. The buildup of this interface obviously depends on what script language you chose.
CERN provides CINT, a C/C++ interpreter that can be embedded within your application to provide scripting capabilities.
If you only wish to call a function by literal name, you could use linker-specific functions.
On POSIX-compliant operating systems (like Linux), you can use dlopen() and dlsym(). You simply parse the input string and figure out the function name and arguments. Then you can ask the linker to find the function by name using dlsym().
On Windows however, these functions aren't available (unless there's some POSIX environment around, like Cygwin). But you can use the Windows API.
You can take a look here for details on these things: http://en.wikipedia.org/wiki/Dynamic_loading
C++ Reflection [2] by Fabio Lombardelli provides full re-flection for C++ through template metaprogramming tech-niques. While it is fully compliant with the C++ standards,it requires the programmer to annotate the classes in order forthem to be reflective
http://cppreflect.sourceforge.net/
otherwise you'd want a function pointer hash table i think
Does your system have to "take a string"? You could expose COM (or CORBA, or whatever) interfaces on your application and have whatever is generating these commands call into your application directly.