Conversion between C structs (C++ POD) and google protobufs? - c++

I have code that currently passes around a lot of (sometimes nested) C (or C++ Plain Old Data) structs and arrays.
I would like to convert these to/from google protobufs. I could manually write code that converts between these two formats, but it would be less error prone to auto-generate such code. What is the best way to do this? (This would be easy in a language with enough introspection to iterate over the names of member variables, but this is C++ code we're talking about)
One thing I'm considering is writing python code that parses the C structs and then spits out a .proto file, along with C code that copies from member to member (in either direction) for all of the types, but maybe there is a better way... or maybe there is another IDL that already can generate:
.h file containing all of nested types
.proto file containing equivalents
.c file with functions that copy either direction between the C++ structs that the .proto file generates and the structs defined in the .h file

I could not find a ready solution for this problem, if there is one, please let me know!
If you decide to roll your own in python, the python bindings for gdb might be useful. You could then read the symbol table, find all structs defined in specified file, and iterate all struct members.
Then use <gdbtype>.strip_typedefs() to get the primitive type of each member and translate it to appropriate protobuf type.
This is probably safer then a text parsers as it will handle types that depends on architecture, compiler flags, preprocessor macros, etc.
I guess the code to convert to and from protobuf also could be generated from the struct member to message field relation, but does not sound easy.

Protocol buffers can be built by parsing an ASCII representation using TextFormat. So one option would be to add a method dumpAsciiProtoBuf to each of your structs. The method would dump any simple fields (like strings, bools, etc) and call dumpAsciiProtoBuf recursively on nested structs fields. You would then have to make sure that the concatenated result is a valid ASCII protocol buffer which can be parsed using TextFormat.
Note though that this might have some performance implications (since parsing the ASCII representation could be expensive). However, this would save you the trouble of writing a converter in a different language, so it seems to be a convenient solution.

I would not parse the C source code myself, instead I would use the LibClang to parse C files into an AST and my own AST walker to generate the Protobuf and the transcoders as necessary. Googling for "libclang walk AST" should give something to start with, like ast-walker.cc and ast-dumper.cc from this github repository, for example.

The question brought up is the age old challenge with "C" (and C++) code - No easy (or standard) way to reflect on c "struct" (or classes). Just search stack overflow on C reflection, and you will see lot of unsuccessful attempts. My first advice will be NOT to try to build another solution (in python, etc.).
One simple approach: Consider using gdb ptype to get structured output for you structures, which you can use to create the .proto file. The advantage is that there is no need to handle the full syntax of the C language (#define, line breaks, ...). See How do I show what fields a struct has in GDB?
From the gdb ptype, it's a short trip to protobuf '.proto' file.
You can get similar result from libCLang (and I believe there is comparable gcc plugin, but I can not locate it). However, you will have to write some non-trivial "C" code.
Another approach - will be to use 'swig' (https://www.swig.org), and process the swig xml output (or the -xmlout option) to dump the parse tree into XML. While this approach will require a little bit of digging to locate the structure that are needed, the information in XML format is complete, easy to parse (using whatever XML parser you want - python, perl). If you are brave enough, you can use xslt to generate the output.

Related

Is it possible to create a user-defined datatype in a language like C/C++(or maybe any) from a string as user input or from file

Well this might be a very weird question but my curiosity has striken pretty hard on this. So here it goes...
NOTE: Lets take the language C into consideration here.
As programmers we usually define a user-defined datatype(say struct) in the source code with the appropriate name.
Suppose I have a program in which I have a structure defined as:
struct Animal {
char *name;
int lifeSpan;
};
And also I have started the execution of this program.
Now, my question here is;
What if I want to define a new structure called "Plant" just like "Animal" mentioned above in my program, without writing its definition in the source code itself(which is obviously impossible currently) but rather from a user input string(or a file input) during runtime.
Lets say my program takes input string from a text file named file1.txt whose content is:
struct Plant {
char *name;
int lifeSpan;
};
What I want now is to have a new structure named "Plant" in my program which is already in execution. The program should read the file content and create a structure as written in the file and attach it to itself on-the-go.
I have checked out a solution for C++ in the discussion Declaring a data type dynamically in C++ but it doesnt seem to have a very convincing solution.
The solution I am looking for is at the compiler-linker-loader level rather than from the language itself.I would be very pleased and thankful if anyone is looking forward to sharing their ideas on this.
What you're asking about is basically "can we implement C as a scripting language?", since this is the only way code can be executed after compilation.
I'm aware that people have been writing (mostly in the comments) that it's possible in other languages but isn't possible in C, since C is a compiled language (hence data types should be defined during compile time).
However, to the best of my knowledge it's actually possible (and might not be as hard as one would imagine).
There are many possible approaches (machine code emulation (VM), JIT compilation, etc').
One approach will use a C compiler to compile the C script as an external dynamic library (.dll on windows, .so on linux, etc') and than "load" the compiled library and execute the code (this is pretty much the JIT compilation approach, for lazy people).
EDIT:
As mentioned in the comments, by using this approach, the new type is loaded as part of an external library.
The original code won't know about this new type, only the new code (or library) will be "aware" of this new type and able to properly use it.
On the other hand, I'm not sure why you're insisting on the need to use static types and a compiler-linker-loader level solution.
The language itself (the C language) can manage this task dynamically (during execution time).
Consider Ruby MRI, for example. The Ruby language supports dynamic types that can be defined during runtime...
...However, this is implemented in C and it's possible to use the code from within C to define new modules and classes. These aren't static types that can be tested during compilation (type creation and identification is performed during runtime).
This is a perfect example showing that C (as a language) can dynamically define "types".
However, this is also a poor example because Ruby's approach is slow. A custom approved can be far faster since it would avoid the huge overhead related to functionality you might not need (such as inheritance).

How to modify C++ code from user-input

I am currently writing a program that sits on top of a C++ interpreter. The user inputs C++ commands at runtime, which are then passed into the interpreter. For certain patterns, I want to replace the command given with a modified form, so that I can provide additional functionality.
I want to replace anything of the form
A->Draw(B1, B2)
with
MyFunc(A, B1, B2).
My first thought was regular expressions, but that would be rather error-prone, as any of A, B1, or B2 could be arbitrary C++ expressions. As these expressions could themselves contain quoted strings or parentheses, it would be quite difficult to match all cases with a regular expression. In addition, there may be multiple, nested forms of this expression
My next thought was to call clang as a subprocess, use "-dump-ast" to get the abstract syntax tree, modify that, then rebuild it into a command to be passed to the C++ interpreter. However, this would require keeping track of any environment changes, such as include files and forward declarations, in order to give clang enough information to parse the expression. As the interpreter does not expose this information, this seems infeasible as well.
The third thought was to use the C++ interpreter's own internal parsing to convert to an abstract syntax tree, then build from there. However, this interpreter does not expose the ast in any way that I was able to find.
Are there any suggestions as to how to proceed, either along one of the stated routes, or along a different route entirely?
What you want is a Program Transformation System.
These are tools that generally let you express changes to source code, written in source level patterns that essentially say:
if you see *this*, replace it by *that*
but operating on Abstract Syntax Trees so the matching and replacement process is
far more trustworthy than what you get with string hacking.
Such tools have to have parsers for the source language of interest.
The source language being C++ makes this fairly difficult.
Clang sort of qualifies; after all it can parse C++. OP objects
it cannot do so without all the environment context. To the extent
that OP is typing (well-formed) program fragments (statements, etc,.)
into the interpreter, Clang may [I don't have much experience with it
myself] have trouble getting focused on what the fragment is (statement? expression? declaration? ...). Finally, Clang isn't really a PTS; its tree modification procedures are not source-to-source transforms. That matters for convenience but might not stop OP from using it; surface syntax rewrite rule are convenient but you can always substitute procedural tree hacking with more effort. When there are more than a few rules, this starts to matter a lot.
GCC with Melt sort of qualifies in the same way that Clang does.
I'm under the impression that Melt makes GCC at best a bit less
intolerable for this kind of work. YMMV.
Our DMS Software Reengineering Toolkit with its full C++14 [EDIT July 2018: C++17] front end absolutely qualifies. DMS has been used to carry out massive transformations
on large scale C++ code bases.
DMS can parse arbitrary (well-formed) fragments of C++ without being told in advance what the syntax category is, and return an AST of the proper grammar nonterminal type, using its pattern-parsing machinery. [You may end up with multiple parses, e.g. ambiguities, that you'll have decide how to resolve, see Why can't C++ be parsed with a LR(1) parser? for more discussion] It can do this without resorting to "the environment" if you are willing to live without macro expansion while parsing, and insist the preprocessor directives (they get parsed too) are nicely structured with respect to the code fragment (#if foo{#endif not allowed) but that's unlikely a real problem for interactively entered code fragments.
DMS then offers a complete procedural AST library for manipulating the parsed trees (search, inspect, modify, build, replace) and can then regenerate surface source code from the modified tree, giving OP text
to feed to the interpreter.
Where it shines in this case is OP can likely write most of his modifications directly as source-to-source syntax rules. For his
example, he can provide DMS with a rewrite rule (untested but pretty close to right):
rule replace_Draw(A:primary,B1:expression,B2:expression):
primary->primary
"\A->Draw(\B1, \B2)" -- pattern
rewrites to
"MyFunc(\A, \B1, \B2)"; -- replacement
and DMS will take any parsed AST containing the left hand side "...Draw..." pattern and replace that subtree with the right hand side, after substituting the matches for A, B1 and B2. The quote marks are metaquotes and are used to distinguish C++ text from rule-syntax text; the backslash is a metaescape used inside metaquotes to name metavariables. For more details of what you can say in the rule syntax, see DMS Rewrite Rules.
If OP provides a set of such rules, DMS can be asked to apply the entire set.
So I think this would work just fine for OP. It is a rather heavyweight mechanism to "add" to the package he wants to provide to a 3rd party; DMS and its C++ front end are hardly "small" programs. But then modern machines have lots of resources so I think its a question of how badly does OP need to do this.
Try modify the headers to supress the method, then compiling you'll find the errors and will be able to replace all core.
As far as you have a C++ interpreter (as CERN's Root) I guess you must use the compiler to intercept all the Draw, an easy and clean way to do that is declare in the headers the Draw method as private, using some defines
class ItemWithDrawMehtod
{
....
public:
#ifdef CATCHTHEMETHOD
private:
#endif
void Draw(A,B);
#ifdef CATCHTHEMETHOD
public:
#endif
....
};
Then compile as:
gcc -DCATCHTHEMETHOD=1 yourfilein.cpp
In case, user want to input complex algorithms to the application, what I suggest is to integrate a scripting language to the app. So that the user can write code [function/algorithm in defined way] so the app can execute it in the interpreter and get the final results. Ex: Python, Perl, JS, etc.
Since you need C++ in the interpreter http://chaiscript.com/ would be a suggestion.
What happens when someone gets ahold of the Draw member function (auto draw = &A::Draw;) and then starts using draw? Presumably you'd want the same improved Draw-functionality to be called in this case too. Thus I think we can conclude that what you really want is to replace the Draw member function with a function of your own.
Since it seems you are not in a position to modify the class containing Draw directly, a solution could be to derive your own class from A and override Draw in there. Then your problem reduces to having your users use your new improved class.
You may again consider the problem of automatically translating uses of class A to your new derived class, but this still seems pretty difficult without the help of a full C++ implementation. Perhaps there is a way to hide the old definition of A and present your replacement under that name instead, via clever use of header files, but I cannot determine whether that's the case from what you've told us.
Another possibility might be to use some dynamic linker hackery using LD_PRELOAD to replace the function Draw that gets called at runtime.
There may be a way to accomplish this mostly with regular expressions.
Since anything that appears after Draw( is already formatted correctly as parameters, you don't need to fully parse them for the purpose you have outlined.
Fundamentally, the part that matters is the "SYMBOL->Draw("
SYMBOL could be any expression that resolves to an object that overloads -> or to a pointer of a type that implements Draw(...). If you reduce this to two cases, you can short-cut the parsing.
For the first case, a simple regular expression that searches for any valid C++ symbol, something similar to "[A-Za-z_][A-Za-z0-9_\.]", along with the literal expression "->Draw(". This will give you the portion that must be rewritten, since the code following this part is already formatted as valid C++ parameters.
The second case is for complex expressions that return an overloaded object or pointer. This requires a bit more effort, but a short parsing routine to walk backward through just a complex expression can be written surprisingly easily, since you don't have to support blocks (blocks in C++ cannot return objects, since lambda definitions do not call the lambda themselves, and actual nested code blocks {...} can't return anything directly inline that would apply here). Note that if the expression doesn't end in ) then it has to be a valid symbol in this context, so if you find a ) just match nested ) with ( and extract the symbol preceding the nested SYMBOL(...(...)...)->Draw() pattern. This may be possible with regular expressions, but should be fairly easy in normal code as well.
As soon as you have the symbol or expression, the replacement is trivial, going from
SYMBOL->Draw(...
to
YourFunction(SYMBOL, ...
without having to deal with the additional parameters to Draw().
As an added benefit, chained function calls are parsed for free with this model, since you can recursively iterate over the code such as
A->Draw(B...)->Draw(C...)
The first iteration identifies the first A->Draw( and rewrites the whole statement as
YourFunction(A, B...)->Draw(C...)
which then identifies the second ->Draw with an expression "YourFunction(A, ...)->" preceding it, and rewrites it as
YourFunction(YourFunction(A, B...), C...)
where B... and C... are well-formed C++ parameters, including nested calls.
Without knowing the C++ version that your interpreter supports, or the kind of code you will be rewriting, I really can't provide any sample code that is likely to be worthwhile.
One way is to load user code as a DLL, (something like plugins,)
this way, you don't need to compile your actual application, just the user code will be compiled, and you application will load it dynamically.

How does to!string(enum.member) works?

How does std.conv.to!string(enum.member) work? How is it possible that a function takes an enum member and returns its name? Does it use a compiler extension or something similar? It's a bit usual to me since I came from C/C++ world.
What it does is use compile time reflection on the enum type to get a list of members (the names as strings) and their values. It constructs a switch statement out of this information for a fast lookup to get the name from a value. to!SomeEnum("a_string") uses the same principle, just in the other direction.
The compile time reflection info is accessed with __traits(allMembers, TheEnumType), which returns a list of strings that can be looped over to build the switch statement. Then __traits(getMember, TheEnumType, memberName) is used to fetch the body.
Traits can be seen more of here: http://dlang.org/traits.html#allMembers
That allMembers one works on many types, not just classes as seen in the example, but also structs, enums, and more, even modules.
The phobos source code has some examples like EnumMembers in std.traits: https://github.com/D-Programming-Language/phobos/blob/master/std/traits.d#L3360
though the phobos source is kinda hard to read, but on line 3399, at the bottom of that function, you can see it using __traits(allMembers) as its data source. std.conv.to is implemented in terms of many std.traits functions.
You can also check out the sample chapter tab to get the Reflection chapter out of my D cookbook which discusses this stuff too:
http://www.packtpub.com/discover-advantages-of-programming-in-d-cookbook/book
The final example in that chapter shows how to use several of the reflection capabilities to build a little function dispatcher based on strings. The following chapter (not available for free though) shows how to build a switch out of it for better efficiency too. It's actually pretty easy: just put the case statements inside a foreach over the compile time data and the D compiler will unroll then optimize the lookup table for you!

Quick'n'dirty way to prototype some serialization in C++?

I need to do a prototype that involves some serialization in C++. It is a quick'n'dirty prototype, so I don't need to solve the problem generally, provide good error checking, or anything like that. But at the same time, I do need to be able to serialize strings of arbitrary length and with arbitrary charcters.
Are there some best practices for how to whip up a quick data serialization in C++? Normally I'd just have output records into a text file with one record per line, but my strings may have new lines in them.
You could consider using JSON, notably thru JsonCpp. You could also use libs11n, a full fledged, template friendly, C++ serialization framework.
(If you want a C library for Json, consider jansson).
You might also consider using old XDR or ASN1 technology.
For a quick & dirty prototype, I do recommend the JsonCpp library mentioned above. And using JSON in that case is useful, since it is a textual, nearly-human-friendly, format.
Later you could even perhaps consider going to MongoDb which has a Json-like model.
Checkout serialization with boost:
http://www.boost.org/doc/libs/1_51_0/libs/serialization/doc/index.html
Not dirty at all but definitely quick.
If you do not mind binary data, for each string dump a length (cast to a char*) and then the value of the string to file. It is very easy to read back. POD structs can also be dumped directly by casting to a char*

Making a parser to extract function name, parameters, return type

I need to parse a C++ class file (.h) and extract the following informations:
Function names
Return types
List of parameter types of each function
Assume that there is a special tag using which I can recognize if I need to parse a function or not.
For eg.
#include <someHeader>
class Test
{
public:
Test();
void fun1();
// *Expose* //
void fun2();
};
So I need to parse only fun2().
I read the basic grammar here, but found it too complex to comprehend.
Q1. I can't make out how complex this task is. Can someone provide a simpler grammar for a function declaration to perform this parsing?
Q2. Is my approach right or should I consider using some library rather than reinventing?
Edit: Just to clarify, I don't have problem parsing, problem is more of understanding the grammar I need to parse.
A C++ header may include arbitrary C++ code. Hence, parsing the header might be as hard as parsing all kinds of C++ code.
Your task becomes easier, if you can make certain assumptions about your header file. For instance, if you always have an EXPOSE-tag in front of your function and the functions are always on a single line, you could first grep for those lines:
grep -A1 EXPOSE <files>
And then you could apply a regular expression to filter out the information you need.
Nevertheless, I'd recommend using existing tools. This seems to be a tutorial on how to do it with clang and Python.
GCC XML is an open source tool that emits the AST (Abstract Syntax Tree). See this other answer where I posted about the usage I made of it.
You should consider to use only if you are proficient (or akin to learn) with an XML analyzer for inspecting the AST. It's a fairly complex structure...
You will need anyway to 'grep' for the comments identifying your required snippets, as comments are lost in output XML.
IF you are doing this just for documentation doxygen could be a good bet.
Either way it may give you some pointers as to how to do this.