library to abstract text parsing in C++

library to abstract text parsing in C++ - c++

I'm about to parse a lot of files, with a precise hierarchy for the XML ones and a precise syntax for the others. I would like to abstract these files at "token level" to simplify my code and logic. I also need UTF-8 support.
Is there a library, or perhaps a library only formed by several headers, that can do this in C++?
EDIT:
supposing that my file is something like that
COLOR=red Language=en
COLOR=blue Language=se
COLOR=green Language=fr
with token level i mean that i can access this values after parsing in this way:
Object.getValue(color, 1)
and this should return red.

Well, there is boost::spirit, that I think may be what you are looking for. You can use it to create rules for parsing input. I personally have never used it but heard good comments for it. Hope it helps.

Related

What library to use to write XML file in a C++ program?

What library to use to write XML file in a C++ program?
I've found two classes posted in CodeProject
http://www.codeproject.com/KB/stl/simple_xmlwriter.aspx
http://www.codeproject.com/KB/XML/XML_writer.aspx
but want to check if there is more standard option than these. I'm only concerned with writing, and not parsing XML.

I tried different libraries and finally decided for TinyXml. It's compact, fast, free (zlib license) and very easy to use.

Question: Are you ever going to update an XML file? Because while that sounds like it's just more writing, with XML it still requires a parser.
While xerces is large and bloated, it is fully standards compliant and it is DOM based. Should you ever have to cross platform or change language, there will always be a DOM based library for whatever language/platform you might move to so knowing how DOM based parsing/writing works is a benefit. If you are going to use XML, you may as well use it correctly.
Avoiding XML altogether is of course the best option. But short of that, I'd go with xerces.

You can use Xerces-C++, a library written by Apache foundation. This library permits read, write and manipulate XML files.
Link: http://xerces.apache.org/xerces-c/

For my purposes, PugiXML worked out really nicely
http://pugixml.org/
The reason why I thought it was so nice was because it was simply 3 files, a configuration header, a header, and the actual source.
But as you stated, you aren't interested in parsing, so why even bother using a special class to write out XML? While maybe your classes are too complex for this, I found the easy thing to do is use the std::ostream and just write out standard compliant XML this way. For example, say I have a class that represents a Company which is a collection of Employee objects, just make a method in each the Company and Employee classes that looks something like the following psuedocode
Company::writeXML(std::ostream& out){
out << "<company>" << std::endl;
BOOST_FOREACH(Employee e, employees){
e.writeXML(out);
}
out << "</company>" << std::endl;
}
and to make it even easier, your Employee's writeXML function can be declared virtual so that you can have a specific output for say a CEO, President, Janitor or whatever the subclasses should be.

I have been using the open-source libxml2 library for years, works great for me.

I ran into the same problem and wound up rolling my own solution. It's implemented as a single header file that you can drop into your project: xml_writer.h
And it comes with a set of unit tests which also serve as documentation.

Roll your own
I've been in a similar situation. I had a program that needed to generate JSON. We did it two ways. First we tried jsoncpp, but in the end I just generated the JSON directly via a std::ofstream.
Afterward we ran the generated JSON through a validator to catch any syntax errors. There were a few but they were really easy to find and correct.
If I were to do it again I would definitely roll my own again. Somewhat unexpectedly, there was less code when using std::ofstream. Plus we didn't have to use/learn a new API. It was easier to write and is easier to maintain.

Easy way to get function prototypes?

A friend and I were discussing imaginary and real languages and a question that came up was if one of us wanted to generate headers for another language (perhaps D which already has a tool) what would be an easy and very good way to do this?
One of us said to scan C files and headers and ignore function bodies and only count the braces within to figure out when a function is finished. The counter to that was typedefs, defines (which braces but defines were considered as a trivial problem) and templates + specialization.
Another solution was to read binaries produce, not the actual exe but the object files the linker uses. The counter to that was the format and complexity. None of us knew anything of any object format so we couldnt estimate (we were thinking of gcc and VS c++).
What do you guys think? Which is easier? This should be backed up with reasonable logic and fact.
If someone can link to a helpful project, one that parses C files/headers and outputs it or one that reads in elf data and displays info in an example project would be useful. I tried googling but I didnt know what it would be called. I found libelf but at this moment I couldn't get it to compile. I might be able to soon.

You can use clang libraries to parse C/C++ source code and extract any information you want in particular function prototypes.
Due to library-based architecture it is easy to reuse parts of clang that you need. In your case these are frontend libraries (liblex, libparse, libsema). I think this is a more feasible approach then using hand-written scanner considering the difficulties that you mentioned (typedefs, defines, etc).
clang can also be used as a tool to parse the source code and output AST in XML form, for example if you have the file test.cpp:
void foo() {}
int main()
{
foo();
}
and invoke clang++ -Xclang -ast-print-xml -fsyntax-only test.cpp you'll get the file test.xml similar to the following (here irrelevant parts skipped for brevity):
<?xml version="1.0"?>
<CLANG_XML>
<TranslationUnit>
<Function id="_1D" file="f2" line="1" col="6" context="_2"
name="foo" type="_12" function_type="_1E" num_args="0">
</Function>
<Function id="_1F" file="f2" line="3" col="5" context="_2"
name="main" type="_21" function_type="_22" num_args="0">
</Function>
</TranslationUnit>
<ReferenceSection>
<Types>
<FunctionType result_type="_12" id="_1E"/>
<FundamentalType kind="int" id="_21"/>
<FundamentalType kind="void" id="_12"/>
<FunctionType result_type="_21" id="_22"/>
<PointerType type="_12" id="_10"/>
</Types>
<Files>
<File id="f2" name="test.cpp"/>
</Files>
</ReferenceSection>
</CLANG_XML>
I don't think that extracting this information from binaries is possible at least for symbols with C linkage, because they don't have name mangling.

ctags' output is quite easy to read/parse

if you want to simply generate a binding, try swig

What you're talking about is compiling: The act of transforming code in one formal language to another. There's a good solid science behind this that, if followed carefully, will guarantee your program halts with a correct analogous code.
Granted, you don't want to parse the whole of the C++ language (hooray for that!), so you just need to define the relevant grammar and define everything else as acceptable noise or comments.
Don't use regular expressions. These won't do because C++ is not a regular language.

One way to do this is to define your interfaces in an abstract language (an IDL), and generate headers for all languages that you're interested in. You can limit the scope of your IDL to those features that are possible in each target language.
Windows takes this approach in its MIDL language, for example.

Doxygen can help with this. It's an advanced topic, and somewhat documented.

In a perfect world...
Each compiler of any programming language would emit such information as part of its output. It could be an ELF extension or a new generally accepted file format, which contains an ELF/COFF/whatever section.
This would spare the (across the globe) thousands of man hours to generate "language bindings" for dozens of languages. Dynamic languages, such as Common Lisp would not need FFI libraries - as it all would happen under the hood and loading a shared library in that new format could automatically be inspected and functions in it could be made available on the fly, without any further ado.
And all, which actually would have to be stored as extra information for each (exported) function is:
Generic name
Return type
Argument list
calling convention (there are not that many in practice)
A reference to the symbol table entry, referred to.
Why has it not been done during the past 30 years?
Because language designers still see the compiler output as their "private affair", whereas IMHO, this should be part of the OS/Runtime/Loader.
Many workarounds exist, some listed here in other answers (IDL, binding generators, such as SWIG and a myriad of others, mostly ad hoc and of varying and typically insufficient quality).
The ELF guys are Unix guys and as such C guys, who still live in a bubble, thinking C is some sort of "golden standard". But even they would not have any problems, using the information to generate their beloved header files.
RUST might not have invented crates in that perfect world and you would not even care if some shared library had been written in C or C++ or Rust or Zig or Haskell - anyone could just use it.
This domain is still dominated by "computer scientists", not engineers; while in theory, there could be an infinite number of calling conventions, in fact, those in active use are very countable few (LLVM supports many of those actually relevant).
Another reason, it has not been done yet, is that there is a danger, some committee would create a monster, trying make it "open, flexible, ..." and no one would eventually dare using it. DCE (distributed computing environment) gives that idea.

How to extract ALL typedefs and structs and unions from c++ source

I have inherited a Visual Studio project that contains hundreds of files.
I would like to extract all the typedefs, structs and unions from each .h/.cpp file and put the results in a file).
Each typdef/struct/union should be on one line in the results file. This would make sorting much easier.
typdef int myType;
struct myFirstStruct { char a; int b;...};
union Part_Number_Serial_Number_Part_2_Response_Message_Type {struct{Message_Response_Head_Type Head; Part_Num_Serial_Num_Part_2_Report_Array Part_2_Report; Message_Tail_Type Tail;} Data; BYTE byData[140];}myUnion;
struct { bool c; int d;...}mySecondStruct;
My problem is, I do not know what to look for (grammar of typedef/structs/unions) using a regular expression.
I cannot believe that nobody has done this before (I googled and have not found anything on this).
Does anyone know the regular expressions for these? (Note some are commented out using // others /* */)
Or a tool to accomplish this.
Edit:
I am toying with the idea of autogenerating source code and/or dialogs for modifying messages that use the underlying typedef/struct/union. I was going to use the output to generate an XML file that could be used for this reason.
The source for these are in C/C++ and used in almost all my projects. These projects are usually NOT in C/C++. By using the XML version I would only need to update/add the typedef/struct/union only in one place and all the projects would be able to autogen the source and/or dialogs.

I can't imagine a purpose for this except for some sort of documentation effort. If that is what you're looking for I would suggest doxygen.
To answer your question, I seriously doubt any amount of regular expressions will be sufficient. What you need to do is actually parse the code. I have heard of a library out there for building compilers and C++ tools that would provide the parsing aspect but I'm sorry to say I have forgotten the name. I know it's out there though so I'd start searching for that.

You will NOT be able to accomplish this with a regular expression. The only way to actually do this will be to get hold of a lexer and parser for the C++ grammar and write the code yourself to dump the interesting bits to a file or database upon encountering one of the structures you're interested in. And unfortunately, C++ parsing is rather hard.

parsing c++ is ... difficult. Instead of killing yourself trying to parse it there are a few options.
gcc-xml
cscope
ctags
global
Each of these will parse c++ code and grab the info your after. If you want to dump it to a file in the format you requested you'd be a lot better off parsing their data files than parsing raw c++.
I recommend you skip all of this and just use doxygen. It won't be in your preferred format but you'll be better off getting used to doxygen's layout.

Python code to parse and inspect c++

Is there a library for Python that will allow me to parse c++ code?
For example, let's say I want to parse some c++ code and find the names of all classes and their member functions/variables.
I can think of a few ways to hack it together using regular expressions, but if there is an existing library it would be more helpful.

In the past I've used for such purposes gccxml (a C++ parser that emits easily-parseable XML) -- I hacked up my own Python interfaces to it, but now there's a pygccxml which should package that up nicely for you.

Parsing C++ accurately is light-years from something you can do with a regular expression.
You need a full C++ parser, and they're pretty hard to build. I've been involved in building one over several years, and track who is doing it; I don't know of any being attempted in Python.
The one I work on is DMS C++ Front End.
It provides not only parsing, but full name and type resolution. After parsing, you can basically extract detailed information about the code at whatever level of detail you like, including arbittary details about function content.
You might consider using GCCXML, which does contain a parser, and will produce, I believe, the names of all classes, functions, and top-level variables. GCCXML won't give you any information about what's inside a function.

This is a little outside your question's scope perhaps... but depending on what you're trying to achieve, perhaps Exuberant Ctags is worth looking at.

Have not tried, but using the Python bindings from LLVM's Clang parser may work; see here.

How about pyparsing?

How to generate empty definitions given a header file

I have a 3rd-party library which for various reasons I don't wish to link against yet. I don't want to butcher my code though to remove all reference to its API, so I'd like to generate a dummy implementation of it.
Is there any tool I can use which spits out empty definitions of classes given their header files? It's fine to return nulls, false and 0 by default. I don't want to do anything on-the-fly or anything clever - the mock object libraries I've looked at appear quite heavy-weight? Ideally I want something to use like
$ generate-definition my_header.h > dummy_implemtation.cpp
I'm using Linux, GCC4.1

This is a harder problem than you might like, as parsing C++ can quickly become a difficult task. Your best bet would be to pick an existing parser with a nice interface.
A quick search found this thread which has many recommendations for parsers to do something similar.
At the very worst you might be able to use SWIG --> Python, and then use reflection on that to print a dummy implementation.
Sorry this is only a half-answer, but I don't think there is an existing tool to do this (other than a mocking framework, which is probably the same amount of work as using a parser).

Create one test application which reads the header file and creates the source file. Test application should parse the header file to know the function names.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js