C++ Parsing code (Hand written) - c++

I need to parse a language which is similar to a minimalized version of Java. Since effiency is the most important factor I choose for a hand written parser instead of LRAR parser generators like GOLD, bison and yacc.
However I can't find the theory behind good hand written parsers. It seems like there are only tutorials on those generators and the mechanism behind it.
Do I have to drop using regular expressions? Because I can imaging they are slow compared to hand written tokiners.
Does anybody know a good class or tutorial for hand written parsing?

In case it helps, here is (not a class or a tutorial but) an example of a hand-written parser: https://github.com/tabatkins/css-parser (however it's explicitly coded for correct/simple correspondence to the specification, and not for optimized for high performance).
The bigger problem is, I expect, to develop the specification for the parsing. Examples of parser specifications include http://dev.w3.org/csswg/css3-syntax/ and a similar one for parsing HTML5.
The prerequisite to using a parser generator is that the language syntax has been defined by a grammar (where the grammar format is supported by the parser generator), instead of by a parsing algorithm.

Related

From propietary format to c++ classes

Given an input like this (USER DEFINED FORMAT):
type dog<
int years
char[] name
>
How can I generate 2 or more differents files like these:
file1.c
------------
struct dog{
int years
char name
}
file2.cpp
-------------
class dog{
int years
string name
%get and set methods
}
Is a parser generator like flex and bison the best way? Or are there better way?
It all depends on the complextity of your language and the transformations you intend.
Option 1: hand-crafted ad-hoc processing
If your transformations are simple and mostly convert some tokens from your source language into the target language (e.g "type" into "struct" and "<>" into "{}"), you could think of a hand-crafted parsing using string operations or regexes using std::regex.
But this is not sufficient here: you must identify statements in your source langage to generate semi-column separated c and c++ statements. Furhtermore, you must related the [] to the right element. Furthermore, to generate correct getter/setter your code must get understanding of the statement content.
So yes, here you'll need a lexer and a parser.
Option 2: hand-crafted parser
For "simple" languages, you could craft easily a recursive descent parser. What does "simple" mean ? It means that the language which can be described with a LL(1) grammar. Without entering now too much in compiler theory, it's more or less languages where one token ahead uniquely determine which grammar construct you have.
Note that you have a very simple calculator example designed this way in the "C++ Programming language" from Bjarne Stroustrup.
Option 3: use a compiler generator
While lexical scanners and recursive descent parsers are relatively easy to code, it's still somehow reinventing the wheel. And for languages which are not LL(1) compliant, you have no other choice than using a compiler generator.
Flex/bison is then a very reasonable and widely supported choice. But you have to become familiar with its logic, grammar files, and code templates. It's not something that you could master in a couple of hours.
If you have to support a custom language professionally, it's really worth the investment. If it's a small project, the learning curve could play against you.
Option 4: lightweight boost alternative
Another, lighter, approach could be to use boost spirit, which bases itself on c++ and is easier to learn: you express the grammar rule directly in C++. There are too main components: Lex for lexical scanner and Qi for parsers.
It's ok for small to medium parsers (like a data definition language), but it has some serious limitations if you have a full language. And by the way, as far as I know, it's limited to LL(k) grammars.

Can a LL(*) parser (like antlr3) parse C++?

I need to create a parser to C++ 14. This parser must be created using C++ in order to enable a legacy code generation reuse. I am thinking to implement this using ANTLR3 (because ANTLR4 doesn't target C++ code yet).
My doubt is if ANTLR3 can parse C++ since it doesn't use the Adaptive LL(*) algorithm like ANTLR4.
Most classic parser generators cannot generate a parser that will parse a grammar for an arbitrary context free language. The restrictions of the grammars they can parse often gives rise to the name of the class of parser generators: LL(k), LALR, ... ANTLR3 is essentially LL; ANTLR4 is better but still not context free.
Earley, GLR, and GLL parser generators can parse context free languages, sometimes with high costs. In practice, Earley tends to be pretty slow (but see the MARPA parser generator used with Perl6, which I understand to be an Earley variant that is claimed to be reasonably fast). GLR and GLL seem to produce working parsers with reasonable performance.
My company has built about 40 parsers for real languages using GLR, including all of C++14, so I have a lot of confidence in the utility of GLR.
When it comes to parsing C++, you're in a whole other world, mostly because C++ parsing seems to depend on collecting symbol table information at the same time. (It isn't really necessary to do that if you can parse context-free).
You can probably make ANTLR4 (and even ANTLR3) parse C++ if you are willing to fight it hard enough. Essentially what you do is build a parser which accepts too much [often due to limitations of the parser generator class], and then uses ad hoc methods to strip away the extra. This is essentially what the hand-written GCC and Clang parsers do; the symbol table information is used to force the parser down the right path.
If you choose to go down this path of building your own parser, no matter which parser generator you choose, you will invest huge amounts of energy to get a working parser. [Been here; done this]. This isn't a good way to get on with whatever your intended task motivates this parser.
I suggest you get one that already works. (I've already listed two; you can find out about our parser through my bio if you want).
That will presumably leave you with a working parser. Then you want to do something with the parse tree, and you'll discover that Life After Parsing requires a lot of machinery that the parsers don't provide. Google the phrase to find my essay on the topic or check my bio.

Unknown type of macro in C++

I noticed something when looking the kind of language of a software (openFOAM) which is written in C++. It was something like,
field value;
For example,
temperature 25;
I wonder how this works. I mean how temperature is set to 25 without using equality sign. Any idea?
Because openFOAM has a parser that understands that format.
Such a parser is not hard to make, and C++ (with the help of a parser-generator tool like yacc, bison) is a popular choice for parsers because it can be very fast.
C++ is a Turing-complete language, therefore it can do anything that any other language can. Specifically, it can process data that doesn't look like C++ code.
Programming languages are typically context free. If the language is context free it can be parsed (I won't say it's trivial but it's a problem that has been solved so many times before and if you take a compilers class you'll have to do it for simple langauges).
The parser is simply looking for a declarations of the format field value;. It appears there is no type check happening so it is simply splitting on semi colon then on the space. Read about parsing context-free langauges, push down automata, and context free grammars if you're interested in learning about parsing source code.

Parsing mathematical functions of custom types

I'm about to start developing a sub-component of an application to evaluate math functions with operands of C++ objects. This will be accessed via a user interface to provide drag and drop, feedback of appropriate types followed by an execute button.
I'm quite interested in using flex and bison for this having looked at equation parsing and the like, both here and further afield. What I'm unsure of is if flex/bison is appropriate when you're trying to parse with custom C++ types? Obviously normal parsing is with text and this is quite a departure from that so wanted so too see what people thought, and see if I'm trying to put a square peg in a round hole.
What do you think?
Edit
There are some very good sources of information in the links people have provided below. One that looks promising but hasn't been mentioned yet is Boost.Spirit. I was taking a look though the examples earlier today and there are some informative calculator based examples in the boost/libs/spirit/examples directory should you have boost downloaded and be interested. Their homepage is here.
Please checkout muparser
Flex and Bison are the right tool for parsing arithmetic expressions, equations and the like.
Here are few examples:
Parsing arithmetic expressions - Bison and Flex. It uses C (and not C++), but you can readapt it.
Flex Bison C++ Template/Example. A "framework" to integrate Flex and Bison "into a modern C++ program". An arithmetic calculator is included as an example.
Certainly sounds like a square peg in a round hole to me (unless I grossly misunderstand the question):
Flex would create a state machine to tokenize a stream, in your case - the contents are already tokenized
Bison sounds a bit more relevant, since it can deal with operator precedence, but integrating with it would be too much of a pain for the relatively small benefit.

Search string parser in C/C++

I work on an open source project focused around Biblical texts. I would like to create a standard string format to build up a search string. I would then need to parse the search string and run the search with the options given. There are a number of different options, from scope of the search, to searching multiple texts, to wildcards, etc.
I'm thinking that using something like lex/yacc to generate a parser for this format might be a good idea. I think the Xapian project uses lemony to achieve a similar goal. My question is, is using one (or more) of these tools the best way to accomplish this?
In addition to the question, I would appreciate any links to resources on these tools (and any others that might be options). The biggest problem I've run into so far is that most of the examples and tutorials are either geared towards a programming language or something simple like a calculator rather than parsing a string format.
Tools like Lex and Yacc are suitable for your purposes. A parser for a search string is not that different from a parser for a programming language (the big difference is that a search string parser generates rules guiding the search, while the programming language parser generates a parse tree from where code is generated)
I assume your syntax will contain rules like the following:
expression : word
| expression AND expression
| expression OR expression
| NOT expression
| '(' expression ')'
all of which are easy to express in Yacc.
You can look at A Compact Guide to Lex & Yacc which I've found very useful for learning Lex and Yacc
If you're trying to build a parser in C++ have a look at
boost::sprit
It certainly is advanced C++, but it will build quite complex and performant parsers from C++ templates without code generation. It took me a few days to get into it, but using and modifying the samples that was straight forward. I also recommend reading the following book:
C++ Template Metaprogramming: Concepts, Tools, and Techniques from Boost and Beyond by David Abrahams and Aleksey Gurtovoy
Keep "syntax error diagnosis and message" in your mind uppermost - if the user makes a mistake, a handcrafted recursive-descent-style parser can have some idea based on what it has scanned so far, what mistake the user might have made. If you're going to use an automated tool, be sure to test how it responds to typical user typos - genius-programmers can handle cryptic messages from their compiler, while it sounds like you are targeting a much less sophisticated user who therefore needs a friendlier interface.