I'm using boost::spirit to write a parser lexer - c++

I'm using boost::spirit to write a parser, lexer
here is what i want to do. i want to put functions and classes into data structures with the variables they use. so i want to know what would be the best way to do this.
and what parts of boost::spirit would be best to use for this?
the languages i want to use this on are C/C++/C#/Objective C/Objective C++.
also the language I'm writing it in is C++ only i'm not to good with the other languages i know

Spirit is a fine tool, but it is not the best tool for all parsing tasks. And for the task of parsing actual C++, it is pretty terrible.
Spirit excels at small-to-mid-level parsing tasks. Language that are fairly regular, with standardized grammars and so forth. C and C-derived languages are generally too complicated for Spirit to handle. It's not that you can't write Spirit parsing code for them. It's just that doing so would be too difficult to build and maintain, given Spirit's general design.
I would suggest downloading Clang if you want a good C or C++ (or Objective variants thereof) parser. It is a compiler as well, but it's designed to be modular, so you can just link to the parsing part of it.

Related

From propietary format to c++ classes

Given an input like this (USER DEFINED FORMAT):
type dog<
int years
char[] name
>
How can I generate 2 or more differents files like these:
file1.c
------------
struct dog{
int years
char name
}
file2.cpp
-------------
class dog{
int years
string name
%get and set methods
}
Is a parser generator like flex and bison the best way? Or are there better way?
It all depends on the complextity of your language and the transformations you intend.
Option 1: hand-crafted ad-hoc processing
If your transformations are simple and mostly convert some tokens from your source language into the target language (e.g "type" into "struct" and "<>" into "{}"), you could think of a hand-crafted parsing using string operations or regexes using std::regex.
But this is not sufficient here: you must identify statements in your source langage to generate semi-column separated c and c++ statements. Furhtermore, you must related the [] to the right element. Furthermore, to generate correct getter/setter your code must get understanding of the statement content.
So yes, here you'll need a lexer and a parser.
Option 2: hand-crafted parser
For "simple" languages, you could craft easily a recursive descent parser. What does "simple" mean ? It means that the language which can be described with a LL(1) grammar. Without entering now too much in compiler theory, it's more or less languages where one token ahead uniquely determine which grammar construct you have.
Note that you have a very simple calculator example designed this way in the "C++ Programming language" from Bjarne Stroustrup.
Option 3: use a compiler generator
While lexical scanners and recursive descent parsers are relatively easy to code, it's still somehow reinventing the wheel. And for languages which are not LL(1) compliant, you have no other choice than using a compiler generator.
Flex/bison is then a very reasonable and widely supported choice. But you have to become familiar with its logic, grammar files, and code templates. It's not something that you could master in a couple of hours.
If you have to support a custom language professionally, it's really worth the investment. If it's a small project, the learning curve could play against you.
Option 4: lightweight boost alternative
Another, lighter, approach could be to use boost spirit, which bases itself on c++ and is easier to learn: you express the grammar rule directly in C++. There are too main components: Lex for lexical scanner and Qi for parsers.
It's ok for small to medium parsers (like a data definition language), but it has some serious limitations if you have a full language. And by the way, as far as I know, it's limited to LL(k) grammars.

Can a LL(*) parser (like antlr3) parse C++?

I need to create a parser to C++ 14. This parser must be created using C++ in order to enable a legacy code generation reuse. I am thinking to implement this using ANTLR3 (because ANTLR4 doesn't target C++ code yet).
My doubt is if ANTLR3 can parse C++ since it doesn't use the Adaptive LL(*) algorithm like ANTLR4.
Most classic parser generators cannot generate a parser that will parse a grammar for an arbitrary context free language. The restrictions of the grammars they can parse often gives rise to the name of the class of parser generators: LL(k), LALR, ... ANTLR3 is essentially LL; ANTLR4 is better but still not context free.
Earley, GLR, and GLL parser generators can parse context free languages, sometimes with high costs. In practice, Earley tends to be pretty slow (but see the MARPA parser generator used with Perl6, which I understand to be an Earley variant that is claimed to be reasonably fast). GLR and GLL seem to produce working parsers with reasonable performance.
My company has built about 40 parsers for real languages using GLR, including all of C++14, so I have a lot of confidence in the utility of GLR.
When it comes to parsing C++, you're in a whole other world, mostly because C++ parsing seems to depend on collecting symbol table information at the same time. (It isn't really necessary to do that if you can parse context-free).
You can probably make ANTLR4 (and even ANTLR3) parse C++ if you are willing to fight it hard enough. Essentially what you do is build a parser which accepts too much [often due to limitations of the parser generator class], and then uses ad hoc methods to strip away the extra. This is essentially what the hand-written GCC and Clang parsers do; the symbol table information is used to force the parser down the right path.
If you choose to go down this path of building your own parser, no matter which parser generator you choose, you will invest huge amounts of energy to get a working parser. [Been here; done this]. This isn't a good way to get on with whatever your intended task motivates this parser.
I suggest you get one that already works. (I've already listed two; you can find out about our parser through my bio if you want).
That will presumably leave you with a working parser. Then you want to do something with the parse tree, and you'll discover that Life After Parsing requires a lot of machinery that the parsers don't provide. Google the phrase to find my essay on the topic or check my bio.

C++ Parsing code (Hand written)

I need to parse a language which is similar to a minimalized version of Java. Since effiency is the most important factor I choose for a hand written parser instead of LRAR parser generators like GOLD, bison and yacc.
However I can't find the theory behind good hand written parsers. It seems like there are only tutorials on those generators and the mechanism behind it.
Do I have to drop using regular expressions? Because I can imaging they are slow compared to hand written tokiners.
Does anybody know a good class or tutorial for hand written parsing?
In case it helps, here is (not a class or a tutorial but) an example of a hand-written parser: https://github.com/tabatkins/css-parser (however it's explicitly coded for correct/simple correspondence to the specification, and not for optimized for high performance).
The bigger problem is, I expect, to develop the specification for the parsing. Examples of parser specifications include http://dev.w3.org/csswg/css3-syntax/ and a similar one for parsing HTML5.
The prerequisite to using a parser generator is that the language syntax has been defined by a grammar (where the grammar format is supported by the parser generator), instead of by a parsing algorithm.

Writing a tokenizer, where to begin?

I'm trying to write a tokenizer for CSS in C++, but I have no idea how to write a tokenizer. I know that it should be greedy, reading as much input as possible, for each token, and in theory I know how I could put that in code.
I have looked at Boost.Tokenizer, and it seems nice, but it doesn't help me whatsoever. It sure is a nice wrapper for a tokenizer, but the problem lies in writing the token splitter, the TokenizerFunction in Boost terms.
I have no idea how to write this tokenizer, are there any "neat" ways of doing it, like something that closely resembles the syntax itself?
Please note, I'm not looking for a parser! My application doesn't need to be able to understand CSS, just read a CSS file to a general internal tokenized format, process some things and output again.
Writing a "correct" lexer and/or parser is more difficult than you might think. And it can get ugly when you start dealing with weird corner cases.
My best suggestion is to invest some time in learning a proper lexer/parser system. CSS should be a fairly easy language to implement, and then you will have acquired an amazingly powerful tool you can use for all sorts of future projects.
I'm an Old FartĀ® and I use lex/yacc (or things that use the same syntax) for this type of project. I first learned to use them back in the early 80's and it has returned the effort to learn them many, many times over.
BTW, if you have anything approaching a BNF of the language, lex/yacc can be laughably easy to work with.
Boost.Spirit.Qi would be my first choice.
Spirit.Qi is designed to be a practical parsing tool. The ability to generate a fully-working parser from a formal EBNF specification inlined in C++ significantly reduces development time. Programmers typically approach parsing using ad hoc hacks with primitive tools such as scanf. Even regular-expression libraries (such as boost regex) or scanners (such as Boost tokenizer) do not scale well when we need to write more elaborate parsers. Attempting to write even a moderately-complex parser using these tools leads to code that is hard to understand and maintain.
The Qi tutorials even finish by implementing a parser for an XMLish language; writing a grammar for CSS should be considerably easier.

Best practices for writing a programming language parser

Are there any best practices that I should follow while writing a parser?
The received wisdom is to use parser generators + grammars and it seems like good advice, because you are using a rigorous tool and presumably reducing effort and potential for bugs in doing so.
To use a parser generator the grammar has to be context free. If you are designing the languauge to be parsed then you can control this. If you are not sure then it could cost you a lot of effort if you start down the grammar route. Even if it is context free in practice, unless the grammar is enormous, it can be simpler to hand code a recursive decent parser.
Being context free does not only make the parser generator possible, but it also makes hand coded parsers a lot simpler. What you end up with is one (or two) functions per phrase. Which is if you organise and name the code cleanly is not much harder to see than a grammar (if your IDE can show you call hierachies then you can pretty much see what the grammar is).
The advantages:-
Simpler build
Better performance
Better control of output
Can cope with small deviations, e.g. work with a grammar that is not 100% context free
I am not saying grammars are always unsuitable, but often the benefits are minimal and are often out weighed by the costs and risks.
(I believe the arguments for them are speciously appealing and that there is a general bias for them as it is a way of signaling that one is more computer-science literate.)
Few pieces of advice:
Know your grammar - write it down in a suitable form
Choose the right tool. Do it from within C++ with Spirit2x, or choose external parser tools like antlr, yacc, or whatever suits you
Do you need a parser? Maybe regexp will suffice? Or maybe hack a perl script to do the trick? Writing complex parsers take time.
Don't overuse regular expressions - while they have their place, they simply don't have the power to handle any kind of real parsing. You can push them, but you're eventually going to hit a wall or end up with an unmaintainable mess. You're better off finding a parser generator that can handle a larger language set. If you really don't want to get into tools, you can look at recursive descent parsers - it's a really simple pattern for hand-writing a small parser. They aren't as flexible or as powerful as the big parser generators, but they have a much shorter learning curve.
Unless you have very tight performance requirements, try and keep your layers separate - the lexer reads in individual tokens, the parser arranges those into a tree, and then semantic analysis checks over everything and links up references, and then a final phase to output whatever is being produced. Keeping the different parts of logic separate will make things easier to maintain later.
Read most of the Dragon book first.
Parsers are not complicated if you know how to build them, but they are NOT the type of thing that if you put in enough time, you'll eventually get there. It's way better to build on the existing knowledge base. (Otherwise expect to write it and throw it away a few dozen times).
Yep. Try to generate it, not write. Consider using yacc, ANTLR, Flex/Bison, Coco/R, GOLD Parser generator, etc. Resort to manually writing a parser only if none of existing parser generators fit your needs.
Choose the right kind of parser, sometimes a Recursive Descendant will be enough, sometimes you should use an LR parser (also, there are many types of LR parsers).
If you have a complex grammar, build an Abstract Syntax Tree.
Try to identify very well what goes into the lexer, what is part of the syntax and what is a matter of semantics.
Try to make the parser the least coupled to the lexer implementation as possible.
Provide a good interface to the user so he is agnostic of the parser implementation.
First, don't try to apply the same techniques to parsing everything. There are numerous possible use cases, from something like IP addresses (a bit of ad hoc code) to C++ programs (which need an industrial-strength parser with feedback from the symbol table), and from user input (which needs to be processed very fast) to compilers (which normally can afford to spend a little time parsing). You might want to specify what you're doing if you want useful answers.
Second, have a grammar in mind to parse with. The more complicated it is, the more formal the specification needs to be. Try to err on the side of being too formal.
Third, well, that depends on what you're doing.