advice on boost spirit - c++

I'm wondering how good is the Boost Spirit Library.
I have begun to read the documentation, but it seems to be a very huge framework, and ask for a lot of time to be master. I really don't want to waste my time on a framework which is not so wonderful as it could be.
I would like to have some opinions on this framework from user who very well known it.

My opinion is surely biased, so please take it with a grain of salt.
Spirit is a huge framework allowing to create very fast parsers and generators with C++. The created parsers and generators integrate nicely with your own data structures. Spirit requires some understanding of its underpinnings to be used efficiently. The documentation is fairly readable and explains things in easy terms. There are literally hundreds of examples available, which I suggest to consider part of the documentation. Understanding those examples is integral part of learning how to use Spirit. Simple tasks are easy to solve, more complex tasks tend to require some thinking and experimentation to get right (but that's probably not specific to Spirit). Spirit has an active community, a responsive and helpful mailing list, and a cool website with tons of additional information. Generally, if you're stuck, it's straightforward to get help.
You pay for all this niceness with increased compile times and huge compiler template error messages if you make an error. But once it compiles it usually works on the spot.

Old question that I came across while searching for some information on Spirit, but I figured I'd put my 2 cents in here for future readers.
At first I was pretty well daunted by the learning curve of Spirit, to the point that I almost gave up. But I'm very glad I kept going. The performance of this parser is just sick. I was previously using PCRE to parse HTTP headers with some simple regular expressions. That's a fairly simple operation and PCRE is pretty fast, so I didn't expect to see much of an improvement - if any - in the speed.
Boy was I surprised to see that it increased the performance by over 2000%. With Spirit I can parse 100,000 sets of HTTP headers in a little more than 1/4th of a second.
And the kicker is the code is so clean and compact compared to the equivalent code with PCRE. My original function that was 80 lines of PCRE gobbledygoop is now a cute little 14-line Spirit grammar, and a Fusion adapter to push the outputs directly to my class structure.
Difficult to learn, and the build times can get pretty bad, but the payoff is fantastic.

Related

Writing a DSL in C++ with boost::proto

Apologies for asking such an open-ended question, but I want to emulate some synthetic assembly (not for a real processor) in C++ and I want to decouple the assembly from the implementation of the simulator it runs on.
Writing a DSL or similar seems like the obvious way and I have some experience of this, having done something like it (actually a mixture between a DSL and an interpreter) in Groovy.
boost::proto seems like the obvious choice, but I find the documentation utterly impenetrable, even though, as I say, I have some grasp of the basics.
Is there any alternative tutorial or similar out there that explains - in a way that focuses on the practicalities of writing a DSL rather than the theory of ASTs etc - how to do this. Or is there an alternative? Right now I am stuck with implementing the assembly instructions as methods of classes that make up the simulator, which makes them very tightly bound and extremely difficult to maintain the code base.
I second the comments suggesting that you may have a badly matched XY-problem here.
Meanwhile, the best introduction to applied Boost Proto for an embedded eDSL was on Dave Abrahams' cpp-next.com blog. Sadly, that has gone off the air.
Eric Niebler, author of Boost Proto, has offerred to send people the raw dumps of those pages, on request:
The C++ community is suffering from the loss of the cpp-next.com website and all the great content that was once hosted there. In the past 2 months, I’ve gotten many questions both about the site and about the fate of my “Expressive C++” article series. In response, I will re-post my old articles on this blog. But I’m busy and it’ll take time. In the meanwhile, if you have a desperate need for a readable introduction to Boost.Proto and domain-specific embedded languages in C++, and you don’t mind reading raw markdown, email me. I’ll send you what I have.
http://ericniebler.com/2014/05/24/cpp-next-com-and-the-expressive-cxx-series-2/
In the mean time, waybackmachine has some of it, e.g.:
http://web.archive.org/web/20120906070131/http://cpp-next.com/archive/2011/01/expressive-c-expression-optimization/
http://web.archive.org/web/20120315111227/http://cpp-next.com/archive/2010/10/expressive-c-expression-extension-part-one/
http://web.archive.org/web/20120430163700/http://cpp-next.com/archive/2010/10/expressive-c-expression-extension-part-two/
*

HSP to C++: Language conversion of a large codebase

I have large codebase written in HSP(wikipedia article - think "BASIC", but japanese).
By "large" I mean it has 151352 lines of code, 60 source files with total code size of 4.5 megabytes. Also, it has plenty of spaghetti code, no comments and badly needs refactoring. The good thing is that it has a lot of text messages, so not all of those lines represent actual program logic.
I'd like to convert this codebase to C++, while retaining my sanity. "I'd like" means that I'm not required to do it, but I'd strongly prefer to find a method to do it.
What's a good way to do it? Obviously, I can't just rewrite it all in C++ (Well, I could do it in theory, but it would take up to 2 years, and I would introduce many bugs in process), so (I think) a reasonable decision would be to implement code recompiler/preprocessor that would allow me to convert source code into messy C++ (HSP is much simpler than C++, so it should be possible) and then start refactoring/documenting the result.
Unfortunately, i'm not entirely sure how to approach building the recompiler efficiently. While I know there are Lex/Yacc/Bison/Boost::spirit, I haven't used them personally.
So can you recommend a good way perform such conversion?
Any free tool ("free" as in "free beer") that is available on windows platform is allowed, as long as it doesn't affect license of original source code.
Yacc it's targeted to efficiently handle more complex tasks, and it's complex to learn, I think it's overkill.
Spirit should be a better choice, if you already know go with it, personally I would use Prolog for this task.
Prolog has builtin syntax analysis, so called DCG. For a language simple as Basic, I'm pretty sure there are no practical problems in the grammar, and modern Prologs (I think to SWI-Prolog, effectively) can handle complex characters encoding in the source very well.
Also, in Prolog you could try to apply some naivety to unroll the spaghetti code. Doing in general it's a complex task, but could be easy if you have just a small number of patterns, repeated many times.
Pattern matching it's key in such problems...
Well, if you really want to go this way and forget about the advices in the comment, you should probably have a good look at the openhsp compiler, and mostly the codegen file :
http://dev.onionsoft.net/trac/browser/trunk/hspcmp/codegen.cpp
and also have the tokens under your eyes :
http://dev.onionsoft.net/trac/browser/trunk/hspcmp/token.h
http://dev.onionsoft.net/trac/browser/trunk/hspcmp/token.cpp
it seems that HSP is not that complicated, and you can skip the AST step. Though, you could get good optimizations out of that. Don't forget also to prepare a C++ lib to embed your generated code in, so you can manage HSP oddities (like globals, and dynamic typing).
if you can hack something out of that, you'll also have to remove most of what this compiler does (create executable, linkage and stuff). Don't forget, it's a really long and hard task that may not be faster or easier than a full rewrite. But if you're ready, you'll find it out the hard way :)
According to original owner of the codebase, HSP starting with version 3 includes HSP to C code converter. Information is not verified due to lack of time, but this blog article documents the tool called hspcnv which is supposed to convert HSP code into C code. The article is in japanese.

Writing a tokenizer, where to begin?

I'm trying to write a tokenizer for CSS in C++, but I have no idea how to write a tokenizer. I know that it should be greedy, reading as much input as possible, for each token, and in theory I know how I could put that in code.
I have looked at Boost.Tokenizer, and it seems nice, but it doesn't help me whatsoever. It sure is a nice wrapper for a tokenizer, but the problem lies in writing the token splitter, the TokenizerFunction in Boost terms.
I have no idea how to write this tokenizer, are there any "neat" ways of doing it, like something that closely resembles the syntax itself?
Please note, I'm not looking for a parser! My application doesn't need to be able to understand CSS, just read a CSS file to a general internal tokenized format, process some things and output again.
Writing a "correct" lexer and/or parser is more difficult than you might think. And it can get ugly when you start dealing with weird corner cases.
My best suggestion is to invest some time in learning a proper lexer/parser system. CSS should be a fairly easy language to implement, and then you will have acquired an amazingly powerful tool you can use for all sorts of future projects.
I'm an Old Fart® and I use lex/yacc (or things that use the same syntax) for this type of project. I first learned to use them back in the early 80's and it has returned the effort to learn them many, many times over.
BTW, if you have anything approaching a BNF of the language, lex/yacc can be laughably easy to work with.
Boost.Spirit.Qi would be my first choice.
Spirit.Qi is designed to be a practical parsing tool. The ability to generate a fully-working parser from a formal EBNF specification inlined in C++ significantly reduces development time. Programmers typically approach parsing using ad hoc hacks with primitive tools such as scanf. Even regular-expression libraries (such as boost regex) or scanners (such as Boost tokenizer) do not scale well when we need to write more elaborate parsers. Attempting to write even a moderately-complex parser using these tools leads to code that is hard to understand and maintain.
The Qi tutorials even finish by implementing a parser for an XMLish language; writing a grammar for CSS should be considerably easier.

state-of-the-art C++ projects

I like to go through existing software projects as a source of learning and new ideas.
doing so I discover things that I did not think were possible
in your opinion, what is the top state of the art C++ project that you have used/develop/extended? can you state reasons why you consider it state of the art and what you can learn from it.
my latest craze is boost::phoenix, http://www.boost.org/doc/libs/1_43_0/libs/spirit/phoenix/doc/html/index.html, which is very comprehensive functional programming library.
Despite its capabilities it is straightforward and easy to extend. After some tweaking I was able to write multithreaded lambda parallel loops and mathematical domain specific language, probably within 2 weeks.
What is yours? (please do not just say boost, as it is huge collection of project)
Personally, I like to look at code in Qt. I do use it everyday, but it seems like every day I use it, I find something new. In terms of total code, it is probably as big as boost. But it comes with excellent documentation and examples and complete source code and is free for LPGL & GPL versions.
For me, what I liked about Qt was that it's concepts matched the way C# works, so it was a fairly easy transition back into c++ for me. But by looking at their code, it has really given me many ways (although not as many as SO) to understand some of the complexity in C++
From what I've seen, the code-sources that I have learned the most from have been from fairly complex 3rd party software libraries. Havok is an excellent example from which I've not only learned programming practices and solutions, but also quite a few mathematical and philosophical discussions. I've also seen some other code-sources which have not been open sourced from which I've learned how to not solve things.
Game engines for AAA-titles in general tend to involve a lot of complex code that tries to push as much as possible through a piece of hardware. I guess that the recommendation goes for all software that tries to achieve something similar but I've only dived into game engines when it comes to such software. AAA-titled game engines often have good or bad solutions to study and I would generally recommend looking into those. There are some that are open source... I think Source/Valve have released theirs in different stages.

How much time would it take to write a C++ compiler using flex/yacc?

How much time would it take to write a C++ compiler using lex/yacc?
Where can I get started with it?
There are many parsing rules that cannot be parsed by a bison/yacc parser (for example, distinguishing between a declaration and a function call in some circumstances). Additionally sometimes the interpretation of tokens requires input from the parser, particularly in C++0x. The handling of the character sequence >> for example is crucially dependent on parsing context.
Those two tools are very poor choices for parsing C++ and you would have to put in a lot of special cases that escaped the basic framework those tools rely on in order to correctly parse C++. It would take you a long time, and even then your parser would likely have weird bugs.
yacc and bison are LALR(1) parser generators, which are not sophisticated enough to handle C++ effectively. As other people have pointed out, most C++ compilers now use a recursive descent parser, and several other answers have pointed at good solutions for writing your own.
C++ templates are no good for handling strings, even constant ones (though this may be fixed in C++0x, I haven't researched carefully), but if they were, you could pretty easily write a recursive descent parser in the C++ template language. I find that rather amusing.
It sounds like you're pretty new to parsing/compiler creation. If that's the case, I'd highly recommend not starting with C++. It's a monster of a language.
Either invent a trivial toy language of your own, or do something modeled on something much smaller and simpler. I saw a lua parser where the grammar definition was about a page long. That'd be much more reasonable as a starting point.
It will probably take you years, and you'll probably switch to some other parser generator in the process.
Parsing C++ is notoriously error-prone. The grammar is not fully LR-parsable, as many parts are context-sensitive. You won't be able to get it working right in flex/yacc, or at least it'll be really awkward to implement. There are only two front-ends I know of that get it right. Your best bet is to use one of these and focus on writing the back-end. That's where the interesting stuff is anyway :-).
Existing C++ Front Ends:
The EDG front-end is used by most of the commercial vendors (Intel, Portland Group, etc.) in their compilers. It costs money, but it's very thorough. People pay big bucks for it because they don't want to deal with the pain of writing their own C++ parser.
GCC's C++ front-end is thorough enough for production code, but you'd have to figure out how to integrate this into your project. I believe it's fairly involved to separate it from GCC. This would also be GPL, but I'm not sure whether that's a problem for you. You can use the GCC front-end in your project via gcc_xml, but this will only give you XML for classes, functions, namespaces, and typedefs. It won't give you a syntax tree for the code.
Another possibility is to use clang, but their C++ support is currently spotty. It'll be nice to see them get all the bugs out, but if you look at their C++ status page you'll notice there are more than a few test cases that still break. Take heed -- clang is a big project. If it's taking these guys years to implement a C++ front-end, it's going to take you longer.
Others have mentioned ANTLR, and there is a C++ grammar available for it, but I'm skeptical. I haven't heard of an ANTLR front end being used in any major compilers, though I do believe it's used in the NetBeans IDE. It might be suitable for an IDE, but I'm skeptical that you'd be able to use it on production code.
A long time, and lex and yacc won't help
If you have the skills to write a compiler for such a large language, you will not need the small amount of help that lex and yacc give you. In fact, while lex is OK it may take longer to use yacc, as it's not really quite powerful enough for C or C++, and you can end up spending far more time getting it to work right than it would take to just write a recursive descent parser.
I believe lex and yacc are best used for simple grammars, or when it is worth the extra effort to have a nicely readable grammar file, perhaps because the grammar is experimental and subject to change.
For that matter, the entire parser is possibly not the major part of your job, depending on exactly what goals you have for the code generator.
As others have already said, yacc is a poor choice for implementing a C++ parser. One can do it; the orginal GCC did so, before the GCC team got disgusted with how hard it was to maintain and extend. (Flex might be OK as a lexer).
Some say recursive descent parsers are best, because Bjarne Stroustrop said so. Our experience is the GLR parsing is the right answer for this, and our GLR-based C++ front end is a nice proof, as is the Elsa front end. Our front end has been used in anger on millions of lines of C++ (including Microsoft and GCC dialects) to carry out program analyses and massive source code transformation.
But what is not emphasized enough is that parsing is just a very small portion of what it takes to build a compiler, especially for C++. You need to also build symbol tables ("what does this identifier mean in this context?") and to do that you need to encode essentially most of several hundred pages of the C++ standard. We believe that the foundation on which we build compiler-like tools, DMS, is extremely good for doing this, and it took us over a man-year to get just this part right.
But then you have the rest of the compiler to consider:
Preprocessor
AST construction
Semantic analysis and type checking
Control, Data flow, and pointer analysis
Basic code generation
Optimizations
Register allocation
Final Code Generation
Debugging support
I keep saying this: building a parser (the BNF part) for a language is like climbing the foothills of the Himalayas. Building a full compiler is like climbing Everest. Pretty much any clod can do the former (although C++ is right at the edge). Only the really serious do the latter, and only when extremely well prepared.
Expect building a C++ compiler to take you years.
(The SD C++ front end handles lexing, parsing, AST generation, symbol tables, some type checking, and regeneration of compilable source text from the AST, including the original comments, for the major C++ dialects. It has been developed over a period of some 6 years).
EDIT: May, 2015. The original answer was written in 2010; we now have 11 years invested, taking us up through C++14. The point is that it is an endless, big effort to build one of these.
Firstly, the "flex" tag on SO is about Adobe's product, not the lexer generator. Secondly, Bjarne Stroustrup is on record as saying he wished he had implemented Cfront (the first C++ compiler) using recursive descent rather than a table driven tool. And thirdly, to answer your question directly - lots. If you feel you need to write one, take a look at ANTLR - not my favourite tool, but there are already C++ parsers for it.
This is a non-trivial problem, and would quite a lot of time to do correctly. For one thing, the grammar for C++ is not completely parseable by a LALR parser such as yacc. You can do subsets of the language, but getting the entire language specification correct is tricky.
You're not the first person to think that this is fun. Here's a nice blog-style article on the topic:
Parsing C++
Here's an important quote from the article:
"After lots of investigation, I
decided that writing a
parser/analysis-tool for C++ is
sufficiently difficult that it's
beyond what I want to do as a hobby."
The problem with that article is that it's a bit old, and several of the links are broken. Here are some links to some other resources on the topic of writing C++ parsers:
ANTLR Grammars (contain several grammars for C++)
A YACC-able C++ 2.1 Grammar and the resulting ambiguities
Parsing and Processing C++ Code (Wikipedia)
Lex,yacc will not be enough. You need a linker, assembler too.., c preprocessor.
It depends on how you do it.
How much pre-made components do you plan to use?
You need to get the description of the syntax and its token from somewhere.
For example, if you use LLVM, you can proceed faster. It already provides a lot of tools, assembler, linker, optimiser....
You can get a c preprocessor from boost project..
You need to create a test suite to test your compiler automatically.
It can take a year if you work on it each day or much less you have more talent and motivation.
Unless you have already written several other compilers; C++ is not a language you even want to start writing a compiler from scratch for, the language has a lot of places were the meaning requires a lot of context before the situation can be disambiguated.
Even if you have lots of experience writing compilers you are looking at several years for a team of developers. This is just to parse the code correctly into an intermediate format. Writing the backend to generate code is yet another specialized task (though you could steal the gcc backend).
If you do a google for "C++ grammars" there are a couple around to get you started.
C++ LEX Tokens: http://www.computing.surrey.ac.uk/research/dsrg/fog/CxxLexer.l
C++ YACC Grammer: http://www.computing.surrey.ac.uk/research/dsrg/fog/CxxGrammar.y
http://www.computing.surrey.ac.uk/research/dsrg/fog/CxxTester.y
A few years - if you can get research grant to re-write new lex/yacc :-)
People keep chasing their tails on this a lot - starting with Stroustrup who was always fancied being a language "designer" rather than actual compiler writer (remember that his C++ was a mere codegen for ages andwould still be there if it wasn't for gcc and other folks).
The core issue is that real research on parser generators pretty much ceased to exist ever since CPU-s became fast enough to handle functional languages and brute-force recursive descent. Recursive descent is the last resort when you don't know what to do - it does exhaustive search till it nabs one "rule" that fires. Once you are content with that you kind of loose interest in researching how to do it efficiently.
What you'd essentially need is a reasonable middle-ground - like LALR(2) with fixed, limited backtraching (plus static checker to yell if "desiogner" splurges into a nondeterministic tree) and also limited and partitioned symbol table feedback (modern parser need to be concurrency-friendly).
Sounds like a research grant proposal, doesn't it :-) Now if we'd find someone to actually fund it, that would be something :-))
A C++ compiler is very complicated. To implement enough of C++ to be compatible with most C++ code out there would take several developers a couple of years full time. clang is a compiler project being funded by Apple to develop a new compiler for C, C++, and Objective-C, with several full-time developers, and the C++ support is still very far from being complete after a couple of years of development.
Recursive decent is a good choice to parse C++. GCC and clang use it.
The Elsa parser (and my ellcc compiler) use the Elkhound GLR compiler generator.
In either case, writing a C++ compiler is a BIG job.
Well, what do you mean by write a compiler?
I doubt any one guy has made a true C++ compiler that took it down all the way to assembly code, but I have used lex and yacc to make a C compiler and I have done it without.
Using both you can make a compiler that leaves out the semantics in a couple days, but figuring out how to use them can take weeks or months easily. Figuring out how to make a compiler at all will take weeks or months no matter what, but the figure I remember is once you know how it works it took a few days with lex and yacc and a few weeks without but the second had better results and fewer bugs so really it's questionable whether they are worth using at all.
The 'semantics' is the actual code production. That can be very simple code that's just enough to work and might not take long at all, or you could spend your whole life doing optimization on it.
With C++ the big issue is templates, but there's so many little issues and rules I can't imagine someone ever wanting to do this. Even if you DO finish, the problem is you won't necessarily have binary compatibility ie be able to be recognized as a runnable program by a linker or the OS because there's more to it than just C++ and its hard to pin down standard but there's also yet more standards to worry about which are even less widely available.