Writing a tokenizer, where to begin?

Writing a tokenizer, where to begin? - c++

I'm trying to write a tokenizer for CSS in C++, but I have no idea how to write a tokenizer. I know that it should be greedy, reading as much input as possible, for each token, and in theory I know how I could put that in code.
I have looked at Boost.Tokenizer, and it seems nice, but it doesn't help me whatsoever. It sure is a nice wrapper for a tokenizer, but the problem lies in writing the token splitter, the TokenizerFunction in Boost terms.
I have no idea how to write this tokenizer, are there any "neat" ways of doing it, like something that closely resembles the syntax itself?
Please note, I'm not looking for a parser! My application doesn't need to be able to understand CSS, just read a CSS file to a general internal tokenized format, process some things and output again.

Writing a "correct" lexer and/or parser is more difficult than you might think. And it can get ugly when you start dealing with weird corner cases.
My best suggestion is to invest some time in learning a proper lexer/parser system. CSS should be a fairly easy language to implement, and then you will have acquired an amazingly powerful tool you can use for all sorts of future projects.
I'm an Old Fart® and I use lex/yacc (or things that use the same syntax) for this type of project. I first learned to use them back in the early 80's and it has returned the effort to learn them many, many times over.
BTW, if you have anything approaching a BNF of the language, lex/yacc can be laughably easy to work with.

Boost.Spirit.Qi would be my first choice.
Spirit.Qi is designed to be a practical parsing tool. The ability to generate a fully-working parser from a formal EBNF specification inlined in C++ significantly reduces development time. Programmers typically approach parsing using ad hoc hacks with primitive tools such as scanf. Even regular-expression libraries (such as boost regex) or scanners (such as Boost tokenizer) do not scale well when we need to write more elaborate parsers. Attempting to write even a moderately-complex parser using these tools leads to code that is hard to understand and maintain.
The Qi tutorials even finish by implementing a parser for an XMLish language; writing a grammar for CSS should be considerably easier.

Related

Drop-in, portable parsing

I see umpteen posts a day about "how to do X with regexen". And the best response to most of them seems like it would honestly be, "Why are you trying to drive a screw with a hammer?" But regexen are everywhere, and the syntax is mostly portable, particularly if you keep away from the fancy bits.
Is there anything equivalent to regexen but at the next level up in power and configurability? A "you can use it anywhere" parsing library of some variety, preferably with a gloriously concise DSL as its interface?
I've used Ragel somewhat, but because of the preprocessing step, I'd hesitate to recommend it to someone as "use this instead of some hairy regex". It's awkward to use from Obj-C, and I expect it will be terribly awkward from a language that doesn't have compile-link-run as part of its standard operating procedure.
What I'm looking for is something that will pass the "inline-online-universal" test.
(inline) You can write the notation inline with your other code, as you would with a regex..
(online) You can run the resulting parser just as you would your other code, which would mean right after input to a REPL in the case of something like Python.
(universal) You can move to a different language/platform and use virtually the same code for your parser, modulo dialect differences. In reality, I'd be happy with something that works from Python, Ruby, C, Java, and Haskell.
Most tools I know of fall down at "online". They preprocess a grammar offline and spit out code in the target language (C, Python, Java, C++…). They're standalone tools that aren't themselves integrated into the language environment.
I've had suggestions of PEG parsers and lex/yacc combos. Parser combinator libraries might also be a good fit. Whatever you might propose, I'd like to see demonstrated that it meets these tests. Your answer should demonstrate that the proposed solution meets the inline-online-universal requirements by providing a working demo parser in Python, C, and Haskell. The demo example is up to the author, but it should be something painful using just regexen but trivial using a proper parser.

https://github.com/leblancmeneses/NPEG
Implements PEG.
Meets all 3 ... let me explain.
It is inline only with C# and offline with all the others. C# has an offline version also.
I currently support offline versions: C/C++/Javascript (local right now)/Java pass all unit tests - to make it universal. To add another language takes 25.84 hrs (how long it took to create the offline Javascript version)
To make it online for every language would be to much maintenance(possible) but it took me a lot of work and time just to support the current offline versions. I can now focus my energy on building grammar optimizers and tooling to unit test grammar rules where all offline versions benefit.

Have a look at Lex/Yacc or their counterparts Flex/Bison (or Coco, or all the other "compiler" generators). The combination can be used to parse complex textual data with an (arguably) much more readable syntax than with regexen.
For simple problems though, where regexen are more than sufficient, by any means do use them.

Building a C++ compiler using erlang

Do you think its possible for a single person (average C++ experience) to build a non-commercial C++ compiler using Erlang, possibly concentrating on optimization?
I wasnt sure if this is completely unrealistic? Is there any advice people could give?
Is erlang the best language to use? I thought it would be good due to its pattern matching. Im not sure if it concurrency would help with writing a compiler??
EDIT: The reason for this is that I dont get to code C++ at work and I want to learn more about the language as I am interested in low latency work. I thought knowing the ins and outs via writing a compiler would be the best way?

A C++ compiler is a lot of work. No, really, a lot of work. C++ is one of the hardest (if not the hardest) production languages to parse. Even just the front-end. Just try reading the standard, it's more than one thousand pages of dense text.
What do you want to use it for? LLVM has the Clang C/C++ front end and an extremely friendly and well-documented intermediate representation. I suggest you use something like this (from Erlang, appropriately adapted or otherwise) and concentrate on the optimisation stage - leaving the parsing to someone else.
Pattern matching does make for a nice compiler though. So Erlang/F#/Scala/Ocaml/Haskell will shine here.

I'm using boost::spirit to write a parser lexer

I'm using boost::spirit to write a parser, lexer
here is what i want to do. i want to put functions and classes into data structures with the variables they use. so i want to know what would be the best way to do this.
and what parts of boost::spirit would be best to use for this?
the languages i want to use this on are C/C++/C#/Objective C/Objective C++.
also the language I'm writing it in is C++ only i'm not to good with the other languages i know

Spirit is a fine tool, but it is not the best tool for all parsing tasks. And for the task of parsing actual C++, it is pretty terrible.
Spirit excels at small-to-mid-level parsing tasks. Language that are fairly regular, with standardized grammars and so forth. C and C-derived languages are generally too complicated for Spirit to handle. It's not that you can't write Spirit parsing code for them. It's just that doing so would be too difficult to build and maintain, given Spirit's general design.
I would suggest downloading Clang if you want a good C or C++ (or Objective variants thereof) parser. It is a compiler as well, but it's designed to be modular, so you can just link to the parsing part of it.

How much time would it take to write a C++ compiler using flex/yacc?

How much time would it take to write a C++ compiler using lex/yacc?
Where can I get started with it?

There are many parsing rules that cannot be parsed by a bison/yacc parser (for example, distinguishing between a declaration and a function call in some circumstances). Additionally sometimes the interpretation of tokens requires input from the parser, particularly in C++0x. The handling of the character sequence >> for example is crucially dependent on parsing context.
Those two tools are very poor choices for parsing C++ and you would have to put in a lot of special cases that escaped the basic framework those tools rely on in order to correctly parse C++. It would take you a long time, and even then your parser would likely have weird bugs.
yacc and bison are LALR(1) parser generators, which are not sophisticated enough to handle C++ effectively. As other people have pointed out, most C++ compilers now use a recursive descent parser, and several other answers have pointed at good solutions for writing your own.
C++ templates are no good for handling strings, even constant ones (though this may be fixed in C++0x, I haven't researched carefully), but if they were, you could pretty easily write a recursive descent parser in the C++ template language. I find that rather amusing.

It sounds like you're pretty new to parsing/compiler creation. If that's the case, I'd highly recommend not starting with C++. It's a monster of a language.
Either invent a trivial toy language of your own, or do something modeled on something much smaller and simpler. I saw a lua parser where the grammar definition was about a page long. That'd be much more reasonable as a starting point.

It will probably take you years, and you'll probably switch to some other parser generator in the process.
Parsing C++ is notoriously error-prone. The grammar is not fully LR-parsable, as many parts are context-sensitive. You won't be able to get it working right in flex/yacc, or at least it'll be really awkward to implement. There are only two front-ends I know of that get it right. Your best bet is to use one of these and focus on writing the back-end. That's where the interesting stuff is anyway :-).
Existing C++ Front Ends:
The EDG front-end is used by most of the commercial vendors (Intel, Portland Group, etc.) in their compilers. It costs money, but it's very thorough. People pay big bucks for it because they don't want to deal with the pain of writing their own C++ parser.
GCC's C++ front-end is thorough enough for production code, but you'd have to figure out how to integrate this into your project. I believe it's fairly involved to separate it from GCC. This would also be GPL, but I'm not sure whether that's a problem for you. You can use the GCC front-end in your project via gcc_xml, but this will only give you XML for classes, functions, namespaces, and typedefs. It won't give you a syntax tree for the code.
Another possibility is to use clang, but their C++ support is currently spotty. It'll be nice to see them get all the bugs out, but if you look at their C++ status page you'll notice there are more than a few test cases that still break. Take heed -- clang is a big project. If it's taking these guys years to implement a C++ front-end, it's going to take you longer.
Others have mentioned ANTLR, and there is a C++ grammar available for it, but I'm skeptical. I haven't heard of an ANTLR front end being used in any major compilers, though I do believe it's used in the NetBeans IDE. It might be suitable for an IDE, but I'm skeptical that you'd be able to use it on production code.

A long time, and lex and yacc won't help
If you have the skills to write a compiler for such a large language, you will not need the small amount of help that lex and yacc give you. In fact, while lex is OK it may take longer to use yacc, as it's not really quite powerful enough for C or C++, and you can end up spending far more time getting it to work right than it would take to just write a recursive descent parser.
I believe lex and yacc are best used for simple grammars, or when it is worth the extra effort to have a nicely readable grammar file, perhaps because the grammar is experimental and subject to change.
For that matter, the entire parser is possibly not the major part of your job, depending on exactly what goals you have for the code generator.

As others have already said, yacc is a poor choice for implementing a C++ parser. One can do it; the orginal GCC did so, before the GCC team got disgusted with how hard it was to maintain and extend. (Flex might be OK as a lexer).
Some say recursive descent parsers are best, because Bjarne Stroustrop said so. Our experience is the GLR parsing is the right answer for this, and our GLR-based C++ front end is a nice proof, as is the Elsa front end. Our front end has been used in anger on millions of lines of C++ (including Microsoft and GCC dialects) to carry out program analyses and massive source code transformation.
But what is not emphasized enough is that parsing is just a very small portion of what it takes to build a compiler, especially for C++. You need to also build symbol tables ("what does this identifier mean in this context?") and to do that you need to encode essentially most of several hundred pages of the C++ standard. We believe that the foundation on which we build compiler-like tools, DMS, is extremely good for doing this, and it took us over a man-year to get just this part right.
But then you have the rest of the compiler to consider:
Preprocessor
AST construction
Semantic analysis and type checking
Control, Data flow, and pointer analysis
Basic code generation
Optimizations
Register allocation
Final Code Generation
Debugging support
I keep saying this: building a parser (the BNF part) for a language is like climbing the foothills of the Himalayas. Building a full compiler is like climbing Everest. Pretty much any clod can do the former (although C++ is right at the edge). Only the really serious do the latter, and only when extremely well prepared.
Expect building a C++ compiler to take you years.
(The SD C++ front end handles lexing, parsing, AST generation, symbol tables, some type checking, and regeneration of compilable source text from the AST, including the original comments, for the major C++ dialects. It has been developed over a period of some 6 years).
EDIT: May, 2015. The original answer was written in 2010; we now have 11 years invested, taking us up through C++14. The point is that it is an endless, big effort to build one of these.

Firstly, the "flex" tag on SO is about Adobe's product, not the lexer generator. Secondly, Bjarne Stroustrup is on record as saying he wished he had implemented Cfront (the first C++ compiler) using recursive descent rather than a table driven tool. And thirdly, to answer your question directly - lots. If you feel you need to write one, take a look at ANTLR - not my favourite tool, but there are already C++ parsers for it.

This is a non-trivial problem, and would quite a lot of time to do correctly. For one thing, the grammar for C++ is not completely parseable by a LALR parser such as yacc. You can do subsets of the language, but getting the entire language specification correct is tricky.
You're not the first person to think that this is fun. Here's a nice blog-style article on the topic:
Parsing C++
Here's an important quote from the article:
"After lots of investigation, I
decided that writing a
parser/analysis-tool for C++ is
sufficiently difficult that it's
beyond what I want to do as a hobby."
The problem with that article is that it's a bit old, and several of the links are broken. Here are some links to some other resources on the topic of writing C++ parsers:
ANTLR Grammars (contain several grammars for C++)
A YACC-able C++ 2.1 Grammar and the resulting ambiguities
Parsing and Processing C++ Code (Wikipedia)

Lex,yacc will not be enough. You need a linker, assembler too.., c preprocessor.
It depends on how you do it.
How much pre-made components do you plan to use?
You need to get the description of the syntax and its token from somewhere.
For example, if you use LLVM, you can proceed faster. It already provides a lot of tools, assembler, linker, optimiser....
You can get a c preprocessor from boost project..
You need to create a test suite to test your compiler automatically.
It can take a year if you work on it each day or much less you have more talent and motivation.

Unless you have already written several other compilers; C++ is not a language you even want to start writing a compiler from scratch for, the language has a lot of places were the meaning requires a lot of context before the situation can be disambiguated.
Even if you have lots of experience writing compilers you are looking at several years for a team of developers. This is just to parse the code correctly into an intermediate format. Writing the backend to generate code is yet another specialized task (though you could steal the gcc backend).
If you do a google for "C++ grammars" there are a couple around to get you started.
C++ LEX Tokens: http://www.computing.surrey.ac.uk/research/dsrg/fog/CxxLexer.l
C++ YACC Grammer: http://www.computing.surrey.ac.uk/research/dsrg/fog/CxxGrammar.y
http://www.computing.surrey.ac.uk/research/dsrg/fog/CxxTester.y

A few years - if you can get research grant to re-write new lex/yacc :-)
People keep chasing their tails on this a lot - starting with Stroustrup who was always fancied being a language "designer" rather than actual compiler writer (remember that his C++ was a mere codegen for ages andwould still be there if it wasn't for gcc and other folks).
The core issue is that real research on parser generators pretty much ceased to exist ever since CPU-s became fast enough to handle functional languages and brute-force recursive descent. Recursive descent is the last resort when you don't know what to do - it does exhaustive search till it nabs one "rule" that fires. Once you are content with that you kind of loose interest in researching how to do it efficiently.
What you'd essentially need is a reasonable middle-ground - like LALR(2) with fixed, limited backtraching (plus static checker to yell if "desiogner" splurges into a nondeterministic tree) and also limited and partitioned symbol table feedback (modern parser need to be concurrency-friendly).
Sounds like a research grant proposal, doesn't it :-) Now if we'd find someone to actually fund it, that would be something :-))

A C++ compiler is very complicated. To implement enough of C++ to be compatible with most C++ code out there would take several developers a couple of years full time. clang is a compiler project being funded by Apple to develop a new compiler for C, C++, and Objective-C, with several full-time developers, and the C++ support is still very far from being complete after a couple of years of development.

Recursive decent is a good choice to parse C++. GCC and clang use it.
The Elsa parser (and my ellcc compiler) use the Elkhound GLR compiler generator.
In either case, writing a C++ compiler is a BIG job.

Well, what do you mean by write a compiler?
I doubt any one guy has made a true C++ compiler that took it down all the way to assembly code, but I have used lex and yacc to make a C compiler and I have done it without.
Using both you can make a compiler that leaves out the semantics in a couple days, but figuring out how to use them can take weeks or months easily. Figuring out how to make a compiler at all will take weeks or months no matter what, but the figure I remember is once you know how it works it took a few days with lex and yacc and a few weeks without but the second had better results and fewer bugs so really it's questionable whether they are worth using at all.
The 'semantics' is the actual code production. That can be very simple code that's just enough to work and might not take long at all, or you could spend your whole life doing optimization on it.
With C++ the big issue is templates, but there's so many little issues and rules I can't imagine someone ever wanting to do this. Even if you DO finish, the problem is you won't necessarily have binary compatibility ie be able to be recognized as a runnable program by a linker or the OS because there's more to it than just C++ and its hard to pin down standard but there's also yet more standards to worry about which are even less widely available.

Best practices for writing a programming language parser

Are there any best practices that I should follow while writing a parser?

The received wisdom is to use parser generators + grammars and it seems like good advice, because you are using a rigorous tool and presumably reducing effort and potential for bugs in doing so.
To use a parser generator the grammar has to be context free. If you are designing the languauge to be parsed then you can control this. If you are not sure then it could cost you a lot of effort if you start down the grammar route. Even if it is context free in practice, unless the grammar is enormous, it can be simpler to hand code a recursive decent parser.
Being context free does not only make the parser generator possible, but it also makes hand coded parsers a lot simpler. What you end up with is one (or two) functions per phrase. Which is if you organise and name the code cleanly is not much harder to see than a grammar (if your IDE can show you call hierachies then you can pretty much see what the grammar is).
The advantages:-
Simpler build
Better performance
Better control of output
Can cope with small deviations, e.g. work with a grammar that is not 100% context free
I am not saying grammars are always unsuitable, but often the benefits are minimal and are often out weighed by the costs and risks.
(I believe the arguments for them are speciously appealing and that there is a general bias for them as it is a way of signaling that one is more computer-science literate.)

Few pieces of advice:
Know your grammar - write it down in a suitable form
Choose the right tool. Do it from within C++ with Spirit2x, or choose external parser tools like antlr, yacc, or whatever suits you
Do you need a parser? Maybe regexp will suffice? Or maybe hack a perl script to do the trick? Writing complex parsers take time.

Don't overuse regular expressions - while they have their place, they simply don't have the power to handle any kind of real parsing. You can push them, but you're eventually going to hit a wall or end up with an unmaintainable mess. You're better off finding a parser generator that can handle a larger language set. If you really don't want to get into tools, you can look at recursive descent parsers - it's a really simple pattern for hand-writing a small parser. They aren't as flexible or as powerful as the big parser generators, but they have a much shorter learning curve.
Unless you have very tight performance requirements, try and keep your layers separate - the lexer reads in individual tokens, the parser arranges those into a tree, and then semantic analysis checks over everything and links up references, and then a final phase to output whatever is being produced. Keeping the different parts of logic separate will make things easier to maintain later.

Read most of the Dragon book first.
Parsers are not complicated if you know how to build them, but they are NOT the type of thing that if you put in enough time, you'll eventually get there. It's way better to build on the existing knowledge base. (Otherwise expect to write it and throw it away a few dozen times).

Yep. Try to generate it, not write. Consider using yacc, ANTLR, Flex/Bison, Coco/R, GOLD Parser generator, etc. Resort to manually writing a parser only if none of existing parser generators fit your needs.

Choose the right kind of parser, sometimes a Recursive Descendant will be enough, sometimes you should use an LR parser (also, there are many types of LR parsers).
If you have a complex grammar, build an Abstract Syntax Tree.
Try to identify very well what goes into the lexer, what is part of the syntax and what is a matter of semantics.
Try to make the parser the least coupled to the lexer implementation as possible.
Provide a good interface to the user so he is agnostic of the parser implementation.

First, don't try to apply the same techniques to parsing everything. There are numerous possible use cases, from something like IP addresses (a bit of ad hoc code) to C++ programs (which need an industrial-strength parser with feedback from the symbol table), and from user input (which needs to be processed very fast) to compilers (which normally can afford to spend a little time parsing). You might want to specify what you're doing if you want useful answers.
Second, have a grammar in mind to parse with. The more complicated it is, the more formal the specification needs to be. Try to err on the side of being too formal.
Third, well, that depends on what you're doing.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js