How are textual data files parsed in modern C++?

How are textual data files parsed in modern C++? - c++

I am (too) often confronted with the task of having to parse textual data files -- the kind of textual structured data representation you used before "everyone" used XML -- that are some kind of industry standard. (There are too many of these.)
Anyways, the basic task is always taking a text file and stuffing what's in there in some kind of datastructure so that our C++ code can do something with the info.
Now, I have implemented a few simple (and oh so buggy) parsers by hand, and there is little I despise more. :-)
So - I was wondering what the current state of the art is when I want to "parse" structured textual data into a in-memory representation (think: XML data binding for an arbitrary language).
What I found so far was "What parser generator do you recommend", but I'm not so sure I'm after a parser generator (like ANTLR).
Obvious candidates seem to be pegtl and Boost.Spirit but they both seem rather complicated (but at least they're in-language) and last time I tried Spirit, the compiler errors drove me nuts. (And pegtl needs a C++11 compatible compiler which is still a problem here (VC++ 2005).)
So am I missing a simpler solution for just getting something like
/begin COMPU_METHOD
DEC " Decimal value"
RAT_FUNC
"%3.0"
"dec"
COEFFS 0 1.000000 0.000000 0 0.000000 1.000000
/end COMPU_METHOD
into a C++ datastructure? (This is just an arbitrary example of how part of such a file may look. For this format I could (and probably should) buy a library to parse it, as it is widespread enough -- which is not the case for all formats I encounter.)
-- or should I just go for the complexity of, say Boost.Spirit?

Boost Spirit
See
my answer here for a demo that resembles your sample;
a more advanced, shorter demo here that parses into a tree structure
more samples search
Coco/R (C++)
I have had good results with this very pragmatic parser generator that supports many lnaguages/platforms using a common grammar format. The speed of parsing is comparable to Boost Spirit (allthough the processing of parsed data may be more efficient using generic programming)
Edit To make things perfectly clear, there never has been a thing that I wasn't able to do with Coco/R.
However, I'm really addicted to the ease with which Spirit deduces attribute type (conversions) for me generically. That is the main timesaver. There is a cost involved though:
learning curve, maintenance
compile time (but parsers don't often change)

I highly recommend biting the bullet and using Boost.Spirit. Although the error messages can be enough to put one out of one's skull, it's been worth it for me. I have used it to implement parsers for under- (or un-) documented custom file formats in a matter of hours, instead of days.
I found that the best way to approach it was to view it as an "std::istream on steroids", since it uses the same double-angle notation to denote separation.

You do not mention how sophisticated were the parsers you created by hand. But I believe such simple files could be definitely parsed by hand crafted routines as long as you split your work to lexical and syntactic parsing performed by dedicated state machines. The first one recognizes tokens like in your example keywords, numbers and strings, and feeds them to the second trying to recognize longer sentences and create corresponding data structures. With a simple files following regular grammars with no ambiguities and other conflicts it should be really simple and manageable.

Related

Can I do versatile mathematical (AST) pattern matching and manipulation with Boost.Spirit?

I was looking into pattern matching in C++, and among things like Mach7, which seems to be a functional approach to the problem, and the more general Visitor Pattern, which seems to be the lowest common denominator: it can do everything but excels at only specific cases.
I would like to manipulate mathematical expressions (simplify, evaluate, and also perform calculations like differential equation solving, and integration, symbollically). Yes, I'm looking to end up with a Computer Algebra System.
For the input, I'm looking at using Boost.Spirit (X3) to parse some form of input (currently playing with getting basic LaTeX support in there, although index vs sub/superscript is a problem for this...).
I then came to the crazy idea of using Boost.Spirit to not only parse the input "text", but also use the non-parser components of the library to actually perform the mathematical manipulations on a resulting AST.
Is this versatile enough for the pattern matching my goal requires or should I look at other solutions? I have tried to find documentation on how other CAS work internally, but short of going through the undoubtetly brilliant code of things like Maxima, I can't seem to find any information on anything but very simple implementations of mathematical ASTs. So I have little input info to determine if Boost.Spirit can do what I eventually will need to do.

I'm not qualified to advise on the subject of symbolic algebra and the requirements there.
I do however know a thing or two about Boost Spirit.
All I can say is: don't do it!
You do not want to burden the parser with such complex responsibilities that are just going to be more difficult to design right inside the "warped" reality that is EDSLs and Phoenix actors.
In fact, I have oft repeated this advice (see e.g. Boost Spirit: "Semantic actions are evil"?, is the most linked to for this, but I've deepened it out in several chat rooms and on occasion in answers where the problem seemed to arise from conflating parsing with processing).

Parse tree for SQL statements - precisely for "SELECT" statement

I am writing (hand written) recursive descent parser for SQL select statement in c++, i need to know whether the parse tree created by me is correct or not. I want to check but i didn't get a good sources for sql parse trees. My way of approach is - writing a function for each production and in that function the result is adding to the root tree. Can any one help me? Thanks in advance.

I don't know how you'll go about verifying your code is correct, but if you're concerned about your understanding of the SQL grammar, then here is a website that lists BNF grammars for various dialects of SQL. You ought to be able construct your parser in terms of these rules.

My company builds a lot of parsers, and have your same problem. We recently finished a SQL 2011 parser based on the draft standard.
Pretty much you decide if the parse tree is right by hand-inspecting it for many source code cases. This presumes that you can print the parse tree in a form that you can easily inspect; this is easily accomplished by a recursive tree walk of the parse tree. [You have to already believe that your abstract syntax tree nodes correctly model what you intend to capture!]. You choose the cases carefully to exercise different parts of the grammar (think "unit tests for grammars"). For a langauge as rich as SQL, this is a big job.
You also need to validate that the parser works in general, and you do that by feeding a lot of real code for the particular dialect of SQL you are handling. I typically try to find 100K-1M SLOC, and if the parser can't eat all of that, I have still have work left to do. Once you get to that level, you sort of consider that your parser is OK and treat further errors as "maintenance issues".
While the following may not help you directly, it might hint at a direction in which you could head. I use a somewhat different approach, based on having extremely strong parsing machinery available. Our tool, the DMS Software Reengineering Toolkit, given a grammar, will produce ASTs automatically, and has built-in facilities to print such parse trees (in one form as XML). The AST has sufficient information to regenerate ("prettyprint") the source text, and DMS has a built-in prettyprinter. So after hand inspecting a variety of cases, what I do is to take a large body of code, and for each file, parse it (getting no parse errors by virtue of the work done above), prettyprint the source, and reparse the source (expecting to get no errors). This is strong hint that we haven't lost anything in the round trip.
We have a new tool available, the Smart Differencer that compares the text of two programs to see if they are "the same" ignoring language layout rules. It works in essence by parsing two files and comparting their parse trees, ignoring the formatting (line/column/escapes/radix/comments/whitespace). What we are starting to do is to parse the source code, prettyprint it, and the smart-diffing the prettytprinted result against the original file. SmartDiff should say "no AST differences". This is a much stronger hint that we haven't lost anything. You can do pretty the much the same if you are willing to compare your before-and-after printed parse trees.

This parser, based on pyparsing, might be helpful as a second SELECT parsing resource (although it is in Python, not C++, sorry).

How should I go about building a simple LR parser?

I am trying to build a simple LR parser for a type of template (configuration) file that will be used to generate some other files. I've read and read about LR parsers, but I just can't seem to understand it! I understand that there is a parse stack, a state stack and a parsing table. Tokens are read onto the parse stack, and when a rule is matched then the tokens are shifted or reduced, depending on the parsing table. This continues recursively until all of the tokens are reduced and the parsing is then complete.
The problem is I don't really know how to generate the parsing table. I've read quite a few descriptions, but the language is technical and I just don't understand it. Can anyone tell me how I would go about this?
Also, how would I store things like the rules of my grammar?
http://codepad.org/oRjnKacH is a sample of the file I'm trying to parse with my attempt at a grammar for its language.
I've never done this before, so I'm just looking for some advice, thanks.

In your study of parser theory, you seem to have missed a much more practical fact: virtually nobody ever even considers hand writing a table-driven, bottom-up parser like you're discussing. For most practical purposes, hand-written parsers use a top-down (usually recursive descent) structure.
The primary reason for using a table-driven parser is that it lets you write a (fairly) small amount of code that manipulates the table and such, that's almost completely generic (i.e. it works for any parser). Then you encode everything about a specific grammar into a form that's easy for a computer to manipulate (i.e. some tables).
Obviously, it would be entirely possible to do that by hand if you really wanted to, but there's almost never a real point. Generating the tables entirely by hand would be pretty excruciating all by itself.
For example, you normally start by constructing an NFA, which is a large table -- normally, one row for each parser state, and one column for each possible input. At each cell, you encode the next state to enter when you start in that state, and then receive that input. Most of these transitions are basically empty (i.e. they just say that input isn't allowed when you're in that state). Note: since the valid transitions are so sparse, most parser generators support some way of compressing these tables, but that doesn't change the basic idea).
You then step through all of those and follow some fairly simple rules to collect sets of NFA states together to become a state in the DFA. The rules are simple enough that it's pretty easy to program them into a computer, but you have to repeat them for every cell in the NFA table, and do essentially perfect book-keeping to produce a DFA that works correctly.
A computer can and will do that quite nicely -- for it, applying a couple of simple rules to every one of twenty thousand cells in the NFA state table is a piece of cake. It's hard to imagine subjecting a person to doing the same though -- I'm pretty sure under UN guidelines, that would be illegal torture.

The classic solution is the lex/yacc combo:
http://dinosaur.compilertools.net/yacc/index.html
Or, as gnu calls them - flex/bison.
edit:
Perl has Parse::RecDescent, which is a recursive descent parser, but it may work better for simple work.

you need to read about ANTLR

I looked at the definition of your fileformat, while I am missing some of the context why you would want specifically a LR parser, my first thought was why not use existing formats like xml, or json. Going down the parsergenerator route usually has a high startup cost that will not pay off for the simple data that you are looking to parse.
As paul said lex/yacc are an option, you might also want to have a look at Boost::Spirit.
I have worked with neither, a year ago wrote a much larger parser using QLALR by the Qt/Nokia people. When I researched parsers this one even though very underdocumented had the smallest footprint to get started (only 1 tool) but it does not support lexical analysis. IIRC I could not figure out C++ support in ANTLR at that time.
10,000 mile view: In general you are looking at two components a lexer that takes the input symbols and turns them into higher order tokens. To work of the tokens your grammar description will state rules, usually you will include some code with the rules, this code will be executed when the rule is matched. The compiler generator (e.g. yacc) will take your description of the rules and the code and turn it into compilable code. Unless you are doing this by hand you would not be manipulating the tables yourself.

Well you can't understand it like
"Function A1 does f to object B, then function A2 does g to D etc"
its more like
"Function A does action {a,b,c,d,e,f,g,h,i,j,k,l,m,n,o or p, or no-op} and shifts/reduces a certain count to objects {1-1567} at stack head of type {B,C,D,E,F,or G} and its containing objects up N levels which may have types {H,I,J,K or L etc} in certain combinations according to a rule list"
It really does need a data table (or code generated from a data table like thing, like a set of BNF grammar data) telling the function what to do.
You CAN write it from scratch. You can also paint walls with eyelash brushes. You can interpret the data table at run-time. You can also put Sleep(1000); statements in your code every other line. Not that I've tried either.
Compilers is complex. Hence compiler generators.
EDIT
You are attempting to define the tokens in terms of content in the file itself.
I assume the reason you "don't want to use regexes" is that you want to be able to access line number information for different tokens within a block of text and not just for the block of text as a whole. If line numbers for each word are unnecessary, and entire blocks are going to fit into memory, I'd be inclined to model the entire bracketed block as a token, as this may increase processing speed. Either way you'll need a custom yylex function. Start by generating one with lex with fixed markers "[" and "]" for content start and end, then freeze it and modify it to take updated data about what markers to look for from the yacc code.

Best practices for writing a programming language parser

Are there any best practices that I should follow while writing a parser?

The received wisdom is to use parser generators + grammars and it seems like good advice, because you are using a rigorous tool and presumably reducing effort and potential for bugs in doing so.
To use a parser generator the grammar has to be context free. If you are designing the languauge to be parsed then you can control this. If you are not sure then it could cost you a lot of effort if you start down the grammar route. Even if it is context free in practice, unless the grammar is enormous, it can be simpler to hand code a recursive decent parser.
Being context free does not only make the parser generator possible, but it also makes hand coded parsers a lot simpler. What you end up with is one (or two) functions per phrase. Which is if you organise and name the code cleanly is not much harder to see than a grammar (if your IDE can show you call hierachies then you can pretty much see what the grammar is).
The advantages:-
Simpler build
Better performance
Better control of output
Can cope with small deviations, e.g. work with a grammar that is not 100% context free
I am not saying grammars are always unsuitable, but often the benefits are minimal and are often out weighed by the costs and risks.
(I believe the arguments for them are speciously appealing and that there is a general bias for them as it is a way of signaling that one is more computer-science literate.)

Few pieces of advice:
Know your grammar - write it down in a suitable form
Choose the right tool. Do it from within C++ with Spirit2x, or choose external parser tools like antlr, yacc, or whatever suits you
Do you need a parser? Maybe regexp will suffice? Or maybe hack a perl script to do the trick? Writing complex parsers take time.

Don't overuse regular expressions - while they have their place, they simply don't have the power to handle any kind of real parsing. You can push them, but you're eventually going to hit a wall or end up with an unmaintainable mess. You're better off finding a parser generator that can handle a larger language set. If you really don't want to get into tools, you can look at recursive descent parsers - it's a really simple pattern for hand-writing a small parser. They aren't as flexible or as powerful as the big parser generators, but they have a much shorter learning curve.
Unless you have very tight performance requirements, try and keep your layers separate - the lexer reads in individual tokens, the parser arranges those into a tree, and then semantic analysis checks over everything and links up references, and then a final phase to output whatever is being produced. Keeping the different parts of logic separate will make things easier to maintain later.

Read most of the Dragon book first.
Parsers are not complicated if you know how to build them, but they are NOT the type of thing that if you put in enough time, you'll eventually get there. It's way better to build on the existing knowledge base. (Otherwise expect to write it and throw it away a few dozen times).

Yep. Try to generate it, not write. Consider using yacc, ANTLR, Flex/Bison, Coco/R, GOLD Parser generator, etc. Resort to manually writing a parser only if none of existing parser generators fit your needs.

Choose the right kind of parser, sometimes a Recursive Descendant will be enough, sometimes you should use an LR parser (also, there are many types of LR parsers).
If you have a complex grammar, build an Abstract Syntax Tree.
Try to identify very well what goes into the lexer, what is part of the syntax and what is a matter of semantics.
Try to make the parser the least coupled to the lexer implementation as possible.
Provide a good interface to the user so he is agnostic of the parser implementation.

First, don't try to apply the same techniques to parsing everything. There are numerous possible use cases, from something like IP addresses (a bit of ad hoc code) to C++ programs (which need an industrial-strength parser with feedback from the symbol table), and from user input (which needs to be processed very fast) to compilers (which normally can afford to spend a little time parsing). You might want to specify what you're doing if you want useful answers.
Second, have a grammar in mind to parse with. The more complicated it is, the more formal the specification needs to be. Try to err on the side of being too formal.
Third, well, that depends on what you're doing.

Compiler-Programming: What are the most fundamental ingredients?

I am interested in writing a very minimalistic compiler.
I want to write a small piece of software (in C/C++) that fulfills the following criteria:
output in ELF format (*nix)
input is a single textfile
C-like grammar and syntax
no linker
no preprocessor
very small (max. 1-2 KLOC)
Language features:
native data types: char, int and floats
arrays (for all native data types)
variables
control structures (if-else)
functions
loops (would be nice)
simple algebra (div, add, sub, mul, boolean expressions, bit-shift, etc.)
inline asm (for system calls)
Can anybody tell me how to start? I don't know what parts a compiler consists of (at least not in the sense that I just could start right off the shelf) and how to program them. Thank you for your ideas.

With all that you hope to accomplish, the most challenging requirement might be "very small (max. 1-2 KLOC)". I think your first requirement alone (generating ELF output) might take well over a thousand lines of code by itself.
One way to simplify the problem, at least to start with, is to generate code in assembly language text that you then feed into an existing assembler (nasm would be a good choice). The assembler would take care of generating the actual machine code, as well as all the ELF specific code required to build an actual runnable executable. Then your job is reduced to language parsing and assembly code generation. When your project matures to the point where you want to remove the dependency on an assembler, you can rewrite this part yourself and plug it in at any time.
If I were you, I might start with an assembler and build pieces on top of it. The simplest "compiler" might take a language with just a few very simple possible statements:
print "hello"
a = 5
print a
and translate that to assembly language. Once you get that working, then you can build a lexer and parser and abstract syntax tree and code generator, which are most of the parts you'll need for a modern block structured language.
Good luck!

Firstly, you need to decide whether you are going to make a compiler or an interpreter. A compiler translates your code into something that can be run either directly on hardware, in an interpreter, or get compiled into another language which then is interpreted in some way. Both types of languages are turing complete so they have the same expressive capabilities. I would suggest that you create a compiler which compiles your code into either .net or Java bytecode, as it gives you a very optimized interpreter to run on as well as a lot of standard libraries.
Once you made your decision there are some common steps to follow
Language definition Firstly, you have to define how your language should look syntactically.
Lexer The second step is to create the keywords of your code, known as tokens. Here, we are talking about very basic elements such as numbers, addition sign, and strings.
Parsing The next step is to create a grammar that matches your list of tokens. You can define your grammar using e.g. a context-free grammar. A number of tools can be fed with one of these grammars and create the parser for you. Usually, the parsed tokens are organized into a parse tree. A parse tree is the representation of your grammar as a data structure which you can move around in.
Compiling or Interpreting The last step is to run some logic on your parse tree. A simple way to make your own interpreter is to create some logic associated to each node type in your tree and walk through the tree either bottom-up or top-down. If you want to compile to another language you can insert the logic of how to translate the code in the nodes instead.
Wikipedia is great for learning more, you might want to start here.
Concerning real-world reading material I would suggest "Programming language processors in JAVA" by David A Watt & Deryck F Brown. I used that book in my compilers course and learning by example is great in this field.

These are the absolutely essential parts:
Scanner: This breaks the input file into tokens
Parser: This constructs an abstract syntax tree (AST) from the tokens identified by the scanner.
Code generation: This produces the output from the AST.
You'll also probably want:
Error handling: This tells the parser what to do if it encounters an unexpected token
Optimization: This will enable the compiler to produce more efficient machine code
Edit: Have you already designed the language? If not, you'll want to look into language design, too.

I don't know what you hope to get out of this, but if it is learning, and looking at existing code works for you, there is always tcc.

The number one essential is a book on compiler writing. A lot of people will tell you to read the "Dragon Book" by Aho et al, but the best book I've read on compilers is "Brinch Hansen on Pascal Compilers". I suspect it's out of print (Amazon is your friend), but it takes you through all the steps of designing and writing a compiler using recursive descent, which is the easiest method for compiler newbies to understand.
Although the book uses Pascal as the implementation and target languages, the lessons and techniques presented apply equally to all other languages.

The examples are all in Perl, but Exploring Programming Language Architecture in Perl is a good book (and free).

A really good set of free references, IMHO, are:
Overall compiler tutorial: Let's Build a Compiler by Jack Crenshaw (http://compilers.iecc.com/crenshaw/) It's wordy, but I like it.
Assembler: NASM (nasm.us) good for Linux and Windows/DOS, and most importantly lots of doco and examples/tutorials. (FASM is also good but less documentation/tutorials out there)
Other sources
The PC Assembly book (http://www.drpaulcarter.com/pcasm/index.php)
I'm trying to write a LISP, so I'm using the Lisp 1.5 Manual. You may want to get the language spec for whatever language you're writing.
As far as 1-2KLOC, assuming you use a high level language (like Py or Rb) you should be close if you're not too ambitious.

I always recommend flex and bison for this kind of work as a beginner. You can always learn the ins and outs of writing your own scanner and parser later, although they may increase the code size at least they will be generated for you by tools. :)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js