How should I go about building a simple LR parser? - c++

I am trying to build a simple LR parser for a type of template (configuration) file that will be used to generate some other files. I've read and read about LR parsers, but I just can't seem to understand it! I understand that there is a parse stack, a state stack and a parsing table. Tokens are read onto the parse stack, and when a rule is matched then the tokens are shifted or reduced, depending on the parsing table. This continues recursively until all of the tokens are reduced and the parsing is then complete.
The problem is I don't really know how to generate the parsing table. I've read quite a few descriptions, but the language is technical and I just don't understand it. Can anyone tell me how I would go about this?
Also, how would I store things like the rules of my grammar?
http://codepad.org/oRjnKacH is a sample of the file I'm trying to parse with my attempt at a grammar for its language.
I've never done this before, so I'm just looking for some advice, thanks.

In your study of parser theory, you seem to have missed a much more practical fact: virtually nobody ever even considers hand writing a table-driven, bottom-up parser like you're discussing. For most practical purposes, hand-written parsers use a top-down (usually recursive descent) structure.
The primary reason for using a table-driven parser is that it lets you write a (fairly) small amount of code that manipulates the table and such, that's almost completely generic (i.e. it works for any parser). Then you encode everything about a specific grammar into a form that's easy for a computer to manipulate (i.e. some tables).
Obviously, it would be entirely possible to do that by hand if you really wanted to, but there's almost never a real point. Generating the tables entirely by hand would be pretty excruciating all by itself.
For example, you normally start by constructing an NFA, which is a large table -- normally, one row for each parser state, and one column for each possible input. At each cell, you encode the next state to enter when you start in that state, and then receive that input. Most of these transitions are basically empty (i.e. they just say that input isn't allowed when you're in that state). Note: since the valid transitions are so sparse, most parser generators support some way of compressing these tables, but that doesn't change the basic idea).
You then step through all of those and follow some fairly simple rules to collect sets of NFA states together to become a state in the DFA. The rules are simple enough that it's pretty easy to program them into a computer, but you have to repeat them for every cell in the NFA table, and do essentially perfect book-keeping to produce a DFA that works correctly.
A computer can and will do that quite nicely -- for it, applying a couple of simple rules to every one of twenty thousand cells in the NFA state table is a piece of cake. It's hard to imagine subjecting a person to doing the same though -- I'm pretty sure under UN guidelines, that would be illegal torture.

The classic solution is the lex/yacc combo:
http://dinosaur.compilertools.net/yacc/index.html
Or, as gnu calls them - flex/bison.
edit:
Perl has Parse::RecDescent, which is a recursive descent parser, but it may work better for simple work.

you need to read about ANTLR

I looked at the definition of your fileformat, while I am missing some of the context why you would want specifically a LR parser, my first thought was why not use existing formats like xml, or json. Going down the parsergenerator route usually has a high startup cost that will not pay off for the simple data that you are looking to parse.
As paul said lex/yacc are an option, you might also want to have a look at Boost::Spirit.
I have worked with neither, a year ago wrote a much larger parser using QLALR by the Qt/Nokia people. When I researched parsers this one even though very underdocumented had the smallest footprint to get started (only 1 tool) but it does not support lexical analysis. IIRC I could not figure out C++ support in ANTLR at that time.
10,000 mile view: In general you are looking at two components a lexer that takes the input symbols and turns them into higher order tokens. To work of the tokens your grammar description will state rules, usually you will include some code with the rules, this code will be executed when the rule is matched. The compiler generator (e.g. yacc) will take your description of the rules and the code and turn it into compilable code. Unless you are doing this by hand you would not be manipulating the tables yourself.

Well you can't understand it like
"Function A1 does f to object B, then function A2 does g to D etc"
its more like
"Function A does action {a,b,c,d,e,f,g,h,i,j,k,l,m,n,o or p, or no-op} and shifts/reduces a certain count to objects {1-1567} at stack head of type {B,C,D,E,F,or G} and its containing objects up N levels which may have types {H,I,J,K or L etc} in certain combinations according to a rule list"
It really does need a data table (or code generated from a data table like thing, like a set of BNF grammar data) telling the function what to do.
You CAN write it from scratch. You can also paint walls with eyelash brushes. You can interpret the data table at run-time. You can also put Sleep(1000); statements in your code every other line. Not that I've tried either.
Compilers is complex. Hence compiler generators.
EDIT
You are attempting to define the tokens in terms of content in the file itself.
I assume the reason you "don't want to use regexes" is that you want to be able to access line number information for different tokens within a block of text and not just for the block of text as a whole. If line numbers for each word are unnecessary, and entire blocks are going to fit into memory, I'd be inclined to model the entire bracketed block as a token, as this may increase processing speed. Either way you'll need a custom yylex function. Start by generating one with lex with fixed markers "[" and "]" for content start and end, then freeze it and modify it to take updated data about what markers to look for from the yacc code.

Related

document level regex

I am looking to create a system that extracts blocks of code from C++ files. For example, if I wanted to extract every while loop, I would look for a pattern that begins with while and ends with }. The problem with that specific example is that while loops may contain other scope blocks, so I'd need to:
Find the string while - regex can easily do this
match braces starting with the open brace after the while and ending with its matching brace
Also match while loops that contain a single line and no braces
Handle as many special cases as possible, such as while loops declared in comments etc, as per #Cid's suggestion
I can do this with a parser and a lot of code, but I was wondering if anything existed that perhaps extends regex to this sort of document level query?
There are parser libraries and tools, even free open-source ones. Clang has one, for example. So does GCC. There are others.
It's a lot of code because C++ is hard to parse. But if someone else gas written the code and it works, that's nit a problem. The usual difficulty with using these products is finding good documentation, but you can always try asking specific questions here
But just doing a lexical analysis of C++ is less difficult, and would be sufficient for crude analysis of program structure if you don't care that it will fail on corner cases. If you start with preprocessed code (or make the dubious assumption that preprocessing doesn't change the program structure) and don't worry about identifying template brackets (in particular, distinguishing between the right shift operator and two consecutive close angle brackets), you should be able to build a lexical analyser with a reasonably short scanner generator specification.
That might be sufficient for crude analysis of program structure if you don't care that it will fail on corner cases.

c++ parser and formatter using a single grammar declaration

I have this idea to be able to 'declare' a grammar and use the same declaration for generating the format function.
A parser generator (e.g. antlr) is able to generate the parser from a bnf grammar.
But is there a way to use the same grammar to generate the formatting code?
I just want to avoid manually having to sync the parsing code (generated) with a manually written formatting code, since the grammar is the same.
could I use the abstract syntax tree?
boost::spirit? metaprogramming?
anyone tried this?
thanks
It's not clear to me whether this question is looking for an existing product or library (in which case, the question would be out-of-scope for Stack Overflow), or in algorithms for automatically generating a pretty printer from (some formalism for) a grammar. Here, I've tried to provide some pointers for the second possibility.
There is a long history of research into syntax-directed pretty printing, and a Google or Citeseer search on that phrase will probably give you lots of reading material. I'd recommend trying to find a copy of Derek Oppen's 1979 paper, Prettyprinting, which describes a linear-time algorithm based on the insertion of a few pretty-printing operators into the tokenized source code.
Oppen's basic operators are fairly simple: they consist of indications about how code segments are to be (recursively) grouped, about where newlines must and might be inserted, and about where in a group to increase indentation depth. With the set of proposed operators, it is possible to create an on-line algorithm which prefers to break lines higher up in the parse tree, avoiding the tendency to over-indent deeply-nested code, which is a classic failing of naïve indentation algorithms.
In essence, the algorithm uses a two-finger solution, where the leading finger consumes new tokens and notices when the line must be wrapped, at which point it signals the trailing finger. The trailing finger then finds the earliest point at which a newline could be inserted and all the additional newlines and indents which must be inserted to conform with the operators, advancing until there is no newline between the fingers.
The on-line algorithm might not produced optimal indentation/reflowing (and it is not immediately obvious what the definition of "optimal" might be); for certain aspects of the pretty-printing, it might be useful to think about the ideas in Donald Knuth's optimal line-wrapping algorithm, as described in his 1999 text, Digital Typography. (More references in the Wikipedia article on line wrapping.)
Oppen's algorithm is not perfect (as indicated in the paper) but it may be "good enough" for many practical purposes. (I note some limitations below.) Tracing the citation history of this paper will give you a number of implementations, improvements, and alternate algorithms.
It's clear that a parser generator could easily be modified to simply insert pretty-printing annotations into a token stream, and I believe that there have been various attempts to create yacc-like pretty-printer generators. (And possibly ANTLR derivatives, too.) The basic idea is to embed the pretty printing annotations in the grammar description, which allows the automatic generation of a reduction action which outputs an annotated token stream.
Syntax-directed pretty printing was added to the ASF+SDF Meta-Environment using a similar annotation system; the basic algorithm and formalism is described by M.G.J. van der Brand in Pretty Printing in the ASF+SDF Meta-environment Past, Present and Future (1995), which also makes for interesting reading. (ASF+SDF has since been superseded by the Rascal Metaprogramming Language, which includes visualization tools.)
One important issue with syntax-directed pretty printing algorithms is that they are based on the parse of a tokenized stream, which means that comments have already been removed. Clearly it is desirable that comments be retained in a pretty-printed version of a program, but correctly attaching comments to the relevant code is not trivial, particularly when the comment is on the same line as some code. Consider, for example, the case of a commented-out operation embedded into code:
// This is the simplified form of actual code
int needed_ = (current_ /* + adjustment_ */ ) * 2;
Or the common case of trailing comments used to document variables:
/* Tracking the current allocation */
int needed_; // Bytes required.
int current_; // Bytes currently allocated.
// int adjustment_; // (TODO) Why is this needed?
/* Either points to the current allocation, or is 0 */
char* buffer_;
In the above example, note the importance of whitespace: the comments may apply to the previous declaration (even though they appear after the semicolon which terminates it) or to the following declaration(s), mostly depending on whether they are suffix comments or full-line comments, but the commented-out code is an exception. Also, the programmer has attempted to line up the names of the member variables.
Another problem with automated syntax-directed pretty-printing is handling incorrect (or incomplete) programs, as would need to be done if the pretty-printing is part of a Development Environment. Error-handling (and error recovery) is by far the most difficult part of automatically-generated parsers; maintaining useful pretty printing in this context is even more complicated. It's precisely for this reason that most IDEs use a form of peephole pretty-printing (another possible search phrase), or even adaptive pretty-printing where user indentation is used as a guide to the location of as-yet-unwritten code.
OP asks, Has anyone tried this?
Yes. Our DMS Software Reengineering Toolkit can do this: you give it just a grammar, you get a parser that builds ASTs and you get a prettyprinter. We've used this on
lots of parse/changeAST/unparse tasks for many language over the last 20 years, preserving the meaning of the source program exactly.
The process is to parse according to the grammar, build an AST, and then walk the AST to carry out prettyprinting operations.
However, you don't get a good prettyprinter. Nice layout of the reformatted source code requires that language cues for block nesting (e.g., matching '{' ... '}', 'BEGIN' ... 'END' pairs, special keywords 'if', 'for', etc.) be used to drive the formatting and indentation. While one can guess what these elements are (as I just did), that's just a guess and in practice a human being needs to inspect the grammar to determine which things are cues and how to format each construct. (The default prettyprinter derived from the grammar makes such guesses).
DMS provides support for that problem in the form of prettyprinter declarations woven into the grammar to provide the formatter engineer quite a lot of control over the layout. (See this SO answer for detailed discussion: https://stackoverflow.com/a/5834775/120163) This produces
(our opinion) pretty good prettyprinters. And DMS does have an explicit grammar/formatter for full C++14. [EDIT Aug 2018: full C++17 in MS and GCC dialects)]
EDIT: rici's answer suggests that comments are difficult to handle. He's right, in the sense that you must handle them, and yes, it is hard to handle them if they are removed as whitespace while parsing. The essense of the problem is "removed as whitespace"; and goes away if you don't do that. DMS provides means to capture the comments (rather than ignoring them as whitespace) and attach them (automatically) to AST nodes. The decision as to which AST node captures the comments is handled in the lexer by declaring comments as "pre" (happening before a token) or "post"; this decision is heuristic on the part of the grammer/lexer engineer, but works actually pretty well. The token with comments is passed to the parser, which builds an AST node from it. With comments attached to AST nodes, the prettyprinter can re-generate them, too.

How are textual data files parsed in modern C++?

I am (too) often confronted with the task of having to parse textual data files -- the kind of textual structured data representation you used before "everyone" used XML -- that are some kind of industry standard. (There are too many of these.)
Anyways, the basic task is always taking a text file and stuffing what's in there in some kind of datastructure so that our C++ code can do something with the info.
Now, I have implemented a few simple (and oh so buggy) parsers by hand, and there is little I despise more. :-)
So - I was wondering what the current state of the art is when I want to "parse" structured textual data into a in-memory representation (think: XML data binding for an arbitrary language).
What I found so far was "What parser generator do you recommend", but I'm not so sure I'm after a parser generator (like ANTLR).
Obvious candidates seem to be pegtl and Boost.Spirit but they both seem rather complicated (but at least they're in-language) and last time I tried Spirit, the compiler errors drove me nuts. (And pegtl needs a C++11 compatible compiler which is still a problem here (VC++ 2005).)
So am I missing a simpler solution for just getting something like
/begin COMPU_METHOD
DEC " Decimal value"
RAT_FUNC
"%3.0"
"dec"
COEFFS 0 1.000000 0.000000 0 0.000000 1.000000
/end COMPU_METHOD
into a C++ datastructure? (This is just an arbitrary example of how part of such a file may look. For this format I could (and probably should) buy a library to parse it, as it is widespread enough -- which is not the case for all formats I encounter.)
-- or should I just go for the complexity of, say Boost.Spirit?
Boost Spirit
See
my answer here for a demo that resembles your sample;
a more advanced, shorter demo here that parses into a tree structure
more samples search
Coco/R (C++)
I have had good results with this very pragmatic parser generator that supports many lnaguages/platforms using a common grammar format. The speed of parsing is comparable to Boost Spirit (allthough the processing of parsed data may be more efficient using generic programming)
Edit To make things perfectly clear, there never has been a thing that I wasn't able to do with Coco/R.
However, I'm really addicted to the ease with which Spirit deduces attribute type (conversions) for me generically. That is the main timesaver. There is a cost involved though:
learning curve, maintenance
compile time (but parsers don't often change)
I highly recommend biting the bullet and using Boost.Spirit. Although the error messages can be enough to put one out of one's skull, it's been worth it for me. I have used it to implement parsers for under- (or un-) documented custom file formats in a matter of hours, instead of days.
I found that the best way to approach it was to view it as an "std::istream on steroids", since it uses the same double-angle notation to denote separation.
You do not mention how sophisticated were the parsers you created by hand. But I believe such simple files could be definitely parsed by hand crafted routines as long as you split your work to lexical and syntactic parsing performed by dedicated state machines. The first one recognizes tokens like in your example keywords, numbers and strings, and feeds them to the second trying to recognize longer sentences and create corresponding data structures. With a simple files following regular grammars with no ambiguities and other conflicts it should be really simple and manageable.

Parse tree for SQL statements - precisely for "SELECT" statement

I am writing (hand written) recursive descent parser for SQL select statement in c++, i need to know whether the parse tree created by me is correct or not. I want to check but i didn't get a good sources for sql parse trees. My way of approach is - writing a function for each production and in that function the result is adding to the root tree. Can any one help me? Thanks in advance.
I don't know how you'll go about verifying your code is correct, but if you're concerned about your understanding of the SQL grammar, then here is a website that lists BNF grammars for various dialects of SQL. You ought to be able construct your parser in terms of these rules.
My company builds a lot of parsers, and have your same problem. We recently finished a SQL 2011 parser based on the draft standard.
Pretty much you decide if the parse tree is right by hand-inspecting it for many source code cases. This presumes that you can print the parse tree in a form that you can easily inspect; this is easily accomplished by a recursive tree walk of the parse tree. [You have to already believe that your abstract syntax tree nodes correctly model what you intend to capture!]. You choose the cases carefully to exercise different parts of the grammar (think "unit tests for grammars"). For a langauge as rich as SQL, this is a big job.
You also need to validate that the parser works in general, and you do that by feeding a lot of real code for the particular dialect of SQL you are handling. I typically try to find 100K-1M SLOC, and if the parser can't eat all of that, I have still have work left to do. Once you get to that level, you sort of consider that your parser is OK and treat further errors as "maintenance issues".
While the following may not help you directly, it might hint at a direction in which you could head. I use a somewhat different approach, based on having extremely strong parsing machinery available. Our tool, the DMS Software Reengineering Toolkit, given a grammar, will produce ASTs automatically, and has built-in facilities to print such parse trees (in one form as XML). The AST has sufficient information to regenerate ("prettyprint") the source text, and DMS has a built-in prettyprinter. So after hand inspecting a variety of cases, what I do is to take a large body of code, and for each file, parse it (getting no parse errors by virtue of the work done above), prettyprint the source, and reparse the source (expecting to get no errors). This is strong hint that we haven't lost anything in the round trip.
We have a new tool available, the Smart Differencer that compares the text of two programs to see if they are "the same" ignoring language layout rules. It works in essence by parsing two files and comparting their parse trees, ignoring the formatting (line/column/escapes/radix/comments/whitespace). What we are starting to do is to parse the source code, prettyprint it, and the smart-diffing the prettytprinted result against the original file. SmartDiff should say "no AST differences". This is a much stronger hint that we haven't lost anything. You can do pretty the much the same if you are willing to compare your before-and-after printed parse trees.
This parser, based on pyparsing, might be helpful as a second SELECT parsing resource (although it is in Python, not C++, sorry).

Efficient memory storage and retrieval of categorized string literals in C++

Note: This is a follow up to this question.
I have a "legacy" program which does hundreds of string matches against big chunks of HTML. For example if the HTML matches 1 of 20+ strings, do something. If it matches 1 of 4 other strings, do something else. There are 50-100 groups of these strings to match against these chunks of HTML (usually whole pages).
I'm taking a whack at refactoring this mess of code and trying to come up with a good approach to do all these matches.
The performance requirements of this code are rather strict. It needs to not wait on I/O when doing these matches so they need to be in memory. Also there can be 100+ copies of this process running at the same time so large I/O on startup could cause slow I/O for other copies.
With these requirements in mind it would be most efficient if only one copy of these strings are stored in RAM (see my previous question linked above).
This program currently runs on Windows with Microsoft compiler but I'd like to keep the solution as cross-platform as possible so I don't think I want to use PE resource files or something.
Mmapping an external file might work but then I have the issue of keeping program version and data version in sync, one does not normally change without the other. Also this requires some file "format" which adds a layer of complexity I'd rather not have.
So after all of this pre-amble it seems like the best solution is to have a bunch arrays of strings which I can then iterate over. This seems kind of messy as I'm mixing code and data heavily, but with the above requirements is there any better way to handle this sort of situation?
I'm not sure just how slow the current implementation is. So it's hard to recommend optimizations without knowing what level of optimization is needed.
Given that, however, I might suggest a two-stage approach. Take your string list and compile it into a radix tree, and then save this tree to some custom format (XML might be good enough for your purposes).
Then your process startup should consist of reading in the radix tree, and matching. If you want/need to optimize the memory storage of the tree, that can be done as a separate project, but it sounds to me like improving the matching algorithm would be a more efficient use of time. In some ways this is a 'roll your own regex system' idea. Rather similar to the suggestion to use a parser generator.
Edit: I've used something similar to this where, as a precompile step, a custom script generates a somewhat optimized structure and saves it to a large char* array. (obviously it can't be too big, but it's another option)
The idea is to keep the list there (making maintenance reasonably easy), but having the pre-compilation step speed up the access during runtime.
If the strings that need to be matched can be locked down at compile time you should consider using a tokenizer generator like lex to scan your input for matches. If you aren't familiar with it lex takes a source file which has some regular expressions (including the simplest regular expressions -- string literals) and C action code to be executed when a match is found. It is used often in building compilers and similar programs, and there are several other similar programs that you could also use (flex and antlr come to mind). lex builds state machine tables and then generates efficient C code for matching input against the regular expressions those state tables represent (input is standard input by default, but you can change this). Using this method would probably not result in the duplication of strings (or other data) in memory among the different instances of your program that you fear. You could probably easily generate the regular expressions from the string literals in your existing code, but it may take a good bit of work to rework your program to use the code that lex generated.
If the strings you have to match change over time there are some regular expressions libraries that can compile regular expressions at run time, but these do use lots of RAM and depending on your program's architecture these might be duplicated across different instances of the program.
The great thing about using a regular expression approach rather than lots of strcmp calls is that if you had the patterns:
"string1"
"string2"
"string3"
and the input:
"string2"
The partial match for "string" would be done just once for a DFA (Deterministic Finite-state Automaton) regular expression system (like lex) which would probably speed up your system. Building these things does require a lot of work on lex 's behalf, but all of the hard work is done up front.
Are these literal strings stored in a file? If so, as you suggested, your best option might be to use memory mapped files to share copies of the file across the hundreds of instances of the program. Also, you may want to try and adjust the working set size to try and see if you can reduce the number of page faults, but given that you have so many instances, it might prove to be counterproductive (and besides your program needs to have quota privileges to adjust the working set size).
There are other tricks you can try to optimize IO performance like allocating large pages, but it depends on your file size and the privileges granted to your program.
The bottomline is that you need to experiment to see what works best and remember to measure after each change :)...