DOM parsing, structured document traversal under the hood

DOM parsing, structured document traversal under the hood - regex

As a developer, and I'm certain I'm far from alone here, I'm always curious to understand what's "under the hood". DOM parsers are one of the list-toppers of this curiosity for me. We all know the famous post. I have even hacked together a bit of an "O RLY?", out of both temporary necessity and curiosity.
However my need to meet the man-behind-the-curtain remains unmet. How do DOM parsers, or any structured document parsers for that matter, parse documents? As far as my intermediate web application developer understanding can muster, it's a combination of recursive string parsing and state-keeping logic, not unlike my own hackish attempt.
Magicians should never reveal their secrets, but seriously, where is he hiding the rabbit?

There's a well-developed theory of parsing, and untold numbers of tools to support it. In general, you look at each character, one at a time, and decide when the characters you've made so far constitute a token. Then you look at the series of tokens, and decide when the sequence of tokens constitute a higher-level grammatical construct -- in this case, an HTML element. As you recognize constructs, you build a tree of nodes to represent them -- in this case, the DOM tree.
So are you familiar with context-free grammars, and compiler-compilers like yacc, bison, and their more modern counterparts? If you understand those, a DOM parser shouldn't be a mystery.

Related

c++ parser and formatter using a single grammar declaration

I have this idea to be able to 'declare' a grammar and use the same declaration for generating the format function.
A parser generator (e.g. antlr) is able to generate the parser from a bnf grammar.
But is there a way to use the same grammar to generate the formatting code?
I just want to avoid manually having to sync the parsing code (generated) with a manually written formatting code, since the grammar is the same.
could I use the abstract syntax tree?
boost::spirit? metaprogramming?
anyone tried this?
thanks

It's not clear to me whether this question is looking for an existing product or library (in which case, the question would be out-of-scope for Stack Overflow), or in algorithms for automatically generating a pretty printer from (some formalism for) a grammar. Here, I've tried to provide some pointers for the second possibility.
There is a long history of research into syntax-directed pretty printing, and a Google or Citeseer search on that phrase will probably give you lots of reading material. I'd recommend trying to find a copy of Derek Oppen's 1979 paper, Prettyprinting, which describes a linear-time algorithm based on the insertion of a few pretty-printing operators into the tokenized source code.
Oppen's basic operators are fairly simple: they consist of indications about how code segments are to be (recursively) grouped, about where newlines must and might be inserted, and about where in a group to increase indentation depth. With the set of proposed operators, it is possible to create an on-line algorithm which prefers to break lines higher up in the parse tree, avoiding the tendency to over-indent deeply-nested code, which is a classic failing of naïve indentation algorithms.
In essence, the algorithm uses a two-finger solution, where the leading finger consumes new tokens and notices when the line must be wrapped, at which point it signals the trailing finger. The trailing finger then finds the earliest point at which a newline could be inserted and all the additional newlines and indents which must be inserted to conform with the operators, advancing until there is no newline between the fingers.
The on-line algorithm might not produced optimal indentation/reflowing (and it is not immediately obvious what the definition of "optimal" might be); for certain aspects of the pretty-printing, it might be useful to think about the ideas in Donald Knuth's optimal line-wrapping algorithm, as described in his 1999 text, Digital Typography. (More references in the Wikipedia article on line wrapping.)
Oppen's algorithm is not perfect (as indicated in the paper) but it may be "good enough" for many practical purposes. (I note some limitations below.) Tracing the citation history of this paper will give you a number of implementations, improvements, and alternate algorithms.
It's clear that a parser generator could easily be modified to simply insert pretty-printing annotations into a token stream, and I believe that there have been various attempts to create yacc-like pretty-printer generators. (And possibly ANTLR derivatives, too.) The basic idea is to embed the pretty printing annotations in the grammar description, which allows the automatic generation of a reduction action which outputs an annotated token stream.
Syntax-directed pretty printing was added to the ASF+SDF Meta-Environment using a similar annotation system; the basic algorithm and formalism is described by M.G.J. van der Brand in Pretty Printing in the ASF+SDF Meta-environment Past, Present and Future (1995), which also makes for interesting reading. (ASF+SDF has since been superseded by the Rascal Metaprogramming Language, which includes visualization tools.)
One important issue with syntax-directed pretty printing algorithms is that they are based on the parse of a tokenized stream, which means that comments have already been removed. Clearly it is desirable that comments be retained in a pretty-printed version of a program, but correctly attaching comments to the relevant code is not trivial, particularly when the comment is on the same line as some code. Consider, for example, the case of a commented-out operation embedded into code:
// This is the simplified form of actual code
int needed_ = (current_ /* + adjustment_ */ ) * 2;
Or the common case of trailing comments used to document variables:
/* Tracking the current allocation */
int needed_; // Bytes required.
int current_; // Bytes currently allocated.
// int adjustment_; // (TODO) Why is this needed?
/* Either points to the current allocation, or is 0 */
char* buffer_;
In the above example, note the importance of whitespace: the comments may apply to the previous declaration (even though they appear after the semicolon which terminates it) or to the following declaration(s), mostly depending on whether they are suffix comments or full-line comments, but the commented-out code is an exception. Also, the programmer has attempted to line up the names of the member variables.
Another problem with automated syntax-directed pretty-printing is handling incorrect (or incomplete) programs, as would need to be done if the pretty-printing is part of a Development Environment. Error-handling (and error recovery) is by far the most difficult part of automatically-generated parsers; maintaining useful pretty printing in this context is even more complicated. It's precisely for this reason that most IDEs use a form of peephole pretty-printing (another possible search phrase), or even adaptive pretty-printing where user indentation is used as a guide to the location of as-yet-unwritten code.

OP asks, Has anyone tried this?
Yes. Our DMS Software Reengineering Toolkit can do this: you give it just a grammar, you get a parser that builds ASTs and you get a prettyprinter. We've used this on
lots of parse/changeAST/unparse tasks for many language over the last 20 years, preserving the meaning of the source program exactly.
The process is to parse according to the grammar, build an AST, and then walk the AST to carry out prettyprinting operations.
However, you don't get a good prettyprinter. Nice layout of the reformatted source code requires that language cues for block nesting (e.g., matching '{' ... '}', 'BEGIN' ... 'END' pairs, special keywords 'if', 'for', etc.) be used to drive the formatting and indentation. While one can guess what these elements are (as I just did), that's just a guess and in practice a human being needs to inspect the grammar to determine which things are cues and how to format each construct. (The default prettyprinter derived from the grammar makes such guesses).
DMS provides support for that problem in the form of prettyprinter declarations woven into the grammar to provide the formatter engineer quite a lot of control over the layout. (See this SO answer for detailed discussion: https://stackoverflow.com/a/5834775/120163) This produces
(our opinion) pretty good prettyprinters. And DMS does have an explicit grammar/formatter for full C++14. [EDIT Aug 2018: full C++17 in MS and GCC dialects)]
EDIT: rici's answer suggests that comments are difficult to handle. He's right, in the sense that you must handle them, and yes, it is hard to handle them if they are removed as whitespace while parsing. The essense of the problem is "removed as whitespace"; and goes away if you don't do that. DMS provides means to capture the comments (rather than ignoring them as whitespace) and attach them (automatically) to AST nodes. The decision as to which AST node captures the comments is handled in the lexer by declaring comments as "pre" (happening before a token) or "post"; this decision is heuristic on the part of the grammer/lexer engineer, but works actually pretty well. The token with comments is passed to the parser, which builds an AST node from it. With comments attached to AST nodes, the prettyprinter can re-generate them, too.

Uses of writing Lexers and Parsers other than for Compilers?

What kind of problems other than writing compilers can be solved using Lexers and Parsers ?
What are the advantages / disadvantages of using Lexers and Parsers over just writing regular expression statements in a programming language.
Are there any situations where only a Lexer or only a Parser is used ?
PS: Precise Comparison Examples would be nice

Lexers and parsers are good for computerized interpretation of anything that is a context-free language but not a regular language.
In more practical terms, this means that they're good for interpreting anything that has a defined structure but is beyond the capabilities of (or more difficult to do with) regex.
For instance, it is difficult if not impossible to write a regular expression which will determine if a given document is valid HTML (due to things like tag nesting, escape characters, required attributes, et cetera). On the other hand, it's (relatively) trivial to write a parser for HTML.
Similarly, you would probably not want to even try to write a regex to determine the order of operations in a mathematical expression. On the other hand, a parser can do it easily.
As for your question regarding individual lexers or parsers:
Neither is "necessary" for the other, or at all.
For instance, one could have human-readable words which translate directly to machine opcodes that would get lexed directly into machine code (this would essentially be a very basic "assembly language"). This would not require a parser.
One could also simply write programs in a way that already was expressed in machine-readable individual symbols and thus easy for a machine to parse - for instance, boolean algebra expressions that used only the symbols 0, 1, &, |, ~, (, and ). This would not require a lexer.
Or you could do without either - for instance, Brainfuck needs neither lexing nor parsing because it is simply a set of ordered instructions; the interpreter just maps symbols to things to do. Machine opcodes, similarly, do not require either.
Mostly, lexers and parsers are written to make things nicer and easier. It's nicer not to have to write everything in individual single-meaning glyphs. It's easier to be able to write out complex expressions in whatever way is convenient (say, with parentheses, (3+4)*2) than it is to force ourselves to write them in ways that machines work (say, RPN: 3 4 + 2 *).

A famous example where parsing is more adapted than regular expressions (because the object of processing is, inherently, a non-regular context-free language) is X?(HT)?ML manipulation. See Jeff Atwood's famous blog post on the subject, derived from a famous answer on this site.

Can extended regex implementations parse HTML?

I know what you're thinking - "oh my god, seriously, not again" - but please bear with me, my question is more than the title. Before we begin, I promise I will never try to parse arbitrary HTML with a regex, or ask anyone else how.
All of the many, many answers here explaining why you cannot do this rely on the formal definition of regular expressions. They parse regular languages, HTML is context-free but not regular, so you can't do it. But I have also heard that many regex implementations in various languages are not strictly regular; they come with extra tricks that break outside the bounds of formal regular expressions.
Since I don't know the details of any particular implementations, such as perl, my questions are:
Which features of regex tools are non-regular? Is it the back references? And in which languages are they found?
Are any of these extra tricks sufficient to parse all context-free languages?
If "no" to #2, then is there a formal category or class of languages that these extra features cover exactly? How can we quickly know whether the problem we are trying to solve is within the power of our not-necessarily-regular expressions?

The answer to your question is that yes, so-called “extended regexes” — which are perhaps more properly called patterns than regular expressions in the formal sense — such as those found in Perl and PCRE are indeed capable of recursive descent parsing of context-free grammars.
This posting’s pair of approaches illustrate not so much theoretical as rather practical limits to applying regexes to X/HTML. The first approach given there, the one labelled naïve, is more like the sort you are apt to find in most programs that make such an attempt. This can be made to work on well-defined, non-generic X/HTML, often with very little effort. That is its best application, just as open-ended X/HTML is its worst.
The second approach, labelled wizardly, uses an actual grammar for parsing. As such, it is fully as powerful as any other grammatical approach. However, it is also far beyond the powers of the overwhelming majority of casual programmers. It also risks re-creating a perfectly fine wheel for negative benefit. I wrote it to show what can be done, but which under virtually no circumstances whatsoever ever should be done. I wanted to show people why they want to use a parser on open-ended X/HTML by showing them how devilishly hard it is to come even close to getting right even using some of the most powerful of pattern-matching facilities currently available.
Many have misread my posting as somehow advocating the opposite of what I am actually saying. Please make no mistake: I’m saying that it is far too complicated to use. It is a proof by counter-example. I had hoped that by showing how to do it with regexes, people would realize why they did not want to go down that road. While all things are possible, not all are expedient.
My personal rule of thumb is that if the required regex is of only the first category, I may well use it, but that if it requires the fully grammatical treatment of the second category, I use someone else’s already-written parser. So even though I can write a parser, I see no reason to do so, and plenty not to.
When carefully crafted for that explicit purpose, patterns can be more resisilient to malformed X/HTML than off-the-shelf parsers tend to be, particularly if you have no real opportunity to hack on said parsers to make them more resilient to the common failure cases that web browsers tend to tolerate but validators do not. However, the grammatical patterns I provide above were designed for only well-formed but reasonably generic HTML (albeit without entity replacement, which is easily enough added). Error recovery in parsers is a separate issue altogether, and by no means a pleasant one.
Patterns, especially the far more commonplace non-grammatical ones most people are used to seeing and using, are much better suited for grabbing up discrete chunks one at a time than they are for producing a full syntactic analysys. In other words, regexes usually work better for lexing than they do for parsing. Without grammatical regexes, you should not try parsing grammars.
But don’t take that too far. I certainly do not mean to imply that you should immediately turn to a full-blown parser just because you want to tackle something that is recursively defined. The easiest and perhaps most commonly seen example of this sort of thing is a pattern to detect nested items, like parentheses. It’s extremely common for me to just plop down something simple like this in my code, and be done with it:
# delete all nested parens
s/\((?:[^()]*+|(?0))*\)//g;

Yes, the extensions in questions are backreferences, and they technically make "regexps" NP-complete, see the Wikipedia paragraph.

Parse tree for SQL statements - precisely for "SELECT" statement

I am writing (hand written) recursive descent parser for SQL select statement in c++, i need to know whether the parse tree created by me is correct or not. I want to check but i didn't get a good sources for sql parse trees. My way of approach is - writing a function for each production and in that function the result is adding to the root tree. Can any one help me? Thanks in advance.

I don't know how you'll go about verifying your code is correct, but if you're concerned about your understanding of the SQL grammar, then here is a website that lists BNF grammars for various dialects of SQL. You ought to be able construct your parser in terms of these rules.

My company builds a lot of parsers, and have your same problem. We recently finished a SQL 2011 parser based on the draft standard.
Pretty much you decide if the parse tree is right by hand-inspecting it for many source code cases. This presumes that you can print the parse tree in a form that you can easily inspect; this is easily accomplished by a recursive tree walk of the parse tree. [You have to already believe that your abstract syntax tree nodes correctly model what you intend to capture!]. You choose the cases carefully to exercise different parts of the grammar (think "unit tests for grammars"). For a langauge as rich as SQL, this is a big job.
You also need to validate that the parser works in general, and you do that by feeding a lot of real code for the particular dialect of SQL you are handling. I typically try to find 100K-1M SLOC, and if the parser can't eat all of that, I have still have work left to do. Once you get to that level, you sort of consider that your parser is OK and treat further errors as "maintenance issues".
While the following may not help you directly, it might hint at a direction in which you could head. I use a somewhat different approach, based on having extremely strong parsing machinery available. Our tool, the DMS Software Reengineering Toolkit, given a grammar, will produce ASTs automatically, and has built-in facilities to print such parse trees (in one form as XML). The AST has sufficient information to regenerate ("prettyprint") the source text, and DMS has a built-in prettyprinter. So after hand inspecting a variety of cases, what I do is to take a large body of code, and for each file, parse it (getting no parse errors by virtue of the work done above), prettyprint the source, and reparse the source (expecting to get no errors). This is strong hint that we haven't lost anything in the round trip.
We have a new tool available, the Smart Differencer that compares the text of two programs to see if they are "the same" ignoring language layout rules. It works in essence by parsing two files and comparting their parse trees, ignoring the formatting (line/column/escapes/radix/comments/whitespace). What we are starting to do is to parse the source code, prettyprint it, and the smart-diffing the prettytprinted result against the original file. SmartDiff should say "no AST differences". This is a much stronger hint that we haven't lost anything. You can do pretty the much the same if you are willing to compare your before-and-after printed parse trees.

This parser, based on pyparsing, might be helpful as a second SELECT parsing resource (although it is in Python, not C++, sorry).

Best practices for writing a programming language parser

Are there any best practices that I should follow while writing a parser?

The received wisdom is to use parser generators + grammars and it seems like good advice, because you are using a rigorous tool and presumably reducing effort and potential for bugs in doing so.
To use a parser generator the grammar has to be context free. If you are designing the languauge to be parsed then you can control this. If you are not sure then it could cost you a lot of effort if you start down the grammar route. Even if it is context free in practice, unless the grammar is enormous, it can be simpler to hand code a recursive decent parser.
Being context free does not only make the parser generator possible, but it also makes hand coded parsers a lot simpler. What you end up with is one (or two) functions per phrase. Which is if you organise and name the code cleanly is not much harder to see than a grammar (if your IDE can show you call hierachies then you can pretty much see what the grammar is).
The advantages:-
Simpler build
Better performance
Better control of output
Can cope with small deviations, e.g. work with a grammar that is not 100% context free
I am not saying grammars are always unsuitable, but often the benefits are minimal and are often out weighed by the costs and risks.
(I believe the arguments for them are speciously appealing and that there is a general bias for them as it is a way of signaling that one is more computer-science literate.)

Few pieces of advice:
Know your grammar - write it down in a suitable form
Choose the right tool. Do it from within C++ with Spirit2x, or choose external parser tools like antlr, yacc, or whatever suits you
Do you need a parser? Maybe regexp will suffice? Or maybe hack a perl script to do the trick? Writing complex parsers take time.

Don't overuse regular expressions - while they have their place, they simply don't have the power to handle any kind of real parsing. You can push them, but you're eventually going to hit a wall or end up with an unmaintainable mess. You're better off finding a parser generator that can handle a larger language set. If you really don't want to get into tools, you can look at recursive descent parsers - it's a really simple pattern for hand-writing a small parser. They aren't as flexible or as powerful as the big parser generators, but they have a much shorter learning curve.
Unless you have very tight performance requirements, try and keep your layers separate - the lexer reads in individual tokens, the parser arranges those into a tree, and then semantic analysis checks over everything and links up references, and then a final phase to output whatever is being produced. Keeping the different parts of logic separate will make things easier to maintain later.

Read most of the Dragon book first.
Parsers are not complicated if you know how to build them, but they are NOT the type of thing that if you put in enough time, you'll eventually get there. It's way better to build on the existing knowledge base. (Otherwise expect to write it and throw it away a few dozen times).

Yep. Try to generate it, not write. Consider using yacc, ANTLR, Flex/Bison, Coco/R, GOLD Parser generator, etc. Resort to manually writing a parser only if none of existing parser generators fit your needs.

Choose the right kind of parser, sometimes a Recursive Descendant will be enough, sometimes you should use an LR parser (also, there are many types of LR parsers).
If you have a complex grammar, build an Abstract Syntax Tree.
Try to identify very well what goes into the lexer, what is part of the syntax and what is a matter of semantics.
Try to make the parser the least coupled to the lexer implementation as possible.
Provide a good interface to the user so he is agnostic of the parser implementation.

First, don't try to apply the same techniques to parsing everything. There are numerous possible use cases, from something like IP addresses (a bit of ad hoc code) to C++ programs (which need an industrial-strength parser with feedback from the symbol table), and from user input (which needs to be processed very fast) to compilers (which normally can afford to spend a little time parsing). You might want to specify what you're doing if you want useful answers.
Second, have a grammar in mind to parse with. The more complicated it is, the more formal the specification needs to be. Try to err on the side of being too formal.
Third, well, that depends on what you're doing.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js