Why do on-line parsers seem to stop at regexps? - regex

I've been wondering for long why there doesn't seem to be any parsers for, say, BNF, that behave like regexps in various libraries.
Sure, there's things like ANTLR, Yacc and many others that generate code which, in turn, can parse a CFG, but there doesn't seem to be a library that can do that without the intermediate step.
I'm interested in writing a Packrat parser, to boot all those nested-parenthesis-quirks associated with regexps (and, perhaps even more so, for the sport of it), but somehow I have this feeling that I'm just walking into another halting problem -like class of swamps.
Is there a technical/theoretical limitation for these parsers, or am I just missing something?

I think it's more of a cultural thing. The use of context-free grammars is mostly confined to compilers, which typically have code associated with each production rule. In some languages, it's easier to output code than to simulate callbacks. In others, you'll see parser libraries: parser combinators in Haskell, for example. On the other hand, regular expressions see wide use in tools like grep, where it's inconvenient to run the C compiler every time the user gives a new regular expression.

Boost.Spirit looks like what you are after.
If you are looking to make your own, I've used BNFC for my latest compiler project and it provides the grammar used in its own implementation. This might be a good starting point...

There isn't and technical/theoretical limitation lurking in the shadows. I can't say why they aren't more popular, but I know of at least one library that provides this sort of "on-line" parsing that you seek.
SimpleParse is a python library that lets you simply paste your hairy EBNF grammar into your program and use it to parse things right away, no itermediate steps. I've used it for several projects where I wanted a custom input language but really didn't want to commit to any formal build process.
Here's a tiny example off the top of my head:
decl = r"""
root := expr
expr := term, ("|", term)*
term := factor+
factor := ("(" expr ")") / [a-z]
"""
parser = Parser(decl)
success, trees, next = parser.parse("(a(b|def)|c)def")
The parser combinator libraries for Haskell and Scala also let your express your the grammar for your parser in the same chunk of code that uses it. However you can't, say, let the user type in a grammar at runtime (which might only be of interest to people making software to help people understand grammars anyway).

Pyparsing (http://pyparsing.wikispaces.com) has built-in support for packrat parsing and it is pure Python, so you can see the actual implementation.

Because full-blown context-free grammars are confusing enough as they are without some cryptically dense and incomprehensible syntax to make them even more confusing?
It's hard to know what you're asking. Are you trying to create something like a regular expression, but for context-free grammars? Like, using $var =~ /expr = expr + expr/ (in Perl) and having that match "1 + 1" or "1 + 1 + 1" or "1 + 1 + 1 + 1 + 1 + ..."? I think one of the limitations of this is going to be syntax: Having more than about three rules is going to make your "grammar-expression" even more unreadable than any modern-day regular expression.

Side effect are the only thing I see thing that will get you. Most of the parser generators include embedded code for processing and you would need an eval to make that work.
One way around that would be to name actions and then make an "action" function that takes the name of the action to do and the args to do it with.

You could theoretically do it with Boost Spirit in C++, but it is mainly made for static grammars. I think the reason this is not common is that CFGs are not as commonly used as regexs. I've never had to use a grammar except for compiler construction, but I have used regexs many times. CFGs are generally much more complex than regexs, so it makes sense to generate code statically with a tool like YACC or ANTLR.

tcllib has something like that, if you can put up with Parse Expression Grammars and also TCL. If Perl is your thing CPAN has Parse::Earley. Here's a pure Perl variation which looks promising. PLY seems to be a plausible solution for Python

Related

Drop-in, portable parsing

I see umpteen posts a day about "how to do X with regexen". And the best response to most of them seems like it would honestly be, "Why are you trying to drive a screw with a hammer?" But regexen are everywhere, and the syntax is mostly portable, particularly if you keep away from the fancy bits.
Is there anything equivalent to regexen but at the next level up in power and configurability? A "you can use it anywhere" parsing library of some variety, preferably with a gloriously concise DSL as its interface?
I've used Ragel somewhat, but because of the preprocessing step, I'd hesitate to recommend it to someone as "use this instead of some hairy regex". It's awkward to use from Obj-C, and I expect it will be terribly awkward from a language that doesn't have compile-link-run as part of its standard operating procedure.
What I'm looking for is something that will pass the "inline-online-universal" test.
(inline) You can write the notation inline with your other code, as you would with a regex..
(online) You can run the resulting parser just as you would your other code, which would mean right after input to a REPL in the case of something like Python.
(universal) You can move to a different language/platform and use virtually the same code for your parser, modulo dialect differences. In reality, I'd be happy with something that works from Python, Ruby, C, Java, and Haskell.
Most tools I know of fall down at "online". They preprocess a grammar offline and spit out code in the target language (C, Python, Java, C++…). They're standalone tools that aren't themselves integrated into the language environment.
I've had suggestions of PEG parsers and lex/yacc combos. Parser combinator libraries might also be a good fit. Whatever you might propose, I'd like to see demonstrated that it meets these tests. Your answer should demonstrate that the proposed solution meets the inline-online-universal requirements by providing a working demo parser in Python, C, and Haskell. The demo example is up to the author, but it should be something painful using just regexen but trivial using a proper parser.
https://github.com/leblancmeneses/NPEG
Implements PEG.
Meets all 3 ... let me explain.
It is inline only with C# and offline with all the others. C# has an offline version also.
I currently support offline versions: C/C++/Javascript (local right now)/Java pass all unit tests - to make it universal. To add another language takes 25.84 hrs (how long it took to create the offline Javascript version)
To make it online for every language would be to much maintenance(possible) but it took me a lot of work and time just to support the current offline versions. I can now focus my energy on building grammar optimizers and tooling to unit test grammar rules where all offline versions benefit.
Have a look at Lex/Yacc or their counterparts Flex/Bison (or Coco, or all the other "compiler" generators). The combination can be used to parse complex textual data with an (arguably) much more readable syntax than with regexen.
For simple problems though, where regexen are more than sufficient, by any means do use them.

Uses of writing Lexers and Parsers other than for Compilers?

What kind of problems other than writing compilers can be solved using Lexers and Parsers ?
What are the advantages / disadvantages of using Lexers and Parsers over just writing regular expression statements in a programming language.
Are there any situations where only a Lexer or only a Parser is used ?
PS: Precise Comparison Examples would be nice
Lexers and parsers are good for computerized interpretation of anything that is a context-free language but not a regular language.
In more practical terms, this means that they're good for interpreting anything that has a defined structure but is beyond the capabilities of (or more difficult to do with) regex.
For instance, it is difficult if not impossible to write a regular expression which will determine if a given document is valid HTML (due to things like tag nesting, escape characters, required attributes, et cetera). On the other hand, it's (relatively) trivial to write a parser for HTML.
Similarly, you would probably not want to even try to write a regex to determine the order of operations in a mathematical expression. On the other hand, a parser can do it easily.
As for your question regarding individual lexers or parsers:
Neither is "necessary" for the other, or at all.
For instance, one could have human-readable words which translate directly to machine opcodes that would get lexed directly into machine code (this would essentially be a very basic "assembly language"). This would not require a parser.
One could also simply write programs in a way that already was expressed in machine-readable individual symbols and thus easy for a machine to parse - for instance, boolean algebra expressions that used only the symbols 0, 1, &, |, ~, (, and ). This would not require a lexer.
Or you could do without either - for instance, Brainfuck needs neither lexing nor parsing because it is simply a set of ordered instructions; the interpreter just maps symbols to things to do. Machine opcodes, similarly, do not require either.
Mostly, lexers and parsers are written to make things nicer and easier. It's nicer not to have to write everything in individual single-meaning glyphs. It's easier to be able to write out complex expressions in whatever way is convenient (say, with parentheses, (3+4)*2) than it is to force ourselves to write them in ways that machines work (say, RPN: 3 4 + 2 *).
A famous example where parsing is more adapted than regular expressions (because the object of processing is, inherently, a non-regular context-free language) is X?(HT)?ML manipulation. See Jeff Atwood's famous blog post on the subject, derived from a famous answer on this site.

Search string parser in C/C++

I work on an open source project focused around Biblical texts. I would like to create a standard string format to build up a search string. I would then need to parse the search string and run the search with the options given. There are a number of different options, from scope of the search, to searching multiple texts, to wildcards, etc.
I'm thinking that using something like lex/yacc to generate a parser for this format might be a good idea. I think the Xapian project uses lemony to achieve a similar goal. My question is, is using one (or more) of these tools the best way to accomplish this?
In addition to the question, I would appreciate any links to resources on these tools (and any others that might be options). The biggest problem I've run into so far is that most of the examples and tutorials are either geared towards a programming language or something simple like a calculator rather than parsing a string format.
Tools like Lex and Yacc are suitable for your purposes. A parser for a search string is not that different from a parser for a programming language (the big difference is that a search string parser generates rules guiding the search, while the programming language parser generates a parse tree from where code is generated)
I assume your syntax will contain rules like the following:
expression : word
| expression AND expression
| expression OR expression
| NOT expression
| '(' expression ')'
all of which are easy to express in Yacc.
You can look at A Compact Guide to Lex & Yacc which I've found very useful for learning Lex and Yacc
If you're trying to build a parser in C++ have a look at
boost::sprit
It certainly is advanced C++, but it will build quite complex and performant parsers from C++ templates without code generation. It took me a few days to get into it, but using and modifying the samples that was straight forward. I also recommend reading the following book:
C++ Template Metaprogramming: Concepts, Tools, and Techniques from Boost and Beyond by David Abrahams and Aleksey Gurtovoy
Keep "syntax error diagnosis and message" in your mind uppermost - if the user makes a mistake, a handcrafted recursive-descent-style parser can have some idea based on what it has scanned so far, what mistake the user might have made. If you're going to use an automated tool, be sure to test how it responds to typical user typos - genius-programmers can handle cryptic messages from their compiler, while it sounds like you are targeting a much less sophisticated user who therefore needs a friendlier interface.

When to use parser-generator, when is regex is enough?

I have not gotten into the field of formal languages in computer science yet, so maybe my question is silly. I am writing a simple NMEA parser in C++, and I have to choose:
My first idea was to build a simple finite state machine manually, but then I thought that maybe I could do it with less work, even more efficiently. I used regular expressions before, but I think the NMEA regular expression is very long and should take "long time" to match it.
Then I thought about using a parser generator. I think all use the same method: they generate a FSA. But I don't know which is more efficient. When do you normally use parser generators instead of regexes (I think you could write regex in parser generator)?
Please explain the differences, I'm interested in both theory and experience.
Well, a simple rule of thumb is: If the grammar of the data you are trying to parse is regular, use regular expressions. If it is not, regular expressions may still work (as most regex engines also support non-regular grammars), but it might well be painful (complicated / bad performance).
Another aspect is what you are trying to do with the parsed data. If you are only interested in one field, a regex is probably easier to read. If you need to read deeply nested structures, a parser is likely to be more maintainable.
Regex is a parser-generator.
From wikipedia:
Regular expressions (abbreviated as regex or regexp, with plural forms regexes, regexps, or regexen) are written in a formal language that can be interpreted by a regular expression processor, a program that either serves as a parser generator or examines text and identifies parts that match the provided specification.
If you're going over a list that only needs to be gone over once, then save the list to a file and read it from there. If you're checking things that are different every time, use regex and store the results in an array or something.
It's much faster than you would assume it to be. I've seen expressions bigger than this post.
Adding that you can nest as much as you'd like, in whatever language you decide to code it in. You could even do it in sections, for maximum re-usability.
As Sneakyness points out, you can have a large and complicated regular expression that is surprisingly powerful. I've seen some examples of this, but none were maintainable by mere mortals. Even using Expresso only helped so much; it was still difficult to understand and risky to modify. So unless you're a savant with a fixation on Grep, I would not recommend this direction.
Instead, consider focusing on the grammar and letting a compiler compiler do the heavy lifting for you.

When is it best to use Regular Expressions over basic string splitting / substring'ing?

It seems that the choice to use string parsing vs. regular expressions comes up on a regular basis for me anytime a situation arises that I need part of a string, information about said string, etc.
The reason that this comes up is that we're evaluating a soap header's action, after it has been parsed into something manageable via the OperationContext object for WCF and then making decisions on that. Right now, the simple solution seems to be basic substring'ing to keep the implementation simple, but part of me wonders if RegEx would be better or more robust. The other part of me wonders if it'd be like using a shotgun to kill a fly in our particular scenario.
So I have to ask, what's the typical threshold that people use when trying to decide to use RegEx over typical string parsing. Note that I'm not very strong in Regular Expressions, and because of this, I try to shy away unless it's absolutely vital to avoid introducing more complication than I need.
If you couldn't tell by my choice of abbreviations, this is in .NET land (C#), but I believe that doesn't have much bearing on the question.
EDIT: It seems as per my typical Raybell charm, I've been too wordy or misleading in my question. I want to apologize. I was giving some background to help give clues as to what I was doing, not mislead people.
I'm basically looking for a guideline as to when to use substring, and variations thereof, over Regular Expressions and vice versa. And while some of the answers may have missed this (and again, my fault), I've genuinely appreciated them and up-voted as accordingly.
My main guideline is to use regular expressions for throwaway code, and for user-input validation. Or when I'm trying to find a specific pattern within a big glob of text. For most other purposes, I'll write a grammar and implement a simple parser.
One important guideline (that's really hard to sidestep, though I see people try all the time) is to always use a parser in cases where the target language's grammar is recursive.
For example, consider a tiny "expression language" for evaluating parenthetized arithmetic expressions. Examples of "programs" in this language would look like this:
1 + 2
5 * (10 - 6)
((1 + 1) / (2 + 2)) / 3
A grammar is easy to write, and looks something like this:
DIGIT := ["0"-"9"]
NUMBER := (DIGIT)+
OPERATOR := ("+" | "-" | "*" | "/" )
EXPRESSION := (NUMBER | GROUP) (OPERATOR EXPRESSION)?
GROUP := "(" EXPRESSION ")"
With that grammar, you can build a recursive descent parser in a jiffy.
An equivalent regular expression is REALLY hard to write, because regular expressions don't usually have very good support for recursion.
Another good example is JSON ingestion. I've seen people try to consume JSON with regular expressions, and it's INSANE. JSON objects are recursive, so they're just begging for regular grammars and recursive descent parsers.
Hmmmmmmm... Looking at other people's responses, I think I may have answered the wrong question.
I interpreted it as "when should use use a simple regex, rather than a full-blown parser?" whereas most people seem to have interpreted the question as "when should you roll your own clumsy ad-hoc character-by-character validation scheme, rather than using a regular expression?"
Given that interpretation, my answer is: never.
Okay.... one more edit.
I'll be a little more forgiving of the roll-your-own scheme. Just... don't call it "parsing" :o)
I think a good rule of thumb is that you should only use string-matching primitives if you can implement ALL of your logic using a single predicate. Like this:
if (str.equals("DooWahDiddy")) // No problemo.
if (str.contains("destroy the earth")) // Okay.
if (str.indexOf(";") < str.length / 2) // Not bad.
Once your conditions contain multiple predicates, then you've started inventing your own ad hoc string validation language, and you should probably just man up and study some regular expressions.
if (str.startsWith("I") && str.endsWith("Widget") &&
(!str.contains("Monkey") || !str.contains("Pox"))) // Madness.
Regular expressions really aren't that hard to learn. Compared to a huuuuge full-featured language like C# with dozens of keywords, primitive types, and operators, and a standard library with thousands of classes, regular expressions are absolutely dirt simple. Most regex implementations support about a dozen or so operations (give or take).
Here's a great reference:
http://www.regular-expressions.info/
PS: As a bonus, if you ever do want to learn about writing your own parsers (with lex/yacc, ANTLR, JavaCC, or other similar tools), learning regular expressions is a great preparation, because parser-generator tools use many of the same principles.
The regex can be
easier to understand
express more clearly the intent
much shorter
easier to change/adapt
In some situations all of those advantages would be achieved by using a regex, in others only some are achieved (the regex is not really easy to understand for example) and in yet other situations the regex is harder to understand, obfuscates the intent, longer and hard to change.
The more of those (and possibly other) advantages I gain from the regex, the more likely I am to use them.
Possible rule of thumb: if understanding the regex would take minutes for someone who is somewhat familiar with regular expressions, then you don't want to use it (unless the "normal" code is even more convoluted ;-).
Hm ... still no simple rule-of-thumb, sorry.
[W]e're evaluating a soap header's
action and making decisions on that
Never use regular expressions or basic string parsing to process XML. Every language in common usage right now has perfectly good XML support. XML is a deceptively complex standard and it's unlikely your code will be correct in the sense that it will properly parse all well-formed XML input, and even it if does, you're wasting your time because (as just mentioned) every language in common usage has XML support. It is unprofessional to use regular expressions to parse XML.
To answer your question, in general the usage of regular expressions should be minimized as they're not very readable. Oftentimes you can combine string parsing and regular expressions (perhaps in a loop) to create a much simpler solution than regular expressions alone.
I would agree with what benjismith said, but want to elaborate just a bit. For very simple syntaxes, basic string parsing can work well, but so can regexes. I wouldn't call them overkill. If it works, it works - go with what you find simplest. And for moderate to intermediate string parsing, a regex is usually the way to go.
As soon as you start finding yourself needing to define a grammar however, i.e. complex string parsing, get back to using some sort of finite state machine or the likes as quickly as you can. Regexes simply don't scale well, to use the term loosely. They get complex, hard to interpret, and even incapable.
I've seen at least one project where the use of regexes kept growing and growing and soon they had trouble inserting new functionality. When it finally came time to do a new major release, they dumped all the regexes and went the route of a grammar parser.
When your required transformation isn't basic -- but is still conceptually simple.
no reason to pull out Regex if you're doing a straight string replacement, for example... its easier to just use the string.Replace
on the other hand, a complex rule with many conditionals or special cases that would take more than 50 characters of regex can be a nightmare to maintain later on if you don't explicitly write it out
I would always use a regex unless it's something very simple such as splitting a comma-separated string. If I think there's a chance the strings might one day get more complicated, I'll probably start with a regex.
I don't subscribe to the view that regexes are hard or complicated. It's one tool that every developer should learn and learn well. They have a myriad of uses, and once learned, this is exactly the sort of thing you never have to worry about ever again.
Regexes are rarely overkill - if the match is simple, so is the regex.
I would think the easiest way to know when to use regular expressions and when not to, is when your string search requires an IF/THEN statement or anything resembling this or that logic, then you need something better than a simple string comparison which is where regex shines.