I work on an open source project focused around Biblical texts. I would like to create a standard string format to build up a search string. I would then need to parse the search string and run the search with the options given. There are a number of different options, from scope of the search, to searching multiple texts, to wildcards, etc.
I'm thinking that using something like lex/yacc to generate a parser for this format might be a good idea. I think the Xapian project uses lemony to achieve a similar goal. My question is, is using one (or more) of these tools the best way to accomplish this?
In addition to the question, I would appreciate any links to resources on these tools (and any others that might be options). The biggest problem I've run into so far is that most of the examples and tutorials are either geared towards a programming language or something simple like a calculator rather than parsing a string format.
Tools like Lex and Yacc are suitable for your purposes. A parser for a search string is not that different from a parser for a programming language (the big difference is that a search string parser generates rules guiding the search, while the programming language parser generates a parse tree from where code is generated)
I assume your syntax will contain rules like the following:
expression : word
| expression AND expression
| expression OR expression
| NOT expression
| '(' expression ')'
all of which are easy to express in Yacc.
You can look at A Compact Guide to Lex & Yacc which I've found very useful for learning Lex and Yacc
If you're trying to build a parser in C++ have a look at
boost::sprit
It certainly is advanced C++, but it will build quite complex and performant parsers from C++ templates without code generation. It took me a few days to get into it, but using and modifying the samples that was straight forward. I also recommend reading the following book:
C++ Template Metaprogramming: Concepts, Tools, and Techniques from Boost and Beyond by David Abrahams and Aleksey Gurtovoy
Keep "syntax error diagnosis and message" in your mind uppermost - if the user makes a mistake, a handcrafted recursive-descent-style parser can have some idea based on what it has scanned so far, what mistake the user might have made. If you're going to use an automated tool, be sure to test how it responds to typical user typos - genius-programmers can handle cryptic messages from their compiler, while it sounds like you are targeting a much less sophisticated user who therefore needs a friendlier interface.
Related
I need to parse a language which is similar to a minimalized version of Java. Since effiency is the most important factor I choose for a hand written parser instead of LRAR parser generators like GOLD, bison and yacc.
However I can't find the theory behind good hand written parsers. It seems like there are only tutorials on those generators and the mechanism behind it.
Do I have to drop using regular expressions? Because I can imaging they are slow compared to hand written tokiners.
Does anybody know a good class or tutorial for hand written parsing?
In case it helps, here is (not a class or a tutorial but) an example of a hand-written parser: https://github.com/tabatkins/css-parser (however it's explicitly coded for correct/simple correspondence to the specification, and not for optimized for high performance).
The bigger problem is, I expect, to develop the specification for the parsing. Examples of parser specifications include http://dev.w3.org/csswg/css3-syntax/ and a similar one for parsing HTML5.
The prerequisite to using a parser generator is that the language syntax has been defined by a grammar (where the grammar format is supported by the parser generator), instead of by a parsing algorithm.
I'm trying to parse a custom language (not too dissimilar to JSON), and I decided to try using boost expressive, as it looked fun.
However, when an xpressive match fails, it simply fails. Is there any way I can implement some kind of error reporting? Like 'the expression matched up until the 47th character (I can get the line numbers from that).
I can sort of see how one could tailor each sub expression to look for other tokens or matches after looking for the one it wants, and reporting an error in this case, but it seems that would be a very complex way of doing it.
Is there any functionality in expressive (or can anyone suggest an approach) that would allow me to do this?
Thanks.
I suggest using ANTLR instead. It is a good compromise between cool, bleeding-edge stuff like Boost Spirit/Qi and stalwart tools like lex and yacc. It can do some amount of smarter error reporting like you want without too much effort.
Note that there are currently ANTLR versions 2 and 3 are both in common usage; 2 includes C++ code generation whereas 3 does not, so you might want to stick with the "older" version for now (porting should be fairly straightforward if v3 eventually has a C++ target).
I am embarking on some learning and I want to write my own syntax highlighting for files in C++.
Can anyone give me ideas on how to go about doing this?
To me it seems that when a file is opened:
It would need to be parsed and decided what type of source file it is. Trusting the extension might not be fool-proof
A way to know what keywords/commands apply to what language
A way to decide what color each keyword/command gets
I want to do this on OS X, using C++ or Objective-C.
Can anyone provide pointers on how I might get started with this?
Syntax highlighters typically don't go beyond lexical analysis, which means you don't have to parse the whole language into statements and declarations and expressions and whatnot. You only have to write a lexer, which is fairly easy with regular expressions. I recommend you start by learning regular expressions, if you haven't already. It'll take all of 30 minutes.
You may want to consider toying with Flex ( the lexical analyzer generator; https://github.com/westes/flex ) as a learning exercise. It should be quite easy to implement a basic syntax highlighter in Flex that outputs highlighted HTML or something.
In short, you would give Flex a set of regular expressions and what to do with matching text, and the generator will greedily match against your expressions. You can make your lexer transition among exclusive states (e.g. in and out of string literals, comments, etc.) as shown in the flex FAQ. Here's a canonical example of a lexer for C written in Flex: http://www.lysator.liu.se/c/ANSI-C-grammar-l.html .
Making an extensible syntax highlighter would be the next part of your journey. Although I am by no means a fan of XML, take a look at how Kate syntax highlighting files are defined, such as this one for C++ . Your task would be to figure out how you want to define syntax highlighters, then make a program that uses those definitions to generate HTML or whatever you please.
You may want to look at how GeSHI implements highlighting, etc. In addition, it has a whole bunch of language packs that contain all the keywords you'll ever want.
Assuming that you are using Cocoa frameworks you can use UTIs to determine the file type.
For an overview of the api:
http://developer.apple.com/mac/library/documentation/FileManagement/Conceptual/understanding_utis/understand_utis_intro/understand_utis_intro.html#//apple_ref/doc/uid/TP40001319-CH201-SW1
For a list of known UTIs:
http://developer.apple.com/mac/library/documentation/Miscellaneous/Reference/UTIRef/Articles/System-DeclaredUniformTypeIdentifiers.html#//apple_ref/doc/uid/TP40009259-SW1
The two keys are you probably most interested in would be kUTTypeObjectiveC​PlusPlusSource and kUTTypeCPlusPlusHeader.
For the highlighting you might find the information on this page helpful as it discusses syntax highlighting with an NSView and temporary attributes:
http://www.cocoadev.com/index.pl?ImplementSyntaxHighlightingUsingTemporaryAttributes
I think (1) isn't possible, since the only way to tell if a file is valid C++ is to run it through a C++ parser and see if it parses... but if you used that as your standard, you couldn't operate on code that doesn't compile because it is a work-in-progress, which you probably want to do. It's probably best just to trust the extension, as I don't think any other method will work better than that.
You can get a list of C++ keywords here: http://www.cppreference.com/wiki/keywords/start
The colors are up to you (or if you want, you can make them configurable and leave the choice to the user)
I'm about to start developing a sub-component of an application to evaluate math functions with operands of C++ objects. This will be accessed via a user interface to provide drag and drop, feedback of appropriate types followed by an execute button.
I'm quite interested in using flex and bison for this having looked at equation parsing and the like, both here and further afield. What I'm unsure of is if flex/bison is appropriate when you're trying to parse with custom C++ types? Obviously normal parsing is with text and this is quite a departure from that so wanted so too see what people thought, and see if I'm trying to put a square peg in a round hole.
What do you think?
Edit
There are some very good sources of information in the links people have provided below. One that looks promising but hasn't been mentioned yet is Boost.Spirit. I was taking a look though the examples earlier today and there are some informative calculator based examples in the boost/libs/spirit/examples directory should you have boost downloaded and be interested. Their homepage is here.
Please checkout muparser
Flex and Bison are the right tool for parsing arithmetic expressions, equations and the like.
Here are few examples:
Parsing arithmetic expressions - Bison and Flex. It uses C (and not C++), but you can readapt it.
Flex Bison C++ Template/Example. A "framework" to integrate Flex and Bison "into a modern C++ program". An arithmetic calculator is included as an example.
Certainly sounds like a square peg in a round hole to me (unless I grossly misunderstand the question):
Flex would create a state machine to tokenize a stream, in your case - the contents are already tokenized
Bison sounds a bit more relevant, since it can deal with operator precedence, but integrating with it would be too much of a pain for the relatively small benefit.
I've been wondering for long why there doesn't seem to be any parsers for, say, BNF, that behave like regexps in various libraries.
Sure, there's things like ANTLR, Yacc and many others that generate code which, in turn, can parse a CFG, but there doesn't seem to be a library that can do that without the intermediate step.
I'm interested in writing a Packrat parser, to boot all those nested-parenthesis-quirks associated with regexps (and, perhaps even more so, for the sport of it), but somehow I have this feeling that I'm just walking into another halting problem -like class of swamps.
Is there a technical/theoretical limitation for these parsers, or am I just missing something?
I think it's more of a cultural thing. The use of context-free grammars is mostly confined to compilers, which typically have code associated with each production rule. In some languages, it's easier to output code than to simulate callbacks. In others, you'll see parser libraries: parser combinators in Haskell, for example. On the other hand, regular expressions see wide use in tools like grep, where it's inconvenient to run the C compiler every time the user gives a new regular expression.
Boost.Spirit looks like what you are after.
If you are looking to make your own, I've used BNFC for my latest compiler project and it provides the grammar used in its own implementation. This might be a good starting point...
There isn't and technical/theoretical limitation lurking in the shadows. I can't say why they aren't more popular, but I know of at least one library that provides this sort of "on-line" parsing that you seek.
SimpleParse is a python library that lets you simply paste your hairy EBNF grammar into your program and use it to parse things right away, no itermediate steps. I've used it for several projects where I wanted a custom input language but really didn't want to commit to any formal build process.
Here's a tiny example off the top of my head:
decl = r"""
root := expr
expr := term, ("|", term)*
term := factor+
factor := ("(" expr ")") / [a-z]
"""
parser = Parser(decl)
success, trees, next = parser.parse("(a(b|def)|c)def")
The parser combinator libraries for Haskell and Scala also let your express your the grammar for your parser in the same chunk of code that uses it. However you can't, say, let the user type in a grammar at runtime (which might only be of interest to people making software to help people understand grammars anyway).
Pyparsing (http://pyparsing.wikispaces.com) has built-in support for packrat parsing and it is pure Python, so you can see the actual implementation.
Because full-blown context-free grammars are confusing enough as they are without some cryptically dense and incomprehensible syntax to make them even more confusing?
It's hard to know what you're asking. Are you trying to create something like a regular expression, but for context-free grammars? Like, using $var =~ /expr = expr + expr/ (in Perl) and having that match "1 + 1" or "1 + 1 + 1" or "1 + 1 + 1 + 1 + 1 + ..."? I think one of the limitations of this is going to be syntax: Having more than about three rules is going to make your "grammar-expression" even more unreadable than any modern-day regular expression.
Side effect are the only thing I see thing that will get you. Most of the parser generators include embedded code for processing and you would need an eval to make that work.
One way around that would be to name actions and then make an "action" function that takes the name of the action to do and the args to do it with.
You could theoretically do it with Boost Spirit in C++, but it is mainly made for static grammars. I think the reason this is not common is that CFGs are not as commonly used as regexs. I've never had to use a grammar except for compiler construction, but I have used regexs many times. CFGs are generally much more complex than regexs, so it makes sense to generate code statically with a tool like YACC or ANTLR.
tcllib has something like that, if you can put up with Parse Expression Grammars and also TCL. If Perl is your thing CPAN has Parse::Earley. Here's a pure Perl variation which looks promising. PLY seems to be a plausible solution for Python