Extracting information using BNF grammars - c++

I would like to extract information from a body of text and be able to query it.
The structure of this body of text would be specified by a BNF grammar (or variant), and the information to extract would be specified at runtime (the syntax of the query does not matter at the moment).
So the requirements are simple, really:
Receive some structured body of text
Load it in an exploitable form using a grammar to parse it
Run a query to select some portions of it
To illustrate with an example, suppose that we have such grammar (in a customized BNF format):
<digit> ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
<id> ::= 15 * digit
<hex> ::= 10 * (<digit> | a | b | c | d | e | f)
<anything> ::= <digit> | .... (all characters)
<match> ::= <id> (" " <hex>)*
<nomatch> ::= "." <anything>*
<line> ::= (<match> | <nomatch> | "") [<CR>] <LF>
<text> ::= <line>+
For which such text would be conforming:
012345678901234
012345678901234 abcdef0123
Nor the previous line nor this one would match
And then I would want to list all tags that appear in the rule, so for example using an XPath like syntax:
match//id
which would return a list.
This sounds relatively easy, except that I have two big constraints:
the BNF grammar should be read at runtime (from a string/vector like structure)
the queries will be read at runtime too
Some precisions:
the grammar is not expected to change often so a "compilation" step to produce an in-memory structure is acceptable (and perhaps necessary to achieve good speed)
speed is of the essence, bonus points for on-the-fly collection of the wanted portions
bonus points for the possibility to have callbacks to disambiguate (sometimes the necessary disambiguation information might require DB access for example)
bonus points for multipart grammars (favoring modularity and reuse of grammar elements)
I know of lex/yacc and flex/bison for example, however they appear to only create C / C++ code to be compiled, which is not what I am looking after.
Do you know of a robust library (preferably free and open-source) that can transform a BNF grammar into a parser "on-the-fly" and produce a structured in-memory output from a body of text using this parser ?
EDIT: I am open to alternatives. At the moment, the idea was that perhaps regexes could allow this extraction, however given the complexity of the grammars involved, this could get ugly quickly and thus maintaining the regexes would be quite a horrendous task. Furthermore, by separating grammars and extraction I hope to be able to reuse the same grammar for different extractions needs rather than having slightly different regexes each time.

I have a proprietary solution that can convert grammar source into an in memory representation. The result is a pure data structure. Any code can use it. I also have C++ class that actually implements the parser. Rule handlers are implemented as virtual methods.
The primary difference between our solution and YACC/Bison is that no C/C++ code is generated. This means that grammar can be reloaded without recompiling the app. The grammar can be annotated with application IDs that are used in the code of the rule handlers.

The GOLD parser system produces an LALR parse table that is apparantly loaded AFAIK at runtime. I believe it has a C++ "parsing" engine so that should be easy to integrate.
You'd read your grammar, fork a subprocess to get the GOLD parser generator to produce the table, and then call your wired-in GOLD parser to load-and-parse.
I don't know how you attach actions to the reductions, which you'd probably like to do.
I have no specific experience with GOLD. "Gold" luck to you.

Related

Dynamically switch parser while parsing

I'm parsing spice netlists, for which I already have a parser. Since I actually use spectre (cadence, integrated electronics), I want to support both simulator languages (they differ, unfortunately). I could use a switch (e.g. commandline) and use the correct parser from start. However, spectre allows simulator lang=spectre statements, which I would also want to support (and vice versa, of course). How can this be done with boost::spirit?
My grammar looks roughly like this:
line = component_parser |
command_parser |
comment_parser |
subcircuit_parser |
subcircuit_instance_parser;
main = -line % qi::eol >> qi::eoi;
This toplevel structure is fine for both languages, so i need to change the subparsers. A first idea for me would be to have the toplevel parser hold instances (or objects) to the respective parser and to switch on finding the simulator lang statement (with a semantic action). Is this a good approach? If not, how else would one do this?
You can use qi::lazy (https://www.boost.org/doc/libs/1_68_0/libs/spirit/doc/html/spirit/qi/reference/auxiliary/lazy.html).
There's an idiomatic pattern related to that, known as The Nabialek Trick.
I have several answers up on this site that show these various techniques.
https://stackoverflow.com/search?q=user%3A85371+qi%3A%3Alazy

Should regex be used in a parser for an interpreter or compiler?

When parsing a grammar, should RegEx be used to match grammars that can be expressed as regular languages or should the current parser design be used exclusively?
For example, the EBNF grammar for JSON can be expressed as:
object ::= '{' '}' | '{' members '}';
members ::= pair | pair ',' members;
pair ::= string ':' value;
array ::= '[' ']' | '[' elements ']';
elements ::= value | value ',' elements;
value ::= string | number | object | array | 'true' | 'false' | 'null';
So grammar would need to be matched using some type of lexical analyzer (such as a recursive descent parser or ad hoc parser), but the grammar for some of the values (such as the number) can be expressed as a regular language like this RegEx pattern for number:
-?\d+(\.\d+)?([eE][+-]?\d+)?
Given this example, assuming one is creating a recursive descent JSON parser... should the number be matched via the recursive descent technique or should the number be matched via RegEx since it can be matched easily using RegEx?
This is a very broad and opinionated question. Hence, to my knowledge, usually you will want a parser to be as fast as possible and to have the smallest footprint in memory as possible, especially if it needs to parse in real-time (on demand).
A RegEx will surely do the job, but it is like shooting a fly with a nuclear weapon !
This is why, many parsers are written in low-level language like C to take advantage of string pointers and avoid the overhead caused by high-level languages like Java with immutable fields, garbage collector,...
Meanwhile, this heavily depends on your use case and cannot be truly answered in a generic way. You should consider the tradeoff between the developer's convenience to use RegEx versus the performance of the parser.
One additionnal consideration is that usually you will want the parser to indicate where you have a syntax error, and which type of error it is. Using a RegEx, it will simply not match and you will have a hard time finding out why it stopped in order to display a proper error message. When using an old-school parser, you can quickly stop parsing as soon as you encounter a syntax error and you can know precisely what did not match and where.
In your specific case for JSON parsing and using RegEx only for numbers, I suppose you are probably using a high-level language already, so what many implementations do is to rely on the language's native parsing for numbers. So just pick the value (string, number,...) using the delimiters and let the programming language throw an exception for number parsing.

How Can I demonstrate this grammar is not ambiguous?

I know I need to show that there is no string that I can get using only left most operation that will lead to two different parsing trees. But how can I do it? I know there is not a simple way of doing it, but since this exercise is on the compilers Dragon book, then I am pretty sure there is a way of showing (no need to be a formal proof, just justfy why) it.
The Gramar is:
S-> SS* | SS+ | a
What this grammar represents is another way of simple arithmetic(I do not remember the name if this technique of anyone knows, please tell me ) : normal sum arithmetic has the form a+a, and this just represents another way of summing and multiplying. So aa+ also mean a+a, aaa*+ is a*a+a and so on
The easiest way to prove that a CFG is unambiguous is to construct an unambiguous parser. If the grammar is LR(k) or LL(k) and you know the value of k, then that is straightforward.
This particular grammar is LR(0), so the parser construction is almost trivial; you should be able to do it on a single sheet of paper (which is worth doing before you try to look up the answer.
The intuition is simple: every production ends with a different terminal symbol, and those terminal symbols appear nowhere else in the grammar. So when you read a symbol, you know precisely which production to use to reduce; there is only one which can apply, and there is no left-hand side you can shift into.
If you invert the grammar to produce Polish (or Łukasiewicz) notation, then you get a trivial LL grammar. Again the parsing algorithm is obvious, since every right hand side starts with a unique terminal, so there is only one prediction which can be made:
S → * S S | + S S | a
So that's also unambiguous. But the infix grammar is ambiguous:
S → S * S | S + S | a
The easiest way to provide ambiguity is to find a sentence which has two parses; one such sentence in this case is:
a + a + a
I think the example string aa actually shows what you need. Can it not be parsed as:
S => SS* => aa OR S => SS+ => aa

How to implement a computer language translator using two separate grammars

I want to create a computer language translator between two languages LANG1 and LANG2. More specifically, I want to translate code written in LANG1 into source code in LANG2.
I have the BNF Grammar for both LANG1 and LANG2.
LANG1 is a small DSL I wrote by myself, and is essentially, an "easier" version of LANG2.
I want to be able to generate statements in LANG2, from input statements written in LANG1.
I am in the process of compiling a compiler for LANG1, but I don't know what to do next (in order to translate LANG1 statements to LANG2 statements).
My understanding of the steps involved are as follows:
1. BNF for my DSL (LANG1) DONE
2. Reverse engineered the BNF for LANG2 DONE
3. Learning how to generate a compiler for LANG1 TODO
4. Translate LANG1 statements to LANG2 statements ???
What are the steps involved in order to generate LANG2 statements from LANG1 statements?
My code base is in C++, so I can use Parser generated in either C or C++.
PS: I will be using ANTLR3 to generate the compiler for LANG1
In the most general case you have to translate each possible section of the grammar from LANG1 into something appropriate for LANG2, or you have to dumb down into the simplest possible primitives of both languages, like assembly or combinators. This is a little time consuming and not very fun.
However, if the grammars are equivalent or share a lot of common ground, you might just be able to get away with just parsing into the same tree for both grammars and having output functions that can take your standardized tree and convert it back into LANG1 or LANG2 source (which is mostly the same as the general case but takes a lot more short cuts).
EDIT: Since I've just reread your question realized that you only want to translate one way, you need only worry about making the form of the tree suit LANG1 and just have your translate function for LANG2. But I hope my example is helpful anyway.
Example Tree construction:
Here are two different ANTLR grammars that produce the same extremely simple AST:
Grammar 1
The first one is the standard way of expressing addition:
grammar simpleAdd;
options {output=AST;}
tokens {
PLUS = '+';
}
expression : addition EOF!;
addition : NUMBER (PLUS NUMBER)+ -> ^(PLUS NUMBER*);
NUMBER :'0'..'9'+;
WHITESPACE : ( '\t' | ' ' | '\r' | '\n')+ { $channel = HIDDEN; } ;
This will accept two or more integers and produce a tree with a PLUS node and all the numbers in a list to be added together. E.g.,
1 + 1 + 2 + 3 + 5
Grammar 2
The second grammar takes a less elegant form:
grammar horribleAdd;
options {output=AST;}
tokens {
PLUS = '+';
PLUS_FUNC = 'plus';
COMMA = ',';
LEFT_PARENS ='(';
RIGHT_PARENS=')';
}
expression : addition EOF!;
addition : PLUS_FUNC LEFT_PARENS NUMBER (COMMA NUMBER)+ RIGHT_PARENS -> ^(PLUS NUMBER*);
NUMBER :'0'..'9'+;
WHITESPACE : ( '\t' | ' ' | '\r' | '\n')+ { $channel = HIDDEN; } ;
This grammar expects numbers given to a function (yes, I know functions don't really work like this, I'm just trying to keep the example as clear as possible). E.g.,
plus(1, 1, 2, 3, 5)
It produces exactly the same tree as the first grammar (a PLUS node with the numbers as children).
Now you don't need to worry what language your instructions came from, you can output it in whatever form you like. All you need to do is write a function that can convert this AST back into a language of your choosing.

Dynamically Describing Mathematical Rules

I want to be able to specify mathematical rules in an external location to my application and have my application process them at run time and perform actions (again described externally). What is the best way to do this?
For example, I might want to execute function MyFunction1() when the following evaluates to true:
(a < b) & MyFunction2() & (myWord == "test").
Thanks in advance for your help.
(If it is of any relevance, I wish to use C++, C or C++/CLI)
I'd consider not reinventing the wheel --- use an embedded scripting engine. This means you'd be using a standard form for describing the actions and logic. There are several great options out there that will probably be fine for your needs.
Good options include:
Javascript though google v8. (I don't love this from an embedding point of view,
but javascript is easy to work with, and many people already know it)
Lua. Its fast and portable. Syntax is maybe not as nice as Javascript, but embedding is
easy.
Python. Clean syntax, lots of libraries. Not much fun to embed though.
I'd consider using SWIG to help generate the bindings ... I know it works for python and lua, not sure about v8.
I would look at the command design pattern to handle calling external mathematical predicates, and the Factory design pattern to run externally defined code.
If your mathematical expression language is that simple then uou could define a grammar for it, e.g.:
expr = bin-op-expr | rel-expr | func-expr | var-expr | "(" expr ")"
bin-op = "&" | "|" | "!"
bin-op-expr = expr bin-op expr
rel-op = "<" | ">" | "==" | "!=" | "<=" | ">="
rel-expr = expr rel-op expr
func-args = "(" ")"
func-expr = func-name func-args
var-expr = name
and then translate that into a grammar for a parser. E.g. you could use Boost.Spirit which provides a DSL to allow you to express a grammar within your C++ code.
If that calculation happens at an inner loop, you want high performance, you cannot go with scripting languages. Based on how "deployable" and how much platform independent you would like that to be:
1) You could express the equations in C++ and let g++ compile it for you at run-time, and you could link to the resulting shared object. But this method is very much platform dependent at every step! The necessary system calls, the compiler to use, the flags, loading a shared object (or a DLL)... That would be super-fast in the end though, especially if you compile the innermost loop with the equation. The equation would be inlined and all.
2) You could use java in the same way. You can get a nice java compiler in java (from Eclipse I think, but you can embed it easily). With this solution, the result would be slightly slower (depending on how much template magic you want), I would expect, by a factor of 2 for most purposes. But this solution is extremely portable. Once you get it working, there's no reason it shouldn't work anywhere, you don't need anything external to your program. Another down side to this is having to write your equations in Java syntax, which is ugly for complex math. The first solution is much better in that respect, since operator overloading greatly helps math equations.
3) I don't know much about C#, but there could be a solution similar to (2). If there is, I know that there's operator overloading in C#, so your equations would be more pleasant to write and look at.