Dynamically switch parser while parsing - c++

I'm parsing spice netlists, for which I already have a parser. Since I actually use spectre (cadence, integrated electronics), I want to support both simulator languages (they differ, unfortunately). I could use a switch (e.g. commandline) and use the correct parser from start. However, spectre allows simulator lang=spectre statements, which I would also want to support (and vice versa, of course). How can this be done with boost::spirit?
My grammar looks roughly like this:
line = component_parser |
command_parser |
comment_parser |
subcircuit_parser |
subcircuit_instance_parser;
main = -line % qi::eol >> qi::eoi;
This toplevel structure is fine for both languages, so i need to change the subparsers. A first idea for me would be to have the toplevel parser hold instances (or objects) to the respective parser and to switch on finding the simulator lang statement (with a semantic action). Is this a good approach? If not, how else would one do this?

You can use qi::lazy (https://www.boost.org/doc/libs/1_68_0/libs/spirit/doc/html/spirit/qi/reference/auxiliary/lazy.html).
There's an idiomatic pattern related to that, known as The Nabialek Trick.
I have several answers up on this site that show these various techniques.
https://stackoverflow.com/search?q=user%3A85371+qi%3A%3Alazy

Related

How to parse DSL input to high performance expression template

(EDITED both title and main text and created a spin-off question that arose)
For our application it would be ideal to parse a simple DSL of logical expressions. However the way I'd like to do this is to parse (at runtime) the input text which gives the expressions into some lazily evaluated structure (an expression template) which can then be later used within more performance sensitive code.
Ideally the evaluation is as fast as possible using this technique as it will be used a large number of times with different values substituting into the placeholders each time. I'm not expecting the expression template to be equivalent in performance to say a hardcoded function that models the same function as the given input text string i.e. there is no need to go down a route of actually compiling say, c++, in situ of a running program (I believe other questions cover dynamic library compiling/loading).
My own thoughts reading examples from boost is that I can use boost::spirit to do the parsing of the input text and I'm confident I can develop the grammar I need. However, I'm not sure how I can combine the parser with boost::proto to build an executable expression template. Most examples of spirit that I've seen are merely interpreters or end up building some kind of syntax tree but go no further. Most examples of proto that I've seen assume the DSL is embedded in the host source code and does not need to be initially interpreted from a string. I'm aware that boost::spirit is actually implemented with boost::proto but not sure if this is relevant to the problem or whether that fact will suggest a convenient solution.
To re-iterate, I need to be able to make real the something like following:
const std::string input_text("a && b || c");
// const std::string input_text(get_dsl_string_from_file("expression1.dsl"));
Expression expr(input_text);
while(keep_intensively_processing) {
...
Context context(…);
// e.g. context.a = false; context.b=false; context.c=true;
bool result(evaluate(expr, context));
...
}
I would really appreciate a minimal example or even just a small kernel that I can build upon that creates an expression from input text which is evaluated later in context.
I don't think this is exactly the same question as posted here: parsing boolean expressions with boost spirit
as I'm not convinced this is necessarily the quickest executing way of doing this, even though it looks very clever. In time I'll try to do a benchmark of all answers posted.

Extracting information using BNF grammars

I would like to extract information from a body of text and be able to query it.
The structure of this body of text would be specified by a BNF grammar (or variant), and the information to extract would be specified at runtime (the syntax of the query does not matter at the moment).
So the requirements are simple, really:
Receive some structured body of text
Load it in an exploitable form using a grammar to parse it
Run a query to select some portions of it
To illustrate with an example, suppose that we have such grammar (in a customized BNF format):
<digit> ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
<id> ::= 15 * digit
<hex> ::= 10 * (<digit> | a | b | c | d | e | f)
<anything> ::= <digit> | .... (all characters)
<match> ::= <id> (" " <hex>)*
<nomatch> ::= "." <anything>*
<line> ::= (<match> | <nomatch> | "") [<CR>] <LF>
<text> ::= <line>+
For which such text would be conforming:
012345678901234
012345678901234 abcdef0123
Nor the previous line nor this one would match
And then I would want to list all tags that appear in the rule, so for example using an XPath like syntax:
match//id
which would return a list.
This sounds relatively easy, except that I have two big constraints:
the BNF grammar should be read at runtime (from a string/vector like structure)
the queries will be read at runtime too
Some precisions:
the grammar is not expected to change often so a "compilation" step to produce an in-memory structure is acceptable (and perhaps necessary to achieve good speed)
speed is of the essence, bonus points for on-the-fly collection of the wanted portions
bonus points for the possibility to have callbacks to disambiguate (sometimes the necessary disambiguation information might require DB access for example)
bonus points for multipart grammars (favoring modularity and reuse of grammar elements)
I know of lex/yacc and flex/bison for example, however they appear to only create C / C++ code to be compiled, which is not what I am looking after.
Do you know of a robust library (preferably free and open-source) that can transform a BNF grammar into a parser "on-the-fly" and produce a structured in-memory output from a body of text using this parser ?
EDIT: I am open to alternatives. At the moment, the idea was that perhaps regexes could allow this extraction, however given the complexity of the grammars involved, this could get ugly quickly and thus maintaining the regexes would be quite a horrendous task. Furthermore, by separating grammars and extraction I hope to be able to reuse the same grammar for different extractions needs rather than having slightly different regexes each time.
I have a proprietary solution that can convert grammar source into an in memory representation. The result is a pure data structure. Any code can use it. I also have C++ class that actually implements the parser. Rule handlers are implemented as virtual methods.
The primary difference between our solution and YACC/Bison is that no C/C++ code is generated. This means that grammar can be reloaded without recompiling the app. The grammar can be annotated with application IDs that are used in the code of the rule handlers.
The GOLD parser system produces an LALR parse table that is apparantly loaded AFAIK at runtime. I believe it has a C++ "parsing" engine so that should be easy to integrate.
You'd read your grammar, fork a subprocess to get the GOLD parser generator to produce the table, and then call your wired-in GOLD parser to load-and-parse.
I don't know how you attach actions to the reductions, which you'd probably like to do.
I have no specific experience with GOLD. "Gold" luck to you.

How do I associate changed lines with functions in a git repository of C code?

I'm attempting to construct a “heatmap” from a multi-year history stored in a git repository where the unit of granularity is individual functions. Functions should grow hotter as they change more times, more frequently, and with more non-blank lines changed.
As a start, I examined the output of
git log --patch -M --find-renames --find-copies-harder --function-context -- *.c
I looked at using Language.C from Hackage, but it seems to want a complete translation unit—expanded headers and all—rather being able to cope with a source fragment.
The --function-context option is new since version 1.7.8. The foundation of the implementation in v1.7.9.4 is a regex:
PATTERNS("cpp",
/* Jump targets or access declarations */
"!^[ \t]*[A-Za-z_][A-Za-z_0-9]*:.*$\n"
/* C/++ functions/methods at top level */
"^([A-Za-z_][A-Za-z_0-9]*([ \t*]+[A-Za-z_][A-Za-z_0-9]*([ \t]*::[ \t]*[^[:space:]]+)?){1,}[ \t]*\\([^;]*)$\n"
/* compound type at top level */
"^((struct|class|enum)[^;]*)$",
/* -- */
"[a-zA-Z_][a-zA-Z0-9_]*"
"|[-+0-9.e]+[fFlL]?|0[xXbB]?[0-9a-fA-F]+[lL]?"
"|[-+*/<>%&^|=!]=|--|\\+\\+|<<=?|>>=?|&&|\\|\\||::|->"),
This seems to recognize boundaries reasonably well but doesn’t always leave the function as the first line of the diff hunk, e.g., with #include directives at the top or with a hunk that contains multiple function definitions. An option to tell diff to emit separate hunks for each function changed would be really useful.
This isn’t safety-critical, so I can tolerate some misses. Does that mean I likely have Zawinski’s “two problems”?
I realise this suggestion is a bit tangential, but it may help in order to clarify and rank requirements. This would work for C or C++ ...
Instead of trying to find text blocks which are functions and comparing them, use the compiler to make binary blocks. Specifically, for every C/C++ source file in a change set, compile it to an object. Then use the object code as a basis for comparisons.
This might not be feasible for you, but IIRC there is an option on gcc to compile so that each function is compiled to an 'independent chunk' within the generated object code file. The linker can pull each 'chunk' into a program. (It is getting pretty late here, so I will look this up in the morning, if you are interested in the idea. )
So, assuming we can do this, you'll have lots of functions defined by chunks of binary code, so a simple 'heat' comparison is 'how much longer or shorter is the code between versions for any function?'
I am also thinking it might be practical to use objdump to reconstitute the assembler for the functions. I might use some regular expressions at this stage to trim off the register names, so that changes to register allocation don't cause too many false positive (changes).
I might even try to sort the assembler instructions in the function bodies, and diff them to get a pattern of "removed" vs "added" between two function implementations. This would give a measure of change which is pretty much independent of layout, and even somewhat independent of the order of some of the source.
So it might be interesting to see if two alternative implementations of the same function (i.e. from different a change set) are the same instructions :-)
This approach should also work for C++ because all names have been appropriately mangled, which should guarantee the same functions are being compared.
So, the regular expressions might be kept very simple :-)
Assuming all of this is straightforward, what might this approach fail to give you?
Side Note: This basic strategy could work for any language which targets machine code, as well as VM instruction sets like the Java VM Bytecode, .NET CLR code, etc too.
It might be worth considering building a simple parser, using one of the common tools, rather than just using regular expressions. Clearly it is better to choose something you are familiar with, or which your organisation already uses.
For this problem, a parser doesn't actually need to validate the code (I assume it is valid when it is checked in), and it doesn't need to understand the code, so it might be quite dumb.
It might throw away comments (retaining new lines), ignore the contents of text strings, and treat program text in a very simple way. It mainly needs to keep track of balanced '{' '}', balanced '(' ')' and all the other valid program text is just individual tokens which can be passed 'straight through'.
It's output might be a separate file/function to make tracking easier.
If the language is C or C++, and the developers are reasonably disciplined, they might never use 'non-syntactic macros'. If that is the case, then the files don't need to be preprocessed.
Then a parser is mostly just looking for a the function name (an identifier) at file scope followed by ( parameter-list ) { ... code ... }
I'd SWAG it would be a few days work using yacc & lex / flex & bison, and it might be so simple that their is no need for the parser generator.
If the code is Java, then ANTLR is a possible, and I think there was a simple Java parser example.
If Haskell is your focus, their may be student projects published which have made a reasonable stab at a parser.

Dynamically Describing Mathematical Rules

I want to be able to specify mathematical rules in an external location to my application and have my application process them at run time and perform actions (again described externally). What is the best way to do this?
For example, I might want to execute function MyFunction1() when the following evaluates to true:
(a < b) & MyFunction2() & (myWord == "test").
Thanks in advance for your help.
(If it is of any relevance, I wish to use C++, C or C++/CLI)
I'd consider not reinventing the wheel --- use an embedded scripting engine. This means you'd be using a standard form for describing the actions and logic. There are several great options out there that will probably be fine for your needs.
Good options include:
Javascript though google v8. (I don't love this from an embedding point of view,
but javascript is easy to work with, and many people already know it)
Lua. Its fast and portable. Syntax is maybe not as nice as Javascript, but embedding is
easy.
Python. Clean syntax, lots of libraries. Not much fun to embed though.
I'd consider using SWIG to help generate the bindings ... I know it works for python and lua, not sure about v8.
I would look at the command design pattern to handle calling external mathematical predicates, and the Factory design pattern to run externally defined code.
If your mathematical expression language is that simple then uou could define a grammar for it, e.g.:
expr = bin-op-expr | rel-expr | func-expr | var-expr | "(" expr ")"
bin-op = "&" | "|" | "!"
bin-op-expr = expr bin-op expr
rel-op = "<" | ">" | "==" | "!=" | "<=" | ">="
rel-expr = expr rel-op expr
func-args = "(" ")"
func-expr = func-name func-args
var-expr = name
and then translate that into a grammar for a parser. E.g. you could use Boost.Spirit which provides a DSL to allow you to express a grammar within your C++ code.
If that calculation happens at an inner loop, you want high performance, you cannot go with scripting languages. Based on how "deployable" and how much platform independent you would like that to be:
1) You could express the equations in C++ and let g++ compile it for you at run-time, and you could link to the resulting shared object. But this method is very much platform dependent at every step! The necessary system calls, the compiler to use, the flags, loading a shared object (or a DLL)... That would be super-fast in the end though, especially if you compile the innermost loop with the equation. The equation would be inlined and all.
2) You could use java in the same way. You can get a nice java compiler in java (from Eclipse I think, but you can embed it easily). With this solution, the result would be slightly slower (depending on how much template magic you want), I would expect, by a factor of 2 for most purposes. But this solution is extremely portable. Once you get it working, there's no reason it shouldn't work anywhere, you don't need anything external to your program. Another down side to this is having to write your equations in Java syntax, which is ugly for complex math. The first solution is much better in that respect, since operator overloading greatly helps math equations.
3) I don't know much about C#, but there could be a solution similar to (2). If there is, I know that there's operator overloading in C#, so your equations would be more pleasant to write and look at.

calculating user defined formulas (with c++)

We would like to have user defined formulas in our c++ program.
e.g. The value v = x + ( y - (z - 2)) / 2. Later in the program the user would define x,y and z -> the program should return the result of the calculation. Somewhen later the formula may get changed, so the next time the program should parse the formula and add the new values. Any ideas / hints how to do something like this ? So far I just came to the solution to write a parser to calculate these formulas - maybe any ideas about that ?
If it will be used frequently and if it will be extended in the future, I would almost recommend adding either Python or Lua into your code. Lua is a very lightweight scripting language which you can hook into and provide new functions, operators etc. If you want to do more robust and complicated things, use Python instead.
You can represent your formula as a tree of operations and sub-expressions. You may want to define types or constants for Operation types and Variables.
You can then easily enough write a method that recurses through the tree, applying the appropriate operations to whatever values you pass in.
Building your own parser for this should be a straight-forward operation:
) convert the equation from infix to postfix notation (a typical compsci assignment) (I'd use a stack)
) wait to get the values you want
) pop the stack of infix items, dropping the value for the variable in where needed
) display results
Using Spirit (for example) to parse (and the 'semantic actions' it provides to construct an expression tree that you can then manipulate, e.g., evaluate) seems like quite a simple solution. You can find a grammar for arithmetic expressions there for example, if needed... (it's quite simple to come up with your own).
Note: Spirit is very simple to learn, and quite adapted for such tasks.
There's generally two ways of doing it, with three possible implementations:
as you've touched on yourself, a library to evaluate formulas
compiling the formula into code
The second option here is usually done either by compiling something that can be loaded in as a kind of plugin, or it can be compiled into a separate program that is then invoked and produces the necessary output.
For C++ I would guess that a library for evaluation would probably exist somewhere so that's where I would start.
If you want to write your own, search for "formal automata" and/or "finite state machine grammar"
In general what you will do is parse the string, pushing characters on a stack as you go. Then start popping the characters off and perform tasks based on what is popped. It's easier to code if you force equations to reverse-polish notation.
To make your life easier, I think getting this kind of input is best done through a GUI where users are restricted in what they can type in.
If you plan on doing it from the command line (that is the impression I get from your post), then you should probably define a strict set of allowable inputs (e.g. only single letter variables, no whitespace, and only certain mathematical symbols: ()+-*/ etc.).
Then, you will need to:
Read in the input char array
Parse it in order to build up a list of variables and actions
Carry out those actions - in BOMDAS order
With ANTLR you can create a parser/compiler that will interpret the user input, then execute the calculations using the Visitor pattern. A good example is here, but it is in C#. You should be able to adapt it quickly to your needs and remain using C++ as your development platform.