the generated file by the ocamllex - ocaml

the theory says about lex tool (I read ocamllex) it will convert a collection of regular expressions into C (OCaml) code for a DFA (actually in a NFA and also NFA2DFA). The formal definition of a DFA M is a 5 tuple M = { Q, Sigma, transition_function, q0, F}. What I found in the generated file is the following:
a record called __ocaml_lex_tables with fields from Lexing module
a recursive function
There is a mapping between the objects/structures of a DFA and the structures generated by ocamllex? I cannot 'see' it.... also I was googling for some help and I did not find any useful example.
The answer from ocamllex tool is meaningful in a DFA context e.g. 7 states, 279 transitions, table size 1158 bytes.
Is it a state transition table ? How to 'read' it ?
Thank you for any link/hint !

ocamllex is focused on speed, so it will not have explicit states visible in generated code. The theoretical representation is not always the fastest one, in practice it is usually transformed to account for constant factor speed improvements. The states are most probably represented with indexes in the generated arrays. You can think of it as mapping back assembly code to the real source code - in the general case it is not possible to do immediately because the compiler performs some optimizations and strives for the most compact and effective code, same goes for ocamllex. And the interesting question is why do you want to do that??

Related

What is the most efficient way to recalculate attributes of a Boost Spirit parse with a different symbol table?

I'm using Boost Spirit to implement functionality in some software that allows the user to enter a mathematical equation that will be repeatedly applied to an input stream. Input stream values are represented as symbols using boost::spirit::qi::symbols which the user can reference in their equation. (e.g. out1 = 3 * in1 + in2)
Parsing and compiling the user's equation is not performance sensitive but calculating its output value is as it forms part of a time-critical pipeline.
The standard way Spirit is used within the documentation is to calculate the output (attribute) of an input as it is parsed. However, as between each calculation only the attribute values of the symbols (out1, in1, etc.) will have changed, it feels like there might be a more efficient way to achieve this, perhaps by caching the abstract syntax tree of the expression and reiterating through it.
What is the most efficient way to recalculate the value of this (fixed) equation given a new set of symbol values?
The standard way Spirit is used is not as limited as you seem to think.
While you can use it to calulate an immediate value on the fly, it is much more usual to build an AST tree in the output attribute, that can be transformed (simplified, optimized) and interpreted (e.g. emitting virtual machine or even assembly instructions).
The compiler tutorials show this in full fledged, but the calculator samples are very close to what you seem to be looking for: http://www.boost.org/doc/libs/1_55_0/libs/spirit/example/qi/compiler_tutorial/
calc1 in example/qi/compiler_tutorial/calc1.cpp
Plain calculator example demonstrating the grammar. The parser is a syntax
checker only and does not do any semantic evaluation.
calc2 in example/qi/compiler_tutorial/calc2.cpp
A Calculator example demonstrating the grammar and semantic actions using
plain functions. The parser prints code suitable for a stack based virtual
machine.
calc3 in example/qi/compiler_tutorial/calc3.cpp
A calculator example demonstrating the grammar and semantic actions using
phoenix to do the actual expression evaluation. The parser is essentially
an "interpreter" that evaluates expressions on the fly.
Here's where it gets interesting for you, since it stops doing the calculations during parsing:
calc4 in example/qi/compiler_tutorial/calc4.cpp
A Calculator example demonstrating generation of AST. The AST, once
created, is traversed,
To print its contents and
To evaluate the result.
calc5 in example/qi/compiler_tutorial/calc5.cpp
Same as Calc4, this time, we'll incorporate debugging support, plus error
handling and reporting.
calc6 in example/qi/compiler_tutorial/calc6.cpp
Yet another calculator example! This time, we will compile to a simple
virtual machine. This is actually one of the very first Spirit example
circa 2000. Now, it's ported to Spirit2.
calc7 in example/qi/compiler_tutorial/calc7/main.cpp
Now we'll introduce variables and assignment. This time, we'll also be
renaming some of the rules -- a strategy for a grander scheme to come ;-)
This version also shows off grammar modularization. Here you will see how
expressions and statements are built as modular grammars.
calc8 in example/qi/compiler_tutorial/calc8/main.cpp
Now we'll introduce boolean expressions and control structures. Is it
obvious now what we are up to? ;-)
I'm sure you'll find lots of inspiration towards the end of the tutorial!
You can build your own tree of calculator objects which mirrors the AST. So for your example out1 = 3 * in1 + in2 the AST is:
*
3
+
in1
in2
So you'd build an object hierarchy like this:
Multiplier(
Constant(3),
Adder(
Variable(&in1),
Variable(&in2)
)
)
With classes something like:
class Result {
virtual double value() = 0;
};
class Multiplier : public Result {
Multiplier(Result* lhs, Result* rhs);
double value() { return _lhs->value() * _rhs->value(); }
}

How do c/c++ compilers know which line an error is on

There is probably a very obvious answer to this, but I was wondering how the compiler knows which line of code my error is on. In some cases it even knows the column.
The only way I can think to do this is to tokenize the input string into a 2D array. This would store [lines][tokens].
C/C++ could be tokenized into 1 long 1D array which would probably be more efficient. I am wondering what the usual parsing method would be that would keep line information.
actually most of it is covered in the dragon book.
Compilers do Lexing/Parsing i.e.: transforming the source code into a tree representation.
When doing so each keyword variable etc. is associated with a line and column number.
However during parsing the exact origin of the failure might get lost and the information might be off.
This is the first step in the long, complicated path towards "Engineering a Compiler" or Compilers Theory
The short answer to that is: there's a module called "front-end" that usually takes care of many phases:
Scanning
Parsing
IR generator
IR optimizer ...
The structure isn't fixed so each compiler will have its own set of modules but more or less the steps involved in the front-end processing are
Scanning - maps character streams into words (also ignores whitespaces/comments) or tokens
Parsing - this is where syntax and (some) semantic analysis take place and where syntax errors are reported
To make this up to you: the compiler knows the location of your error because when something doesn't fit into a structure called "abstract syntax tree" (i.e. it cannot be constructed) or doesn't follow any of the syntax-directed translation rules, well.. there's something wrong and the compiler indicates the location where this didn't happen. If there's a grammar error on just one word/token then even a precise column location can be returned since nothing matched a terminal keyword: a basic token like the if keyword in the C/C++ language.
If you want to know more about this topic my suggestion is to start with the classic academic approach of the "Compiler Book" or "Dragon Book" and then, later on, possibly study an open-source front-end like Clang

Parsing a tokenized free form grammar with Boost.Spirit

I've got stuck trying to create a Boost.Spirit parser for the callgrind tool's output which is part of valgrind. Callgrind outputs a domain specific embedded programming language (DSEL) which lets you do all sorts of cool stuff like custom expressions for synthetic counters, but it's not easy to parse.
I've placed some sample callgrind output at https://gist.github.com/ned14/5452719#file-sample-callgrind-output. I've placed my current best attempt at a Boost.Spirit lexer and parser at https://gist.github.com/ned14/5452719#file-callgrindparser-hpp and https://gist.github.com/ned14/5452719#file-callgrindparser-cxx. The Lexer part is straightforward: it tokenises tag-values, non-whitespace text, comments, end of lines, integers, hexadecimals, floats and operators (ignore the punctuators in the sample code, they're unused). White space is skipped.
So far so good. The problem is parsing the tokenised input stream. I haven't even attempted the main stanzas yet, I'm still trying to parse the tag-values which can occur at any point in the file. Tag values look like this:
tagtext: unknown series of tokens<eol>
It could be freeform text e.g.
desc: I1 cache: 32768 B, 64 B, 8-way associative, 157 picosec hit latency
In this situation you'd want to convert the set of tokens to a string i.e. to an iterator_range and extract.
It could however be an expression e.g.
event: EPpsec = 316 Ir + 1120 I1mr + 1120 D1mr + 1120 D1mw + 1362 ILmr + 1362 DLmr + 1362 DLmw
This says that from now on, event EPpsec is to be synthesised as Ir multiplied by 316 added to I1mr multiplied by 1120 added to ... etc.
The point I want to make here is that tag-value pairs need to be accumulated as arbitrary sets of tokens, and post-processed into whatever we turn them into later.
To that end, Boost.Spirit's utree() class looked exactly what I wanted, and that's what the sample code uses. But on VS2012 using the November CTP compiler with variadic templates I'm currently seeing this compile error:
1>C:\Users\ndouglas.RIMNET\documents\visual studio 2012\Projects\CallgrindParser\boost\boost/range/iterator_range_core.hpp(56): error C2440: 'static_cast' : cannot convert from 'boost::spirit::detail::list::node_iterator<const boost::spirit::utree>' to 'base_iterator_type'
1> No constructor could take the source type, or constructor overload resolution was ambiguous
1> C:\Users\ndouglas.RIMNET\documents\visual studio 2012\Projects\CallgrindParser\boost\boost/range/iterator_range_core.hpp(186) : see reference to function template instantiation 'IteratorT boost::iterator_range_detail::iterator_range_impl<IteratorT>::adl_begin<const Range>(ForwardRange &)' being compiled
1> with
1> [
1> IteratorT=base_iterator_type
1> , Range=boost::spirit::utree
1> , ForwardRange=boost::spirit::utree
1> ]
... which suggests that my base_iterator_type, which is a Boost.Spirit multi_pass<> wrap of an istreambuf_iterator for forward iterator nature, is somehow not understood by Boost.Spirit's utree() implementation. Thing is, I'm not sure if this is my bad code or bad Boost.Spirit code seeing as line_pos_iterator<> was failing to correctly specify its forward_iterator concept tag.
Thanks to past Stackoverflow help I could write a pure non-tokenised grammar, but it would be brittle. The right solution is to tokenise and use a freeform grammar capable of fairly arbitrary input. The number of examples of getting Boost.Spirit's Lex and Grammar working together in real world examples to achieve this rather than toy examples is sadly very few. Therefore any help would be greatly appreciated.
Niall
The token attribute exposes a variant, which in addition to the base-iterator range, can _assume the types declared in the token_type typedef:
typedef lex::lexertl::token<base_iterator_type, mpl::vector<std::string, int, double>> token_type;
So: string, int and double. Note however that coercion into one of the possible types will only occur lazily, when the parser actually uses the value.
utrees are a very versatile container [1]. Hence, when you expose a spirit::utree attribute on a rule, and the token value variant contains an iterator_range, then it attempts to assign that into the utree object (this fails, because the iterators are ... 'funky').
The easiest way to get your desired behaviour is to force Qi to interpret the attribute of the tag token as a string, and have that assigned to the utree. Therefore the following line constitutes a fix that will make compilation succeed:
unknowntagvalue = qi::as_string[tok.tag] >> restofline;
Notes
Having said all this, I would indeed suggest the following
Consider using the Nabialek Trick to dispatch different lazy rules depending on the tag matched - this makes it unnecessary to deal with raw utrees later on
You might have had success specializing boost::spirit::traits::assign_to_XXXXXX traits (see documentation)
consider using a pure Qi parser. While I can "feel" your sentiment that "it is going to brittle" [2] it seems you have already demonstrated that it raises the complexity to such a degree that it might not have net merit:
the unexpected ways in which attributes materialize (this question)
the problem with line-pos iterators (this is frequently asked question, and AFAIR it has mostly hard or inelegant solutions)
the inflexibility regarding e.g. ad-hoc debugging (access to source data in SA), switching/disabling skippers etc.
my personal experience was that looking at lexer states to drive these isn't helpful, because switching lexer state can only work reliably from lexer token semantic actions, whereas often, the disambiguation would happen in the Qi phase
but I'm diverging :)
[1] e.g. they have facilities for very lightweight 'referencing' of iterator ranges (e.g. for symbols, or to avoid copying characters from a source buffer into the attribute unless wanted)
[2] In effect, only because using a sequential lexer (scanner) vastly reduces the number of backtrack opportunities, so it simplifies the mental model of the parser. However, you can use expectation points to much the same effect.

Expression Evaluation in C++

I'm writing some excel-like C++ console app for homework.
My app should be able to accept formulas for it's cells, for example it should evaluate something like this:
Sum(tablename\fieldname[recordnumber], fieldname[recordnumber], ...)
tablename\fieldname[recordnumber] points to a cell in another table,
fieldname[recordnumber] points to a cell in current table
or
Sin(fieldname[recordnumber])
or
anotherfieldname[recordnumber]
or
"10" // (simply a number)
something like that.
functions are Sum, Ave, Sin, Cos, Tan, Cot, Mul, Div, Pow, Log (10), Ln, Mod
It's pathetic, I know, but it's my homework :'(
So does anyone know a trick to evaluate something like this?
Ok, nice homework question by the way.
It really depends on how heavy you want this to be. You can create a full expression parser (which is fun but also time consuming).
In order to do that, you need to describe the full grammar and write a frontend (have a look at lex and yacc or flexx and bison.
But as I see your question you can limit yourself to three subcases:
a simple value
a lookup (possibly to an other table)
a function which inputs are lookups
I think a little OO design can helps you out here.
I'm not sure if you have to deal with real time refresh and circular dependency checks. Else they can be tricky too.
For the parsing, I'd look at Recursive descent parsing. Then have a table that maps all possible function names to function pointers:
struct FunctionTableEntry {
string name;
double (*f)(double);
};
You should write a parser. Parser should take the expression i.e., each line and should identify the command and construct the parse tree. This is the first phase. In the second phase you can evaluate the tree by substituting the data for each elements of the command.
Previous responders have hit it on the head: you need to parse the cell contents, and interpret them.
StackOverflow already has a whole slew of questions on building compilers and interperters where you can find pointers to resources. Some of them are:
Learning to write a compiler (#1669 people!)
Learning Resources on Parsers, Interpreters, and Compilers
What are good resources on compilation?
References Needed for Implementing an Interpreter in C/C++
...
and so on.
Aside: I never have the energy to link them all together, or even try to build a comprehensive list.
I guess you cannot use yacc/lex (or the like) so you have to parse "manually":
Iterate over the string and divide it into its parts. What a part is depends on you grammar (syntax). That way you can find the function names and the parameters. The difficulty of this depends on the complexity of your syntax.
Maybe you should read a bit about lexical analysis.

calculating user defined formulas (with c++)

We would like to have user defined formulas in our c++ program.
e.g. The value v = x + ( y - (z - 2)) / 2. Later in the program the user would define x,y and z -> the program should return the result of the calculation. Somewhen later the formula may get changed, so the next time the program should parse the formula and add the new values. Any ideas / hints how to do something like this ? So far I just came to the solution to write a parser to calculate these formulas - maybe any ideas about that ?
If it will be used frequently and if it will be extended in the future, I would almost recommend adding either Python or Lua into your code. Lua is a very lightweight scripting language which you can hook into and provide new functions, operators etc. If you want to do more robust and complicated things, use Python instead.
You can represent your formula as a tree of operations and sub-expressions. You may want to define types or constants for Operation types and Variables.
You can then easily enough write a method that recurses through the tree, applying the appropriate operations to whatever values you pass in.
Building your own parser for this should be a straight-forward operation:
) convert the equation from infix to postfix notation (a typical compsci assignment) (I'd use a stack)
) wait to get the values you want
) pop the stack of infix items, dropping the value for the variable in where needed
) display results
Using Spirit (for example) to parse (and the 'semantic actions' it provides to construct an expression tree that you can then manipulate, e.g., evaluate) seems like quite a simple solution. You can find a grammar for arithmetic expressions there for example, if needed... (it's quite simple to come up with your own).
Note: Spirit is very simple to learn, and quite adapted for such tasks.
There's generally two ways of doing it, with three possible implementations:
as you've touched on yourself, a library to evaluate formulas
compiling the formula into code
The second option here is usually done either by compiling something that can be loaded in as a kind of plugin, or it can be compiled into a separate program that is then invoked and produces the necessary output.
For C++ I would guess that a library for evaluation would probably exist somewhere so that's where I would start.
If you want to write your own, search for "formal automata" and/or "finite state machine grammar"
In general what you will do is parse the string, pushing characters on a stack as you go. Then start popping the characters off and perform tasks based on what is popped. It's easier to code if you force equations to reverse-polish notation.
To make your life easier, I think getting this kind of input is best done through a GUI where users are restricted in what they can type in.
If you plan on doing it from the command line (that is the impression I get from your post), then you should probably define a strict set of allowable inputs (e.g. only single letter variables, no whitespace, and only certain mathematical symbols: ()+-*/ etc.).
Then, you will need to:
Read in the input char array
Parse it in order to build up a list of variables and actions
Carry out those actions - in BOMDAS order
With ANTLR you can create a parser/compiler that will interpret the user input, then execute the calculations using the Visitor pattern. A good example is here, but it is in C#. You should be able to adapt it quickly to your needs and remain using C++ as your development platform.