boost spirit expression grammar issue while creating AST - c++

I'm trying to parse expression grammar (with variables) into an Abstract Syntax Tree (AST) so that later on I could make a use of this AST and calculate values basing on those expressions (they may be a part of a function for example, so there is no need to store those expressions rather than calculate a value right away).
To my surprise, after handling with loops and instructions (which require nested structures in AST as well), I got nothing but seg faults after trying to parse any expression.. After hours of struggling with this I decided to ask here, because I have no idea what is it (maybe something with the grammar)
This statement-loop part works perfectly well. The struct 'loop' gets as a parameter only a number of repetitions - string so far (later on I want to put an expression here):
statement %= loop | inst;
inst %= lexeme[+(char_ - (';'|char_('}')) )] >> ';';
loop = "do(" > lexeme[+(char_ - "){")] // parse a number of loop repetitions
> "){"
> *statement > "}"
;
Structures are like:
typedef boost::variant<
boost::recursive_wrapper<s_loop>
, std::string>
s_statement;
struct s_loop
{
std::string name; // tag name
//s_expression exp; // TODO
std::vector<s_statement> children; // children
};
I use recursive wrapper, so I thought that maybe it is because of "deep" wrapping in case of expression-term-factor, why I can't do it. For loop-statement it goes simply like:
loop --(contains)--> statement (statement may be a loop!)
And in case of expressions it should be finally implemented like:
expression -> term -> factor (factor may be an expression!)
So, to be sure it's because of 'deep' wrapping, I tried with trivial grammar:
expression -> factor (factor may be an expression)
AST structures are copy-paste of above, everything is quite similar and.... it does not work! :(
I am quite sure that it must be something wrong with my grammar.. To be honest, I am not an expert of spirit. Here's the grammar:
expression = factor > * ( (char_('+')|char_('-')) > factor ) ;
factor %= uint_ | my_var | my_dat | my_rec_exp;
// factor %= uint_ | my_var | my_dat; //this WORKS! I've made procedures to traverse an AST
// strings and ints are parsed and stored well inside the expression structure
// factor %= uint_ | my_rec_exp; // even this simple version (of course I adjust a stucture s_expression) doesn't work.. WHY? :( , it's even less complex than loop-statement
my_rec_exp = '(' > expression > ')';
my_var %= char_('!') >> lexeme[+ ( char_ - ( ('+')|char_('-')|char_('*')|char_('/')|char_('(')|char_(')') ) ) ] ;
my_dat %= char_('#') >> lexeme[+ ( char_ - ( ('+')|char_('-')|char_('*')|char_('/')|char_('(')|char_(')') ) ) ] ;
Structures are here:
struct s_expression;
typedef boost::variant<
boost::recursive_wrapper<s_expression>,
// s_expression,
std::string,
unsigned int
>
s_factor;
struct s_term{ // WE DO NOT USE THIS IN THE SIMPLIFIED VERSION
s_factor factor0;
std::vector<std::pair<char, s_factor> >
factors;
};
struct s_expression{
s_factor term0;
std::vector<std::pair<char, s_factor> >
terms;
};
I will say one more time that without recursive expression it works well (parses to en expression containing a set of numers / strings connected with operators + / - ). But if I add expression as a variant of factor it crashes on exec.
Thank you for any advice / suggestion !

Related

how to use list syntax with defaults spirit

I am attempting to parse comma separated integers, with possible blanks. For instance, 1,2,,3,,-1 should be parsed as {1,2,n,3,n,-1} where is n is some constant.
The expression,
(int_ | eps) % ','
works when n == 0. More specifically, the following code works special cased for 0:
#include <boost/spirit/include/qi.hpp>
#include <iostream>
int main() {
using namespace boost::qi;
std::vector<int> v;
std::string s("1,2,,3,4,,-1");
phrase_parse(s.begin(), s.end(),
(int_|eps) % ','
, space, v);
}
I tried the following expression for arbitrary n:
(int_ | eps[_val = 3]) % ','
But apparently this is wrong. The compiler generates an error novel. I refrain from pasting all that here, as most likely what I am trying is incorrect (rather than specific compiler issues).
What would be the right way?
Nick
The attr() parser exists for this purpose:
(int_ | attr(3)) % ','

How can I implement grouping preference in antlr?

I want to write a lexer and parser which could take expressions like
(4+y)*8
4+5*x
(3)+(z*(4+w))*6
And then parse them considering the priority of multiplication over addition. In particular, I can't figure out how I could avoid
4+5*x
being grouped as
MULTIPLICATION(ADDITION(4,5),5) instead of ADDITION(4+MULTIPLICATION)
My lexer looks like that:
PLUS : '+';
TIMES : '*';
NUMBER : [0-9]+'.'?[0-9]*;
VARIABLE : [(a-z)|(A-Z)]+;
OPENING : '(';
CLOSING : ')';
WHITESPACE : [ \t\r\n]+ -> skip ;
The correct grouping will happen automatically if you define your lower-priority operations closer to your "root expression" rule than the higher-priority ones:
expr
: e=multDivExpr
( PLUS e=multDivExpr
| MINUS e=multDivExpr
)*
;
multDivExpr
: e=atomExpr
( TIMES e=atom
| DIV e=atom
| REM e=atom
)*
;
atom
: NUMBER
| VARIABLE
| OPENING e=expr CLOSING
;
A simple way to understand what's going on is to think that the recursive descent parser generated by ANTLR will use multDivExpr non-terminals as "building blocks" for the "additive" expr non-terminal, therefore applying the grouping to multiplication and division before considering addition and subtraction.

How to get the AST of a regular expression string?

How can I get the abstract syntax tree (AST) of a regular expression (in C++)?
For example,
(XYZ)|(123)
should yield a tree of:
|
/ \
. .
/ \ / \
. Z . 3
/ \ / \
X Y 1 2
Is there a boost::spirit grammar to parse regular expression patterns? The boost::regex library should have it, but I didn't find it. Are there any other open-source tools available that would give me the abstract representation of a regex?
I stumbled into this question again. And I decided to take a look at how hard it would actually be to write a parser for a significant subset of regular expression syntax with Boost Spirit.
So, as usual, I started out with pen and paper, and after a while had some draft rules in mind. Time to draw the analogous AST up:
namespace ast
{
struct multiplicity
{
unsigned minoccurs;
boost::optional<unsigned> maxoccurs;
bool greedy;
multiplicity(unsigned minoccurs = 1, boost::optional<unsigned> maxoccurs = 1)
: minoccurs(minoccurs), maxoccurs(maxoccurs), greedy(true)
{ }
bool unbounded() const { return !maxoccurs; }
bool repeating() const { return !maxoccurs || *maxoccurs > 1; }
};
struct charset
{
bool negated;
using range = boost::tuple<char, char>; // from, till
using element = boost::variant<char, range>;
std::set<element> elements;
// TODO: single set for loose elements, simplify() method
};
struct start_of_match {};
struct end_of_match {};
struct any_char {};
struct group;
typedef boost::variant< // unquantified expression
start_of_match,
end_of_match,
any_char,
charset,
std::string, // literal
boost::recursive_wrapper<group> // sub expression
> simple;
struct atom // quantified simple expression
{
simple expr;
multiplicity mult;
};
using sequence = std::vector<atom>;
using alternative = std::vector<sequence>;
using regex = boost::variant<atom, sequence, alternative>;
struct group {
alternative root;
group() = default;
group(alternative root) : root(std::move(root)) { }
};
}
This is your typical AST (58 LoC) that works well with Spirit (due to integrating with boost via variant and optional, as well as having strategically chosen constructors).
The grammar ended up being only slightly longer:
template <typename It>
struct parser : qi::grammar<It, ast::alternative()>
{
parser() : parser::base_type(alternative)
{
using namespace qi;
using phx::construct;
using ast::multiplicity;
alternative = sequence % '|';
sequence = *atom;
simple =
(group)
| (charset)
| ('.' >> qi::attr(ast::any_char()))
| ('^' >> qi::attr(ast::start_of_match()))
| ('$' >> qi::attr(ast::end_of_match()))
// optimize literal tree nodes by grouping unquantified literal chars
| (as_string [ +(literal >> !char_("{?+*")) ])
| (as_string [ literal ]) // lone char/escape + explicit_quantifier
;
atom = (simple >> quantifier); // quantifier may be implicit
explicit_quantifier =
// bounded ranges:
lit('?') [ _val = construct<multiplicity>( 0, 1) ]
| ('{' >> uint_ >> '}' ) [ _val = construct<multiplicity>(_1, _1) ]
// repeating ranges can be marked non-greedy:
| (
lit('+') [ _val = construct<multiplicity>( 1, boost::none) ]
| lit('*') [ _val = construct<multiplicity>( 0, boost::none) ]
| ('{' >> uint_ >> ",}") [ _val = construct<multiplicity>(_1, boost::none) ]
| ('{' >> uint_ >> "," >> uint_ >> '}') [ _val = construct<multiplicity>(_1, _2) ]
| ("{," >> uint_ >> '}' ) [ _val = construct<multiplicity>( 0, _1) ]
) >> -lit('?') [ phx::bind(&multiplicity::greedy, _val) = false ]
;
quantifier = explicit_quantifier | attr(ast::multiplicity());
charset = '['
>> (lit('^') >> attr(true) | attr(false)) // negated
>> *(range | charset_el)
> ']'
;
range = charset_el >> '-' >> charset_el;
group = '(' >> alternative >> ')';
literal = unescape | ~char_("\\+*?.^$|{()") ;
unescape = ('\\' > char_);
// helper to optionally unescape waiting for raw ']'
charset_el = !lit(']') >> (unescape|char_);
}
private:
qi::rule<It, ast::alternative()> alternative;
qi::rule<It, ast::sequence()> sequence;
qi::rule<It, ast::atom()> atom;
qi::rule<It, ast::simple()> simple;
qi::rule<It, ast::multiplicity()> explicit_quantifier, quantifier;
qi::rule<It, ast::charset()> charset;
qi::rule<It, ast::charset::range()> range;
qi::rule<It, ast::group()> group;
qi::rule<It, char()> literal, unescape, charset_el;
};
Now, the real fun is to do something with the AST. Since you want to visualize the tree, I thought of generating DOT graph from the AST. So I did:
int main()
{
std::cout << "digraph common {\n";
for (std::string pattern: {
"abc?",
"ab+c",
"(ab)+c",
"[^-a\\-f-z\"\\]aaaa-]?",
"abc|d",
"a?",
".*?(a|b){,9}?",
"(XYZ)|(123)",
})
{
std::cout << "// ================= " << pattern << " ========\n";
ast::regex tree;
if (doParse(pattern, tree))
{
check_roundtrip(tree, pattern);
regex_todigraph printer(std::cout, pattern);
boost::apply_visitor(printer, tree);
}
}
std::cout << "}\n";
}
This program results in the following graphs:
The self-edges depict repeats and the colour indicates whether the match is greedy (red) or non-greedy (blue). As you can see I've optimized the AST a bit for clarity, but (un)commenting the relevant lines will make the difference:
I think it wouldn't be too hard to tune. Hopefully it will serve as inspiration to someone.
Full code at this gist: https://gist.github.com/sehe/8678988
I reckon that Boost Xpressive must be able to 'almost' do this out of the box.
xpressive is an advanced, object-oriented regular expression template library for C++. Regular expressions can be written as strings that are parsed at run-time, or as expression templates that are parsed at compile-time. Regular expressions can refer to each other and to themselves recursively, allowing you to build arbitrarily complicated grammars out of them.
I'll see whether I can confirm (with a small sample).
Other thoughts include using Boost Spirit with the generic utree facility to 'store' the AST. You'd have to reproduce a grammar (which is relatively simple for common subsets of Regex syntax), so it might mean more work.
Progress Report 1
Looking at Xpressive, I made some inroads. I got some pretty pictures using DDD's great graphical data display. But not pretty enough.
Then I explored the 'code' side more: Xpressive is built upon Boost Proto. It uses Proto to define a DSEL that models regular expressions directly in C++ code.
Proto generates the expression tree (generic AST, if you will) completely generically from C++ code (by overloading all possible operators). The library (Xpressive, in this case) then needs to define the semantics by walking the tree and e.g.
building a domain specific expression tree
annotating/decorating it with semantic information
possibly taking semantic action directly (e.g. how Boost Spirit does semantic actions in Qi and Karma1)
As you can see, the sky is really the limit there, and things are looking disturbingly similar to compiler macros like in Boo, Nemerle, Lisp etc.
Visualizing Expression Trres
Now, Boost Proto expression trees can be generically visualized:
Working from the example from Expressive C++: Playing with Syntax I slightly extended Xpressive's "Hello World" example to display the expression tree:
#include <iostream>
#include <boost/xpressive/xpressive.hpp>
#include <boost/proto/proto.hpp>
using namespace boost::xpressive;
int main()
{
std::string hello( "hello world!" );
sregex rex = sregex::compile( "(\\w+) (\\w+)!" );
// equivalent proto based expression
rex = (s1= +_w) >> ' ' >> (s2= +_w) >> '!';
boost::proto::display_expr( (s1= +_w) >> ' ' >> (s2= +_w) >> '!');
smatch what;
if( regex_match( hello, what, rex ) )
{
std::cout << what[0] << '\n'; // whole match
std::cout << what[1] << '\n'; // first capture
std::cout << what[2] << '\n'; // second capture
}
return 0;
}
The output of which is close to (note the compiler ABI specific typeid names):
shift_right(
shift_right(
shift_right(
assign(
terminal(N5boost9xpressive6detail16mark_placeholderE)
, unary_plus(
terminal(N5boost9xpressive6detail25posix_charset_placeholderE)
)
)
, terminal( )
)
, assign(
terminal(N5boost9xpressive6detail16mark_placeholderE)
, unary_plus(
terminal(N5boost9xpressive6detail25posix_charset_placeholderE)
)
)
)
, terminal(!)
)
hello world!
hello
world
DISCLAIMER You should realize that this is not actually displaying the Regex AST, but rather the generic expression tree from Proto, so it is devoid of domain specific (Regex) information. I mention it because the difference is likely going to cause some more work (? unless I find a hook into Xpressive's compilation structures) for it to become truly useful for the original question.
That's it for now
I'll leave on that note, as it's lunch time and I'm picking up the kids, but this certainly grabbed my interest, so I intend to post more later!
Conclusions / Progress Report 1.0000001
The bad news right away: it won't work.
Here's why. That disclaimer was right on the money. When weekend arrived, I had already been thinking things through a bit more and 'predicted' that the whole thing would break down right where I left it off: the AST is being based on the proto expression tree (not the regex matchable_ex).
This fact was quickly confirmed after some code inspection: after compilation, the proto expression tree isn't available anymore to be displayed. Let alone when the basic_regex was specified as a dynamic pattern in the first place (there never was a proto expression for it).
I had been half hoping that the matching had been implemented directly on the proto expression tree (using proto evalutation/evaluation contexts), but quickly confirmed out that this is not the case.
So, the main takeaway is:
this is all not going to work for displaying any regex AST
the best you can do with the above is visualize a proto expression, that you'll necessarily have to create directly in your code. That is a fancy way of just writing the AST manually in that same code...
Slightly less strict observations include
Boost Proto and Boost Expressive are highly interesting libraries (I didn't mind going fishing in there). I have obviously learned a few important lessons about template meta-programming libraries, and these libraries in particular.
It is hard to design a regex parser that builds a statically typed expression tree. In fact it is impossible in the general case - it would require the compiler to instantiate all possible expressions tree combinations to a certain depth. This would obviously not scale. You could get around that by introducing polymorphic composition and using polymorphic invocation, but this would remove the benefits of template metaprogramming (compile-time optimization for the statically instantiated types/specializations).
Both Boost Regex and Boost Expressive will likely support some kind of regex AST internally (to support the matching evaluation) but
it hasn't been exposed/documented
there is no obvious display facility for that
1 Even Spirit Lex supports them, for that matter (but not by default)
boost::regex seems to have a hand-written recursive-descent parser in basic_regex_parser.hpp. Even though it feels awfully like re-inventing the wheel, you are probably faster when writing up the grammar in boost::spirit yourself, especially with the multitude of regex formats around.

Implementing "NOT" in boost::spirit mini_c

I tried to modify the mini_c example of boost::spirit to match to my existing vocabulary.
I therefore added a operator "NOT that should behave equal as "!":
unary_expr =
primary_expr
| ("NOT" > primary_expr [op(op_not)]) // This does not work
| ('!' > primary_expr [op(op_not)])
| ('-' > primary_expr [op(op_neg)])
| ('+' > primary_expr)
;
I can compile the modified source code, but when i try to execute it it fails to parse. How can i solve this?
EDIT:
As my want to access external variables, i had made another modification in order to build a list of these variables when compiling:
identifier %=
raw[lexeme[alpha >> *(alnum | 'ยง' | '_' | '.' | '-' )]]
;
variable =
identifier [add_var(_1)]
;
Where add_var and identifier are defined as
rule<Iterator, std::string(), white_space> identifier;
function<var_adder> add_var;
If i don't use this modification, "NOT" can be used. With the modification, using "NOT" generates a parsing error.
EDIT 2:
The following conditional expressions do work though:
logical_expr =
relational_expr
>> *( ("AND" > relational_expr [op(op_and)])
| ("OR" > relational_expr [op(op_or)])
)
;
With your change the small test:
int main()
{
return NOT 1;
}
parses successfully and returns 0. So it is not obvious to me what doesn't work for you. Could you provide a failing input example as well, please?

Parsing string, with Boost Spirit 2, to fill data in user defined struct

I'm using Boost.Spirit which was distributed with Boost-1.42.0 with VS2005. My problem is like this.
I've this string which was delimted with commas. The first 3 fields of it are strings and rest are numbers. like this.
String1,String2,String3,12.0,12.1,13.0,13.1,12.4
My rule is like this
qi::rule<string::iterator, qi::skip_type> stringrule = *(char_ - ',')
qi::rule<string::iterator, qi::skip_type> myrule= repeat(3)[*(char_ - ',') >> ','] >> (double_ % ',') ;
I'm trying to store the data in a structure like this.
struct MyStruct
{
vector<string> stringVector ;
vector<double> doubleVector ;
} ;
MyStruct var ;
I've wrapped it in BOOST_FUSION_ADAPT_STRUCTURE to use it with spirit.
BOOST_FUSION_ADAPT_STRUCT (MyStruct, (vector<string>, stringVector) (vector<double>, doubleVector))
My parse function parses the line and returns true and after
qi::phrase_parse (iterBegin, iterEnd, myrule, boost::spirit::ascii::space, var) ;
I'm expecting var.stringVector and var.doubleVector are properly filled. but it is not the case.
What is going wrong ?
The code sample is located here
Thanks in advance,
Surya
qi::skip_type is not something you could use a skipper. qi::skip_type is the type of the placeholder qi::skip, which is applicable for the skip[] directive only (to enable skipping inside a lexeme[] or to change skipper in use) and which is not a parser component matching any input on its own. You need to specify your specific skipper type instead (in your case that's boost::spirit::ascii:space_type).
Moreover, in order for your rules to return the parsed attribute, you need to specify the type of the expected attribute while defining your rule. That leaves you with:
qi::rule<string::iterator, std::string(), ascii:space_type>
stringrule = *(char_ - ',');
qi::rule<string::iterator, MyStruct(), ascii:space_type>
myrule = repeat(3)[*(char_ - ',') >> ','] >> (double_ % ',');
which should do exactly what you expect.