I am attempting to parse comma separated integers, with possible blanks. For instance, 1,2,,3,,-1 should be parsed as {1,2,n,3,n,-1} where is n is some constant.
The expression,
(int_ | eps) % ','
works when n == 0. More specifically, the following code works special cased for 0:
#include <boost/spirit/include/qi.hpp>
#include <iostream>
int main() {
using namespace boost::qi;
std::vector<int> v;
std::string s("1,2,,3,4,,-1");
phrase_parse(s.begin(), s.end(),
(int_|eps) % ','
, space, v);
}
I tried the following expression for arbitrary n:
(int_ | eps[_val = 3]) % ','
But apparently this is wrong. The compiler generates an error novel. I refrain from pasting all that here, as most likely what I am trying is incorrect (rather than specific compiler issues).
What would be the right way?
Nick
The attr() parser exists for this purpose:
(int_ | attr(3)) % ','
Related
I'm writting a recursive descent parser LL(1) in C++, but I have a problem because I don't know exactly how to get the next token. I know I have to use regular expressions for getting a terminal but I don't know how to get the largest next token.
For example, this lexical and this grammar (without left recursion, left factoring and without cycles):
//LEXICAL IN FLEX
TIME [0-9]+
DIRECTION UR|DR|DL|UL|U|D|L|R
ACTION A|J|M
%%
{TIME} {printf("TIME"); return (TIME);}
{DIRECTION} {printf("DIRECTION"); return (DIRECTION);}
{ACTION} {printf("ACTION"); return (ACTION);}
"~" {printf("RELEASED"); return (RELEASED);}
"+" {printf("PLUS_OP"); return (PLUS_OP);}
"*" {printf("COMB_OP"); return (COMB_OP);}
//GRAMMAR IN BISON
command : list_move PLUS_OP list_action
| list_move COMB_OP list_action
| list_move list_action
| list_move
| list_action
;
list_move: move list_move_prm
;
list_move_prm: move
| move list_move_prm
| ";"
;
list_action: ACTION list_action_prm
;
list_action_prm: PLUS_OP ACTION list_action_prm
| COMB_OP ACTION list_action_prm
| ACTION list_action_prm
| ";" //epsilon
;
move: TIME RELEASED DIRECTION
| RELEASED DIRECTION
| DIRECTION
;
I have a string that contains: "D DR R + A" it should validate it, but getting "DR" I have problems because "D" it's a token too, I don't know how to get "DR" instead "D".
There are a number of ways of hand-writing a tokenizer
you can use a recusive descent LL(1) parser "all the way down" -- rewrite your grammar in terms of single characters rather than tokens, and left factor it. Then your nextToken() routine becomes just getchar(). You'll end up with additional rules like:
TIME: DIGIT more_digits ;
more_digits: /* epsilon */ | DIGIT more_digits ;
DIGIT: '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' ;
DIRECTION: 'U' dir_suffix | 'D' dir_suffix | 'L' | 'R' ;
dir_suffix: /* epsilon */ | 'L' | 'R' ;
You can use regexes. Generally this means keeping around a buffer and reading the input into it. nextToken() then runs a series of regexes on the buffer, figuring out which one returns the longest token and returns that, advancing the buffer as needed.
You can do what flex does -- this is the buffer approach above, combined with building a single DFA that evaluates all of the regexes simultaneously. Running this DFA on the buffer then returns the longest token (based on the last accepting state reached before getting an error).
Note that in all cases, you'll need to consider how to handle whitespace as well. You can just ignore whitespace everywhere (FORTRAN style) or you can allow whitespace between some tokens, but not others (eg, not between the digits of TIME or within a DIRECTION, but everywhere else in the grammar). This can make the grammar much more complex (and the process of hand-writing the recursive descent parser much more tedious).
“I don't know exactly how to get the next token”
Your input comes from a stream (std::istream). You must write a get_token(istream) function (or a tokenizer class). The function must first discard white spaces, then read a character (or more if necessary) analyze it and returns the associated token. The following functions will help you achieve your goal:
ws – discards white-space.
istream::get – reads a character.
istream::putback – puts back in the stream a character (think “undo get”).
"I don't know how to get "DR" instead "D""
Both "D" and "DR" are words. Just read them as you would read a word: is >> word. You will also need a keyword to token map (see std::map). If you read the "D" string, you can ask the map what the associated token is. If not found, throw an exception.
A starting point (run it):
#include <iostream>
#include <iomanip>
#include <map>
#include <string>
enum token_t
{
END,
PLUS,
NUMBER,
D,
DR,
R,
A,
// ...
};
// ...
using keyword_to_token_t = std::map < std::string, token_t >;
keyword_to_token_t kwtt =
{
{"A", A},
{"D", D},
{"R", R},
{"DR", DR}
// ...
};
// ...
std::string s;
int n;
// ...
token_t get_token( std::istream& is )
{
char c;
std::ws( is ); // discard white-space
if ( !is.get( c ) ) // read a character
return END; // failed to read or eof
// analyze the character
switch ( c )
{
case '+': // simple token
return PLUS;
case '0': case '1': // rest of digits
is.putback( c ); // it starts with a digit: it must be a number, so put it back
is >> n; // and let the library to the hard work
return NUMBER;
//...
default: // keyword
is.putback( c );
is >> s;
if ( kwtt.find( s ) == kwtt.end() )
throw "keyword not found";
return kwtt[ s ];
}
}
int main()
{
try
{
while ( get_token( std::cin ) )
;
std::cout << "valid tokens";
}
catch ( const char* e )
{
std::cout << e;
}
}
I have a grammar that works perfectly fine and contains the following lines.
element = container | list | pair;
container = name >> '(' >> -(arg % ',') >> ')' >> '{' >> +element > '}';
// trying to put an expectation operator here --------^
list = name >> '(' > (value % ',') > ')' > ';';
pair = name >> ':' > value > ';';
To have meaningful error messages, I want to make sure that container does not backtrack as soon as it hits '{'. But for some reason, if I replace the sequence operator with an expectation operator right after the '{', I get a huge compiler error. Any ideas what the problem might be?
element is a boost::variant; container, list and pair are own structs with BOOST_FUSION_ADAPT_STRUCT applied. Please have a look here for the full source code: https://github.com/fklemme/liberty_tool/blob/master/src/liberty_grammar.hpp#L24
Yes. Because the precedences of operator>> and operator> aren't equal, the resulting synthesized attribute type is different.
In fact, it is no longer automatically compatible with the intended exposed attribute type.
In this case the problem can be quickly neutralized with some disambiguating parentheses around the sub-expression:
container = name >> '(' >> -(arg % ',') >> ')' >> ('{' > +element > '}');
I'm trying to parse expression grammar (with variables) into an Abstract Syntax Tree (AST) so that later on I could make a use of this AST and calculate values basing on those expressions (they may be a part of a function for example, so there is no need to store those expressions rather than calculate a value right away).
To my surprise, after handling with loops and instructions (which require nested structures in AST as well), I got nothing but seg faults after trying to parse any expression.. After hours of struggling with this I decided to ask here, because I have no idea what is it (maybe something with the grammar)
This statement-loop part works perfectly well. The struct 'loop' gets as a parameter only a number of repetitions - string so far (later on I want to put an expression here):
statement %= loop | inst;
inst %= lexeme[+(char_ - (';'|char_('}')) )] >> ';';
loop = "do(" > lexeme[+(char_ - "){")] // parse a number of loop repetitions
> "){"
> *statement > "}"
;
Structures are like:
typedef boost::variant<
boost::recursive_wrapper<s_loop>
, std::string>
s_statement;
struct s_loop
{
std::string name; // tag name
//s_expression exp; // TODO
std::vector<s_statement> children; // children
};
I use recursive wrapper, so I thought that maybe it is because of "deep" wrapping in case of expression-term-factor, why I can't do it. For loop-statement it goes simply like:
loop --(contains)--> statement (statement may be a loop!)
And in case of expressions it should be finally implemented like:
expression -> term -> factor (factor may be an expression!)
So, to be sure it's because of 'deep' wrapping, I tried with trivial grammar:
expression -> factor (factor may be an expression)
AST structures are copy-paste of above, everything is quite similar and.... it does not work! :(
I am quite sure that it must be something wrong with my grammar.. To be honest, I am not an expert of spirit. Here's the grammar:
expression = factor > * ( (char_('+')|char_('-')) > factor ) ;
factor %= uint_ | my_var | my_dat | my_rec_exp;
// factor %= uint_ | my_var | my_dat; //this WORKS! I've made procedures to traverse an AST
// strings and ints are parsed and stored well inside the expression structure
// factor %= uint_ | my_rec_exp; // even this simple version (of course I adjust a stucture s_expression) doesn't work.. WHY? :( , it's even less complex than loop-statement
my_rec_exp = '(' > expression > ')';
my_var %= char_('!') >> lexeme[+ ( char_ - ( ('+')|char_('-')|char_('*')|char_('/')|char_('(')|char_(')') ) ) ] ;
my_dat %= char_('#') >> lexeme[+ ( char_ - ( ('+')|char_('-')|char_('*')|char_('/')|char_('(')|char_(')') ) ) ] ;
Structures are here:
struct s_expression;
typedef boost::variant<
boost::recursive_wrapper<s_expression>,
// s_expression,
std::string,
unsigned int
>
s_factor;
struct s_term{ // WE DO NOT USE THIS IN THE SIMPLIFIED VERSION
s_factor factor0;
std::vector<std::pair<char, s_factor> >
factors;
};
struct s_expression{
s_factor term0;
std::vector<std::pair<char, s_factor> >
terms;
};
I will say one more time that without recursive expression it works well (parses to en expression containing a set of numers / strings connected with operators + / - ). But if I add expression as a variant of factor it crashes on exec.
Thank you for any advice / suggestion !
While going through the documentation I read that
for a string of doubles separated by a comma we could go like this (which I understand)
double_ >> * (',' >> double_) or double_ %
but what does the following expression mean. Its supposed to split comma separated strings into a vector and it works. I would appreciate it if someone could kindly clarify it. I am confused with - operator I believe its a difference operator but I cant figure out its role here
*(qi::char_ - ',') % ','
*(char_ - ',') means "match zero or more characters but ','", and it can also be written like this: *~char_(","). On the other hand, *char_ means just "match zero or more characters".
To understand, why the exclusion is needed, just try with and without it:
#include <string>
#include <boost/spirit/home/qi.hpp>
int main()
{
using namespace boost::spirit::qi;
std::vector<std::string> out1, out2;
std::string s = "str1, str2, str3";
bool b = parse(s.begin(), s.end(), *~char_(",") % ",", out1); // out1: ["str1", "str2", "str3"]
b = parse(s.begin(), s.end(), *char_ % ",", out2); // out2: ["str1, str2, str3"]
}
qi::char_ - ',' matches all characters but , to prevent the inner expression from being too greedy.
You really need to read EBNF standard to understand Boost.Spirit.
How can I get the abstract syntax tree (AST) of a regular expression (in C++)?
For example,
(XYZ)|(123)
should yield a tree of:
|
/ \
. .
/ \ / \
. Z . 3
/ \ / \
X Y 1 2
Is there a boost::spirit grammar to parse regular expression patterns? The boost::regex library should have it, but I didn't find it. Are there any other open-source tools available that would give me the abstract representation of a regex?
I stumbled into this question again. And I decided to take a look at how hard it would actually be to write a parser for a significant subset of regular expression syntax with Boost Spirit.
So, as usual, I started out with pen and paper, and after a while had some draft rules in mind. Time to draw the analogous AST up:
namespace ast
{
struct multiplicity
{
unsigned minoccurs;
boost::optional<unsigned> maxoccurs;
bool greedy;
multiplicity(unsigned minoccurs = 1, boost::optional<unsigned> maxoccurs = 1)
: minoccurs(minoccurs), maxoccurs(maxoccurs), greedy(true)
{ }
bool unbounded() const { return !maxoccurs; }
bool repeating() const { return !maxoccurs || *maxoccurs > 1; }
};
struct charset
{
bool negated;
using range = boost::tuple<char, char>; // from, till
using element = boost::variant<char, range>;
std::set<element> elements;
// TODO: single set for loose elements, simplify() method
};
struct start_of_match {};
struct end_of_match {};
struct any_char {};
struct group;
typedef boost::variant< // unquantified expression
start_of_match,
end_of_match,
any_char,
charset,
std::string, // literal
boost::recursive_wrapper<group> // sub expression
> simple;
struct atom // quantified simple expression
{
simple expr;
multiplicity mult;
};
using sequence = std::vector<atom>;
using alternative = std::vector<sequence>;
using regex = boost::variant<atom, sequence, alternative>;
struct group {
alternative root;
group() = default;
group(alternative root) : root(std::move(root)) { }
};
}
This is your typical AST (58 LoC) that works well with Spirit (due to integrating with boost via variant and optional, as well as having strategically chosen constructors).
The grammar ended up being only slightly longer:
template <typename It>
struct parser : qi::grammar<It, ast::alternative()>
{
parser() : parser::base_type(alternative)
{
using namespace qi;
using phx::construct;
using ast::multiplicity;
alternative = sequence % '|';
sequence = *atom;
simple =
(group)
| (charset)
| ('.' >> qi::attr(ast::any_char()))
| ('^' >> qi::attr(ast::start_of_match()))
| ('$' >> qi::attr(ast::end_of_match()))
// optimize literal tree nodes by grouping unquantified literal chars
| (as_string [ +(literal >> !char_("{?+*")) ])
| (as_string [ literal ]) // lone char/escape + explicit_quantifier
;
atom = (simple >> quantifier); // quantifier may be implicit
explicit_quantifier =
// bounded ranges:
lit('?') [ _val = construct<multiplicity>( 0, 1) ]
| ('{' >> uint_ >> '}' ) [ _val = construct<multiplicity>(_1, _1) ]
// repeating ranges can be marked non-greedy:
| (
lit('+') [ _val = construct<multiplicity>( 1, boost::none) ]
| lit('*') [ _val = construct<multiplicity>( 0, boost::none) ]
| ('{' >> uint_ >> ",}") [ _val = construct<multiplicity>(_1, boost::none) ]
| ('{' >> uint_ >> "," >> uint_ >> '}') [ _val = construct<multiplicity>(_1, _2) ]
| ("{," >> uint_ >> '}' ) [ _val = construct<multiplicity>( 0, _1) ]
) >> -lit('?') [ phx::bind(&multiplicity::greedy, _val) = false ]
;
quantifier = explicit_quantifier | attr(ast::multiplicity());
charset = '['
>> (lit('^') >> attr(true) | attr(false)) // negated
>> *(range | charset_el)
> ']'
;
range = charset_el >> '-' >> charset_el;
group = '(' >> alternative >> ')';
literal = unescape | ~char_("\\+*?.^$|{()") ;
unescape = ('\\' > char_);
// helper to optionally unescape waiting for raw ']'
charset_el = !lit(']') >> (unescape|char_);
}
private:
qi::rule<It, ast::alternative()> alternative;
qi::rule<It, ast::sequence()> sequence;
qi::rule<It, ast::atom()> atom;
qi::rule<It, ast::simple()> simple;
qi::rule<It, ast::multiplicity()> explicit_quantifier, quantifier;
qi::rule<It, ast::charset()> charset;
qi::rule<It, ast::charset::range()> range;
qi::rule<It, ast::group()> group;
qi::rule<It, char()> literal, unescape, charset_el;
};
Now, the real fun is to do something with the AST. Since you want to visualize the tree, I thought of generating DOT graph from the AST. So I did:
int main()
{
std::cout << "digraph common {\n";
for (std::string pattern: {
"abc?",
"ab+c",
"(ab)+c",
"[^-a\\-f-z\"\\]aaaa-]?",
"abc|d",
"a?",
".*?(a|b){,9}?",
"(XYZ)|(123)",
})
{
std::cout << "// ================= " << pattern << " ========\n";
ast::regex tree;
if (doParse(pattern, tree))
{
check_roundtrip(tree, pattern);
regex_todigraph printer(std::cout, pattern);
boost::apply_visitor(printer, tree);
}
}
std::cout << "}\n";
}
This program results in the following graphs:
The self-edges depict repeats and the colour indicates whether the match is greedy (red) or non-greedy (blue). As you can see I've optimized the AST a bit for clarity, but (un)commenting the relevant lines will make the difference:
I think it wouldn't be too hard to tune. Hopefully it will serve as inspiration to someone.
Full code at this gist: https://gist.github.com/sehe/8678988
I reckon that Boost Xpressive must be able to 'almost' do this out of the box.
xpressive is an advanced, object-oriented regular expression template library for C++. Regular expressions can be written as strings that are parsed at run-time, or as expression templates that are parsed at compile-time. Regular expressions can refer to each other and to themselves recursively, allowing you to build arbitrarily complicated grammars out of them.
I'll see whether I can confirm (with a small sample).
Other thoughts include using Boost Spirit with the generic utree facility to 'store' the AST. You'd have to reproduce a grammar (which is relatively simple for common subsets of Regex syntax), so it might mean more work.
Progress Report 1
Looking at Xpressive, I made some inroads. I got some pretty pictures using DDD's great graphical data display. But not pretty enough.
Then I explored the 'code' side more: Xpressive is built upon Boost Proto. It uses Proto to define a DSEL that models regular expressions directly in C++ code.
Proto generates the expression tree (generic AST, if you will) completely generically from C++ code (by overloading all possible operators). The library (Xpressive, in this case) then needs to define the semantics by walking the tree and e.g.
building a domain specific expression tree
annotating/decorating it with semantic information
possibly taking semantic action directly (e.g. how Boost Spirit does semantic actions in Qi and Karma1)
As you can see, the sky is really the limit there, and things are looking disturbingly similar to compiler macros like in Boo, Nemerle, Lisp etc.
Visualizing Expression Trres
Now, Boost Proto expression trees can be generically visualized:
Working from the example from Expressive C++: Playing with Syntax I slightly extended Xpressive's "Hello World" example to display the expression tree:
#include <iostream>
#include <boost/xpressive/xpressive.hpp>
#include <boost/proto/proto.hpp>
using namespace boost::xpressive;
int main()
{
std::string hello( "hello world!" );
sregex rex = sregex::compile( "(\\w+) (\\w+)!" );
// equivalent proto based expression
rex = (s1= +_w) >> ' ' >> (s2= +_w) >> '!';
boost::proto::display_expr( (s1= +_w) >> ' ' >> (s2= +_w) >> '!');
smatch what;
if( regex_match( hello, what, rex ) )
{
std::cout << what[0] << '\n'; // whole match
std::cout << what[1] << '\n'; // first capture
std::cout << what[2] << '\n'; // second capture
}
return 0;
}
The output of which is close to (note the compiler ABI specific typeid names):
shift_right(
shift_right(
shift_right(
assign(
terminal(N5boost9xpressive6detail16mark_placeholderE)
, unary_plus(
terminal(N5boost9xpressive6detail25posix_charset_placeholderE)
)
)
, terminal( )
)
, assign(
terminal(N5boost9xpressive6detail16mark_placeholderE)
, unary_plus(
terminal(N5boost9xpressive6detail25posix_charset_placeholderE)
)
)
)
, terminal(!)
)
hello world!
hello
world
DISCLAIMER You should realize that this is not actually displaying the Regex AST, but rather the generic expression tree from Proto, so it is devoid of domain specific (Regex) information. I mention it because the difference is likely going to cause some more work (? unless I find a hook into Xpressive's compilation structures) for it to become truly useful for the original question.
That's it for now
I'll leave on that note, as it's lunch time and I'm picking up the kids, but this certainly grabbed my interest, so I intend to post more later!
Conclusions / Progress Report 1.0000001
The bad news right away: it won't work.
Here's why. That disclaimer was right on the money. When weekend arrived, I had already been thinking things through a bit more and 'predicted' that the whole thing would break down right where I left it off: the AST is being based on the proto expression tree (not the regex matchable_ex).
This fact was quickly confirmed after some code inspection: after compilation, the proto expression tree isn't available anymore to be displayed. Let alone when the basic_regex was specified as a dynamic pattern in the first place (there never was a proto expression for it).
I had been half hoping that the matching had been implemented directly on the proto expression tree (using proto evalutation/evaluation contexts), but quickly confirmed out that this is not the case.
So, the main takeaway is:
this is all not going to work for displaying any regex AST
the best you can do with the above is visualize a proto expression, that you'll necessarily have to create directly in your code. That is a fancy way of just writing the AST manually in that same code...
Slightly less strict observations include
Boost Proto and Boost Expressive are highly interesting libraries (I didn't mind going fishing in there). I have obviously learned a few important lessons about template meta-programming libraries, and these libraries in particular.
It is hard to design a regex parser that builds a statically typed expression tree. In fact it is impossible in the general case - it would require the compiler to instantiate all possible expressions tree combinations to a certain depth. This would obviously not scale. You could get around that by introducing polymorphic composition and using polymorphic invocation, but this would remove the benefits of template metaprogramming (compile-time optimization for the statically instantiated types/specializations).
Both Boost Regex and Boost Expressive will likely support some kind of regex AST internally (to support the matching evaluation) but
it hasn't been exposed/documented
there is no obvious display facility for that
1 Even Spirit Lex supports them, for that matter (but not by default)
boost::regex seems to have a hand-written recursive-descent parser in basic_regex_parser.hpp. Even though it feels awfully like re-inventing the wheel, you are probably faster when writing up the grammar in boost::spirit yourself, especially with the multitude of regex formats around.