my parser is nearly working :)
(still amazed by Spirit feature set (and compiletimes) and the very welcoming community here on stack overflow)
small sample for online try:
http://coliru.stacked-crooked.com/a/1c1bf88909dce7e3
so i've learned to use more lexeme-rules and try to prevent no_skip -
my rules are smaller and better to read as a result but now i stuck with
combining lexeme-rules and skipping-rules what seems to be not possible (compiletime error with warning about not castable to Skipper)
my problem is the comma seperated list in subscriptions
which does not skip spaces around expressions
parses:
"a.b[a,b]"
fails:
"a.b[ a , b ]"
these are my rules:
qi::rule<std::string::const_iterator, std::string()> identifier_chain;
qi::rule<std::string::const_iterator, std::string()>
expression_list = identifier_chain >> *(qi::char_(',') >> identifier_chain);
qi::rule < std::string::const_iterator, std::string() >
subscription = qi::char_('[') >> expression_list >> qi::char_(']');
qi::rule<std::string::const_iterator, std::string()>
identifier = qi::ascii::alpha >> *(qi::ascii::alnum | '_');
identifier_chain = identifier >> *(('.' >> identifier) | subscription);
as you can see all rules are "lexeme" and i think the subscription rule should be a ascii::space_type skipper but that does not compile
should i add space eaters in the front and back of identifier_chains in the expression_list?
feels like writing an regex :(
expression_list = *qi::blank >> identifier_chain >> *(*qi::blank >> qi::char_(',') >> *qi::blank >> identifier_chain >> *qi::blank);
it works but i've read that this will get me to an much bigger parser in the end (handling all the space skipping by myself)
thx for any advice
btw: any idea why i can't compile if surrounding the '.' in the indentifier_chain with qi::char_('.')
identifier_chain = identifier >> *(('.' >> identifier) | subscription);
UPDATE:
i've updated my expression list as suggested by sehe
qi::rule<std::string::const_iterator, spirit::ascii::blank_type, std::string()>
expression_list = identifier_chain >> *(qi::char_(',') >> identifier_chain);
qi::rule < std::string::const_iterator, std::string() >
subscription = qi::char_('[') >> qi::skip(qi::blank)[expression_list] >> qi::char_(']');
but still get compile error due to non castable Skipper: http://coliru.stacked-crooked.com/a/adcf665742b055dd
i also tried changed the identifer_chain to
identifier_chain = identifier >> *(('.' >> identifier) | qi::skip(qi::blank)[subscription]);
but i still can't compile the example
The answer I linked to earlier describes all the combinations (if I remember correctly): Boost spirit skipper issues
In short:
any rule that declares a skipper (so rule<It, Skipper[, Attr()]> or rule<It, Attr(), Skipper>) MUST be invoked with a compatible skipper (an expression that can be assigned to the type of Skipper).
any rule that does NOT declare a skipper (so of the form rule<It[, Attr()]>) will implicitly behave like a lexeme, meaning no input characters are skipped.
That's it. The slightly subtler ramifications are that given two rules:
rule<It, blank_type> a;
rule<It> b; // b is implicitly lexeme
You can invoke b from a:
a = "test" >> b;
But when you wish to invoke a from b you will find that you have to provide the skipper:
b = "oops" >> a; // DOES NOT COMPILE
b = "okay" >> qi::skip(qi::blank) [ a ];
That's almost all there is to it. There are a few more directives around skippers and lexemes in Qi, see again the answer linked above.
Side Question:
should i add space eaters in the front and back of identifier_chains in the expression_list?
If you look closely at the answer example here Parse a '.' chained identifier list, with qi::lexeme and prevent space skipping, you can see that it already does pre- and post skipping correctly, because I used phrase_parse:
" a.b " OK: ( "a" "b" )
----
"a . b" Failed
Remaining unparsed: "a . b"
----
You COULD also wrap the whole thing in an "outer" rule:
rule<std::string::const_iterator> main_rule =
qi::skip(qi::blank) [ identifier_chain ];
That's just the same but allows users to call parse without specifying the skipper.
I am trying to parse a string into a struct using boost spirit x3:
struct identifier {
std::vector<std::string> namespaces;
std::vector<std::string> classes;
std::string identifier;
};
now I have a parser rule to match a strings like this:
foo::bar::baz.bla.blub
foo.bar
boo::bar
foo
my parser rule looks like this.
auto const nested_identifier_def =
x3::lexeme[
-(id_string % "::")
>> -(id_string % ".")
>> id_string
];
where id_string parses combinations of alphanum.
I know this rule doesnt work to parse as I want it, because while parsing foo.bar for example this part of the rule -(id_string % ".") consumes the whole string.
How can i change the rule to parse correctly in the struct?
Assuming your id_string is something like this:
auto const id_string = x3::rule<struct id_string_tag, std::string>{} =
x3::lexeme[
(x3::alpha | '_')
>> *(x3::alnum | '_')
];
then I think this is what you're after:
auto const nested_identifier_def =
*(id_string >> "::")
>> *(id_string >> '.')
>> id_string;
Online Demo
The issue is that p % delimit is shorthand for p >> *(delimit >> p), i.e. it always consumes one p after the delimiter. However what you want is *(p >> delimit) so that no p is consumed after the delimiter and is instead left for the next rule.
I'm newbie in boost. I have string delimeted with tab ( '\t' ).
How can i parse it with boost::spirit?
parser code from boost's samples
The boost sample code isn't the same as the actual boost sample, which was comma delimited, so presumably there are your modifications?
The ascii::space parser will handle the tabs for you as delimiters, so something like:
start %=
lit("employee")
>> '{'
>> int_ >>
>> quoted_string >>
>> quoted_string >>
>> double_
>> '}'
;
Should work (minus the 'lit('\t')'). But, this will also parse other spacing characters (e.g. space, tab).
If you actually need there to explicitly be single tabs ONLY between the terms, then leave in the lit('\t') and wrap it in a lexeme[] to disable skipping by the skip parser.
How can I get the abstract syntax tree (AST) of a regular expression (in C++)?
For example,
(XYZ)|(123)
should yield a tree of:
|
/ \
. .
/ \ / \
. Z . 3
/ \ / \
X Y 1 2
Is there a boost::spirit grammar to parse regular expression patterns? The boost::regex library should have it, but I didn't find it. Are there any other open-source tools available that would give me the abstract representation of a regex?
I stumbled into this question again. And I decided to take a look at how hard it would actually be to write a parser for a significant subset of regular expression syntax with Boost Spirit.
So, as usual, I started out with pen and paper, and after a while had some draft rules in mind. Time to draw the analogous AST up:
namespace ast
{
struct multiplicity
{
unsigned minoccurs;
boost::optional<unsigned> maxoccurs;
bool greedy;
multiplicity(unsigned minoccurs = 1, boost::optional<unsigned> maxoccurs = 1)
: minoccurs(minoccurs), maxoccurs(maxoccurs), greedy(true)
{ }
bool unbounded() const { return !maxoccurs; }
bool repeating() const { return !maxoccurs || *maxoccurs > 1; }
};
struct charset
{
bool negated;
using range = boost::tuple<char, char>; // from, till
using element = boost::variant<char, range>;
std::set<element> elements;
// TODO: single set for loose elements, simplify() method
};
struct start_of_match {};
struct end_of_match {};
struct any_char {};
struct group;
typedef boost::variant< // unquantified expression
start_of_match,
end_of_match,
any_char,
charset,
std::string, // literal
boost::recursive_wrapper<group> // sub expression
> simple;
struct atom // quantified simple expression
{
simple expr;
multiplicity mult;
};
using sequence = std::vector<atom>;
using alternative = std::vector<sequence>;
using regex = boost::variant<atom, sequence, alternative>;
struct group {
alternative root;
group() = default;
group(alternative root) : root(std::move(root)) { }
};
}
This is your typical AST (58 LoC) that works well with Spirit (due to integrating with boost via variant and optional, as well as having strategically chosen constructors).
The grammar ended up being only slightly longer:
template <typename It>
struct parser : qi::grammar<It, ast::alternative()>
{
parser() : parser::base_type(alternative)
{
using namespace qi;
using phx::construct;
using ast::multiplicity;
alternative = sequence % '|';
sequence = *atom;
simple =
(group)
| (charset)
| ('.' >> qi::attr(ast::any_char()))
| ('^' >> qi::attr(ast::start_of_match()))
| ('$' >> qi::attr(ast::end_of_match()))
// optimize literal tree nodes by grouping unquantified literal chars
| (as_string [ +(literal >> !char_("{?+*")) ])
| (as_string [ literal ]) // lone char/escape + explicit_quantifier
;
atom = (simple >> quantifier); // quantifier may be implicit
explicit_quantifier =
// bounded ranges:
lit('?') [ _val = construct<multiplicity>( 0, 1) ]
| ('{' >> uint_ >> '}' ) [ _val = construct<multiplicity>(_1, _1) ]
// repeating ranges can be marked non-greedy:
| (
lit('+') [ _val = construct<multiplicity>( 1, boost::none) ]
| lit('*') [ _val = construct<multiplicity>( 0, boost::none) ]
| ('{' >> uint_ >> ",}") [ _val = construct<multiplicity>(_1, boost::none) ]
| ('{' >> uint_ >> "," >> uint_ >> '}') [ _val = construct<multiplicity>(_1, _2) ]
| ("{," >> uint_ >> '}' ) [ _val = construct<multiplicity>( 0, _1) ]
) >> -lit('?') [ phx::bind(&multiplicity::greedy, _val) = false ]
;
quantifier = explicit_quantifier | attr(ast::multiplicity());
charset = '['
>> (lit('^') >> attr(true) | attr(false)) // negated
>> *(range | charset_el)
> ']'
;
range = charset_el >> '-' >> charset_el;
group = '(' >> alternative >> ')';
literal = unescape | ~char_("\\+*?.^$|{()") ;
unescape = ('\\' > char_);
// helper to optionally unescape waiting for raw ']'
charset_el = !lit(']') >> (unescape|char_);
}
private:
qi::rule<It, ast::alternative()> alternative;
qi::rule<It, ast::sequence()> sequence;
qi::rule<It, ast::atom()> atom;
qi::rule<It, ast::simple()> simple;
qi::rule<It, ast::multiplicity()> explicit_quantifier, quantifier;
qi::rule<It, ast::charset()> charset;
qi::rule<It, ast::charset::range()> range;
qi::rule<It, ast::group()> group;
qi::rule<It, char()> literal, unescape, charset_el;
};
Now, the real fun is to do something with the AST. Since you want to visualize the tree, I thought of generating DOT graph from the AST. So I did:
int main()
{
std::cout << "digraph common {\n";
for (std::string pattern: {
"abc?",
"ab+c",
"(ab)+c",
"[^-a\\-f-z\"\\]aaaa-]?",
"abc|d",
"a?",
".*?(a|b){,9}?",
"(XYZ)|(123)",
})
{
std::cout << "// ================= " << pattern << " ========\n";
ast::regex tree;
if (doParse(pattern, tree))
{
check_roundtrip(tree, pattern);
regex_todigraph printer(std::cout, pattern);
boost::apply_visitor(printer, tree);
}
}
std::cout << "}\n";
}
This program results in the following graphs:
The self-edges depict repeats and the colour indicates whether the match is greedy (red) or non-greedy (blue). As you can see I've optimized the AST a bit for clarity, but (un)commenting the relevant lines will make the difference:
I think it wouldn't be too hard to tune. Hopefully it will serve as inspiration to someone.
Full code at this gist: https://gist.github.com/sehe/8678988
I reckon that Boost Xpressive must be able to 'almost' do this out of the box.
xpressive is an advanced, object-oriented regular expression template library for C++. Regular expressions can be written as strings that are parsed at run-time, or as expression templates that are parsed at compile-time. Regular expressions can refer to each other and to themselves recursively, allowing you to build arbitrarily complicated grammars out of them.
I'll see whether I can confirm (with a small sample).
Other thoughts include using Boost Spirit with the generic utree facility to 'store' the AST. You'd have to reproduce a grammar (which is relatively simple for common subsets of Regex syntax), so it might mean more work.
Progress Report 1
Looking at Xpressive, I made some inroads. I got some pretty pictures using DDD's great graphical data display. But not pretty enough.
Then I explored the 'code' side more: Xpressive is built upon Boost Proto. It uses Proto to define a DSEL that models regular expressions directly in C++ code.
Proto generates the expression tree (generic AST, if you will) completely generically from C++ code (by overloading all possible operators). The library (Xpressive, in this case) then needs to define the semantics by walking the tree and e.g.
building a domain specific expression tree
annotating/decorating it with semantic information
possibly taking semantic action directly (e.g. how Boost Spirit does semantic actions in Qi and Karma1)
As you can see, the sky is really the limit there, and things are looking disturbingly similar to compiler macros like in Boo, Nemerle, Lisp etc.
Visualizing Expression Trres
Now, Boost Proto expression trees can be generically visualized:
Working from the example from Expressive C++: Playing with Syntax I slightly extended Xpressive's "Hello World" example to display the expression tree:
#include <iostream>
#include <boost/xpressive/xpressive.hpp>
#include <boost/proto/proto.hpp>
using namespace boost::xpressive;
int main()
{
std::string hello( "hello world!" );
sregex rex = sregex::compile( "(\\w+) (\\w+)!" );
// equivalent proto based expression
rex = (s1= +_w) >> ' ' >> (s2= +_w) >> '!';
boost::proto::display_expr( (s1= +_w) >> ' ' >> (s2= +_w) >> '!');
smatch what;
if( regex_match( hello, what, rex ) )
{
std::cout << what[0] << '\n'; // whole match
std::cout << what[1] << '\n'; // first capture
std::cout << what[2] << '\n'; // second capture
}
return 0;
}
The output of which is close to (note the compiler ABI specific typeid names):
shift_right(
shift_right(
shift_right(
assign(
terminal(N5boost9xpressive6detail16mark_placeholderE)
, unary_plus(
terminal(N5boost9xpressive6detail25posix_charset_placeholderE)
)
)
, terminal( )
)
, assign(
terminal(N5boost9xpressive6detail16mark_placeholderE)
, unary_plus(
terminal(N5boost9xpressive6detail25posix_charset_placeholderE)
)
)
)
, terminal(!)
)
hello world!
hello
world
DISCLAIMER You should realize that this is not actually displaying the Regex AST, but rather the generic expression tree from Proto, so it is devoid of domain specific (Regex) information. I mention it because the difference is likely going to cause some more work (? unless I find a hook into Xpressive's compilation structures) for it to become truly useful for the original question.
That's it for now
I'll leave on that note, as it's lunch time and I'm picking up the kids, but this certainly grabbed my interest, so I intend to post more later!
Conclusions / Progress Report 1.0000001
The bad news right away: it won't work.
Here's why. That disclaimer was right on the money. When weekend arrived, I had already been thinking things through a bit more and 'predicted' that the whole thing would break down right where I left it off: the AST is being based on the proto expression tree (not the regex matchable_ex).
This fact was quickly confirmed after some code inspection: after compilation, the proto expression tree isn't available anymore to be displayed. Let alone when the basic_regex was specified as a dynamic pattern in the first place (there never was a proto expression for it).
I had been half hoping that the matching had been implemented directly on the proto expression tree (using proto evalutation/evaluation contexts), but quickly confirmed out that this is not the case.
So, the main takeaway is:
this is all not going to work for displaying any regex AST
the best you can do with the above is visualize a proto expression, that you'll necessarily have to create directly in your code. That is a fancy way of just writing the AST manually in that same code...
Slightly less strict observations include
Boost Proto and Boost Expressive are highly interesting libraries (I didn't mind going fishing in there). I have obviously learned a few important lessons about template meta-programming libraries, and these libraries in particular.
It is hard to design a regex parser that builds a statically typed expression tree. In fact it is impossible in the general case - it would require the compiler to instantiate all possible expressions tree combinations to a certain depth. This would obviously not scale. You could get around that by introducing polymorphic composition and using polymorphic invocation, but this would remove the benefits of template metaprogramming (compile-time optimization for the statically instantiated types/specializations).
Both Boost Regex and Boost Expressive will likely support some kind of regex AST internally (to support the matching evaluation) but
it hasn't been exposed/documented
there is no obvious display facility for that
1 Even Spirit Lex supports them, for that matter (but not by default)
boost::regex seems to have a hand-written recursive-descent parser in basic_regex_parser.hpp. Even though it feels awfully like re-inventing the wheel, you are probably faster when writing up the grammar in boost::spirit yourself, especially with the multitude of regex formats around.
I'm using Boost.Spirit which was distributed with Boost-1.42.0 with VS2005. My problem is like this.
I've this string which was delimted with commas. The first 3 fields of it are strings and rest are numbers. like this.
String1,String2,String3,12.0,12.1,13.0,13.1,12.4
My rule is like this
qi::rule<string::iterator, qi::skip_type> stringrule = *(char_ - ',')
qi::rule<string::iterator, qi::skip_type> myrule= repeat(3)[*(char_ - ',') >> ','] >> (double_ % ',') ;
I'm trying to store the data in a structure like this.
struct MyStruct
{
vector<string> stringVector ;
vector<double> doubleVector ;
} ;
MyStruct var ;
I've wrapped it in BOOST_FUSION_ADAPT_STRUCTURE to use it with spirit.
BOOST_FUSION_ADAPT_STRUCT (MyStruct, (vector<string>, stringVector) (vector<double>, doubleVector))
My parse function parses the line and returns true and after
qi::phrase_parse (iterBegin, iterEnd, myrule, boost::spirit::ascii::space, var) ;
I'm expecting var.stringVector and var.doubleVector are properly filled. but it is not the case.
What is going wrong ?
The code sample is located here
Thanks in advance,
Surya
qi::skip_type is not something you could use a skipper. qi::skip_type is the type of the placeholder qi::skip, which is applicable for the skip[] directive only (to enable skipping inside a lexeme[] or to change skipper in use) and which is not a parser component matching any input on its own. You need to specify your specific skipper type instead (in your case that's boost::spirit::ascii:space_type).
Moreover, in order for your rules to return the parsed attribute, you need to specify the type of the expected attribute while defining your rule. That leaves you with:
qi::rule<string::iterator, std::string(), ascii:space_type>
stringrule = *(char_ - ',');
qi::rule<string::iterator, MyStruct(), ascii:space_type>
myrule = repeat(3)[*(char_ - ',') >> ','] >> (double_ % ',');
which should do exactly what you expect.