How to get the AST of a regular expression string? - c++

How can I get the abstract syntax tree (AST) of a regular expression (in C++)?
For example,
(XYZ)|(123)
should yield a tree of:
|
/ \
. .
/ \ / \
. Z . 3
/ \ / \
X Y 1 2
Is there a boost::spirit grammar to parse regular expression patterns? The boost::regex library should have it, but I didn't find it. Are there any other open-source tools available that would give me the abstract representation of a regex?

I stumbled into this question again. And I decided to take a look at how hard it would actually be to write a parser for a significant subset of regular expression syntax with Boost Spirit.
So, as usual, I started out with pen and paper, and after a while had some draft rules in mind. Time to draw the analogous AST up:
namespace ast
{
struct multiplicity
{
unsigned minoccurs;
boost::optional<unsigned> maxoccurs;
bool greedy;
multiplicity(unsigned minoccurs = 1, boost::optional<unsigned> maxoccurs = 1)
: minoccurs(minoccurs), maxoccurs(maxoccurs), greedy(true)
{ }
bool unbounded() const { return !maxoccurs; }
bool repeating() const { return !maxoccurs || *maxoccurs > 1; }
};
struct charset
{
bool negated;
using range = boost::tuple<char, char>; // from, till
using element = boost::variant<char, range>;
std::set<element> elements;
// TODO: single set for loose elements, simplify() method
};
struct start_of_match {};
struct end_of_match {};
struct any_char {};
struct group;
typedef boost::variant< // unquantified expression
start_of_match,
end_of_match,
any_char,
charset,
std::string, // literal
boost::recursive_wrapper<group> // sub expression
> simple;
struct atom // quantified simple expression
{
simple expr;
multiplicity mult;
};
using sequence = std::vector<atom>;
using alternative = std::vector<sequence>;
using regex = boost::variant<atom, sequence, alternative>;
struct group {
alternative root;
group() = default;
group(alternative root) : root(std::move(root)) { }
};
}
This is your typical AST (58 LoC) that works well with Spirit (due to integrating with boost via variant and optional, as well as having strategically chosen constructors).
The grammar ended up being only slightly longer:
template <typename It>
struct parser : qi::grammar<It, ast::alternative()>
{
parser() : parser::base_type(alternative)
{
using namespace qi;
using phx::construct;
using ast::multiplicity;
alternative = sequence % '|';
sequence = *atom;
simple =
(group)
| (charset)
| ('.' >> qi::attr(ast::any_char()))
| ('^' >> qi::attr(ast::start_of_match()))
| ('$' >> qi::attr(ast::end_of_match()))
// optimize literal tree nodes by grouping unquantified literal chars
| (as_string [ +(literal >> !char_("{?+*")) ])
| (as_string [ literal ]) // lone char/escape + explicit_quantifier
;
atom = (simple >> quantifier); // quantifier may be implicit
explicit_quantifier =
// bounded ranges:
lit('?') [ _val = construct<multiplicity>( 0, 1) ]
| ('{' >> uint_ >> '}' ) [ _val = construct<multiplicity>(_1, _1) ]
// repeating ranges can be marked non-greedy:
| (
lit('+') [ _val = construct<multiplicity>( 1, boost::none) ]
| lit('*') [ _val = construct<multiplicity>( 0, boost::none) ]
| ('{' >> uint_ >> ",}") [ _val = construct<multiplicity>(_1, boost::none) ]
| ('{' >> uint_ >> "," >> uint_ >> '}') [ _val = construct<multiplicity>(_1, _2) ]
| ("{," >> uint_ >> '}' ) [ _val = construct<multiplicity>( 0, _1) ]
) >> -lit('?') [ phx::bind(&multiplicity::greedy, _val) = false ]
;
quantifier = explicit_quantifier | attr(ast::multiplicity());
charset = '['
>> (lit('^') >> attr(true) | attr(false)) // negated
>> *(range | charset_el)
> ']'
;
range = charset_el >> '-' >> charset_el;
group = '(' >> alternative >> ')';
literal = unescape | ~char_("\\+*?.^$|{()") ;
unescape = ('\\' > char_);
// helper to optionally unescape waiting for raw ']'
charset_el = !lit(']') >> (unescape|char_);
}
private:
qi::rule<It, ast::alternative()> alternative;
qi::rule<It, ast::sequence()> sequence;
qi::rule<It, ast::atom()> atom;
qi::rule<It, ast::simple()> simple;
qi::rule<It, ast::multiplicity()> explicit_quantifier, quantifier;
qi::rule<It, ast::charset()> charset;
qi::rule<It, ast::charset::range()> range;
qi::rule<It, ast::group()> group;
qi::rule<It, char()> literal, unescape, charset_el;
};
Now, the real fun is to do something with the AST. Since you want to visualize the tree, I thought of generating DOT graph from the AST. So I did:
int main()
{
std::cout << "digraph common {\n";
for (std::string pattern: {
"abc?",
"ab+c",
"(ab)+c",
"[^-a\\-f-z\"\\]aaaa-]?",
"abc|d",
"a?",
".*?(a|b){,9}?",
"(XYZ)|(123)",
})
{
std::cout << "// ================= " << pattern << " ========\n";
ast::regex tree;
if (doParse(pattern, tree))
{
check_roundtrip(tree, pattern);
regex_todigraph printer(std::cout, pattern);
boost::apply_visitor(printer, tree);
}
}
std::cout << "}\n";
}
This program results in the following graphs:
The self-edges depict repeats and the colour indicates whether the match is greedy (red) or non-greedy (blue). As you can see I've optimized the AST a bit for clarity, but (un)commenting the relevant lines will make the difference:
I think it wouldn't be too hard to tune. Hopefully it will serve as inspiration to someone.
Full code at this gist: https://gist.github.com/sehe/8678988

I reckon that Boost Xpressive must be able to 'almost' do this out of the box.
xpressive is an advanced, object-oriented regular expression template library for C++. Regular expressions can be written as strings that are parsed at run-time, or as expression templates that are parsed at compile-time. Regular expressions can refer to each other and to themselves recursively, allowing you to build arbitrarily complicated grammars out of them.
I'll see whether I can confirm (with a small sample).
Other thoughts include using Boost Spirit with the generic utree facility to 'store' the AST. You'd have to reproduce a grammar (which is relatively simple for common subsets of Regex syntax), so it might mean more work.
Progress Report 1
Looking at Xpressive, I made some inroads. I got some pretty pictures using DDD's great graphical data display. But not pretty enough.
Then I explored the 'code' side more: Xpressive is built upon Boost Proto. It uses Proto to define a DSEL that models regular expressions directly in C++ code.
Proto generates the expression tree (generic AST, if you will) completely generically from C++ code (by overloading all possible operators). The library (Xpressive, in this case) then needs to define the semantics by walking the tree and e.g.
building a domain specific expression tree
annotating/decorating it with semantic information
possibly taking semantic action directly (e.g. how Boost Spirit does semantic actions in Qi and Karma1)
As you can see, the sky is really the limit there, and things are looking disturbingly similar to compiler macros like in Boo, Nemerle, Lisp etc.
Visualizing Expression Trres
Now, Boost Proto expression trees can be generically visualized:
Working from the example from Expressive C++: Playing with Syntax I slightly extended Xpressive's "Hello World" example to display the expression tree:
#include <iostream>
#include <boost/xpressive/xpressive.hpp>
#include <boost/proto/proto.hpp>
using namespace boost::xpressive;
int main()
{
std::string hello( "hello world!" );
sregex rex = sregex::compile( "(\\w+) (\\w+)!" );
// equivalent proto based expression
rex = (s1= +_w) >> ' ' >> (s2= +_w) >> '!';
boost::proto::display_expr( (s1= +_w) >> ' ' >> (s2= +_w) >> '!');
smatch what;
if( regex_match( hello, what, rex ) )
{
std::cout << what[0] << '\n'; // whole match
std::cout << what[1] << '\n'; // first capture
std::cout << what[2] << '\n'; // second capture
}
return 0;
}
The output of which is close to (note the compiler ABI specific typeid names):
shift_right(
shift_right(
shift_right(
assign(
terminal(N5boost9xpressive6detail16mark_placeholderE)
, unary_plus(
terminal(N5boost9xpressive6detail25posix_charset_placeholderE)
)
)
, terminal( )
)
, assign(
terminal(N5boost9xpressive6detail16mark_placeholderE)
, unary_plus(
terminal(N5boost9xpressive6detail25posix_charset_placeholderE)
)
)
)
, terminal(!)
)
hello world!
hello
world
DISCLAIMER You should realize that this is not actually displaying the Regex AST, but rather the generic expression tree from Proto, so it is devoid of domain specific (Regex) information. I mention it because the difference is likely going to cause some more work (? unless I find a hook into Xpressive's compilation structures) for it to become truly useful for the original question.
That's it for now
I'll leave on that note, as it's lunch time and I'm picking up the kids, but this certainly grabbed my interest, so I intend to post more later!
Conclusions / Progress Report 1.0000001
The bad news right away: it won't work.
Here's why. That disclaimer was right on the money. When weekend arrived, I had already been thinking things through a bit more and 'predicted' that the whole thing would break down right where I left it off: the AST is being based on the proto expression tree (not the regex matchable_ex).
This fact was quickly confirmed after some code inspection: after compilation, the proto expression tree isn't available anymore to be displayed. Let alone when the basic_regex was specified as a dynamic pattern in the first place (there never was a proto expression for it).
I had been half hoping that the matching had been implemented directly on the proto expression tree (using proto evalutation/evaluation contexts), but quickly confirmed out that this is not the case.
So, the main takeaway is:
this is all not going to work for displaying any regex AST
the best you can do with the above is visualize a proto expression, that you'll necessarily have to create directly in your code. That is a fancy way of just writing the AST manually in that same code...
Slightly less strict observations include
Boost Proto and Boost Expressive are highly interesting libraries (I didn't mind going fishing in there). I have obviously learned a few important lessons about template meta-programming libraries, and these libraries in particular.
It is hard to design a regex parser that builds a statically typed expression tree. In fact it is impossible in the general case - it would require the compiler to instantiate all possible expressions tree combinations to a certain depth. This would obviously not scale. You could get around that by introducing polymorphic composition and using polymorphic invocation, but this would remove the benefits of template metaprogramming (compile-time optimization for the statically instantiated types/specializations).
Both Boost Regex and Boost Expressive will likely support some kind of regex AST internally (to support the matching evaluation) but
it hasn't been exposed/documented
there is no obvious display facility for that
1 Even Spirit Lex supports them, for that matter (but not by default)

boost::regex seems to have a hand-written recursive-descent parser in basic_regex_parser.hpp. Even though it feels awfully like re-inventing the wheel, you are probably faster when writing up the grammar in boost::spirit yourself, especially with the multitude of regex formats around.

Related

How to combine skipping and non-skipping (lexeme) rules?

my parser is nearly working :)
(still amazed by Spirit feature set (and compiletimes) and the very welcoming community here on stack overflow)
small sample for online try:
http://coliru.stacked-crooked.com/a/1c1bf88909dce7e3
so i've learned to use more lexeme-rules and try to prevent no_skip -
my rules are smaller and better to read as a result but now i stuck with
combining lexeme-rules and skipping-rules what seems to be not possible (compiletime error with warning about not castable to Skipper)
my problem is the comma seperated list in subscriptions
which does not skip spaces around expressions
parses:
"a.b[a,b]"
fails:
"a.b[ a , b ]"
these are my rules:
qi::rule<std::string::const_iterator, std::string()> identifier_chain;
qi::rule<std::string::const_iterator, std::string()>
expression_list = identifier_chain >> *(qi::char_(',') >> identifier_chain);
qi::rule < std::string::const_iterator, std::string() >
subscription = qi::char_('[') >> expression_list >> qi::char_(']');
qi::rule<std::string::const_iterator, std::string()>
identifier = qi::ascii::alpha >> *(qi::ascii::alnum | '_');
identifier_chain = identifier >> *(('.' >> identifier) | subscription);
as you can see all rules are "lexeme" and i think the subscription rule should be a ascii::space_type skipper but that does not compile
should i add space eaters in the front and back of identifier_chains in the expression_list?
feels like writing an regex :(
expression_list = *qi::blank >> identifier_chain >> *(*qi::blank >> qi::char_(',') >> *qi::blank >> identifier_chain >> *qi::blank);
it works but i've read that this will get me to an much bigger parser in the end (handling all the space skipping by myself)
thx for any advice
btw: any idea why i can't compile if surrounding the '.' in the indentifier_chain with qi::char_('.')
identifier_chain = identifier >> *(('.' >> identifier) | subscription);
UPDATE:
i've updated my expression list as suggested by sehe
qi::rule<std::string::const_iterator, spirit::ascii::blank_type, std::string()>
expression_list = identifier_chain >> *(qi::char_(',') >> identifier_chain);
qi::rule < std::string::const_iterator, std::string() >
subscription = qi::char_('[') >> qi::skip(qi::blank)[expression_list] >> qi::char_(']');
but still get compile error due to non castable Skipper: http://coliru.stacked-crooked.com/a/adcf665742b055dd
i also tried changed the identifer_chain to
identifier_chain = identifier >> *(('.' >> identifier) | qi::skip(qi::blank)[subscription]);
but i still can't compile the example
The answer I linked to earlier describes all the combinations (if I remember correctly): Boost spirit skipper issues
In short:
any rule that declares a skipper (so rule<It, Skipper[, Attr()]> or rule<It, Attr(), Skipper>) MUST be invoked with a compatible skipper (an expression that can be assigned to the type of Skipper).
any rule that does NOT declare a skipper (so of the form rule<It[, Attr()]>) will implicitly behave like a lexeme, meaning no input characters are skipped.
That's it. The slightly subtler ramifications are that given two rules:
rule<It, blank_type> a;
rule<It> b; // b is implicitly lexeme
You can invoke b from a:
a = "test" >> b;
But when you wish to invoke a from b you will find that you have to provide the skipper:
b = "oops" >> a; // DOES NOT COMPILE
b = "okay" >> qi::skip(qi::blank) [ a ];
That's almost all there is to it. There are a few more directives around skippers and lexemes in Qi, see again the answer linked above.
Side Question:
should i add space eaters in the front and back of identifier_chains in the expression_list?
If you look closely at the answer example here Parse a '.' chained identifier list, with qi::lexeme and prevent space skipping, you can see that it already does pre- and post skipping correctly, because I used phrase_parse:
" a.b " OK: ( "a" "b" )
----
"a . b" Failed
Remaining unparsed: "a . b"
----
You COULD also wrap the whole thing in an "outer" rule:
rule<std::string::const_iterator> main_rule =
qi::skip(qi::blank) [ identifier_chain ];
That's just the same but allows users to call parse without specifying the skipper.

Starting with Spirit X3

I've just started using Spirit X3 and I have a little question related with my first test. Do you know why this function is returning "false"?
bool parse()
{
std::string rc = "a 6 literal 8";
auto iter_begin = rc.begin();
auto iter_end = rc.end();
bool bOK= phrase_parse( iter_begin, iter_end,
// ----- start parser -----
alpha >> *alnum >> "literal" >> *alnum
// ----- end parser -----
, space);
return bOK && iter_begin == iter_end;
}
I've seen the problem is related with how I write the grammar. If I replace it with this one, it returns "true"
alpha >> -alnum >> "literal" >> *alnum
I'm using the Spirit version included in Boost 1.61.0.
Thanks in advance,
Sen
Your problem is a combination of the greediness of operator * and the use of a skipper. You need to keep in mind that alnum is a PrimitiveParser and that means that before every time this parser is tried, Spirit will pre-skip, and so the behaviour of your parser is:
alpha parses a.
The kleene operator starts.
alnum skips the space and then parses 6.
alnum skips the space and then parses l.
alnum parses i.
...
alnum parses l.
alnum skips the space and then parses 8.
alnum tries and fails to parse more. This completes the kleene operator with a parsed attribute of 6literal8.
"literal" tries and fails to parse.
The sequence operator fails and the invocation of phrase_parse returns false.
You can easily avoid this problem using the lexeme directive (barebones x3 docs, qi docs). Something like this should work:
alpha >> lexeme[*alnum] >> "literal" >> lexeme[*alnum];

how to split a C++ string to get whole string individually and some parts/characters of it

My question is, how can I split a string in C++? For example, I have `
string str = "[ (a*b) + {(c-d)/f} ]"
It need to get the whole expression individually like [,(,a,*,b,......
And I want to get the brackets only like [,(,),{,(,),},] on their proper position
How can I do these with some easy ways
This is called lexical analysis (getting tokens from some sequence or stream of characters) and should be followed by parsing. Read e.g. the first half of the Dragon Book.
Maybe LL parsing is enough for you....
There are many tools for that, see this question (I would suggest ANTLR). You probably should build some abstract syntax tree at some point.
But it might not worth the effort. Did you consider embedding some scripting language in your application, e.g. lua (see this and this...), or gnu guile, python, etc...
Here is a way I got to do this,
string expression = "[ (a*b) + {(c-d)/f} ]" ;
string token ;
// appending an extra character that i'm sure will never occur in my expression
// and will be used for splitting here
expression.append("~") ;
istringstream iss(expression);
getline(iss, token, '~');
for(int i = 0 ; i < token.length() ; i++ ) {
if(token[i] != ' ' ) {
cout<<token[i] << ",";
}
}
Output will be: [,(,a,*,b,),+,{,(,c,-,d,),/,f,},],

boost spirit expression grammar issue while creating AST

I'm trying to parse expression grammar (with variables) into an Abstract Syntax Tree (AST) so that later on I could make a use of this AST and calculate values basing on those expressions (they may be a part of a function for example, so there is no need to store those expressions rather than calculate a value right away).
To my surprise, after handling with loops and instructions (which require nested structures in AST as well), I got nothing but seg faults after trying to parse any expression.. After hours of struggling with this I decided to ask here, because I have no idea what is it (maybe something with the grammar)
This statement-loop part works perfectly well. The struct 'loop' gets as a parameter only a number of repetitions - string so far (later on I want to put an expression here):
statement %= loop | inst;
inst %= lexeme[+(char_ - (';'|char_('}')) )] >> ';';
loop = "do(" > lexeme[+(char_ - "){")] // parse a number of loop repetitions
> "){"
> *statement > "}"
;
Structures are like:
typedef boost::variant<
boost::recursive_wrapper<s_loop>
, std::string>
s_statement;
struct s_loop
{
std::string name; // tag name
//s_expression exp; // TODO
std::vector<s_statement> children; // children
};
I use recursive wrapper, so I thought that maybe it is because of "deep" wrapping in case of expression-term-factor, why I can't do it. For loop-statement it goes simply like:
loop --(contains)--> statement (statement may be a loop!)
And in case of expressions it should be finally implemented like:
expression -> term -> factor (factor may be an expression!)
So, to be sure it's because of 'deep' wrapping, I tried with trivial grammar:
expression -> factor (factor may be an expression)
AST structures are copy-paste of above, everything is quite similar and.... it does not work! :(
I am quite sure that it must be something wrong with my grammar.. To be honest, I am not an expert of spirit. Here's the grammar:
expression = factor > * ( (char_('+')|char_('-')) > factor ) ;
factor %= uint_ | my_var | my_dat | my_rec_exp;
// factor %= uint_ | my_var | my_dat; //this WORKS! I've made procedures to traverse an AST
// strings and ints are parsed and stored well inside the expression structure
// factor %= uint_ | my_rec_exp; // even this simple version (of course I adjust a stucture s_expression) doesn't work.. WHY? :( , it's even less complex than loop-statement
my_rec_exp = '(' > expression > ')';
my_var %= char_('!') >> lexeme[+ ( char_ - ( ('+')|char_('-')|char_('*')|char_('/')|char_('(')|char_(')') ) ) ] ;
my_dat %= char_('#') >> lexeme[+ ( char_ - ( ('+')|char_('-')|char_('*')|char_('/')|char_('(')|char_(')') ) ) ] ;
Structures are here:
struct s_expression;
typedef boost::variant<
boost::recursive_wrapper<s_expression>,
// s_expression,
std::string,
unsigned int
>
s_factor;
struct s_term{ // WE DO NOT USE THIS IN THE SIMPLIFIED VERSION
s_factor factor0;
std::vector<std::pair<char, s_factor> >
factors;
};
struct s_expression{
s_factor term0;
std::vector<std::pair<char, s_factor> >
terms;
};
I will say one more time that without recursive expression it works well (parses to en expression containing a set of numers / strings connected with operators + / - ). But if I add expression as a variant of factor it crashes on exec.
Thank you for any advice / suggestion !

Need to parse a string, having a mask (something like this "%yr-%mh-%dy"), so i get the int values

For example i have to find time in format mentioned in the title(but %-tags order can be different) in a string "The date is 2009-August-25." How can i make the program interprete the tags and what construction is better to use for storing them among with information about how to act with certain pieces of date string?
First look into boost::date_time library. It has IO system witch may be what you want but I see lack of searching.
To do custom date searching you need boost::xpressive. It contain anything you will need. Lets look into my hastily writed example. First you should parse your custom pattern, witch is easy with Xpressive. First look at header you need:
#include <string>
#include <iostream>
#include <map>
#include <boost/xpressive/xpressive_static.hpp>
#include <boost/xpressive/regex_actions.hpp>
//make example shorter but less clear
using namespace boost::xpressive;
Second define map of your special tags:
std::map<std::string, int > number_map;
number_map["%yr"] = 0;
number_map["%mh"] = 1;
number_map["%dy"] = 2;
number_map["%%"] = 3; // escape a %
Next step is to create a regex witch will parse our pattern with tags and save values from map into variable tag_id when it find tag or save -1 otherwise:
int tag_id;
sregex rx=((a1=number_map)|(s1=+~as_xpr('%')))[ref(tag_id)=(a1|-1)];
More information and description look here and here.
Now lets parse some pattern:
std::string pattern("%yr-%mh-%dy"); // this will be parsed
sregex_token_iterator begin( pattern.begin(), pattern.end(), rx ), end;
if(begin == end) throw std::runtime_error("The pattern is empty!");
The sregex_token_iterator will iterate over our tokens, and each time it will set tag_id varible. All we have to do is to build regex using this tokens. We will construct this regex using tag corresponding parts of static regex defined in array:
sregex regex_group[] = {
range('1','9') >> repeat<3,3>( _d ), // 4 digit year
as_xpr( "January" ) | "February" | "August", // not all month XD so lazy
repeat<2,2>( range('0','9') )[ // two digit day
check(as<int>(_) >= 1 && as<int>(_) <= 31) ], //only bettwen 1 and 31
as_xpr( '%' ) // match escaped %
};
Finally, lets start build our special regex. The first match will construct first part of it. If the tag is matched and tag_id is non negative we choose regex from array, else the match is probably the delimiter and we construct regex witch match it:
sregex custom_regex = (tag_id>=0) ? regex_group[tag_id] : as_xpr(begin->str());
Next we will iterate from begin to end and append next regex:
while(++begin != end)
{
if(tag_id>=0)
{
sregex nextregex = custom_regex >> regex_group[tag_id];
custom_regex = nextregex;
}
else
{
sregex nextregex = custom_regex >> as_xpr(begin->str());
custom_regex = nextregex;
}
}
Now our regex is ready, lets find some dates :-]
std::string input = "The date is 2009-August-25.";
smatch mydate;
if( regex_search( input, mydate, custom_regex ) )
std::cout << "Found " << mydate.str() << "." << std::endl;
The xpressive library is very powerful and fast. It's also beautiful use of patterns.
If you like this example, let me know in comment or points ;-)
I'd transform the tagged string in a regular expression with capture for the 3 fields and search for it. The complexity of the regular expression will depend on what you want to accept for %yr. You can also have a less strict expression and then check for valid values, this can leads to better error messages ("Invalid month: Augsut" instead of "date not found") or to false positives depending on the context.