boost::spirit::qi Expectation Parser and parser grouping unexpected behaviour - c++

I'm hoping someone can shine a light through my ignorance of using the > and >> operators in spirit parsing.
I have a working grammar, where the top-level rule looks like
test = identifier >> operationRule >> repeat(1,3)[any_string] >> arrow >> any_string >> conditionRule;
It relies on attributes to automatically allocated parsed values to a fusion-adapted struct (ie a boost tuple).
However, I know that once we match the operationRule, we must continue or fail (ie we don't want to allow backtracking to try other rules that begin with identifier).
test = identifier >>
operationRule > repeat(1,3)[any_string] > arrow > any_string > conditionRule;
This causes a cryptic compiler error ('boost::Container' : use of class template requires template argument list). Futz around a bit and the following compiles:
test = identifier >>
(operationRule > repeat(1,3)[any_string] > arrow > any_string > conditionRule);
but the attribute setting no longer works - my data structure contains garbage after parsing. This can be fixed by adding actions like [at_c<0>(_val) = _1], but that seems a little clunky - as well as making things slower according to the boost docs.
So, my questions are
Is it worth preventing back-tracking?
Why do I need the grouping operator ()
Does my last example above really stop back-tracking after operationRule is matched (I suspect not, it seems that if the entire parser inside the (...) fails backtracking will be allowed)?
If the answer to the previous question is /no/, how do I construct a rule that allows backtracking if operation is /not/ matched, but does not allow backtracking once operation /is/ matched?
Why does the grouping operator destroy the attribute grammar - requiring actions?
I realise this is quite a broad question - any hints that point in the right direction will be greatly appreciated!

Is it worth preventing back-tracking?
Absolutely. Preventing back tracking in general is a proven way to improve parser performance.
reduce the use of (negative) lookahead (operator !, operator - and some operator &)
order branches (operator |, operator ||, operator^ and some operator */-/+) such that the most frequent/likely branch is ordered first, or that the most costly branch is tried last
Using expectation points (>) does not essentially reduce backtracking: it will just disallow it. This will enable targeted error messages, prevent useless 'parsing-into-the-unknown'.
Why do I need the grouping operator ()
I'm not sure. I had a check using my what_is_the_attr helpers from here
ident >> op >> repeat(1,3)[any] >> "->" >> any
synthesizes attribute:
fusion::vector4<string, string, vector<string>, string>
ident >> op > repeat(1,3)[any] > "->" > any
synthesizes attribute:
fusion::vector3<fusion::vector2<string, string>, vector<string>, string>
I haven't found the need to group subexpressions using parentheses (things compile), but obviously DataT needs to be modified to match the changed layout.
typedef boost::tuple<
boost::tuple<std::string, std::string>,
std::vector<std::string>,
std::string
> DataT;
The full code below shows how I'd prefer to do that, using adapted structs.
Does my above example really stop back-tracking after operationRule is matched (I suspect not, it seems that if the entire parser inside the (...) fails backtracking will be allowed)?
Absolutely. If the expectation(s) is not met, a qi::expectation_failure<> exception is thrown. This by default aborts the parse. You could use qi::on_error to retry, fail, accept or rethrow. The MiniXML example has very good examples on using expectation points with qi::on_error
If the answer to the previous question is /no/, how do I construct a rule that allows backtracking if operation is /not/ matched, but does not allow backtracking once operation /is/ matched?
Why does the grouping operator destroy the attribute grammar - requiring actions?
It doesn't destroy the attribute grammar, it just changes the exposed type. So, if you bind an appropriate attribute reference to the rule/grammar, you won't need semantic actions. Now, I feel there should be ways to go without the grouping , so let me try it (preferrably on your short selfcontained sample). And indeed I have found no such need. I've added a full example to help you see what is happening in my testing, and not using semantic actions.
Full Code
The full code show 5 scenarios:
OPTION 1: Original without expectations
(no relevant changes)
OPTION 2: with expectations
Using the modified typedef for DataT (as shown above)
OPTION 3: adapted struct, without expectations
Using a userdefined struct with BOOST_FUSION_ADAPT_STRUCT
OPTION 4: adapted struct, with expectations
Modifying the adapted struct from OPTION 3
OPTION 5: lookahead hack
This one leverages a 'clever' (?) hack, by making all >> into expectations, and detecting the presence of a operationRule-match beforehand. This is of course suboptimal, but allows you to keep DataT unmodified, and without using semantic actions.
Obviously, define OPTION to the desired value before compiling.
#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/karma.hpp>
#include <boost/spirit/include/phoenix.hpp>
#include <boost/fusion/adapted.hpp>
#include <iostream>
namespace qi = boost::spirit::qi;
namespace karma = boost::spirit::karma;
#ifndef OPTION
#define OPTION 5
#endif
#if OPTION == 1 || OPTION == 5 // original without expectations (OR lookahead hack)
typedef boost::tuple<std::string, std::string, std::vector<std::string>, std::string> DataT;
#elif OPTION == 2 // with expectations
typedef boost::tuple<boost::tuple<std::string, std::string>, std::vector<std::string>, std::string> DataT;
#elif OPTION == 3 // adapted struct, without expectations
struct DataT
{
std::string identifier, operation;
std::vector<std::string> values;
std::string destination;
};
BOOST_FUSION_ADAPT_STRUCT(DataT, (std::string, identifier)(std::string, operation)(std::vector<std::string>, values)(std::string, destination));
#elif OPTION == 4 // adapted struct, with expectations
struct IdOpT
{
std::string identifier, operation;
};
struct DataT
{
IdOpT idop;
std::vector<std::string> values;
std::string destination;
};
BOOST_FUSION_ADAPT_STRUCT(IdOpT, (std::string, identifier)(std::string, operation));
BOOST_FUSION_ADAPT_STRUCT(DataT, (IdOpT, idop)(std::vector<std::string>, values)(std::string, destination));
#endif
template <typename Iterator>
struct test_parser : qi::grammar<Iterator, DataT(), qi::space_type, qi::locals<char> >
{
test_parser() : test_parser::base_type(test, "test")
{
using namespace qi;
quoted_string =
omit [ char_("'\"") [_a =_1] ]
>> no_skip [ *(char_ - char_(_a)) ]
> lit(_a);
any_string = quoted_string | +qi::alnum;
identifier = lexeme [ alnum >> *graph ];
operationRule = string("add") | "sub";
arrow = "->";
#if OPTION == 1 || OPTION == 3 // without expectations
test = identifier >> operationRule >> repeat(1,3)[any_string] >> arrow >> any_string;
#elif OPTION == 2 || OPTION == 4 // with expectations
test = identifier >> operationRule > repeat(1,3)[any_string] > arrow > any_string;
#elif OPTION == 5 // lookahead hack
test = &(identifier >> operationRule) > identifier > operationRule > repeat(1,3)[any_string] > arrow > any_string;
#endif
}
qi::rule<Iterator, qi::space_type/*, qi::locals<char> */> arrow;
qi::rule<Iterator, std::string(), qi::space_type/*, qi::locals<char> */> operationRule;
qi::rule<Iterator, std::string(), qi::space_type/*, qi::locals<char> */> identifier;
qi::rule<Iterator, std::string(), qi::space_type, qi::locals<char> > quoted_string, any_string;
qi::rule<Iterator, DataT(), qi::space_type, qi::locals<char> > test;
};
int main()
{
std::string str("addx001 add 'str1' \"str2\" -> \"str3\"");
test_parser<std::string::const_iterator> grammar;
std::string::const_iterator iter = str.begin();
std::string::const_iterator end = str.end();
DataT data;
bool r = phrase_parse(iter, end, grammar, qi::space, data);
if (r)
{
using namespace karma;
std::cout << "OPTION " << OPTION << ": " << str << " --> ";
#if OPTION == 1 || OPTION == 3 || OPTION == 5 // without expectations (OR lookahead hack)
std::cout << format(delimit[auto_ << auto_ << '[' << auto_ << ']' << " --> " << auto_], data) << "\n";
#elif OPTION == 2 || OPTION == 4 // with expectations
std::cout << format(delimit[auto_ << '[' << auto_ << ']' << " --> " << auto_], data) << "\n";
#endif
}
if (iter!=end)
std::cout << "Remaining: " << std::string(iter,end) << "\n";
}
Output for all OPTIONS:
for a in 1 2 3 4 5; do g++ -DOPTION=$a -I ~/custom/boost/ test.cpp -o test$a && ./test$a; done
OPTION 1: addx001 add 'str1' "str2" -> "str3" --> addx001 add [ str1 str2 ] --> str3
OPTION 2: addx001 add 'str1' "str2" -> "str3" --> addx001 add [ str1 str2 ] --> str3
OPTION 3: addx001 add 'str1' "str2" -> "str3" --> addx001 add [ str1 str2 ] --> str3
OPTION 4: addx001 add 'str1' "str2" -> "str3" --> addx001 add [ str1 str2 ] --> str3
OPTION 5: addx001 add 'str1' "str2" -> "str3" --> addx001 add [ str1 str2 ] --> str3

Related

Automatically building a parser from a generator and vice versa [duplicate]

I have a Qi grammar definition that I use to parse an input. Later I have a Karma generator to output in a way that should be similar to the input.
Is this possible at all? It seem that a parser grammar can be transformed into a generator grammar automatically (??).
#include <boost/config/warning_disable.hpp>
#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/karma.hpp>
#include <iostream>
int main(){
//test input
std::string s = "Xx 1.233 pseudo";
//input variables
std::string element;
double mass;
std::string pseudo;
auto GRAMMAR =
boost::spirit::qi::lexeme[+(boost::spirit::qi::char_ - ' ' - '\n')]
>> boost::spirit::qi::double_
>> boost::spirit::qi::lexeme[+(boost::spirit::qi::char_ - ' ' - '\n')];
bool r = boost::spirit::qi::phrase_parse(
s.begin(), s.end(),
GRAMMAR,
boost::spirit::qi::space, element, mass, pseudo
);
std::cout << boost::spirit::karma::format(
GRAMMAR ??? is it possible?
,
element,
mass,
pseudo
);
}
Sadly it's not possible to achieve what you want in a general way (or at least I don't know how), but if you are willing to just use a limited subset of Spirit.Qi the approach below could work.
The first thing to know is that when you use something like:
int_ >> double_
You just have a Boost.Proto expression that describes several terminals and how they are related. That expression by itself doesn't "know" anything about how to parse one int and then one double. Whenever you use parse/phrase_parse or assign one of these Proto expressions to a rule Spirit "compiles" that expression for a domain (Qi or Karma) and creates the parsers/generators that do the actual work.
Here you can see a small example that shows the exact types of the Proto and compiled Qi expressions:
Raw proto type:
boost::proto::exprns_::expr<boost::proto::tagns_::tag::shift_right, boost::proto::argsns_::list2<boost::spirit::terminal<boost::spirit::tag::int_> const&, boost::spirit::terminal<boost::spirit::tag::double_> const&>, 2l>
"Pretty" proto type:
shift_right(
terminal(boost::spirit::tag::int_)
, terminal(boost::spirit::tag::double_)
)
Compiled Qi type:
boost::spirit::qi::sequence<boost::fusion::cons<boost::spirit::qi::any_int_parser<int, 10u, 1u, -1>, boost::fusion::cons<boost::spirit::qi::any_real_parser<double, boost::spirit::qi::real_policies<double> >, boost::fusion::nil_> > >
As long as you have access to the original expression you can use Proto transforms/grammars to convert it to a suitable Karma expression.
In the example below I have used the following transformations:
Qi |Karma |Reason
------------|---------------|------
lexeme[expr]|verbatim[expr] | lexeme does not exist in Karma
omit[expr] |no_delimit[eps]| omit consumes an attribute in Karma
a >> b |a << b |
a > b |a << b | < does not exist in Karma
a - b |a | - does not exist in Karma
In order to achieve this transformations you can use boost::proto::or_ getting something similar to:
struct Grammar : proto::or_<
proto::when<Matcher1,Transform1>,
proto::when<Matcher2,Transform2>,
Matcher3,
Matcher4
>{};
I'll try to explain how this works.
MatcherN in the example below can be:
proto::terminal<boost::spirit::tag::omit>: matches only that specific terminal.
proto::terminal<proto::_>: matches any terminal not specifically matched before.
proto::subscript<proto::terminal<boost::spirit::tag::omit>,proto::_>: matches omit[expr] where expr can be anything.
proto::shift_right<ToKarma,ToKarma>: matches expr1 >> expr2 where expr1 and expr2 must recursively conform to the ToKarma grammar.
proto::nary_expr<proto::_,proto::vararg<ToKarma> >: matches any n-ary (unary, binary or actually n-ary like a function call a(b,c,d,e)) where each one of the elements of the expression conforms to the ToKarma grammar.
All the TransformN in this example are expression builders, here are some explanations:
_make_terminal(boost::spirit::tag::lexeme()): builds a proto::terminal<boost::spirit::tag::lexeme> (note that it is necessary to add () after the tag, you'll get an awful error if you forget them).
_make_subscript(_make_terminal(tag::no_delimit()), _make_terminal(tag::eps())): builds a proto::subscript<proto::terminal<tag::no_delimit>, proto::terminal<tag::eps> >, or the equivalent to no_delimit[eps].
_make_shift_left(ToKarma(proto::_left), ToKarma(proto::_right)): proto::_left means take the lhs of the original expression. ToKarma(proto::_left) means recursively apply the ToKarma grammar/transform to the lhs of the original expression. The whole _make_shift_left basically builds transformed_lhs << transformed_rhs.
A MatcherN by itself (not inside proto::when) is a shorthand for build an expression of the same type using as elements the result of recursively applying the transform to the original elements.
Full Sample (Running on WandBox)
#include <iostream>
#include <string>
#include <tuple>
#include <boost/config/warning_disable.hpp>
#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/karma.hpp>
#include <boost/fusion/include/std_tuple.hpp>
namespace proto= boost::proto;
struct ToKarma: proto::or_<
//translation of directives
proto::when<proto::terminal<boost::spirit::tag::lexeme>, proto::_make_terminal(boost::spirit::tag::verbatim())>, //lexeme -> verbatim
proto::when<
proto::subscript<proto::terminal<boost::spirit::tag::omit>,proto::_>, //omit[expr] -> no_delimit[eps]
proto::_make_subscript(proto::_make_terminal(boost::spirit::tag::no_delimit()),proto::_make_terminal(boost::spirit::tag::eps()))
>,
proto::terminal<proto::_>, //if the expression is any other terminal leave it as is
//translation of operators
proto::when<proto::shift_right<ToKarma,ToKarma>, proto::_make_shift_left(ToKarma(proto::_left),ToKarma(proto::_right)) >, //changes '>>' into '<<'
proto::when<proto::greater<ToKarma,ToKarma>, proto::_make_shift_left(ToKarma(proto::_left),ToKarma(proto::_right)) >, //changes '>' into '<<'
proto::when<proto::minus<ToKarma,ToKarma>, ToKarma(proto::_left)>, //changes 'expr-whatever' into 'expr'
proto::nary_expr<proto::_,proto::vararg<ToKarma> > //if it's anything else leave it unchanged and recurse into the expression tree
>{};
template <typename ... Attr, typename Parser>
void test(const std::string& input, const Parser& parser)
{
std::cout << "Original: \"" << input << "\"\n";
std::tuple<Attr...> attr;
std::string::const_iterator iter = input.begin(), end = input.end();
bool result = boost::spirit::qi::phrase_parse(iter,end,parser,boost::spirit::qi::space,attr);
if(result && iter==end)
{
ToKarma to_karma;
std::cout << "Generated: \"" << boost::spirit::karma::format_delimited(to_karma(parser), boost::spirit::karma::space, attr) << '"' << std::endl;
}
else
{
std::cout << "Parsing failed. Unparsed: ->" << std::string(iter,end) << "<-" << std::endl;
}
}
int main(){
using namespace boost::spirit::qi;
test<std::string,double,std::string >("Xx 1.233 pseudo", lexeme[+(char_-' '-'\n')] >> double_ >> lexeme[+(char_-' '-'\n')]);
test<int,double>("foo 1 2.5", omit[lexeme[+alpha]] > int_ > double_);
}
PS:
Things that definitely won't work:
qi::rule
qi::grammar
qi::symbols
Things that don't exist in Karma:
qi::attr
qi::matches
qi::hold
Permutation parser ^
Sequential Or parser ||
Things that have different semantics in Karma:
qi::skip
And-predicate parser &
Not-predicate parser !

Constraining the existing Boost.Spirit real_parser (with a policy)

I want to parse a float, but not allow NaN values, so I generate a policy which inherits from the default policy and create a real_parser with it:
// using boost::spirit::qi::{real_parser,real_policies,
// phrase_parse,double_,char_};
template <typename T>
struct no_nan_policy : real_policies<T>
{
template <typename I, typename A>
static bool
parse_nan(I&, I const&, A&) {
return false;
}
};
real_parser<double, no_nan_policy<double> > no_nan;
// then I can use no_nan to parse, as in the following grammar
bool ok = phrase_parse(first, last,
no_nan[ref(valA) = _1] >> char_('#') >> double_[ref(b) = _1],
space);
But now I also want to ensure that the overall length of the string parsed with no_nan does not exceed 4, i.e. "1.23" or ".123" or even "2.e6" or "inf" is ok, "3.2323" is not, nor is "nan". I can not do that in the parse_n/parse_frac_n section of the policy, which separately looks left/right of the dot and can not communicate (...cleanly), which they would have to since the overall length is relevant.
The idea then was to extend real_parser (in boost/spirit/home/qi/numeric/real.hpp) and wrap the parse method -- but this class has no methods. Next to real_parser is the any_real_parser struct which does have parse, but these two structs do not seem to interact in any obvious way.
Is there a way to easily inject my own parse(), do some pre-checks, and then call the real parse (return boost::spirit::qi::any_real_parser<T, RealPolicy>::parse(...)) which then adheres to the given policies? Writing a new parser would be a last-resort method, but I hope there is a better way.
(Using Boost 1.55, i.e. Spirit 2.5.2, with C++11)
It seems I am so close, i.e. just a few changes to the double_ parser and I'd be done. This would probably be a lot more maintainable than adding a new grammar, since all the other parsing is done that way. – toting 7 hours ago
Even more maintainable would be to not write another parser at all.
You basically want to parse a floating point numbers (Spirit has got you covered) but apply some validations afterward. I'd do the validations in a semantic action:
raw [ double_ [_val = _1] ] [ _pass = !isnan_(_val) && px::size(_1)<=4 ]
That's it.
Explanations
Anatomy:
double_ [_val = _1] parses a double and assigns it to the exposed attribute as usual¹
raw [ parser ] matches the enclosed parser but exposes the raw source iterator range as an attribute
[ _pass = !isnan_(_val) && px::size(_1)<=4 ] - the business part!
This semantic action attaches to the raw[] parser. Hence
_1 now refers to the raw iterator range that already parsed the double_
_val already contains the "cooked" value of a successful match of double_
_pass is a Spirit context flag that we can set to false to make parsing fail.
Now the only thing left is to tie it all together. Let's make a deferred version of ::isnan:
boost::phoenix::function<decltype(&::isnan)> isnan_(&::isnan);
We're good to go.
Test Program
Live On Coliru
#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/phoenix.hpp>
#include <cmath>
#include <iostream>
int main ()
{
using It = std::string::const_iterator;
auto my_fpnumber = [] { // TODO encapsulate in a grammar struct
using namespace boost::spirit::qi;
using boost::phoenix::size;
static boost::phoenix::function<decltype(&::isnan)> isnan_(&::isnan);
return rule<It, double()> (
raw [ double_ [_val = _1] ] [ _pass = !isnan_(_val) && size(_1)<=4 ]
);
}();
for (std::string const s: { "1.23", ".123", "2.e6", "inf", "3.2323", "nan" })
{
It f = s.begin(), l = s.end();
double result;
if (parse(f, l, my_fpnumber, result))
std::cout << "Parse success: '" << s << "' -> " << result << "\n";
else
std::cout << "Parse rejected: '" << s << "' at '" << std::string(f,l) << "'\n";
}
}
Prints
Parse success: '1.23' -> 1.23
Parse success: '.123' -> 0.123
Parse success: '2.e6' -> 2e+06
Parse success: 'inf' -> inf
Parse rejected: '3.2323' at '3.2323'
Parse rejected: 'nan' at 'nan'
¹ The assignment has to be done explicitly here because we use semantic actions and they normally suppress automatic attribute propagation

boost spirit qi - conditional parsing

I need to match some input, construct a complex object and then match the rest input in two ways, depending on some props. of constructed object.
I've tried the qi::eps(/condition/) >> p1 | p2 but results is obvious for me.
The simplified code http://liveworkspace.org/code/1NzThA$6
In code snippet I match int_ from input and if the value == 0 try to match 'a' either - 'b'
But I've got OK for '0b' input! I've tried to play with braces but no luck.
I personally wouldn't use semantic actions (or phoenix) this lightly. It's not the "Spirit" of Qi (pun half-intended).
Boost Spirit: "Semantic actions are evil"?
This is my take:
rule<char const*, char()> r =
(omit [ int_(0) ] >> char_('a')) |
(omit [ int_(1) ] >> char_('b'))
;
See? Much cleaner. Also: automatic attribute propagation. See it live on http://liveworkspace.org/code/1T9h5
Output:
ok: a
fail
fail
ok: b
Full sample code:
//#define BOOST_SPIRIT_DEBUG
#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/phoenix.hpp>
namespace qi = boost::spirit::qi;
namespace phoenix = boost::phoenix;
template <typename P>
inline
void test_parser(char const* input, P const& p, bool full_match = true)
{
char const* f(input);
char const* l(f + strlen(f));
char result;
if (qi::parse(f, l, p, result) && (!full_match || (f == l)))
std::cout << "ok: " << result << std::endl;
else
std::cout << "fail" << std::endl;
}
int main()
{
int p;
using namespace qi;
rule<char const*, char()> r =
(omit [ int_(0) ] >> char_('a')) |
(omit [ int_(1) ] >> char_('b'))
;
BOOST_SPIRIT_DEBUG_NODE(r);
test_parser("0a", r); //should match
test_parser("0b", r); //should not match
test_parser("1a", r); //should not match
test_parser("1b", r); //should match
}
Here's your rule:
qi::rule<char const*> r =
qi::int_ [phoenix::ref(p) = qi::_1]
>> (qi::eps(phoenix::ref(p) == 0)
>> qi::char_('a') | qi::char_('b'))
That to me says: accept '0a' or anything that ends with 'b'. That matches the results you get in your code snippet.
I confess I don't entirely understand your question, but if you are trying to get a kind of 'exclusive or' thing happening (as indicated by the comments in your code snippet), then this rule is incomplete. The workaround (actually more of a 'fix' than a 'workaround') you presented in your comment is one solution, though you don't need the qi::lazy as the phoenix-based Qi locals are already lazy, but you are on the right track. Here's another (more readable?) solution.
qi::rule<char const*> r =
qi::int_ [phoenix::ref(p) = qi::_1]
>> ((qi::eps(phoenix::ref(p) == 0) >> qi::char_('a')) |
(qi::eps(phoenix::ref(p) == 1) >> qi::char_('b')))
;
If you prefer to use the locals<> you added in your comment, that's fine too, but using the reference to p adds less overhead to the code, as long as you remember not to set p anywhere else in your grammar, and you don't end up building a grammar that recurses that rule :)

Using Boost.Spirit to extract certain tags/attributes from HTML

So I've been learning a bit about Boost.Spirit to replace the use of regular expressions in a lot of my code. The main reason is pure speed. I've found Boost.Spirit to be up to 50 times faster than PCRE for some relatively simple tasks.
One thing that is a big bottleneck in one of my apps is taking some HTML, finding all "img" tags, and extracting the "src" attribute.
This is my current regex:
(?i:<img\s[^\>]*src\s*=\s*[""']([^<][^""']+)[^\>]*\s*/*>)
I've been playing around with it trying to get something to work in Spirit, but so far I've come up empty. Any tips on how to create a set of Spirit rules that will accomplish the same thing as this regex would be awesome.
And of course, the Boost Spirit variant couldn't be missed:
sehe#natty:/tmp$ time ./spirit < bench > /dev/null
real 0m3.895s
user 0m3.820s
sys 0m0.070s
To be honest the Spirit code is slightly more versatile than the other variations:
it actually parses attributes a bit smarter, so it would be easy to handle a variety of attributes at the same time, perhaps depending on the containing element
the Spirit parser would be easier to adapt to cross-line matching. This could be most easily achieved
using spirit::istream_iterator<> (which is unfortunately notoriously slow)
using a memory-mapped file with raw const char* as iterators; The latter approach works equally well for the other techniques
The code is as follows: (full code at https://gist.github.com/c16725584493b021ba5b)
//#define BOOST_SPIRIT_DEBUG
#include <string>
#include <iostream>
#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/phoenix.hpp>
namespace qi = boost::spirit::qi;
namespace phx = boost::phoenix;
void handle_attr(
const std::string& elem,
const std::string& attr,
const std::string& value)
{
if (elem == "img" && attr == "src")
std::cout << "value : " << value << std::endl;
}
typedef std::string::const_iterator It;
typedef qi::space_type Skipper;
struct grammar : qi::grammar<It, Skipper>
{
grammar() : grammar::base_type(html)
{
using namespace boost::spirit::qi;
using phx::bind;
attr = as_string [ +~char_("= \t\r\n/>") ] [ _a = _1 ]
>> '=' >> (
as_string [ '"' >> lexeme [ *~char_('"') ] >> '"' ]
| as_string [ "'" >> lexeme [ *~char_("'") ] >> "'" ]
) [ bind(handle_attr, _r1, _a, _1) ]
;
elem = lit('<')
>> as_string [ lexeme [ ~char_("-/>") >> *(char_ - space - char_("/>")) ] ] [ _a = _1 ]
>> *attr(_a);
html = (-elem) % +("</" | (char_ - '<'));
BOOST_SPIRIT_DEBUG_NODE(html);
BOOST_SPIRIT_DEBUG_NODE(elem);
BOOST_SPIRIT_DEBUG_NODE(attr);
}
qi::rule<It, Skipper> html;
qi::rule<It, Skipper, qi::locals<std::string> > elem;
qi::rule<It, qi::unused_type(std::string), Skipper, qi::locals<std::string> > attr;
};
int main(int argc, const char *argv[])
{
std::string s;
const static grammar html_;
while (std::getline(std::cin, s))
{
It f = s.begin(),
l = s.end();
if (!phrase_parse(f, l, html_, qi::space) || (f!=l))
std::cerr << "unparsed: " << std::string(f,l) << std::endl;
}
return 0;
}
Update
I did benchmarks.
Full disclosure is here: https://gist.github.com/c16725584493b021ba5b
It includes the full code used, the compilation flags and the body of test data (file bench) used.
In short
Regular expressions are indeed faster and way simpler here
Do not underestimate the time I spent debugging the Spirit grammar to get it correct!
Care has been taken to eliminate 'accidental' differences (by e.g.
keeping handle_attribute unchanged across the implementations, even though it makes sense mostly only for the Spirit implementation).
using the same line-wise input style and string iterators for both
Right now, all three implementations result in the exact same output
Everything built/timed on g++ 4.6.1 (c++03 mode), -O3
Edit in reply to the knee-jerk (and correct) response that you shouldn't be parsing HTML using Regexes:
You shouldn't be using regexen to parse non-trivial inputs (mainly, anything with a grammar. Of course Perl 5.10+ 'regex grammars' are an exception, because they are not isolated regexes anymore
HTML basically cannot be parsed, it is non-standard tag soup. Strict (X)HTML, are a different matter
According to Xaade, if you haven't got enough time to produce a perfect implementation using a standards compliant HTML reader, you should
"ask client if they want shit or not. If they want shit, you charge them more. Shit costs you more than them." -- Xaade
That said there are scenarios in which I'd do precisely what I suggest here: use a regex. Mainly, if it is to do a one-off quick search or to get daily, rough statistics of known data etc. YMMV and you should make your own call.
For timings and summaries, see:
Boost Regex answer below
Boost Xpressive answer here
Spirit answer here
I heartily suggest using a regex here:
typedef std::string::const_iterator It;
int main(int argc, const char *argv[])
{
const boost::regex re("<img\\s+[^\\>]*?src\\s*=\\s*([\"'])(.*?)\\1");
std::string s;
boost::smatch what;
while (std::getline(std::cin, s))
{
It f = s.begin(), l = s.end();
do
{
if (!boost::regex_search(f, l, what, re))
break;
handle_attr("img", "src", what[2]);
f = what[0].second;
} while (f!=s.end());
}
return 0;
}
Use it like:
./test < index.htm
I cannot see any reason why the spirit based approach should/could be any faster?
Edit PS. Iff you claim that static optimization would be the key, why not just convert it into a Boost Expressive, static, regular expression?
Out of curiosity I redid my regex sample based on Boost Xpressive, using statically compiled regexes:
sehe#natty:/tmp$ time ./expressive < bench > /dev/null
real 0m2.146s
user 0m2.110s
sys 0m0.030s
Interestingly, there is no discernable speed difference when using the dynamic regular expression; however, on the whole the Xpressive version performs better than the Boost Regex version (by roughly 10%)
What is really nice, IMO, is that it was really almost matter of including the xpressive.hpp and changing a few namespaces around to change from Boost Regex to Xpressive. The API interface (as far as it was being used) is exactly the same.
The relevant code is as follows: (full code at https://gist.github.com/c16725584493b021ba5b)
typedef std::string::const_iterator It;
int main(int argc, const char *argv[])
{
using namespace boost::xpressive;
#if DYNAMIC
const sregex re = sregex::compile
("<img\\s+[^\\>]*?src\\s*=\\s*([\"'])(.*?)\\1");
#else
const sregex re = "<img" >> +_s >> -*(~(set = '\\','>')) >>
"src" >> *_s >> '=' >> *_s
>> (s1 = as_xpr('"') | '\'') >> (s2 = -*_) >> s1;
#endif
std::string s;
smatch what;
while (std::getline(std::cin, s))
{
It f = s.begin(), l = s.end();
do
{
if (!regex_search(f, l, what, re))
break;
handle_attr("img", "src", what[2]);
f = what[0].second;
} while (f!=s.end());
}
return 0;
}

How would I perform this text pattern matching

Migrated from [Spirit-general] list
Good morning,
I'm trying to parse a relatively simple pattern across 4 std::strings,
extracting whatever the part which matches the pattern into a separate
std::string.
In an abstracted sense, here is what I want:
s1=<string1><consecutive number>, s2=<consecutive number><string2>,
s3=<string1><consecutive number>, s4=<consecutive number><string2>
Less abstracted:
s1="apple 1", s2="2 cheese", s3="apple 3", s4="4 cheese"
Actual contents:
s1="lxckvjlxcjvlkjlkje xvcjxzlvcj wqrej lxvcjz ljvl;x czvouzxvcu
j;ljfds apple 1 xcvljxclvjx oueroi xcvzlkjv; zjx", s2="xzljlkxvc
jlkjxzvl jxcvljzx lvjlkj wre 2 cheese", s3="apple 3", s4="kxclvj
xcvjlxk jcvljxlck jxcvl 4 cheese"
How would I perform this pattern matching?
Thanks for all suggestions,
Alec Taylor
Update 2
Here is a really simple explanation I just figured out to explain the
problem I am trying to solve:
std::string s1=garbagetext1+number1+name1+garbagetext4;
std::string s3=garbagetext2+(number1+2)+name1+garbagetext5;
std::string s5=garbagetext3+(number1+4)+name1+garbagetext6;
Edit for context:
Feel free to add it to stackoverflow (I've been having some trouble
posting there)
I can't give you what I've done so far, because I wasn't sure if it
was within the capabilities of the boost::spirit libraries to do what
I'm trying to do
Edit: Re Update2
Here is a really simple explanation I just figured out to explain the
problem I am trying to solve:
std::string s1=garbagetext1+number1+name1+garbagetext4;
std::string s3=garbagetext2+(number1+2)+name1+garbagetext5;
std::string s5=garbagetext3+(number1+4)+name1+garbagetext6;
It starts looking like a job for:
Tokenizing the 'garbage text/names' - you could make a symbol table of sorts on the fly and use it to match patterns (spirit Lex and Qi's symbol table (qi::symbol) could facilitate it, but I feel you could write that in any number of ways)
conversely, use regular expressions, as suggested before (below, and at least twice in mailing list).
Here's a simple idea:
(\d+) ([a-z]+).*?(\d+) \2
\d+ match a sequence of digits in a "(subexpression)" (NUM1)
([a-z]+) match a name (just picked a simple definition of 'name')
.*? skip any length of garbage, but as little as possible before starting subsequent match
\d+ match another number (sequence of digits) (NUM2)
\2 followed by the same name (backreference)
You can see how you'd already be narrowing the list of matches to inspect down to 'potential' hits. You'd only have to /post-validate/ to see that NUM2 == NUM1+2
Two notes:
Add (...)+ around the tail part to allow repeated matching of patterns
(\d+) ([a-z]+)(.*?(\d+) \2)+
You may wish to make the garbage skip (.*?) aware of separators (by doing negative zerowidth assertions) to avoid more than 2 skipping delimiters (e.g. s\d+=" as a delimiting pattern). I leave it out of scope for clarity now, here's the gist:
((?!s\d+=").)*? -- beware of potential performance degradation
Alec, The following is a show-case of how to do a wide range of things in Boost Spirit, in the context of answering your question.
I had to make assumptions about what is required input structure; I assumed
whitespace was strict (spaces as shown, no newlines)
the sequence numbers should be in increasing order
the sequence numbers should recur exactly in the text values
the keywords 'apple' and 'cheese' are in strict alternation
whether the keyword comes before or after the the sequence number in the text value, is also in strict alternation
Note There are about a dozen places in the implementation below, where significantly less complex choices could possibly have been made. For example, I could have hardcoded the whole pattern (as a de facto regex?), assuming that 4 items are always expected in the input. However I wanted to
make no more assumptions than necessary
learn from the experience. Especially the topic of qi::locals<> and inherited attributes have been on my agenda for a while.
However, the solution allows a great deal of flexibility:
the keywords aren't hardcoded, and you could e.g. easily make the parser accept both keywords at any sequence number
a comment shows how to generate a custom parsing exception when the sequence number is out of sync (not the expected number)
different spellings of the sequence numbers are currently accepted (i.e. s01="apple 001" is ok. Look at Unsigned Integer Parsers for info on how to tune that behaviour)
the output structure is either a vector<std::pair<int, std::string> > or a vector of struct:
struct Entry
{
int sequence;
std::string text;
};
both versions can be switched with the single #if 1/0 line
The sample uses Boost Spirit Qi for parsing.
Conversely, Boost Spirit Karma is used to display the result of parsing:
format((('s' << auto_ << "=\"" << auto_) << "\"") % ", ", parsed)
The output for the actual contents given in the post is:
parsed: s1="apple 1", s2="2 cheese", s3="apple 3", s4="4 cheese"
On to the code.
#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/karma.hpp>
#include <boost/spirit/include/phoenix.hpp>
#include <boost/spirit/include/phoenix_operator.hpp>
namespace qi = boost::spirit::qi;
namespace karma = boost::spirit::karma;
namespace phx = boost::phoenix;
#if 1 // using fusion adapted struct
#include <boost/fusion/adapted/struct.hpp>
struct Entry
{
int sequence;
std::string text;
};
BOOST_FUSION_ADAPT_STRUCT(Entry, (int, sequence)(std::string, text));
#else // using boring std::pair
#include <boost/fusion/adapted/std_pair.hpp> // for karma output generation
typedef std::pair<int, std::string> Entry;
#endif
int main()
{
std::string input =
"s1=\"lxckvjlxcjvlkjlkje xvcjxzlvcj wqrej lxvcjz ljvl;x czvouzxvcu"
"j;ljfds apple 1 xcvljxclvjx oueroi xcvzlkjv; zjx\", s2=\"xzljlkxvc"
"jlkjxzvl jxcvljzx lvjlkj wre 2 cheese\", s3=\"apple 3\", s4=\"kxclvj"
"xcvjlxk jcvljxlck jxcvl 4 cheese\"";
using namespace qi;
typedef std::string::const_iterator It;
It f(input.begin()), l(input.end());
int next = 1;
qi::rule<It, std::string(int)> label;
qi::rule<It, std::string(int)> value;
qi::rule<It, int()> number;
qi::rule<It, Entry(), qi::locals<int> > assign;
label %= qi::raw [
( eps(qi::_r1 % 2) >> qi::string("apple ") > qi::uint_(qi::_r1) )
| qi::uint_(qi::_r1) > qi::string(" cheese")
];
value %= '"'
>> qi::omit[ *(~qi::char_('"') - label(qi::_r1)) ]
>> label(qi::_r1)
>> qi::omit[ *(~qi::char_('"')) ]
>> '"';
number %= qi::uint_(phx::ref(next)++) /*| eps [ phx::throw_(std::runtime_error("Sequence number out of sync")) ] */;
assign %= 's' > number[ qi::_a = _1 ] > '=' > value(qi::_a);
std::vector<Entry> parsed;
bool ok = false;
try
{
ok = parse(f, l, assign % ", ", parsed);
if (ok)
{
using namespace karma;
std::cout << "parsed:\t" << format((('s' << auto_ << "=\"" << auto_) << "\"") % ", ", parsed) << std::endl;
}
} catch(qi::expectation_failure<It>& e)
{
std::cerr << "Expectation failed: " << e.what() << " '" << std::string(e.first, e.last) << "'" << std::endl;
} catch(const std::exception& e)
{
std::cerr << e.what() << std::endl;
}
if (!ok || (f!=l))
std::cerr << "problem at: '" << std::string(f,l) << "'" << std::endl;
}
Provided you can use c++11 compiler, parsing these patterns is pretty simple using AXE†:
#include <axe.h>
#include <string>
template<class I>
void num_value(I i1, I i2)
{
unsigned n;
unsigned next = 1;
// rule to match unsigned decimal number and compare it with another number
auto num = axe::r_udecimal(n) & axe::r_bool([&](...){ return n == next; });
// rule to match a single word
auto word = axe::r_alphastr();
// rule to match space characters
auto space = axe::r_any(" \t\n");
// semantic action - print to cout and increment next
auto e_cout = axe::e_ref([&](I i1, I i2)
{
std::cout << std::string(i1, i2) << '\n';
++next;
});
// there are only two patterns in this example
auto pattern1 = (word & +space & num) >> e_cout;
auto pattern2 = (num & +space & word) >> e_cout;
auto s1 = axe::r_find(pattern1);
auto s2 = axe::r_find(pattern2);
auto text = s1 & s2 & s1 & s2 & axe::r_end();
text(i1, i2);
}
To parse the text simply call num_value(text.begin(), text.end()); No changes required to parse unicode strings.
† I didn't test it.
Look into Boost.Regex. I've seen an almost-identical poosting in boost-users and the solution is to use regexes for some of the match work.