Boost Spirit optional parser and backtracking - c++

Why this parser leave 'b' in attributes, even if option wasn't matched?
using namespace boost::spirit::qi;
std::string str = "abc";
auto a = char_("a");
auto b = char_("b");
qi::rule<std::string::iterator, std::string()> expr;
expr = +a >> -(b >> +a);
std::string res;
bool r = qi::parse(
str.begin(),
str.end(),
expr >> lit("bc"),
res
);
It parses successfully, but res is "ab".
If parse "abac" with expr alone, option is matched and attribute is "aba".
Same with "aac", option doesn't start to match and attribute is "aa".
But with "ab", attribute is "ab", even though b gets backtracked, and, as in example, matched with next parser.
UPD
With expr.name("expr"); and debug(expr); I got
<expr>
<try>abc</try>
<success>bc</success>
<attributes>[[a, b]]</attributes>
</expr>

Firstly, it's UB to use the auto variables to keep the expression templates, because they hold references to the temporaries "a" and "b" [1].
Instead write
expr = +qi::char_("a") >> -(qi::char_("b") >> +qi::char_("a"));
or, if you insist:
auto a = boost::proto::deep_copy(qi::char_("a"));
auto b = boost::proto::deep_copy(qi::char_("b"));
expr = +a >> -(b >> +a);
Now noticing the >> lit("bc") part hiding in the parse call, suggests you may expect backtracking to on succesfully matched tokens when a parse failure happens down the road.
That doesn't happen: Spirit generates PEG grammars, and always greedily matches from left to right.
On to the sample given, ab results, even though backtracking does occur, the effects on the attribute are not rolled back without qi::hold: Live On Coliru
Container attributes are passed along by ref and the effects of previous (successful) expressions is not rolled back, unless you tell Spirit too. This way, you can "pay for what you use" (as copying temporaries all the time would be costly).
See e.g.
boost::spirit::qi duplicate parsing on the output
Understanding Boost.spirit's string parser
Boost spirit revert parsing
<a>
<try>abc</try>
<success>bc</success>
<attributes>[a]</attributes>
</a>
<a>
<try>bc</try>
<fail/>
</a>
<b>
<try>bc</try>
<success>c</success>
<attributes>[b]</attributes>
</b>
<a>
<try>c</try>
<fail/>
</a>
<bc>
<try>bc</try>
<success></success>
<attributes>[]</attributes>
</bc>
Success: 'ab'
[1] see here:
Assigning parsers to auto variables
Generating Spirit parser expressions from a variadic list of alternative parser expressions
boost spirit V2 qi bug associated with optimization level

Quoting #sehe from this SO question
A string attribute is a container attribute and many elements could be
assigned into it by different parser subexpressions. Now for
efficiency reasons, Spirit doesn't rollback the values of emitted
attributes on backtracking.
So, I've put optional parser on hold, and it's done.
expr = +qi::char_("a") >> -(qi::hold[qi::char_("b") >> +qi::char_("a")]);
For more information see mentioned question and hold docs

Related

Parse key, value pairs when key is not unique

My input are multiple key, value pairs e.g.:
A=1, B=2, C=3, ..., A=4
I want to parse the input into the following type:
std::map< char, std::vector< int > > m
Values for equal keys shall be appended to the vector. So the parsed output should be equal to:
m['A']={1,4};
m['B']={2};
m['C']={3};
What is the simplest solution using 'boost::spirit::qi' ?
Here is one way to do it:
#include <boost/spirit/include/qi.hpp>
#include <boost/fusion/include/vector.hpp>
#include <boost/fusion/include/at_c.hpp>
#include <iostream>
#include <utility>
#include <string>
#include <vector>
#include <map>
namespace qi = boost::spirit::qi;
namespace fusion = boost::fusion;
int main()
{
std::string str = "A=1, B=2, C=3, A=4";
std::map< char, std::vector< int > > m;
auto inserter = [&m](fusion::vector< char, int > const& parsed,
qi::unused_type, qi::unused_type)
{
m[fusion::at_c< 0 >(parsed)].push_back(fusion::at_c< 1 >(parsed));
};
auto it = str.begin(), end = str.end();
bool res = qi::phrase_parse(it, end,
((qi::char_ >> '=' >> qi::int_)[inserter]) % ',',
qi::space);
if (res && it == end)
std::cout << "Parsing complete" << std::endl;
else
std::cout << "Parsing incomplete" << std::endl;
for (auto const& elem : m)
{
std::cout << "m['" << elem.first << "'] = {";
for (auto value : elem.second)
std::cout << " " << value;
std::cout << " }" << std::endl;
}
return 0;
}
A few comments about the implementation:
qi::phrase_parse is a Boost.Spirit algorithm that takes a pair of iterators, a parser, and a skip parser, and runs the parsers on the input denoted by the iterators. In the process, it updates the beginning iterator (it in this example) so that it points to the end of the consumed input upon return. The returned res value indicates whether the parsers have succeeded (i.e. the consumed input could be successfully parsed). There are other forms of qi::phrase_parse that allow extracting attributes (which is the parsed data, in terms of Boost.Spirit) but we're not using attributes here because you have a peculiar requirement of the resulting container structure.
The skip parser is used to skip portions of the input between the elements of the main parser. In this case, qi::space means that any whitespace characters will be ignored in the input, so that e.g. "A = 1" and "A=1" can both be parsed similarly. There is qi::parse family of algorithms which do not have a skip parser and therefore require the main parser to handle all input without skips.
The (qi::char_ >> '=' >> qi::int_) part of the main parser matches a single character, followed by the equals sign character, followed by a signed integer. The equals sign is expressed as a literal (i.e. it is equivalent to the qi::lit('=') parser), which means it only matches the input but does not result in a parsed data. Therefore the result of this parser is an attribute that is a sequence of two elements - a character and an integer.
The % ',' part of the parser is a list parser, which parses any number of pieces of input described by the parser on the left (which is the parser described above), separated by the pieces described by the parser on the right (i.e. with comma characters in our case). As before, the comma character is a literal parser, so it doesn't produce output.
The [inserter] part is a semantic action, which is a function that is called by the parser every time it matches a portion of input string. The parser passes all its parsed output as the first argument to this function. In our case the semantic action is attached to the parser described in bullet #3, which means a sequence of a character and an integer is passed. Boost.Spirit uses a fusion::vector to pass these data. The other two arguments of the semantic action are not used in this example and can be ignored.
The inserter function in this example is a lambda function, but it could be any other kind of function object, including a regular function, a function generated by std::bind, etc. The important part is that it has the specified signature and that the type of its first argument is compatible with the attribute of the parser, to which it is attached as a semantic action. So, if we had a different parser in bullet #3, this argument would have to be changed accordingly.
fusion::at_c< N >() in the inserter obtains the element of the vector at index N. It is very similar to std::get< N >() when applied to std::tuple.

Boost spirit: Parse char_ with changing local variable value

I want to implement a grammar that requires parsing instance names and paths, where a path is a list of instance names separated by a divider. The divider can be either . (period) or / (slash) given in the input file before the paths are listed, e.g.:
DIVIDER .
a.b.c
x.y.z
Once set, the divider never changes for the whole file (i.e. if set to ., encountering a path like a/b/c should not parse correctly). Since I don't know what the divider is in advance, I'm thinking about storing it in a variable of my grammar and use that value in corresponding char_ parsers (of course, the actual grammar is much more complex, but this is the part where I'm having trouble).
This is somewhat similar to this question: Boost spirit using local variables but not quite what I want, since using the Nabialek trick allows to parse "invalid" paths after the divider is set.
I'm not asking for a complete solution here, but my question is essentially this: Can I parse values into members of my grammar and then use these values for further parsing of remaining input?
I'd use an inherited attribute:
qi::rule<It, std::string(char)> element = *~qi::char_(qi::_r1);
qi::rule<It, std::vector<std::string>(char)> path = element(qi::_r1) % qi::char_(qi::_r1);
// use it like:
std::vector<std::string> data;
bool ok = qi::parse(f, l, path('/'), data);
Alternatively you /can/ indeed bind to a local variable:
char delim = '/';
qi::rule<It, std::string()> element = *~qi::char_(delim);
qi::rule<It, std::vector<std::string>()> path = element % qi::char_(delim);
// use it like:
std::vector<std::string> data;
bool ok = qi::parse(f, l, path, data);
If you need it to be dynamic, use boost::phoenix::ref:
char delim = '/';
qi::rule<It, std::string()> element = *~qi::char_(boost::phoenix::ref(delim));
qi::rule<It, std::vector<std::string>()> path = element % qi::char_(boost::phoenix::ref(delim));
// use it like:
std::vector<std::string> data;
bool ok = qi::parse(f, l, path, data);

Using boost::spirit to match words

I want to create a parser that will match exactly two alphanumeric words from a string, such as:
message1 message2
and then save that into two variables of type std::string.
I've read this previous answer which seems to work for an endless amount of repetitions, which uses the following parser:
+qi::alnum % +qi::space
However when I try to do this:
bool const result = qi::phrase_parse(
input.begin(), input.end(),
+qi::alnum >> +qi::alnum,
+qi::space,
words
);
the words vector contains every single letter in a different string:
't'
'h'
'i'
's'
'i'
's'
This is extremely counter-intuitive, and I'm not sure as to why it's happening. Could someone please explain that?
Also, can I have two predefined strings to be populated instead of a std::vector?
Final note: I would like to avoid the using statement, as I would like to have every namespace clearly defined to help me understand how Spirit works.
Yes, but the skipper ignores the whitespace before you can act on it.
Use lexeme to control the skipper:
bool const result = qi::phrase_parse(
input.begin(), input.end(),
qi::lexeme [+qi::alnum] >> qi::lexeme [+qi::alnum],
qi::space,
words
);
Note the skipper should be qi::space instead of +qi::space.
See also Boost spirit skipper issues

Why does'n boost::spirit match foo123 with (+alpha | +alnum) grammar?

I have a more complex boost::spirit grammar that doesn't match like I expected.
I was able to break it down to this minimal example: http://ideone.com/oPu2e7 (doesn't compile there, but compiles with VS2010)
Basically this is my grammar:
my_grammar() : my_grammar::base_type(start)
{
start %=
(+alpha | +alnum)
;
}
qi::rule<Iterator, std::string(), ascii::space_type> start;
It matches foobar, 123foo but doesn't match foo123. Why? I would expect it to match all three.
PEG parsers match greedy, left-to-right. That should be enough to explain.
But lets look at foo123: it matches "1 or more +alpha, so the first branch is taken. The second branch is not taken, so the numerics 123 remain unparsed.
There's no "inherent" backtracking on the kleen operators. You /can/ employ backtracking if you know e.g. that you need to parse the full input:
(+alpha >> eoi | +alnum >> eoi)

Boost Qi Composing rules using Functions

I'm trying to define some Boost::spirit::qi parsers for multiple subsets of a language with minimal code duplication. To do this, I created a few basic rule building functions. The original parser works fine, but once I started to use the composing functions, my parsers no longer seem to work.
The general language is of the form:
A B: C
There are subsets of the language where A, B, or C must be specific types, such as A is an int while B and C are floats. Here is the parser I used for that sub language:
using entry = boost::tuple<int, float, float>;
template <typename Iterator>
struct sublang : grammar<Iterator, entry(), ascii::space_type>
{
sublang() : sublang::base_type(start)
{
start = int_ >> float_ >> ':' >> float_;
}
rule<Iterator, entry(), ascii::space_type> start;
};
But since there are many subsets, I tried to create a function to build my parser rules:
template<typename AttrName, typename Value>
auto attribute(AttrName attrName, Value value)
{
return attrName >> ':' >> value;
}
So that I could build parsers for each subset more easily without duplicate information:
// in sublang
start = int_ >> attribute(float_, float_);
This fails however and I'm not sure why. In my clang testing, parsing just fails. In g++, it seems the program crashes.
Here's the full example code: http://coliru.stacked-crooked.com/a/8636f19b2e9bff8d
What is wrong with the current code and what would be the correct approach for this problem? I would like to avoid specifying the grammar of attributes and other elements in each sublanguage parser.
Quite simply: using auto with Spirit (or any EDSL based on Boost Proto and Boost Phoenix) is most likely Undefined Behaviour¹
Now, you can usually fix this using
BOOST_SPIRIT_AUTO
boost::proto::deep_copy
the new facility that's coming in the most recent version of Boost (TODO add link)
In this case,
template<typename AttrName, typename Value>
auto attribute(AttrName attrName, Value value) {
return boost::proto::deep_copy(attrName >> ':' >> value);
}
fixes it: Live On Coliru
Alternatively
you could use qi::lazy[] with inherited attributes.
I do very similar things in the prop_key rule in Reading JSON file with C++ and BOOST.
you could have a look at the Keyword List Operator from the Spirit Repository. It's designed to allow easier construction of grammars like:
no_constraint_person_rule %=
kwd("name")['=' > parse_string ]
/ kwd("age") ['=' > int_]
/ kwd("size") ['=' > double_ > 'm']
;
This you could potentially combine with the Nabialek Trick. I'd search the answers on SO for examples. (One is Grammar balancing issue)
¹ Except for entirely stateless actors (Eric Niebler on this) and expression placeholders. See e.g.
Assigning parsers to auto variables
undefined behaviour somewhere in boost::spirit::qi::phrase_parse
C++ Boost qi recursive rule construction
boost spirit V2 qi bug associated with optimization level
Some examples
Define parsers parameterized with sub-parsers in Boost Spirit
Generating Spirit parser expressions from a variadic list of alternative parser expressions