boost spirit parser : getting around the greedy kleene *

boost spirit parser : getting around the greedy kleene * - c++

I have a grammar that should match sequence of characters followed a single character that is subset of the first one.
For example,
boost::spirit::qi::rule<Iterator, std::string()> grammar = *char_('a', 'z') >> char_('b', 'z').
Since the kleene * is greedy operator it gobbles up everything leaving nothing for the second parser, so it fails to match strings like "abcd"
Is there any way to get around this?

Yes, though your sample lacks context for us to know it.
We need to know what constitutes a complete match, because right now "b" would be a valid match, and "bb" or "bbb". So when the input is "bbb", what is going to be the match? (b, bb or bbb?).
And when you will answer (likely) "Obviously, bbb", then what happens for "bbbb"? When do you stop accepting chars from the subset? If you want the kleene star to not be greedy, do you want it to still be greedy?
The above dialog is annoying but the goal is to make you THINK about what you need. You do not need a non-greedy kleene-star. You probably want a validation constraint on the last char. Most likely, if the input has "bbba" you do not want to simply match "bbb", leaving "a". Instead you likely want to stop parsing because "bbba" is not a valid token.
Assuming That...
I'd write
grammar = +char_("a-z") >> eps(px::back(_val) != 'a');
Meaning that we accept at least 1 char as long as it matches, asserting that the last character wasn't the a.
Live On Coliru
#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/phoenix.hpp>
#include <boost/spirit/include/phoenix_stl.hpp>
namespace qi = boost::spirit::qi;
namespace px = boost::phoenix;
template <typename It>
struct P : qi::grammar<It, std::string()>
{
P() : P::base_type(start) {
using namespace qi;
start = +char_("a-z") >> eps(px::back(_val) != 'a');
}
private:
qi::rule<It, std::string()> start;
};
#include <iomanip>
int main() {
using It = std::string::const_iterator;
P<It> const p;
for (std::string const input : { "", "b", "bb", "bbb", "aaab", "a", "bbba" }) {
std::cout << std::quoted(input) << ": ";
std::string out;
It f = input.begin(), l = input.end();
if (parse(f, l, p, out)) {
std::cout << std::quoted(out);
} else {
std::cout << "(failed) ";
}
if (f != l)
std::cout << " Remaining: " << std::quoted(std::string(f,l));
std::cout << "\n";
}
}
Prints
"": (failed)
"b": "b"
"bb": "bb"
"bbb": "bbb"
"aaab": "aaab"
"a": (failed) Remaining: "a"
"bbba": (failed) Remaining: "bbba"
BONUS
A more generic, albeit less efficient, approach would be to match the leading characters with a look-ahead assertion that it isn't the last character of its kind:
start = *(char_("a-z") >> &char_("a-z")) >> char_("b-z");
A benefit here is that no Phoenix usage is required:
Live On Coliru
#include <boost/spirit/include/qi.hpp>
namespace qi = boost::spirit::qi;
template <typename It>
struct P : qi::grammar<It, std::string()>
{
P() : P::base_type(start) {
using namespace qi;
start = *(char_("a-z") >> &char_("a-z")) >> char_("b-z");
}
private:
qi::rule<It, std::string()> start;
};
#include <iomanip>
int main() {
using It = std::string::const_iterator;
P<It> const p;
for (std::string const input : { "", "b", "bb", "bbb", "aaab", "a", "bbba" }) {
std::cout << std::quoted(input) << ": ";
std::string out;
It f = input.begin(), l = input.end();
if (parse(f, l, p, out)) {
std::cout << std::quoted(out);
} else {
std::cout << "(failed) ";
}
if (f != l)
std::cout << " Remaining: " << std::quoted(std::string(f,l));
std::cout << "\n";
}
}

Related

Extract messages from stream and ignore data between the messages using a boost::spirit parser

I'm trying to create a (pretty simple) parser using boost::spirit::qi to extract messages from a stream. Each message starts from a short marker and ends with \r\n. The message body is ASCII text (letters and numbers) separated by a comma. For example:
!START,01,2.3,ABC\r\n
!START,456.2,890\r\n
I'm using unit tests to check the parser and everything works well when I pass only correct messages one by one. But when I try to emulate some invalid input, like:
!START,01,2.3,ABC\r\n
trash-message
!START,456.2,890\r\n
The parser doesn't see the following messages after an unexpected text.
I'm new in boost::spirit and I'd like to know how a parser based on boost::spirit::qi::grammar is supposed to work.
My question is:
Should the parser slide in the input stream and try to find a beginning of a message?
Or the caller should check the parsing result and in case of failure move an iterator and then recall the parser again?
Many thanks for considering my request.

My question is: Should the parser slide in the input stream and try to find a beginning of a message?
Only when you tell it to. It's called qi::parse, not qi::search. But obviously you can make a grammar ignore things.
Live On Coliru
//#define BOOST_SPIRIT_DEBUG
#include <boost/fusion/adapted.hpp>
#include <boost/spirit/include/qi.hpp>
#include <iomanip>
#include <iostream>
namespace qi = boost::spirit::qi;
struct Command {
enum Type { START, QUIT, TRASH } type = TRASH;
std::vector<std::string> args;
};
using Commands = std::vector<Command>;
BOOST_FUSION_ADAPT_STRUCT(Command, type, args)
template <typename It> struct CmdParser : qi::grammar<It, Commands()> {
CmdParser() : CmdParser::base_type(commands_) {
type_.add("!start", Command::START);
type_.add("!quit", Command::QUIT);
trash_ = *~qi::char_("\r\n"); // just ignore the entire line
arg_ = *~qi::char_(",\r\n");
command_ = qi::no_case[type_] >> *(',' >> arg_);
commands_ = *((command_ | trash_) >> +qi::eol);
BOOST_SPIRIT_DEBUG_NODES((trash_)(arg_)(command_)(commands_))
}
private:
qi::symbols<char, Command::Type> type_;
qi::rule<It, Commands()> commands_;
qi::rule<It, Command()> command_;
qi::rule<It, std::string()> arg_;
qi::rule<It> trash_;
};
int main() {
std::string_view input = "!START,01,2.3,ABC\r\n"
"trash-message\r\n"
"!START,456.2,890\r\n";
using It = std::string_view::const_iterator;
static CmdParser<It> const parser;
Commands parsed;
auto f = input.begin(), l = input.end();
if (parse(f, l, parser, parsed)) {
std::cout << "Parsed:\n";
for(Command const& cmd : parsed) {
std::cout << cmd.type;
for (auto& arg: cmd.args)
std::cout << ", " << quoted(arg);
std::cout << "\n";
}
} else {
std::cout << "Parse failed\n";
}
if (f != l)
std::cout << "Remaining unparsed: " << quoted(std::string(f, l)) << "\n";
}
Printing
Parsed:
0, "01", "2.3", "ABC"
2
0, "456.2", "890"

How can I keep certain semantic actions out of the AST in boost::spirit::qi

I have a huge amount of files I am trying to parse using boost::spirit::qi. Parsing is not a problem, but some of the files contain noise that I want to skip. Building a simple parser (not using boost::spirit::qi) verifies that I can avoid the noise by skipping anything that doesn't match rules at the beginning of a line. So, I'm looking for a way to write a line based parser that skip lines when not matching any rule.
The example below allows the grammar to skip lines if they don't match at all, but the 'junk' rule still inserts an empty instance of V(), which is unwanted behaviour.
The use of \r instead of \n in the example is intentional as I have encountered both \n, \r and \r\n in the files.
#include <iostream>
#include <string>
#include <vector>
#include <boost/foreach.hpp>
#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/phoenix.hpp>
#include <boost/fusion/include/std_tuple.hpp>
namespace qi = boost::spirit::qi;
namespace ascii = boost::spirit::ascii;
namespace phx = boost::phoenix;
using V = std::tuple<std::string, double, double, double>;
namespace client {
template <typename Iterator>
struct VGrammar : qi::grammar<Iterator, std::vector<V>(), ascii::space_type> {
VGrammar() : VGrammar::base_type(start) {
using namespace qi;
v %= string("v") > double_ > double_ > double_;
junk = +(char_ - eol);
start %= +(v | junk);
v.name("v");
junk.name("junk");
start.name("start");
using phx::val;
using phx::construct;
on_error<fail>(
start,
std::cout
<< val("Error! Expecting \n\n'")
<< qi::_4
<< val("'\n\n here: \n\n'")
<< construct<std::string>(qi::_3, qi::_2)
<< val("'")
<< std::endl
);
//debug(v);
//debug(junk);
//debug(start);
}
qi::rule<Iterator> junk;
//qi::rule<Iterator, qi::unused_type()> junk; // Doesn't work either
//qi::rule<Iterator, qi::unused_type(), qi::unused_type()> junk; // Doesn't work either
qi::rule<Iterator, V(), ascii::space_type> v;
qi::rule<Iterator, std::vector<V>(), ascii::space_type> start;
};
} // namespace client
int main(int argc, char* argv[]) {
using iterator_type = std::string::const_iterator;
std::string input = "";
input += "v 1 2 3\r"; // keep v 1 2 3
input += "o a b c\r"; // parse as junk
input += "v 4 5 6 v 7 8 9\r"; // keep v 4 5 6, but parse v 7 8 9 as junk
input += " v 10 11 12\r\r"; // parse as junk
iterator_type iter = input.begin();
const iterator_type end = input.end();
std::vector<V> parsed_output;
client::VGrammar<iterator_type> v_grammar;
std::cout << "run" << std::endl;
bool r = phrase_parse(iter, end, v_grammar, ascii::space, parsed_output);
std::cout << "done ... r: " << (r ? "true" : "false") << ", iter==end: " << ((iter == end) ? "true" : "false") << std::endl;
if (r && (iter == end)) {
BOOST_FOREACH(V const& v_row, parsed_output) {
std::cout << std::get<0>(v_row) << ", " << std::get<1>(v_row) << ", " << std::get<2>(v_row) << ", " << std::get<3>(v_row) << std::endl;
}
}
return EXIT_SUCCESS;
}
Here's the output from the example:
run
done ... r: true, iter==end: true
v, 1, 2, 3
, 0, 0, 0
v, 4, 5, 6
v, 7, 8, 9
v, 10, 11, 12
And here is what I actually want the parser to return.
run
done ... r: true, iter==end: true
v, 1, 2, 3
v, 4, 5, 6
My main problem right now is to keep the 'junk' rule from adding an empty V() object. How do I accomplish this? Or am I overthinking the problem?
I have tried adding lit(junk) to the start rule, since lit() doesn't return anything, but this will not compile. It fails with: "static assertion failed: error_invalid_expression".
I have also tried to set the semantic action on the junk rule to qi::unused_type() but the rule still creates an empty V() in that case.
I am aware of the following questions, but they don't address this particular issue. I have tried out the comment skipper earlier, but it looks like I'll have to reimplement all the parse rules in the skipper in order to identify noise. My example is inspired by the solution in the last link:
How to skip line/block/nested-block comments in Boost.Spirit?
How to parse entries followed by semicolon or newline (boost::spirit)?
Version info:
Linux debian 4.9.0-7-amd64 #1 SMP Debian 4.9.110-3+deb9u2 (2018-08-13) x86_64 GNU/Linux
g++ (Debian 6.3.0-18+deb9u1) 6.3.0 20170516
#define BOOST_VERSION 106200
and:
Linux raspberrypi 4.14.24-v7+ #1097 SMP Mon Mar 5 16:42:05 GMT 2018 armv7l GNU/Linux
g++ (Raspbian 4.9.2-10+deb8u1) 4.9.2
#define BOOST_VERSION 106200
For those who wonder: yes I'm trying to parse files similar to Wavefront OBJ files and I'm aware that there is already a bunch of parsers available. However, the data I'm parsing is part of a larger data structure which also requires parsing, so it does make sense to build a new parser.

What you are wanting to achieve is called error recover.
Unfortunately, Spirit does not have a nice way of doing it (there are also some internal decisions which makes it hard to make it externally). However, in your case it is simple to achieve by grammar rewrite.
#include <iostream>
#include <string>
#include <vector>
#include <boost/foreach.hpp>
#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/phoenix.hpp>
#include <boost/fusion/include/std_tuple.hpp>
namespace qi = boost::spirit::qi;
namespace ascii = boost::spirit::ascii;
namespace phx = boost::phoenix;
using V = std::tuple<std::string, double, double, double>;
namespace client {
template <typename Iterator>
struct VGrammar : qi::grammar<Iterator, std::vector<V>()> {
VGrammar() : VGrammar::base_type(start) {
using namespace qi;
v = skip(blank)[no_skip[string("v")] > double_ > double_ > double_];
junk = +(char_ - eol);
start = (v || -junk) % eol;
v.name("v");
junk.name("junk");
start.name("start");
using phx::val;
using phx::construct;
on_error<fail>(
start,
std::cout
<< val("Error! Expecting \n\n'")
<< qi::_4
<< val("'\n\n here: \n\n'")
<< construct<std::string>(qi::_3, qi::_2)
<< val("'")
<< std::endl
);
//debug(v);
//debug(junk);
//debug(start);
}
qi::rule<Iterator> junk;
//qi::rule<Iterator, qi::unused_type()> junk; // Doesn't work either
//qi::rule<Iterator, qi::unused_type(), qi::unused_type()> junk; // Doesn't work either
qi::rule<Iterator, V()> v;
qi::rule<Iterator, std::vector<V>()> start;
};
} // namespace client
int main(int argc, char* argv[]) {
using iterator_type = std::string::const_iterator;
std::string input = "";
input += "v 1 2 3\r"; // keep v 1 2 3
input += "o a b c\r"; // parse as junk
input += "v 4 5 6 v 7 8 9\r"; // keep v 4 5 6, but parse v 7 8 9 as junk
input += " v 10 11 12\r\r"; // parse as junk
iterator_type iter = input.begin();
const iterator_type end = input.end();
std::vector<V> parsed_output;
client::VGrammar<iterator_type> v_grammar;
std::cout << "run" << std::endl;
bool r = parse(iter, end, v_grammar, parsed_output);
std::cout << "done ... r: " << (r ? "true" : "false") << ", iter==end: " << ((iter == end) ? "true" : "false") << std::endl;
if (r && (iter == end)) {
BOOST_FOREACH(V const& v_row, parsed_output) {
std::cout << std::get<0>(v_row) << ", " << std::get<1>(v_row) << ", " << std::get<2>(v_row) << ", " << std::get<3>(v_row) << std::endl;
}
}
return EXIT_SUCCESS;
}

I have tried adding lit(junk) to the start rule, since lit() doesn't return anything, but this will not compile. It fails with: "static assertion failed: error_invalid_expression".
What you're looking for would be omit[junk], but it should make no difference because it will still make the synthesized attribute optional<>.
Fixing Things
First of all, you need newlines to be significant. Which means you cannot skip space. Because it eats newlines. What's worse, you need leading whitespace to be significant as well (to junk that last line, e.g.). You cannot even use qi::blank for the skipper then. (See Boost spirit skipper issues).
Just so you can still have whitespace inside the v rule, just have a local skipper (that doesn't eat newlines):
v %= &lit("v") >> skip(blank) [ string("v") > double_ > double_ > double_ ];
It engages the skipper only after establishing that there was no unexpected leading whitespace.
Note that the string("v") is a bit redundant this way, but that brings us to the second motive:
Second of all, I'm with you in avoiding semantic actions. However, this means you have to make your rules reflect your data structures.
In this particular instance, it means you should probably turn the line skipping a bit inside-out. What if you express the grammar as a straight repeat of v, interspersed with /whatever/, instead of just /newline/? I'd write that like:
junk = *(char_ - eol);
other = !v >> junk;
start = *(v >> junk >> eol % other);
Note that
the delimiter expression now uses the operator% (list operator) itself: (eol % other). What this cleverly accomplishes is that it keeps eating newlines as long as they are only delimited by "other" lines (anything !v at this point).
other is more constrained than junk, because junk may eat v, whereas other makes sure that never happens
therefore v >> junk allows the third line of your sample to be correctly processed (the line that has v 4 5 6 v 7 8 9\r)
Now it all works: Live On Coliru:
run
done ... r: true, iter==end: true
v, 1, 2, 3
v, 4, 5, 6
Perfecting It
You might be aware of the fact that this does not handle the case when the first line(s) are not v lines. Let's add that case to the sample and make sure it works as well:
Live On Coliru:
//#define BOOST_SPIRIT_DEBUG
#include <iostream>
#include <string>
#include <vector>
#include <boost/foreach.hpp>
#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/phoenix.hpp>
#include <boost/fusion/include/std_tuple.hpp>
namespace qi = boost::spirit::qi;
namespace phx = boost::phoenix;
using V = std::tuple<std::string, double, double, double>;
namespace client {
template <typename Iterator>
struct VGrammar : qi::grammar<Iterator, std::vector<V>()> {
VGrammar() : VGrammar::base_type(start) {
using namespace qi;
v %= &lit("v") >> skip(blank) [ string("v") > double_ > double_ > double_ ];
junk = *(char_ - eol);
other = !v >> junk;
start =
other >> eol % other >>
*(v >> junk >> eol % other);
BOOST_SPIRIT_DEBUG_NODES((v)(junk)(start))
on_error<fail>(
start,
std::cout
<< phx::val("Error! Expecting \n\n'") << qi::_4
<< "'\n\n here: \n\n'" << phx::construct<std::string>(qi::_3, qi::_2)
<< "'\n"
);
}
private:
qi::rule<Iterator> other, junk;
qi::rule<Iterator, V()> v;
qi::rule<Iterator, std::vector<V>()> start;
};
} // namespace client
int main() {
using iterator_type = std::string::const_iterator;
std::string input = "";
input += "o a b c\r"; // parse as junk
input += "v 1 2 3\r"; // keep v 1 2 3
input += "o a b c\r"; // parse as junk
input += "v 4 5 6 v 7 8 9\r"; // keep v 4 5 6, but parse v 7 8 9 as junk
input += " v 10 11 12\r\r"; // parse as junk
iterator_type iter = input.begin();
const iterator_type end = input.end();
std::vector<V> parsed_output;
client::VGrammar<iterator_type> v_grammar;
std::cout << "run" << std::endl;
bool r = parse(iter, end, v_grammar, parsed_output);
std::cout << "done ... r: " << (r ? "true" : "false") << ", iter==end: " << ((iter == end) ? "true" : "false") << std::endl;
if (iter != end)
std::cout << "Remaining unparsed: '" << std::string(iter, end) << "'\n";
if (r) {
BOOST_FOREACH(V const& v_row, parsed_output) {
std::cout << std::get<0>(v_row) << ", " << std::get<1>(v_row) << ", " << std::get<2>(v_row) << ", " << std::get<3>(v_row) << std::endl;
}
}
return EXIT_SUCCESS;
}

Regex: Finding all subexpressions (using boost::regex)

I have a file which contains some "entity" data in Valve's format. It's basically a key-value deal, and it looks like this:
{
"world_maxs" "3432 4096 822"
"world_mins" "-2408 -4096 -571"
"skyname" "sky_alpinestorm_01"
"maxpropscreenwidth" "-1"
"detailvbsp" "detail_sawmill.vbsp"
"detailmaterial" "detail/detailsprites_sawmill"
"classname" "worldspawn"
"mapversion" "1371"
"hammerid" "1"
}
{
"origin" "553 -441 322"
"targetname" "tonemap_global"
"classname" "env_tonemap_controller"
"hammerid" "90580"
}
Each pair of {} counts as one entity, and the rows inside count as KeyValues. As you can see, it's fairly straightforward.
I want to process this data into a vector<map<string, string> > in C++. To do this, I've tried using regular expressions that come with Boost. Here is what I have so far:
static const boost::regex entityRegex("\\{(\\s*\"([A-Za-z0-9_]+)\"\\s*\"([^\"]+)\")+\\s*\\}");
boost::smatch what;
while (regex_search(entitiesString, what, entityRegex)) {
cout << what[0] << endl;
cout << what[1] << endl;
cout << what[2] << endl;
cout << what[3] << endl;
break; // TODO
}
Easier-to-read regex:
\{(\s*"([A-Za-z0-9_]+)"\s*"([^"]+)")+\s*\}
I'm not sure the regex is well-formed for my problem yet, but it seems to print the last key-value pair (hammerid, 1) at least.
My question is, how would I go about extracting the "nth" matched subexpression within an expression? Or is there not really a practical way to do this? Would it perhaps be better to write two nested while-loops, one which searches for the {} patterns, and then one which searches for the actual key-value pairs?
Thanks!

Using a parser generator you can code a proper parser.
For example, using Boost Spirit you can define the rules of the grammar inline as C++ expressions:
start = *entity;
entity = '{' >> *entry >> '}';
entry = text >> text;
text = '"' >> *~char_('"') >> '"';
Here's a full demo:
Live On Coliru
#include <boost/spirit/include/qi.hpp>
#include <boost/fusion/adapted/std_pair.hpp>
#include <map>
using Entity = std::map<std::string, std::string>;
using ValveData = std::vector<Entity>;
namespace qi = boost::spirit::qi;
template <typename It, typename Skipper = qi::space_type>
struct Grammar : qi::grammar<It, ValveData(), Skipper>
{
Grammar() : Grammar::base_type(start) {
using namespace qi;
start = *entity;
entity = '{' >> *entry >> '}';
entry = text >> text;
text = '"' >> *~char_('"') >> '"';
BOOST_SPIRIT_DEBUG_NODES((start)(entity)(entry)(text))
}
private:
qi::rule<It, ValveData(), Skipper> start;
qi::rule<It, Entity(), Skipper> entity;
qi::rule<It, std::pair<std::string, std::string>(), Skipper> entry;
qi::rule<It, std::string()> text;
};
int main()
{
using It = boost::spirit::istream_iterator;
Grammar<It> parser;
It f(std::cin >> std::noskipws), l;
ValveData data;
bool ok = qi::phrase_parse(f, l, parser, qi::space, data);
if (ok) {
std::cout << "Parsing success:\n";
int count = 0;
for(auto& entity : data)
{
++count;
for (auto& entry : entity)
std::cout << "Entity " << count << ": [" << entry.first << "] -> [" << entry.second << "]\n";
}
} else {
std::cout << "Parsing failed\n";
}
if (f!=l)
std::cout << "Remaining unparsed input: '" << std::string(f,l) << "'\n";
}
Which prints (for the input shown):
Parsing success:
Entity 1: [classname] -> [worldspawn]
Entity 1: [detailmaterial] -> [detail/detailsprites_sawmill]
Entity 1: [detailvbsp] -> [detail_sawmill.vbsp]
Entity 1: [hammerid] -> [1]
Entity 1: [mapversion] -> [1371]
Entity 1: [maxpropscreenwidth] -> [-1]
Entity 1: [skyname] -> [sky_alpinestorm_01]
Entity 1: [world_maxs] -> [3432 4096 822]
Entity 1: [world_mins] -> [-2408 -4096 -571]
Entity 2: [classname] -> [env_tonemap_controller]
Entity 2: [hammerid] -> [90580]
Entity 2: [origin] -> [553 -441 322]
Entity 2: [targetname] -> [tonemap_global]

I think doing it all with one regex expression is hard because of the variable number of entries inside each entity {}. Personally I would consider using simply std::readline to do your parsing.
#include <map>
#include <vector>
#include <string>
#include <sstream>
#include <iostream>
std::istringstream iss(R"~(
{
"world_maxs" "3432 4096 822"
"world_mins" "-2408 -4096 -571"
"skyname" "sky_alpinestorm_01"
"maxpropscreenwidth" "-1"
"detailvbsp" "detail_sawmill.vbsp"
"detailmaterial" "detail/detailsprites_sawmill"
"classname" "worldspawn"
"mapversion" "1371"
"hammerid" "1"
}
{
"origin" "553 -441 322"
"targetname" "tonemap_global"
"classname" "env_tonemap_controller"
"hammerid" "90580"
}
)~");
int main()
{
std::string skip;
std::string entity;
std::vector<std::map<std::string, std::string> > vm;
// skip to open brace, read entity until close brace
while(std::getline(iss, skip, '{') && std::getline(iss, entity, '}'))
{
// turn entity into input stream
std::istringstream iss(entity);
// temporary map
std::map<std::string, std::string> m;
std::string key, val;
// skip to open quote, read key to close quote
while(std::getline(iss, skip, '"') && std::getline(iss, key, '"'))
{
// skip to open quote read val to close quote
if(std::getline(iss, skip, '"') && std::getline(iss, val, '"'))
m[key] = val;
}
// move map (no longer needed)
vm.push_back(std::move(m));
}
for(auto& m: vm)
{
for(auto& p: m)
std::cout << p.first << ": " << p.second << '\n';
std::cout << '\n';
}
}
Output:
classname: worldspawn
detailmaterial: detail/detailsprites_sawmill
detailvbsp: detail_sawmill.vbsp
hammerid: 1
mapversion: 1371
maxpropscreenwidth: -1
skyname: sky_alpinestorm_01
world_maxs: 3432 4096 822
world_mins: -2408 -4096 -571
classname: env_tonemap_controller
hammerid: 90580
origin: 553 -441 322
targetname: tonemap_global

I would have written it like this:
^\{(\s*"([A-Za-z0-9_]+)"\s*"([^"]+)")+\s*\}$
Or splited the regex into two strings. First match the curly braces, then loop through the content of curly braces line for line.
Match curly braces: ^(\{[^\}]+)$
Match the lines: ^(\s*"([A-Za-z0-9_]+)"\s*"([^"]+)"\s*)$

C++ Boost Spirit, parsing data and storing the maximum

I'm trying the code sehe gave here : Boolean expression (grammar) parser in c++
I would like to create a string variable max, that would store the maximum variable encountered at each parsing (on the lexicographic order, for example).
I tried things like :
var_ = qi::lexeme[ +alpha ] [_val = _1, if_(phx::ref(m) < _1) [phx::ref(m) = _1]];, but there is a (really long) compilation error
var_ = qi::lexeme[ +alpha [_val = _1, if_(phx::ref(m) < _1) [phx::ref(m) = _1]]]; but with this one I only get the first caracter of a variable, which is restrincting.
I also tried to simplify things using integers instead of string for variables, but var_ = int_ [...] didn't work either, because int_ is already a parser (I think).
Do you have any ideas ?
Thanks in advance

I'd say that
start = *word [ if_(_1>_val) [_val=_1] ];
should be fine. However, due to a bug (?) Phoenix statements in a single-statement semantic action do not compile. You can easily work around it using a no-op statement, like e.g. _pass=true in this context:
start = *word [ if_(_1>_val) [_val=_1], _pass = true ];
Now, for this I assumed a
rule<It, std::string()> word = +alpha;
If you insist you can cram it all into one rule though:
start = *as_string[lexeme[+alpha]] [ if_(_1>_val) [_val=_1], _pass = true ];
I don't recommend that.
Demo
Live On Colir
#define BOOST_SPIRIT_USE_PHOENIX_V3
#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/phoenix.hpp>
namespace qi = boost::spirit::qi;
namespace phx = boost::phoenix;
template <typename It, typename Skipper>
struct max_parser : qi::grammar<It, std::string(), Skipper> {
max_parser() : max_parser::base_type(start) {
using namespace qi;
using phx::if_;
#if 1
word = lexeme [ +alpha ];
start = *word [ if_(_1>_val) [_val=_1], _pass = true ];
#else
start = *as_string[lexeme[+alpha]] [ if_(_1>_val) [_val=_1], _pass = true ];
#endif
}
private:
qi::rule<It, std::string(), Skipper> start, word;
};
int main() {
std::string const input("beauty shall be in ze eye of the beholder");
using It = std::string::const_iterator;
max_parser<It, qi::space_type> parser;
std::string data;
It it = input.begin(), end = input.end();
bool ok = qi::phrase_parse(it, end, parser, qi::space, data);
if (ok) {
std::cout << "Parse success: " << data << "\n";
} else {
std::cout << "Parse failed\n";
}
if (it != end)
std::cout << "Remaining unparsed: '" << std::string(it,end) << "'\n";
}
Prints:
Parse success: ze

Re: comment:
Thanks for your answers. I wanted to do both usual parsing and keeping the maximum encountered string, and it worked with : var_ = *as_string[qi::lexeme[ +digit ]] [if_(phx::ref(m) < _1) [phx::ref(m) = _1], _val = _1];
For even more fun, and in the interest of complete overkill, I've come up with something that I think is close to useful:
Live On Coliru
int main() {
do_test<int>(" 1 99 -1312 4 1014", -9999);
do_test<double>(" 1 NaN -4 7e3 7e4 -31e9");
do_test<std::string>("beauty shall be in ze eye of the beholder", "", qi::as_string[qi::lexeme[+qi::graph]]);
}
The sample prints:
Parse success: 5 elements with maximum of 1014
values: 1 99 -1312 4 1014
Parse success: 6 elements with maximum of 70000
values: 1 nan -4 7000 70000 -3.1e+10
Parse success: 9 elements with maximum of ze
values: beauty shall be in ze eye of the beholder
As you can see, with string we need to help the Spirit a bit because it doesn't know how you would like to "define" a single "word". The test driver is completely generic:
template <typename T, typename ElementParser = typename boost::spirit::traits::create_parser<T>::type>
void do_test(std::string const& input,
T const& start_value = std::numeric_limits<T>::lowest(),
ElementParser const& element_parser = boost::spirit::traits::create_parser<T>::call())
{
using It = std::string::const_iterator;
vector_and_max<T> data;
It it = input.begin(), end = input.end();
bool ok = qi::phrase_parse(it, end, max_parser<It, T>(start_value, element_parser), qi::space, data);
if (ok) {
std::cout << "Parse success: " << data.first.size() << " elements with maximum of " << data.second << "\n";
std::copy(data.first.begin(), data.first.end(), std::ostream_iterator<T>(std::cout << "\t values: ", " "));
std::cout << "\n";
} else {
std::cout << "Parse failed\n";
}
if (it != end)
std::cout << "Remaining unparsed: '" << std::string(it,end) << "'\n";
}
The start-element and element-parser are passed to the constructor of our grammar:
template <typename T>
using vector_and_max = std::pair<std::vector<T>, T>;
template <typename It, typename T, typename Skipper = qi::space_type>
struct max_parser : qi::grammar<It, vector_and_max<T>(), Skipper> {
template <typename ElementParser>
max_parser(T const& start_value, ElementParser const& element_parser) : max_parser::base_type(start) {
using namespace qi;
using phx::if_;
_a_type running_max;
vector_with_max %=
eps [ running_max = start_value ]
>> *boost::proto::deep_copy(element_parser)
[ if_(_1>running_max) [running_max=_1], _pass = true ]
>> attr(running_max)
;
start = vector_with_max;
}
private:
qi::rule<It, vector_and_max<T>(), Skipper> start;
qi::rule<It, vector_and_max<T>(), Skipper, qi::locals<T> > vector_with_max;
};
Full Listing
For reference
Live On Coliru
#define BOOST_SPIRIT_USE_PHOENIX_V3
#include <boost/fusion/adapted/std_pair.hpp>
#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/phoenix.hpp>
namespace qi = boost::spirit::qi;
namespace phx = boost::phoenix;
template <typename T>
using vector_and_max = std::pair<std::vector<T>, T>;
template <typename It, typename T, typename Skipper = qi::space_type>
struct max_parser : qi::grammar<It, vector_and_max<T>(), Skipper> {
template <typename ElementParser>
max_parser(T const& start_value, ElementParser const& element_parser) : max_parser::base_type(start) {
using namespace qi;
using phx::if_;
_a_type running_max;
vector_with_max %=
eps [ running_max = start_value ]
>> *boost::proto::deep_copy(element_parser)
[ if_(_1>running_max) [running_max=_1], _pass = true ]
>> attr(running_max)
;
start = vector_with_max;
}
private:
qi::rule<It, vector_and_max<T>(), Skipper> start;
qi::rule<It, vector_and_max<T>(), Skipper, qi::locals<T> > vector_with_max;
};
template <typename T, typename ElementParser = typename boost::spirit::traits::create_parser<T>::type>
void do_test(std::string const& input,
T const& start_value = std::numeric_limits<T>::lowest(),
ElementParser const& element_parser = boost::spirit::traits::create_parser<T>::call())
{
using It = std::string::const_iterator;
vector_and_max<T> data;
It it = input.begin(), end = input.end();
bool ok = qi::phrase_parse(it, end, max_parser<It, T>(start_value, element_parser), qi::space, data);
if (ok) {
std::cout << "Parse success: " << data.first.size() << " elements with maximum of " << data.second << "\n";
std::copy(data.first.begin(), data.first.end(), std::ostream_iterator<T>(std::cout << "\t values: ", " "));
std::cout << "\n";
} else {
std::cout << "Parse failed\n";
}
if (it != end)
std::cout << "Remaining unparsed: '" << std::string(it,end) << "'\n";
}
int main() {
do_test<int>(" 1 99 -1312 4 1014");
do_test<double>(" 1 NaN -4 7e3 7e4 -31e9");
do_test<std::string>("beauty shall be in ze eye of the beholder", "", qi::as_string[qi::lexeme[+qi::graph]]);
}

Just for fun, here's how to do roughly¹ the same as in my other answer, and more, but without using boost spirit at all:
Live On Coliru
#include <algorithm>
#include <sstream>
#include <iterator>
#include <iostream>
int main() {
std::istringstream iss("beauty shall be in ze eye of the beholder");
std::string top2[2];
auto end = std::partial_sort_copy(
std::istream_iterator<std::string>(iss), {},
std::begin(top2), std::end(top2),
std::greater<std::string>());
for (auto it=top2; it!=end; ++it)
std::cout << "(Next) highest word: '" << *it << "'\n";
}
Output:
(Next) highest word: 'ze'
(Next) highest word: 'the'
¹ we're not nearly as specific about isalpha and isspace character types here

Boost Spirit Signals Successful Parsing Despite Token Being Incomplete

I have a very simple path construct that I am trying to parse with boost spirit.lex.
We have the following grammar:
token := [a-z]+
path := (token : path) | (token)
So we're just talking about colon separated lower-case ASCII strings here.
I have three examples "xyz", "abc:xyz", "abc:xyz:".
The first two should be deemed valid. The third one, which has a trailing colon, should not be deemed valid. Unfortunately the parser I have recognizes all three as being valid. The grammar should not allow an empty token, but apparently spirit is doing just that. What am I missing to get the third one rejected?
Also, if you read the code below, in comments there is another version of the parser that demands that all paths end with semi-colons. I can get appropriate behavior when I activate those lines, (i.e. rejection of "abc:xyz:;"), but this is not really what I want.
Anyone have any ideas?
Thanks.
#include <boost/config/warning_disable.hpp>
#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/lex_lexertl.hpp>
#include <boost/spirit/include/phoenix_operator.hpp>
#include <iostream>
#include <string>
using namespace boost::spirit;
using boost::phoenix::val;
template<typename Lexer>
struct PathTokens : boost::spirit::lex::lexer<Lexer>
{
PathTokens()
{
identifier = "[a-z]+";
separator = ":";
this->self.add
(identifier)
(separator)
(';')
;
}
boost::spirit::lex::token_def<std::string> identifier, separator;
};
template <typename Iterator>
struct PathGrammar
: boost::spirit::qi::grammar<Iterator>
{
template <typename TokenDef>
PathGrammar(TokenDef const& tok)
: PathGrammar::base_type(path)
{
using boost::spirit::_val;
path
=
(token >> tok.separator >> path)[std::cerr << _1 << "\n"]
|
//(token >> ';')[std::cerr << _1 << "\n"]
(token)[std::cerr << _1 << "\n"]
;
token
= (tok.identifier) [_val=_1]
;
}
boost::spirit::qi::rule<Iterator> path;
boost::spirit::qi::rule<Iterator, std::string()> token;
};
int main()
{
typedef std::string::iterator BaseIteratorType;
typedef boost::spirit::lex::lexertl::token<BaseIteratorType, boost::mpl::vector<std::string> > TokenType;
typedef boost::spirit::lex::lexertl::lexer<TokenType> LexerType;
typedef PathTokens<LexerType>::iterator_type TokensIterator;
typedef std::vector<std::string> Tests;
Tests paths;
paths.push_back("abc");
paths.push_back("abc:xyz");
paths.push_back("abc:xyz:");
/*
paths.clear();
paths.push_back("abc;");
paths.push_back("abc:xyz;");
paths.push_back("abc:xyz:;");
*/
for ( Tests::iterator iter = paths.begin(); iter != paths.end(); ++iter )
{
std::string str = *iter;
std::cerr << "*****" << str << "*****\n";
PathTokens<LexerType> tokens;
PathGrammar<TokensIterator> grammar(tokens);
BaseIteratorType first = str.begin();
BaseIteratorType last = str.end();
bool r = boost::spirit::lex::tokenize_and_parse(first, last, tokens, grammar);
std::cerr << r << " " << (first==last) << "\n";
}
}

I addition to to what llonesmiz already said, here's a trick using qi::eoi that I sometimes use:
path = (
(token >> tok.separator >> path) [std::cerr << _1 << "\n"]
| token [std::cerr << _1 << "\n"]
) >> eoi;
This makes the grammar require eoi (end-of-input) at the end of a successful match. This leads to the desired result:
http://liveworkspace.org/code/23a7adb11889bbb2825097d7c553f71d
*****abc*****
abc
1 1
*****abc:xyz*****
xyz
abc
1 1
*****abc:xyz:*****
xyz
abc
0 1

The problem lies in the meaning of first and last after your call to tokenize_and_parse. first==last checks if your string has been completely tokenized, you can't infer anything about grammar. If you isolate the parsing like this, you obtain the expected result:
PathTokens<LexerType> tokens;
PathGrammar<TokensIterator> grammar(tokens);
BaseIteratorType first = str.begin();
BaseIteratorType last = str.end();
LexerType::iterator_type lexfirst = tokens.begin(first,last);
LexerType::iterator_type lexlast = tokens.end();
bool r = parse(lexfirst, lexlast, grammar);
std::cerr << r << " " << (lexfirst==lexlast) << "\n";

This is what I finally ended up with. It uses the suggestions from both #sehe and #llonesmiz. Note the conversion to std::wstring and the use of actions in the grammar definition, which were not present in the original post.
#include <boost/config/warning_disable.hpp>
#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/lex_lexertl.hpp>
#include <boost/spirit/include/phoenix_operator.hpp>
#include <boost/bind.hpp>
#include <iostream>
#include <string>
//
// This example uses boost spirit to parse a simple
// colon-delimited grammar.
//
// The grammar we want to recognize is:
// identifier := [a-z]+
// separator = :
// path= (identifier separator path) | identifier
//
// From the boost spirit perspective this example shows
// a few things I found hard to come by when building my
// first parser.
// 1. How to flag an incomplete token at the end of input
// as an error. (use of boost::spirit::eoi)
// 2. How to bind an action on an instance of an object
// that is taken as input to the parser.
// 3. Use of std::wstring.
// 4. Use of the lexer iterator.
//
// This using directive will cause issues with boost::bind
// when referencing placeholders such as _1.
// using namespace boost::spirit;
//! A class that tokenizes our input.
template<typename Lexer>
struct Tokens : boost::spirit::lex::lexer<Lexer>
{
Tokens()
{
identifier = L"[a-z]+";
separator = L":";
this->self.add
(identifier)
(separator)
;
}
boost::spirit::lex::token_def<std::wstring, wchar_t> identifier, separator;
};
//! This class provides a callback that echoes strings to stderr.
struct Echo
{
void echo(boost::fusion::vector<std::wstring> const& t) const
{
using namespace boost::fusion;
std::wcerr << at_c<0>(t) << "\n";
}
};
//! The definition of our grammar, as described above.
template <typename Iterator>
struct Grammar : boost::spirit::qi::grammar<Iterator>
{
template <typename TokenDef>
Grammar(TokenDef const& tok, Echo const& e)
: Grammar::base_type(path)
{
using boost::spirit::_val;
path
=
((token >> tok.separator >> path)[boost::bind(&Echo::echo, e,::_1)]
|
(token)[boost::bind(&Echo::echo, &e, ::_1)]
) >> boost::spirit::eoi; // Look for end of input.
token
= (tok.identifier) [_val=boost::spirit::qi::_1]
;
}
boost::spirit::qi::rule<Iterator> path;
boost::spirit::qi::rule<Iterator, std::wstring()> token;
};
int main()
{
// A set of typedefs to make things a little clearer. This stuff is
// well described in the boost spirit documentation/examples.
typedef std::wstring::iterator BaseIteratorType;
typedef boost::spirit::lex::lexertl::token<BaseIteratorType, boost::mpl::vector<std::wstring> > TokenType;
typedef boost::spirit::lex::lexertl::lexer<TokenType> LexerType;
typedef Tokens<LexerType>::iterator_type TokensIterator;
typedef LexerType::iterator_type LexerIterator;
// Define some paths to parse.
typedef std::vector<std::wstring> Tests;
Tests paths;
paths.push_back(L"abc");
paths.push_back(L"abc:xyz");
paths.push_back(L"abc:xyz:");
paths.push_back(L":");
// Parse 'em.
for ( Tests::iterator iter = paths.begin(); iter != paths.end(); ++iter )
{
std::wstring str = *iter;
std::wcerr << L"*****" << str << L"*****\n";
Echo e;
Tokens<LexerType> tokens;
Grammar<TokensIterator> grammar(tokens, e);
BaseIteratorType first = str.begin();
BaseIteratorType last = str.end();
// Have the lexer consume our string.
LexerIterator lexFirst = tokens.begin(first, last);
LexerIterator lexLast = tokens.end();
// Have the parser consume the output of the lexer.
bool r = boost::spirit::qi::parse(lexFirst, lexLast, grammar);
// Print the status and whether or note all output of the lexer
// was processed.
std::wcerr << r << L" " << (lexFirst==lexLast) << L"\n";
}
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

boost spirit parser : getting around the greedy kleene * - c++

Related

Extract messages from stream and ignore data between the messages using a boost::spirit parser

How can I keep certain semantic actions out of the AST in boost::spirit::qi

Regex: Finding all subexpressions (using boost::regex)

C++ Boost Spirit, parsing data and storing the maximum

Boost Spirit Signals Successful Parsing Despite Token Being Incomplete

Categories

Resources