Using boost spirit x3 to parse escaped ascii strings I came across this answer but am getting an expectation exception. I have changed the expectation operator in the original to the sequence operator to disable the exception in the code below. Running the code it parses the input and assigns the correct value to the attribute but returns false and is not consuming the input. Any ideas what I've done wrong here?
gcc version 10.3.0
boost 1.71
std = c++17
#include <boost/spirit/home/x3.hpp>
#include <string>
#include <iostream>
namespace x3 = boost::spirit::x3;
using namespace std::string_literals;
//changed expectation to sequence
auto const qstring = x3::lexeme['"' >> *(
"\\n" >> x3::attr('\n')
| "\\b" >> x3::attr('\b')
| "\\f" >> x3::attr('\f')
| "\\t" >> x3::attr('\t')
| "\\v" >> x3::attr('\v')
| "\\0" >> x3::attr('\0')
| "\\r" >> x3::attr('\r')
| "\\n" >> x3::attr('\n')
| "\\" >> x3::char_("\"\\")
| "\\\"" >> x3::char_('"')
| ~x3::char_('"')
) >> '"'];
int main(int, char**){
auto const quoted = "\"Hel\\\"lo Wor\\\"ld"s;
auto const expected = "Hel\"lo Wor\"ld"s;
std::string result;
auto first = quoted.begin();
auto const last = quoted.end();
bool ok = x3::phrase_parse(first, last, qstring, x3::ascii::space, result);
std::cout << "parse returned " << std::boolalpha << ok << '\n';
std::cout << result << " == " << expected << " is " << std::boolalpha << (result == expected) << '\n';
std::cout << "first == last = " << (first == last) << '\n';
std::cout << "first = " << *first << '\n';
return 0;
}
Your input isn't terminated with a quote character. Writing it as a raw string literal helps:
std::string const qinput = R"("Hel\"lo Wor\"ld)";
Should be
std::string const qinput = R"("Hel\"lo Wor\"ld")";
Now, the rest is common container handling: in Spirit, when a rule fails (also when it just backtracks a branch) the container attribute is not rolled back. See e.g. boost::spirit::qi duplicate parsing on the output, Understanding Boost.spirit's string parser, etc.
Basically, you cannot rely on the result if the parse failed. This is likely why the original had an expectation point: to raise an exception.
A full demonstration of the correct working:
Live On Coliru
#include <boost/spirit/home/x3.hpp>
#include <string>
#include <iostream>
#include <iomanip>
namespace x3 = boost::spirit::x3;
auto escapes = []{
x3::symbols<char> sym;
sym.add
("\\b", '\b')
("\\f", '\f')
("\\t", '\t')
("\\v", '\v')
("\\0", '\0')
("\\r", '\r')
("\\n", '\n')
("\\\\", '\\')
("\\\"", '"')
;
return sym;
}();
auto const qstring = x3::lexeme['"' >> *(escapes | ~x3::char_('"')) >> '"'];
int main(){
auto squote = [](std::string_view s) { return std::quoted(s, '\''); };
std::string const expected = R"(Hel"lo Wor"ld)";
for (std::string const qinput : {
R"("Hel\"lo Wor\"ld)", // oops no closing quote
R"("Hel\"lo Wor\"ld")",
"\"Hel\\\"lo Wor\\\"ld\"", // if you insist
R"("Hel\"lo Wor\"ld" trailing data)",
})
{
std::cout << "\n -- input " << squote(qinput) << "\n";
std::string result;
auto first = cbegin(qinput);
auto last = cend(qinput);
bool ok = x3::phrase_parse(first, last, qstring, x3::space, result);
ok &= (first == last);
std::cout << "parse returned " << std::boolalpha << ok << "\n";
std::cout << squote(result) << " == " << squote(expected) << " is "
<< (result == expected) << "\n";
if (first != last)
std::cout << "Remaining input unparsed: " << squote({first, last})
<< "\n";
}
}
Prints
-- input '"Hel\\"lo Wor\\"ld'
parse returned false
'Hel"lo Wor"ld' == 'Hel"lo Wor"ld' is true
Remaining input unparsed: '"Hel\\"lo Wor\\"ld'
-- input '"Hel\\"lo Wor\\"ld"'
parse returned true
'Hel"lo Wor"ld' == 'Hel"lo Wor"ld' is true
-- input '"Hel\\"lo Wor\\"ld"'
parse returned true
'Hel"lo Wor"ld' == 'Hel"lo Wor"ld' is true
-- input '"Hel\\"lo Wor\\"ld" trailing data'
parse returned false
'Hel"lo Wor"ld' == 'Hel"lo Wor"ld' is true
Remaining input unparsed: 'trailing data'
Related
I am a beginner to regex in c++ I was wondering why this code:
#include <iostream>
#include <string>
#include <boost/regex.hpp>
int main() {
std::string s = "? 8==2 : true ! false";
boost::regex re("\\?\\s+(.*)\\s*:\\s*(.*)\\s*\\!\\s*(.*)");
boost::sregex_token_iterator p(s.begin(), s.end(), re, -1); // sequence and that reg exp
boost::sregex_token_iterator end; // Create an end-of-reg-exp
// marker
while (p != end)
std::cout << *p++ << '\n';
}
Prints a empty string. I put the regex in regexTester and it matches the string correctly but here when I try to iterate over the matches it returns nothing.
I think the tokenizer is actually meant to split text by some delimiter, and the delimiter is not included. Compare with std::regex_token_iterator:
std::regex_token_iterator is a read-only LegacyForwardIterator that accesses the individual sub-matches of every match of a regular expression within the underlying character sequence. It can also be used to access the parts of the sequence that were not matched by the given regular expression (e.g. as a tokenizer).
Indeed you invoke exactly this mode as per the docs:
if submatch is -1, then enumerates all the text sequences that did not match the expression re (that is to performs field splitting).
(emphasis mine).
So, just fix that:
for (boost::sregex_token_iterator p(s.begin(), s.end(), re), e; p != e;
++p)
{
boost::sub_match<It> const& current = *p;
if (current.matched) {
std::cout << std::quoted(current.str()) << '\n';
} else {
std::cout << "non matching" << '\n';
}
}
Other Observations
All the greedy Kleene-stars are recipe for trouble. You won't ever find a second match, because the first one's .* at the end will by definition gobble up all remaining input.
Instead, make them non-greedy (.*?) and or much more precise (like isolating some character set, or mandating non-space characters?).
boost::regex re(R"(\?\s+(.*?)\s*:\s*(.*?)\s*\!\s*(.*?))");
// Or, if you don't want raw string literals:
boost::regex re("\\?\\s+(.*?)\\s*:\\s*(.*?)\\s*\\!\\s*(.*?)");
Live Demo
#include <boost/regex.hpp>
#include <iomanip>
#include <iostream>
#include <string>
int main() {
using It = std::string::const_iterator;
std::string const s =
"? 8==2 : true ! false;"
"? 9==3 : 'book' ! 'library';";
boost::regex re(R"(\?\s+(.*?)\s*:\s*(.*?)\s*\!\s*(.*?))");
{
std::cout << "=== regex_search:\n";
boost::smatch results;
for (It b = s.begin(); boost::regex_search(b, s.end(), results, re); b = results[0].end()) {
std::cout << results.str() << "\n";
std::cout << "remain: " << std::quoted(std::string(results[0].second, s.end())) << "\n";
}
}
std::cout << "=== token iteration:\n";
for (boost::sregex_token_iterator p(s.begin(), s.end(), re), e; p != e;
++p)
{
boost::sub_match<It> const& current = *p;
if (current.matched) {
std::cout << std::quoted(current.str()) << '\n';
} else {
std::cout << "non matching" << '\n';
}
}
}
Prints
=== regex_search:
? 8==2 : true !
remain: "false;? 9==3 : 'book' ! 'library';"
? 9==3 : 'book' !
remain: "'library';"
=== token iteration:
"? 8==2 : true ! "
"? 9==3 : 'book' ! "
BONUS: Parser Expressions
Instead of abusing regexen to do parsing, you could generate a parser, e.g. using Boost Spirit:
Live On Coliru
#include <boost/spirit/home/x3.hpp>
#include <boost/fusion/adapted.hpp>
#include <iomanip>
#include <iostream>
namespace x3 = boost::spirit::x3;
int main() {
std::string const s =
"? 8==2 : true ! false;"
"? 9==3 : 'book' ! 'library';";
using expression = std::string;
using ternary = std::tuple<expression, expression, expression>;
std::vector<ternary> parsed;
auto expr_ = x3::lexeme [+(x3::graph - ';')];
auto ternary_ = "?" >> expr_ >> ":" >> expr_ >> "!" >> expr_;
std::cout << "=== parser approach:\n";
if (x3::phrase_parse(begin(s), end(s), *x3::seek[ ternary_ ], x3::space, parsed)) {
for (auto [cond, e1, e2] : parsed) {
std::cout
<< " condition " << std::quoted(cond) << "\n"
<< " true expression " << std::quoted(e1) << "\n"
<< " else expression " << std::quoted(e2) << "\n"
<< "\n";
}
} else {
std::cout << "non matching" << '\n';
}
}
Prints
=== parser approach:
condition "8==2"
true expression "true"
else expression "false"
condition "9==3"
true expression "'book'"
else expression "'library'"
This is much more extensible, will easily support recursive grammars and will be able to synthesize a typed representation of your syntax tree, instead of just leaving you with scattered bits of string.
I tried to use qi::uint_parser<int>(). But it is the same like qi::uint_. They all match integers range from 0 to std::numeric_limits<unsigned int>::max().
Is qi::uint_parser<int>() designed to be like this? What parser shall I use to match an integer range from 0 to std::numeric_limits<int>::max()? Thanks.
Simplest demo, attaching a semantic action to do the range check:
uint_ [ _pass = (_1>=0 && _1<=std::numeric_limits<int>::max()) ];
Live On Coliru
#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/phoenix.hpp>
template <typename It>
struct MyInt : boost::spirit::qi::grammar<It, int()> {
MyInt() : MyInt::base_type(start) {
using namespace boost::spirit::qi;
start %= uint_ [ _pass = (_1>=0 && _1<=std::numeric_limits<int>::max()) ];
}
private:
boost::spirit::qi::rule<It, int()> start;
};
template <typename Int>
void test(Int value, char const* logical) {
MyInt<std::string::const_iterator> p;
std::string const input = std::to_string(value);
std::cout << " ---------------- Testing '" << input << "' (" << logical << ")\n";
auto f = input.begin(), l = input.end();
int parsed;
if (parse(f, l, p, parsed)) {
std::cout << "Parse success: " << parsed << "\n";
} else {
std::cout << "Parse failed\n";
}
if (f!=l) {
std::cout << "Remaining unparsed: '" << std::string(f,l) << "'\n";
}
}
int main() {
unsigned maxint = std::numeric_limits<int>::max();
MyInt<std::string::const_iterator> p;
test(maxint , "maxint");
test(maxint-1, "maxint-1");
test(maxint+1, "maxint+1");
test(0 , "0");
test(-1 , "-1");
}
Prints
---------------- Testing '2147483647' (maxint)
Parse success: 2147483647
---------------- Testing '2147483646' (maxint-1)
Parse success: 2147483646
---------------- Testing '2147483648' (maxint+1)
Parse failed
Remaining unparsed: '2147483648'
---------------- Testing '0' (0)
Parse success: 0
---------------- Testing '-1' (-1)
Parse failed
Remaining unparsed: '-1'
I am currently implementing a parser which succeeds on the "strongest" match for spirit::qi. There are meaningful applications for such a thing. E.g matching references to either simple refs (eg "willy") or namespace qualified refs (eg. "willy::anton"). That's not my actual real world case but it is almost self-explanatory, I guess. At least it helped me to track down the issue.
I found a solution for that. It works perfectly, when the skipper parser is not involved (i.e. there is nothing to skip). It does not work as expected if there are areas which need skipping.
I believe, I tracked down the problem. It seems like under certain conditions spaces are actually not skipped allthough they should be.
Below is find a self-contained very working example. It loops over some rules and some input to provide enough information. If you run it with BOOST_SPIRIT_DEBUG enabled, you get in particular the output:
<qualifier>
<try> :: anton</try>
<fail/>
</qualifier>
I think, this one should not have failed. Am I right guessing so? Does anyone know a way to get around that? Or is it just my poor understanding of qi semantics? Thank you very much for your time. :)
My environment: MSVC 2015 latest, target win32 console
#define BOOST_SPIRIT_DEBUG
#include <io.h>
#include<map>
#include <boost/spirit/include/qi.hpp>
typedef std::string::const_iterator iterator_type;
namespace qi = boost::spirit::qi;
using map_type = std::map<std::string, qi::rule<iterator_type, std::string()>&>;
namespace maxence { namespace parser {
template <typename Iterator>
struct ident : qi::grammar<Iterator, std::string()>
{
ident();
qi::rule<Iterator, std::string()>
id, id_raw;
qi::rule<Iterator, std::string()>
not_used,
qualifier,
qualified_id, simple_id,
id_reference, id_reference_final;
map_type rules = {
{ "E1", id },
{ "E2", id_raw}
};
};
template <typename Iterator>
// we actually don't need the start rule (see below)
ident<Iterator>::ident() : ident::base_type(not_used)
{
id_reference = (!simple_id >> qualified_id) | (!qualified_id >> simple_id);
id_reference_final = id_reference;
///////////////////////////////////////////////////
// standard simple id (not followed by
// delimiter "::")
simple_id = (qi::alpha | '_') >> *(qi::alnum | '_') >> !qi::lit("::");
///////////////////////////////////////////////////
// this is qualifier <- "::" simple_id
// I repeat the simple_id pattern here to make sure
// this demo has no "early match" issues
qualifier = qi::string("::") > (qi::alpha | '_') >> *(qi::alnum | '_');
///////////////////////////////////////////////////
// this is: qualified_id <- simple_id qualifier*
qualified_id = (qi::alpha | '_') >> *(qi::alnum | '_') >> +(qualifier) >> !qi::lit("::");
id = id_reference_final;
id_raw = qi::raw[id_reference_final];
BOOST_SPIRIT_DEBUG_NODES(
(id)
(id_raw)
(qualifier)
(qualified_id)
(simple_id)
(id_reference)
(id_reference_final)
)
}
}}
int main()
{
maxence::parser::ident<iterator_type> ident;
using ss_map_type = std::map<std::string, std::string>;
ss_map_type parser_input =
{
{ "Simple id (behaves ok)", "willy" },
{ "Qualified id (behaves ok)", "willy::anton" },
{ "Skipper involved (unexpected)", "willy :: anton" }
};
for (ss_map_type::const_iterator input = parser_input.begin(); input != parser_input.end(); input++) {
for (map_type::const_iterator example = ident.rules.begin(); example != ident.rules.end(); example++) {
std::string to_parse = input->second;
std::string result;
std::string parser_name = (example->second).name();
std::cout << "--------------------------------------------" << std::endl;
std::cout << "Description: " << input->first << std::endl;
std::cout << "Parser [" << parser_name << "] parsing [" << to_parse << "]" << std::endl;
auto b(to_parse.begin()), e(to_parse.end());
// --- test for parser success
bool success = qi::phrase_parse(b, e, (example)->second, qi::space, result);
if (success) std::cout << "Parser succeeded. Result: " << result << std::endl;
else std::cout << " Parser failed. " << std::endl;
//--- test for EOI
if (b == e) {
std::cout << "EOI reached.";
if (success) std::cout << " The sun is shining brightly. :)";
} else {
std::cout << "Failure: EOI not reached. Remaining: [";
while (b != e) std::cout << *b++; std::cout << "]";
}
std::cout << std::endl << "--------------------------------------------" << std::endl;
}
}
return 0;
}
i have following line
/90pv-RKSJ-UCS2C usecmap
std::string const line = "/90pv-RKSJ-UCS2C usecmap";
auto first = line.begin(), last = line.end();
std::string label, token;
bool ok = qi::phrase_parse(
first, last,
qi::lexeme [ "/" >> +~qi::char_(" ") ] >> ' ' >> qi::lexeme[+~qi::char_(' ')] , qi::space, label, token);
if (ok)
std::cout << "Parse success: label='" << label << "', token='" << token << "'\n";
else
std::cout << "Parse failed\n";
if (first!=last)
std::cout << "Remaining unparsed input: '" << std::string(first, last) << "'\n";
I want to 90pv-RKSJ-UCS2C in label and usecmap in token variable.
I extract 90pv-RKSJ-UCS2C value but not usecmap
With space the skipper, you cannot ever match ' ' (it is skipped!). See also: Boost spirit skipper issues
So, either don't use a skipper, or allow the skipper to eat it:
bool ok = qi::phrase_parse(
first, last,
qi::lexeme [ "/" >> +qi::graph ] >> qi::lexeme[+qi::graph], qi::blank, label, token);
Notes:
I used qi::graph instead of the ~qi::char_(" ") formulation
I used blank_type because you said
i have following line
Which implies that line-ends should not be skipped
Demo
Live On Coliru
#include <boost/spirit/include/qi.hpp>
namespace qi = boost::spirit::qi;
int main()
{
std::string const line = "/90pv-rksj-ucs2c usecmap";
auto first = line.begin(), last = line.end();
std::string label, token;
bool ok = qi::phrase_parse(
first, last,
qi::lexeme [ "/" >> +qi::graph ] >> qi::lexeme[+qi::graph], qi::blank, label, token);
if (ok)
std::cout << "parse success: label='" << label << "', token='" << token << "'\n";
else
std::cout << "parse failed\n";
if (first!=last)
std::cout << "remaining unparsed input: '" << std::string(first, last) << "'\n";
}
Prints:
parse success: label='90pv-rksj-ucs2c', token='usecmap'
If you are using C++11, I suggest using regular expression.
#include <iostream>
#include <regex>
using namespace std;
int main() {
regex re("^/([^\\s]*)\\s([^\\s]*)"); // 1st () captures
// 90pv-RKSJ-UCS2C and 2nd () captures usecmap
smatch sm;
string s="/90pv-RKSJ-UCS2C usecmap";
regex_match(s,sm,re);
for(int i=0;i<sm.size();i++) {
cout<<sm[i]<<endl;
}
string label=sm[1],token=sm[2];
system("pause");
}
Using boost spirit, I'd like to extract a string that is followed by some data in parentheses. The relevant string is separated by a space from the opening parenthesis. Unfortunately, the string itself may contain spaces. I'm looking for a concise solution that returns the string without a trailing space.
The following code illustrates the problem:
#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/phoenix_operator.hpp>
#include <string>
#include <iostream>
namespace qi = boost::spirit::qi;
using std::string;
using std::cout;
using std::endl;
void
test_input(const string &input)
{
string::const_iterator b = input.begin();
string::const_iterator e = input.end();
string parsed;
bool const r = qi::parse(b, e,
*(qi::char_ - qi::char_("(")) >> qi::lit("(Spirit)"),
parsed
);
if(r) {
cout << "PASSED:" << endl;
} else {
cout << "FAILED:" << endl;
}
cout << " Parsed: \"" << parsed << "\"" << endl;
cout << " Rest: \"" << string(b, e) << "\"" << endl;
}
int main()
{
test_input("Fine (Spirit)");
test_input("Hello, World (Spirit)");
return 0;
}
Its output is:
PASSED:
Parsed: "Fine "
Rest: ""
PASSED:
Parsed: "Hello, World "
Rest: ""
With this simple grammar, the extracted string is always followed by a space (that I 'd like to eliminate).
The solution should work within Spirit since this is only part of a larger grammar. (Thus, it would probably be clumsy to trim the extracted strings after parsing.)
Thank you in advance.
Like the comment said, in the case of a single space, you can just hard code it. If you need to be more flexible or tolerant:
I'd use a skipper with raw to "cheat" the skipper for your purposes:
bool const r = qi::phrase_parse(b, e,
qi::raw [ *(qi::char_ - qi::char_("(")) ] >> qi::lit("(Spirit)"),
qi::space,
parsed
);
This works, and prints
PASSED:
Parsed: "Fine"
Rest: ""
PASSED:
Parsed: "Hello, World"
Rest: ""
See it Live on Coliru
Full program for reference:
#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/phoenix_operator.hpp>
#include <string>
#include <iostream>
namespace qi = boost::spirit::qi;
using std::string;
using std::cout;
using std::endl;
void
test_input(const string &input)
{
string::const_iterator b = input.begin();
string::const_iterator e = input.end();
string parsed;
bool const r = qi::phrase_parse(b, e,
qi::raw [ *(qi::char_ - qi::char_("(")) ] >> qi::lit("(Spirit)"),
qi::space,
parsed
);
if(r) {
cout << "PASSED:" << endl;
} else {
cout << "FAILED:" << endl;
}
cout << " Parsed: \"" << parsed << "\"" << endl;
cout << " Rest: \"" << string(b, e) << "\"" << endl;
}
int main()
{
test_input("Fine (Spirit)");
test_input("Hello, World (Spirit)");
return 0;
}