Got stuck porting legacy boost::spirit code - c++
I am porting some legacy code from VS2010 & boost1.53 to VS2017 & boost1.71.
I have got stuck last two hours while trying compiling it.
The code is:
#include <string>
#include <vector>
#include <fstream>
#include <boost/config/warning_disable.hpp>
#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/phoenix.hpp>
namespace qi = boost::spirit::qi;
using qi::_1; using qi::_2; using qi::_3; using qi::_4;
enum TYPE { SEND, CHECK, COMMENT };
struct Command
{
TYPE type;
std::string id;
std::string arg1;
std::string arg2;
bool checking;
};
class Parser
{
typedef boost::spirit::istream_iterator It;
typedef std::vector<Command> Commands;
struct deferred_fill
{
template <typename R, typename S, typename T, typename U> struct result { typedef void type; };//Not really sure still necessary
typedef void result_type;//Not really sure still necessary
void operator() (boost::iterator_range<It> const& id, boost::iterator_range<It> const& arg1, bool checking, Command& command) const
{
command.type = TYPE::SEND;
command.id.assign(id.begin(), id.end());
command.arg1.assign(arg1.begin(), arg1.end());
command.checking = checking;
}
};
private:
qi::symbols<char, bool> value;
qi::rule<It> ignore;
qi::rule<It, Command()> send;
qi::rule<It, Commands()> start;
boost::phoenix::function<deferred_fill> fill;
public:
std::vector<Command> commands;
Parser()
{
using namespace qi;
using boost::phoenix::push_back;
value.add("TRUE", true)
("FALSE", false);
send = ("SEND_CONFIRM" >> *blank >> '(' >> *blank >> raw[*~char_(',')] >> ','
>> *blank >> raw[*~char_(',')] >> ','
>> *blank >> value >> *blank >> ')' >> *blank >> ';')[fill(_1, _2, _3, _val)];
ignore = *~char_("\r\n");
start = (send[push_back(_val, _1)] | ignore) % eol;
}
void parse(const std::string& path)
{
std::ifstream in(path, std::ios_base::in);
if (!in) return;
in >> std::noskipws;//No white space skipping
boost::spirit::istream_iterator first(in);
boost::spirit::istream_iterator last;
qi::parse(first, last, start, commands);
}
};
int main(int argc, char* argv[])
{
Parser parser;
parser.parse("file.txt");
return 0;
}
The compiler complains in the next way (only copy first lines):
1>z:\externos\boost_1_71_0\boost\phoenix\core\detail\function_eval.hpp(116): error C2039: 'type': no es un miembro de 'boost::result_of<const Parser::deferred_fill (std::vector<Value,std::allocator<char>> &,std::vector<Value,std::allocator<char>> &,boost::iterator_range<Parser::It> &,Command &)>'
1> with
1> [
1> Value=char
1> ]
1>z:\externos\boost_1_71_0\boost\phoenix\core\detail\function_eval.hpp(114): note: vea la declaración de 'boost::result_of<const Parser::deferred_fill (std::vector<Value,std::allocator<char>> &,std::vector<Value,std::allocator<char>> &,boost::iterator_range<Parser::It> &,Command &)>'
1> with
1> [
1> Value=char
1> ]
1>z:\externos\boost_1_71_0\boost\phoenix\core\detail\function_eval.hpp(89): note: vea la referencia a la creación de instancias de plantilla clase de 'boost::phoenix::detail::function_eval::result_impl<F,void (Head,const boost::phoenix::actor<boost::spirit::argument<1>>&,const boost::phoenix::actor<boost::spirit::argument<2>>&,const boost::phoenix::actor<boost::spirit::attribute<0>>&),const boost::phoenix::vector2<Env,Actions> &>' que se está compilando
1> with
1> [
1> F=const boost::proto::exprns_::basic_expr<boost::proto::tagns_::tag::terminal,boost::proto::argsns_::term<Parser::deferred_fill>,0> &,
1> Head=const boost::phoenix::actor<boost::spirit::argument<0>> &,
1> Env=boost::phoenix::vector4<const boost::phoenix::actor<boost::proto::exprns_::basic_expr<boost::phoenix::detail::tag::function_eval,boost::proto::argsns_::list5<boost::proto::exprns_::basic_expr<boost::proto::tagns_::tag::terminal,boost::proto::argsns_::term<Parser::deferred_fill>,0>,boost::phoenix::actor<boost::spirit::argument<0>>,boost::phoenix::actor<boost::spirit::argument<1>>,boost::phoenix::actor<boost::spirit::argument<2>>,boost::phoenix::actor<boost::spirit::attribute<0>>>,5>> *,boost::fusion::vector<std::vector<char,std::allocator<char>>,std::vector<char,std::allocator<char>>,boost::iterator_range<Parser::It>,std::vector<char,std::allocator<char>>,boost::iterator_range<Parser::It>,std::vector<char,std::allocator<char>>,bool,std::vector<char,std::allocator<char>>,std::vector<char,std::allocator<char>>> &,boost::spirit::context<boost::fusion::cons<Command &,boost::fusion::nil_>,boost::fusion::vector<>> &,bool &> &,
1> Actions=const boost::phoenix::default_actions &
1> ]
I guess that error is related with the use of boost::spirit::istream_iterator, instead of char*, but I cann't figure out how to fix it to work again.
I have run out of ideas, please, anyone can see where my mistake is?
Aw. You're doing awesome things. Sadly/fortunately it's overcomplicated.
So let's first fix, and then simplify.
The Error
It's like you said,
void operator() (boost::iterator_range<It> const& id, boost::iterator_range<It> const& arg1, bool checking, Command& command) const
Doesn't match what actually gets invoked:
void Parser::deferred_fill::operator()(T&& ...) const [with T =
{std::vector<char>&, std::vector<char>&,
boost::iterator_range<boost::spirit::basic_istream_iterator<...> >&,
Command&}]
The reason is NOT the iterator (as you can see it's boost::spirit__istream_iterator alright).
However it's because you're getting other things as attributes. Turns out *blank exposes the attribute as a vector<char>. So you can "fix" that by omit[]-ing those. Let's instead wrap it in an attribute-less rule like ignore so we reduce the clutter.
Now the invocation is with
void Parser::deferred_fill::operator()(T&& ...) const [with T = {boost::iterator_range<It>&, boost::iterator_range<It>&, bool&, Command&}]
So it is compatible and compiles. Parsing:
SEND_CONFIRM("this is the id part", "this is arg1", TRUE);
With
Parser parser;
parser.parse("file.txt");
std::cout << std::boolalpha;
for (auto& cmd : parser.commands) {
std::cout << '{' << cmd.id << ", "
<< cmd.arg1 << ", "
<< cmd.arg2 << ", "
<< cmd.checking << "}\n";
}
Prints
{"this is the id part", "this is arg1", , TRUE}
Let's improve this
This calls for a skipper
This calls for automatic attribute propagation
Some other elements of style
Skippers
Instead of "calling" a skipper explicitly, let's use the built in capability:
rule<It, Attr(), Skipper> x;
defines a rule that skips over inputs sequences matched by a parser of the Skipper type. You need to actually pass in the skipper of that type.
using qi::phrase_parse instead of qi::parse
by using the qi::skip() directive
I always advocate the second approach, because it makes for a friendlier, less error-prone interface.
So declaring the skipper type:
qi::rule<It, Command(), qi::blank_type> send;
We can reduce the rule to:
send = (lit("SEND_CONFIRM") >> '('
>> raw[*~char_(',')] >> ','
>> raw[*~char_(',')] >> ','
>> value >> ')' >> ';')
[fill(_1, _2, _3, _val)];
And than pass a skipper from the start rule:
start = skip(blank) [
(send[push_back(_val, _1)] | ignore) % eol
];
That's all. Still compiles and matches the same.
Live On Coliru
Skipping with Lexemes
Still the same topic, lexemes actually inhibit the skipper¹, so you don't have to raw[]. This changes the exposed attributes to vector<char> as well:
void operator() (std::vector<char> const& id, std::vector<char> const& arg1, bool checking, Command& command) const
Live On Coliru
Automatic Attribute Propagation
Qi has semantic actions, but its real strength is in them being optional: Boost Spirit: "Semantic actions are evil"?
push_back(_val, _1) is actually the automatic attribute propagation semantics anwyays for *p, +p and p % delim² anyways, so just drop it:
start = skip(blank) [
(send | ignore) % eol
];
(note that send|ignore actually synthesizes optional<Command> which is fine for automatic propagation)
std::vector is attribute-compatible with std::string, e.g.. So if we can add a placeholder for arg2 we can match the Command structure layout:
send = lit("SEND_CONFIRM") >> '('
>> attr(SEND) // fill type
>> lexeme[*~char_(',')] >> ','
>> lexeme[*~char_(',')] >> ','
>> attr(std::string()) // fill for arg2
>> value >> ')' >> ';'
;
Now to be able to drop fill and its implementation, we have to adapt Command as a fusion sequence:
BOOST_FUSION_ADAPT_STRUCT(Command, type, id, arg1, arg2, checking)
Elements of Style 1
Using a namespace for your Command types makes it easier to ADL use the operator<< overloads for Commands, se we can just std::cout << cmd;
At this point, it all works in a fraction of the code: Live On Coliru
Elements of Style 2
If you can, make your parser stateless. That means it can be const, so you can:
reuse it without costly construction
the optimizer has more to work with
it's more testable (stateful things are harder to prove idempotent)
So, instead of having commands a member, just return them. While we're at it, we can make parse a static function
Instead of hardcoding the iterator type, it's flexible to have it as a template argument. That way you're not stuck with the overhead of multi_pass_adaptor and istream_iterator if you have a command in a char[] buffer, string or string_view at some point.
Also, deriving your Parser from qi::grammar with a suitable entry-point means you can use it as a parser expression (actually a non-terminal, just like rule<>) as any other parser.
Consider enabling rule debugging (see example)
Full Code
Live On Coliru
#define BOOST_SPIRIT_DEBUG
#include <boost/spirit/include/qi.hpp>
#include <fstream>
namespace qi = boost::spirit::qi;
namespace Commands {
enum TYPE { SEND, CHECK, COMMENT };
enum BOOL { FALSE, TRUE };
struct Command {
TYPE type;
std::string id;
std::string arg1;
std::string arg2;
BOOL checking;
};
typedef std::vector<Command> Commands;
// for (debug) output
static inline std::ostream& operator<<(std::ostream& os, TYPE t) {
switch (t) {
case SEND: return os << "SEND";
case CHECK: return os << "CHECK";
case COMMENT: return os << "COMMENT";
}
return os << "(unknown)";
}
static inline std::ostream& operator<<(std::ostream& os, BOOL b) {
return os << (b?"TRUE":"FALSE");
}
using boost::fusion::operator<<;
}
BOOST_FUSION_ADAPT_STRUCT(Commands::Command, type, id, arg1, arg2, checking)
namespace Commands {
template <typename It>
class Parser : public qi::grammar<It, Commands()> {
public:
Commands commands;
Parser() : Parser::base_type(start) {
using namespace qi;
value.add("TRUE", TRUE)
("FALSE", FALSE);
send = lit("SEND_CONFIRM") >> '('
>> attr(SEND) // fill type
>> lexeme[*~char_(',')] >> ','
>> lexeme[*~char_(',')] >> ','
>> attr(std::string()) // fill for arg2
>> value >> ')' >> ';'
;
ignore = +~char_("\r\n");
start = skip(blank) [
(send | ignore) % eol
];
BOOST_SPIRIT_DEBUG_NODES((start)(send)(ignore))
}
private:
qi::symbols<char, BOOL> value;
qi::rule<It> ignore;
qi::rule<It, Command(), qi::blank_type> send;
qi::rule<It, Commands()> start;
};
static Commands parse(std::istream& in) {
using It = boost::spirit::istream_iterator;
static const Parser<It> parser;
It first(in >> std::noskipws), //No white space skipping
last;
Commands commands;
if (!qi::parse(first, last, parser, commands)) {
throw std::runtime_error("command parse error");
}
return commands; // c++11 move semantics
}
}
int main() {
try {
for (auto& cmd : Commands::parse(std::cin))
std::cout << cmd << "\n";
} catch(std::exception const& e) {
std::cout << e.what() << "\n";
}
}
Prints
(SEND "this is the id part" "this is arg1" TRUE)
Or indeed with BOOST_SPIRIT_DEBUG defined:
<start>
<try>SEND_CONFIRM("this i</try>
<send>
<try>SEND_CONFIRM("this i</try>
<success>\n</success>
<attributes>[[SEND, [", t, h, i, s, , i, s, , t, h, e, , i, d, , p, a, r, t, "], [", t, h, i, s, , i, s, , a, r, g, 1, "], [], TRUE]]</attributes>
</send>
<send>
<try></try>
<fail/>
</send>
<ignore>
<try></try>
<fail/>
</ignore>
<success>\n</success>
<attributes>[[[SEND, [", t, h, i, s, , i, s, , t, h, e, , i, d, , p, a, r, t, "], [", t, h, i, s, , i, s, , a, r, g, 1, "], [], TRUE]]]</attributes>
</start>
¹ while pre-skipping as you require; see Boost spirit skipper issues
² (and then some, but let's not digress)
Related
How to overcome a Boost Spirit AST snafu
For starters, I have an AST in which I have to do a forward declaration, but apparently this is not exactly kosher in the latest C++ compilers? Overcoming this I can work through the rest of the grammar, I believe. For reference, I am writing the parser more or less faithfully to the Google Protobuf v2 specification. If memory serves, this has something to do with perhaps introducing a type def? And/or Boost Spirit recursive descent, i.e. recursive_wrapper? But it's been a little while, I'm a bit fuzzy on those details. Would someone mind taking a look? But for the forward declaration issue I think the posted code is mostly grammar complete. TBD are Protobuf service, rpc, stream, and, of course, comments. There may be a couple of variant gremlins lurking in there as well I'm not sure what to do with; i.e. how to synthesize a "nil" or empty_statement, for instance, pops up a couple of times throughout the grammatical alternatives.
How does one end up with such a vast body of untested code? I suppose it makes sense to look at a minimized version of this code from scratch and stop at the earliest point it stops working, instead of postponing sanity checks until it's become unmanageable.¹ I'm going to point you at some places where you can see what to do. Recursive using declaration with boost variant C++ Mutually Recursive Variant Type (Again) I have to warn I don't think std::variant or std::optional are supported yet by Qi. I could be wrong. Review And Fixup Round I spent entirely too much time trying to fix the many issues, subtle and not so subtle. I'll be happy to explain a bit, but for now I'm just dropping the result: Live On Coliru #define BOOST_SPIRIT_DEBUG #include <iostream> #include <string> #include <vector> #include <boost/fusion/include/adapt_struct.hpp> #include <boost/spirit/include/qi.hpp> #include <boost/spirit/include/phoenix.hpp> #include <boost/spirit/include/qi_auto.hpp> //#include <boost/container/vector.hpp> namespace AST { using boost::variant; using boost::optional; enum class bool_t { false_, true_ }; enum class syntax_t { proto2 }; using str_t = std::string; struct full_id_t { std::string full_id; }; using int_t = intmax_t; using float_t = double; /// See: http://www.boost.org/doc/libs/1_68_0/libs/spirit/example/qi/compiler_tutorial/calc8/ast.hpp /// Specifically, struct nil {}. struct empty_statement_t {}; // TODO: TBD: we may need/want to dissect this one still further... i.e. to ident, message/enum-name, etc. struct element_type_t : std::string { using std::string::string; using std::string::operator=; }; // TODO: TBD: let's not get too fancy with the inheritance, ... // TODO: TBD: however, scanning the other types, we could potentially do more of it, strategically, here and there struct msg_type_t : element_type_t {}; struct enum_type_t : element_type_t {}; struct package_t { std::string full_id; }; using const_t = variant<full_id_t, int_t, float_t, str_t, bool_t>; struct import_modifier_t { std::string val; }; struct import_t { optional<import_modifier_t> mod; std::string target_name; }; struct option_t { std::string name; const_t val; }; using label_t = std::string; using type_t = variant<std::string, msg_type_t, enum_type_t>; // TODO: TBD: could potentially get more meta-dissected based on the specification: struct field_opt_t { std::string name; const_t val; }; struct field_t { label_t label; // this would benefit from being an enum instead type_t type; std::string name; int_t number; std::vector<field_opt_t> opts; }; // TODO: TBD: add extend_t after msg_t ... struct field_t; struct enum_t; struct msg_t; struct extend_t; struct extensions_t; struct group_t; struct option_t; struct oneof_t; struct map_field_t; struct reserved_t; using msg_body_t = std::vector<variant< field_t, enum_t, msg_t, extend_t, extensions_t, group_t, option_t, oneof_t, map_field_t, reserved_t, empty_statement_t >>; struct group_t { label_t label; std::string name; int_t number; msg_body_t body; }; struct oneof_field_t { type_t type; std::string name; int_t number; optional<std::vector<field_opt_t>> opts; }; struct oneof_t { std::string name; std::vector<variant<oneof_field_t, empty_statement_t>> choices; }; struct key_type_t { std::string val; }; struct map_field_t { key_type_t key_type; type_t type; std::string name; int_t number; optional<std::vector<field_opt_t>> opts; }; struct range_t { int_t min; optional<int_t> max; }; struct extensions_t { std::vector<range_t> ranges; }; struct reserved_t { variant<std::vector<range_t>, std::vector<std::string>> val; }; struct enum_val_opt_t { std::string name; const_t val; }; struct enum_field_t { std::string name; std::string ordinal; std::vector<enum_val_opt_t> opt; // consistency }; using enum_body_t = std::vector<variant<option_t, enum_field_t, empty_statement_t> >; struct enum_t { std::string name; enum_body_t body; }; struct msg_t { std::string name; // TODO: TBD: here is another case where forward declaration is necessary in terms of the AST definition. msg_body_t body; }; struct extend_t { using content_t = variant<field_t, group_t, empty_statement_t>; // TODO: TBD: actually, this use case may beg the question whether // "message type", et al, in some way deserve a first class definition? msg_type_t msg_type; std::vector<content_t> content; }; struct top_level_def_t { // TODO: TBD: may add svc_t after extend_t ... variant<msg_t, enum_t, extend_t> content; }; struct proto_t { syntax_t syntax; std::vector<variant<import_t, package_t, option_t, top_level_def_t, empty_statement_t>> content; }; template <typename T> static inline std::ostream& operator<<(std::ostream& os, T const&) { std::operator<<(os, "["); std::operator<<(os, typeid(T).name()); std::operator<<(os, "]"); return os; } } BOOST_FUSION_ADAPT_STRUCT(AST::option_t, name, val) BOOST_FUSION_ADAPT_STRUCT(AST::full_id_t, full_id) BOOST_FUSION_ADAPT_STRUCT(AST::package_t, full_id) BOOST_FUSION_ADAPT_STRUCT(AST::import_modifier_t, val) BOOST_FUSION_ADAPT_STRUCT(AST::import_t, mod, target_name) BOOST_FUSION_ADAPT_STRUCT(AST::field_opt_t, name, val) BOOST_FUSION_ADAPT_STRUCT(AST::field_t, label, type, name, number, opts) BOOST_FUSION_ADAPT_STRUCT(AST::group_t, label, name, number, body) BOOST_FUSION_ADAPT_STRUCT(AST::oneof_field_t, type, name, number, opts) BOOST_FUSION_ADAPT_STRUCT(AST::oneof_t, name, choices) BOOST_FUSION_ADAPT_STRUCT(AST::key_type_t, val) BOOST_FUSION_ADAPT_STRUCT(AST::map_field_t, key_type, type, name, number, opts) BOOST_FUSION_ADAPT_STRUCT(AST::range_t, min, max) BOOST_FUSION_ADAPT_STRUCT(AST::extensions_t, ranges) BOOST_FUSION_ADAPT_STRUCT(AST::reserved_t, val) BOOST_FUSION_ADAPT_STRUCT(AST::enum_val_opt_t, name, val) BOOST_FUSION_ADAPT_STRUCT(AST::enum_field_t, name, ordinal, opt) BOOST_FUSION_ADAPT_STRUCT(AST::enum_t, name, body) BOOST_FUSION_ADAPT_STRUCT(AST::msg_t, name, body) BOOST_FUSION_ADAPT_STRUCT(AST::extend_t, msg_type, content) BOOST_FUSION_ADAPT_STRUCT(AST::top_level_def_t, content) BOOST_FUSION_ADAPT_STRUCT(AST::proto_t, syntax, content) namespace qi = boost::spirit::qi; template<typename It> struct ProtoGrammar : qi::grammar<It, AST::proto_t()> { using char_rule_type = qi::rule<It, char()>; using string_rule_type = qi::rule<It, std::string()>; using skipper_type = qi::space_type; ProtoGrammar() : ProtoGrammar::base_type(start) { using qi::lit; using qi::digit; using qi::lexeme; // redundant, because no rule declares a skipper using qi::char_; // Identifiers id = lexeme[qi::alpha >> *char_("A-Za-z0-9_")]; full_id = id; msg_name = id; enum_name = id; field_name = id; oneof_name = id; map_name = id; service_name = id; rpc_name = id; stream_name = id; // These distincions aren't very useful until in the semantic analysis // stage. I'd suggest to not conflate that with parsing. msg_type = qi::as_string[ -char_('.') >> *(qi::hold[id >> char_('.')]) >> msg_name ]; enum_type = qi::as_string[ -char_('.') >> *(qi::hold[id >> char_('.')]) >> enum_name ]; // group_name = lexeme[qi::upper >> *char_("A-Za-z0-9_")]; // simpler: group_name = &qi::upper >> id; // Integer literals oct_lit = &char_('0') >> qi::uint_parser<AST::int_t, 8>{}; hex_lit = qi::no_case["0x"] >> qi::uint_parser<AST::int_t, 16>{}; dec_lit = qi::uint_parser<AST::int_t, 10>{}; int_lit = lexeme[hex_lit | oct_lit | dec_lit]; // ordering is important // Floating-point literals float_lit = qi::real_parser<double, qi::strict_real_policies<double> >{}; // String literals oct_esc = '\\' >> qi::uint_parser<unsigned char, 8, 3, 3>{}; hex_esc = qi::no_case["\\x"] >> qi::uint_parser<unsigned char, 16, 2, 2>{}; // The last bit in this phrase is literally, "Or Any Characters Not in the Sequence" (fixed) char_val = hex_esc | oct_esc | char_esc | ~char_("\0\n\\"); str_lit = lexeme["'" >> *(char_val - "'") >> "'"] | lexeme['"' >> *(char_val - '"') >> '"'] ; // Empty Statement - likely redundant empty_statement = ';' >> qi::attr(AST::empty_statement_t{}); // Constant const_ = bool_lit | str_lit | float_lit // again, ordering is important | int_lit | full_id ; // keyword helper #define KW(p) (lexeme[(p) >> !(qi::alnum | '_')]) // Syntax syntax = KW("syntax") >> '=' >> lexeme[ lit("'proto2'") | "\"proto2\"" ] >> ';' >> qi::attr(AST::syntax_t::proto2); // Import Statement import_modifier = KW("weak") | KW("public"); import = KW("import") >> -import_modifier >> str_lit >> ';'; // Package package = KW("package") >> full_id >> ';'; // Option opt_name = qi::raw[ (id | '(' >> full_id >> ')') >> *('.' >> id) ]; opt = KW("option") >> opt_name >> '=' >> const_ >> ';'; // Fields field_num = int_lit; label = KW("required") | KW("optional") | KW("repeated") ; type = KW(builtin_type) | msg_type | enum_type ; // Normal field field_opt = opt_name >> '=' >> const_; field_opts = -('[' >> field_opt % ',' >> ']'); field = label >> type >> field_name >> '=' >> field_num >> field_opts >> ';'; // Group field group = label >> KW("group") >> group_name >> '=' >> field_num >> msg_body; // Oneof and oneof field oneof_field = type >> field_name >> '=' >> field_num >> field_opts >> ';'; oneof = KW("oneof") >> oneof_name >> '{' >> *( oneof_field // TODO: TBD: ditto how to handle "empty" not synthesizing any attributes ... | empty_statement ) >> '}'; // Map field key_type = KW(builtin_type); // mapField = "map" "<" keyType "," type ">" mapName "=" fieldNumber [ "[" fieldOptions "]" ] ";" map_field = KW("map") >> '<' >> key_type >> ',' >> type >> '>' >> map_name >> '=' >> field_num >> field_opts >> ';'; // Extensions and Reserved, Extensions ... range = int_lit >> -(KW("to") >> (int_lit | KW("max"))); ranges = range % ','; extensions = KW("extensions") >> ranges >> ';'; // Reserved reserved = KW("reserved") >> (ranges | field_names) >> ';'; field_names = field_name % ','; // Enum definition enum_val_opt = opt_name >> '=' >> const_; enum_val_opts = -('[' >> (enum_val_opt % ',') >> ']'); enum_field = id >> '=' >> int_lit >> enum_val_opts >> ';'; enum_body = '{' >> *(opt | enum_field | empty_statement) >> '}'; enum_ = KW("enum") >> enum_name >> enum_body; // Message definition msg = KW("message") >> msg_name >> msg_body; msg_body = '{' >> *( field | enum_ | msg | extend | extensions | group | opt | oneof | map_field | reserved //// TODO: TBD: how to "include" an empty statement ... ? "empty" does not synthesize anything, right? | empty_statement ) >> '}'; // Extend extend_content = field | group | empty_statement; extend_contents = '{' >> *extend_content >> '}'; extend = KW("extend") >> msg_type >> extend_contents; top_level_def = msg | enum_ | extend /*| service*/; proto = syntax >> *(import | package | opt | top_level_def | empty_statement); start = qi::skip(qi::space) [ proto ]; BOOST_SPIRIT_DEBUG_NODES( (id) (full_id) (msg_name) (enum_name) (field_name) (oneof_name) (map_name) (service_name) (rpc_name) (stream_name) (group_name) (msg_type) (enum_type) (oct_lit) (hex_lit) (dec_lit) (int_lit) (float_lit) (oct_esc) (hex_esc) (char_val) (str_lit) (empty_statement) (const_) (syntax) (import_modifier) (import) (package) (opt_name) (opt) (field_num) (label) (type) (field_opt) (field_opts) (field) (group) (oneof_field) (oneof) (key_type) (map_field) (range) (ranges) (extensions) (reserved) (field_names) (enum_val_opt) (enum_val_opts) (enum_field) (enum_body) (enum_) (msg) (msg_body) (extend_content) (extend_contents) (extend) (top_level_def) (proto)) } private: struct escapes_t : qi::symbols<char, char> { escapes_t() { this->add ("\\a", '\a') ("\\b", '\b') ("\\f", '\f') ("\\n", '\n') ("\\r", '\r') ("\\t", '\t') ("\\v", '\v') ("\\\\", '\\') ("\\'", '\'') ("\\\"", '"'); } } char_esc; string_rule_type id, full_id, msg_name, enum_name, field_name, oneof_name, map_name, service_name, rpc_name, stream_name, group_name; qi::rule<It, AST::msg_type_t(), skipper_type> msg_type; qi::rule<It, AST::enum_type_t(), skipper_type> enum_type; qi::rule<It, AST::int_t()> int_lit, dec_lit, oct_lit, hex_lit; qi::rule<It, AST::float_t()> float_lit; /// true | false struct bool_lit_t : qi::symbols<char, AST::bool_t> { bool_lit_t() { this->add ("true", AST::bool_t::true_) ("false", AST::bool_t::false_); } } bool_lit; char_rule_type oct_esc, hex_esc, char_val; qi::rule<It, AST::str_t()> str_lit; // TODO: TBD: there are moments when this is a case in a variant or vector<variant> qi::rule<It, AST::empty_statement_t(), skipper_type> empty_statement; qi::rule<It, AST::const_t(), skipper_type> const_; /// syntax = {'proto2' | "proto2"} ; qi::rule<It, AST::syntax_t(), skipper_type> syntax; /// import [weak|public] <targetName/> ; qi::rule<It, AST::import_t(), skipper_type> import; qi::rule<It, AST::import_modifier_t(), skipper_type> import_modifier; /// package <fullIdent/> ; qi::rule<It, AST::package_t(), skipper_type> package; /// option <optionName/> = <const/> ; qi::rule<It, AST::option_t(), skipper_type> opt; /// <ident/> | "(" <fullIdent/> ")" ("." <ident/>)* string_rule_type opt_name; qi::rule<It, AST::label_t(), skipper_type> label; qi::rule<It, AST::type_t(), skipper_type> type; struct builtin_type_t : qi::symbols<char, std::string> { builtin_type_t() { this->add ("double", "double") ("float", "float") ("int32", "int32") ("int64", "int64") ("uint32", "uint32") ("uint64", "uint64") ("sint32", "sint32") ("sint64", "sint64") ("fixed32", "fixed32") ("fixed64", "fixed64") ("sfixed32", "sfixed32") ("sfixed64", "sfixed64") ("bool", "bool") ("string", "string") ("bytes", "bytes"); } } builtin_type; qi::rule<It, AST::int_t()> field_num; qi::rule<It, AST::field_opt_t(), skipper_type> field_opt; qi::rule<It, std::vector<AST::field_opt_t>(), skipper_type> field_opts; qi::rule<It, AST::field_t(), skipper_type> field; qi::rule<It, AST::group_t(), skipper_type> group; qi::rule<It, AST::oneof_t(), skipper_type> oneof; qi::rule<It, AST::oneof_field_t(), skipper_type> oneof_field; qi::rule<It, AST::key_type_t(), skipper_type> key_type; qi::rule<It, AST::map_field_t(), skipper_type> map_field; /// <int/> [ to ( <int/> | "max" ) ] qi::rule<It, AST::range_t(), skipper_type> range; qi::rule<It, std::vector<AST::range_t>(), skipper_type> ranges; /// extensions <ranges/> ; qi::rule<It, AST::extensions_t(), skipper_type> extensions; /// reserved <ranges/>|<fieldNames/> ; qi::rule<It, AST::reserved_t(), skipper_type> reserved; qi::rule<It, std::vector<std::string>(), skipper_type> field_names; /// <optionName/> = <constant/> qi::rule<It, AST::enum_val_opt_t(), skipper_type> enum_val_opt; qi::rule<It, std::vector<AST::enum_val_opt_t>(), skipper_type> enum_val_opts; /// <ident/> = <int/> [ +<enumValueOption/> ] ; qi::rule<It, AST::enum_field_t(), skipper_type> enum_field; qi::rule<It, AST::enum_body_t(), skipper_type> enum_body; qi::rule<It, AST::enum_t(), skipper_type> enum_; // TODO: TBD: continue here: https://developers.google.com/protocol-buffers/docs/reference/proto2-spec#message_definition /// message <messageName/> <messageBody/> qi::rule<It, AST::msg_t(), skipper_type> msg; /// *{ <field/> | <enum/> | <message/> | <extend/> | <extensions/> | <group/> /// | <option/> | <oneof/> | <mapField/> | <reserved/> | <emptyStatement/> } qi::rule<It, AST::msg_body_t(), skipper_type> msg_body; // TODO: TBD: not sure how appropriate it would be to reach these cases, but we'll see what happens... /// extend <messageType/> *{ <field/> | <group/> | <emptyStatement/> } qi::rule<It, AST::extend_t::content_t(), skipper_type> extend_content; qi::rule<It, std::vector<AST::extend_t::content_t>(), skipper_type> extend_contents; qi::rule<It, AST::extend_t(), skipper_type> extend; // TODO: TBD: ditto comments in the rule definition section. // service; rpc; stream; /// topLevelDef = <message/> | <enum/> | <extend/> | <service/> qi::rule<It, AST::top_level_def_t(), skipper_type> top_level_def; /// <syntax/> { <import/> | <package/> | <option/> | <option/> | <emptyStatement/> } qi::rule<It, AST::proto_t(), skipper_type> proto; qi::rule<It, AST::proto_t()> start; }; #include <fstream> int main() { std::ifstream ifs("sample.proto"); std::string const input(std::istreambuf_iterator<char>(ifs), {}); using It = std::string::const_iterator; It f = input.begin(), l = input.end(); ProtoGrammar<It> const g; AST::proto_t parsed; bool ok = qi::parse(f, l, g, parsed); if (ok) { std::cout << "Parse succeeded\n"; } else { std::cout << "Parse failed\n"; } if (f != l) { std::cout << "Remaining unparsed input: '" << std::string(f,l) << "'\n"; } } Which for a sample input of syntax = "proto2"; import "demo_stuff.proto"; package StackOverflow; message Sample { optional StuffMsg foo_list = 1; optional StuffMsg bar_list = 2; optional StuffMsg qux_list = 3; } message TransportResult { message Sentinel {} oneof Chunk { Sample payload = 1; Sentinel end_of_stream = 2; } } message ShowTime { optional uint32 magic = 1 [ default = 0xBDF69E88 ]; repeated string parameters = 2; optional string version_info = 3; } Prints <proto> <try>syntax = "proto2";\ni</try> <syntax> <try>syntax = "proto2";\ni</try> <success>\nimport "demo_stuff.</success> <attributes>[[N3AST8syntax_tE]]</attributes> </syntax> <import> <try>\nimport "demo_stuff.</try> <import_modifier> <try> "demo_stuff.proto";</try> <fail/> </import_modifier> <str_lit> <try>"demo_stuff.proto";\n</try> [ ... much snipped ... ] <empty_statement> <try>\n\n</try> <fail/> </empty_statement> <success>\n\n</success> <attributes>[[[N3AST8syntax_tE], [[[empty], [d, e, m, o, _, s, t, u, f, f, ., p, r, o, t, o]], [[S, t, a, c, k, O, v, e, r, f, l, o, w]], [[[S, a, m, p, l, e], [[[], [S, t, u, f, f, M, s, g], [f, o, o, _, l, i, s, t], 1, []], [[], [S, t, u, f, f, M, s, g], [b, a, r, _, l, i, s, t], 2, []], [[], [S, t, u, f, f, M, s, g], [q, u, x, _, l, i, s, t], 3, []]]]], [[[T, r, a, n, s, p, o, r, t, R, e, s, u, l, t], [[[S, e, n, t, i, n, e, l], []], [[C, h, u, n, k], [[[S, a, m, p, l, e], [p, a, y, l, o, a, d], 1, []], [[S, e, n, t, i, n, e, l], [e, n, d, _, o, f, _, s, t, r, e, a, m], 2, []]]]]]], [[[S, h, o, w, T, i, m, e], [[[], [u, i, n, t, 3, 2], [m, a, g, i, c], 1, [[[d, e, f, a, u, l, t], 3187056264]]], [[], [s, t, r, i, n, g], [p, a, r, a, m, e, t, e, r, s], 2, []], [[], [s, t, r, i, n, g], [v, e, r, s, i, o, n, _, i, n, f, o], 3, []]]]]]]]</attributes> </proto> Parse succeeded Remaining unparsed input: ' ' ¹ (Conflating "recursive descent" (a parsing concept) with recursive variants is confusing too). ² Sadly it exceeds the capacity of both Wandbox and Coliru
I'll just summarize a couple of key points I observed digesting. First, wow, some of it I do not think is documented, in Spirit Qi pages, etc, unless you happened to have heard about it through little birds, etc. That to say, thanks so much for the insights! Interesting, transformed things directly to language level whenever possible. For instance, bool_t, deriving directly from std::string, and even syntax_t, to name a few. Did not think one could even do that from a parser/AST perspective, but it makes sense. Very interesting, deriving from std::string. As above, did not know that. struct element_type_t : std::string { using std::string::string; using std::string::operator=; }; In particular, with emphasis on string and operator=, I'm assuming to help the parser rules, attribute propagation, etc. Yes, I wondered about support for std::optional and std::variant, but it would make sense considering Boost.Spirit maturity. Good points re: leveraging boost constructs of the same in lieu of std. Did not know you could define aliases. It would make sense instead of defining a first class struct. For instance, using const_t = variant<full_id_t, int_t, float_t, str_t, bool_t>; Interesting label_t aliasing. Although I may pursue that being a language level enum with corresponding rule attribution. Still, up-vote for so much effort here. using label_t = std::string; Then the forward declaration and alias of the problem area, msg_body_t. INTERESTING I had no idea. Really. struct field_t; struct enum_t; struct msg_t; struct extend_t; struct extensions_t; struct group_t; struct option_t; struct oneof_t; struct map_field_t; struct reserved_t; using msg_body_t = std::vector<variant< field_t, enum_t, msg_t, extend_t, extensions_t, group_t, option_t, oneof_t, map_field_t, reserved_t, empty_statement_t >>; Still, I'm not sure how that avoids the C++ C2079 (VS2017) forward declaration issue? I'll have to double check in my project code, but it ran for you, obviously, so something about that must be more kosher than I am thinking. BOOST_FUSION_ADAPT_STRUCT(AST::option_t, name, val) // etc ... And which simplifies the struct adaptation significantly, I imagine. Eventually, yes, I would want a skipper involved. I had not quite gotten that far yet when I stumbled on the forward declaration issue. using skipper_type = qi::space_type; // ... start = qi::skip(qi::space) [ proto ]; // ... qi::rule<It, AST::msg_type_t(), skipper_type> msg_type; For many of the rule definitions, = or %=? The prevailing wisdom I've heard over the years is to prefer %=. Your thoughts? i.e. id = lexeme[qi::alpha >> *char_("A-Za-z0-9_")]; // ^, or: id %= lexeme[qi::alpha >> *char_("A-Za-z0-9_")]; // ^^ ? Makes sense for these to land in an language friendly AST attribution: oct_lit = &char_('0') >> qi::uint_parser<AST::int_t, 8>{}; hex_lit = qi::no_case["0x"] >> qi::uint_parser<AST::int_t, 16>{}; dec_lit = qi::uint_parser<AST::int_t, 10>{}; int_lit = lexeme[hex_lit | oct_lit | dec_lit]; // ordering is important // Yes, I understand why, because 0x... | 0... | dig -> that to say, great point! I did not spend as much time as I should, perhaps, exploring the bits exposed by Qi here, i.e. qi::upper, etc, but it's a great point: group_name = &qi::upper >> id; Did not know this operator was a thing for char_. I do not think it is documented, however, unless you happened to have heard it from little birdies: // Again, great points re: numerical/parser ordering. char_val = hex_esc | oct_esc | char_esc | ~char_("\0\n\\"); // ^ Not sure what you mean here, "likely redundant". However, it is VERY interesting that you can produce the attribution here. I like that a lot. // Empty Statement - likely redundant empty_statement = ';' >> qi::attr(AST::empty_statement_t{}); // ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ If you mean in terms of whether the semi-colon is redundant, I thought so as well, at first. Then I studied the rest of the grammar, and rather than second guess things, no, I agree with the grammar: "empty statement" really is an empty statement, at least when you accept it in the context of the grammatical alternatives. Sometimes, however, the semi-colon does indicate what you and I thought it was: that is, "end of statement" or eos, which caused me to raise an eyebrow at first as well. I also did not know you could attribute things directly during a Qi rule. In fact, the general guidance I'd been considering was to avoid semantic actions. But I suppose these are different animals than qi::attr(...) per se. Nice approach here. Plus it lends consistency to the rule definitions. Cannot up-vote enough for this one, among others. #define KW(p) (lexeme[(p) >> !(qi::alnum | '_')]) Here I am considering language level enumerated values, but it is interesting nontheless. label = KW("required") | KW("optional") | KW("repeated") ; Here, the long and the short of it is, fewer rules involved. It is a bit messier in terms of all the strings, etc, but I like that it more or less reads one-for-one with the grammar to inform the definition. // mapField = "map" "<" keyType "," type ">" mapName "=" fieldNumber [ "[" fieldOptions "]" ] ";" map_field = KW("map") >> '<' >> key_type >> ',' >> type >> '>' >> map_name >> '=' >> field_num >> field_opts >> ';'; I wondered about whether Qi symbols might be useful, but I had no idea these bits would be that useful: struct escapes_t : qi::symbols<char, char> { escapes_t() { this->add ("\\a", '\a') ("\\b", '\b') ("\\f", '\f') ("\\n", '\n') ("\\r", '\r') ("\\t", '\t') ("\\v", '\v') ("\\\\", '\\') ("\\'", '\'') ("\\\"", '"'); } } char_esc; Ditto symbols, up-vote: struct builtin_type_t : qi::symbols<char, std::string> { /* ... */ }; In summary, very impressed here. Thank you so much for the insights.
There was a slight oversight in the ranges I think. Referring to the proto2 Extensions specification, literally we have: range = intLit [ "to" ( intLit | "max" ) ] Then adjusting in the AST: enum range_max_t { max }; struct range_t { int_t min; boost::optional<boost::variant<int_t, range_max_t>> max; }; And last but not least in the grammar: range %= int_lit >> -(KW("to") >> (int_lit | KW_ATTR("max", ast::range_max_t::max))); With the helper: #define KW_ATTR(p, a) (qi::lexeme[(p) >> !(qi::alnum | '_')] >> qi::attr(a)) Untested, but my confidence is higher today than it was yesterday that this approach is on the right track. Worst case, if there are any type conflicts between the int_t, which is basically defined as long long and the enumerated type range_max_t, then I could just store the keyword "max" for the same effect. That's a worst case; I'd like to keep it as simple as possible but not lose sight of the specification at the same time. Anyway, thanks again for the insights! up-vote
I'm not positive I completely understand this aspect, apart from one builds and the other does not. With extend_t you introduce a using type alias content_t. I "get it", in the sense that this magically "just works". For instance: struct extend_t { using content_t = boost::variant<field_t, group_t, empty_statement_t>; msg_type_t msg_type; std::vector<content_t> content; }; However, contrasting that with a more traditional template inheritance and type definition, I'm not sure why that does not work. For instance: template<typename Content> struct has_content { typedef Content content_type; content_type content; }; // It is noteworthy, would need to identify the std::vector::value_type as well... struct extend_t : has_content<std::vector<boost::variant<field_t, group_t, empty_statement_t>>> { msg_type_t msg_type; }; In which case, I start seeing symptoms of the forward declaration in the form of incomplete type errors. I am hesitant to accept the one as "gospel" as it were without having a better understanding as to why it is.
boost spirit - improving error output
This question leads on from its predecessor here: decoding an http header value The Question: In my test assertion failure, I am printing out the following contents of error_message: Error! Expecting <alternative><media_type_no_parameters><media_type_with_parameters> in header value: "text/html garbage ; charset = \"ISO-8859-5\"" at position: 0 Which is unhelpful... What is the correct way to get a nice syntax error that says: Error! token_pair has invalid syntax here: text/html garbage ; charset = "ISO-8859-5" ^ must be eoi or separator of type ; Background: HTTP Content-Type in a request has the following form: type/subtype *( ; param[=param_value]) <eoi> Where type and subtype many not be quoted or be separated by spaces, param is not quoted, and param_value is both optional and optionally quoted. Other than between type/subtype spaces or horizontal tabs may be used as white space. There may also be space before type/subtype. For now I am ignoring the possibility of HTTP line breaks or comments as I understand that they are deprecated. Summary: There shall be one type, one subtype and zero or more parameters. type and subtype are HTTP tokens, which is to say that they may not contain delimiters ("/\[]<>,; and so on) or spaces. Thus, the following header is legal: text/html ; charset = "ISO-8859-5" And the following header is illegal: text/html garbage ; charset = "ISO-8859-5" ^^^^^^^ illegal - must be either ; or <eoi> The code I am using to parse this (seemingly simple, but actually quite devious) protocol component is below. Code My code, adapted from sehe's fantasic answer here (warning, prerequisites are google test and boost) //#define BOOST_SPIRIT_DEBUG #include <boost/config/warning_disable.hpp> #include <gtest/gtest.h> #include <boost/spirit/include/qi.hpp> #include <boost/spirit/include/phoenix.hpp> #include <boost/fusion/include/adapted.hpp> #include <utility> #include <vector> #include <string> #include <iostream> using token_pair = std::pair<std::string, std::string>; struct parameter { std::string name; std::string value; bool has_value; }; struct media_type { token_pair type_subtype; std::vector<parameter> params; }; BOOST_FUSION_ADAPT_STRUCT(parameter, name, value, has_value) BOOST_FUSION_ADAPT_STRUCT(media_type, type_subtype, params) namespace qi = boost::spirit::qi; namespace phoenix = boost::phoenix; using namespace std::literals; template<class Iterator> struct components { components() { using qi::ascii::char_; spaces = char_(" \t"); token = +~char_( "()<>#,;:\\\"/[]?={} \t"); token_pair_rule = token >> '/' >> token; quoted_string = '"' >> *('\\' >> char_ | ~char_('"')) >> '"'; value = quoted_string | token; name_only = token >> qi::attr("") >> qi::attr(false); nvp = token >> '=' >> value >> qi::attr(true); any_parameter = ';' >> (nvp | name_only); some_parameters = +any_parameter; parameters = *any_parameter; qi::on_error<qi::fail>( token, this->report_error(qi::_1, qi::_2, qi::_3, qi::_4) ); BOOST_SPIRIT_DEBUG_NODES((token) (quoted_string) (value) (name_only) (nvp) (any_parameter) (parameters) ) } protected: using Skipper = qi::space_type; Skipper spaces; qi::rule<Iterator, std::string()> quoted_string, token, value; qi::rule<Iterator, parameter(), Skipper> nvp, name_only, any_parameter; qi::rule<Iterator, std::vector<parameter>(), Skipper> parameters, some_parameters; qi::rule<Iterator, token_pair()> token_pair_rule; public: std::string error_message; protected: struct ReportError { // the result type must be explicit for Phoenix template<typename, typename, typename, typename> struct result { typedef void type; }; ReportError(std::string& error_message) : error_message(error_message) {} // contract the string to the surrounding new-line characters template<typename Iter> void operator()(Iter first, Iter last, Iter error, const qi::info& what) const { using namespace std::string_literals; std::ostringstream ss; ss << "Error! Expecting " << what << " in header value: " << std::quoted(std::string(first, last)) << " at position: " << error - first; error_message = ss.str(); } std::string& error_message; }; const phoenix::function<ReportError> report_error = ReportError(error_message); }; template<class Iterator> struct token_grammar : components<Iterator> , qi::grammar<Iterator, media_type()> { token_grammar() : token_grammar::base_type(media_type_rule) { media_type_with_parameters = token_pair_rule >> qi::skip(spaces)[some_parameters]; media_type_no_parameters = token_pair_rule >> qi::attr(std::vector<parameter>()) >> qi::skip(spaces)[qi::eoi]; media_type_rule = qi::eps > (qi::hold[media_type_no_parameters] | qi::hold[media_type_with_parameters]); BOOST_SPIRIT_DEBUG_NODES((media_type_with_parameters) (media_type_no_parameters) (media_type_rule)) qi::on_error<qi::fail>( media_type_rule, this->report_error(qi::_1, qi::_2, qi::_3, qi::_4) ); } private: using Skipper = typename token_grammar::components::Skipper; using token_grammar::components::spaces; using token_grammar::components::token; using token_grammar::components::token_pair_rule; using token_grammar::components::value; using token_grammar::components::any_parameter; using token_grammar::components::parameters; using token_grammar::components::some_parameters; public: qi::rule<Iterator, media_type()> media_type_no_parameters, media_type_with_parameters, media_type_rule; }; TEST(spirit_test, test1) { token_grammar<std::string::const_iterator> grammar{}; auto test = R"__test(application/json )__test"s; auto ct = media_type {}; bool r = parse(test.cbegin(), test.cend(), grammar, ct); EXPECT_EQ("application", ct.type_subtype.first); EXPECT_EQ("json", ct.type_subtype.second); EXPECT_EQ(0, ct.params.size()); ct = {}; test = R"__test(text/html ; charset = "ISO-8859-5")__test"s; parse(test.cbegin(), test.cend(), grammar, ct); EXPECT_EQ("text", ct.type_subtype.first); EXPECT_EQ("html", ct.type_subtype.second); ASSERT_EQ(1, ct.params.size()); EXPECT_TRUE(ct.params[0].has_value); EXPECT_EQ("charset", ct.params[0].name); EXPECT_EQ("ISO-8859-5", ct.params[0].value); auto mt = media_type {}; parse(test.cbegin(), test.cend(), grammar.media_type_rule, mt); EXPECT_EQ("text", mt.type_subtype.first); EXPECT_EQ("html", mt.type_subtype.second); EXPECT_EQ(1, mt.params.size()); // // Introduce a failure case // mt = media_type {}; test = R"__test(text/html garbage ; charset = "ISO-8859-5")__test"s; r = parse(test.cbegin(), test.cend(), grammar.media_type_rule, mt); EXPECT_FALSE(r); EXPECT_EQ("", grammar.error_message); }
Boost spirit, returned value from a semantic action interferes with the rule attribute
The following program is an artificial example (reduced from a larger grammar on which I'm working) to exhibit a strange behaviour. The output of the program run as is is "hello" and is incorrect. If I remove the (useless in this example) semantic action from the quoted_string rule the output is the expected "foo=hello". #define BOOST_RESULT_OF_USE_DECLTYPE #define BOOST_SPIRIT_USE_PHOENIX_V3 #include <vector> #include <string> #include <iostream> #include <boost/spirit/include/qi.hpp> #include <boost/spirit/include/phoenix.hpp> #include "utils.hpp" namespace t { using std::vector; using std::string; namespace qi = boost::spirit::qi; namespace phx = boost::phoenix; template <typename Iterator, typename Skipper=qi::space_type> struct G1 : qi::grammar<Iterator, string(), Skipper> { template <typename T> using rule = qi::rule<Iterator, T, Skipper>; qi::rule<Iterator, string(), qi::locals<char>> quoted_string; rule<string()> start; G1() : G1::base_type(start, "G1") { { using qi::_1; using qi::_a; using attr_signature = vector<char>; auto handler = [](attr_signature const& elements) -> string { string output; for(auto& e : elements) { output += e; } return output; }; quoted_string = (qi::omit[qi::char_("'\"")[_a = _1]] >> +(qi::char_ - qi::char_(_a)) >> qi::lit(_a))[qi::_val = phx::bind(handler, _1)]; } start = qi::string("foo") >> -(qi::string("=") >> quoted_string); } }; string parse(string const input) { G1<string::const_iterator> g; string result; phrase_parse(begin(input), end(input), g, qi::standard::space, result); return result; } }; int main() { using namespace std; auto r = t::parse("foo='hello'"); cout << r << endl; } I can definitely find a workaround, but I'd figure out what am I missing
Like #cv_and_he explained, you're overwriting the attribute with the result of handler(_1). Since attributes are passed by reference, you lose the original value. Automatic attribute propagation rules know how to concatenate "string" container values, so why don't you just use the default implementation? quoted_string %= qi::omit[qi::char_("'\"")[_a = _1]] >> +(qi::char_ - qi::char_(_a)) >> qi::lit(_a); (Note the %=; this enables automatic propagation even in the presence of semantic actions). Alternatively, you can push-back from inside the SA: >> +(qi::char_ - qi::char_(_a)) [ phx::push_back(qi::_val, _1) ] And, if you really need some processing done in handler, make it take the string by reference: auto handler = [](attr_signature const& elements, std::string& attr) { for(auto& e : elements) { attr += e; } }; quoted_string = (qi::omit[qi::char_("'\"")[_a = _1]] >> +(qi::char_ - qi::char_(_a)) >> qi::lit(_a)) [ phx::bind(handler, _1, qi::_val) ]; All these approaches work. For really heavy duty things, I have in the past used a custom string type with boost::spirit::traits customization points to do the transformations: http://www.boost.org/doc/libs/1_55_0/libs/spirit/doc/html/spirit/advanced/customize.html
Handling utf-8 in Boost.Spirit with utf-32 parser
I have a similar issue like How to use boost::spirit to parse UTF-8? and How to match unicode characters with boost::spirit? but none of these solve the issue i'm facing. I have a std::string with UTF-8 characters, i used the u8_to_u32_iterator to wrap the std::string and used unicode terminals like this: BOOST_NETWORK_INLINE void parse_headers(std::string const & input, std::vector<request_header_narrow> & container) { using namespace boost::spirit::qi; u8_to_u32_iterator<std::string::const_iterator> begin(input.begin()), end(input.end()); std::vector<request_header_narrow_utf8_wrapper> wrapper_container; parse( begin, end, *( +(alnum|(punct-':')) >> lit(": ") >> +((unicode::alnum|space|punct) - '\r' - '\n') >> lit("\r\n") ) >> lit("\r\n") , wrapper_container ); BOOST_FOREACH(request_header_narrow_utf8_wrapper header_wrapper, wrapper_container) { request_header_narrow header; u32_to_u8_iterator<request_header_narrow_utf8_wrapper::string_type::iterator> name_begin(header_wrapper.name.begin()), name_end(header_wrapper.name.end()), value_begin(header_wrapper.value.begin()), value_end(header_wrapper.value.end()); for(; name_begin != name_end; ++name_begin) header.name += *name_begin; for(; value_begin != value_end; ++value_begin) header.value += *value_begin; container.push_back(header); } } The request_header_narrow_utf8_wrapper is defined and mapped to Fusion like this (don't mind the missing namespace declarations): struct request_header_narrow_utf8_wrapper { typedef std::basic_string<boost::uint32_t> string_type; std::basic_string<boost::uint32_t> name, value; }; BOOST_FUSION_ADAPT_STRUCT( boost::network::http::request_header_narrow_utf8_wrapper, (std::basic_string<boost::uint32_t>, name) (std::basic_string<boost::uint32_t>, value) ) This works fine, but i was wondering can i somehow manage to make the parser assing directly to a struct containing std::string members instead of doing the for-each loop with the u32_to_u8_iterator ? I was thinking that one way could be making a wrapper for std::string that would have an assignment operator with boost::uint32_t so that parser could assign directly, but are there other solutions? EDIT After reading some more i ended up with this: namespace boost { namespace spirit { namespace traits { typedef std::basic_string<uint32_t> u32_string; /* template <> struct is_string<u32_string> : mpl::true_ {};*/ template <> // <typename Attrib, typename T, typename Enable> struct assign_to_container_from_value<std::string, u32_string, void> { static void call(u32_string const& val, std::string& attr) { u32_to_u8_iterator<u32_string::const_iterator> begin(val.begin()), end(val.end()); for(; begin != end; ++begin) attr += *begin; } }; } // namespace traits } // namespace spirit } // namespace boost and this BOOST_NETWORK_INLINE void parse_headers(std::string const & input, std::vector<request_header_narrow> & container) { using namespace boost::spirit::qi; u8_to_u32_iterator<std::string::const_iterator> begin(input.begin()), end(input.end()); parse( begin, end, *( as<boost::spirit::traits::u32_string>()[+(alnum|(punct-':'))] >> lit(": ") >> as<boost::spirit::traits::u32_string>()[+((unicode::alnum|space|punct) - '\r' - '\n')] >> lit("\r\n") ) >> lit("\r\n") , container ); } Any comments or advice if this is the best i can get?
Another job for a attribute trait. I've simplified your datatypes for demonstration purposes: typedef std::basic_string<uint32_t> u32_string; struct Value { std::string value; }; Now you can have the conversion happen "auto-magically" using: namespace boost { namespace spirit { namespace traits { template <> // <typename Attrib, typename T, typename Enable> struct assign_to_attribute_from_value<Value, u32_string, void> { typedef u32_to_u8_iterator<u32_string::const_iterator> Conv; static void call(u32_string const& val, Value& attr) { attr.value.assign(Conv(val.begin()), Conv(val.end())); } }; }}} Consider a sample parser that parses JSON-style strings in UTF-8, while also allowing Unicode escape sequences of 32-bit codepoints: \uXXXX. It is convenient to have the intermediate storage be a u32_string for this purpose: /////////////////////////////////////////////////////////////// // Parser /////////////////////////////////////////////////////////////// namespace qi = boost::spirit::qi; namespace encoding = qi::standard_wide; //namespace encoding = qi::unicode; template <typename It, typename Skipper = encoding::space_type> struct parser : qi::grammar<It, Value(), Skipper> { parser() : parser::base_type(start) { string = qi::lexeme [ L'"' >> *char_ >> L'"' ]; static qi::uint_parser<uint32_t, 16, 4, 4> _4HEXDIG; char_ = +( ~encoding::char_(L"\"\\")) [ qi::_val += qi::_1 ] | qi::lit(L"\x5C") >> ( // \ (reverse solidus) qi::lit(L"\x22") [ qi::_val += L'"' ] | // " quotation mark U+0022 qi::lit(L"\x5C") [ qi::_val += L'\\' ] | // \ reverse solidus U+005C qi::lit(L"\x2F") [ qi::_val += L'/' ] | // / solidus U+002F qi::lit(L"\x62") [ qi::_val += L'\b' ] | // b backspace U+0008 qi::lit(L"\x66") [ qi::_val += L'\f' ] | // f form feed U+000C qi::lit(L"\x6E") [ qi::_val += L'\n' ] | // n line feed U+000A qi::lit(L"\x72") [ qi::_val += L'\r' ] | // r carriage return U+000D qi::lit(L"\x74") [ qi::_val += L'\t' ] | // t tab U+0009 qi::lit(L"\x75") // uXXXX U+XXXX >> _4HEXDIG [ qi::_val += qi::_1 ] ); // entry point start = string; } private: qi::rule<It, Value(), Skipper> start; qi::rule<It, u32_string()> string; qi::rule<It, u32_string()> char_; }; As you can see, the start rule simply assigns the attribute value to the Value struct - which implicitely invokes our assign_to_attribute_from_value trait! A simple test program Live on Coliru to prove that it does work: // input assumed to be utf8 Value parse(std::string const& input) { auto first(begin(input)), last(end(input)); typedef boost::u8_to_u32_iterator<decltype(first)> Conv2Utf32; Conv2Utf32 f(first), saved = f, l(last); static const parser<Conv2Utf32, encoding::space_type> p; Value parsed; if (!qi::phrase_parse(f, l, p, encoding::space, parsed)) { std::cerr << "whoops at position #" << std::distance(saved, f) << "\n"; } return parsed; } #include <iostream> int main() { Value parsed = parse("\"Footnote: ¹ serious busineş\\u1e61\n\""); std::cout << parsed.value; } Now observe that the output is encoded in UTF8 again: $ ./test | tee >(file -) >(xxd) Footnote: ¹ serious busineşṡ /dev/stdin: UTF-8 Unicode text 0000000: 466f 6f74 6e6f 7465 3a20 c2b9 2073 6572 Footnote: .. ser 0000010: 696f 7573 2062 7573 696e 65c5 9fe1 b9a1 ious busine..... 0000020: 0a The U+1E61 code-point has been correctly encoded as [0xE1,0xB9,0xA1].
Parsing the end of a phrase before the beginning in Boost.Spirit
I'm trying to get Boost.Spirit to parse MSVC mangled symbols. These take the form: ?myvolatileStaticMember#myclass##2HC which means "volatile int myclass::myvolatileStaticMember". The "key" to the parse is the double at symbol "##". Prior to the ## is the name of the symbol which consists of a C++ identifier followed by zero or more "#" additional identifiers to fully represent the symbol in its absolute namespace hierarchy. After the ## is the specification of what the identifier is (a variable, a function etc.) Now, I can get Boost.Spirit to parse either the part preceding the ## or the part after the ##. I haven't figured out yet how to get Boost.Spirit to find the ## and feed what comes before that to one custom parser and what comes after it to a different custom parser. Here's my parser for the part preceding the ##: // This grammar is for a MSVC mangled identifier template<typename iterator> struct msvc_name : grammar<iterator, SymbolType(), locals<SymbolType, string>> { SymbolTypeDict &typedict; void name_writer(SymbolType &val, const string &i) const { val.name=i; } void dependent_writer(SymbolType &val, const string &i) const { SymbolTypeDict::const_iterator dt=typedict.find(i); if(dt==typedict.end()) { auto _dt=typedict.emplace(make_pair(i, SymbolType(SymbolTypeQualifier::None, SymbolTypeType::Namespace, i))); dt=_dt.first; } val.dependents.push_back(&dt->second); } // These work by spreading the building of a templated type over multiple calls using local variables _a and _b // We accumulate template parameters into _a and accumulate mangled symbolness into _b void begin_template_dependent_writer(SymbolType &, SymbolType &a, string &b, const string &i) const { a=SymbolType(SymbolTypeQualifier::None, SymbolTypeType::Class, i); b=i; } void add_template_constant_dependent_writer(SymbolType &a, string &b, long long constant) const { string i("_c"+to_string(constant)); SymbolTypeDict::const_iterator dt=typedict.find(i); if(dt==typedict.end()) { auto _dt=typedict.emplace(make_pair(i, SymbolType(SymbolTypeQualifier::None, SymbolTypeType::Constant, to_string(constant)))); dt=_dt.first; } a.templ_params.push_back(&dt->second); b.append(i); } void add_template_type_dependent_writer(SymbolType &a, string &b, SymbolTypeType type) const { string i("_t"+to_string(static_cast<int>(type))); SymbolTypeDict::const_iterator dt=typedict.find(i); if(dt==typedict.end()) { auto _dt=typedict.emplace(make_pair(i, SymbolType(SymbolTypeQualifier::None, type))); dt=_dt.first; } a.templ_params.push_back(&dt->second); b.append(i); } void finish_template_dependent_writer(SymbolType &val, SymbolType &a, string &b) const { SymbolTypeDict::const_iterator dt=typedict.find(b); if(dt==typedict.end()) { auto _dt=typedict.emplace(make_pair(b, a)); dt=_dt.first; } val.dependents.push_back(&dt->second); } msvc_name(SymbolTypeDict &_typedict) : msvc_name::base_type(start), typedict(_typedict) { identifier=+(char_ - '#'); identifier.name("identifier"); template_dependent_identifier=+(char_ - '#'); template_dependent_identifier.name("template_dependent_identifier"); dependent_identifier=+(char_ - '#'); dependent_identifier.name("dependent_identifier"); start = identifier [ boost::phoenix::bind(&msvc_name::name_writer, this, _val, _1) ] >> *( lit("##") >> eps | (("#?$" > template_dependent_identifier [ boost::phoenix::bind(&msvc_name::begin_template_dependent_writer, this, _val, _a, _b, _1) ]) > "#" > +(( "$0" > constant [ boost::phoenix::bind(&msvc_name::add_template_constant_dependent_writer, this, _a, _b, _1) ]) | type [ boost::phoenix::bind(&msvc_name::add_template_type_dependent_writer, this, _a, _b, _1) ]) >> eps [ boost::phoenix::bind(&msvc_name::finish_template_dependent_writer, this, _val, _a, _b) ]) | ("#" > dependent_identifier [ boost::phoenix::bind(&msvc_name::dependent_writer, this, _val, _1) ])) ; BOOST_SPIRIT_DEBUG_NODE(start); start.name("msvc_name"); on_error<boost::spirit::qi::fail, iterator>(start, cerr << boost::phoenix::val("Parsing error: Expected ") << _4 << boost::phoenix::val(" here: \"") << boost::phoenix::construct<string>(_3, _2) << boost::phoenix::val("\"") << endl); } rule<iterator, SymbolType(), locals<SymbolType, string>> start; rule<iterator, string()> identifier, template_dependent_identifier, dependent_identifier; msvc_type type; msvc_constant<iterator> constant; }; You'll note the "lit("##") >> eps" where I'm trying to get it to stop matching once it sees a ##. Now here is the part which is supposed to match a full mangled symbol: template<typename iterator> struct msvc_symbol : grammar<iterator, SymbolType()> { SymbolTypeDict &typedict; /* The key to Microsoft symbol mangles is the operator '##' which consists of a preamble and a postamble. Immediately following the '##' operator is: Variable: 3<type><storage class> Static member variable: 2<type><storage class> Function: <near|far><calling conv>[<stor ret>] <return type>[<parameter type>...]<term>Z <Y |Z ><A|E|G >[<?A|?B|?C|?D>]<MangledToSymbolTypeType...> <#>Z Member Function: <protection>[<const>]<calling conv>[<stor ret>] <return type>[<parameter type>...]<term>Z <A-V >[<A-D> ]<A|E|G >[<?A|?B|?C|?D>]<MangledToSymbolTypeType...> <#>Z */ msvc_symbol(SymbolTypeDict &_typedict) : msvc_symbol::base_type(start), typedict(_typedict), name(_typedict), variable(_typedict) { start="?" >> name >> ("##" >> variable); BOOST_SPIRIT_DEBUG_NODE(start); on_error<boost::spirit::qi::fail, iterator>(start, cerr << boost::phoenix::val("Parsing error: Expected ") << _4 << boost::phoenix::val(" here: \"") << boost::phoenix::construct<string>(_3, _2) << boost::phoenix::val("\"") << endl); } rule<iterator, SymbolType()> start; msvc_name<iterator> name; msvc_variable<iterator> variable; }; So, it matches the "?" easily enough ;). The problem is that it sends everything after the "?" to the msvc_name parser, so instead of the bit from ## onwards going to msvc_variable and the remainder going to msvc_name, msvc_name consumes everything up to and including ##. This isn't intuitive, as one would have thought that the brackets mean to do that thing first. Therefore if I replace: start="?" >> name >> ("##" >> variable); with start="?" >> name >> variable; ... it all works fine. However I'd really prefer not to do it this way. Ideally, I want Boost.Spirit to split at ## cleanly in msvc_symbol and "do the right thing" as it were. I'm thinking I'm probably not thinking recursively enough? Either way I'm stumped. Note: Yes I am aware I can break the string at ## and run two separate parsers. That isn't what I'm asking - rather, I'm asking how to configure Boost.Spirit to parse the end of a phrase before the beginning. Note also: I'm aware a skipper could be used to make the ## whitespace and do the split that way. The problem is that what comes before the ## is very specific, as is what comes after the ##. Therefore it's not really whitespace. Many thanks in advance to anyone who can help. From searching Google and Stackoverflow for questions related to this one, overcoming Boost.Spirit's "left-to-right greediness" is a problem for a lot of people. Niall
It would seem that you can do a number of things: explicitely disallow the double "##" where you expect "#". See also http://www.boost.org/doc/libs/1_52_0/libs/spirit/doc/html/spirit/qi/reference/operator/difference.html http://www.boost.org/doc/libs/1_52_0/libs/spirit/doc/html/spirit/qi/reference/operator/not_predicate.html tokenize first (use Spirit Lex?) Here I show you a working example of the first approach: #include <boost/spirit/include/qi.hpp> #include <boost/spirit/include/karma.hpp> namespace qi = boost::spirit::qi; template<typename T> T reversed(T c) { return T(c.rbegin(), c.rend()); } int main (int argc, char** argv) { const std::string input("?myvolatileStaticMember#myclass##2HC"); auto f = begin(input), l = end(input); auto identifier = +~qi::char_("#"); auto delimit = qi::lit("#") - "##"; std::vector<std::string> qualifiedName; std::string typeId; if (qi::parse(f,l, '?' >> identifier % delimit >> "##" >> +qi::char_, qualifiedName, typeId)) { using namespace boost::spirit::karma; qualifiedName = reversed(qualifiedName); std::cout << "Qualified name: " << format(auto_ % "::" << "\n", qualifiedName); std::cout << "Type indication: '" << typeId << "'\n"; } } Output: Qualified name: myclass::myvolatileStaticMember Type indication: '2HC'