I want to parse header columns of a text file. The column names should be allowed to be quoted and any case of letters. Currently I am using the following grammar:
#include <string>
#include <iostream>
#include <boost/spirit/include/qi.hpp>
namespace qi = boost::spirit::qi;
template <typename Iterator, typename Skipper>
struct Grammar : qi::grammar<Iterator, void(), Skipper>
{
static constexpr char colsep = '|';
Grammar() : Grammar::base_type(header)
{
using namespace qi;
using ascii::char_;
#define COL(name) (no_case[name] | ('"' >> no_case[name] >> '"'))
header = (COL("columna") | COL("column_a")) >> colsep >>
(COL("columnb") | COL("column_b")) >> colsep >>
(COL("columnc") | COL("column_c")) >> eol >> eoi;
#undef COL
}
qi::rule<Iterator, void(), Skipper> header;
};
int main()
{
const std::string s{"columnA|column_B|column_c\n"};
auto begin(std::begin(s)), end(std::end(s));
Grammar<std::string::const_iterator, qi::blank_type> p;
bool ok = qi::phrase_parse(begin, end, p, qi::blank);
if (ok && begin == end)
std::cout << "Header ok" << std::endl;
else if (ok && begin != end)
std::cout << "Remaining unparsed: '" << std::string(begin, end) << "'" << std::endl;
else
std::cout << "Parse failed" << std::endl;
return 0;
}
Is this possible without the use of a macro? Further I would like to ignore any underscores at all. Can this be achieved with a custom skipper? In the end it would be ideal if one could write:
header = col("columna") >> colsep >> col("columnb") >> colsep >> column("columnc") >> eol >> eoi;
where col would be an appropriate grammar or rule.
#sehe how can I fix this grammar to support "\"Column_A\"" as well? 6 hours ago
By this time you should probably have realized that there's two different things going on here.
Separate Yo Concerns
On the one hand you have a grammar (that allows |-separated columns like columna or "Column_A").
On the other hand you have semantic analysis (the phase where you check that the parsed contents match certain criteria).
The thing that is making your life hard is trying to conflate the two. Now, don't get me wrong, there could be (very rare) circumstances where fusing those responsibilities together is absolutely required - but I feel that would always be an optimization. If you need that, Spirit is not your thing, and you're much more likely to be served with a handwritten parser.
Parsing
So let's get brain-dead simple about the grammar:
static auto headers = (quoted|bare) % '|' > (eol|eoi);
The bare and quoted rules can be pretty much the same as before:
static auto quoted = lexeme['"' >> *('\\' >> char_ | "\"\"" >> attr('"') | ~char_('"')) >> '"'];
static auto bare = *(graph - '|');
As you can see this will implicitly take care of quoting and escaping as well whitespace skipping outside lexemes. When applied simply, it will result in a clean list of column names:
std::string const s = "\"columnA\"|column_B| column_c \n";
std::vector<std::string> headers;
bool ok = phrase_parse(begin(s), end(s), Grammar::headers, x3::blank, headers);
std::cout << "Parse " << (ok?"ok":"invalid") << std::endl;
if (ok) for(auto& col : headers) {
std::cout << std::quoted(col) << "\n";
}
Prints Live On Coliru
Parse ok
"columnA"
"column_B"
"column_c"
INTERMEZZO: Coding Style
Let's structure our code so that the separation of concerns is reflected. Our parsing code might use X3, but our validation code doesn't need to be in the same translation unit (cpp file).
Have a header defining some basic types:
#include <string>
#include <vector>
using Header = std::string;
using Headers = std::vector<Header>;
Define the operations we want to perform on them:
Headers parse_headers(std::string const& input);
bool header_match(Header const& actual, Header const& expected);
bool headers_match(Headers const& actual, Headers const& expected);
Now, main can be rewritten as just:
auto headers = parse_headers("\"columnA\"|column_B| column_c \n");
for(auto& col : headers) {
std::cout << std::quoted(col) << "\n";
}
bool valid = headers_match(headers, {"columna","columnb","columnc"});
std::cout << "Validation " << (valid?"passed":"failed") << "\n";
And e.g. a parse_headers.cpp could contain:
#include <boost/spirit/home/x3.hpp>
namespace x3 = boost::spirit::x3;
namespace Grammar {
using namespace x3;
static auto quoted = lexeme['"' >> *('\\' >> char_ | "\"\"" >> attr('"') | ~char_('"')) >> '"'];
static auto bare = *(graph - '|');
static auto headers = (quoted|bare) % '|' > (eol|eoi);
}
Headers parse_headers(std::string const& input) {
Headers output;
if (phrase_parse(begin(input), end(input), Grammar::headers, x3::blank, output))
return output;
return {}; // or throw, if you prefer
}
Validating
This is what is known as "semantic checks". You take the vector of strings and check them according to your logic:
#include <boost/range/adaptors.hpp>
#include <boost/algorithm/string.hpp>
bool header_match(Header const& actual, Header const& expected) {
using namespace boost::adaptors;
auto significant = [](unsigned char ch) {
return ch != '_' && std::isgraph(ch);
};
return boost::algorithm::iequals(actual | filtered(significant), expected);
}
bool headers_match(Headers const& actual, Headers const& expected) {
return boost::equal(actual, expected, header_match);
}
That's all. All the power of algorithms and modern C++ at your disposal, no need to fight with constraints due to parsing context.
Full Demo
The above, Live On Wandbox
Both parts got significantly simpler:
your parser doesn't have to deal with quirky comparison logic
your comparison logic doesn't have to deal with grammar concerns (quotes, escapes, delimiters and whitespace)
Related
I am trying to solve an issue with positive and negative values in Boost Spirit.
The parser should use unsigned numbers (positive) 99% of the time.
The program works reading a string that defines a variables from 1 to 32 bits that should be read from another stream (for question context, not shown in the example), but there is a special case where a string "D_REF" may be a 16 bits signed number (2's complement).
The program codifies all checks as unsigned values in a std::vector, so I need to codify that positive value as unsigned, but previously I need to apply a cast to it to force it into an unsigned short value, and then store it in the unsigned int struct.
This need comes from an after request where a data stream shall be read and values extracted from it as unsigned, and there parsed comparisons apply to them.
I know this request may look weird, but it is a must for a current project, so can anyone help me with this?
Godbolt link: https://godbolt.org/z/8j615Mecx
//#define BOOST_SPIRIT_DEBUG
#include <boost/spirit/include/qi.hpp>
#include <iomanip>
#include <iostream>
namespace engine
{
struct Check
{
std::string variable;
unsigned int number;
};
using Checks = std::vector<Check>;
}
BOOST_FUSION_ADAPT_STRUCT(engine::Check, variable, number)
namespace engine
{
namespace qi = boost::spirit::qi;
template <typename It>
class Parser : public qi::grammar<It, Checks()>
{
private:
qi::rule<It, Check(), qi::blank_type> equal1, equal2;
qi::rule<It, Checks()> start;
public:
Parser() : Parser::base_type(start)
{
using namespace qi;
//equal1 = as_string["MSG33.D_REF"] >> "==" >> int_[static_cast<unsigned short>(_1)];// This is the idea...
equal1 = as_string["MSG33.D_REF"] >> "==" >> int_;// This may contain negative numbers, but they are only 16 bits length, so they must be casted to "unsigned short" and not to "unsigned int"
equal2 = +(alnum | char_("._")) >> "==" >> uint_;
start = skip(blank)[(equal1 | equal2) % "&&"] > eoi;
}
};
Checks parse(const std::string& str)
{
using It = std::string::const_iterator;
static const Parser<It> parser;
Checks checks;
It first = str.begin(), last = str.end();
if (!qi::parse(first, last, parser, checks))
return {};
return checks;
}
}
int main()
{
auto checks1 = engine::parse("MSG33.ANYTHING == 25");// Normal case. All the checks are done with positive variable values
auto checks2 = engine::parse("MSG33.D_REF == 25");// Especial case extended from normal case. Checks A positive/negative variable with a positiove value.
auto checks3 = engine::parse("MSG33.D_REF == -25");// Especial case. Check a negative value. D_REF should be codified as 2's complement 16 bits unsigned, but it is converted to 32 bits unsigned
std::cout << std::hex << "Obtained: " << checks3.front().number << std::endl << "Wished: " << static_cast<unsigned short>(checks3.front().number);// It displays 0xffffffe7, but I need 0xffe7. Possible semanatic action to force conversion prior to vector insertion???
}
First: A word of caution
Automatic attribute propagation already does exactly what you need. That's pretty much what you'd expect since it compiles.
Your problem really has nothing to do with the parsing at all. It has to do with how you interpret the correctly parsed negative number, correctly converted to the integer type you chose (unsigned int).
Indeed, if you want to treat a unsigned int value as a short (signed or unsigned) you have to coerce it, or use a bitmask to clear the high bits: c.number & 0xffff.
Storing 0xffe7 inside the unsigned int is of course possible. But it is technically just INCORRECT 2's complement encoding. Experience tells me it will lead to error-prone code.
If I were to go for a design like this, I'd choose an integer representation type that is expressly NOT an arithmetic type. Something like
struct Number {
_implementation_defined_ storage;
uint32_t as_uint32() const { return /*some implementation logic on storage*/; }
int16_t as_int16() const { return /*some other implementation logic on storage*/; }
// etc.
};
In the land of parsed AST's, I'd prefer
template <typename V>
struct Check {
std::string name;
V number;
};
using Check = boost::variant<Check<uint32_t>, Check<int16_t>>;
With that out of the way, let's see some answers to your question:
Using static cast in the semantic action
You can force the issue using Boost Phoenix: Live On Coliru
assign_d_ref %= qi::string("MSG33.D_REF") >> "==" >>
qi::int_[_1 = boost::phoenix::static_cast_<uint16_t>(_1)];
IMO, a slightly better approach¹ is to have a parser that parses uint16_t in the first place: Live On Coliru
qi::int_parser<uint16_t> uint16_;
assign_d_ref = qi::string("MSG33.D_REF") >> "==" >> uint16_;
Other Improvements
I'd also improve the expressiveness some more using e.g.:
qi::symbols<char> s16_vars;
s16_vars += "MSG33.D_REF", "MSG34.D_REF";
assign_s16 = qi::raw[s16_vars] >> "==" >> uint16_;
To generalize for signed 16 bit variables.
qi::rule<It, std::string()> name;
name = +(qi::alnum | qi::char_("._"));
This fixes the missing lexeme[] around the name (by declaring the rule without skipper²).
assign_u32 = name >> "==" >> qi::uint_;
assign = assign_s16 | assign_u32;
start = qi::skip(qi::blank)[assign % "&&" > qi::eoi];
Apart from the readability, it fixes the edge case where blanks are immediately before end-of-input.
See the combined result Live On Coliru
#include <boost/fusion/adapted.hpp>
#include <boost/spirit/include/qi.hpp>
#include <iomanip>
#include <iostream>
namespace engine {
struct Check {
std::string variable;
uint32_t number;
friend std::ostream& operator<<(std::ostream& os, Check const& c) {
auto f = os.flags();
os << "{" << std::quoted(c.variable) << " == " //
<< std::hex << std::showbase << c.number << "}";
os.setf(f);
return os;
}
};
using Checks = std::vector<Check>;
} // namespace engine
BOOST_FUSION_ADAPT_STRUCT(engine::Check, variable, number)
namespace engine {
namespace qi = boost::spirit::qi;
template <typename It> class Parser : public qi::grammar<It, Checks()> {
public:
Parser() : Parser::base_type(start) {
using namespace qi::labels;
s16_vars += "MSG33.D_REF", "MSG34.D_REF";
name = +(qi::alnum | qi::char_("._"));
assign_s16 = qi::raw[s16_vars] >> "==" >> uint16_;
assign_u32 = name >> "==" >> qi::uint_;
assign = assign_s16 | assign_u32;
start = qi::skip(qi::blank)[assign % "&&" > qi::eoi];
BOOST_SPIRIT_DEBUG_NODES((start)(assign)(assign_u32)(assign_s16)(name))
}
private:
qi::int_parser<uint16_t> uint16_;
qi::symbols<char> s16_vars;
qi::rule<It, Check(), qi::blank_type> assign, assign_s16, assign_u32;
qi::rule<It, Checks()> start;
// lexeme:
qi::rule<It, std::string()> name;
};
Checks parse(const std::string& str) {
using It = std::string::const_iterator;
static const Parser<It> parser;
Checks checks;
It first = str.begin(), last = str.end();
if (!qi::parse(first, last, parser, checks))
return {};
return checks;
}
} // namespace engine
int main() {
for (auto sep = ""; auto& c : engine::parse(
"MSG33.ANYTHING == 25 && MSG33.D_REF == 25 && MSG33.D_REF == -25"))
std::cout << std::exchange(sep, " && ") << c;
std::cout << "\n";
}
Printing (like all samples above):
{"MSG33.ANYTHING" == 0x19} && {"MSG33.D_REF" == 0x19} && {"MSG33.D_REF" == 0xffe7}
BONUS: Variant Style
Because you might be interested, here's a version using the variant AST:
Live On Coliru
#include <boost/core/demangle.hpp>
#include <boost/fusion/adapted.hpp>
#include <boost/spirit/include/qi.hpp>
#include <iomanip>
#include <iostream>
namespace engine {
template <typename T>
struct VarCheck {
std::string variable;
T number;
friend std::ostream& operator<<(std::ostream& os, VarCheck const& c) {
auto f = os.flags();
os << " {" << std::quoted(c.variable) << " == " << std::hex
<< std::showbase << c.number << ":"
<< boost::core::demangle(typeid(T).name()) << "}";
os.setf(f);
return os;
}
};
using S16Var = VarCheck<int16_t>;
using U32Var = VarCheck<uint32_t>;
using Check = boost::variant<U32Var, S16Var>;
using Checks = std::vector<Check>;
} // namespace engine
// BOOST_FUSION_ADAPT_STRUCT(engine::S16Var, variable, number)
// BOOST_FUSION_ADAPT_STRUCT(engine::S16Var, variable, number)
// Or, generically: https://www.boost.org/doc/libs/1_80_0/libs/fusion/doc/html/fusion/adapted/adapt_tpl_struct.html
BOOST_FUSION_ADAPT_TPL_STRUCT((T), (engine::VarCheck)(T), variable, number)
namespace engine {
namespace qi = boost::spirit::qi;
template <typename It> class Parser : public qi::grammar<It, Checks()> {
public:
Parser() : Parser::base_type(start) {
using namespace qi::labels;
s16_vars += "MSG33.D_REF", "MSG34.D_REF";
name = +(qi::alnum | qi::char_("._"));
assign_s16 = qi::raw[s16_vars] >> "==" >> uint16_;
assign_u32 = name >> "==" >> qi::uint_;
assign = assign_s16 | assign_u32;
start = qi::skip(qi::blank)[assign % "&&" > qi::eoi];
BOOST_SPIRIT_DEBUG_NODES((start)(assign)(assign_u32)(assign_s16)(name))
}
private:
qi::int_parser<uint16_t> uint16_;
qi::symbols<char> s16_vars;
qi::rule<It, Check(), qi::blank_type> assign;
qi::rule<It, U32Var(), qi::blank_type> assign_u32;
qi::rule<It, S16Var(), qi::blank_type> assign_s16;
qi::rule<It, Checks()> start;
// lexeme:
qi::rule<It, std::string()> name;
};
Checks parse(const std::string& str) {
using It = std::string::const_iterator;
static const Parser<It> parser;
Checks checks;
It first = str.begin(), last = str.end();
if (!qi::parse(first, last, parser, checks))
return {};
return checks;
}
} // namespace engine
int main() {
for (auto sep = "";
auto& c : engine::parse("MSG33.ANYTHING == 25 && MSG33.D_REF == 25 && "
"MSG33.D_REF == -25")) {
std::cout << std::exchange(sep, "\n && ") << c;
}
std::cout << "\n";
}
I've extended the output with the static type information for visibility:
{"MSG33.ANYTHING" == 0x19:unsigned int}
&& {"MSG33.D_REF" == 0x19:short}
&& {"MSG33.D_REF" == 0xffe7:short}
It's easy to generalize for more variable type here:
using S16Var = VarCheck<int16_t>;
using U32Var = VarCheck<uint32_t>;
using DblVar = VarCheck<double>;
using StrVar = VarCheck<std::string>;
using Check = boost::variant<U32Var, S16Var, DblVar, StrVar>;
See it Live On Coliru, with the output
{"MSG33.ANYTHING" == 0x19:unsigned int}
&& {"MSG33.D_REF" == 0x19:short}
&& {"SEHE.DBL_1" == 4.2e+10:double}
&& {"SEHE.DBL_2" == -inf:double}
&& {"SEHE.STR_42" == Life The Universe and everything:std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >}
&& {"SEHE.STR_300" == Three hundred:std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >}
&& {"MSG33.D_REF" == 0xffe7:short}
¹ E.g. Boost Spirit: "Semantic actions are evil"?
² Boost spirit skipper issues
The following Spirit x3 grammar for a simple robot command language generates compiler errors in Windows Visual Studio 17. For this project, I am required to compile with the warning level to 4 (/W4) and treat warnings as errors (/WX).
Warning C4127 conditional expression is
constant SpiritTest e:\data\boost\boost_1_65_1\boost\spirit\home\x3\char\detail\cast_char.hpp 29
Error C2039 'insert': is not a member of
'boost::spirit::x3::unused_type' SpiritTest e:\data\boost\boost_1_65_1\boost\spirit\home\x3\core\detail\parse_into_container.hpp 259 Error C2039 'end': is not a member of
'boost::spirit::x3::unused_type' SpiritTest e:\data\boost\boost_1_65_1\boost\spirit\home\x3\core\detail\parse_into_container.hpp 259 Error C2039 'empty': is not a member of
'boost::spirit::x3::unused_type' SpiritTest e:\data\boost\boost_1_65_1\boost\spirit\home\x3\core\detail\parse_into_container.hpp 254 Error C2039 'begin': is not a member of
'boost::spirit::x3::unused_type' SpiritTest e:\data\boost\boost_1_65_1\boost\spirit\home\x3\core\detail\parse_into_container.hpp 259
Clearly, something is wrong with my grammar, but the error messages are completely unhelpful. I have found that if I remove the Kleene star in the last line of the grammar (*parameter to just parameter) the errors disappear, but then I get lots of warnings like this:
Warning C4459 declaration of 'digit' hides global
declaration SpiritTest e:\data\boost\boost_1_65_1\boost\spirit\home\x3\support\numeric_utils\detail\extract_int.hpp 174
Warning C4127 conditional expression is constant SpiritTest e:\data\boost\boost_1_65_1\boost\spirit\home\x3\char\detail\cast_char.hpp 29
#include <string>
#include <iostream>
#include <boost/config/warning_disable.hpp>
#include <boost/spirit/home/x3.hpp>
namespace x3 = boost::spirit::x3;
//
// Grammar for simple command language
//
namespace scl
{
using boost::spirit::x3::char_;
using boost::spirit::x3::double_;
using boost::spirit::x3::int_;
using boost::spirit::x3::lexeme;
using boost::spirit::x3::lit;
using boost::spirit::x3::no_case;
auto valid_identifier_chars = char_ ("a-zA-Z_");
auto quoted_string = '"' >> *(lexeme [~char_ ('"')]) >> '"';
auto keyword_value_chars = char_ ("a-zA-Z0-9$_.");
auto qual = lexeme [!(no_case [lit ("no")]) >> +valid_identifier_chars] >> -('=' >> (quoted_string | int_ | double_ | +keyword_value_chars));
auto neg_qual = lexeme [no_case [lit ("no")] >> +valid_identifier_chars];
auto qualifier = lexeme ['/' >> (qual | neg_qual)];
auto verb = +valid_identifier_chars >> *qualifier;
auto parameter = +keyword_value_chars >> *qualifier;
auto command = verb >> *parameter;
}; // End namespace scl
using namespace std; // Must be after Boost stuff!
int
main ()
{
vector <string> input =
{
"show/out=\"somefile.txt\" motors/all cameras/full",
"start/speed=5 motors arm1 arm2/speed=2.5/track arm3",
"rotate camera1/notrack/axis=y/angle=45"
};
//
// Parse each of the strings in the input vector
//
for (string str : input)
{
auto b = str.begin ();
auto e = str.end ();
cout << "Parsing: " << str << endl;
x3::phrase_parse (b, e, scl::command, x3::space);
if (b != e)
{
cout << "Error, only parsed to position: " << b - str.begin () << endl;
}
} // End for
return 0;
} // End main
There is a regression since Boost 1.65 that causes problems with some rules that potentially propagate into container type attributes.
They dispatch to the wrong overload when instantiated without an actual bound attribute. When this happens there is a "mock" attribute type called unused_type. The errors you are seeing indicate that unused_type is being treated as if it were a concrete attribute type, and clearly that won't fly.
The regression was fixed in https://github.com/boostorg/spirit/commit/ee4943d5891bdae0706fb616b908e3bf528e0dfa
You can see that it's a regression by compiling with Boost 1.64:
Boost 1.64 compiles it fine GCC
and Clang
Boost 1.65 breaks it GCC and Clang again
Now, latest develop is supposed to fix it, but you can simply copy the patched file, even just the 7-line patch.
All of the above was already available when I linked the duplicate question How to make a recursive rule in boost spirit x3 in VS2017, which highlights the same regression
Review
using namespace std; // Must be after Boost stuff!
Actually, it probably needs to be nowhere unless very locally scoped, where you can see the impact of any potential name colisions.
Consider encapsulating the skipper, since it's likely logically part of your grammar spec, not something to be overridden by the caller.
This is a bug:
auto quoted_string = '"' >> *(lexeme[~char_('"')]) >> '"';
You probably meant to assert the whole literal is lexeme, not individual characters (that's... moot because whitespace would never hit the parser anyways, because of the skipper).
auto quoted_string = lexeme['"' >> *~char_('"') >> '"'];
Likewise, you might have intended +keyword_value_chars to be lexeme, because right now one=two three four would parse the "qualifier" one with a "keyword value" of onethreefour, not one three four¹
x3::space skips embedded newlines, if that's not the intent, use x3::blank
Since PEG grammars are parsed left-to-right greedy, you can order the qualifier production and do without the !(no_case["no"]) lookahead assertion. That not only removes duplication but also makes the grammar simpler and more efficient:
auto qual = lexeme[+valid_identifier_chars] >>
-('=' >> (quoted_string | int_ | double_ | +keyword_value_chars)); // TODO lexeme
auto neg_qual = lexeme[no_case["no"] >> +valid_identifier_chars];
auto qualifier = lexeme['/' >> (neg_qual | qual)];
¹ Note (Post-Scriptum) now that we notice qualifier is, itself, already a lexeme, there's no need to lexeme[] things inside (unless, of course they're reused in contexts with skippers).
However, this also gives rise to the question whether whitespace around the = operator should be accepted (currently, it is not), or whether qualifiers can be separated with whitespace (like id /a /b; currently they can).
Perhaps verb needed some lexemes[] as well (unless you really did want to parse "one two three" as a verb)
If no prefix for negative qualifiers, then maybe the identifier itself is, too? This could simplify the grammar
The ordering of int_ and double_ makes it so that most doubles are mis-parsed as int before they could ever be recognized. Consider something more explicit like x3::strict_real_policies<double>>{} | int_
If you're parsing quoted constructs, perhaps you want to recognize escapes too ('\"' and '\\' for example):
auto quoted_string = lexeme['"' >> *('\\' >> char_ | ~char_('"')) >> '"'];
If you have a need for "keyword values" consider listing known values in x3::symbols<>. This can also be used to parse directly into an enum type.
Here's a version that parses into AST types and prints it back for demonstration purposes:
Live On Coliru
#include <boost/config/warning_disable.hpp>
#include <string>
#include <vector>
#include <boost/variant.hpp>
namespace Ast {
struct Keyword : std::string { // needs to be strong-typed to distinguish from quoted values
using std::string::string;
using std::string::operator=;
};
struct Nil {};
using Value = boost::variant<Nil, std::string, int, double, Keyword>;
struct Qualifier {
enum Kind { positive, negative } kind;
std::string identifier;
Value value;
};
struct Param {
Keyword keyword;
std::vector<Qualifier> qualifiers;
};
struct Command {
std::string verb;
std::vector<Qualifier> qualifiers;
std::vector<Param> params;
};
}
#include <boost/fusion/adapted/struct.hpp>
BOOST_FUSION_ADAPT_STRUCT(Ast::Qualifier, kind, identifier, value)
BOOST_FUSION_ADAPT_STRUCT(Ast::Param, keyword, qualifiers)
BOOST_FUSION_ADAPT_STRUCT(Ast::Command, verb, qualifiers, params)
#include <boost/spirit/home/x3.hpp>
namespace x3 = boost::spirit::x3;
namespace scl {
//
// Grammar for simple command language
//
using x3::char_;
using x3::int_;
using x3::lexeme;
using x3::no_case;
// lexeme tokens
auto keyword = x3::rule<struct _keyword, Ast::Keyword> { "keyword" }
= lexeme [ +char_("a-zA-Z0-9$_.") ];
auto identifier = lexeme [ +char_("a-zA-Z_") ];
auto quoted_string = lexeme['"' >> *('\\' >> x3::char_ | ~x3::char_('"')) >> '"'];
auto value
= quoted_string
| x3::real_parser<double, x3::strict_real_policies<double>>{}
| x3::int_
| keyword;
auto qual
= x3::attr(Ast::Qualifier::positive) >> identifier >> -('=' >> value);
auto neg_qual
= x3::attr(Ast::Qualifier::negative) >> lexeme[no_case["no"] >> identifier] >> x3::attr(Ast::Nil{}); // never a value
auto qualifier
= lexeme['/' >> (neg_qual | qual)];
auto verb
= identifier;
auto parameter = x3::rule<struct _parameter, Ast::Param> {"parameter"}
= keyword >> *qualifier;
auto command = x3::rule<struct _command, Ast::Command> {"command"}
= x3::skip(x3::space) [ verb >> *qualifier >> *parameter ];
} // End namespace scl
// For Demo, Debug: printing the Ast types back
#include <iostream>
#include <iomanip>
namespace Ast {
static inline std::ostream& operator<<(std::ostream& os, Value const& v) {
struct {
std::ostream& _os;
void operator()(std::string const& s) const { _os << std::quoted(s); }
void operator()(int i) const { _os << i; }
void operator()(double d) const { _os << d; }
void operator()(Keyword const& kwv) const { _os << kwv; }
void operator()(Nil) const { }
} vis{os};
boost::apply_visitor(vis, v);
return os;
}
static inline std::ostream& operator<<(std::ostream& os, Qualifier const& q) {
os << "/" << (q.kind==Qualifier::negative?"no":"") << q.identifier;
if (q.value.which())
os << "=" << q.value;
return os;
}
static inline std::ostream& operator<<(std::ostream& os, std::vector<Qualifier> const& qualifiers) {
for (auto& qualifier : qualifiers)
os << qualifier;
return os;
}
static inline std::ostream& operator<<(std::ostream& os, Param const& p) {
return os << p.keyword << p.qualifiers;
}
static inline std::ostream& operator<<(std::ostream& os, Command const& cmd) {
os << cmd.verb << cmd.qualifiers;
for (auto& param : cmd.params) os << " " << param;
return os;
}
}
int main() {
for (std::string const str : {
"show/out=\"somefile.txt\" motors/all cameras/full",
"start/speed=5 motors arm1 arm2/speed=2.5/track arm3",
"rotate camera1/notrack/axis=y/angle=45",
})
{
auto b = str.begin(), e = str.end();
Ast::Command cmd;
bool ok = parse(b, e, scl::command, cmd);
std::cout << (ok?"OK":"FAIL") << '\t' << std::quoted(str) << '\n';
if (ok) {
std::cout << " -- Full AST: " << cmd << "\n";
std::cout << " -- Verb+Qualifiers: " << cmd.verb << cmd.qualifiers << "\n";
for (auto& param : cmd.params)
std::cout << " -- Param+Qualifiers: " << param << "\n";
}
if (b != e) {
std::cout << " -- Remaining unparsed: " << std::quoted(std::string(b,e)) << "\n";
}
}
}
Prints
OK "show/out=\"somefile.txt\" motors/all cameras/full"
-- Full AST: show/out="somefile.txt" motors/all cameras/full
-- Verb+Qualifiers: show/out="somefile.txt"
-- Param+Qualifiers: motors/all
-- Param+Qualifiers: cameras/full
OK "start/speed=5 motors arm1 arm2/speed=2.5/track arm3"
-- Full AST: start/speed=5 motors arm1 arm2/speed=2.5/track arm3
-- Verb+Qualifiers: start/speed=5
-- Param+Qualifiers: motors
-- Param+Qualifiers: arm1
-- Param+Qualifiers: arm2/speed=2.5/track
-- Param+Qualifiers: arm3
OK "rotate camera1/notrack/axis=y/angle=45"
-- Full AST: rotate camera1/notrack/axis=y/angle=45
-- Verb+Qualifiers: rotate
-- Param+Qualifiers: camera1/notrack/axis=y/angle=45
For completeness
Demo also Live On MSVC (Rextester) - note that RexTester uses Boost 1.60
Coliru uses Boost 1.66 but the problem doesn't manifest itself because now, there are concrete attribute values bound to parsers
Think about a preprocessor which will read the raw text (no significant white space or tokens).
There are 3 rules.
resolve_para_entry should solve the Argument inside a call. The top-level text is returned as string.
resolve_para should resolve the whole Parameter list and put all the top-level Parameter in a string list.
resolve is the entry
On the way I track the iterator and get the text portion
Samples:
sometext(para) → expect para in the string list
sometext(para1,para2) → expect para1 and para2 in string list
sometext(call(a)) → expect call(a) in the string list
sometext(call(a,b)) ← here it fails; it seams that the "!lit(',')" wont take the Parser to step outside ..
Rules:
resolve_para_entry = +(
(iter_pos >> lit('(') >> (resolve_para_entry | eps) >> lit(')') >> iter_pos) [_val= phoenix::bind(&appendString, _val, _1,_3)]
| (!lit(',') >> !lit(')') >> !lit('(') >> (wide::char_ | wide::space)) [_val = phoenix::bind(&appendChar, _val, _1)]
);
resolve_para = (lit('(') >> lit(')'))[_val = std::vector<std::wstring>()] // empty para -> old style
| (lit('(') >> resolve_para_entry >> *(lit(',') >> resolve_para_entry) > lit(')'))[_val = phoenix::bind(&appendStringList, _val, _1, _2)]
| eps;
;
resolve = (iter_pos >> name_valid >> iter_pos >> resolve_para >> iter_pos);
In the end doesn't seem very elegant. Maybe there is a better way to parse such stuff without skipper
Indeed this should be a lot simpler.
First off, I fail to see why the absense of a skipper is at all relevant.
Second, exposing the raw input is best done using qi::raw[] instead of dancing with iter_pos and clumsy semantic actions¹.
Among the other observations I see:
negating a charset is done with ~, so e.g. ~char_(",()")
(p|eps) would be better spelled -p
(lit('(') >> lit(')')) could be just "()" (after all, there's no skipper, right)
p >> *(',' >> p) is equivalent to p % ','
With the above, resolve_para simplifies to this:
resolve_para = '(' >> -(resolve_para_entry % ',') >> ')';
resolve_para_entry seems weird, to me. It appears that any nested parentheses are simply swallowed. Why not actually parse a recursive grammar so you detect syntax errors?
Here's my take on it:
Define An AST
I prefer to make this the first step because it helps me think about the parser productions:
namespace Ast {
using ArgList = std::list<std::string>;
struct Resolve {
std::string name;
ArgList arglist;
};
using Resolves = std::vector<Resolve>;
}
Creating The Grammar Rules
qi::rule<It, Ast::Resolves()> start;
qi::rule<It, Ast::Resolve()> resolve;
qi::rule<It, Ast::ArgList()> arglist;
qi::rule<It, std::string()> arg, identifier;
And their definitions:
identifier = char_("a-zA-Z_") >> *char_("a-zA-Z0-9_");
arg = raw [ +('(' >> -arg >> ')' | +~char_(",)(")) ];
arglist = '(' >> -(arg % ',') >> ')';
resolve = identifier >> arglist;
start = *qr::seek[hold[resolve]];
Notes:
No more semantic actions
No more eps
No more iter_pos
I've opted to make arglist not-optional. If you really wanted that, change it back:
resolve = identifier >> -arglist;
But in our sample it will generate a lot of noisy output.
Of course your entry point (start) will be different. I just did the simplest thing that could possibly work, using another handy parser directive from the Spirit Repository (like iter_pos that you were already using): seek[]
The hold is there for this reason: boost::spirit::qi duplicate parsing on the output - You might not need it in your actual parser.
Live On Coliru
#include <boost/fusion/include/adapt_struct.hpp>
#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/repository/include/qi_seek.hpp>
namespace Ast {
using ArgList = std::list<std::string>;
struct Resolve {
std::string name;
ArgList arglist;
};
using Resolves = std::vector<Resolve>;
}
BOOST_FUSION_ADAPT_STRUCT(Ast::Resolve, name, arglist)
namespace qi = boost::spirit::qi;
namespace qr = boost::spirit::repository::qi;
template <typename It>
struct Parser : qi::grammar<It, Ast::Resolves()>
{
Parser() : Parser::base_type(start) {
using namespace qi;
identifier = char_("a-zA-Z_") >> *char_("a-zA-Z0-9_");
arg = raw [ +('(' >> -arg >> ')' | +~char_(",)(")) ];
arglist = '(' >> -(arg % ',') >> ')';
resolve = identifier >> arglist;
start = *qr::seek[hold[resolve]];
}
private:
qi::rule<It, Ast::Resolves()> start;
qi::rule<It, Ast::Resolve()> resolve;
qi::rule<It, Ast::ArgList()> arglist;
qi::rule<It, std::string()> arg, identifier;
};
#include <iostream>
int main() {
using It = std::string::const_iterator;
std::string const samples = R"--(
Samples:
sometext(para) → expect para in the string list
sometext(para1,para2) → expect para1 and para2 in string list
sometext(call(a)) → expect call(a) in the string list
sometext(call(a,b)) ← here it fails; it seams that the "!lit(',')" wont make the parser step outside
)--";
It f = samples.begin(), l = samples.end();
Ast::Resolves data;
if (parse(f, l, Parser<It>{}, data)) {
std::cout << "Parsed " << data.size() << " resolves\n";
} else {
std::cout << "Parsing failed\n";
}
for (auto& resolve: data) {
std::cout << " - " << resolve.name << "\n (\n";
for (auto& arg : resolve.arglist) {
std::cout << " " << arg << "\n";
}
std::cout << " )\n";
}
}
Prints
Parsed 6 resolves
- sometext
(
para
)
- sometext
(
para1
para2
)
- sometext
(
call(a)
)
- call
(
a
)
- call
(
a
b
)
- lit
(
'
'
)
More Ideas
That last output shows you a problem with your current grammar: lit(',') should obviously not be seen as a call with two parameters.
I recently did an answer on extracting (nested) function calls with parameters which does things more neatly:
Boost spirit parse rule is not applied
or this one boost spirit reporting semantic error
BONUS
Bonus version that uses string_view and also shows exact line/column information of all extracted words.
Note that it still doesn't require any phoenix or semantic actions. Instead it simply defines the necesary trait to assign to boost::string_view from an iterator range.
Live On Coliru
#include <boost/fusion/include/adapt_struct.hpp>
#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/repository/include/qi_seek.hpp>
#include <boost/utility/string_view.hpp>
namespace Ast {
using Source = boost::string_view;
using ArgList = std::list<Source>;
struct Resolve {
Source name;
ArgList arglist;
};
using Resolves = std::vector<Resolve>;
}
BOOST_FUSION_ADAPT_STRUCT(Ast::Resolve, name, arglist)
namespace boost { namespace spirit { namespace traits {
template <typename It>
struct assign_to_attribute_from_iterators<boost::string_view, It, void> {
static void call(It f, It l, boost::string_view& attr) {
attr = boost::string_view { f.base(), size_t(std::distance(f.base(),l.base())) };
}
};
} } }
namespace qi = boost::spirit::qi;
namespace qr = boost::spirit::repository::qi;
template <typename It>
struct Parser : qi::grammar<It, Ast::Resolves()>
{
Parser() : Parser::base_type(start) {
using namespace qi;
identifier = raw [ char_("a-zA-Z_") >> *char_("a-zA-Z0-9_") ];
arg = raw [ +('(' >> -arg >> ')' | +~char_(",)(")) ];
arglist = '(' >> -(arg % ',') >> ')';
resolve = identifier >> arglist;
start = *qr::seek[hold[resolve]];
}
private:
qi::rule<It, Ast::Resolves()> start;
qi::rule<It, Ast::Resolve()> resolve;
qi::rule<It, Ast::ArgList()> arglist;
qi::rule<It, Ast::Source()> arg, identifier;
};
#include <iostream>
struct Annotator {
using Ref = boost::string_view;
struct Manip {
Ref fragment, context;
friend std::ostream& operator<<(std::ostream& os, Manip const& m) {
return os << "[" << m.fragment << " at line:" << m.line() << " col:" << m.column() << "]";
}
size_t line() const {
return 1 + std::count(context.begin(), fragment.begin(), '\n');
}
size_t column() const {
return 1 + (fragment.begin() - start_of_line().begin());
}
Ref start_of_line() const {
return context.substr(context.substr(0, fragment.begin()-context.begin()).find_last_of('\n') + 1);
}
};
Ref context;
Manip operator()(Ref what) const { return {what, context}; }
};
int main() {
using It = std::string::const_iterator;
std::string const samples = R"--(Samples:
sometext(para) → expect para in the string list
sometext(para1,para2) → expect para1 and para2 in string list
sometext(call(a)) → expect call(a) in the string list
sometext(call(a,b)) ← here it fails; it seams that the "!lit(',')" wont make the parser step outside
)--";
It f = samples.begin(), l = samples.end();
Ast::Resolves data;
if (parse(f, l, Parser<It>{}, data)) {
std::cout << "Parsed " << data.size() << " resolves\n";
} else {
std::cout << "Parsing failed\n";
}
Annotator annotate{samples};
for (auto& resolve: data) {
std::cout << " - " << annotate(resolve.name) << "\n (\n";
for (auto& arg : resolve.arglist) {
std::cout << " " << annotate(arg) << "\n";
}
std::cout << " )\n";
}
}
Prints
Parsed 6 resolves
- [sometext at line:3 col:1]
(
[para at line:3 col:10]
)
- [sometext at line:4 col:1]
(
[para1 at line:4 col:10]
[para2 at line:4 col:16]
)
- [sometext at line:5 col:1]
(
[call(a) at line:5 col:10]
)
- [call at line:5 col:34]
(
[a at line:5 col:39]
)
- [call at line:6 col:10]
(
[a at line:6 col:15]
[b at line:6 col:17]
)
- [lit at line:6 col:62]
(
[' at line:6 col:66]
[' at line:6 col:68]
)
¹ Boost Spirit: "Semantic actions are evil"?
In my Boost Spirit grammar I would like to have a rule that does this:
rule<...> noCaseLit = no_case[ lit( "KEYWORD" ) ];
but for a custom keyword so that I can do this:
... >> noCaseLit( "SomeSpecialKeyword" ) >> ... >> noCaseLit( "OtherSpecialKeyword1" )
Is this possible with Boost Spirit rules and if so how?
P.S. I use the case insensitive thing as an example, what I'm after is rule parameterization in general.
Edits:
Through the link provided by 'sehe' in the comments I was able to come close to what I wanted but I'm not quite there yet.
/* Defining the noCaseLit rule */
rule<Iterator, string(string)> noCaseLit = no_case[lit(_r1)];
/* Using the noCaseLit rule */
rule<...> someRule = ... >> noCaseLit(phx::val("SomeSpecialKeyword")) >> ...
I haven't yet figured out a way to automatically convert the literal string to the Phoenix value so that I can use the rule like this:
rule<...> someRule = ... >> noCaseLit("SomeSpecialKeyword") >> ...
The easiest way is to simply create a function that returns your rule/parser. In the example near the end of this page you can find a way to declare the return value of your function. (The same here in a commented example).
#include <iostream>
#include <string>
#include <boost/spirit/include/qi.hpp>
namespace ascii = boost::spirit::ascii;
namespace qi = boost::spirit::qi;
typedef boost::proto::result_of::deep_copy<
BOOST_TYPEOF(ascii::no_case[qi::lit(std::string())])
>::type nocaselit_return_type;
nocaselit_return_type nocaselit(const std::string& keyword)
{
return boost::proto::deep_copy(ascii::no_case[qi::lit(keyword)]);
}
//C++11 VERSION EASIER TO MODIFY (AND DOESN'T REQUIRE THE TYPEDEF)
//auto nocaselit(const std::string& keyword) -> decltype(boost::proto::deep_copy(ascii::no_case[qi::lit(keyword)]))
//{
// return boost::proto::deep_copy(ascii::no_case[qi::lit(keyword)]);
//}
int main()
{
std::string test1="MyKeYWoRD";
std::string::const_iterator iter=test1.begin();
std::string::const_iterator end=test1.end();
if(qi::parse(iter,end,nocaselit("mYkEywOrd"))&& (iter==end))
std::cout << "Parse 1 Successful" << std::endl;
else
std::cout << "Parse 2 Failed. Remaining: " << std::string(iter,end) << std::endl;
qi::rule<std::string::const_iterator,ascii::space_type> myrule =
*(
( nocaselit("double") >> ':' >> qi::double_ )
| ( nocaselit("keyword") >> '-' >> *(qi::char_ - '.') >> '.')
);
std::string test2=" DOUBLE : 3.5 KEYWORD-whatever.Double :2.5";
iter=test2.begin();
end=test2.end();
if(qi::phrase_parse(iter,end,myrule,ascii::space)&& (iter==end))
std::cout << "Parse 2 Successful" << std::endl;
else
std::cout << "Parse 2 Failed. Remaining: " << std::string(iter,end) << std::endl;
return 0;
}
Part of a simple skeleton utility I'm hacking on I have a grammar for triggering substitutions in text. I thought it a wonderful way to get comfortable with Boost.Spirit, but the template errors are a joy of a unique kind.
Here is the code in its entirety:
#include <iostream>
#include <iterator>
#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/phoenix.hpp>
namespace bsq = boost::spirit::qi;
namespace {
template<typename Iterator>
struct skel_grammar : public bsq::grammar<Iterator> {
skel_grammar();
private:
bsq::rule<Iterator> macro_b;
bsq::rule<Iterator> macro_e;
bsq::rule<Iterator, bsq::ascii::space_type> id;
bsq::rule<Iterator> macro;
bsq::rule<Iterator> text;
bsq::rule<Iterator> start;
};
template<typename Iterator>
skel_grammar<Iterator>::skel_grammar() : skel_grammar::base_type(start)
{
text = bsq::no_skip[+(bsq::char_ - macro_b)[bsq::_val += bsq::_1]];
macro_b = bsq::lit("<<");
macro_e = bsq::lit(">>");
macro %= macro_b >> id >> macro_e;
id %= -(bsq::ascii::alpha | bsq::char_('_'))
>> +(bsq::ascii::alnum | bsq::char_('_'));
start = *(text | macro);
}
} // namespace
int main(int argc, char* argv[])
{
std::string input((std::istreambuf_iterator<char>(std::cin)),
std::istreambuf_iterator<char>());
skel_grammar<std::string::iterator> grammar;
bool r = bsq::parse(input.begin(), input.end(), grammar);
std::cout << std::boolalpha << r << '\n';
return 0;
}
What's wrong with this code?
Mmm. I feel that we have discussed a few more details in chat than have been reflected in the question as it is.
Let me entertain you with my 'toy' implementation, complete with test cases, of a grammar that will recognize <<macros>> like this, including nested expansion of the same.
Notable features:
Expansion is done using a callback (process()), giving you maximum flexibility (you could use a look up table, cause parsing to fail depending on the macro content, or even have sideeffects independent of the output
the parser is optimized to favour streaming mode. Look at spirit::istream_iterator on how to parse input in streaming mode (Stream-based Parsing Made Easy). This has the obvious benefits if your input stream is 10 GB, and contains only 4 macros - it is the difference between crawling performance (or running out of memory) and just scaling.
note that the demo still writes to a string buffer (via oss). You could, however, easily, hook the output directly to std::cout or, say, an std::ofstream instance
Expansion is done eagerly, so you can have nifty effects using indirect macros. See the testcases
I even demoed a simplistic way to support escaping the << or >> delimiters (#define SUPPORT_ESCAPES)
Without further ado:
The Code
Note due to laziness, I require -std==c++0x, but only when SUPPORT_ESCAPES is defined
//#define BOOST_SPIRIT_DEBUG
#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/phoenix.hpp>
namespace qi = boost::spirit::qi;
namespace phx= boost::phoenix;
namespace fsn= boost::fusion;
namespace
{
#define SUPPORT_ESCAPES
static bool process(std::string& macro)
{
if (macro == "error") {
return false; // fail the parse
}
if (macro == "hello") {
macro = "bye";
} else if (macro == "bye") {
macro = "We meet again";
} else if (macro == "sideeffect") {
std::cerr << "this is a side effect while parsing\n";
macro = "(done)";
} else if (std::string::npos != macro.find('~')) {
std::reverse(macro.begin(), macro.end());
macro.erase(std::remove(macro.begin(), macro.end(), '~'));
} else {
macro = std::string("<<") + macro + ">>"; // this makes the unsupported macros appear unchanged
}
return true;
}
template<typename Iterator, typename OutIt>
struct skel_grammar : public qi::grammar<Iterator>
{
struct fastfwd {
template<typename,typename> struct result { typedef bool type; };
template<typename R, typename O>
bool operator()(const R&r,O& o) const
{
#ifndef SUPPORT_ESCAPES
o = std::copy(r.begin(),r.end(),o);
#else
auto f = std::begin(r), l = std::end(r);
while(f!=l)
{
if (('\\'==*f) && (l == ++f))
break;
*o++ = *f++;
}
#endif
return true; // false to fail the parse
}
} copy;
skel_grammar(OutIt& out) : skel_grammar::base_type(start)
{
using namespace qi;
#ifdef SUPPORT_ESCAPES
rawch = ('\\' >> char_) | char_;
#else
# define rawch qi::char_
#endif
macro = ("<<" >> (
(*(rawch - ">>" - "<<") [ _val += _1 ])
% macro [ _val += _1 ] // allow nests
) >>
">>")
[ _pass = phx::bind(process, _val) ];
start =
raw [ +(rawch - "<<") ] [ _pass = phx::bind(copy, _1, phx::ref(out)) ]
% macro [ _pass = phx::bind(copy, _1, phx::ref(out)) ]
;
BOOST_SPIRIT_DEBUG_NODE(start);
BOOST_SPIRIT_DEBUG_NODE(macro);
# undef rawch
}
private:
#ifdef SUPPORT_ESCAPES
qi::rule<Iterator, char()> rawch;
#endif
qi::rule<Iterator, std::string()> macro;
qi::rule<Iterator> start;
};
}
int main(int argc, char* argv[])
{
std::string input =
"Greeting is <<hello>> world!\n"
"Side effects are <<sideeffect>> and <<other>> vars are untouched\n"
"Empty <<>> macros are ok, as are stray '>>' pairs.\n"
"<<nested <<macros>> (<<hello>>?) work>>\n"
"The order of expansion (evaluation) is _eager_: '<<<<hello>>>>' will expand to the same as '<<bye>>'\n"
"Lastly you can do algorithmic stuff too: <<!esrever ~ni <<hello>>>>\n"
#ifdef SUPPORT_ESCAPES // bonus: escapes
"You can escape \\<<hello>> (not expanded to '<<hello>>')\n"
"Demonstrate how it <<avoids <\\<nesting\\>> macros>>.\n"
#endif
;
std::ostringstream oss;
std::ostream_iterator<char> out(oss);
skel_grammar<std::string::iterator, std::ostream_iterator<char> > grammar(out);
std::string::iterator f(input.begin()), l(input.end());
bool r = qi::parse(f, l, grammar);
std::cout << "parse result: " << (r?"success":"failure") << "\n";
if (f!=l)
std::cout << "unparsed remaining: '" << std::string(f,l) << "'\n";
std::cout << "Streamed output:\n\n" << oss.str() << '\n';
return 0;
}
The Test Output
this is a side effect while parsing
parse result: success
Streamed output:
Greeting is bye world!
Side effects are (done) and <<other>> vars are untouched
Empty <<>> macros are ok, as are stray '>>' pairs.
<<nested <<macros>> (bye?) work>>
The order of expansion (evaluation) is _eager_: 'We meet again' will expand to the same as 'We meet again'
Lastly you can do algorithmic stuff too: eyb in reverse!
You can escape <<hello>> (not expanded to 'bye')
Demonstrate how it <<avoids <<nesting>> macros>>.
There is quite a lot of functionality hidden there to grok. I suggest you look at the test cases and the process() callback alongside each other to see what is going on.
Cheers & HTH :)