Parsing a SQL INSERT with Boost Spirit Classic - c++

I'm trying to learn Boost Spirit and as an exercise, I've tried to parse a SQL INSERT statement using Boost Spirit Classic.
This is the string I'm trying to parse:
INSERT INTO example_tab (cola, colb, colc, cold) VALUES (vala, valb, valc, vald);
From this SELECT example I've created this little grammar:
struct microsql_grammar : public grammar<microsql_grammar>
{
template <typename ScannerT>
struct definition
{
definition(microsql_grammar const& self)
{
keywords = "insert", "into", "values";
chlit<> LPAREN('(');
chlit<> RPAREN(')');
chlit<> SEMI(';');
chlit<> COMMA(',');
typedef inhibit_case<strlit<> > token_t;
token_t INSERT = as_lower_d["insert"];
token_t INTO = as_lower_d["into"];
token_t VALUES = as_lower_d["values"];
identifier =
nocase_d
[
lexeme_d
[
(alpha_p >> *(alnum_p | '_'))
]
];
string_literal =
lexeme_d
[
ch_p('\'') >> +( anychar_p - ch_p('\'') )
>> ch_p('\'')
];
program = +(query);
query = insert_into_clause >> SEMI;
insert_into_clause = insert_clause >> into_clause;
insert_clause = INSERT >> INTO >> identifier >> LPAREN >> var_list_clause >> RPAREN;
into_clause = VALUES >> LPAREN >> var_list_clause >> RPAREN;
var_list_clause = list_p( identifier, COMMA );
}
rule<ScannerT> const& start() const { return program; }
symbols<> keywords;
rule<ScannerT> identifier, string_literal, program, query, insert_into_clause, insert_clause,
into_clause, var_list_clause;
};
};
Using a minimal to test it:
void test_it(const string& my_example)
{
microsql_grammar g;
if (!parse(example.c_str(), g, space_p).full)
{
// point a - FAIL
throw new exception();
}
// point b - OK
}
Unfortunately it always enters the point A and throws the exception. Since I'm new to this, I have no idea where my error lies. I have two questions:
What's the proper way to debug parsing errors when using Boost Spirit?
Why parsing fails in this example?

To get visibility into what is failing to parse, assign the result of parse to a parse_info<>, then log/examine the parse_info<>::stop field, which in this case should be a const char * pointing at the last byte of you input string that matched your grammar.
microsql_grammar g;
parse_info<std::string::const_iterator> result = parse(example.begin(), example.end(), g, space_p)
if (!result.full)
{
std::string parsed(example.begin(), result.stop);
std::cout << parsed << std::endl;
// point a - FAIL
}
// point b - OK
Apologies if this doesn't compile, but should be a starting point.

Related

Update a parser to admit parentheses within quoted strings

I need to update a parser to admit these new features, but I am not able to manage all them at a time:
The commands must admit an indeterminate number of parameters (> 0).
Parameters might be numbers, unquoted strings or quoted strings.
Parameters are separate by commas.
Within quoted strings, it shall be permitted to use opening/closing parenthesis.
(It easier to understand these requirements looking at source code example)
My current code, including checks, is as follows:
Godbolt link: https://godbolt.org/z/5d6o53n9h
#include <boost/fusion/adapted/struct/adapt_struct.hpp>
#include <boost/spirit/include/qi.hpp>
namespace script
{
struct Command
{
enum Type { NONE, WRITE_LOG, INSERT_LABEL, START_PROCESS, END_PROCESS, COMMENT, FAIL };
Type type{ Type::NONE };
std::vector<std::string> args;
};
using Commands = std::vector<Command>;
}//namespace script
BOOST_FUSION_ADAPT_STRUCT(script::Command, type, args)
namespace script
{
namespace qi = boost::spirit::qi;
template <typename It>
class Parser : public qi::grammar<It, Commands()>
{
private:
qi::symbols<char, Command::Type> type;
qi::rule<It, Command(), qi::blank_type> none, command, comment, fail;//By its very nature "fail" must be the last one to be checked
qi::rule<It, Commands()> start;
public:
Parser() : Parser::base_type(start)
{
using namespace qi;//NOTE: "as_string" is neccessary in all args due to std::vector<std::string>
auto empty_args = copy(attr(std::vector<std::string>{}));
type.add
("WriteLog", Command::WRITE_LOG)
("InsertLabel", Command::INSERT_LABEL)
("StartProcess", Command::START_PROCESS)
("EndProcess", Command::END_PROCESS);
none = omit[*blank] >> &(eol | eoi)
>> attr(Command::NONE)
>> empty_args;//ignore args
command = type >> '('
>> as_string[lexeme[+~char_("(),\r\n")]] % ',' >> ')';
comment = lit("//")
>> attr(Command::COMMENT)
>> as_string[lexeme[*~char_("\r\n")]];
fail = omit[*~char_("\r\n")]
>> attr(Command::FAIL)
>> empty_args;//ignore args
start = skip(blank)[(none | command | comment | fail) % eol] >> eoi;
}
};
Commands parse(std::istream& in)
{
using It = boost::spirit::istream_iterator;
static const Parser<It> parser;
Commands commands;
It first(in >> std::noskipws), last;//No white space skipping
if (!qi::parse(first, last, parser, commands))
throw std::runtime_error("command parse error");
return commands;
}
}//namespace script
std::stringstream ss{
R"(// just a comment
WriteLog("this is a log")
WriteLog("this is also (in another way) a log")
WriteLog("but this is just a fail)
StartProcess(17, "program.exe", True)
StartProcess(17, "this_is_a_fail.exe, True)
)"};
int main()
{
using namespace script;
try
{
auto commands = script::parse(ss);
std::array args{ 0, 0, 1, 1, -1, 0, 3, -1, 0 };//Fails may have any number of arguments. It doesn't care. Sets as -1 by convenience flag
std::array types{ Command::COMMENT, Command::NONE, Command::WRITE_LOG, Command::WRITE_LOG, Command::FAIL, Command::NONE, Command::START_PROCESS, Command::FAIL, Command::NONE };
std::cout << std::boolalpha << "size correct? " << (commands.size() == 9) << std::endl;
std::cout << "types correct? " << std::equal(commands.begin(), commands.end(), types.begin(), types.end(), [](auto& cmd, auto& type) { return cmd.type == type; }) << std::endl;
std::cout << "arguments correct? " << std::equal(commands.begin(), commands.end(), args.begin(), args.end(), [](auto& cmd, auto arg) { return cmd.args.size() == arg || arg == -1; }) << std::endl;
}
catch (std::exception const& e)
{
std::cout << e.what() << "\n";
}
}
Any help with this will be appreciated.
You say you want to allow parentheses within quoted strings. But you don't even support quoted strings!
So the problem is your argument rule. Which doesn't even exist. It whould be roughly this part:
argument = +~char_("(),\r\n");
command = type >> '(' >> argument % ',' >> ')';
Where argument might be declared as
qi::rule<It, Argument()> argument;
In fact, rewriting the tests in an organized fashion, here's what we get right now:
Live On Compiler Explorer
static const Commands expected{
{Command::COMMENT, {"just a comment"}},
{Command::NONE, {}},
{Command::WRITE_LOG, {"this is a log"}},
{Command::WRITE_LOG, {"this is also (in another way) a log"}},
{Command::FAIL, {}},
{Command::NONE, {}},
{Command::START_PROCESS, {"17", "program.exe", "True"}},
{Command::FAIL, {}},
{Command::NONE, {}},
};
try {
auto parsed = script::parse(ss);
fmt::print("Parsed all correct? {} -- {} parsed (vs. {} expected)\n",
(parsed == expected), parsed.size(), expected.size());
for (auto i = 0u; i < std::min(expected.size(), parsed.size()); ++i) {
if (expected[i] != parsed[i]) {
fmt::print("index #{} expected {}\n"
" actual: {}\n",
i, expected[i], parsed[i]);
} else {
fmt::print("index #{} CORRECT ({})\n", i, parsed[i]);
}
}
} catch (std::exception const& e) {
fmt::print("Exception: {}\n", e.what());
}
Prints
Parsed all correct? false -- 9 parsed (vs. 9 expected)
index #0 CORRECT (Command(COMMENT, ["just a comment"]))
index #1 CORRECT (Command(NONE, []))
index #2 expected Command(WRITE_LOG, ["this is a log"])
actual: Command(WRITE_LOG, ["\"this is a log\""])
index #3 expected Command(WRITE_LOG, ["this is also (in another way) a log"])
actual: Command(FAIL, [])
index #4 expected Command(FAIL, [])
actual: Command(WRITE_LOG, ["\"but this is just a fail"])
index #5 CORRECT (Command(NONE, []))
index #6 expected Command(START_PROCESS, ["17", "program.exe", "True"])
actual: Command(START_PROCESS, ["17", "\"program.exe\"", "True"])
index #7 expected Command(FAIL, [])
actual: Command(START_PROCESS, ["17", "\"this_is_a_fail.exe", "True"])
index #8 CORRECT (Command(NONE, []))
As you can see, it fails quoted strings too, in my expectation. That's because the quoting is a language construct. In the AST (parsed results) you donot care about how exactly it was written in code. E.g. "hello\ world\041" might be equivalent too "hello world!" so both should result in the argument value hello world!.
So, let's do as we say:
argument = quoted_string | number | boolean | raw_string;
We can add a few rules:
// notice these are lexemes (no internal skipping):
qi::rule<It, Argument()> argument, quoted_string, number, boolean, raw_string;
And define them:
quoted_string = '"' >> *~char_('"') >> '"';
number = raw[double_];
boolean = raw[bool_];
raw_string = +~char_("(),\r\n");
argument = quoted_string | number | boolean | raw_string;
(If you want to allow escaped quotes, something like this:
quoted_string = '"' >> *('\\' >> char_ | ~char_('"')) >> '"';
Now, I'd say you probably want Argument to be something like variant<double, std::string, bool>, instead of just std::string.
With only this change, all the problems have practically vanished: Live On Compiler Explorer:
Parsed all correct? false -- 9 parsed (vs. 9 expected)
index #0 CORRECT (Command(COMMENT, ["just a comment"]))
index #1 CORRECT (Command(NONE, []))
index #2 CORRECT (Command(WRITE_LOG, ["this is a log"]))
index #3 CORRECT (Command(WRITE_LOG, ["this is also (in another way) a log"]))
index #4 CORRECT (Command(FAIL, []))
index #5 CORRECT (Command(NONE, []))
index #6 CORRECT (Command(START_PROCESS, ["17", "program.exe", "True"]))
index #7 expected Command(FAIL, [])
actual: Command(START_PROCESS, ["17", "this_is_a_fail.exe, True)\n\"this_is_a_fail.exe", "True"])
index #8 CORRECT (Command(NONE, []))
Now, index #7 looks very funky, but it's actually a well-known phenomenon in Spirit¹. Enabling BOOST_SPIRIT_DEBUG demonstrates it:
<argument>
<try>"this_is_a_fail.exe,</try>
<quoted_string>
<try>"this_is_a_fail.exe,</try>
<fail/>
</quoted_string>
<number>
<try>"this_is_a_fail.exe,</try>
<fail/>
</number>
<boolean>
<try>"this_is_a_fail.exe,</try>
<fail/>
</boolean>
<raw_string>
<try>"this_is_a_fail.exe,</try>
<success>, True)</success>
<attributes>[[t, h, i, s, _, i, s, _, a, _, f, a, i, l, ., e, x, e, ,, , T, r, u, e, ), ", t, h, i, s, _, i, s, _, a, _, f, a, i, l, ., e, x, e]]</attributes>
</raw_string>
<success>, True)</success>
<attributes>[[t, h, i, s, _, i, s, _, a, _, f, a, i, l, ., e, x, e, ,, , T, r, u, e, ), ", t, h, i, s, _, i, s, _, a, _, f, a, i, l, ., e, x, e]]</attributes>
</argument>
So, the string gets accepted as a raw string, even though it started with ". That's easily fixed, but we don't even need to. We could just apply qi::hold to avoid the duplication:
argument = qi::hold[quoted_string] | number | boolean | raw_string;
Result:
actual: Command(START_PROCESS, ["17", "\"this_is_a_fail.exe", "True"])
However, if you expect it to fail, fix that other problem:
raw_string = +~char_("\"(),\r\n"); // note the \"
Note: In the off-chance you really only require it to not start with
a quote:
raw_string = !lit('"') >> +~char_("(),\r\n");
I guess by now you see the problem with a "loose rule" like that, so I
don't recommend it.
You could express the requirement another way though, saying "if an
argument starts with '"' then is MUST be a quoted_string. Use
an expectation point there:
quoted_string = '"' > *('\\' >> char_ | ~char_('"')) > '"';
This has the effect that failure to parse a complete quoted_string
will throw an expectation_failed exception.
Summary / Listing
This is what we end up with:
Live On Compiler Explorer
//#define BOOST_SPIRIT_DEBUG
#include <boost/fusion/adapted/struct/adapt_struct.hpp>
#include <boost/spirit/include/qi.hpp>
#include <fmt/ranges.h>
namespace script {
using Argument = std::string;
using Arguments = std::vector<Argument>;
struct Command {
enum Type {
NONE,
WRITE_LOG,
INSERT_LABEL,
START_PROCESS,
END_PROCESS,
COMMENT,
FAIL
};
Type type{Type::NONE};
Arguments args;
auto operator<=>(Command const&) const = default;
};
using Commands = std::vector<Command>;
} // namespace script
BOOST_FUSION_ADAPT_STRUCT(script::Command, type, args)
namespace script {
namespace qi = boost::spirit::qi;
template <typename It> class Parser : public qi::grammar<It, Commands()> {
public:
Parser() : Parser::base_type(start) {
using namespace qi; // NOTE: "as_string" is neccessary in all args
auto empty_args = copy(attr(Arguments{}));
type.add //
("WriteLog", Command::WRITE_LOG) //
("InsertLabel", Command::INSERT_LABEL) //
("StartProcess", Command::START_PROCESS) //
("EndProcess", Command::END_PROCESS); //
none = omit[*blank] >> &(eol | eoi) //
>> attr(Command{Command::NONE, {}});
quoted_string = '"' >> *('\\' >> char_ | ~char_('"')) >> '"';
number = raw[double_];
boolean = raw[bool_];
raw_string = +~char_("\"(),\r\n");
argument = qi::hold[quoted_string] | number | boolean | raw_string;
command = type >> '(' >> argument % ',' >> ')';
comment = "//" //
>> attr(Command::COMMENT) //
>> as_string[lexeme[*~char_("\r\n")]]; //
fail = omit[*~char_("\r\n")] >> attr(Command{Command::FAIL, {}});
line = none | command | comment | fail; // keep fail last
start = skip(blank)[line % eol] >> eoi;
BOOST_SPIRIT_DEBUG_NODES((start)(line)(fail)(comment)(command)(
argument)(none)(quoted_string)(raw_string)(boolean)(number))
}
private:
qi::symbols<char, Command::Type> type;
qi::rule<It, Command(), qi::blank_type> line, none, command, comment, fail;
// notice these are lexemes (no internal skipping):
qi::rule<It, Argument()> argument, quoted_string, number, boolean, raw_string;
qi::rule<It, Commands()> start;
};
Commands parse(std::istream& in)
{
using It = boost::spirit::istream_iterator;
static const Parser<It> parser;
Commands commands;
return qi::parse(It{in >> std::noskipws}, {}, parser, commands)
? commands
: throw std::runtime_error("command parse error");
}
struct Formatter {
static constexpr auto name(script::Command::Type type) {
return std::array{"NONE", "WRITE_LOG", "INSERT_LABEL",
"START_PROCESS", "END_PROCESS", "COMMENT",
"FAIL"}
.at(static_cast<int>(type));
}
auto parse(auto& ctx) const { return ctx.begin(); }
auto format(script::Command const& cmd, auto& ctx) const {
return format_to(ctx.out(), "Command({}, {})", name(cmd.type), cmd.args);
}
};
} // namespace script
template <> struct fmt::formatter<script::Command> : script::Formatter {};
std::stringstream ss{
R"(// just a comment
WriteLog("this is a log")
WriteLog("this is also (in another way) a log")
WriteLog("but this is just a fail)
StartProcess(17, "program.exe", True)
StartProcess(17, "this_is_a_fail.exe, True)
)"};
int main() {
using namespace script;
static const Commands expected{
{Command::COMMENT, {"just a comment"}},
{Command::NONE, {}},
{Command::WRITE_LOG, {"this is a log"}},
{Command::WRITE_LOG, {"this is also (in another way) a log"}},
{Command::FAIL, {}},
{Command::NONE, {}},
{Command::START_PROCESS, {"17", "program.exe", "True"}},
{Command::FAIL, {}},
{Command::NONE, {}},
};
try {
auto parsed = script::parse(ss);
fmt::print("Parsed all correct? {} -- {} parsed (vs. {} expected)\n",
(parsed == expected), parsed.size(), expected.size());
for (auto i = 0u; i < std::min(expected.size(), parsed.size()); ++i) {
if (expected[i] != parsed[i]) {
fmt::print("index #{} expected {}\n"
" actual: {}\n",
i, expected[i], parsed[i]);
} else {
fmt::print("index #{} CORRECT ({})\n", i, parsed[i]);
}
}
} catch (std::exception const& e) {
fmt::print("Exception: {}\n", e.what());
}
}
Prints
Parsed all correct? true -- 9 parsed (vs. 9 expected)
index #0 CORRECT (Command(COMMENT, ["just a comment"]))
index #1 CORRECT (Command(NONE, []))
index #2 CORRECT (Command(WRITE_LOG, ["this is a log"]))
index #3 CORRECT (Command(WRITE_LOG, ["this is also (in another way) a log"]))
index #4 CORRECT (Command(FAIL, []))
index #5 CORRECT (Command(NONE, []))
index #6 CORRECT (Command(START_PROCESS, ["17", "program.exe", "True"]))
index #7 CORRECT (Command(FAIL, []))
index #8 CORRECT (Command(NONE, []))
¹ see e.g. boost::spirit alternative parsers return duplicates (which links to three more of the same kind)

Boost spirit core dump on parsing bracketed expression

Having some simplified grammar that should parse sequence of terminal literals: id, '<', '>' and ":action".
I need to allow brackets '(' ')' that do nothing but improve reading. (Full example is there http://coliru.stacked-crooked.com/a/dca93f5c8f37a889 )
Snip of my grammar:
start = expression % eol;
expression = (simple_def >> -expression)
| (qi::lit('(') > expression > ')');
simple_def = qi::lit('<') [qi::_val = Command::left]
| qi::lit('>') [qi::_val = Command::right]
| key [qi::_val = Command::id]
| qi::lit(":action") [qi::_val = Command::action]
;
key = +qi::char_("a-zA-Z_0-9");
When I try to parse: const std::string s = "(a1 > :action)"; Everything works like a charm.
But when I little bit bring more complexity with brackets "(a1 (>) :action)" I've gotten coredump. Just for information - coredump happens on coliru, while msvc compiled example just demonstrate fail parsing.
So my questions: (1) what's wrong with brackets, (2) how exactly brackets can be introduced to expression.
p.s. It is simplified grammar, in real I have more complicated case, but this is a minimal reproduceable code.
You should just handle the expectation failure:
terminate called after throwing an instance of 'boost::wrapexcept<boost::spir
it::qi::expectation_failure<__gnu_cxx::__normal_iterator<char const*, std::__
cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >
>'
what(): boost::spirit::qi::expectation_failure
Aborted (core dumped)
If you handle the expectation failure, the program will not have to terminate.
Fixing The Grammar
Your 'nested expression' rule only accepts a single expression. I think that
expression = (simple_def >> -expression)
is intended to match "1 or more `simple_def". However, the alternative branch:
| ('(' > expression > ')');
doesn't accept the same: it just stops after parsing `)'. This means that your input is simply invalid according to the grammar.
I suggest a simplification by expressing intent. You were on the right path with semantic typedefs. Let's avoid the "weasely" Line Of Lines (what even is that?):
using Id = std::string;
using Line = std::vector<Command>;
using Script = std::vector<Line>;
And use these typedefs consistently. Now, we can express the grammar as we "think" about it:
start = skip(blank)[script];
script = line % eol;
line = +simple;
simple = group | command;
group = '(' > line > ')';
See, by simplifying our mental model and sticking to it, we avoided the entire problem you had a hard time spotting.
Here's a quick demo that includes error handling, optional debug output, both test cases and encapsulating the skipper as it is part of the grammar: Live On Compiler Explorer
#include <fmt/ranges.h>
#include <fmt/ostream.h>
#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/phoenix.hpp>
namespace qi = boost::spirit::qi;
namespace phx = boost::phoenix;
enum class Command { id, left, right, action };
static inline std::ostream& operator<<(std::ostream& os, Command cmd) {
switch (cmd) {
case Command::id: return os << "[ID]";
case Command::left: return os << "[LEFT]";
case Command::right: return os << "[RIGHT]";
case Command::action: return os << "[ACTION]";
}
return os << "[???]";
}
using Id = std::string;
using Line = std::vector<Command>;
using Script = std::vector<Line>;
template <typename It>
struct ExprGrammar : qi::grammar<It, Script()> {
ExprGrammar() : ExprGrammar::base_type(start) {
using namespace qi;
start = skip(blank)[script];
script = line % eol;
line = +simple;
simple = group | command;
group = '(' > line > ')';
command =
lit('<') [ _val = Command::left ] |
lit('>') [ _val = Command::right ] |
key [ _val = Command::id ] |
lit(":action") [ _val = Command::action ] ;
key = +char_("a-zA-Z_0-9");
BOOST_SPIRIT_DEBUG_NODES((command)(line)(simple)(group)(script)(key));
}
private:
qi::rule<It, Script()> start;
qi::rule<It, Line(), qi::blank_type> line, simple, group;
qi::rule<It, Script(), qi::blank_type> script;
qi::rule<It, Command(), qi::blank_type> command;
// lexemes
qi::rule<It, Id()> key;
};
int main() {
using It = std::string::const_iterator;
ExprGrammar<It> const p;
for (const std::string s : {
"a1 > :action\na1 (>) :action",
"(a1 > :action)\n(a1 (>) :action)",
"a1 (> :action)",
}) {
It f(begin(s)), l(end(s));
try {
Script parsed;
bool ok = qi::parse(f, l, p, parsed);
if (ok) {
fmt::print("Parsed {}\n", parsed);
} else {
fmt::print("Parsed failed\n");
}
if (f != l) {
fmt::print("Remaining unparsed: '{}'\n", std::string(f, l));
}
} catch (qi::expectation_failure<It> const& ef) {
fmt::print("{}\n", ef.what()); // TODO add more details :)
}
}
}
Prints
Parsed {{[ID], [RIGHT], [ACTION]}, {[ID], [RIGHT], [ACTION]}}
Parsed {{[ID], [RIGHT], [ACTION]}, {[ID], [RIGHT], [ACTION]}}
Parsed {{[ID], [RIGHT], [ACTION]}}
BONUS
However, I think this can all be greatly simplified using qi::symbols for the commands. In fact it looks like you're only tokenizing (you confirm this when you say that the parentheses are not important).
line = +simple;
simple = group | command | (omit[key] >> attr(Command::id));
group = '(' > line > ')';
key = +char_("a-zA-Z_0-9");
Now you don't need Phoenix at all: Live On Compiler Explorer, printing
ok? true {{[ID], [RIGHT], [ACTION]}, {[ID], [RIGHT], [ACTION]}}
ok? true {{[ID], [RIGHT], [ACTION]}, {[ID], [RIGHT], [ACTION]}}
ok? true {{[ID], [RIGHT], [ACTION]}}
Even Simpler?
Since I observe that you're basically tokenizing line-wise, why not simply skip the parentheses, and simplify all the way down to:
script = line % eol;
line = *(command | omit[key] >> attr(Command::id));
That's all. See it Live On Compiler Explorer again:
#include <boost/spirit/include/qi.hpp>
#include <fmt/ostream.h>
#include <fmt/ranges.h>
namespace qi = boost::spirit::qi;
enum class Command { id, left, right, action };
using Id = std::string;
using Line = std::vector<Command>;
using Script = std::vector<Line>;
static inline std::ostream& operator<<(std::ostream& os, Command cmd) {
return os << (std::array{"ID", "LEFT", "RIGHT", "ACTION"}.at(int(cmd)));
}
template <typename It>
struct ExprGrammar : qi::grammar<It, Script()> {
ExprGrammar() : ExprGrammar::base_type(start) {
using namespace qi;
start = skip(skipper.alias())[line % eol];
line = *(command | omit[key] >> attr(Command::id));
key = +char_("a-zA-Z_0-9");
BOOST_SPIRIT_DEBUG_NODES((line)(key));
}
private:
using Skipper = qi::rule<It>;
qi::rule<It, Script()> start;
qi::rule<It, Line(), Skipper> line;
Skipper skipper = qi::char_(" \t\b\f()");
qi::rule<It /*, Id()*/> key; // omit attribute for efficiency
struct cmdsym : qi::symbols<char, Command> {
cmdsym() { this->add("<", Command::left)
(">", Command::right)
(":action", Command::action);
}
} command;
};
int main() {
using It = std::string::const_iterator;
ExprGrammar<It> const p;
for (const std::string s : {
"a1 > :action\na1 (>) :action",
"(a1 > :action)\n(a1 (>) :action)",
"a1 (> :action)",
})
try {
It f(begin(s)), l(end(s));
Script parsed;
bool ok = qi::parse(f, l, p, parsed);
fmt::print("ok? {} {}\n", ok, parsed);
if (f != l)
fmt::print(" -- Remaining '{}'\n", std::string(f, l));
} catch (qi::expectation_failure<It> const& ef) {
fmt::print("{}\n", ef.what()); // TODO add more details :)
}
}
Prints
ok? true {{ID, RIGHT, ACTION}, {ID, RIGHT, ACTION}}
ok? true {{ID, RIGHT, ACTION}, {ID, RIGHT, ACTION}}
ok? true {{ID, RIGHT, ACTION}}
Note I very subtly changed +() to *() so it would accept empty lines as well. This may or may not be what you want

Boost Spirit X3: How to recover both matched and unmatched results of a rule

Consider the following sample text line:
"Hello : World 2020 :tag1:tag2:tag3"
I want to design a spirit X3 parser that can extract:
Content := "Hello : world 2020 "
Tags := { tag1,tag2,tag3 }
The problem: Content is defined as leftover char sequence(excluding eol) after matching the tags and I am not sure how to write a rule that can synthesize two attributes: one representing the extracted tags and another representing leftover characters(the content)
So far I've written the rule for extracting the tags:
...
namespace ast {
struct sample {
std::u32string content;
std::vector<std::u32string> tags;
};
//BOOST FUSION STUFF .....
}
namespace grammar {
using x3 = boost::spirit::x3;
using x3::unicode::lit;
using x3::unicode::char_;
using x3::unicode::alnum;
auto const tag
= x3::rule<class tag_class, std::u32string> {"tag"}
%=
lit(U":")
>>
+(alnum | lit(U"_") | lit(U"#") | lit(U"#") | lit(U"%") )
;
auto const tags
= x3::rule<class tags_class, std::vector<std::u32string>{"tags"}
%= +tag >> lit(U":");
}
But stuck over here:
auto const sample_rule =
= x3::rule<class sample_rule_class, ast::sample> {"sample"}
= ?? // something like (+char_ - (eol|tags);
I'm sure there is a much elegant solution out there. In the meantime, a messy solution:
Parse each sample line as a single string unit.
Use semantic action to filer out the tags from each matched string unit.
Discard the filtered tags from the string unit to be left with only content.
sample_ast.h
#prgama once
#include <string>
namespace ast {
struct sample {
std::u32string content;
std::vector<std::u32string> tags;
};
}
sample.h
#pgrama once
#include <boost/fusion/include/adapt_struct.hpp>
#include <boost/spirit/char_encoding/unicode.hpp>
#include <boost/spirit/home/x3.hpp>
#include "sample_ast.hpp"
//tags property is intentionally ignored.
//It will be synthesized
//manually using semantic actions
BOOST_FUSION_ADAPT_STRUCT( ast::sample,content )
namespace grammar {
namespace detail {
using x3 = boost::spirit::x3;
using x3::unicode::char_;
using x3::eol;
using x3::eoi;
using x3::lexeme;
auto const sample_line
= x3::rule<class sample_line_class, std::u32string>{"sample_line"}
= lexeme[ +(char_ - (eol|eoi)) ];
auto filter_tags = /*.... definition moved to next page for clarity */
auto const sample
= x3::rule<class sample, ast::sample >{"sample"}
=% filter_tags[ sample_line ];
}}
namespace grammar {
using grammar::detail::sample;
}
filter_tags definition
iterate the matched data right to left collecting
colon separated tags until an invalid tag char is
encountered or all chars have been exhausted.
pos_saved is used to track the beginning
of the tag list, which is used to discard the tags
from the content after collecting them into the ast.
auto filter_tags = []( auto& context )
{
auto &attr = _attr(context); // content string
auto &val = _val(context); // ast::sample
std::stack<char32_t> mem;
auto pos = attr.rbegin();
auto& const pos_end = attr.rend();
auto pos_saved = atrr.end();
do{
//tag start or end
if( *pos == U':' ){
if( mem.empty() ) { //tag start
mem.push(U':');
}
else { //tag end
//tag closed state:
//all chars for the current tag
//are ready for transfer into
//the ast.
std::u32string tag;
while( mem.top() != ':' ){
//since we're reverse iterating the data
//the tags wont be backwards
tag.push_back( mem.top());
mem.pop();
}
val.tags.push_back(tag);
//update the start offset of
//that tags
pos_saved = pos.base();
}
} else { // tag char or not
using u = spirit::char_encoding::unicode;
if( !mem.empty() ) {
if(u::isalnum(*pos)) mem.push( *pos ); //tag char found
else break; //invalid tag char found
}
else {
//space after tag list but before content end
if(u::isspace(*pos) pos_saved = pos.base();
}
}
}while(++pos != pos_end);
if( pos_saved != attr.end()) attr.erase(pos_saved, attr.end() );
if( attr.empty() ) _pass(context) = false;
};

How to capture the value parsed by a boost::spirit::x3 parser to be used within the body of a semantic action?

I have a parser for string literals, and I'd like to attach a semantic action to the parser that will manipulate the parsed value. It seems that boost::spirit::x3::_val() returns a reference to the parsed value when given the context, but for some reason the parsed string always enters the body of the semantic action as just an empty string, which obviously makes it difficult to read from it. It is the right string though, I've made sure by checking the addresses. Anyone know how I could have a reference to the parsed value within the semantic action attached to the parser? This here is the parser I currently use:
x3::lexeme[quote > *("\\\"" >> x3::attr('\"') | ~x3::char_(quote)) > quote]
And I'd like to add the semantic action to the end of it. Thank you in advance!
EDIT: it seems that whenever I attach any semantic action in general to the parser, the value is nullified. I suppose the question now is how could I access the value before that happens? I just need to be able to manipulate the parsed string before it is given to the AST.
In X3, semantic actions are much simpler. They're unary callables that take just the context.
Then you use free functions to extract information from the context:
x3::_val(ctx) is like qi::_val
x3::_attr(ctx) is like qi::_0 (or qi::_1 for simple parsers)
x3::_pass(ctx) is like qi::_pass
So, to get your semantic action, you could do:
auto qstring
= x3::rule<struct rule_type, std::string> {"qstring"}
= x3::lexeme[quote > *("\\" >> x3::char_(quote) | ~x3::char_(quote)) > quote]
;
Now to make a very odd string rule that reverses the text (after de-escaping) and requires the number of characters to be an odd-number:
auto odd_reverse = [](auto& ctx) {
auto& attr = x3::_attr(ctx);
auto& val = x3::_val(ctx);
x3::traits::move_to(attr, val);
std::reverse(val.begin(), val.end());
x3::_pass(ctx) = val.size() % 2 == 0;
};
auto odd_string
= x3::rule<struct odd_type, std::string> {"odd_string"}
= qstring [ odd_reverse ]
;
DEMO
Live On Coliru
#include <boost/spirit/home/x3.hpp>
#include <iostream>
#include <iomanip>
int main() {
namespace x3 = boost::spirit::x3;
auto constexpr quote = '"';
auto qstring
= x3::rule<struct rule_type, std::string> {"qstring"}
= x3::lexeme[quote > *("\\" >> x3::char_(quote) | ~x3::char_(quote)) > quote]
;
auto odd_reverse = [](auto& ctx) {
auto& attr = x3::_attr(ctx);
auto& val = x3::_val(ctx);
x3::traits::move_to(attr, val);
std::reverse(val.begin(), val.end());
x3::_pass(ctx) = val.size() % 2 == 0;
};
auto odd_string
= x3::rule<struct odd_type, std::string> {"odd_string"}
= qstring [ odd_reverse ]
;
for (std::string const input : {
R"("test \"hello\" world")",
R"("test \"hello\" world!")",
}) {
std::string output;
auto f = begin(input), l = end(input);
if (x3::phrase_parse(f, l, odd_string, x3::blank, output)) {
std::cout << "[" << output << "]\n";
} else {
std::cout << "Failed\n";
}
if (f != l) {
std::cout << "Remaining unparsed: " << std::quoted(std::string(f,l)) << "\n";
}
}
}
Printing
[dlrow "olleh" tset]
Failed
Remaining unparsed: "\"test \\\"hello\\\" world!\""
UPDATE
To the added question:
EDIT: it seems that whenever I attach any semantic action in general
to the parser, the value is nullified. I suppose the question now is
how could I access the value before that happens? I just need to be
able to manipulate the parsed string before it is given to the AST.
Yes, if you attach an action, automatic attribute propagation is inhibited. This is the same in Qi, where you could assign rules with %= instead of = to force automatic attribute propagation.
To get the same effect in X3, use the third template argument to x3::rule: x3::rule<X, T, true> to indicate you want automatic propagation.
Really, try not to fight the system. In practice, the automatic transformation system is way more sophisticated than I am willing to re-discover on my own, so I usually post-process the whole AST or at most apply some minor tweaks in an action. See also Boost Spirit: "Semantic actions are evil"?

It's a good idea to use boost::program_options to parse a text file?

I have to deal with a lot of files with a well defined syntax and semantic, for example:
the first line it's an header with special info
the other lines are containing a key value at the start of the line that are telling you how to parse and deal with the content of that line
if there is a comment it starts with a given token
etc etc ...
now boost::program_options, as far as I can tell, does pretty much the same job, but I only care about importing the content of those text file, without any extra work in between, just parse it and store it in my data structure .
the key step for me is that I would like to be able to do this parsing with:
regular expressions since I need to detect different semantics and I can't really imagine another way to do this
error checking ( corrupted file, unmatched keys even after parsing the entire file, etc etc ... )
so, I can use this library for this job ? There is a more functional approach ?
Okay, a starting point for a Spirit grammar
_Name = "newmtl" >> lexeme [ +graph ];
_Ns = "Ns" >> double_;
_Ka = "Ka" >> double_ >> double_ >> double_;
_Kd = "Kd" >> double_ >> double_ >> double_;
_Ks = "Ks" >> double_ >> double_ >> double_;
_d = "d" >> double_;
_illum %= "illum" >> qi::int_ [ _pass = (_1>=0) && (_1<=10) ];
comment = '#' >> *(char_ - eol);
statement=
comment
| _Ns [ bind(&material::_Ns, _r1) = _1 ]
| _Ka [ bind(&material::_Ka, _r1) = _1 ]
| _Kd [ bind(&material::_Kd, _r1) = _1 ]
| _Ks [ bind(&material::_Ks, _r1) = _1 ]
| _d [ bind(&material::_d, _r1) = _1 ]
| _illum [ bind(&material::_illum, _r1) = _1 ]
;
_material = -comment % eol
>> _Name [ bind(&material::_Name, _val) = _1 ] >> eol
>> -statement(_val) % eol;
start = _material % -eol;
I only implemented the MTL file subset grammar from your sample files.
Note: This is rather a simplistic grammar. But, you know, first things first. In reality I'd probably consider using the keyword list parser from the spirit repository. It has facilities to 'require' certain number of occurrences for the different 'field types'.
Note: Spirit Karma (and some ~50 other lines of code) are only here for demonstrational purposes.
With the following contents of untitled.mtl
# Blender MTL File: 'None'
# Material Count: 2
newmtl None
Ns 0
Ka 0.000000 0.000000 0.000000
Kd 0.8 0.8 0.8
Ks 0.8 0.8 0.8
d 1
illum 2
# Added just for testing:
newmtl Demo
Ns 1
Ks 0.9 0.9 0.9
d 42
illum 7
The output reads
phrase_parse -> true
remaining input: ''
void dump(const T&) [with T = std::vector<blender::mtl::material>]
-----
material {
Ns:0
Ka:{r:0,g:0,b:0}
Kd:{r:0.8,g:0.8,b:0.8}
Ks:{r:0.8,g:0.8,b:0.8}
d:1
illum:2(Highlight on)
}
material {
Ns:1
Ka:(unspecified)
Kd:(unspecified)
Ks:{r:0.9,g:0.9,b:0.9}
d:42
illum:7(Transparency: Refraction on/Reflection: Fresnel on and Ray trace on)
}
-----
Here's the listing
#define BOOST_SPIRIT_USE_PHOENIX_V3
#define BOOST_SPIRIT_DEBUG
#include <boost/fusion/adapted.hpp>
#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/karma.hpp> // for debug output/streaming
#include <boost/spirit/include/phoenix.hpp>
#include <boost/spirit/include/phoenix_operator.hpp>
namespace qi = boost::spirit::qi;
namespace phx= boost::phoenix;
namespace wavefront { namespace obj
{
} }
namespace blender { namespace mtl // material?
{
struct Ns { int exponent; }; // specular exponent
struct Reflectivity { double r, g, b; };
using Name = std::string;
using Ka = Reflectivity;
using Kd = Reflectivity;
using Ks = Reflectivity;
using dissolve_factor = double;
enum class illumination_model {
color, // 0 Color on and Ambient off
color_ambient, // 1 Color on and Ambient on
highlight, // 2 Highlight on
reflection_ray, // 3 Reflection on and Ray trace on
glass_ray, // 4 Transparency: Glass on
// Reflection: Ray trace on
fresnel_ray, // 5 Reflection: Fresnel on and Ray trace on
refract_ray, // 6 Transparency: Refraction on
// Reflection: Fresnel off and Ray trace on
refract_ray_fresnel,// 7 Transparency: Refraction on
// Reflection: Fresnel on and Ray trace on
reflection, // 8 Reflection on and Ray trace off
glass, // 9 Transparency: Glass on
// Reflection: Ray trace off
shadow_invis, // 10 Casts shadows onto invisible surfaces
};
struct material
{
Name _Name;
boost::optional<Ns> _Ns;
boost::optional<Reflectivity> _Ka;
boost::optional<Reflectivity> _Kd;
boost::optional<Reflectivity> _Ks;
boost::optional<dissolve_factor> _d;
boost::optional<illumination_model> _illum;
};
using mtl_file = std::vector<material>;
///////////////////////////////////////////////////////////////////////
// Debug output helpers
std::ostream& operator<<(std::ostream& os, blender::mtl::illumination_model o)
{
using blender::mtl::illumination_model;
switch(o)
{
case illumination_model::color: return os << "0(Color on and Ambient off)";
case illumination_model::color_ambient: return os << "1(Color on and Ambient on)";
case illumination_model::highlight: return os << "2(Highlight on)";
case illumination_model::reflection_ray: return os << "3(Reflection on and Ray trace on)";
case illumination_model::glass_ray: return os << "4(Transparency: Glass on/Reflection: Ray trace on)";
case illumination_model::fresnel_ray: return os << "5(Reflection: Fresnel on and Ray trace on)";
case illumination_model::refract_ray: return os << "6(Transparency: Refraction on/Reflection: Fresnel off and Ray trace on)";
case illumination_model::refract_ray_fresnel: return os << "7(Transparency: Refraction on/Reflection: Fresnel on and Ray trace on)";
case illumination_model::reflection: return os << "8(Reflection on and Ray trace off)";
case illumination_model::glass: return os << "9(Transparency: Glass on/Reflection: Ray trace off)";
case illumination_model::shadow_invis: return os << "10(Casts shadows onto invisible surfaces)";
default: return os << "ILLEGAL VALUE";
}
}
std::ostream& operator<<(std::ostream& os, blender::mtl::Reflectivity const& o)
{
return os << "{r:" << o.r << ",g:" << o.g << ",b:" << o.b << "}";
}
std::ostream& operator<<(std::ostream& os, blender::mtl::material const& o)
{
using namespace boost::spirit::karma;
return os << format("material {"
"\n\tNs:" << (auto_ | "(unspecified)")
<< "\n\tKa:" << (stream | "(unspecified)")
<< "\n\tKd:" << (stream | "(unspecified)")
<< "\n\tKs:" << (stream | "(unspecified)")
<< "\n\td:" << (stream | "(unspecified)")
<< "\n\tillum:" << (stream | "(unspecified)")
<< "\n}", o);
}
} }
BOOST_FUSION_ADAPT_STRUCT(blender::mtl::Reflectivity,(double, r)(double, g)(double, b))
BOOST_FUSION_ADAPT_STRUCT(blender::mtl::Ns, (int, exponent))
BOOST_FUSION_ADAPT_STRUCT(blender::mtl::material,
(boost::optional<blender::mtl::Ns>, _Ns)
(boost::optional<blender::mtl::Ka>, _Ka)
(boost::optional<blender::mtl::Kd>, _Kd)
(boost::optional<blender::mtl::Ks>, _Ks)
(boost::optional<blender::mtl::dissolve_factor>, _d)
(boost::optional<blender::mtl::illumination_model>, _illum))
namespace blender { namespace mtl { namespace parsing
{
template <typename It>
struct grammar : qi::grammar<It, qi::blank_type, mtl_file()>
{
template <typename T=qi::unused_type> using rule = qi::rule<It, qi::blank_type, T>;
rule<Name()> _Name;
rule<Ns()> _Ns;
rule<Reflectivity()> _Ka;
rule<Reflectivity()> _Kd;
rule<Reflectivity()> _Ks;
rule<dissolve_factor()> _d;
rule<illumination_model()> _illum;
rule<mtl_file()> start;
rule<material()> _material;
rule<void(material&)> statement;
rule<> comment;
grammar() : grammar::base_type(start)
{
using namespace qi;
using phx::bind;
using blender::mtl::material;
_Name = "newmtl" >> lexeme [ +graph ];
_Ns = "Ns" >> double_;
_Ka = "Ka" >> double_ >> double_ >> double_;
_Kd = "Kd" >> double_ >> double_ >> double_;
_Ks = "Ks" >> double_ >> double_ >> double_;
_d = "d" >> double_;
_illum %= "illum" >> qi::int_ [ _pass = (_1>=0) && (_1<=10) ];
comment = '#' >> *(char_ - eol);
statement=
comment
| _Ns [ bind(&material::_Ns, _r1) = _1 ]
| _Ka [ bind(&material::_Ka, _r1) = _1 ]
| _Kd [ bind(&material::_Kd, _r1) = _1 ]
| _Ks [ bind(&material::_Ks, _r1) = _1 ]
| _d [ bind(&material::_d, _r1) = _1 ]
| _illum [ bind(&material::_illum, _r1) = _1 ]
;
_material = -comment % eol
>> _Name [ bind(&material::_Name, _val) = _1 ] >> eol
>> -statement(_val) % eol;
start = _material % -eol;
BOOST_SPIRIT_DEBUG_NODES(
(start)
(statement)
(_material)
(_Name) (_Ns) (_Ka) (_Kd) (_Ks) (_d) (_illum)
(comment))
}
};
} } }
#include <fstream>
template <typename T>
void dump(T const& data)
{
using namespace boost::spirit::karma;
std::cout << __PRETTY_FUNCTION__
<< "\n-----\n"
<< format(stream % eol, data)
<< "\n-----\n";
}
void testMtl(const char* const fname)
{
std::ifstream mtl(fname, std::ios::binary);
mtl.unsetf(std::ios::skipws);
boost::spirit::istream_iterator f(mtl), l;
using namespace blender::mtl::parsing;
static const grammar<decltype(f)> p;
blender::mtl::mtl_file data;
bool ok = qi::phrase_parse(f, l, p, qi::blank, data);
std::cout << "phrase_parse -> " << std::boolalpha << ok << "\n";
std::cout << "remaining input: '" << std::string(f,l) << "'\n";
dump(data);
}
int main()
{
testMtl("untitled.mtl");
}
Yes, at least if you config file as simple as map of key-value pairs (something like simple .ini).
From documentation:
The program_options library allows program developers to obtain
program options, that is (name, value) pairs from the user, via
conventional methods such as command line and config file.
...
Options can be read from anywhere. Sooner or later the command line
will be not enough for your users, and you'll want config files or
maybe even environment variables. These can be added without
significant effort on your part.
See "multiple sources" sample for details.
But, if you need (or could probably need in the future) a more sophisticated config files (XML, JSON or binary for example), it is worth to use standalone library.
It's most likely possible, but not necessarily convenient. If you want to parse anything you want to use parser - whether you use existing one or write one yourself depends on what you are parsing.
If there is no way to parse your format with any existing tool then just write your own parser. You can use lex/flex/flex++ with yacc/bison/bison++ or boost::spirit.
I think in a long run learning to maintain you own parser will be more useful that forcefully adjusting boost::program_options config, but not as convenient as using some existing parser already matching your needs.