I have a very simple path construct that I am trying to parse with boost spirit.lex.
We have the following grammar:
token := [a-z]+
path := (token : path) | (token)
So we're just talking about colon separated lower-case ASCII strings here.
I have three examples "xyz", "abc:xyz", "abc:xyz:".
The first two should be deemed valid. The third one, which has a trailing colon, should not be deemed valid. Unfortunately the parser I have recognizes all three as being valid. The grammar should not allow an empty token, but apparently spirit is doing just that. What am I missing to get the third one rejected?
Also, if you read the code below, in comments there is another version of the parser that demands that all paths end with semi-colons. I can get appropriate behavior when I activate those lines, (i.e. rejection of "abc:xyz:;"), but this is not really what I want.
Anyone have any ideas?
Thanks.
#include <boost/config/warning_disable.hpp>
#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/lex_lexertl.hpp>
#include <boost/spirit/include/phoenix_operator.hpp>
#include <iostream>
#include <string>
using namespace boost::spirit;
using boost::phoenix::val;
template<typename Lexer>
struct PathTokens : boost::spirit::lex::lexer<Lexer>
{
PathTokens()
{
identifier = "[a-z]+";
separator = ":";
this->self.add
(identifier)
(separator)
(';')
;
}
boost::spirit::lex::token_def<std::string> identifier, separator;
};
template <typename Iterator>
struct PathGrammar
: boost::spirit::qi::grammar<Iterator>
{
template <typename TokenDef>
PathGrammar(TokenDef const& tok)
: PathGrammar::base_type(path)
{
using boost::spirit::_val;
path
=
(token >> tok.separator >> path)[std::cerr << _1 << "\n"]
|
//(token >> ';')[std::cerr << _1 << "\n"]
(token)[std::cerr << _1 << "\n"]
;
token
= (tok.identifier) [_val=_1]
;
}
boost::spirit::qi::rule<Iterator> path;
boost::spirit::qi::rule<Iterator, std::string()> token;
};
int main()
{
typedef std::string::iterator BaseIteratorType;
typedef boost::spirit::lex::lexertl::token<BaseIteratorType, boost::mpl::vector<std::string> > TokenType;
typedef boost::spirit::lex::lexertl::lexer<TokenType> LexerType;
typedef PathTokens<LexerType>::iterator_type TokensIterator;
typedef std::vector<std::string> Tests;
Tests paths;
paths.push_back("abc");
paths.push_back("abc:xyz");
paths.push_back("abc:xyz:");
/*
paths.clear();
paths.push_back("abc;");
paths.push_back("abc:xyz;");
paths.push_back("abc:xyz:;");
*/
for ( Tests::iterator iter = paths.begin(); iter != paths.end(); ++iter )
{
std::string str = *iter;
std::cerr << "*****" << str << "*****\n";
PathTokens<LexerType> tokens;
PathGrammar<TokensIterator> grammar(tokens);
BaseIteratorType first = str.begin();
BaseIteratorType last = str.end();
bool r = boost::spirit::lex::tokenize_and_parse(first, last, tokens, grammar);
std::cerr << r << " " << (first==last) << "\n";
}
}
I addition to to what llonesmiz already said, here's a trick using qi::eoi that I sometimes use:
path = (
(token >> tok.separator >> path) [std::cerr << _1 << "\n"]
| token [std::cerr << _1 << "\n"]
) >> eoi;
This makes the grammar require eoi (end-of-input) at the end of a successful match. This leads to the desired result:
http://liveworkspace.org/code/23a7adb11889bbb2825097d7c553f71d
*****abc*****
abc
1 1
*****abc:xyz*****
xyz
abc
1 1
*****abc:xyz:*****
xyz
abc
0 1
The problem lies in the meaning of first and last after your call to tokenize_and_parse. first==last checks if your string has been completely tokenized, you can't infer anything about grammar. If you isolate the parsing like this, you obtain the expected result:
PathTokens<LexerType> tokens;
PathGrammar<TokensIterator> grammar(tokens);
BaseIteratorType first = str.begin();
BaseIteratorType last = str.end();
LexerType::iterator_type lexfirst = tokens.begin(first,last);
LexerType::iterator_type lexlast = tokens.end();
bool r = parse(lexfirst, lexlast, grammar);
std::cerr << r << " " << (lexfirst==lexlast) << "\n";
This is what I finally ended up with. It uses the suggestions from both #sehe and #llonesmiz. Note the conversion to std::wstring and the use of actions in the grammar definition, which were not present in the original post.
#include <boost/config/warning_disable.hpp>
#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/lex_lexertl.hpp>
#include <boost/spirit/include/phoenix_operator.hpp>
#include <boost/bind.hpp>
#include <iostream>
#include <string>
//
// This example uses boost spirit to parse a simple
// colon-delimited grammar.
//
// The grammar we want to recognize is:
// identifier := [a-z]+
// separator = :
// path= (identifier separator path) | identifier
//
// From the boost spirit perspective this example shows
// a few things I found hard to come by when building my
// first parser.
// 1. How to flag an incomplete token at the end of input
// as an error. (use of boost::spirit::eoi)
// 2. How to bind an action on an instance of an object
// that is taken as input to the parser.
// 3. Use of std::wstring.
// 4. Use of the lexer iterator.
//
// This using directive will cause issues with boost::bind
// when referencing placeholders such as _1.
// using namespace boost::spirit;
//! A class that tokenizes our input.
template<typename Lexer>
struct Tokens : boost::spirit::lex::lexer<Lexer>
{
Tokens()
{
identifier = L"[a-z]+";
separator = L":";
this->self.add
(identifier)
(separator)
;
}
boost::spirit::lex::token_def<std::wstring, wchar_t> identifier, separator;
};
//! This class provides a callback that echoes strings to stderr.
struct Echo
{
void echo(boost::fusion::vector<std::wstring> const& t) const
{
using namespace boost::fusion;
std::wcerr << at_c<0>(t) << "\n";
}
};
//! The definition of our grammar, as described above.
template <typename Iterator>
struct Grammar : boost::spirit::qi::grammar<Iterator>
{
template <typename TokenDef>
Grammar(TokenDef const& tok, Echo const& e)
: Grammar::base_type(path)
{
using boost::spirit::_val;
path
=
((token >> tok.separator >> path)[boost::bind(&Echo::echo, e,::_1)]
|
(token)[boost::bind(&Echo::echo, &e, ::_1)]
) >> boost::spirit::eoi; // Look for end of input.
token
= (tok.identifier) [_val=boost::spirit::qi::_1]
;
}
boost::spirit::qi::rule<Iterator> path;
boost::spirit::qi::rule<Iterator, std::wstring()> token;
};
int main()
{
// A set of typedefs to make things a little clearer. This stuff is
// well described in the boost spirit documentation/examples.
typedef std::wstring::iterator BaseIteratorType;
typedef boost::spirit::lex::lexertl::token<BaseIteratorType, boost::mpl::vector<std::wstring> > TokenType;
typedef boost::spirit::lex::lexertl::lexer<TokenType> LexerType;
typedef Tokens<LexerType>::iterator_type TokensIterator;
typedef LexerType::iterator_type LexerIterator;
// Define some paths to parse.
typedef std::vector<std::wstring> Tests;
Tests paths;
paths.push_back(L"abc");
paths.push_back(L"abc:xyz");
paths.push_back(L"abc:xyz:");
paths.push_back(L":");
// Parse 'em.
for ( Tests::iterator iter = paths.begin(); iter != paths.end(); ++iter )
{
std::wstring str = *iter;
std::wcerr << L"*****" << str << L"*****\n";
Echo e;
Tokens<LexerType> tokens;
Grammar<TokensIterator> grammar(tokens, e);
BaseIteratorType first = str.begin();
BaseIteratorType last = str.end();
// Have the lexer consume our string.
LexerIterator lexFirst = tokens.begin(first, last);
LexerIterator lexLast = tokens.end();
// Have the parser consume the output of the lexer.
bool r = boost::spirit::qi::parse(lexFirst, lexLast, grammar);
// Print the status and whether or note all output of the lexer
// was processed.
std::wcerr << r << L" " << (lexFirst==lexLast) << L"\n";
}
}
Related
I'm trying to create a (pretty simple) parser using boost::spirit::qi to extract messages from a stream. Each message starts from a short marker and ends with \r\n. The message body is ASCII text (letters and numbers) separated by a comma. For example:
!START,01,2.3,ABC\r\n
!START,456.2,890\r\n
I'm using unit tests to check the parser and everything works well when I pass only correct messages one by one. But when I try to emulate some invalid input, like:
!START,01,2.3,ABC\r\n
trash-message
!START,456.2,890\r\n
The parser doesn't see the following messages after an unexpected text.
I'm new in boost::spirit and I'd like to know how a parser based on boost::spirit::qi::grammar is supposed to work.
My question is:
Should the parser slide in the input stream and try to find a beginning of a message?
Or the caller should check the parsing result and in case of failure move an iterator and then recall the parser again?
Many thanks for considering my request.
My question is: Should the parser slide in the input stream and try to find a beginning of a message?
Only when you tell it to. It's called qi::parse, not qi::search. But obviously you can make a grammar ignore things.
Live On Coliru
//#define BOOST_SPIRIT_DEBUG
#include <boost/fusion/adapted.hpp>
#include <boost/spirit/include/qi.hpp>
#include <iomanip>
#include <iostream>
namespace qi = boost::spirit::qi;
struct Command {
enum Type { START, QUIT, TRASH } type = TRASH;
std::vector<std::string> args;
};
using Commands = std::vector<Command>;
BOOST_FUSION_ADAPT_STRUCT(Command, type, args)
template <typename It> struct CmdParser : qi::grammar<It, Commands()> {
CmdParser() : CmdParser::base_type(commands_) {
type_.add("!start", Command::START);
type_.add("!quit", Command::QUIT);
trash_ = *~qi::char_("\r\n"); // just ignore the entire line
arg_ = *~qi::char_(",\r\n");
command_ = qi::no_case[type_] >> *(',' >> arg_);
commands_ = *((command_ | trash_) >> +qi::eol);
BOOST_SPIRIT_DEBUG_NODES((trash_)(arg_)(command_)(commands_))
}
private:
qi::symbols<char, Command::Type> type_;
qi::rule<It, Commands()> commands_;
qi::rule<It, Command()> command_;
qi::rule<It, std::string()> arg_;
qi::rule<It> trash_;
};
int main() {
std::string_view input = "!START,01,2.3,ABC\r\n"
"trash-message\r\n"
"!START,456.2,890\r\n";
using It = std::string_view::const_iterator;
static CmdParser<It> const parser;
Commands parsed;
auto f = input.begin(), l = input.end();
if (parse(f, l, parser, parsed)) {
std::cout << "Parsed:\n";
for(Command const& cmd : parsed) {
std::cout << cmd.type;
for (auto& arg: cmd.args)
std::cout << ", " << quoted(arg);
std::cout << "\n";
}
} else {
std::cout << "Parse failed\n";
}
if (f != l)
std::cout << "Remaining unparsed: " << quoted(std::string(f, l)) << "\n";
}
Printing
Parsed:
0, "01", "2.3", "ABC"
2
0, "456.2", "890"
I was toying with Boost.Spirit X3 calculator example when I encountered an error I couldn't get my head around.
I minimized the program to reduce complexity still throwing the same error.
Say I want to parse an input as a list of statements (strings) followed by a delimiter (';').
This is my structure:
namespace client { namespace ast
{
struct program
{
std::list<std::string> stmts;
};
}}
BOOST_FUSION_ADAPT_STRUCT(client::ast::program,
(std::list<std::string>, stmts)
)
The grammar is as follows:
namespace client
{
namespace grammar
{
x3::rule<class program, ast::program> const program("program");
auto const program_def =
*((*char_) > ';')
;
BOOST_SPIRIT_DEFINE(
program
);
auto calculator = program;
}
using grammar::calculator;
}
Invoked
int
main()
{
std::cout <<"///////////////////////////////////////////\n\n";
std::cout << "Expression parser...\n\n";
std::cout << //////////////////////////////////////////////////\n\n";
std::cout << "Type an expression...or [q or Q] to quit\n\n";
typedef std::string::const_iterator iterator_type;
typedef client::ast::program ast_program;
std::string str;
while (std::getline(std::cin, str))
{
if (str.empty() || str[0] == 'q' || str[0] == 'Q')
break;
auto& calc = client::calculator; // Our grammar
ast_program program; // Our program (AST)
iterator_type iter = str.begin();
iterator_type end = str.end();
boost::spirit::x3::ascii::space_type space;
bool r = phrase_parse(iter, end, calc, space, program);
if (r && iter == end)
{
std::cout << "-------------------------\n";
std::cout << "Parsing succeeded\n";
std::cout<< '\n';
std::cout << "-------------------------\n";
}
else
{
std::cout << "-------------------------\n";
std::cout << "Parsing failed\n";
std::cout << "-------------------------\n";
}
}
std::cout << "Bye... :-) \n\n";
return 0;
}
Error I get is
/opt/boost_1_66_0/boost/spirit/home/x3/support/traits/container_traits.hpp: In instantiation of ‘struct boost::spirit::x3::traits::container_value<client::ast::program, void>’:
.
.
.
/opt/boost_1_66_0/boost/spirit/home/x3/support/traits/container_traits.hpp:76:12: error: no type named ‘value_type’ in ‘struct client::ast::program’
struct container_value
/opt/boost_1_66_0/boost/spirit/home/x3/operator/detail/sequence.hpp:497:72: error: no type named ‘type’ in ‘struct boost::spirit::x3::traits::container_value<client::ast::program, void>’
, typename traits::is_substitute<attribute_type, value_type>::type());
^~~~~~
Things I tried:
Following Getting boost::spirit::qi to use stl containers
Even though it uses Qi I nonetheless tried:
namespace boost{namespace spirit{ namespace traits{
template<>
struct container_value<client::ast::program>
//also with struct container<client::ast::program, void>
{
typedef std::list<std::string> type;
};
}}}
You see I'm kinda in the dark, so expectably to no avail.
parser2.cpp:41:8: error: ‘container_value’ is not a class template
struct container_value<client::ast::program>
^~~~~~~~~~~~~~~
In the same SO question I author says "There is one known limitation though, when you try to use a struct that has a single element that is also a container compilation fails unless you add qi::eps >> ... to your rule."
I did try adding a dummy eps also without success.
Please, help me decipher what that error means.
Yup. This looks like another limitation with automatic propagation of attributes when single-element sequences are involved.
I'd probably bite the bullet and change the rule definition from what it is (and what you'd expect to work) to:
x3::rule<class program_, std::vector<std::string> >
That removes the root of the confusion.
Other notes:
you had char_ which also eats ';' so the grammar would never succeed because no ';' would follow a "statement".
your statements aren't lexeme, so whitespace is discarded (is this what you meant? See Boost spirit skipper issues)
your statement could be empty, which meant parsing would ALWAYS fail at the end of input (where it would always read an empty state, and then discover that the expected ';' was missing). Fix it by requiring at least 1 character before accepting a statement.
With some simplifications/style changes:
Live On Coliru
#include <boost/fusion/adapted.hpp>
#include <boost/spirit/home/x3.hpp>
#include <list>
namespace x3 = boost::spirit::x3;
namespace ast {
using statement = std::string;
struct program {
std::list<statement> stmts;
};
}
BOOST_FUSION_ADAPT_STRUCT(ast::program, stmts)
namespace grammar {
auto statement
= x3::rule<class statement_, ast::statement> {"statement"}
= +~x3::char_(';');
auto program
= x3::rule<class program_, std::list<ast::statement> > {"program"}
= *(statement >> ';');
}
#include <iostream>
#include <iomanip>
int main() {
std::cout << "Type an expression...or [q or Q] to quit\n\n";
using It = std::string::const_iterator;
for (std::string str; std::getline(std::cin, str);) {
if (str.empty() || str[0] == 'q' || str[0] == 'Q')
break;
auto &parser = grammar::program;
ast::program program; // Our program (AST)
It iter = str.begin(), end = str.end();
if (phrase_parse(iter, end, parser, x3::space, program)) {
std::cout << "Parsing succeeded\n";
for (auto& s : program.stmts) {
std::cout << "Statement: " << std::quoted(s, '\'') << "\n";
}
}
else
std::cout << "Parsing failed\n";
if (iter != end)
std::cout << "Remaining unparsed: " << std::quoted(std::string(iter, end), '\'') << "\n";
}
}
Which for input "a;b;c;d;" prints:
Parsing succeeded
Statement: 'a'
Statement: 'b'
Statement: 'c'
Statement: 'd'
I know how to add token definitions with an identifier:
this->self.add(identifier, ID_IDENTIFIER);
And I know how to add token definitions with a semantic action:
this->self += whitespace [ lex::_pass = lex::pass_flags::pass_ignore ];
Unfortunately this doesn't work:
this->self.add(whitespace
[ lex::_pass = lex::pass_flags::pass_ignore ],
ID_IDENTIFIER);
It gives an error that the token can't be converted to a string (!?):
error C2664: 'const boost::spirit::lex::detail::lexer_def_>::adder &boost::spirit::lex::detail::lexer_def_>::adder::operator ()(wchar_t,unsigned int) const' : cannot convert argument 1 from 'const boost::proto::exprns_::expr' to 'const std::basic_string,std::allocator> &'
Interestingly, the adder in lexer.hpp has an operator () which takes an action as a third parameter – but it's commented out in my version of boost (1.55.0). Does this work in newer versions?
In the absence of this, how would I add token definitions with a semantic action and an ID to the lexer?
Looking at the header files it seems that there are at least two possible approaches:
You can use token_def's id member function in order to set the id after you have defined your token:
ellipses = "\\.\\.\\.";
...
ellipses.id(ID_ELLIPSES);
You can use token_def's two parameters constructor when you define your token:
number = lex::token_def<>("[0-9]+", ID_NUMBER);
And then you can simply add your semantic actions as you did before:
this->self = ellipses[phx::ref(std::cout) << "Found ellipses.\n"] | '(' | ')' | number[phx::ref(std::cout) << "Found: " << phx::construct<std::string>(lex::_start, lex::_end) << '\n'];
The code below is based on Boost.Spirit.Lex example3.cpp with minor changes (marked with //CHANGED) to achieve what you want.
Full Sample (Running on rextester)
#include <iostream>
#include <string>
#include <boost/config/warning_disable.hpp>
#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/lex_lexertl.hpp>
#include <boost/spirit/include/phoenix.hpp>
using namespace boost::spirit;
namespace phx = boost::phoenix;
enum token_id //ADDED
{
ID_ELLIPSES = lex::min_token_id + 1,
ID_NUMBER
};
///////////////////////////////////////////////////////////////////////////////
// Token definition
///////////////////////////////////////////////////////////////////////////////
template <typename Lexer>
struct example3_tokens : lex::lexer<Lexer>
{
example3_tokens()
{
// define the tokens to match
ellipses = "\\.\\.\\.";
number = lex::token_def<>("[0-9]+", ID_NUMBER); //CHANGED
ellipses.id(ID_ELLIPSES); //CHANGED
// associate the tokens and the token set with the lexer
this->self = ellipses[phx::ref(std::cout) << "Found ellipses.\n"] | '(' | ')' | number[phx::ref(std::cout) << "Found: " << phx::construct<std::string>(lex::_start, lex::_end) << '\n']; //CHANGED
// define the whitespace to ignore (spaces, tabs, newlines and C-style
// comments)
this->self("WS")
= lex::token_def<>("[ \\t\\n]+") // whitespace
| "\\/\\*[^*]*\\*+([^/*][^*]*\\*+)*\\/" // C style comments
;
}
// these tokens expose the iterator_range of the matched input sequence
lex::token_def<> ellipses, identifier, number;
};
///////////////////////////////////////////////////////////////////////////////
// Grammar definition
///////////////////////////////////////////////////////////////////////////////
template <typename Iterator, typename Lexer>
struct example3_grammar
: qi::grammar<Iterator, qi::in_state_skipper<Lexer> >
{
template <typename TokenDef>
example3_grammar(TokenDef const& tok)
: example3_grammar::base_type(start)
{
start
= +(couplet | qi::token(ID_ELLIPSES)) //CHANGED
;
// A couplet matches nested left and right parenthesis.
// For example:
// (1) (1 2) (1 2 3) ...
// ((1)) ((1 2)(3 4)) (((1) (2 3) (1 2 (3) 4))) ...
// (((1))) ...
couplet
= qi::token(ID_NUMBER) //CHANGED
| '(' >> +couplet >> ')'
;
BOOST_SPIRIT_DEBUG_NODE(start);
BOOST_SPIRIT_DEBUG_NODE(couplet);
}
qi::rule<Iterator, qi::in_state_skipper<Lexer> > start, couplet;
};
///////////////////////////////////////////////////////////////////////////////
int main()
{
// iterator type used to expose the underlying input stream
typedef std::string::iterator base_iterator_type;
// This is the token type to return from the lexer iterator
typedef lex::lexertl::token<base_iterator_type> token_type;
// This is the lexer type to use to tokenize the input.
// Here we use the lexertl based lexer engine.
typedef lex::lexertl::actor_lexer<token_type> lexer_type; //CHANGED
// This is the token definition type (derived from the given lexer type).
typedef example3_tokens<lexer_type> example3_tokens;
// this is the iterator type exposed by the lexer
typedef example3_tokens::iterator_type iterator_type;
// this is the type of the grammar to parse
typedef example3_grammar<iterator_type, example3_tokens::lexer_def> example3_grammar;
// now we use the types defined above to create the lexer and grammar
// object instances needed to invoke the parsing process
example3_tokens tokens; // Our lexer
example3_grammar calc(tokens); // Our parser
std::string str ="(1) (1 2) (1 2 3) ... ((1)) ((1 2)(3 4)) (((1) (2 3) (1 2 (3) 4))) ... (((1))) ..."; //CHANGED
// At this point we generate the iterator pair used to expose the
// tokenized input stream.
std::string::iterator it = str.begin();
iterator_type iter = tokens.begin(it, str.end());
iterator_type end = tokens.end();
// Parsing is done based on the token stream, not the character
// stream read from the input.
// Note how we use the lexer defined above as the skip parser.
bool r = qi::phrase_parse(iter, end, calc, qi::in_state("WS")[tokens.self]);
if (r && iter == end)
{
std::cout << "-------------------------\n";
std::cout << "Parsing succeeded\n";
std::cout << "-------------------------\n";
}
else
{
std::cout << "-------------------------\n";
std::cout << "Parsing failed\n";
std::cout << "-------------------------\n";
}
std::cout << "Bye... :-) \n\n";
return 0;
}
Part of a simple skeleton utility I'm hacking on I have a grammar for triggering substitutions in text. I thought it a wonderful way to get comfortable with Boost.Spirit, but the template errors are a joy of a unique kind.
Here is the code in its entirety:
#include <iostream>
#include <iterator>
#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/phoenix.hpp>
namespace bsq = boost::spirit::qi;
namespace {
template<typename Iterator>
struct skel_grammar : public bsq::grammar<Iterator> {
skel_grammar();
private:
bsq::rule<Iterator> macro_b;
bsq::rule<Iterator> macro_e;
bsq::rule<Iterator, bsq::ascii::space_type> id;
bsq::rule<Iterator> macro;
bsq::rule<Iterator> text;
bsq::rule<Iterator> start;
};
template<typename Iterator>
skel_grammar<Iterator>::skel_grammar() : skel_grammar::base_type(start)
{
text = bsq::no_skip[+(bsq::char_ - macro_b)[bsq::_val += bsq::_1]];
macro_b = bsq::lit("<<");
macro_e = bsq::lit(">>");
macro %= macro_b >> id >> macro_e;
id %= -(bsq::ascii::alpha | bsq::char_('_'))
>> +(bsq::ascii::alnum | bsq::char_('_'));
start = *(text | macro);
}
} // namespace
int main(int argc, char* argv[])
{
std::string input((std::istreambuf_iterator<char>(std::cin)),
std::istreambuf_iterator<char>());
skel_grammar<std::string::iterator> grammar;
bool r = bsq::parse(input.begin(), input.end(), grammar);
std::cout << std::boolalpha << r << '\n';
return 0;
}
What's wrong with this code?
Mmm. I feel that we have discussed a few more details in chat than have been reflected in the question as it is.
Let me entertain you with my 'toy' implementation, complete with test cases, of a grammar that will recognize <<macros>> like this, including nested expansion of the same.
Notable features:
Expansion is done using a callback (process()), giving you maximum flexibility (you could use a look up table, cause parsing to fail depending on the macro content, or even have sideeffects independent of the output
the parser is optimized to favour streaming mode. Look at spirit::istream_iterator on how to parse input in streaming mode (Stream-based Parsing Made Easy). This has the obvious benefits if your input stream is 10 GB, and contains only 4 macros - it is the difference between crawling performance (or running out of memory) and just scaling.
note that the demo still writes to a string buffer (via oss). You could, however, easily, hook the output directly to std::cout or, say, an std::ofstream instance
Expansion is done eagerly, so you can have nifty effects using indirect macros. See the testcases
I even demoed a simplistic way to support escaping the << or >> delimiters (#define SUPPORT_ESCAPES)
Without further ado:
The Code
Note due to laziness, I require -std==c++0x, but only when SUPPORT_ESCAPES is defined
//#define BOOST_SPIRIT_DEBUG
#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/phoenix.hpp>
namespace qi = boost::spirit::qi;
namespace phx= boost::phoenix;
namespace fsn= boost::fusion;
namespace
{
#define SUPPORT_ESCAPES
static bool process(std::string& macro)
{
if (macro == "error") {
return false; // fail the parse
}
if (macro == "hello") {
macro = "bye";
} else if (macro == "bye") {
macro = "We meet again";
} else if (macro == "sideeffect") {
std::cerr << "this is a side effect while parsing\n";
macro = "(done)";
} else if (std::string::npos != macro.find('~')) {
std::reverse(macro.begin(), macro.end());
macro.erase(std::remove(macro.begin(), macro.end(), '~'));
} else {
macro = std::string("<<") + macro + ">>"; // this makes the unsupported macros appear unchanged
}
return true;
}
template<typename Iterator, typename OutIt>
struct skel_grammar : public qi::grammar<Iterator>
{
struct fastfwd {
template<typename,typename> struct result { typedef bool type; };
template<typename R, typename O>
bool operator()(const R&r,O& o) const
{
#ifndef SUPPORT_ESCAPES
o = std::copy(r.begin(),r.end(),o);
#else
auto f = std::begin(r), l = std::end(r);
while(f!=l)
{
if (('\\'==*f) && (l == ++f))
break;
*o++ = *f++;
}
#endif
return true; // false to fail the parse
}
} copy;
skel_grammar(OutIt& out) : skel_grammar::base_type(start)
{
using namespace qi;
#ifdef SUPPORT_ESCAPES
rawch = ('\\' >> char_) | char_;
#else
# define rawch qi::char_
#endif
macro = ("<<" >> (
(*(rawch - ">>" - "<<") [ _val += _1 ])
% macro [ _val += _1 ] // allow nests
) >>
">>")
[ _pass = phx::bind(process, _val) ];
start =
raw [ +(rawch - "<<") ] [ _pass = phx::bind(copy, _1, phx::ref(out)) ]
% macro [ _pass = phx::bind(copy, _1, phx::ref(out)) ]
;
BOOST_SPIRIT_DEBUG_NODE(start);
BOOST_SPIRIT_DEBUG_NODE(macro);
# undef rawch
}
private:
#ifdef SUPPORT_ESCAPES
qi::rule<Iterator, char()> rawch;
#endif
qi::rule<Iterator, std::string()> macro;
qi::rule<Iterator> start;
};
}
int main(int argc, char* argv[])
{
std::string input =
"Greeting is <<hello>> world!\n"
"Side effects are <<sideeffect>> and <<other>> vars are untouched\n"
"Empty <<>> macros are ok, as are stray '>>' pairs.\n"
"<<nested <<macros>> (<<hello>>?) work>>\n"
"The order of expansion (evaluation) is _eager_: '<<<<hello>>>>' will expand to the same as '<<bye>>'\n"
"Lastly you can do algorithmic stuff too: <<!esrever ~ni <<hello>>>>\n"
#ifdef SUPPORT_ESCAPES // bonus: escapes
"You can escape \\<<hello>> (not expanded to '<<hello>>')\n"
"Demonstrate how it <<avoids <\\<nesting\\>> macros>>.\n"
#endif
;
std::ostringstream oss;
std::ostream_iterator<char> out(oss);
skel_grammar<std::string::iterator, std::ostream_iterator<char> > grammar(out);
std::string::iterator f(input.begin()), l(input.end());
bool r = qi::parse(f, l, grammar);
std::cout << "parse result: " << (r?"success":"failure") << "\n";
if (f!=l)
std::cout << "unparsed remaining: '" << std::string(f,l) << "'\n";
std::cout << "Streamed output:\n\n" << oss.str() << '\n';
return 0;
}
The Test Output
this is a side effect while parsing
parse result: success
Streamed output:
Greeting is bye world!
Side effects are (done) and <<other>> vars are untouched
Empty <<>> macros are ok, as are stray '>>' pairs.
<<nested <<macros>> (bye?) work>>
The order of expansion (evaluation) is _eager_: 'We meet again' will expand to the same as 'We meet again'
Lastly you can do algorithmic stuff too: eyb in reverse!
You can escape <<hello>> (not expanded to 'bye')
Demonstrate how it <<avoids <<nesting>> macros>>.
There is quite a lot of functionality hidden there to grok. I suggest you look at the test cases and the process() callback alongside each other to see what is going on.
Cheers & HTH :)
I would like to write a boost::spirit parser that parses a simple string in double quotes that uses escaped double quotes, e.g. "a \"b\" c".
Here is what I tried:
#include <boost/config/warning_disable.hpp>
#include <boost/spirit/include/qi.hpp>
#include <iostream>
#include <string>
namespace client
{
namespace qi = boost::spirit::qi;
namespace ascii = boost::spirit::ascii;
template <typename Iterator>
bool parse(Iterator first, Iterator last)
{
using qi::char_;
qi::rule< Iterator, std::string(), ascii::space_type > text;
qi::rule< Iterator, std::string() > content;
qi::rule< Iterator, char() > escChar;
text = '"' >> content >> '"';
content = +(~char_('"') | escChar);
escChar = '\\' >> char_("\"");
bool r = qi::phrase_parse(first, last, text, ascii::space);
if (first != last) // fail if we did not get a full match
return false;
return r;
}
}
int main() {
std::string str = "\"a \\\"b\\\" c\"";
if (client::parse(str.begin(), str.end()))
std::cout << str << " Parses OK: " << std::endl;
else
std::cout << "Fail\n";
return 0;
}
It follows the example on Parsing escaped strings with boost spirit, but the output is "Fail". How can I get it to work?
Been a while since I had a go at spirit, but I think one of your rules is the wrong way round.
Try:
content = +(escChar | ~char_('"'))
instead of:
content = +(~char_('"') | escChar)
It is matching your \ using ~char('"') and therefore never gets round to checking if escChar matches. It then reads the next " as the end of the string and stops parsing.