Boost Spirit Qi track line and parse unicode

Boost Spirit Qi track line and parse unicode - c++

I want to trace input position and input line for unicode strings.
For the position I store an iterator to begin and use std::distance at the desired position. That works well as long as the input is not unicode. With unicode symbols the position gets shifted, i.e. ä takes two spaces in the input stream and position is off by 1. So, I switched to boost::u8_to_u32_iterator and this works fine.
For the line I use boost::spirit::line_pos_iterator which also works well.
My problem is in combining both concepts to use the line iterator and the unicode iterator. Another solution allowing pos and line on unicode strings is of course also welcome.
Here is a small example for the unicode parser; as said I would like to wrap the iterator additionally with boost::spirit::line_pos_iterator but that doesn't even compile.
#define BOOST_SPIRIT_USE_PHOENIX_V3
#define BOOST_SPIRIT_UNICODE
#include <boost/regex/pending/unicode_iterator.hpp>
#include <boost/fusion/adapted/struct.hpp>
#include <boost/spirit/include/phoenix.hpp>
namespace phx = boost::phoenix;
#include <boost/spirit/include/qi.hpp>
namespace qi = boost::spirit::qi;
#include <boost/spirit/repository/include/qi_iter_pos.hpp>
#include <boost/spirit/include/support_line_pos_iterator.hpp>
#include <iostream>
#include <string>
//==============================================================================
std::string to_utf8(const std::u32string& input) {
return std::string(
boost::u32_to_u8_iterator<std::u32string::const_iterator>(input.begin()),
boost::u32_to_u8_iterator<std::u32string::const_iterator>(input.end()));
}
//==============================================================================
int main() {
std::string input(u8"Hallo äöüß");
typedef boost::u8_to_u32_iterator<std::string::const_iterator> iterator_type;
iterator_type first(input.begin()), last(input.end());
qi::rule<iterator_type, std::u32string()> string_u32 = *(qi::char_ - qi::eoi);
qi::rule<iterator_type, std::string()> string =
string_u32[qi::_val = phx::bind(&to_utf8, qi::_1)];
qi::rule<iterator_type, std::string()> rule = string;
std::string ast;
bool result = qi::parse(first, last, rule, ast);
if (result) {
result = first == last;
}
if (result) {
std::cout << "Parsed: " << ast << std::endl;
} else {
std::cout << "Failure" << std::endl;
}
}

Update Demo added Live on Coliru
I see the same problem whe you try to wrap iterator_type in a line_pos_iterator.
After some thinking, I don't quite know what causes it (it might be possible to get around this by wrapping the u8_to_u32 converting iterator adapter inside a boost::spirit::multi_pass<> iterator adapter, but... that sounded so unwieldy I haven't even tried).
Instead, I think that the nature of line-breaking is that it is (mostly?) charset agnostic. So you could wrap the source iterator with line_pos_iterator first, before the encoding conversion.
This does compile. Of course, then you'll get position information in terms of the source iterators, not 'logical characters'[1].
Let me show a demonstration below. It parses space separated words into a vector of strings. The simplest way to show the position information was to use a vector of iterator_ranges instead of just strings. I used qi::raw[] to expose the iterators[2].
So after a successful parse I loop through the matched ranges and print their location information. First, I print the actual positions reported from line_pos_iterators. Remember these are 'raw' byte offsets, since the source iterator is byte-oriented.
Next, I do a little dance with get_current_line and the u8_to_u32 conversion to translate the offset within the line to a (more) logical count. You'll see that the range for e.g.
Note I currently assumed that ranges would not cross line boundaries (that is true for this grammar). Otherwise one would need to extract and convert 2 lines. The way I'm doing that now is rather expensive. Consider optimizing by e.g. using Boost String Algorithm's find_all facilities. You can build a list of line-ends and use std::lower_bound to locate the current line slightly more efficiently.
Note There might be issues with the implementations of get_line_start and get_current_line; if you notice anything like this, there's a 10-line patch over at the [spirit-general] user list that you could try
Without further ado, the code and the output:
#define BOOST_SPIRIT_USE_PHOENIX_V3
#define BOOST_SPIRIT_UNICODE
#include <boost/regex/pending/unicode_iterator.hpp>
#include <boost/fusion/adapted/struct.hpp>
#include <boost/spirit/include/phoenix.hpp>
#include <boost/phoenix/function/adapt_function.hpp>
namespace phx = boost::phoenix;
#include <boost/spirit/include/qi.hpp>
namespace qi = boost::spirit::qi;
namespace encoding = boost::spirit::unicode;
#include <boost/spirit/repository/include/qi_iter_pos.hpp>
#include <boost/spirit/include/support_line_pos_iterator.hpp>
#include <iostream>
#include <string>
//==============================================================================
std::string to_utf8(const std::u32string& input) {
return std::string(
boost::u32_to_u8_iterator<std::u32string::const_iterator>(input.begin()),
boost::u32_to_u8_iterator<std::u32string::const_iterator>(input.end()));
}
BOOST_PHOENIX_ADAPT_FUNCTION(std::string, to_utf8_, to_utf8, 1)
//==============================================================================
int main() {
std::string input(u8"Hallo äöüß\n¡Bye! ✿➂➿♫");
typedef boost::spirit::line_pos_iterator<std::string::const_iterator> source_iterator;
typedef boost::u8_to_u32_iterator<source_iterator> iterator_type;
source_iterator soi(input.begin()),
eoi(input.end());
iterator_type first(soi),
last(eoi);
qi::rule<iterator_type, std::u32string()> string_u32 = +encoding::graph;
qi::rule<iterator_type, std::string()> string = string_u32 [qi::_val = to_utf8_(qi::_1)];
std::vector<boost::iterator_range<iterator_type> > ast;
// note the trick with `raw` to expose the iterators
bool result = qi::phrase_parse(first, last, *qi::raw[ string ], encoding::space, ast);
if (result) {
for (auto const& range : ast)
{
source_iterator
base_b(range.begin().base()),
base_e(range.end().base());
auto lbound = get_line_start(soi, base_b);
// RAW access to the base iterators:
std::cout << "Fragment: '" << std::string(base_b, base_e) << "'\t"
<< "raw: L" << get_line(base_b) << ":" << get_column(lbound, base_b, /*tabs:*/4)
<< "-L" << get_line(base_e) << ":" << get_column(lbound, base_e, /*tabs:*/4);
// "cooked" access:
auto line = get_current_line(lbound, base_b, eoi);
// std::cout << "Line: '" << line << "'\n";
// iterator_type is an alias for u8_to_u32_iterator<...>
size_t cur_pos = 0, start_pos = 0, end_pos = 0;
for(iterator_type it = line.begin(), _eol = line.end(); ; ++it, ++cur_pos)
{
if (it.base() == base_b) start_pos = cur_pos;
if (it.base() == base_e) end_pos = cur_pos;
if (it == _eol)
break;
}
std::cout << "\t// in u32 code _units_: positions " << start_pos << "-" << end_pos << "\n";
}
std::cout << "\n";
} else {
std::cout << "Failure" << std::endl;
}
if (first!=last)
{
std::cout << "Remaining: '" << std::string(first, last) << "'\n";
}
}
The output:
clang++ -std=c++11 -Os main.cpp && ./a.out
Fragment: 'Hallo' raw: L1:1-L1:6 // in u32 code _units_: positions 0-5
Fragment: 'äöüß' raw: L1:7-L1:15 // in u32 code _units_: positions 6-10
Fragment: '¡Bye!' raw: L2:2-L2:8 // in u32 code _units_: positions 1-6
Fragment: '✿➂➿♫' raw: L2:9-L2:21 // in u32 code _units_: positions 7-11
[1] I think there's not a useful definition of what a character is in this context. There's bytes, code units, code points, grapheme clusters, possibly more. Suffice it to say that the source iterator (std::string::const_iterator) deals with bytes (since it is charset/encoding unaware). In u32string you can /almost/ assume that a single position is roughly a code-point (although I think (?) that for >L2 UNICODE support you still would have to support code points combined from multiple code units).
[2] This means that current the attribute conversion and the semantic action are redundant, but you'll get that :)

Related

Boost Spirit X3: skip parser that would do nothing

I'm getting myself familiarized with boost spirit v3. The question I want to ask is how to state the fact that you don't want to use skip parser in any way.
Consider a simple example of parsing comma-separated sequence of integers:
#include <iostream>
#include <string>
#include <vector>
#include <boost/spirit/home/x3.hpp>
int main()
{
using namespace boost::spirit::x3;
const std::string input{"2,4,5"};
const auto parser = int_ % ',';
std::vector<int> numbers;
auto start = input.cbegin();
auto r = phrase_parse(start, input.end(), parser, space, numbers);
if(r && start == input.cend())
{
// success
for(const auto &item: numbers)
std::cout << item << std::endl;
return 0;
}
std::cerr << "Input was not parsed successfully" << std::endl;
return 1;
}
This works totally fine. However, I would like to forbid having spaces in between (i.e. "2, 4,5" should not be parsed well).
I tried using eps as a skip parser in phrase_parse, but as you can guess, the program ended up in the infinite loop because eps matches to an empty string.
Solution I found is to use no_skip directive (https://www.boost.org/doc/libs/1_75_0/libs/spirit/doc/html/spirit/qi/reference/directive/no_skip.html). So the parser now becomes:
const auto parser = no_skip[int_ % ','];
This works fine, but I don't find it to be an elegant solution (especially providing "space" parser in phrase_parse when I want no whitespace skips). Are there no skip parsers that would simply do nothing? Am I missing something?
Thanks for Your time. Looking forward to any replies.

You can use either no_skip[] or lexeme[]. They're almost identical, except for pre-skip (Boost Spirit lexeme vs no_skip).
Are there no skip parsers that would simply do nothing? Am I missing something?
A wild guess, but you might be missing the parse API that doesn't accept a skipper in the first place
Live On Coliru
#include <iostream>
#include <iomanip>
#include <boost/spirit/home/x3.hpp>
namespace x3 = boost::spirit::x3;
int main() {
std::string const input{ "2,4,5" };
auto f = begin(input), l = end(input);
const auto parser = x3::int_ % ',';
std::vector<int> numbers;
auto r = parse(f, l, parser, numbers);
if (r) {
// success
for (const auto& item : numbers)
std::cout << item << std::endl;
} else {
std::cerr << "Input was not parsed successfully" << std::endl;
return 1;
}
if (f!=l) {
std::cout << "Remaining input " << std::quoted(std::string(f,l)) << "\n";
return 2;
}
}
Prints
2
4
5

Boost regex to capture all repeated patterns

I have a simple text file with the following contents:
VALUE "foo"
ANOTHERVALUE "bar"
YETANOTHERVALUE "barbar"
The values in column 1 are known.
I want to capture everything in column 1 and in column 2.
My solution involves manually writing all the possible values of column 1 (which is known), into the regex string but obviously this is not ideal practice since I am basically repeating code and this does not allow the ordering to be flexible:
const char* re =
"^[[:space:]]*"
"(VALUE)[[:space:]]*\"(.*)\"[[:space:]]*"
"(ANOTHERVALUE)[[:space:]]*\"(.*)\"[[:space:]]*"
"(YETANOTHERVALUE)[[:space:]]*\"(.*)\"[[:space:]]*";

I'm citing commenter Igor Tandetnik here, because he almost gave the complete answer in his comment:
Regular expressions capture exactly as many substrings as there are
left parentheses in the expression [...]
The right way to solve this problem is to write a regex that matches a
single pair, [...]
\s*([a-zA-Z]+)\s*"(.*?)"
Notes:
\s is equivalent to [[:space:]]
.*? is used to stop searching after the 2nd " instead of the last " in the string
and apply it repeatedly, e.g. via std::regex_iterator
The boost equivalent is boost::regex_iterator.
#include <iostream>
#include <string>
#include <algorithm>
#include <boost/regex.hpp>
const boost::regex expr{ R"__(\s*([a-zA-Z]+)\s*"(.*?)")__" };
const std::string s =
R"(VALUE "foo"
ANOTHERVALUE "bar"
YETANOTHERVALUE "barbar"
)";
int main() {
boost::sregex_iterator it{ begin(s), end(s), expr }, itEnd;
std::for_each( it, itEnd, []( const boost::smatch& m ){
std::cout << m[1] << '\n' << m[2] << std::endl;
});
}
Live demo.
Notes:
I'm using raw string literals to make the code cleaner.

I would use a little Spirit Parser here:
Reading Into A Map
Live On Coliru
#include <boost/fusion/adapted/std_pair.hpp> // reading maps
#include <boost/spirit/home/x3.hpp>
#include <boost/spirit/include/support_istream_iterator.hpp>
#include <iostream>
#include <fstream>
#include <map>
auto read_config_map(std::istream& stream) {
std::map<std::string, std::string> settings;
boost::spirit::istream_iterator f(stream >> std::noskipws), l;
using namespace boost::spirit::x3;
auto key_ = lexeme [ +upper ];
auto value_ = lexeme [ '"' >> *~char_('"') >> '"' ];
if (!phrase_parse(f, l, -(key_ >> value_) % eol >> eoi, blank, settings))
throw std::invalid_argument("cannot parse config map");
return settings;
}
auto read_config_map(std::string const& fname) {
std::ifstream stream(fname);
return read_config_map(stream);
}
int main() {
for (auto&& entry : read_config_map(std::cin))
std::cout << "Key:'" << entry.first << "' Value:'" << entry.second << "'\n";
}
Prints:
Key:'ANOTHERVALUE' Value:'bar'
Key:'VALUE' Value:'foo'
Key:'YETANOTHERVALUE' Value:'barbar'

boost spirit qi parser failed in release and pass in debug

#include <boost/spirit/include/qi.hpp>
#include <string>
#include <vector>
#include <iterator>
#include <algorithm>
#include <iostream>
using namespace boost::spirit;
int main()
{
std::string s;
std::getline(std::cin, s);
auto specialtxt = *(qi::char_('-', '.', '_'));
auto txt = no_skip[*(qi::char_("a-zA-Z0-9_.\\:$\'-"))];
auto anytxt = *(qi::char_("a-zA-Z0-9_.\\:${}[]+/()-"));
qi::rule <std::string::iterator, void(),ascii::space_type> rule2 = txt ('=') >> ('[') >> (']');
auto begin = s.begin();
auto end = s.end();
if (qi::phrase_parse(begin, end, rule2, ascii::space))
{
std::cout << "MATCH" << std::endl;
}
else
{
std::cout << "NO MATCH" << std::endl;
}
}
this code works fine in debug mode
parser fails in release mode
rule is to just parse text=[]; any thing else than this should fail it works fine in debug mode but not in release mode it shows result no match for any string.
if i enter string like
abc=[];
this passes in debug as expected but fails in release

You can't use auto with Spirit v2:
Assigning parsers to auto variables
You have Undefined Behaviour
DEMO
I tried to make (more) sense of the rest of the code. There were various instances that would never work:
txt('=') is an invalid Qi expression. I assumed you wanted txt >> ('=') instead
qi::char_("a-zA-Z0-9_.\\:$\\-{}[]+/()") doesn't do what you think because $-{ is actually the character "range" \x24-\x7b... Escape the - (or put it at the very end/start of the set like in the other char_ call).
qi::char_('-','.','_') can't work. Did you mean qi::char_("-._")?
specialtxt and anytxt were unused...
prefer const_iterator
prefer namespace aliases above using namespace to prevent hard-to-detect errors
Live On Coliru
#include <boost/spirit/include/qi.hpp>
#include <iostream>
namespace qi = boost::spirit::qi;
int main() {
std::string const s = "abc=[];";
auto specialtxt = qi::copy(*(qi::char_("-._")));
auto anytxt = qi::copy(*(qi::char_("a-zA-Z0-9_.\\:$\\-{}[]+/()")));
(void) specialtxt;
(void) anytxt;
auto txt = qi::copy(qi::no_skip[*(qi::char_("a-zA-Z0-9_.\\:$\'-"))]);
qi::rule<std::string::const_iterator, qi::space_type> rule2 = txt >> '=' >> '[' >> ']';
auto begin = s.begin();
auto end = s.end();
if (qi::phrase_parse(begin, end, rule2, qi::space)) {
std::cout << "MATCH" << std::endl;
} else {
std::cout << "NO MATCH" << std::endl;
}
if (begin != end) {
std::cout << "Trailing unparsed: '" << std::string(begin, end) << "'\n";
}
}
Printing
MATCH
Trailing unparsed: ';'

How to split_regex only once?

The function template boost::algorithm::split_regex splits a single string into strings on the substring of the original string that matches the regex pattern we passed to split_regex. The question is: how can I split it only once on the first substring that matches? That is, is it possible to make split_regex stop after its first splitting? Please see the following codes.
#include <boost/algorithm/string/regex.hpp>
#include <boost/format.hpp>
#include <boost/regex.hpp>
#include <iostream>
#include <locale>
int main(int argc, char *argv[])
{
using namespace std;
using boost::regex;
locale::global(locale(""));
// Create a standard string for experiment.
string strRequestLine("Host: 192.168.0.1:12345");
regex pat(R"(:\s*)", regex::perl | boost::regex_constants::match_stop);
// Try to split the request line.
vector<string> coll;
boost::algorithm::split_regex(coll, strRequestLine, pat);
// Output what we got.
for (const auto& elt : coll)
cout << boost::format("{%s}\n") % elt;
// Exit the program.
return 0;
}
Where shall the codes be modified to have the output like
{Host}
{192.168.0.1:12345}
instead of the current output
{Host}
{192.168.0.1}
{12345}
Any suggestion/hint? Thanks.
Please note that I'm not asking how to do it with other functions or patterns. I'm asking if it's possible for split_regex to split only once and then stop. Because regex object seems to have the ability to stop at the first matched, I wonder that if offering it some proper flags it maybe stop at the first matched.

For your specific input it seems the simple fix is to change the pattern to become R"(:\s+)". Of course, this assumes that there is, at least, one space after Host: and no space between the IP address and the port.
Another alternative would be not to use split_regex() but rather std::regex_match():
#include <iostream>
#include <regex>
#include <string>
int main()
{
std::string strRequestLine("Host: 192.168.0.1:12345");
std::smatch results;
if (std::regex_match(strRequestLine, results, std::regex(R"(([^:]*):\s*(.*))"))) {
for (auto it(++results.begin()), end(results.end()); it != end; ++it) {
std::cout << "{" << *it << "}\n";
}
}
}

Expanding from my comment:
You might be interested in the first sample I listed here: small HTTP response headers parsing function. Summary: use phrase_parse(f, e, token >> ':' >> lexeme[*(char_ - eol)], space, key, value)
Here's a simple sample:
Live On Coliru
#include <boost/spirit/include/qi.hpp>
namespace qi = boost::spirit::qi;
namespace {
typedef std::string::const_iterator It;
// 2.2 Basic Rules (rfc1945)
static const qi::rule<It, std::string()> rfc1945_token = +~qi::char_( " \t><#,;:\\\"/][?=}{:"); // FIXME? should filter CTLs
}
#include <iostream>
int main()
{
std::string const strRequestLine("Host: 192.168.0.1:12345");
std::string::const_iterator f(strRequestLine.begin()), l(strRequestLine.end());
std::string key, value;
if (qi::phrase_parse(f, l, rfc1945_token >> ':' >> qi::lexeme[*(qi::char_ - qi::eol)], qi::space, key, value))
std::cout << "'" << key << "' -> '" << value << "'\n";
}
Prints
'Host' -> '192.168.0.1:12345'

Boost Spirit Signals Successful Parsing Despite Token Being Incomplete

I have a very simple path construct that I am trying to parse with boost spirit.lex.
We have the following grammar:
token := [a-z]+
path := (token : path) | (token)
So we're just talking about colon separated lower-case ASCII strings here.
I have three examples "xyz", "abc:xyz", "abc:xyz:".
The first two should be deemed valid. The third one, which has a trailing colon, should not be deemed valid. Unfortunately the parser I have recognizes all three as being valid. The grammar should not allow an empty token, but apparently spirit is doing just that. What am I missing to get the third one rejected?
Also, if you read the code below, in comments there is another version of the parser that demands that all paths end with semi-colons. I can get appropriate behavior when I activate those lines, (i.e. rejection of "abc:xyz:;"), but this is not really what I want.
Anyone have any ideas?
Thanks.
#include <boost/config/warning_disable.hpp>
#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/lex_lexertl.hpp>
#include <boost/spirit/include/phoenix_operator.hpp>
#include <iostream>
#include <string>
using namespace boost::spirit;
using boost::phoenix::val;
template<typename Lexer>
struct PathTokens : boost::spirit::lex::lexer<Lexer>
{
PathTokens()
{
identifier = "[a-z]+";
separator = ":";
this->self.add
(identifier)
(separator)
(';')
;
}
boost::spirit::lex::token_def<std::string> identifier, separator;
};
template <typename Iterator>
struct PathGrammar
: boost::spirit::qi::grammar<Iterator>
{
template <typename TokenDef>
PathGrammar(TokenDef const& tok)
: PathGrammar::base_type(path)
{
using boost::spirit::_val;
path
=
(token >> tok.separator >> path)[std::cerr << _1 << "\n"]
|
//(token >> ';')[std::cerr << _1 << "\n"]
(token)[std::cerr << _1 << "\n"]
;
token
= (tok.identifier) [_val=_1]
;
}
boost::spirit::qi::rule<Iterator> path;
boost::spirit::qi::rule<Iterator, std::string()> token;
};
int main()
{
typedef std::string::iterator BaseIteratorType;
typedef boost::spirit::lex::lexertl::token<BaseIteratorType, boost::mpl::vector<std::string> > TokenType;
typedef boost::spirit::lex::lexertl::lexer<TokenType> LexerType;
typedef PathTokens<LexerType>::iterator_type TokensIterator;
typedef std::vector<std::string> Tests;
Tests paths;
paths.push_back("abc");
paths.push_back("abc:xyz");
paths.push_back("abc:xyz:");
/*
paths.clear();
paths.push_back("abc;");
paths.push_back("abc:xyz;");
paths.push_back("abc:xyz:;");
*/
for ( Tests::iterator iter = paths.begin(); iter != paths.end(); ++iter )
{
std::string str = *iter;
std::cerr << "*****" << str << "*****\n";
PathTokens<LexerType> tokens;
PathGrammar<TokensIterator> grammar(tokens);
BaseIteratorType first = str.begin();
BaseIteratorType last = str.end();
bool r = boost::spirit::lex::tokenize_and_parse(first, last, tokens, grammar);
std::cerr << r << " " << (first==last) << "\n";
}
}

I addition to to what llonesmiz already said, here's a trick using qi::eoi that I sometimes use:
path = (
(token >> tok.separator >> path) [std::cerr << _1 << "\n"]
| token [std::cerr << _1 << "\n"]
) >> eoi;
This makes the grammar require eoi (end-of-input) at the end of a successful match. This leads to the desired result:
http://liveworkspace.org/code/23a7adb11889bbb2825097d7c553f71d
*****abc*****
abc
1 1
*****abc:xyz*****
xyz
abc
1 1
*****abc:xyz:*****
xyz
abc
0 1

The problem lies in the meaning of first and last after your call to tokenize_and_parse. first==last checks if your string has been completely tokenized, you can't infer anything about grammar. If you isolate the parsing like this, you obtain the expected result:
PathTokens<LexerType> tokens;
PathGrammar<TokensIterator> grammar(tokens);
BaseIteratorType first = str.begin();
BaseIteratorType last = str.end();
LexerType::iterator_type lexfirst = tokens.begin(first,last);
LexerType::iterator_type lexlast = tokens.end();
bool r = parse(lexfirst, lexlast, grammar);
std::cerr << r << " " << (lexfirst==lexlast) << "\n";

This is what I finally ended up with. It uses the suggestions from both #sehe and #llonesmiz. Note the conversion to std::wstring and the use of actions in the grammar definition, which were not present in the original post.
#include <boost/config/warning_disable.hpp>
#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/lex_lexertl.hpp>
#include <boost/spirit/include/phoenix_operator.hpp>
#include <boost/bind.hpp>
#include <iostream>
#include <string>
//
// This example uses boost spirit to parse a simple
// colon-delimited grammar.
//
// The grammar we want to recognize is:
// identifier := [a-z]+
// separator = :
// path= (identifier separator path) | identifier
//
// From the boost spirit perspective this example shows
// a few things I found hard to come by when building my
// first parser.
// 1. How to flag an incomplete token at the end of input
// as an error. (use of boost::spirit::eoi)
// 2. How to bind an action on an instance of an object
// that is taken as input to the parser.
// 3. Use of std::wstring.
// 4. Use of the lexer iterator.
//
// This using directive will cause issues with boost::bind
// when referencing placeholders such as _1.
// using namespace boost::spirit;
//! A class that tokenizes our input.
template<typename Lexer>
struct Tokens : boost::spirit::lex::lexer<Lexer>
{
Tokens()
{
identifier = L"[a-z]+";
separator = L":";
this->self.add
(identifier)
(separator)
;
}
boost::spirit::lex::token_def<std::wstring, wchar_t> identifier, separator;
};
//! This class provides a callback that echoes strings to stderr.
struct Echo
{
void echo(boost::fusion::vector<std::wstring> const& t) const
{
using namespace boost::fusion;
std::wcerr << at_c<0>(t) << "\n";
}
};
//! The definition of our grammar, as described above.
template <typename Iterator>
struct Grammar : boost::spirit::qi::grammar<Iterator>
{
template <typename TokenDef>
Grammar(TokenDef const& tok, Echo const& e)
: Grammar::base_type(path)
{
using boost::spirit::_val;
path
=
((token >> tok.separator >> path)[boost::bind(&Echo::echo, e,::_1)]
|
(token)[boost::bind(&Echo::echo, &e, ::_1)]
) >> boost::spirit::eoi; // Look for end of input.
token
= (tok.identifier) [_val=boost::spirit::qi::_1]
;
}
boost::spirit::qi::rule<Iterator> path;
boost::spirit::qi::rule<Iterator, std::wstring()> token;
};
int main()
{
// A set of typedefs to make things a little clearer. This stuff is
// well described in the boost spirit documentation/examples.
typedef std::wstring::iterator BaseIteratorType;
typedef boost::spirit::lex::lexertl::token<BaseIteratorType, boost::mpl::vector<std::wstring> > TokenType;
typedef boost::spirit::lex::lexertl::lexer<TokenType> LexerType;
typedef Tokens<LexerType>::iterator_type TokensIterator;
typedef LexerType::iterator_type LexerIterator;
// Define some paths to parse.
typedef std::vector<std::wstring> Tests;
Tests paths;
paths.push_back(L"abc");
paths.push_back(L"abc:xyz");
paths.push_back(L"abc:xyz:");
paths.push_back(L":");
// Parse 'em.
for ( Tests::iterator iter = paths.begin(); iter != paths.end(); ++iter )
{
std::wstring str = *iter;
std::wcerr << L"*****" << str << L"*****\n";
Echo e;
Tokens<LexerType> tokens;
Grammar<TokensIterator> grammar(tokens, e);
BaseIteratorType first = str.begin();
BaseIteratorType last = str.end();
// Have the lexer consume our string.
LexerIterator lexFirst = tokens.begin(first, last);
LexerIterator lexLast = tokens.end();
// Have the parser consume the output of the lexer.
bool r = boost::spirit::qi::parse(lexFirst, lexLast, grammar);
// Print the status and whether or note all output of the lexer
// was processed.
std::wcerr << r << L" " << (lexFirst==lexLast) << L"\n";
}
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Boost Spirit Qi track line and parse unicode - c++

Related

Boost Spirit X3: skip parser that would do nothing

Boost regex to capture all repeated patterns

boost spirit qi parser failed in release and pass in debug

How to split_regex only once?

Boost Spirit Signals Successful Parsing Despite Token Being Incomplete

Categories

Resources