How to skip (not output) tokens in Boost Spirit? - c++

I'm new to Boost Spirit. I haven't been able to find examples for some simple things. For example, suppose I have an even number of space-delimited integers. (That matches *(qi::int_ >> qi::int_). So far so good.) I'd like to save just the even ones to a std::vector<int>. I've tried a variety of things like *(qi::int_ >> qi::skip[qi::int_]) https://godbolt.org/z/KPToo3xh6 but that still records every int, not just even ones.
#include <stdexcept>
#include <fmt/format.h>
#include <fmt/ranges.h>
#include <boost/spirit/include/qi.hpp>
namespace qi = boost::spirit::qi;
// Example based off https://raw.githubusercontent.com/bingmann/2018-cpp-spirit-parsing/master/spirit1_simple.cpp:
// Helper to run a parser, check for errors, and capture the results.
template <typename Parser, typename Skipper, typename ... Args>
void PhraseParseOrDie(
const std::string& input, const Parser& p, const Skipper& s,
Args&& ... args)
{
std::string::const_iterator begin = input.begin(), end = input.end();
boost::spirit::qi::phrase_parse(begin, end, p, s, std::forward<Args>(args) ...);
if (begin != end) {
fmt::print("Unparseable: \"{}\"\n", std::string(begin, end));
}
}
void test(std::string input)
{
std::vector<int> out_int_list;
PhraseParseOrDie(
// input string
input,
// parser grammar
*(qi::int_ >> qi::skip[qi::int_]),
// skip parser
qi::space,
// output list
out_int_list);
fmt::print("test() parse result: {}\n", out_int_list);
}
int main(int argc, char* argv[])
{
test("12345 42 5 2");
return 0;
}
Prints
test() parse result: [12345, 42, 5, 2]

You're looking for qi::omit[]:
*(qi::int_ >> qi::omit[qi::int_])
Note you can also implicitly omit things by declaring a rule without attribute-type (which make it bind to qi::unused_type for silent compatibility).
Also note that if you're making an adhoc, sloppy grammar to scan for certain "landmarks" in a larger body of text, consider spirit::repository::qi::seek which can be significantly faster and more expressive.
Finally, note that Spirit X3 comes with a similar seek[] directive out of the box.
Simplified Demo
Much simplified: https://godbolt.org/z/EY4KdxYv9
#include <fmt/ranges.h>
#include <boost/spirit/include/qi.hpp>
// Helper to run a parser, check for errors, and capture the results.
void test(std::string const& input)
{
std::vector<int> out_int_list;
namespace qi = boost::spirit::qi;
qi::parse(input.begin(), input.end(), //
qi::expect[ //
qi::skip(qi::space)[ //
*(qi::int_ >> qi::omit[qi::int_]) > qi::eoi]], //
out_int_list);
fmt::print("test() parse result: {}\n", out_int_list);
}
int main() { test("12345 42 5 2"); }
Prints
test() parse result: [12345, 5]
But Wait
Seeing your comment
// Parse a bracketed list of integers with spaces between symbols
Did you really mean that? Because that sounds a ton more like:
'[' > qi::auto_ % +qi::graph > ']'
See it live: https://godbolt.org/z/eK6Thzqea
//#define BOOST_SPIRIT_DEBUG
#include <fmt/ranges.h>
#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/qi_auto.hpp>
//#include <boost/fusion/adapted.hpp>
// Helper to run a parser, check for errors, and capture the results.
template <typename T> auto test(std::string const& input) {
std::vector<T> out;
using namespace boost::spirit::qi;
rule<std::string::const_iterator, T()> v = auto_;
BOOST_SPIRIT_DEBUG_NODE(v);
phrase_parse( //
input.begin(), input.end(), //
'[' > -v % lexeme[+(graph - ']')] > ']', //
space, out);
return out;
}
int main() {
fmt::print("ints: {}\n", test<int>("[12345 USD 5 PUT]"));
fmt::print("doubles: {}\n", test<double>("[ 1.2345 42 -inf 'hello' 3.1415 ]"));
}
Prints
ints: [12345, 5]
doubles: [1.2345, -inf, 3.1415]

Related

Boost::Spirit doubles character when followed by a default value

I use boost::spirit to parse (a part) of a monomial like x, y, xy, x^2, x^3yz. I want to save the variables of the monomial into a map, which also stores the corresponding exponent. Therefore the grammar should also save the implicit exponent of 1 (so x stores as if it was written as x^1).
start = +(potVar);
potVar=(varName>>'^'>>exponent)|(varName>> qi::attr(1));// First try: This doubles the variable name
//potVar = varName >> (('^' >> exponent) | qi::attr(1));// Second try: This works as intended
exponent = qi::int_;
varName = qi::char_("a-z");
When using the default attribute as in the line "First try", Spirit doubles the variable name.
Everything works as intended when using the default attribute as in the line "Second try".
'First try' reads a variable x and stores the pair [xx, 1].
'Second try' reads a variable x and stores the pair [x, 1].
I think I solved the original problem myself. The second try works. However, I don't see how I doubled the variable name. Because I am about to get familiar with boost::spirit, which is a collection of challenges for me, and there are probably more to come, I would like to understand this behavior.
This is the whole code to recreate the problem. The frame of the grammar is copied from a presentation of the KIT https://panthema.net/2018/0912-Boost-Spirit-Tutorial/ , and Stackoverflow was already very helpful, when I needed the header, which enables me to use the std::pair.
#include <iostream>
#include <iomanip>
#include <stdexcept>
#include <cmath>
#include <map>
#include <utility>//for std::pair
#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/qi.hpp>
#include <boost/fusion/adapted/std_pair.hpp> //https://stackoverflow.com/questions/53953642/parsing-map-of-variants-with-boost-spirit-x3
namespace qi = boost::spirit::qi;
template <typename Parser, typename Skipper, typename ... Args>
void PhraseParseOrDie(
const std::string& input, const Parser& p, const Skipper& s,
Args&& ... args)
{
std::string::const_iterator begin = input.begin(), end = input.end();
boost::spirit::qi::phrase_parse(
begin, end, p, s, std::forward<Args>(args) ...);
if (begin != end) {
std::cout << "Unparseable: "
<< std::quoted(std::string(begin, end)) << std::endl;
throw std::runtime_error("Parse error");
}
}
class ArithmeticGrammarMonomial : public qi::grammar<
std::string::const_iterator,
std::map<std::string, int>(), qi::space_type>
{
public:
using Iterator = std::string::const_iterator;
ArithmeticGrammarMonomial() : ArithmeticGrammarMonomial::base_type(start)
{
start = +(potVar);
potVar=(varName>>'^'>>exponent)|(varName>> qi::attr(1));
//potVar = varName >> (('^' >> exponent) | qi::attr(1));
exponent = qi::int_;
varName = qi::char_("a-z");
}
qi::rule<Iterator, std::map<std::string, int>(), qi::space_type> start;
qi::rule<Iterator, std::pair<std::string, int>(), qi::space_type> potVar;
qi::rule<Iterator, int()> exponent;
qi::rule<Iterator, std::string()> varName;
};
void test2(std::string input)
{
std::map<std::string, int> out_map;
PhraseParseOrDie(input, ArithmeticGrammarMonomial(), qi::space, out_map);
std::cout << "test2() parse result: "<<std::endl;
for(auto &it: out_map)
std::cout<< it.first<<it.second << std::endl;
}
/******************************************************************************/
int main(int argc, char* argv[])
{
std::cout << "Parse Monomial 1" << std::endl;
test2(argc >= 2 ? argv[1] : "x^3y^1");
test2(argc >= 2 ? argv[1] : "xy");
return 0;
}
Live demo
I think I solved the original problem myself. The second try works.
Indeed. It's how I'd do this (always match the AST with your parser expressions).
However, I don't see how I doubled the variable name.
It's due to backtracking with container attributes. They don't get rolled back. So the first branch parses potVar into a string, and then the parser backtracks into the second branch, which parses potVar into the same string.
boost::spirit::qi duplicate parsing on the output
Understanding Boost.spirit's string parser
Parsing with Boost::Spirit (V2.4) into container
Boost Spirit optional parser and backtracking
boost::spirit alternative parsers return duplicates
It can also crop up with semantic actions:
Boost Semantic Actions causing parsing issues
Boost Spirit optional parser and backtracking
In short:
match your AST structure in your rule expression, or use qi::hold to force the issue (at performance cost)
avoid semantic actions (Boost Spirit: "Semantic actions are evil"?)
For inspiration, here's a simplified take using Spirit X3
Live On Compiler Explorer
#include <boost/fusion/adapted.hpp>
#include <boost/spirit/home/x3.hpp>
#include <fmt/ranges.h>
#include <map>
namespace Parsing {
namespace x3 = boost::spirit::x3;
auto exponent = '^' >> x3::int_ | x3::attr(1);
auto varName = x3::repeat(1)[x3::char_("a-z")];
auto potVar
= x3::rule<struct P, std::pair<std::string, int>>{}
= varName >> exponent;
auto start = x3::skip(x3::space)[+potVar >> x3::eoi];
template <typename T = x3::unused_type>
void StrictParse(std::string_view input, T&& into = {})
{
auto f = input.begin(), l = input.end();
if (!x3::parse(f, l, start, into)) {
fmt::print(stderr, "Error at: '{}'\n", std::string(f, l));
throw std::runtime_error("Parse error");
}
}
} // namespace Parsing
void test2(std::string input) {
std::map<std::string, int> out_map;
Parsing::StrictParse(input, out_map);
fmt::print("{} -> {}\n", input, out_map);
}
int main() {
for (auto s : {"x^3y^1", "xy"})
test2(s);
}
Prints
x^3y^1 -> [("x", 3), ("y", 1)]
xy -> [("x", 1), ("y", 1)]
Bonus Notes
It looks to me like you should be more careful. Even if you assume that all variables are 1 letter and no terms can occur (only factors), then still you need to correctly handle x^5y^2x to be x^6y^2 right?
Here's Qi version that uses semantic actions to correctly accumulate like factors:
Live On Coliru
#include <boost/spirit/include/phoenix.hpp>
#include <boost/spirit/include/qi.hpp>
#include <iomanip>
#include <iostream>
#include <map>
namespace qi = boost::spirit::qi;
using Iterator = std::string::const_iterator;
using Monomial = std::map<char, int>;
struct ArithmeticGrammarMonomial : qi::grammar<Iterator, Monomial()> {
ArithmeticGrammarMonomial() : ArithmeticGrammarMonomial::base_type(start) {
using namespace qi;
exp_ = '^' >> int_ | attr(1);
start = skip(space)[ //
+(char_("a-z") >> exp_)[_val[_1] += _2] //
];
}
private:
qi::rule<Iterator, Monomial()> start;
qi::rule<Iterator, int(), qi::space_type> exp_;
};
void do_test(std::string_view input) {
Monomial output;
static const ArithmeticGrammarMonomial p;
Iterator f(begin(input)), l(end(input));
qi::parse(f, l, qi::eps > p, output);
std::cout << std::quoted(input) << " -> " << std::endl;
for (auto& [var,exp] : output)
std::cout << " - " << var << '^' << exp << std::endl;
}
int main() {
for (auto s : {"x^3y^1", "xy", "x^5y^2x"})
do_test(s);
}
Prints
"x^3y^1" ->
- x^3
- y^1
"xy" ->
- x^1
- y^1
"x^5y^2x" ->
- x^6
- y^2

boost::spirit::x3 parsing is slower than strsep parsing

I wrote a x3 parser to parse a structured text file, here is the demo code:
int main() {
char buf[10240];
type_t example; // def see below
FILE* fp = fopen("text", "r");
while (fgets(buf, 10240, fp)) // read to the buffer
{
int n = strlen(buf);
example.clear();
if (client::parse_numbers(buf, buf+n, example)) // def see below
{ // do nothing here, only parse the buf and fill into the example }
}
}
struct type_t {
int id;
std::vector<int> fads;
std::vector<int> fbds;
std::vector<float> fvalues;
float target;
void clear() {
fads.clear();
fbds.clear();
fvalues.clear();
}
};
template <typename Iterator>
bool parse_numbers(Iterator first, Iterator last, type_t& example)
{
using x3::int_;
using x3::double_;
using x3::phrase_parse;
using x3::parse;
using x3::_attr;
using ascii::space;
auto fn_id = [&](auto& ctx) { example.id = _attr(ctx); };
auto fn_fad = [&](auto& ctx) { example.fads.push_back(_attr(ctx)); };
auto fn_fbd = [&](auto& ctx) { example.fbds.push_back(_attr(ctx)); };
auto fn_value = [&](auto& ctx) { example.fvalues.push_back(_attr(ctx)); };
auto fn_target = [&](auto& ctx) { example.target = _attr(ctx); };
bool r = phrase_parse(first, last,
// Begin grammar
(
int_[fn_id] >>
double_[fn_target] >>
+(int_[fn_fad] >> ':' >> int_[fn_fbd] >> ':' >> double_[fn_value])
)
,
// End grammar
space);
if (first != last) // fail if we did not get a full match
return false;
return r;
}
//]
}
Am I doing it the right way or how to improve? I'd like to see if any optimization could be done before I switch back to my strsep parsing implementation, since it's much faster than this x3 version.
Why do you use semantic actions for this? An interesting point to read about is sehe's article Boost Spirit: “Semantic actions are evil”? and other notes about.
Parsing into an AST structure as shown by the X3 examples, e.g. Employee - Parsing into structs is IMO much more natural. You need the visitor pattern to evaluate the data later on.
One solution is shown here:
#include <iostream>
#include <sstream>
#include <fstream>
#include <vector>
#include <boost/fusion/include/adapt_struct.hpp>
#include <boost/spirit/home/x3.hpp>
namespace ast {
struct triple {
double fad;
double fbd;
double value;
};
struct data {
int id;
double target;
std::vector<ast::triple> triple;
};
}
BOOST_FUSION_ADAPT_STRUCT(ast::triple, fad, fbd, value)
BOOST_FUSION_ADAPT_STRUCT(ast::data, id, target, triple)
namespace x3 = boost::spirit::x3;
namespace parser {
using x3::int_; using x3::double_;
auto const triple = x3::rule<struct _, ast::triple>{ "triple" } =
int_ >> ':' >> int_ >> ':' >> double_;
auto const data = x3::rule<struct _, ast::data>{ "data" } =
int_ >> double_ >> +triple;
}
int main()
{
std::stringstream buffer;
std::ifstream file{ R"(C:\data.txt)" };
if(file.is_open()) {
buffer << file.rdbuf();
file.close();
}
auto iter = std::begin(buffer.str());
auto const end = std::cend(buffer.str());
ast::data data;
bool parse_ok = x3::phrase_parse(iter, end, parser::data, x3::space, data);
if(parse_ok && (iter == end)) return true;
return false;
}
It does compile (see Wandbox), but isn't tested due to missing input data (which you can generate by you own inside the main() of course), but you are interested in benchmarking only.
Also note the use of stringstream to read the rdbuf. The are several ways to skin the cat, I refer here to How to read in a file in C++ where the rdbufreading approach is fast.
Further, how did you benchmark? Simply measure the time required by x3::phrase_parse() resp. strsep part only or the hole binary? file loading time inclusive? It must be compareable! Also consider OS filesystem caching etc.
BTW, it would be interesting to see the results and the test environment (data file size, strsep implementation etc).
Addendum:
If you approximately know how much data you can expect, you can pre-allocate memory for the vector using data.triple.reserve(10240); (or write an own constructor with this as arg). This prevents re-allocating during parsing (don't forget to enclose this into try/catch block to capture std::bad_alloc etc.). IIR the default capacity is 1000 on older gcc.

Spirit Grammar To break Up a String by Number of Characters

I am working on learning to write spirit grammars and I am trying to create a basic base 16 to base 64 converter that takes in a string representing hex, for example:
49276d206b696c
parse out 6 or less characters (less if the string isn't a perfect multiple of 6) and generate a base 64 encoded string from the input. One grammar I figured would probably work is something like this:
// 6 characters
`(qi::char_("0-9a-fA-F") >> qi::char_("0-9a-fA-F") >>
qi::char_("0-9a-fA-F") >> qi::char_("0-9a-fA-F") >>
qi::char_("0-9a-fA-F") >> qi::char_("0-9a-fA-F")[/*action*/]) |
// or 5 characters
(qi::char_("0-9a-fA-F") >> qi::char_("0-9a-fA-F") >>
qi::char_("0-9a-fA-F") >> qi::char_("0-9a-fA-F") >>
qi::char_("0-9a-fA-F")[/*action*/]) | ...`
etc.... all the way down to one character, Or having a different rule defined for each number of characters, but I think there must be a better way to specify the grammar. I read about spirit repeat and was thinking maybe I could do something like
+(boost::spirit::repeat(1, 6)[qi::char_("0-9a-fA-F")][/*action on characters*/])
however the compiler throws an error on this, because of the sematic action portion of the grammar. Is there a simpler way to specify a grammar to operate on exactly 6 or less characters at a time?
Edit
Here is what I have done so far...
base16convertergrammar.hpp
#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/phoenix.hpp>
#include <string>
#include <iostream>
namespace grammar {
namespace qi = boost::spirit::qi;
void toBase64(const std::string& p_input, std::string& p_output)
{
if (p_input.length() < 6)
{
// pad length
}
// use back inserter and generator to append to end of p_output.
}
template <typename Iterator>
struct Base16Grammar : qi::grammar<Iterator, std::string()>
{
Base16Grammar() : Base16Grammar::base_type(start, "base16grammar"),
m_base64String()
{
// get six characters at a time and send them off to be encoded
// if there is less than six characters just parse what we have
start = +(boost::spirit::repeat(1, 6)[qi::char_("0-9a-fA-F")][boost::phoenix::bind(toBase64, qi::_1,
boost::phoenix::ref(m_base64String))]);
}
qi::rule<Iterator, std::string()> start;
std::string m_base64String;
};
}
And here is the usage...
base16converter.cpp
#include "base16convertergrammar.hpp"
const std::string& convertHexToBase64(const std::string& p_hexString)
{
grammar::Base16Grammar<std::string::const_iterator> g;
bool r = boost::spirit::qi::parse(p_hexString.begin(), p_hexString.end(), g);
}
int main(int argc, char** argv)
{
std::string test("49276d206b696c6c");
convertHexToBase64(test);
}
First of all, repeat()[] exposes a vector, so vector<char>, not a string.
void toBase64(const std::vector<char>& p_input, std::string& p_output)
Secondly, please don't do all that work. You don't tell us what the input means, but as long as you want to group it in sixes, I'm assuming you want them interpreted as /something/. You could e.g. use the int_parser:
Live On Coliru
#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/phoenix.hpp>
#include <string>
#include <iostream>
namespace grammar {
namespace qi = boost::spirit::qi;
namespace px = boost::phoenix;
template <typename Iterator>
struct Base16Grammar : qi::grammar<Iterator, std::string()>
{
Base16Grammar() : Base16Grammar::base_type(start, "base16grammar")
{
start = +qi::int_parser<uint64_t, 16, 1, 6>() [ qi::_val += to_string(qi::_1) + "; " ];
}
private:
struct to_string_f { template <typename T> std::string operator()(T const& v) const { return std::to_string(v); } };
px::function<to_string_f> to_string;
qi::rule<Iterator, std::string()> start;
};
}
std::string convertHexToBase64(const std::string& p_hexString)
{
grammar::Base16Grammar<std::string::const_iterator> g;
std::string result;
bool r = boost::spirit::qi::parse(p_hexString.begin(), p_hexString.end(), g, result);
assert(r);
return result;
}
int main()
{
for (std::string test : {"49276d206b696c6c"})
std::cout << test << " -> " << convertHexToBase64(test) << "\n";
}
Prints
49276d206b696c6c -> 4794221; 2124649; 27756;
Going out on a limb, you just want to transcode hex-encoded binary into base64.
Since you're already using Boost:
Live On Coliru
#include <boost/archive/iterators/base64_from_binary.hpp>
#include <boost/archive/iterators/insert_linebreaks.hpp>
#include <boost/archive/iterators/transform_width.hpp>
// for hex decoding
#include <boost/iterator/function_input_iterator.hpp>
#include <string>
#include <iostream>
#include <functional>
std::string convertHexToBase64(const std::string &hex) {
struct get_byte_f {
using result_type = uint8_t;
std::string::const_iterator hex_it;
result_type operator()() {
auto nibble = [](uint8_t ch) {
if (!std::isxdigit(ch)) throw std::runtime_error("invalid hex input");
return std::isdigit(ch) ? ch - '0' : std::tolower(ch) - 'a' + 10;
};
auto hi = nibble(*hex_it++);
auto lo = nibble(*hex_it++);
return hi << 4 | lo;
}
} get_byte{ hex.begin() };
using namespace boost::archive::iterators;
using It = boost::iterators::function_input_iterator<get_byte_f, size_t>;
typedef insert_linebreaks< // insert line breaks every 72 characters
base64_from_binary< // convert binary values to base64 characters
transform_width< // retrieve 6 bit integers from a sequence of 8 bit bytes
It, 6, 8> >,
72> B64; // compose all the above operations in to a new iterator
return { B64(It{get_byte, 0}), B64(It{get_byte, hex.size()/2}) };
}
int main() {
for (std::string test : {
"49276d206b696c6c",
"736f6d65206c656e67746879207465787420746f2073686f77207768617420776f756c642068617070656e206174206c696e6520777261700a"
})
{
std::cout << " === hex: " << test << "\n" << convertHexToBase64(test) << "\n";
}
}
Prints
=== hex: 49276d206b696c6c
SSdtIGtpbGw
=== hex: 736f6d65206c656e67746879207465787420746f2073686f77207768617420776f756c642068617070656e206174206c696e6520777261700a
c29tZSBsZW5ndGh5IHRleHQgdG8gc2hvdyB3aGF0IHdvdWxkIGhhcHBlbiBhdCBsaW5lIHdy
YXAK

Parse only specific numbers with Boost.Spirit

How can I build a Boost.Spirit parser that matches only numbers in a certain range?
Consider the simple parser qi::uint_. It matches all unsigned integers. Is it possible to construct a parser that matches the numbers 0 to 12345 but not 12346 and larger?
One way is to attach to the qi::uint_ parser a semantic action that checks the parser's attribute and sets the semantic action's third parameter accordingly:
#include <iostream>
#include <string>
#include <vector>
#include <boost/spirit/include/qi.hpp>
namespace qi = boost::spirit::qi;
int main() {
qi::rule<std::string::const_iterator, unsigned(), qi::ascii::space_type> rule;
const auto not_greater_than_12345 = [](const unsigned& attr, auto&, bool& pass) {
pass = !(attr > 12345U);
};
rule %= qi::uint_[not_greater_than_12345];
std::vector<std::string> numbers{"0", "123", "1234", "12345", "12346", "123456"};
for (const auto& number : numbers) {
unsigned result;
auto iter = number.cbegin();
if (qi::phrase_parse(iter, number.cend(), rule, qi::ascii::space, result) &&
iter == number.cend()) {
std::cout << result << '\n'; // 0 123 1234 12345
}
}
}
Live on Wandbox
The semantic action can be written more concisely with the Phoenix placeholders _pass and _1:
#include <iostream>
#include <string>
#include <vector>
#include <boost/phoenix/phoenix.hpp>
#include <boost/spirit/include/qi.hpp>
namespace qi = boost::spirit::qi;
int main() {
qi::rule<std::string::const_iterator, unsigned(), qi::ascii::space_type> rule;
rule %= qi::uint_[qi::_pass = !(qi::_1 > 12345U)];
std::vector<std::string> numbers{"0", "123", "1234", "12345", "12346", "123456"};
for (const auto& number : numbers) {
unsigned result;
auto iter = number.cbegin();
if (qi::phrase_parse(iter, number.cend(), rule, qi::ascii::space, result) &&
iter == number.cend()) {
std::cout << result << '\n'; // 0 123 1234 12345
}
}
}
Live on Wandbox
From Semantic Actions with Parsers
The possible signatures for functions to be used as semantic actions are:
...
template <typename Attrib, typename Context>
void fa(Attrib& attr, Context& context, bool& pass);
... Here Attrib is the attribute type of the parser attached to the semantic action. ... The third parameter, pass, can be used by the semantic action to force the associated parser to fail. If pass is set to false the action parser will immediately return false as well, while not invoking p and not generating any output.

Rcpp - Capture result of sregex_token_iterator to vector

I'm an R user and am learning c++ to leverage in Rcpp. Recently, I wrote an alternative to R's strsplit in Rcpp using string.h but it isn't regex based (afaik). I've been reading about Boost and found sregex_token_iterator.
The website below has an example:
std::string input("This is his face");
sregex re = sregex::compile(" "); // find white space
// iterate over all non-white space in the input. Note the -1 below:
sregex_token_iterator begin( input.begin(), input.end(), re, -1 ), end;
// write all the words to std::cout
std::ostream_iterator< std::string > out_iter( std::cout, "\n" );
std::copy( begin, end, out_iter );
My rcpp function runs just fine:
#include <Rcpp.h>
#include <boost/xpressive/xpressive.hpp>
using namespace Rcpp;
// [[Rcpp::export]]
StringVector testMe(std::string input,std::string uregex) {
boost::xpressive::sregex re = boost::xpressive::sregex::compile(uregex); // find a date
// iterate over the days, months and years in the input
boost::xpressive::sregex_token_iterator begin( input.begin(), input.end(), re ,-1), end;
// write all the words to std::cout
std::ostream_iterator< std::string > out_iter( std::cout, "\n" );
std::copy( begin, end, out_iter );
return("Done");
}
/*** R
testMe("This is a funny sentence"," ")
*/
But all it does is print out the tokens. I am very new to C++ but I understand the idea of making a vector in rcpp with StringVector res(10); (make a vector named res of length 10) which I can then index res[1] = "blah".
My question is - how do I take the output of boost::xpressive::sregex_token_iterator begin( input.begin(), input.end(), re ,-1), end; and store it in a vector so I can return it?
http://www.boost.org/doc/libs/1_54_0/doc/html/xpressive/user_s_guide.html#boost_xpressive.user_s_guide.string_splitting_and_tokenization
Final working Rcpp solution
Including this because my need was Rcpp specific and I had to make some minor changes to the solution provided.
#include <Rcpp.h>
#include <boost/xpressive/xpressive.hpp>
typedef std::vector<std::string> StringVector;
using boost::xpressive::sregex;
using boost::xpressive::sregex_token_iterator;
using Rcpp::List;
void tokenWorker(/*in*/ const std::string& input,
/*in*/ const sregex re,
/*inout*/ StringVector& v)
{
sregex_token_iterator begin( input.begin(), input.end(), re ,-1), end;
// write all the words to v
std::copy(begin, end, std::back_inserter(v));
}
//[[Rcpp::export]]
List tokenize(StringVector t, std::string tok = " "){
List final_res(t.size());
sregex re = sregex::compile(tok);
for(int z=0;z<t.size();z++){
std::string x = "";
for(int y=0;y<t[z].size();y++){
x += t[z][y];
}
StringVector v;
tokenWorker(x, re, v);
final_res[z] = v;
}
return(final_res);
}
/*** R
tokenize("Please tokenize this sentence")
*/
My question is - how do I take the output of
boost::xpressive::sregex_token_iterator begin( input.begin(),
input.end(), re ,-1), end; and store it in a vector so I can return
it?
You're already halfway there.
The missing link is just std::back_inserter
#include <iostream>
#include <algorithm>
#include <vector>
#include <string>
#include <boost/xpressive/xpressive.hpp>
typedef std::vector<std::string> StringVector;
using boost::xpressive::sregex;
using boost::xpressive::sregex_token_iterator;
void testMe(/*in*/ const std::string& input,
/*in*/ const std::string& uregex,
/*inout*/ StringVector& v)
{
sregex re = sregex::compile(uregex);
sregex_token_iterator begin( input.begin(), input.end(), re ,-1), end;
// write all the words to v
std::copy(begin, end, std::back_inserter(v));
}
int main()
{
std::string input("This is his face");
std::string blank(" ");
StringVector v;
// find white space
testMe(input, blank, v);
std::copy(v.begin(), v.end(),
std::ostream_iterator<std::string>(std::cout, "|"));
std::cout << std::endl;
return 0;
}
output:
This|is|his|face|
I used legacy C++ because you used a regex lib from boost instead of std <regex>; maybe you better consider C++14 right from the start when you learn c++ right now; C++14 would have shortened even this small snippet and made it more expressive.
And here's the C++11 version.
Aside from the benefits of using a standardized <regex>, the <regex>-using version compiles roughly twice as fast as the boost::xpressive version with gcc-4.9 and clang-3.5 (-g -O0 -std=c++11) on a QuadCore-Box running with Debian x86_64 Jessie.
#include <iostream>
#include <algorithm>
#include <vector>
#include <string>
//////////////////////////////////////////////////////////////////////////////
// A minimal adaption layer atop boost::xpressive and c++11 std's <regex> //
//--------------------------------------------------------------------------//
// remove the comment sign from the #define if your compiler suite's //
// <regex> implementation is not complete //
//#define USE_REGEX_FALLBACK_33509467 1 //
//////////////////////////////////////////////////////////////////////////////
#if defined(USE_REGEX_FALLBACK_33509467)
#include <boost/xpressive/xpressive.hpp>
using regex = boost::xpressive::sregex;
using sregex_iterator = boost::xpressive::sregex_token_iterator;
auto compile = [] (const std::string& s) {
return boost::xpressive::sregex::compile(s);
};
auto make_sregex_iterator = [] (const std::string& s, const regex& re) {
return sregex_iterator(s.begin(), s.end(), re ,-1);
};
#else // #if !defined(USE_REGEX_FALLBACK_33509467)
#include <regex>
using regex = std::regex;
using sregex_iterator = std::sregex_token_iterator;
auto compile = [] (const std::string& s) {
return regex(s);
};
auto make_sregex_iterator = [] (const std::string& s, const regex& re) {
return std::sregex_token_iterator(s.begin(), s.end(), re, -1);
};
#endif // #if defined(USE_REGEX_FALLBACK_33509467)
//////////////////////////////////////////////////////////////////////////////
typedef std::vector<std::string> StringVector;
StringVector testMe(/*in*/const std::string& input,
/*in*/const std::string& uregex)
{
regex re = compile(uregex);
sregex_iterator begin = make_sregex_iterator(input, re),
end;
return StringVector(begin, end); // doesn't steal the strings
// but try (and succeed) to move the vector
}
int main() {
std::string input("This is his face");
std::string blank(" ");
// tokenize by white space
StringVector v = testMe(input, blank);
std::copy(v.begin(), v.end(),
std::ostream_iterator<std::string>(std::cout, "|"));
std::cout << std::endl;
return EXIT_SUCCESS;
}