Rcpp - Capture result of sregex_token_iterator to vector

Rcpp - Capture result of sregex_token_iterator to vector - c++

I'm an R user and am learning c++ to leverage in Rcpp. Recently, I wrote an alternative to R's strsplit in Rcpp using string.h but it isn't regex based (afaik). I've been reading about Boost and found sregex_token_iterator.
The website below has an example:
std::string input("This is his face");
sregex re = sregex::compile(" "); // find white space
// iterate over all non-white space in the input. Note the -1 below:
sregex_token_iterator begin( input.begin(), input.end(), re, -1 ), end;
// write all the words to std::cout
std::ostream_iterator< std::string > out_iter( std::cout, "\n" );
std::copy( begin, end, out_iter );
My rcpp function runs just fine:
#include <Rcpp.h>
#include <boost/xpressive/xpressive.hpp>
using namespace Rcpp;
// [[Rcpp::export]]
StringVector testMe(std::string input,std::string uregex) {
boost::xpressive::sregex re = boost::xpressive::sregex::compile(uregex); // find a date
// iterate over the days, months and years in the input
boost::xpressive::sregex_token_iterator begin( input.begin(), input.end(), re ,-1), end;
// write all the words to std::cout
std::ostream_iterator< std::string > out_iter( std::cout, "\n" );
std::copy( begin, end, out_iter );
return("Done");
}
/*** R
testMe("This is a funny sentence"," ")
*/
But all it does is print out the tokens. I am very new to C++ but I understand the idea of making a vector in rcpp with StringVector res(10); (make a vector named res of length 10) which I can then index res[1] = "blah".
My question is - how do I take the output of boost::xpressive::sregex_token_iterator begin( input.begin(), input.end(), re ,-1), end; and store it in a vector so I can return it?
http://www.boost.org/doc/libs/1_54_0/doc/html/xpressive/user_s_guide.html#boost_xpressive.user_s_guide.string_splitting_and_tokenization
Final working Rcpp solution
Including this because my need was Rcpp specific and I had to make some minor changes to the solution provided.
#include <Rcpp.h>
#include <boost/xpressive/xpressive.hpp>
typedef std::vector<std::string> StringVector;
using boost::xpressive::sregex;
using boost::xpressive::sregex_token_iterator;
using Rcpp::List;
void tokenWorker(/*in*/ const std::string& input,
/*in*/ const sregex re,
/*inout*/ StringVector& v)
{
sregex_token_iterator begin( input.begin(), input.end(), re ,-1), end;
// write all the words to v
std::copy(begin, end, std::back_inserter(v));
}
//[[Rcpp::export]]
List tokenize(StringVector t, std::string tok = " "){
List final_res(t.size());
sregex re = sregex::compile(tok);
for(int z=0;z<t.size();z++){
std::string x = "";
for(int y=0;y<t[z].size();y++){
x += t[z][y];
}
StringVector v;
tokenWorker(x, re, v);
final_res[z] = v;
}
return(final_res);
}
/*** R
tokenize("Please tokenize this sentence")
*/

My question is - how do I take the output of
boost::xpressive::sregex_token_iterator begin( input.begin(),
input.end(), re ,-1), end; and store it in a vector so I can return
it?
You're already halfway there.
The missing link is just std::back_inserter
#include <iostream>
#include <algorithm>
#include <vector>
#include <string>
#include <boost/xpressive/xpressive.hpp>
typedef std::vector<std::string> StringVector;
using boost::xpressive::sregex;
using boost::xpressive::sregex_token_iterator;
void testMe(/*in*/ const std::string& input,
/*in*/ const std::string& uregex,
/*inout*/ StringVector& v)
{
sregex re = sregex::compile(uregex);
sregex_token_iterator begin( input.begin(), input.end(), re ,-1), end;
// write all the words to v
std::copy(begin, end, std::back_inserter(v));
}
int main()
{
std::string input("This is his face");
std::string blank(" ");
StringVector v;
// find white space
testMe(input, blank, v);
std::copy(v.begin(), v.end(),
std::ostream_iterator<std::string>(std::cout, "|"));
std::cout << std::endl;
return 0;
}
output:
This|is|his|face|
I used legacy C++ because you used a regex lib from boost instead of std <regex>; maybe you better consider C++14 right from the start when you learn c++ right now; C++14 would have shortened even this small snippet and made it more expressive.

And here's the C++11 version.
Aside from the benefits of using a standardized <regex>, the <regex>-using version compiles roughly twice as fast as the boost::xpressive version with gcc-4.9 and clang-3.5 (-g -O0 -std=c++11) on a QuadCore-Box running with Debian x86_64 Jessie.
#include <iostream>
#include <algorithm>
#include <vector>
#include <string>
//////////////////////////////////////////////////////////////////////////////
// A minimal adaption layer atop boost::xpressive and c++11 std's <regex> //
//--------------------------------------------------------------------------//
// remove the comment sign from the #define if your compiler suite's //
// <regex> implementation is not complete //
//#define USE_REGEX_FALLBACK_33509467 1 //
//////////////////////////////////////////////////////////////////////////////
#if defined(USE_REGEX_FALLBACK_33509467)
#include <boost/xpressive/xpressive.hpp>
using regex = boost::xpressive::sregex;
using sregex_iterator = boost::xpressive::sregex_token_iterator;
auto compile = [] (const std::string& s) {
return boost::xpressive::sregex::compile(s);
};
auto make_sregex_iterator = [] (const std::string& s, const regex& re) {
return sregex_iterator(s.begin(), s.end(), re ,-1);
};
#else // #if !defined(USE_REGEX_FALLBACK_33509467)
#include <regex>
using regex = std::regex;
using sregex_iterator = std::sregex_token_iterator;
auto compile = [] (const std::string& s) {
return regex(s);
};
auto make_sregex_iterator = [] (const std::string& s, const regex& re) {
return std::sregex_token_iterator(s.begin(), s.end(), re, -1);
};
#endif // #if defined(USE_REGEX_FALLBACK_33509467)
//////////////////////////////////////////////////////////////////////////////
typedef std::vector<std::string> StringVector;
StringVector testMe(/*in*/const std::string& input,
/*in*/const std::string& uregex)
{
regex re = compile(uregex);
sregex_iterator begin = make_sregex_iterator(input, re),
end;
return StringVector(begin, end); // doesn't steal the strings
// but try (and succeed) to move the vector
}
int main() {
std::string input("This is his face");
std::string blank(" ");
// tokenize by white space
StringVector v = testMe(input, blank);
std::copy(v.begin(), v.end(),
std::ostream_iterator<std::string>(std::cout, "|"));
std::cout << std::endl;
return EXIT_SUCCESS;
}

Related

string to numbers (vector array of int type)conversion

The function takes a string containing of comma(,) separated numbers as string and converts into numbers. Sometimes it produces a garbage value at the end.
vector<int> parseInts(string str)
{
int as[200]={0};
int i=0,j=0;
for(;str[i]!='\0';i++)
{
while(str[i]!=','&&str[i]!='\0')
{as[j]= as[j]*10 +str[i] -'0';
i++;}
j++;
}
vector<int>rr;
for(int i=0;i<j;i++)
rr.push_back(as[i]);
return rr;
}

If you're writing in C++, use C++ features instead of C-style string manipulation. You can combine std::istringstream, std::getline(), and std::stoi() into a very short solution. (Also note that you should take the argument by const reference since you do not modify it.)
#include <iostream>
#include <sstream>
#include <string>
#include <vector>
std::vector<int> parseInts(std::string const & str) {
std::vector<int> values;
std::istringstream src{str};
std::string buf;
while (std::getline(src, buf, ',')) {
// Note no error checking on this conversion -- exercise for the reader.
values.push_back(std::stoi(buf));
}
return values;
}
(Demo)

The code doesn't handle whitespace and inputs with more than 200 numbers.
An alternative working solution:
#include <iostream>
#include <sstream>
#include <iterator>
#include <algorithm>
#include <vector>
std::vector<int> parseInts(std::string s) {
std::replace(s.begin(), s.end(), ',', ' ');
std::istringstream ss(std::move(s));
return std::vector<int>{
std::istream_iterator<int>{ss},
std::istream_iterator<int>{}
};
}
int main() {
auto v = parseInts("1,2 , 3 ,,, 4,5,,,");
for(auto i : v)
std::cout << i << '\n';
}
Output:
1
2
3
4
5

You never really asked a question. If you are looking for an elegant method, then I provide that below. If you are asking us to debug the code, then that is a different matter.
First here is a nice utility for splitting a string
std::vector<std::string> split(const std::string& str, char delim) {
std::vector<std::string> strings;
size_t start;
size_t end = 0;
while ((start = str.find_first_not_of(delim, end)) != std::string::npos) {
end = str.find(delim, start);
strings.push_back(str.substr(start, end - start));
}
return strings;
}
First split the string on commas:
std::vector<std::string> strings = split(str, ',');
Then covert each to an int
std::vector<int> ints;
for (auto s : strings)
ints.push_back(std::stoi(s))

Split string using multiple string delimiters

Suppose I have string like
Harry potter was written by J. K. Rowling
How to split string using was and by as a delimiter and get result in vector in C++?
I know split using multiple char but not using multiple string.

If you use c++11 and clang there is a solution using a regex string tokenizer:
#include <fstream>
#include <iostream>
#include <algorithm>
#include <iterator>
#include <regex>
int main()
{
std::string text = " Harry potter was written by J. K. Rowling.";
std::regex ws_re("(was)|(by)");
std::copy( std::sregex_token_iterator(text.begin(), text.end(), ws_re, -1),
std::sregex_token_iterator(),
std::ostream_iterator<std::string>(std::cout, "\n"));
}
The output is :
Harry potter
written
J. K. Rowling.
Sadly gcc4.8 does not have the regex fully integrated. But clang does compile and link this correctly.

Brute force approach, not boost, no c++11, optimizations more than welcome:
/** Split the string s by the delimiters, place the result in the
outgoing vector result */
void split(const std::string& s, const std::vector<std::string>& delims,
std::vector<std::string>& result)
{
// split the string into words
std::stringstream ss(s);
std::istream_iterator<std::string> begin(ss);
std::istream_iterator<std::string> end;
std::vector<std::string> splits(begin, end);
// then append the words together, except if they are delimiter
std::string current;
for(int i=0; i<splits.size(); i++)
{
if(std::find(delims.begin(), delims.end(), splits[i]) != delims.end())
{
result.push_back(current);
current = "";
}
else
{
current += splits[i] + " " ;
}
}
result.push_back(current.substr(0, current.size() - 1));
}

regex replace with callback in c++11?

Is there a function of regex replacement that will send the matches to user function and then substitute the return value:
I've tried this method, but it obviously doesn't work:
cout << regex_replace("my values are 9, 19", regex("\d+"), my_callback);
and function:
std::string my_callback(std::string &m) {
int int_m = atoi(m.c_str());
return std::to_string(int_m + 1);
}
and the result should be: my values are 10, 20
I mean similar mode of working like php's preg_replace_callback or python's re.sub(pattern, callback, subject)
And I mean the latest 4.9 gcc, that is capable of regex without boost.

I wanted this kind of function and didn't like the answer "use boost". The problem with Benjamin's answer is it provides all the tokens. This means you don't know which token is a match and it doesn't let you use capture groups. This does:
// clang++ -std=c++11 -stdlib=libc++ -o test test.cpp
#include <cstdlib>
#include <iostream>
#include <string>
#include <regex>
namespace std
{
template<class BidirIt, class Traits, class CharT, class UnaryFunction>
std::basic_string<CharT> regex_replace(BidirIt first, BidirIt last,
const std::basic_regex<CharT,Traits>& re, UnaryFunction f)
{
std::basic_string<CharT> s;
typename std::match_results<BidirIt>::difference_type
positionOfLastMatch = 0;
auto endOfLastMatch = first;
auto callback = [&](const std::match_results<BidirIt>& match)
{
auto positionOfThisMatch = match.position(0);
auto diff = positionOfThisMatch - positionOfLastMatch;
auto startOfThisMatch = endOfLastMatch;
std::advance(startOfThisMatch, diff);
s.append(endOfLastMatch, startOfThisMatch);
s.append(f(match));
auto lengthOfMatch = match.length(0);
positionOfLastMatch = positionOfThisMatch + lengthOfMatch;
endOfLastMatch = startOfThisMatch;
std::advance(endOfLastMatch, lengthOfMatch);
};
std::regex_iterator<BidirIt> begin(first, last, re), end;
std::for_each(begin, end, callback);
s.append(endOfLastMatch, last);
return s;
}
template<class Traits, class CharT, class UnaryFunction>
std::string regex_replace(const std::string& s,
const std::basic_regex<CharT,Traits>& re, UnaryFunction f)
{
return regex_replace(s.cbegin(), s.cend(), re, f);
}
} // namespace std
using namespace std;
std::string my_callback(const std::smatch& m) {
int int_m = atoi(m.str(0).c_str());
return std::to_string(int_m + 1);
}
int main(int argc, char *argv[])
{
cout << regex_replace("my values are 9, 19", regex("\\d+"),
my_callback) << endl;
cout << regex_replace("my values are 9, 19", regex("\\d+"),
[](const std::smatch& m){
int int_m = atoi(m.str(0).c_str());
return std::to_string(int_m + 1);
}
) << endl;
return 0;
}

You could use a regex_token_iterator
#include <iostream>
#include <algorithm>
#include <regex>
#include <string>
#include <sstream>
int main()
{
std::string input_text = "my values are 9, 19";
std::string output_text;
auto callback = [&](std::string const& m){
std::istringstream iss(m);
int n;
if(iss >> n)
{
output_text += std::to_string(n+1);
}
else
{
output_text += m;
}
};
std::regex re("\\d+");
std::sregex_token_iterator
begin(input_text.begin(), input_text.end(), re, {-1,0}),
end;
std::for_each(begin,end,callback);
std::cout << output_text;
}
Note that the {-1,0} in the argument list of the iterator constructor is a list specifying the submatches we want to iterate over. The -1 is for non-matching sections, and the 0 is for the first submatch.
Also note that I have not used the c++11 regex functionality extensively and am no expert in it. So there may be problems with this code. But for your specific input, I tested it and it seems to produce the expected results. If you find any input set for which it doesn't work, please let me know.

Maybe I arrived too late to this party (about 5 years thought), but I neither liked the answer "use boost", following function has less generalization (speaking about string types), but apparently works. However, I don't know if use a std::ostringstream is better than std::string::append:
std::string regex_replace(
const std::string& input,
const std::regex& regex,
std::function<std::string(std::smatch const& match)> format) {
std::ostringstream output;
std::sregex_iterator begin(input.begin(), input.end(), regex), end;
for(; begin != end; begin++){
output << begin->prefix() << format(*begin);
}
output << input.substr(input.size() - begin->position());
return output.str();
}
So, as you can see I used std::sregex_iterator instead of std::sregex_token_iterator.

That kind of functionality only exists in the Boost library version of regex_replace, which can have the custom formatter. Unfortunately, the standard C++11 implementation requires the replacement format argument must be a string.
Here is the documentation on regex_replace: http://www.cplusplus.com/reference/regex/match_replace/

c++11 regex slower than python

hi i would like to understand why the following code which does a split string split using regex
#include<regex>
#include<vector>
#include<string>
std::vector<std::string> split(const std::string &s){
static const std::regex rsplit(" +");
auto rit = std::sregex_token_iterator(s.begin(), s.end(), rsplit, -1);
auto rend = std::sregex_token_iterator();
auto res = std::vector<std::string>(rit, rend);
return res;
}
int main(){
for(auto i=0; i< 10000; ++i)
split("a b c", " ");
return 0;
}
is slower then the following python code
import re
for i in range(10000):
re.split(' +', 'a b c')
here's
> python test.py 0.05s user 0.01s system 94% cpu 0.070 total
./test 0.26s user 0.00s system 99% cpu 0.296 total
Im using clang++ on osx.
compiling with -O3 brings it to down to 0.09s user 0.00s system 99% cpu 0.109 total

Notice
See also this answer: https://stackoverflow.com/a/21708215 which was the base for the EDIT 2 at the bottom here.
I've augmented the loop to 1000000 to get a better timing measure.
This is my Python timing:
real 0m2.038s
user 0m2.009s
sys 0m0.024s
Here's an equivalent of your code, just a bit prettier:
#include <regex>
#include <vector>
#include <string>
std::vector<std::string> split(const std::string &s, const std::regex &r)
{
return {
std::sregex_token_iterator(s.begin(), s.end(), r, -1),
std::sregex_token_iterator()
};
}
int main()
{
const std::regex r(" +");
for(auto i=0; i < 1000000; ++i)
split("a b c", r);
return 0;
}
Timing:
real 0m5.786s
user 0m5.779s
sys 0m0.005s
This is an optimization to avoid construction/allocation of vector and string objects:
#include <regex>
#include <vector>
#include <string>
void split(const std::string &s, const std::regex &r, std::vector<std::string> &v)
{
auto rit = std::sregex_token_iterator(s.begin(), s.end(), r, -1);
auto rend = std::sregex_token_iterator();
v.clear();
while(rit != rend)
{
v.push_back(*rit);
++rit;
}
}
int main()
{
const std::regex r(" +");
std::vector<std::string> v;
for(auto i=0; i < 1000000; ++i)
split("a b c", r, v);
return 0;
}
Timing:
real 0m3.034s
user 0m3.029s
sys 0m0.004s
This is near a 100% performance improvement.
The vector is created before the loop, and can grow its memory in the first iteration. Afterwards there's no memory deallocation by clear(), the vector maintains the memory and construct strings in-place.
Another performance increase would be to avoid construction/destruction std::string completely, and hence, allocation/deallocation of its objects.
This is a tentative in this direction:
#include <regex>
#include <vector>
#include <string>
void split(const char *s, const std::regex &r, std::vector<std::string> &v)
{
auto rit = std::cregex_token_iterator(s, s + std::strlen(s), r, -1);
auto rend = std::cregex_token_iterator();
v.clear();
while(rit != rend)
{
v.push_back(*rit);
++rit;
}
}
Timing:
real 0m2.509s
user 0m2.503s
sys 0m0.004s
An ultimate improvement would be to have a std::vector of const char * as return, where each char pointer would point to a substring inside the original s c string itself. The problem is that, you can't do that because each of them would not be null terminated (for this, see usage of C++1y string_ref in a later sample).
This last improvement could also be achieved with this:
#include <regex>
#include <vector>
#include <string>
void split(const std::string &s, const std::regex &r, std::vector<std::string> &v)
{
auto rit = std::cregex_token_iterator(s.data(), s.data() + s.length(), r, -1);
auto rend = std::cregex_token_iterator();
v.clear();
while(rit != rend)
{
v.push_back(*rit);
++rit;
}
}
int main()
{
const std::regex r(" +");
std::vector<std::string> v;
for(auto i=0; i < 1000000; ++i)
split("a b c", r, v); // the constant string("a b c") should be optimized
// by the compiler. I got the same performance as
// if it was an object outside the loop
return 0;
}
I've built the samples with clang 3.3 (from trunk) with -O3. Maybe other regex libraries are able to perform better, but in any case, allocations/deallocations are frequently a performance hit.
Boost.Regex
This is the boost::regex timing for the c string arguments sample:
real 0m1.284s
user 0m1.278s
sys 0m0.005s
Same code, boost::regex and std::regex interface in this sample are identical, just needed to change the namespace and include.
Best wishes for it to get better over time, C++ stdlib regex implementations are in their infancy.
EDIT
For sake of completion, I've tried this (the above mentioned "ultimate improvement" suggestion) and it didn't improved performance of the equivalent std::vector<std::string> &v version in anything:
#include <regex>
#include <vector>
#include <string>
template<typename Iterator> class intrusive_substring
{
private:
Iterator begin_, end_;
public:
intrusive_substring(Iterator begin, Iterator end) : begin_(begin), end_(end) {}
Iterator begin() {return begin_;}
Iterator end() {return end_;}
};
using intrusive_char_substring = intrusive_substring<const char *>;
void split(const std::string &s, const std::regex &r, std::vector<intrusive_char_substring> &v)
{
auto rit = std::cregex_token_iterator(s.data(), s.data() + s.length(), r, -1);
auto rend = std::cregex_token_iterator();
v.clear(); // This can potentially be optimized away by the compiler because
// the intrusive_char_substring destructor does nothing, so
// resetting the internal size is the only thing to be done.
// Formerly allocated memory is maintained.
while(rit != rend)
{
v.emplace_back(rit->first, rit->second);
++rit;
}
}
int main()
{
const std::regex r(" +");
std::vector<intrusive_char_substring> v;
for(auto i=0; i < 1000000; ++i)
split("a b c", r, v);
return 0;
}
This has to do with the array_ref and string_ref proposal. Here's a sample code using it:
#include <regex>
#include <vector>
#include <string>
#include <string_ref>
void split(const std::string &s, const std::regex &r, std::vector<std::string_ref> &v)
{
auto rit = std::cregex_token_iterator(s.data(), s.data() + s.length(), r, -1);
auto rend = std::cregex_token_iterator();
v.clear();
while(rit != rend)
{
v.emplace_back(rit->first, rit->length());
++rit;
}
}
int main()
{
const std::regex r(" +");
std::vector<std::string_ref> v;
for(auto i=0; i < 1000000; ++i)
split("a b c", r, v);
return 0;
}
It will also be cheaper to return a vector of string_ref rather than string copies for the case of split with vector return.
EDIT 2
This new solution is able to get output by return. I have used Marshall Clow's string_view (string_ref got renamed) libc++ implementation found at https://github.com/mclow/string_view.
#include <string>
#include <string_view>
#include <boost/regex.hpp>
#include <boost/range/iterator_range.hpp>
#include <boost/iterator/transform_iterator.hpp>
using namespace std;
using namespace std::experimental;
using namespace boost;
string_view stringfier(const cregex_token_iterator::value_type &match) {
return {match.first, static_cast<size_t>(match.length())};
}
using string_view_iterator =
transform_iterator<decltype(&stringfier), cregex_token_iterator>;
iterator_range<string_view_iterator> split(string_view s, const regex &r) {
return {
string_view_iterator(
cregex_token_iterator(s.begin(), s.end(), r, -1),
stringfier
),
string_view_iterator()
};
}
int main() {
const regex r(" +");
for (size_t i = 0; i < 1000000; ++i) {
split("a b c", r);
}
}
Timing:
real 0m0.385s
user 0m0.385s
sys 0m0.000s
Note how faster this is compared to previous results. Of course, it's not filling a vector inside the loop (nor matching anything in advance probably too), but you get a range anyway, which you can range over with range-based for, or even use it to fill a vector.
As ranging over the iterator_range creates string_views over an original string(or a null terminated string), this gets very lightweight, never generating unnecessary string allocations.
Just to compare using this split implementation but actually filling a vector we could do this:
int main() {
const regex r(" +");
vector<string_view> v;
v.reserve(10);
for (size_t i = 0; i < 1000000; ++i) {
copy(split("a b c", r), back_inserter(v));
v.clear();
}
}
This uses boost range copy algorithm to fill the vector in each iteration, the timing is:
real 0m1.002s
user 0m0.997s
sys 0m0.004s
As can be seen, no much difference in comparison with the optimized string_view output param version.
Note also there's a proposal for a std::split that would work like this.

For optimizations, in general, you want to avoid two things:
burning away CPU cycles for unnecessary stuff
waiting idly for something to happen (memory read, disk read, network read, ...)
The two can be antithetic as sometimes it ends up being faster computing something than caching all of it in memory... so it's a game of balance.
Let's analyze your code:
std::vector<std::string> split(const std::string &s){
static const std::regex rsplit(" +"); // only computed once
// search for first occurrence of rsplit
auto rit = std::sregex_token_iterator(s.begin(), s.end(), rsplit, -1);
auto rend = std::sregex_token_iterator();
// simultaneously:
// - parses "s" from the second to the past the last occurrence
// - allocates one `std::string` for each match... at least! (there may be a copy)
// - allocates space in the `std::vector`, possibly multiple times
auto res = std::vector<std::string>(rit, rend);
return res;
}
Can we do better ? Well, if we could reuse existing storage instead of keeping allocating and deallocating memory, we should see a significant improvement [1]:
// Overwrites 'result' with the matches, returns the number of matches
// (note: 'result' is never shrunk, but may be grown as necessary)
size_t split(std::string const& s, std::vector<std::string>& result){
static const std::regex rsplit(" +"); // only computed once
auto rit = std::cregex_token_iterator(s.begin(), s.end(), rsplit, -1);
auto rend = std::cregex_token_iterator();
size_t pos = 0;
// As long as possible, reuse the existing strings (in place)
for (size_t max = result.size();
rit != rend && pos != max;
++rit, ++pos)
{
result[pos].assign(rit->first, rit->second);
}
// When more matches than existing strings, extend capacity
for (; rit != rend; ++rit, ++pos) {
result.emplace_back(rit->first, rit->second);
}
return pos;
} // split
In the test that you perform, where the number of submatches is constant across iterations, this version is unlikely to be beaten: it will only allocate memory on the first run (both for rsplit and result) and then keep reusing existing memory.
[1]: Disclaimer, I have only proved this code to be correct I have not tested it (as Donald Knuth would say).

How about this verion? It is no regex, but it solves the split pretty fast ...
#include <vector>
#include <string>
#include <algorithm>
size_t split2(const std::string& s, std::vector<std::string>& result)
{
size_t count = 0;
result.clear();
std::string::const_iterator p1 = s.cbegin();
std::string::const_iterator p2 = p1;
bool run = true;
do
{
p2 = std::find(p1, s.cend(), ' ');
result.push_back(std::string(p1, p2));
++count;
if (p2 != s.cend())
{
p1 = std::find_if(p2, s.cend(), [](char c) -> bool
{
return c != ' ';
});
}
else run = false;
} while (run);
return count;
}
int main()
{
std::vector<std::string> v;
std::string s = "a b c";
for (auto i = 0; i < 100000; ++i)
split2(s, v);
return 0;
}
$ time splittest.exe
real 0m0.132s
user 0m0.000s
sys 0m0.109s

I would say C++11 regex is much slower than perl and possibly than python.
To measure properly the performance it is better to do the tests
using some not trivial expression or else you are measuring everything
but the regex itself.
For example, to compare C++11 and perl
C++11 test code
#include <iostream>
#include <string>
#include <regex>
#include <chrono>
int main ()
{
int veces = 10000;
int count = 0;
std::regex expres ("([^-]*)-([^-]*)-(\\d\\d\\d:\\d\\d)---(.*)");
std::string text ("some-random-text-453:10--- etc etc blah");
std::smatch macha;
auto start = std::chrono::system_clock::now();
for (int ii = 0; ii < veces; ii ++)
count += std::regex_search (text, macha, expres);
auto milli = std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::system_clock::now() - start).count();
std::cout << count << "/" << veces << " matches " << milli << " ms --> " << (milli / (float) veces) << " ms per regex_search" << std::endl;
return 0;
}
in my computer compiling with gcc 4.9.3 I get the ouput
10000/10000 matches 1704 ms --> 0.1704 ms per regex_search
perl test code
use Time::HiRes qw/ time sleep /;
my $veces = 1000000;
my $conta = 0;
my $phrase = "some-random-text-453:10--- etc etc blah";
my $start = time;
for (my $ii = 0; $ii < $veces; $ii++)
{
if ($phrase =~ m/([^-]*)-([^-]*)-(\d\d\d:\d\d)---(.*)/)
{
$conta = $conta + 1;
}
}
my $elapsed = (time - $start) * 1000.0;
print $conta . "/" . $veces . " matches " . $elapsed . " ms --> " . ($elapsed / $veces) . " ms per regex\n";
using perl v5.8.8 again in my computer
1000000/1000000 matches 2929.55303192139 ms --> 0.00292955303192139 ms per regex
So in this test the ratio C++11 / perl is
0.1704 / 0.002929 = 58.17 times slower than perl
in real scenarios I get ratios about 100 to 200 times slower.
So, for example parsing a big file with a million of lines takes
about a second for perl while it can take more minutes(!) for a C++11
program using regex.

How do I tokenize a string in C++?

Java has a convenient split method:
String str = "The quick brown fox";
String[] results = str.split(" ");
Is there an easy way to do this in C++?

The Boost tokenizer class can make this sort of thing quite simple:
#include <iostream>
#include <string>
#include <boost/foreach.hpp>
#include <boost/tokenizer.hpp>
using namespace std;
using namespace boost;
int main(int, char**)
{
string text = "token, test string";
char_separator<char> sep(", ");
tokenizer< char_separator<char> > tokens(text, sep);
BOOST_FOREACH (const string& t, tokens) {
cout << t << "." << endl;
}
}
Updated for C++11:
#include <iostream>
#include <string>
#include <boost/tokenizer.hpp>
using namespace std;
using namespace boost;
int main(int, char**)
{
string text = "token, test string";
char_separator<char> sep(", ");
tokenizer<char_separator<char>> tokens(text, sep);
for (const auto& t : tokens) {
cout << t << "." << endl;
}
}

Here's a real simple one:
#include <vector>
#include <string>
using namespace std;
vector<string> split(const char *str, char c = ' ')
{
vector<string> result;
do
{
const char *begin = str;
while(*str != c && *str)
str++;
result.push_back(string(begin, str));
} while (0 != *str++);
return result;
}

C++ standard library algorithms are pretty universally based around iterators rather than concrete containers. Unfortunately this makes it hard to provide a Java-like split function in the C++ standard library, even though nobody argues that this would be convenient. But what would its return type be? std::vector<std::basic_string<…>>? Maybe, but then we’re forced to perform (potentially redundant and costly) allocations.
Instead, C++ offers a plethora of ways to split strings based on arbitrarily complex delimiters, but none of them is encapsulated as nicely as in other languages. The numerous ways fill whole blog posts.
At its simplest, you could iterate using std::string::find until you hit std::string::npos, and extract the contents using std::string::substr.
A more fluid (and idiomatic, but basic) version for splitting on whitespace would use a std::istringstream:
auto iss = std::istringstream{"The quick brown fox"};
auto str = std::string{};
while (iss >> str) {
process(str);
}
Using std::istream_iterators, the contents of the string stream could also be copied into a vector using its iterator range constructor.
Multiple libraries (such as Boost.Tokenizer) offer specific tokenisers.
More advanced splitting require regular expressions. C++ provides the std::regex_token_iterator for this purpose in particular:
auto const str = "The quick brown fox"s;
auto const re = std::regex{R"(\s+)"};
auto const vec = std::vector<std::string>(
std::sregex_token_iterator{begin(str), end(str), re, -1},
std::sregex_token_iterator{}
);

Another quick way is to use getline. Something like:
stringstream ss("bla bla");
string s;
while (getline(ss, s, ' ')) {
cout << s << endl;
}
If you want, you can make a simple split() method returning a vector<string>, which is
really useful.

Use strtok. In my opinion, there isn't a need to build a class around tokenizing unless strtok doesn't provide you with what you need. It might not, but in 15+ years of writing various parsing code in C and C++, I've always used strtok. Here is an example
char myString[] = "The quick brown fox";
char *p = strtok(myString, " ");
while (p) {
printf ("Token: %s\n", p);
p = strtok(NULL, " ");
}
A few caveats (which might not suit your needs). The string is "destroyed" in the process, meaning that EOS characters are placed inline in the delimter spots. Correct usage might require you to make a non-const version of the string. You can also change the list of delimiters mid parse.
In my own opinion, the above code is far simpler and easier to use than writing a separate class for it. To me, this is one of those functions that the language provides and it does it well and cleanly. It's simply a "C based" solution. It's appropriate, it's easy, and you don't have to write a lot of extra code :-)

You can use streams, iterators, and the copy algorithm to do this fairly directly.
#include <string>
#include <vector>
#include <iostream>
#include <istream>
#include <ostream>
#include <iterator>
#include <sstream>
#include <algorithm>
int main()
{
std::string str = "The quick brown fox";
// construct a stream from the string
std::stringstream strstr(str);
// use stream iterators to copy the stream to the vector as whitespace separated strings
std::istream_iterator<std::string> it(strstr);
std::istream_iterator<std::string> end;
std::vector<std::string> results(it, end);
// send the vector to stdout.
std::ostream_iterator<std::string> oit(std::cout);
std::copy(results.begin(), results.end(), oit);
}

A solution using regex_token_iterators:
#include <iostream>
#include <regex>
#include <string>
using namespace std;
int main()
{
string str("The quick brown fox");
regex reg("\\s+");
sregex_token_iterator iter(str.begin(), str.end(), reg, -1);
sregex_token_iterator end;
vector<string> vec(iter, end);
for (auto a : vec)
{
cout << a << endl;
}
}

No offense folks, but for such a simple problem, you are making things way too complicated. There are a lot of reasons to use Boost. But for something this simple, it's like hitting a fly with a 20# sledge.
void
split( vector<string> & theStringVector, /* Altered/returned value */
const string & theString,
const string & theDelimiter)
{
UASSERT( theDelimiter.size(), >, 0); // My own ASSERT macro.
size_t start = 0, end = 0;
while ( end != string::npos)
{
end = theString.find( theDelimiter, start);
// If at end, use length=maxLength. Else use length=end-start.
theStringVector.push_back( theString.substr( start,
(end == string::npos) ? string::npos : end - start));
// If at end, use start=maxSize. Else use start=end+delimiter.
start = ( ( end > (string::npos - theDelimiter.size()) )
? string::npos : end + theDelimiter.size());
}
}
For example (for Doug's case),
#define SHOW(I,X) cout << "[" << (I) << "]\t " # X " = \"" << (X) << "\"" << endl
int
main()
{
vector<string> v;
split( v, "A:PEP:909:Inventory Item", ":" );
for (unsigned int i = 0; i < v.size(); i++)
SHOW( i, v[i] );
}
And yes, we could have split() return a new vector rather than passing one in. It's trivial to wrap and overload. But depending on what I'm doing, I often find it better to re-use pre-existing objects rather than always creating new ones. (Just as long as I don't forget to empty the vector in between!)
Reference: http://www.cplusplus.com/reference/string/string/.
(I was originally writing a response to Doug's question: C++ Strings Modifying and Extracting based on Separators (closed). But since Martin York closed that question with a pointer over here... I'll just generalize my code.)

Boost has a strong split function: boost::algorithm::split.
Sample program:
#include <vector>
#include <boost/algorithm/string.hpp>
int main() {
auto s = "a,b, c ,,e,f,";
std::vector<std::string> fields;
boost::split(fields, s, boost::is_any_of(","));
for (const auto& field : fields)
std::cout << "\"" << field << "\"\n";
return 0;
}
Output:
"a"
"b"
" c "
""
"e"
"f"
""

This is a simple STL-only solution (~5 lines!) using std::find and std::find_first_not_of that handles repetitions of the delimiter (like spaces or periods for instance), as well leading and trailing delimiters:
#include <string>
#include <vector>
void tokenize(std::string str, std::vector<string> &token_v){
size_t start = str.find_first_not_of(DELIMITER), end=start;
while (start != std::string::npos){
// Find next occurence of delimiter
end = str.find(DELIMITER, start);
// Push back the token found into vector
token_v.push_back(str.substr(start, end-start));
// Skip all occurences of the delimiter to find new start
start = str.find_first_not_of(DELIMITER, end);
}
}
Try it out live!

I know you asked for a C++ solution, but you might consider this helpful:
Qt
#include <QString>
...
QString str = "The quick brown fox";
QStringList results = str.split(" ");
The advantage over Boost in this example is that it's a direct one to one mapping to your post's code.
See more at Qt documentation

Here is a sample tokenizer class that might do what you want
//Header file
class Tokenizer
{
public:
static const std::string DELIMITERS;
Tokenizer(const std::string& str);
Tokenizer(const std::string& str, const std::string& delimiters);
bool NextToken();
bool NextToken(const std::string& delimiters);
const std::string GetToken() const;
void Reset();
protected:
size_t m_offset;
const std::string m_string;
std::string m_token;
std::string m_delimiters;
};
//CPP file
const std::string Tokenizer::DELIMITERS(" \t\n\r");
Tokenizer::Tokenizer(const std::string& s) :
m_string(s),
m_offset(0),
m_delimiters(DELIMITERS) {}
Tokenizer::Tokenizer(const std::string& s, const std::string& delimiters) :
m_string(s),
m_offset(0),
m_delimiters(delimiters) {}
bool Tokenizer::NextToken()
{
return NextToken(m_delimiters);
}
bool Tokenizer::NextToken(const std::string& delimiters)
{
size_t i = m_string.find_first_not_of(delimiters, m_offset);
if (std::string::npos == i)
{
m_offset = m_string.length();
return false;
}
size_t j = m_string.find_first_of(delimiters, i);
if (std::string::npos == j)
{
m_token = m_string.substr(i);
m_offset = m_string.length();
return true;
}
m_token = m_string.substr(i, j - i);
m_offset = j;
return true;
}
Example:
std::vector <std::string> v;
Tokenizer s("split this string", " ");
while (s.NextToken())
{
v.push_back(s.GetToken());
}

pystring is a small library which implements a bunch of Python's string functions, including the split method:
#include <string>
#include <vector>
#include "pystring.h"
std::vector<std::string> chunks;
pystring::split("this string", chunks);
// also can specify a separator
pystring::split("this-string", chunks, "-");

I posted this answer for similar question.
Don't reinvent the wheel. I've used a number of libraries and the fastest and most flexible I have come across is: C++ String Toolkit Library.
Here is an example of how to use it that I've posted else where on the stackoverflow.
#include <iostream>
#include <vector>
#include <string>
#include <strtk.hpp>
const char *whitespace = " \t\r\n\f";
const char *whitespace_and_punctuation = " \t\r\n\f;,=";
int main()
{
{ // normal parsing of a string into a vector of strings
std::string s("Somewhere down the road");
std::vector<std::string> result;
if( strtk::parse( s, whitespace, result ) )
{
for(size_t i = 0; i < result.size(); ++i )
std::cout << result[i] << std::endl;
}
}
{ // parsing a string into a vector of floats with other separators
// besides spaces
std::string s("3.0, 3.14; 4.0");
std::vector<float> values;
if( strtk::parse( s, whitespace_and_punctuation, values ) )
{
for(size_t i = 0; i < values.size(); ++i )
std::cout << values[i] << std::endl;
}
}
{ // parsing a string into specific variables
std::string s("angle = 45; radius = 9.9");
std::string w1, w2;
float v1, v2;
if( strtk::parse( s, whitespace_and_punctuation, w1, v1, w2, v2) )
{
std::cout << "word " << w1 << ", value " << v1 << std::endl;
std::cout << "word " << w2 << ", value " << v2 << std::endl;
}
}
return 0;
}

Adam Pierce's answer provides an hand-spun tokenizer taking in a const char*. It's a bit more problematic to do with iterators because incrementing a string's end iterator is undefined. That said, given string str{ "The quick brown fox" } we can certainly accomplish this:
auto start = find(cbegin(str), cend(str), ' ');
vector<string> tokens{ string(cbegin(str), start) };
while (start != cend(str)) {
const auto finish = find(++start, cend(str), ' ');
tokens.push_back(string(start, finish));
start = finish;
}
Live Example
If you're looking to abstract complexity by using standard functionality, as On Freund suggests strtok is a simple option:
vector<string> tokens;
for (auto i = strtok(data(str), " "); i != nullptr; i = strtok(nullptr, " ")) tokens.push_back(i);
If you don't have access to C++17 you'll need to substitute data(str) as in this example: http://ideone.com/8kAGoa
Though not demonstrated in the example, strtok need not use the same delimiter for each token. Along with this advantage though, there are several drawbacks:
strtok cannot be used on multiple strings at the same time: Either a nullptr must be passed to continue tokenizing the current string or a new char* to tokenize must be passed (there are some non-standard implementations which do support this however, such as: strtok_s)
For the same reason strtok cannot be used on multiple threads simultaneously (this may however be implementation defined, for example: Visual Studio's implementation is thread safe)
Calling strtok modifies the string it is operating on, so it cannot be used on const strings, const char*s, or literal strings, to tokenize any of these with strtok or to operate on a string who's contents need to be preserved, str would have to be copied, then the copy could be operated on
c++20 provides us with split_view to tokenize strings, in a non-destructive manner: https://topanswers.xyz/cplusplus?q=749#a874
The previous methods cannot generate a tokenized vector in-place, meaning without abstracting them into a helper function they cannot initialize const vector<string> tokens. That functionality and the ability to accept any white-space delimiter can be harnessed using an istream_iterator. For example given: const string str{ "The quick \tbrown \nfox" } we can do this:
istringstream is{ str };
const vector<string> tokens{ istream_iterator<string>(is), istream_iterator<string>() };
Live Example
The required construction of an istringstream for this option has far greater cost than the previous 2 options, however this cost is typically hidden in the expense of string allocation.
If none of the above options are flexable enough for your tokenization needs, the most flexible option is using a regex_token_iterator of course with this flexibility comes greater expense, but again this is likely hidden in the string allocation cost. Say for example we want to tokenize based on non-escaped commas, also eating white-space, given the following input: const string str{ "The ,qu\\,ick ,\tbrown, fox" } we can do this:
const regex re{ "\\s*((?:[^\\\\,]|\\\\.)*?)\\s*(?:,|$)" };
const vector<string> tokens{ sregex_token_iterator(cbegin(str), cend(str), re, 1), sregex_token_iterator() };
Live Example

Check this example. It might help you..
#include <iostream>
#include <sstream>
using namespace std;
int main ()
{
string tmps;
istringstream is ("the dellimiter is the space");
while (is.good ()) {
is >> tmps;
cout << tmps << "\n";
}
return 0;
}

If you're using C++ ranges - the full ranges-v3 library, not the limited functionality accepted into C++20 - you could do it this way:
auto results = str | ranges::views::tokenize(" ",1);
... and this is lazily-evaluated. You can alternatively set a vector to this range:
auto results = str | ranges::views::tokenize(" ",1) | ranges::to<std::vector>();
this will take O(m) space and O(n) time if str has n characters making up m words.
See also the library's own tokenization example, here.

MFC/ATL has a very nice tokenizer. From MSDN:
CAtlString str( "%First Second#Third" );
CAtlString resToken;
int curPos= 0;
resToken= str.Tokenize("% #",curPos);
while (resToken != "")
{
printf("Resulting token: %s\n", resToken);
resToken= str.Tokenize("% #",curPos);
};
Output
Resulting Token: First
Resulting Token: Second
Resulting Token: Third

If you're willing to use C, you can use the strtok function. You should pay attention to multi-threading issues when using it.

For simple stuff I just use the following:
unsigned TokenizeString(const std::string& i_source,
const std::string& i_seperators,
bool i_discard_empty_tokens,
std::vector<std::string>& o_tokens)
{
unsigned prev_pos = 0;
unsigned pos = 0;
unsigned number_of_tokens = 0;
o_tokens.clear();
pos = i_source.find_first_of(i_seperators, pos);
while (pos != std::string::npos)
{
std::string token = i_source.substr(prev_pos, pos - prev_pos);
if (!i_discard_empty_tokens || token != "")
{
o_tokens.push_back(i_source.substr(prev_pos, pos - prev_pos));
number_of_tokens++;
}
pos++;
prev_pos = pos;
pos = i_source.find_first_of(i_seperators, pos);
}
if (prev_pos < i_source.length())
{
o_tokens.push_back(i_source.substr(prev_pos));
number_of_tokens++;
}
return number_of_tokens;
}
Cowardly disclaimer: I write real-time data processing software where the data comes in through binary files, sockets, or some API call (I/O cards, camera's). I never use this function for something more complicated or time-critical than reading external configuration files on startup.

You can simply use a regular expression library and solve that using regular expressions.
Use expression (\w+) and the variable in \1 (or $1 depending on the library implementation of regular expressions).

Many overly complicated suggestions here. Try this simple std::string solution:
using namespace std;
string someText = ...
string::size_type tokenOff = 0, sepOff = tokenOff;
while (sepOff != string::npos)
{
sepOff = someText.find(' ', sepOff);
string::size_type tokenLen = (sepOff == string::npos) ? sepOff : sepOff++ - tokenOff;
string token = someText.substr(tokenOff, tokenLen);
if (!token.empty())
/* do something with token */;
tokenOff = sepOff;
}

I thought that was what the >> operator on string streams was for:
string word; sin >> word;

Here's an approach that allows you control over whether empty tokens are included (like strsep) or excluded (like strtok).
#include <string.h> // for strchr and strlen
/*
* want_empty_tokens==true : include empty tokens, like strsep()
* want_empty_tokens==false : exclude empty tokens, like strtok()
*/
std::vector<std::string> tokenize(const char* src,
char delim,
bool want_empty_tokens)
{
std::vector<std::string> tokens;
if (src and *src != '\0') // defensive
while( true ) {
const char* d = strchr(src, delim);
size_t len = (d)? d-src : strlen(src);
if (len or want_empty_tokens)
tokens.push_back( std::string(src, len) ); // capture token
if (d) src += len+1; else break;
}
return tokens;
}

Seems odd to me that with all us speed conscious nerds here on SO no one has presented a version that uses a compile time generated look up table for the delimiter (example implementation further down). Using a look up table and iterators should beat std::regex in efficiency, if you don't need to beat regex, just use it, its standard as of C++11 and super flexible.
Some have suggested regex already but for the noobs here is a packaged example that should do exactly what the OP expects:
std::vector<std::string> split(std::string::const_iterator it, std::string::const_iterator end, std::regex e = std::regex{"\\w+"}){
std::smatch m{};
std::vector<std::string> ret{};
while (std::regex_search (it,end,m,e)) {
ret.emplace_back(m.str());
std::advance(it, m.position() + m.length()); //next start position = match position + match length
}
return ret;
}
std::vector<std::string> split(const std::string &s, std::regex e = std::regex{"\\w+"}){ //comfort version calls flexible version
return split(s.cbegin(), s.cend(), std::move(e));
}
int main ()
{
std::string str {"Some people, excluding those present, have been compile time constants - since puberty."};
auto v = split(str);
for(const auto&s:v){
std::cout << s << std::endl;
}
std::cout << "crazy version:" << std::endl;
v = split(str, std::regex{"[^e]+"}); //using e as delim shows flexibility
for(const auto&s:v){
std::cout << s << std::endl;
}
return 0;
}
If we need to be faster and accept the constraint that all chars must be 8 bits we can make a look up table at compile time using metaprogramming:
template<bool...> struct BoolSequence{}; //just here to hold bools
template<char...> struct CharSequence{}; //just here to hold chars
template<typename T, char C> struct Contains; //generic
template<char First, char... Cs, char Match> //not first specialization
struct Contains<CharSequence<First, Cs...>,Match> :
Contains<CharSequence<Cs...>, Match>{}; //strip first and increase index
template<char First, char... Cs> //is first specialization
struct Contains<CharSequence<First, Cs...>,First>: std::true_type {};
template<char Match> //not found specialization
struct Contains<CharSequence<>,Match>: std::false_type{};
template<int I, typename T, typename U>
struct MakeSequence; //generic
template<int I, bool... Bs, typename U>
struct MakeSequence<I,BoolSequence<Bs...>, U>: //not last
MakeSequence<I-1, BoolSequence<Contains<U,I-1>::value,Bs...>, U>{};
template<bool... Bs, typename U>
struct MakeSequence<0,BoolSequence<Bs...>,U>{ //last
using Type = BoolSequence<Bs...>;
};
template<typename T> struct BoolASCIITable;
template<bool... Bs> struct BoolASCIITable<BoolSequence<Bs...>>{
/* could be made constexpr but not yet supported by MSVC */
static bool isDelim(const char c){
static const bool table[256] = {Bs...};
return table[static_cast<int>(c)];
}
};
using Delims = CharSequence<'.',',',' ',':','\n'>; //list your custom delimiters here
using Table = BoolASCIITable<typename MakeSequence<256,BoolSequence<>,Delims>::Type>;
With that in place making a getNextToken function is easy:
template<typename T_It>
std::pair<T_It,T_It> getNextToken(T_It begin,T_It end){
begin = std::find_if(begin,end,std::not1(Table{})); //find first non delim or end
auto second = std::find_if(begin,end,Table{}); //find first delim or end
return std::make_pair(begin,second);
}
Using it is also easy:
int main() {
std::string s{"Some people, excluding those present, have been compile time constants - since puberty."};
auto it = std::begin(s);
auto end = std::end(s);
while(it != std::end(s)){
auto token = getNextToken(it,end);
std::cout << std::string(token.first,token.second) << std::endl;
it = token.second;
}
return 0;
}
Here is a live example: http://ideone.com/GKtkLQ

I know this question is already answered but I want to contribute. Maybe my solution is a bit simple but this is what I came up with:
vector<string> get_words(string const& text, string const& separator)
{
vector<string> result;
string tmp = text;
size_t first_pos = 0;
size_t second_pos = tmp.find(separator);
while (second_pos != string::npos)
{
if (first_pos != second_pos)
{
string word = tmp.substr(first_pos, second_pos - first_pos);
result.push_back(word);
}
tmp = tmp.substr(second_pos + separator.length());
second_pos = tmp.find(separator);
}
result.push_back(tmp);
return result;
}
Please comment if there is a better approach to something in my code or if something is wrong.
UPDATE: added generic separator

you can take advantage of boost::make_find_iterator. Something similar to this:
template<typename CH>
inline vector< basic_string<CH> > tokenize(
const basic_string<CH> &Input,
const basic_string<CH> &Delimiter,
bool remove_empty_token
) {
typedef typename basic_string<CH>::const_iterator string_iterator_t;
typedef boost::find_iterator< string_iterator_t > string_find_iterator_t;
vector< basic_string<CH> > Result;
string_iterator_t it = Input.begin();
string_iterator_t it_end = Input.end();
for(string_find_iterator_t i = boost::make_find_iterator(Input, boost::first_finder(Delimiter, boost::is_equal()));
i != string_find_iterator_t();
++i) {
if(remove_empty_token){
if(it != i->begin())
Result.push_back(basic_string<CH>(it,i->begin()));
}
else
Result.push_back(basic_string<CH>(it,i->begin()));
it = i->end();
}
if(it != it_end)
Result.push_back(basic_string<CH>(it,it_end));
return Result;
}

Here's my Swiss® Army Knife of string-tokenizers for splitting up strings by whitespace, accounting for single and double-quote wrapped strings as well as stripping those characters from the results. I used RegexBuddy 4.x to generate most of the code-snippet, but I added custom handling for stripping quotes and a few other things.
#include <string>
#include <locale>
#include <regex>
std::vector<std::wstring> tokenize_string(std::wstring string_to_tokenize) {
std::vector<std::wstring> tokens;
std::wregex re(LR"(("[^"]*"|'[^']*'|[^"' ]+))", std::regex_constants::collate);
std::wsregex_iterator next( string_to_tokenize.begin(),
string_to_tokenize.end(),
re,
std::regex_constants::match_not_null );
std::wsregex_iterator end;
const wchar_t single_quote = L'\'';
const wchar_t double_quote = L'\"';
while ( next != end ) {
std::wsmatch match = *next;
const std::wstring token = match.str( 0 );
next++;
if (token.length() > 2 && (token.front() == double_quote || token.front() == single_quote))
tokens.emplace_back( std::wstring(token.begin()+1, token.begin()+token.length()-1) );
else
tokens.emplace_back(token);
}
return tokens;
}

I wrote a simplified version (and maybe a little bit efficient) of https://stackoverflow.com/a/50247503/3976739 for my own use. I hope it would help.
void StrTokenizer(string& source, const char* delimiter, vector<string>& Tokens)
{
size_t new_index = 0;
size_t old_index = 0;
while (new_index != std::string::npos)
{
new_index = source.find(delimiter, old_index);
Tokens.emplace_back(source.substr(old_index, new_index-old_index));
if (new_index != std::string::npos)
old_index = ++new_index;
}
}

If the maximum length of the input string to be tokenized is known, one can exploit this and implement a very fast version. I am sketching the basic idea below, which was inspired by both strtok() and the "suffix array"-data structure described Jon Bentley's "Programming Perls" 2nd edition, chapter 15. The C++ class in this case only gives some organization and convenience of use. The implementation shown can be easily extended for removing leading and trailing whitespace characters in the tokens.
Basically one can replace the separator characters with string-terminating '\0'-characters and set pointers to the tokens withing the modified string. In the extreme case when the string consists only of separators, one gets string-length plus 1 resulting empty tokens. It is practical to duplicate the string to be modified.
Header file:
class TextLineSplitter
{
public:
TextLineSplitter( const size_t max_line_len );
~TextLineSplitter();
void SplitLine( const char *line,
const char sep_char = ',',
);
inline size_t NumTokens( void ) const
{
return mNumTokens;
}
const char * GetToken( const size_t token_idx ) const
{
assert( token_idx < mNumTokens );
return mTokens[ token_idx ];
}
private:
const size_t mStorageSize;
char *mBuff;
char **mTokens;
size_t mNumTokens;
inline void ResetContent( void )
{
memset( mBuff, 0, mStorageSize );
// mark all items as empty:
memset( mTokens, 0, mStorageSize * sizeof( char* ) );
// reset counter for found items:
mNumTokens = 0L;
}
};
Implementattion file:
TextLineSplitter::TextLineSplitter( const size_t max_line_len ):
mStorageSize ( max_line_len + 1L )
{
// allocate memory
mBuff = new char [ mStorageSize ];
mTokens = new char* [ mStorageSize ];
ResetContent();
}
TextLineSplitter::~TextLineSplitter()
{
delete [] mBuff;
delete [] mTokens;
}
void TextLineSplitter::SplitLine( const char *line,
const char sep_char /* = ',' */,
)
{
assert( sep_char != '\0' );
ResetContent();
strncpy( mBuff, line, mMaxLineLen );
size_t idx = 0L; // running index for characters
do
{
assert( idx < mStorageSize );
const char chr = line[ idx ]; // retrieve current character
if( mTokens[ mNumTokens ] == NULL )
{
mTokens[ mNumTokens ] = &mBuff[ idx ];
} // if
if( chr == sep_char || chr == '\0' )
{ // item or line finished
// overwrite separator with a 0-terminating character:
mBuff[ idx ] = '\0';
// count-up items:
mNumTokens ++;
} // if
} while( line[ idx++ ] );
}
A scenario of usage would be:
// create an instance capable of splitting strings up to 1000 chars long:
TextLineSplitter spl( 1000 );
spl.SplitLine( "Item1,,Item2,Item3" );
for( size_t i = 0; i < spl.NumTokens(); i++ )
{
printf( "%s\n", spl.GetToken( i ) );
}
output:
Item1
Item2
Item3

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Rcpp - Capture result of sregex_token_iterator to vector - c++

Related

string to numbers (vector array of int type)conversion

Split string using multiple string delimiters

regex replace with callback in c++11?

c++11 regex slower than python

How do I tokenize a string in C++?

Categories

Resources