Most efficient way to iterate over words in a string - c++

If I wanted to iterate over individual words in a string (separated by whitespace), then the obvious solution would be:
std::istringstream s(myString);
std::string word;
while (s >> word)
do things
However that's quite inefficient. The entire string is copied while initializing the string stream, and then each extracted word is copied one at a time into the word variable (which is close to copying the entire string for a second time). Is there a way to improve on this without manually iterating over each character?

In most cases, copying represents a very small percentage of the overall costs, so having a clean, highly readable code becomes more important. In rare cases when the time profiler tells you that copying creates a bottleneck, you can iterate over characters in the string with some help from the standard library.
One approach that you could take is to iterate with std::string::find_first_of and std::string::find_first_not_of member functions, like this:
const std::string s = "quick \t\t brown \t fox jumps over the\nlazy dog";
const std::string ws = " \t\r\n";
std::size_t pos = 0;
while (pos != s.size()) {
std::size_t from = s.find_first_not_of(ws, pos);
if (from == std::string::npos) {
break;
}
std::size_t to = s.find_first_of(ws, from+1);
if (to == std::string::npos) {
to = s.size();
}
// If you want an individual word, copy it with substr.
// The code below simply prints it character-by-character:
std::cout << "'";
for (std::size_t i = from ; i != to ; i++) {
std::cout << s[i];
}
std::cout << "'" << std::endl;
pos = to;
}
Demo.
Unfortunately, the code becomes a lot harder to read, so you should avoid this change, or at least postpone it until it becomes requried.

Using boost string algorithms we can write it as follows.
The loop doesn't involve any copying of the string.
#include <string>
#include <iostream>
#include <boost/algorithm/string.hpp>
int main()
{
std::string s = "stack over flow";
auto it = boost::make_split_iterator( s, boost::token_finder(
boost::is_any_of( " " ), boost::algorithm::token_compress_on ) );
decltype( it ) end;
for( ; it != end; ++it )
{
std::cout << "word: '" << *it << "'\n";
}
return 0;
}
Making it C++11-ish
Since pairs of iterators are so oldschool nowadays, we may use boost.range to define some generic helper functions. These finally allow us to loop over the words using range-for:
#include <string>
#include <iostream>
#include <boost/algorithm/string.hpp>
#include <boost/range/iterator_range_core.hpp>
template< typename Range >
using SplitRange = boost::iterator_range< boost::split_iterator< typename Range::const_iterator > >;
template< typename Range, typename Finder >
SplitRange< Range > make_split_range( const Range& rng, const Finder& finder )
{
auto first = boost::make_split_iterator( rng, finder );
decltype( first ) last;
return { first, last };
}
template< typename Range, typename Predicate >
SplitRange< Range > make_token_range( const Range& rng, const Predicate& pred )
{
return make_split_range( rng, boost::token_finder( pred, boost::algorithm::token_compress_on ) );
}
int main()
{
std::string str = "stack \tover\r\n flow";
for( const auto& substr : make_token_range( str, boost::is_any_of( " \t\r\n" ) ) )
{
std::cout << "word: '" << substr << "'\n";
}
return 0;
}
Demo:
http://coliru.stacked-crooked.com/a/2f4b3d34086cc6ec

If you want to have it as fast as possible, you need to fall back to the good old C function strtok() (or its thread-safe companion strtok_r()):
const char* kWhiteSpace = " \t\v\n\r"; //whatever you call white space
char* token = std::strtok(myString.data(), kWhiteSpace);
while(token) {
//do things with token
token = std::strtok(nullptr, kWhiteSpace));
}
Beware that this will clobber the contents of myString: It works by replacing the first delimiter character after each token with a terminating null byte, and returning a pointer to the start of the tokens in turn. This is a legacy C function after all.
However, that weakness is also its strength: It does not perform any copy, nor does it allocate any dynamic memory (which likely is the most time consuming thing in your example code). As such, you won't find a native C++ method that beats strtok()'s speed.

What about spliting the string? You can check this post for more information.
Inside this post there is a detailed answer about how to split a string in tokens. In this answer maybe you could check the second way using iterators and the copy algorithm.

Related

How can you split string in C++ and store them in variables? [duplicate]

Java has a convenient split method:
String str = "The quick brown fox";
String[] results = str.split(" ");
Is there an easy way to do this in C++?
The Boost tokenizer class can make this sort of thing quite simple:
#include <iostream>
#include <string>
#include <boost/foreach.hpp>
#include <boost/tokenizer.hpp>
using namespace std;
using namespace boost;
int main(int, char**)
{
string text = "token, test string";
char_separator<char> sep(", ");
tokenizer< char_separator<char> > tokens(text, sep);
BOOST_FOREACH (const string& t, tokens) {
cout << t << "." << endl;
}
}
Updated for C++11:
#include <iostream>
#include <string>
#include <boost/tokenizer.hpp>
using namespace std;
using namespace boost;
int main(int, char**)
{
string text = "token, test string";
char_separator<char> sep(", ");
tokenizer<char_separator<char>> tokens(text, sep);
for (const auto& t : tokens) {
cout << t << "." << endl;
}
}
Here's a real simple one:
#include <vector>
#include <string>
using namespace std;
vector<string> split(const char *str, char c = ' ')
{
vector<string> result;
do
{
const char *begin = str;
while(*str != c && *str)
str++;
result.push_back(string(begin, str));
} while (0 != *str++);
return result;
}
C++ standard library algorithms are pretty universally based around iterators rather than concrete containers. Unfortunately this makes it hard to provide a Java-like split function in the C++ standard library, even though nobody argues that this would be convenient. But what would its return type be? std::vector<std::basic_string<…>>? Maybe, but then we’re forced to perform (potentially redundant and costly) allocations.
Instead, C++ offers a plethora of ways to split strings based on arbitrarily complex delimiters, but none of them is encapsulated as nicely as in other languages. The numerous ways fill whole blog posts.
At its simplest, you could iterate using std::string::find until you hit std::string::npos, and extract the contents using std::string::substr.
A more fluid (and idiomatic, but basic) version for splitting on whitespace would use a std::istringstream:
auto iss = std::istringstream{"The quick brown fox"};
auto str = std::string{};
while (iss >> str) {
process(str);
}
Using std::istream_iterators, the contents of the string stream could also be copied into a vector using its iterator range constructor.
Multiple libraries (such as Boost.Tokenizer) offer specific tokenisers.
More advanced splitting require regular expressions. C++ provides the std::regex_token_iterator for this purpose in particular:
auto const str = "The quick brown fox"s;
auto const re = std::regex{R"(\s+)"};
auto const vec = std::vector<std::string>(
std::sregex_token_iterator{begin(str), end(str), re, -1},
std::sregex_token_iterator{}
);
Another quick way is to use getline. Something like:
stringstream ss("bla bla");
string s;
while (getline(ss, s, ' ')) {
cout << s << endl;
}
If you want, you can make a simple split() method returning a vector<string>, which is
really useful.
Use strtok. In my opinion, there isn't a need to build a class around tokenizing unless strtok doesn't provide you with what you need. It might not, but in 15+ years of writing various parsing code in C and C++, I've always used strtok. Here is an example
char myString[] = "The quick brown fox";
char *p = strtok(myString, " ");
while (p) {
printf ("Token: %s\n", p);
p = strtok(NULL, " ");
}
A few caveats (which might not suit your needs). The string is "destroyed" in the process, meaning that EOS characters are placed inline in the delimter spots. Correct usage might require you to make a non-const version of the string. You can also change the list of delimiters mid parse.
In my own opinion, the above code is far simpler and easier to use than writing a separate class for it. To me, this is one of those functions that the language provides and it does it well and cleanly. It's simply a "C based" solution. It's appropriate, it's easy, and you don't have to write a lot of extra code :-)
You can use streams, iterators, and the copy algorithm to do this fairly directly.
#include <string>
#include <vector>
#include <iostream>
#include <istream>
#include <ostream>
#include <iterator>
#include <sstream>
#include <algorithm>
int main()
{
std::string str = "The quick brown fox";
// construct a stream from the string
std::stringstream strstr(str);
// use stream iterators to copy the stream to the vector as whitespace separated strings
std::istream_iterator<std::string> it(strstr);
std::istream_iterator<std::string> end;
std::vector<std::string> results(it, end);
// send the vector to stdout.
std::ostream_iterator<std::string> oit(std::cout);
std::copy(results.begin(), results.end(), oit);
}
A solution using regex_token_iterators:
#include <iostream>
#include <regex>
#include <string>
using namespace std;
int main()
{
string str("The quick brown fox");
regex reg("\\s+");
sregex_token_iterator iter(str.begin(), str.end(), reg, -1);
sregex_token_iterator end;
vector<string> vec(iter, end);
for (auto a : vec)
{
cout << a << endl;
}
}
No offense folks, but for such a simple problem, you are making things way too complicated. There are a lot of reasons to use Boost. But for something this simple, it's like hitting a fly with a 20# sledge.
void
split( vector<string> & theStringVector, /* Altered/returned value */
const string & theString,
const string & theDelimiter)
{
UASSERT( theDelimiter.size(), >, 0); // My own ASSERT macro.
size_t start = 0, end = 0;
while ( end != string::npos)
{
end = theString.find( theDelimiter, start);
// If at end, use length=maxLength. Else use length=end-start.
theStringVector.push_back( theString.substr( start,
(end == string::npos) ? string::npos : end - start));
// If at end, use start=maxSize. Else use start=end+delimiter.
start = ( ( end > (string::npos - theDelimiter.size()) )
? string::npos : end + theDelimiter.size());
}
}
For example (for Doug's case),
#define SHOW(I,X) cout << "[" << (I) << "]\t " # X " = \"" << (X) << "\"" << endl
int
main()
{
vector<string> v;
split( v, "A:PEP:909:Inventory Item", ":" );
for (unsigned int i = 0; i < v.size(); i++)
SHOW( i, v[i] );
}
And yes, we could have split() return a new vector rather than passing one in. It's trivial to wrap and overload. But depending on what I'm doing, I often find it better to re-use pre-existing objects rather than always creating new ones. (Just as long as I don't forget to empty the vector in between!)
Reference: http://www.cplusplus.com/reference/string/string/.
(I was originally writing a response to Doug's question: C++ Strings Modifying and Extracting based on Separators (closed). But since Martin York closed that question with a pointer over here... I'll just generalize my code.)
Boost has a strong split function: boost::algorithm::split.
Sample program:
#include <vector>
#include <boost/algorithm/string.hpp>
int main() {
auto s = "a,b, c ,,e,f,";
std::vector<std::string> fields;
boost::split(fields, s, boost::is_any_of(","));
for (const auto& field : fields)
std::cout << "\"" << field << "\"\n";
return 0;
}
Output:
"a"
"b"
" c "
""
"e"
"f"
""
This is a simple STL-only solution (~5 lines!) using std::find and std::find_first_not_of that handles repetitions of the delimiter (like spaces or periods for instance), as well leading and trailing delimiters:
#include <string>
#include <vector>
void tokenize(std::string str, std::vector<string> &token_v){
size_t start = str.find_first_not_of(DELIMITER), end=start;
while (start != std::string::npos){
// Find next occurence of delimiter
end = str.find(DELIMITER, start);
// Push back the token found into vector
token_v.push_back(str.substr(start, end-start));
// Skip all occurences of the delimiter to find new start
start = str.find_first_not_of(DELIMITER, end);
}
}
Try it out live!
I know you asked for a C++ solution, but you might consider this helpful:
Qt
#include <QString>
...
QString str = "The quick brown fox";
QStringList results = str.split(" ");
The advantage over Boost in this example is that it's a direct one to one mapping to your post's code.
See more at Qt documentation
Here is a sample tokenizer class that might do what you want
//Header file
class Tokenizer
{
public:
static const std::string DELIMITERS;
Tokenizer(const std::string& str);
Tokenizer(const std::string& str, const std::string& delimiters);
bool NextToken();
bool NextToken(const std::string& delimiters);
const std::string GetToken() const;
void Reset();
protected:
size_t m_offset;
const std::string m_string;
std::string m_token;
std::string m_delimiters;
};
//CPP file
const std::string Tokenizer::DELIMITERS(" \t\n\r");
Tokenizer::Tokenizer(const std::string& s) :
m_string(s),
m_offset(0),
m_delimiters(DELIMITERS) {}
Tokenizer::Tokenizer(const std::string& s, const std::string& delimiters) :
m_string(s),
m_offset(0),
m_delimiters(delimiters) {}
bool Tokenizer::NextToken()
{
return NextToken(m_delimiters);
}
bool Tokenizer::NextToken(const std::string& delimiters)
{
size_t i = m_string.find_first_not_of(delimiters, m_offset);
if (std::string::npos == i)
{
m_offset = m_string.length();
return false;
}
size_t j = m_string.find_first_of(delimiters, i);
if (std::string::npos == j)
{
m_token = m_string.substr(i);
m_offset = m_string.length();
return true;
}
m_token = m_string.substr(i, j - i);
m_offset = j;
return true;
}
Example:
std::vector <std::string> v;
Tokenizer s("split this string", " ");
while (s.NextToken())
{
v.push_back(s.GetToken());
}
pystring is a small library which implements a bunch of Python's string functions, including the split method:
#include <string>
#include <vector>
#include "pystring.h"
std::vector<std::string> chunks;
pystring::split("this string", chunks);
// also can specify a separator
pystring::split("this-string", chunks, "-");
I posted this answer for similar question.
Don't reinvent the wheel. I've used a number of libraries and the fastest and most flexible I have come across is: C++ String Toolkit Library.
Here is an example of how to use it that I've posted else where on the stackoverflow.
#include <iostream>
#include <vector>
#include <string>
#include <strtk.hpp>
const char *whitespace = " \t\r\n\f";
const char *whitespace_and_punctuation = " \t\r\n\f;,=";
int main()
{
{ // normal parsing of a string into a vector of strings
std::string s("Somewhere down the road");
std::vector<std::string> result;
if( strtk::parse( s, whitespace, result ) )
{
for(size_t i = 0; i < result.size(); ++i )
std::cout << result[i] << std::endl;
}
}
{ // parsing a string into a vector of floats with other separators
// besides spaces
std::string s("3.0, 3.14; 4.0");
std::vector<float> values;
if( strtk::parse( s, whitespace_and_punctuation, values ) )
{
for(size_t i = 0; i < values.size(); ++i )
std::cout << values[i] << std::endl;
}
}
{ // parsing a string into specific variables
std::string s("angle = 45; radius = 9.9");
std::string w1, w2;
float v1, v2;
if( strtk::parse( s, whitespace_and_punctuation, w1, v1, w2, v2) )
{
std::cout << "word " << w1 << ", value " << v1 << std::endl;
std::cout << "word " << w2 << ", value " << v2 << std::endl;
}
}
return 0;
}
Adam Pierce's answer provides an hand-spun tokenizer taking in a const char*. It's a bit more problematic to do with iterators because incrementing a string's end iterator is undefined. That said, given string str{ "The quick brown fox" } we can certainly accomplish this:
auto start = find(cbegin(str), cend(str), ' ');
vector<string> tokens{ string(cbegin(str), start) };
while (start != cend(str)) {
const auto finish = find(++start, cend(str), ' ');
tokens.push_back(string(start, finish));
start = finish;
}
Live Example
If you're looking to abstract complexity by using standard functionality, as On Freund suggests strtok is a simple option:
vector<string> tokens;
for (auto i = strtok(data(str), " "); i != nullptr; i = strtok(nullptr, " ")) tokens.push_back(i);
If you don't have access to C++17 you'll need to substitute data(str) as in this example: http://ideone.com/8kAGoa
Though not demonstrated in the example, strtok need not use the same delimiter for each token. Along with this advantage though, there are several drawbacks:
strtok cannot be used on multiple strings at the same time: Either a nullptr must be passed to continue tokenizing the current string or a new char* to tokenize must be passed (there are some non-standard implementations which do support this however, such as: strtok_s)
For the same reason strtok cannot be used on multiple threads simultaneously (this may however be implementation defined, for example: Visual Studio's implementation is thread safe)
Calling strtok modifies the string it is operating on, so it cannot be used on const strings, const char*s, or literal strings, to tokenize any of these with strtok or to operate on a string who's contents need to be preserved, str would have to be copied, then the copy could be operated on
c++20 provides us with split_view to tokenize strings, in a non-destructive manner: https://topanswers.xyz/cplusplus?q=749#a874
The previous methods cannot generate a tokenized vector in-place, meaning without abstracting them into a helper function they cannot initialize const vector<string> tokens. That functionality and the ability to accept any white-space delimiter can be harnessed using an istream_iterator. For example given: const string str{ "The quick \tbrown \nfox" } we can do this:
istringstream is{ str };
const vector<string> tokens{ istream_iterator<string>(is), istream_iterator<string>() };
Live Example
The required construction of an istringstream for this option has far greater cost than the previous 2 options, however this cost is typically hidden in the expense of string allocation.
If none of the above options are flexable enough for your tokenization needs, the most flexible option is using a regex_token_iterator of course with this flexibility comes greater expense, but again this is likely hidden in the string allocation cost. Say for example we want to tokenize based on non-escaped commas, also eating white-space, given the following input: const string str{ "The ,qu\\,ick ,\tbrown, fox" } we can do this:
const regex re{ "\\s*((?:[^\\\\,]|\\\\.)*?)\\s*(?:,|$)" };
const vector<string> tokens{ sregex_token_iterator(cbegin(str), cend(str), re, 1), sregex_token_iterator() };
Live Example
Check this example. It might help you..
#include <iostream>
#include <sstream>
using namespace std;
int main ()
{
string tmps;
istringstream is ("the dellimiter is the space");
while (is.good ()) {
is >> tmps;
cout << tmps << "\n";
}
return 0;
}
If you're using C++ ranges - the full ranges-v3 library, not the limited functionality accepted into C++20 - you could do it this way:
auto results = str | ranges::views::tokenize(" ",1);
... and this is lazily-evaluated. You can alternatively set a vector to this range:
auto results = str | ranges::views::tokenize(" ",1) | ranges::to<std::vector>();
this will take O(m) space and O(n) time if str has n characters making up m words.
See also the library's own tokenization example, here.
MFC/ATL has a very nice tokenizer. From MSDN:
CAtlString str( "%First Second#Third" );
CAtlString resToken;
int curPos= 0;
resToken= str.Tokenize("% #",curPos);
while (resToken != "")
{
printf("Resulting token: %s\n", resToken);
resToken= str.Tokenize("% #",curPos);
};
Output
Resulting Token: First
Resulting Token: Second
Resulting Token: Third
If you're willing to use C, you can use the strtok function. You should pay attention to multi-threading issues when using it.
For simple stuff I just use the following:
unsigned TokenizeString(const std::string& i_source,
const std::string& i_seperators,
bool i_discard_empty_tokens,
std::vector<std::string>& o_tokens)
{
unsigned prev_pos = 0;
unsigned pos = 0;
unsigned number_of_tokens = 0;
o_tokens.clear();
pos = i_source.find_first_of(i_seperators, pos);
while (pos != std::string::npos)
{
std::string token = i_source.substr(prev_pos, pos - prev_pos);
if (!i_discard_empty_tokens || token != "")
{
o_tokens.push_back(i_source.substr(prev_pos, pos - prev_pos));
number_of_tokens++;
}
pos++;
prev_pos = pos;
pos = i_source.find_first_of(i_seperators, pos);
}
if (prev_pos < i_source.length())
{
o_tokens.push_back(i_source.substr(prev_pos));
number_of_tokens++;
}
return number_of_tokens;
}
Cowardly disclaimer: I write real-time data processing software where the data comes in through binary files, sockets, or some API call (I/O cards, camera's). I never use this function for something more complicated or time-critical than reading external configuration files on startup.
You can simply use a regular expression library and solve that using regular expressions.
Use expression (\w+) and the variable in \1 (or $1 depending on the library implementation of regular expressions).
Many overly complicated suggestions here. Try this simple std::string solution:
using namespace std;
string someText = ...
string::size_type tokenOff = 0, sepOff = tokenOff;
while (sepOff != string::npos)
{
sepOff = someText.find(' ', sepOff);
string::size_type tokenLen = (sepOff == string::npos) ? sepOff : sepOff++ - tokenOff;
string token = someText.substr(tokenOff, tokenLen);
if (!token.empty())
/* do something with token */;
tokenOff = sepOff;
}
I thought that was what the >> operator on string streams was for:
string word; sin >> word;
Here's an approach that allows you control over whether empty tokens are included (like strsep) or excluded (like strtok).
#include <string.h> // for strchr and strlen
/*
* want_empty_tokens==true : include empty tokens, like strsep()
* want_empty_tokens==false : exclude empty tokens, like strtok()
*/
std::vector<std::string> tokenize(const char* src,
char delim,
bool want_empty_tokens)
{
std::vector<std::string> tokens;
if (src and *src != '\0') // defensive
while( true ) {
const char* d = strchr(src, delim);
size_t len = (d)? d-src : strlen(src);
if (len or want_empty_tokens)
tokens.push_back( std::string(src, len) ); // capture token
if (d) src += len+1; else break;
}
return tokens;
}
Seems odd to me that with all us speed conscious nerds here on SO no one has presented a version that uses a compile time generated look up table for the delimiter (example implementation further down). Using a look up table and iterators should beat std::regex in efficiency, if you don't need to beat regex, just use it, its standard as of C++11 and super flexible.
Some have suggested regex already but for the noobs here is a packaged example that should do exactly what the OP expects:
std::vector<std::string> split(std::string::const_iterator it, std::string::const_iterator end, std::regex e = std::regex{"\\w+"}){
std::smatch m{};
std::vector<std::string> ret{};
while (std::regex_search (it,end,m,e)) {
ret.emplace_back(m.str());
std::advance(it, m.position() + m.length()); //next start position = match position + match length
}
return ret;
}
std::vector<std::string> split(const std::string &s, std::regex e = std::regex{"\\w+"}){ //comfort version calls flexible version
return split(s.cbegin(), s.cend(), std::move(e));
}
int main ()
{
std::string str {"Some people, excluding those present, have been compile time constants - since puberty."};
auto v = split(str);
for(const auto&s:v){
std::cout << s << std::endl;
}
std::cout << "crazy version:" << std::endl;
v = split(str, std::regex{"[^e]+"}); //using e as delim shows flexibility
for(const auto&s:v){
std::cout << s << std::endl;
}
return 0;
}
If we need to be faster and accept the constraint that all chars must be 8 bits we can make a look up table at compile time using metaprogramming:
template<bool...> struct BoolSequence{}; //just here to hold bools
template<char...> struct CharSequence{}; //just here to hold chars
template<typename T, char C> struct Contains; //generic
template<char First, char... Cs, char Match> //not first specialization
struct Contains<CharSequence<First, Cs...>,Match> :
Contains<CharSequence<Cs...>, Match>{}; //strip first and increase index
template<char First, char... Cs> //is first specialization
struct Contains<CharSequence<First, Cs...>,First>: std::true_type {};
template<char Match> //not found specialization
struct Contains<CharSequence<>,Match>: std::false_type{};
template<int I, typename T, typename U>
struct MakeSequence; //generic
template<int I, bool... Bs, typename U>
struct MakeSequence<I,BoolSequence<Bs...>, U>: //not last
MakeSequence<I-1, BoolSequence<Contains<U,I-1>::value,Bs...>, U>{};
template<bool... Bs, typename U>
struct MakeSequence<0,BoolSequence<Bs...>,U>{ //last
using Type = BoolSequence<Bs...>;
};
template<typename T> struct BoolASCIITable;
template<bool... Bs> struct BoolASCIITable<BoolSequence<Bs...>>{
/* could be made constexpr but not yet supported by MSVC */
static bool isDelim(const char c){
static const bool table[256] = {Bs...};
return table[static_cast<int>(c)];
}
};
using Delims = CharSequence<'.',',',' ',':','\n'>; //list your custom delimiters here
using Table = BoolASCIITable<typename MakeSequence<256,BoolSequence<>,Delims>::Type>;
With that in place making a getNextToken function is easy:
template<typename T_It>
std::pair<T_It,T_It> getNextToken(T_It begin,T_It end){
begin = std::find_if(begin,end,std::not1(Table{})); //find first non delim or end
auto second = std::find_if(begin,end,Table{}); //find first delim or end
return std::make_pair(begin,second);
}
Using it is also easy:
int main() {
std::string s{"Some people, excluding those present, have been compile time constants - since puberty."};
auto it = std::begin(s);
auto end = std::end(s);
while(it != std::end(s)){
auto token = getNextToken(it,end);
std::cout << std::string(token.first,token.second) << std::endl;
it = token.second;
}
return 0;
}
Here is a live example: http://ideone.com/GKtkLQ
I know this question is already answered but I want to contribute. Maybe my solution is a bit simple but this is what I came up with:
vector<string> get_words(string const& text, string const& separator)
{
vector<string> result;
string tmp = text;
size_t first_pos = 0;
size_t second_pos = tmp.find(separator);
while (second_pos != string::npos)
{
if (first_pos != second_pos)
{
string word = tmp.substr(first_pos, second_pos - first_pos);
result.push_back(word);
}
tmp = tmp.substr(second_pos + separator.length());
second_pos = tmp.find(separator);
}
result.push_back(tmp);
return result;
}
Please comment if there is a better approach to something in my code or if something is wrong.
UPDATE: added generic separator
you can take advantage of boost::make_find_iterator. Something similar to this:
template<typename CH>
inline vector< basic_string<CH> > tokenize(
const basic_string<CH> &Input,
const basic_string<CH> &Delimiter,
bool remove_empty_token
) {
typedef typename basic_string<CH>::const_iterator string_iterator_t;
typedef boost::find_iterator< string_iterator_t > string_find_iterator_t;
vector< basic_string<CH> > Result;
string_iterator_t it = Input.begin();
string_iterator_t it_end = Input.end();
for(string_find_iterator_t i = boost::make_find_iterator(Input, boost::first_finder(Delimiter, boost::is_equal()));
i != string_find_iterator_t();
++i) {
if(remove_empty_token){
if(it != i->begin())
Result.push_back(basic_string<CH>(it,i->begin()));
}
else
Result.push_back(basic_string<CH>(it,i->begin()));
it = i->end();
}
if(it != it_end)
Result.push_back(basic_string<CH>(it,it_end));
return Result;
}
Here's my Swiss® Army Knife of string-tokenizers for splitting up strings by whitespace, accounting for single and double-quote wrapped strings as well as stripping those characters from the results. I used RegexBuddy 4.x to generate most of the code-snippet, but I added custom handling for stripping quotes and a few other things.
#include <string>
#include <locale>
#include <regex>
std::vector<std::wstring> tokenize_string(std::wstring string_to_tokenize) {
std::vector<std::wstring> tokens;
std::wregex re(LR"(("[^"]*"|'[^']*'|[^"' ]+))", std::regex_constants::collate);
std::wsregex_iterator next( string_to_tokenize.begin(),
string_to_tokenize.end(),
re,
std::regex_constants::match_not_null );
std::wsregex_iterator end;
const wchar_t single_quote = L'\'';
const wchar_t double_quote = L'\"';
while ( next != end ) {
std::wsmatch match = *next;
const std::wstring token = match.str( 0 );
next++;
if (token.length() > 2 && (token.front() == double_quote || token.front() == single_quote))
tokens.emplace_back( std::wstring(token.begin()+1, token.begin()+token.length()-1) );
else
tokens.emplace_back(token);
}
return tokens;
}
I wrote a simplified version (and maybe a little bit efficient) of https://stackoverflow.com/a/50247503/3976739 for my own use. I hope it would help.
void StrTokenizer(string& source, const char* delimiter, vector<string>& Tokens)
{
size_t new_index = 0;
size_t old_index = 0;
while (new_index != std::string::npos)
{
new_index = source.find(delimiter, old_index);
Tokens.emplace_back(source.substr(old_index, new_index-old_index));
if (new_index != std::string::npos)
old_index = ++new_index;
}
}
If the maximum length of the input string to be tokenized is known, one can exploit this and implement a very fast version. I am sketching the basic idea below, which was inspired by both strtok() and the "suffix array"-data structure described Jon Bentley's "Programming Perls" 2nd edition, chapter 15. The C++ class in this case only gives some organization and convenience of use. The implementation shown can be easily extended for removing leading and trailing whitespace characters in the tokens.
Basically one can replace the separator characters with string-terminating '\0'-characters and set pointers to the tokens withing the modified string. In the extreme case when the string consists only of separators, one gets string-length plus 1 resulting empty tokens. It is practical to duplicate the string to be modified.
Header file:
class TextLineSplitter
{
public:
TextLineSplitter( const size_t max_line_len );
~TextLineSplitter();
void SplitLine( const char *line,
const char sep_char = ',',
);
inline size_t NumTokens( void ) const
{
return mNumTokens;
}
const char * GetToken( const size_t token_idx ) const
{
assert( token_idx < mNumTokens );
return mTokens[ token_idx ];
}
private:
const size_t mStorageSize;
char *mBuff;
char **mTokens;
size_t mNumTokens;
inline void ResetContent( void )
{
memset( mBuff, 0, mStorageSize );
// mark all items as empty:
memset( mTokens, 0, mStorageSize * sizeof( char* ) );
// reset counter for found items:
mNumTokens = 0L;
}
};
Implementattion file:
TextLineSplitter::TextLineSplitter( const size_t max_line_len ):
mStorageSize ( max_line_len + 1L )
{
// allocate memory
mBuff = new char [ mStorageSize ];
mTokens = new char* [ mStorageSize ];
ResetContent();
}
TextLineSplitter::~TextLineSplitter()
{
delete [] mBuff;
delete [] mTokens;
}
void TextLineSplitter::SplitLine( const char *line,
const char sep_char /* = ',' */,
)
{
assert( sep_char != '\0' );
ResetContent();
strncpy( mBuff, line, mMaxLineLen );
size_t idx = 0L; // running index for characters
do
{
assert( idx < mStorageSize );
const char chr = line[ idx ]; // retrieve current character
if( mTokens[ mNumTokens ] == NULL )
{
mTokens[ mNumTokens ] = &mBuff[ idx ];
} // if
if( chr == sep_char || chr == '\0' )
{ // item or line finished
// overwrite separator with a 0-terminating character:
mBuff[ idx ] = '\0';
// count-up items:
mNumTokens ++;
} // if
} while( line[ idx++ ] );
}
A scenario of usage would be:
// create an instance capable of splitting strings up to 1000 chars long:
TextLineSplitter spl( 1000 );
spl.SplitLine( "Item1,,Item2,Item3" );
for( size_t i = 0; i < spl.NumTokens(); i++ )
{
printf( "%s\n", spl.GetToken( i ) );
}
output:
Item1
Item2
Item3

How to parse a string in c++ [duplicate]

Java has a convenient split method:
String str = "The quick brown fox";
String[] results = str.split(" ");
Is there an easy way to do this in C++?
The Boost tokenizer class can make this sort of thing quite simple:
#include <iostream>
#include <string>
#include <boost/foreach.hpp>
#include <boost/tokenizer.hpp>
using namespace std;
using namespace boost;
int main(int, char**)
{
string text = "token, test string";
char_separator<char> sep(", ");
tokenizer< char_separator<char> > tokens(text, sep);
BOOST_FOREACH (const string& t, tokens) {
cout << t << "." << endl;
}
}
Updated for C++11:
#include <iostream>
#include <string>
#include <boost/tokenizer.hpp>
using namespace std;
using namespace boost;
int main(int, char**)
{
string text = "token, test string";
char_separator<char> sep(", ");
tokenizer<char_separator<char>> tokens(text, sep);
for (const auto& t : tokens) {
cout << t << "." << endl;
}
}
Here's a real simple one:
#include <vector>
#include <string>
using namespace std;
vector<string> split(const char *str, char c = ' ')
{
vector<string> result;
do
{
const char *begin = str;
while(*str != c && *str)
str++;
result.push_back(string(begin, str));
} while (0 != *str++);
return result;
}
C++ standard library algorithms are pretty universally based around iterators rather than concrete containers. Unfortunately this makes it hard to provide a Java-like split function in the C++ standard library, even though nobody argues that this would be convenient. But what would its return type be? std::vector<std::basic_string<…>>? Maybe, but then we’re forced to perform (potentially redundant and costly) allocations.
Instead, C++ offers a plethora of ways to split strings based on arbitrarily complex delimiters, but none of them is encapsulated as nicely as in other languages. The numerous ways fill whole blog posts.
At its simplest, you could iterate using std::string::find until you hit std::string::npos, and extract the contents using std::string::substr.
A more fluid (and idiomatic, but basic) version for splitting on whitespace would use a std::istringstream:
auto iss = std::istringstream{"The quick brown fox"};
auto str = std::string{};
while (iss >> str) {
process(str);
}
Using std::istream_iterators, the contents of the string stream could also be copied into a vector using its iterator range constructor.
Multiple libraries (such as Boost.Tokenizer) offer specific tokenisers.
More advanced splitting require regular expressions. C++ provides the std::regex_token_iterator for this purpose in particular:
auto const str = "The quick brown fox"s;
auto const re = std::regex{R"(\s+)"};
auto const vec = std::vector<std::string>(
std::sregex_token_iterator{begin(str), end(str), re, -1},
std::sregex_token_iterator{}
);
Another quick way is to use getline. Something like:
stringstream ss("bla bla");
string s;
while (getline(ss, s, ' ')) {
cout << s << endl;
}
If you want, you can make a simple split() method returning a vector<string>, which is
really useful.
Use strtok. In my opinion, there isn't a need to build a class around tokenizing unless strtok doesn't provide you with what you need. It might not, but in 15+ years of writing various parsing code in C and C++, I've always used strtok. Here is an example
char myString[] = "The quick brown fox";
char *p = strtok(myString, " ");
while (p) {
printf ("Token: %s\n", p);
p = strtok(NULL, " ");
}
A few caveats (which might not suit your needs). The string is "destroyed" in the process, meaning that EOS characters are placed inline in the delimter spots. Correct usage might require you to make a non-const version of the string. You can also change the list of delimiters mid parse.
In my own opinion, the above code is far simpler and easier to use than writing a separate class for it. To me, this is one of those functions that the language provides and it does it well and cleanly. It's simply a "C based" solution. It's appropriate, it's easy, and you don't have to write a lot of extra code :-)
You can use streams, iterators, and the copy algorithm to do this fairly directly.
#include <string>
#include <vector>
#include <iostream>
#include <istream>
#include <ostream>
#include <iterator>
#include <sstream>
#include <algorithm>
int main()
{
std::string str = "The quick brown fox";
// construct a stream from the string
std::stringstream strstr(str);
// use stream iterators to copy the stream to the vector as whitespace separated strings
std::istream_iterator<std::string> it(strstr);
std::istream_iterator<std::string> end;
std::vector<std::string> results(it, end);
// send the vector to stdout.
std::ostream_iterator<std::string> oit(std::cout);
std::copy(results.begin(), results.end(), oit);
}
A solution using regex_token_iterators:
#include <iostream>
#include <regex>
#include <string>
using namespace std;
int main()
{
string str("The quick brown fox");
regex reg("\\s+");
sregex_token_iterator iter(str.begin(), str.end(), reg, -1);
sregex_token_iterator end;
vector<string> vec(iter, end);
for (auto a : vec)
{
cout << a << endl;
}
}
No offense folks, but for such a simple problem, you are making things way too complicated. There are a lot of reasons to use Boost. But for something this simple, it's like hitting a fly with a 20# sledge.
void
split( vector<string> & theStringVector, /* Altered/returned value */
const string & theString,
const string & theDelimiter)
{
UASSERT( theDelimiter.size(), >, 0); // My own ASSERT macro.
size_t start = 0, end = 0;
while ( end != string::npos)
{
end = theString.find( theDelimiter, start);
// If at end, use length=maxLength. Else use length=end-start.
theStringVector.push_back( theString.substr( start,
(end == string::npos) ? string::npos : end - start));
// If at end, use start=maxSize. Else use start=end+delimiter.
start = ( ( end > (string::npos - theDelimiter.size()) )
? string::npos : end + theDelimiter.size());
}
}
For example (for Doug's case),
#define SHOW(I,X) cout << "[" << (I) << "]\t " # X " = \"" << (X) << "\"" << endl
int
main()
{
vector<string> v;
split( v, "A:PEP:909:Inventory Item", ":" );
for (unsigned int i = 0; i < v.size(); i++)
SHOW( i, v[i] );
}
And yes, we could have split() return a new vector rather than passing one in. It's trivial to wrap and overload. But depending on what I'm doing, I often find it better to re-use pre-existing objects rather than always creating new ones. (Just as long as I don't forget to empty the vector in between!)
Reference: http://www.cplusplus.com/reference/string/string/.
(I was originally writing a response to Doug's question: C++ Strings Modifying and Extracting based on Separators (closed). But since Martin York closed that question with a pointer over here... I'll just generalize my code.)
Boost has a strong split function: boost::algorithm::split.
Sample program:
#include <vector>
#include <boost/algorithm/string.hpp>
int main() {
auto s = "a,b, c ,,e,f,";
std::vector<std::string> fields;
boost::split(fields, s, boost::is_any_of(","));
for (const auto& field : fields)
std::cout << "\"" << field << "\"\n";
return 0;
}
Output:
"a"
"b"
" c "
""
"e"
"f"
""
This is a simple STL-only solution (~5 lines!) using std::find and std::find_first_not_of that handles repetitions of the delimiter (like spaces or periods for instance), as well leading and trailing delimiters:
#include <string>
#include <vector>
void tokenize(std::string str, std::vector<string> &token_v){
size_t start = str.find_first_not_of(DELIMITER), end=start;
while (start != std::string::npos){
// Find next occurence of delimiter
end = str.find(DELIMITER, start);
// Push back the token found into vector
token_v.push_back(str.substr(start, end-start));
// Skip all occurences of the delimiter to find new start
start = str.find_first_not_of(DELIMITER, end);
}
}
Try it out live!
I know you asked for a C++ solution, but you might consider this helpful:
Qt
#include <QString>
...
QString str = "The quick brown fox";
QStringList results = str.split(" ");
The advantage over Boost in this example is that it's a direct one to one mapping to your post's code.
See more at Qt documentation
Here is a sample tokenizer class that might do what you want
//Header file
class Tokenizer
{
public:
static const std::string DELIMITERS;
Tokenizer(const std::string& str);
Tokenizer(const std::string& str, const std::string& delimiters);
bool NextToken();
bool NextToken(const std::string& delimiters);
const std::string GetToken() const;
void Reset();
protected:
size_t m_offset;
const std::string m_string;
std::string m_token;
std::string m_delimiters;
};
//CPP file
const std::string Tokenizer::DELIMITERS(" \t\n\r");
Tokenizer::Tokenizer(const std::string& s) :
m_string(s),
m_offset(0),
m_delimiters(DELIMITERS) {}
Tokenizer::Tokenizer(const std::string& s, const std::string& delimiters) :
m_string(s),
m_offset(0),
m_delimiters(delimiters) {}
bool Tokenizer::NextToken()
{
return NextToken(m_delimiters);
}
bool Tokenizer::NextToken(const std::string& delimiters)
{
size_t i = m_string.find_first_not_of(delimiters, m_offset);
if (std::string::npos == i)
{
m_offset = m_string.length();
return false;
}
size_t j = m_string.find_first_of(delimiters, i);
if (std::string::npos == j)
{
m_token = m_string.substr(i);
m_offset = m_string.length();
return true;
}
m_token = m_string.substr(i, j - i);
m_offset = j;
return true;
}
Example:
std::vector <std::string> v;
Tokenizer s("split this string", " ");
while (s.NextToken())
{
v.push_back(s.GetToken());
}
pystring is a small library which implements a bunch of Python's string functions, including the split method:
#include <string>
#include <vector>
#include "pystring.h"
std::vector<std::string> chunks;
pystring::split("this string", chunks);
// also can specify a separator
pystring::split("this-string", chunks, "-");
I posted this answer for similar question.
Don't reinvent the wheel. I've used a number of libraries and the fastest and most flexible I have come across is: C++ String Toolkit Library.
Here is an example of how to use it that I've posted else where on the stackoverflow.
#include <iostream>
#include <vector>
#include <string>
#include <strtk.hpp>
const char *whitespace = " \t\r\n\f";
const char *whitespace_and_punctuation = " \t\r\n\f;,=";
int main()
{
{ // normal parsing of a string into a vector of strings
std::string s("Somewhere down the road");
std::vector<std::string> result;
if( strtk::parse( s, whitespace, result ) )
{
for(size_t i = 0; i < result.size(); ++i )
std::cout << result[i] << std::endl;
}
}
{ // parsing a string into a vector of floats with other separators
// besides spaces
std::string s("3.0, 3.14; 4.0");
std::vector<float> values;
if( strtk::parse( s, whitespace_and_punctuation, values ) )
{
for(size_t i = 0; i < values.size(); ++i )
std::cout << values[i] << std::endl;
}
}
{ // parsing a string into specific variables
std::string s("angle = 45; radius = 9.9");
std::string w1, w2;
float v1, v2;
if( strtk::parse( s, whitespace_and_punctuation, w1, v1, w2, v2) )
{
std::cout << "word " << w1 << ", value " << v1 << std::endl;
std::cout << "word " << w2 << ", value " << v2 << std::endl;
}
}
return 0;
}
Adam Pierce's answer provides an hand-spun tokenizer taking in a const char*. It's a bit more problematic to do with iterators because incrementing a string's end iterator is undefined. That said, given string str{ "The quick brown fox" } we can certainly accomplish this:
auto start = find(cbegin(str), cend(str), ' ');
vector<string> tokens{ string(cbegin(str), start) };
while (start != cend(str)) {
const auto finish = find(++start, cend(str), ' ');
tokens.push_back(string(start, finish));
start = finish;
}
Live Example
If you're looking to abstract complexity by using standard functionality, as On Freund suggests strtok is a simple option:
vector<string> tokens;
for (auto i = strtok(data(str), " "); i != nullptr; i = strtok(nullptr, " ")) tokens.push_back(i);
If you don't have access to C++17 you'll need to substitute data(str) as in this example: http://ideone.com/8kAGoa
Though not demonstrated in the example, strtok need not use the same delimiter for each token. Along with this advantage though, there are several drawbacks:
strtok cannot be used on multiple strings at the same time: Either a nullptr must be passed to continue tokenizing the current string or a new char* to tokenize must be passed (there are some non-standard implementations which do support this however, such as: strtok_s)
For the same reason strtok cannot be used on multiple threads simultaneously (this may however be implementation defined, for example: Visual Studio's implementation is thread safe)
Calling strtok modifies the string it is operating on, so it cannot be used on const strings, const char*s, or literal strings, to tokenize any of these with strtok or to operate on a string who's contents need to be preserved, str would have to be copied, then the copy could be operated on
c++20 provides us with split_view to tokenize strings, in a non-destructive manner: https://topanswers.xyz/cplusplus?q=749#a874
The previous methods cannot generate a tokenized vector in-place, meaning without abstracting them into a helper function they cannot initialize const vector<string> tokens. That functionality and the ability to accept any white-space delimiter can be harnessed using an istream_iterator. For example given: const string str{ "The quick \tbrown \nfox" } we can do this:
istringstream is{ str };
const vector<string> tokens{ istream_iterator<string>(is), istream_iterator<string>() };
Live Example
The required construction of an istringstream for this option has far greater cost than the previous 2 options, however this cost is typically hidden in the expense of string allocation.
If none of the above options are flexable enough for your tokenization needs, the most flexible option is using a regex_token_iterator of course with this flexibility comes greater expense, but again this is likely hidden in the string allocation cost. Say for example we want to tokenize based on non-escaped commas, also eating white-space, given the following input: const string str{ "The ,qu\\,ick ,\tbrown, fox" } we can do this:
const regex re{ "\\s*((?:[^\\\\,]|\\\\.)*?)\\s*(?:,|$)" };
const vector<string> tokens{ sregex_token_iterator(cbegin(str), cend(str), re, 1), sregex_token_iterator() };
Live Example
Check this example. It might help you..
#include <iostream>
#include <sstream>
using namespace std;
int main ()
{
string tmps;
istringstream is ("the dellimiter is the space");
while (is.good ()) {
is >> tmps;
cout << tmps << "\n";
}
return 0;
}
If you're using C++ ranges - the full ranges-v3 library, not the limited functionality accepted into C++20 - you could do it this way:
auto results = str | ranges::views::tokenize(" ",1);
... and this is lazily-evaluated. You can alternatively set a vector to this range:
auto results = str | ranges::views::tokenize(" ",1) | ranges::to<std::vector>();
this will take O(m) space and O(n) time if str has n characters making up m words.
See also the library's own tokenization example, here.
MFC/ATL has a very nice tokenizer. From MSDN:
CAtlString str( "%First Second#Third" );
CAtlString resToken;
int curPos= 0;
resToken= str.Tokenize("% #",curPos);
while (resToken != "")
{
printf("Resulting token: %s\n", resToken);
resToken= str.Tokenize("% #",curPos);
};
Output
Resulting Token: First
Resulting Token: Second
Resulting Token: Third
If you're willing to use C, you can use the strtok function. You should pay attention to multi-threading issues when using it.
For simple stuff I just use the following:
unsigned TokenizeString(const std::string& i_source,
const std::string& i_seperators,
bool i_discard_empty_tokens,
std::vector<std::string>& o_tokens)
{
unsigned prev_pos = 0;
unsigned pos = 0;
unsigned number_of_tokens = 0;
o_tokens.clear();
pos = i_source.find_first_of(i_seperators, pos);
while (pos != std::string::npos)
{
std::string token = i_source.substr(prev_pos, pos - prev_pos);
if (!i_discard_empty_tokens || token != "")
{
o_tokens.push_back(i_source.substr(prev_pos, pos - prev_pos));
number_of_tokens++;
}
pos++;
prev_pos = pos;
pos = i_source.find_first_of(i_seperators, pos);
}
if (prev_pos < i_source.length())
{
o_tokens.push_back(i_source.substr(prev_pos));
number_of_tokens++;
}
return number_of_tokens;
}
Cowardly disclaimer: I write real-time data processing software where the data comes in through binary files, sockets, or some API call (I/O cards, camera's). I never use this function for something more complicated or time-critical than reading external configuration files on startup.
You can simply use a regular expression library and solve that using regular expressions.
Use expression (\w+) and the variable in \1 (or $1 depending on the library implementation of regular expressions).
Many overly complicated suggestions here. Try this simple std::string solution:
using namespace std;
string someText = ...
string::size_type tokenOff = 0, sepOff = tokenOff;
while (sepOff != string::npos)
{
sepOff = someText.find(' ', sepOff);
string::size_type tokenLen = (sepOff == string::npos) ? sepOff : sepOff++ - tokenOff;
string token = someText.substr(tokenOff, tokenLen);
if (!token.empty())
/* do something with token */;
tokenOff = sepOff;
}
I thought that was what the >> operator on string streams was for:
string word; sin >> word;
Here's an approach that allows you control over whether empty tokens are included (like strsep) or excluded (like strtok).
#include <string.h> // for strchr and strlen
/*
* want_empty_tokens==true : include empty tokens, like strsep()
* want_empty_tokens==false : exclude empty tokens, like strtok()
*/
std::vector<std::string> tokenize(const char* src,
char delim,
bool want_empty_tokens)
{
std::vector<std::string> tokens;
if (src and *src != '\0') // defensive
while( true ) {
const char* d = strchr(src, delim);
size_t len = (d)? d-src : strlen(src);
if (len or want_empty_tokens)
tokens.push_back( std::string(src, len) ); // capture token
if (d) src += len+1; else break;
}
return tokens;
}
Seems odd to me that with all us speed conscious nerds here on SO no one has presented a version that uses a compile time generated look up table for the delimiter (example implementation further down). Using a look up table and iterators should beat std::regex in efficiency, if you don't need to beat regex, just use it, its standard as of C++11 and super flexible.
Some have suggested regex already but for the noobs here is a packaged example that should do exactly what the OP expects:
std::vector<std::string> split(std::string::const_iterator it, std::string::const_iterator end, std::regex e = std::regex{"\\w+"}){
std::smatch m{};
std::vector<std::string> ret{};
while (std::regex_search (it,end,m,e)) {
ret.emplace_back(m.str());
std::advance(it, m.position() + m.length()); //next start position = match position + match length
}
return ret;
}
std::vector<std::string> split(const std::string &s, std::regex e = std::regex{"\\w+"}){ //comfort version calls flexible version
return split(s.cbegin(), s.cend(), std::move(e));
}
int main ()
{
std::string str {"Some people, excluding those present, have been compile time constants - since puberty."};
auto v = split(str);
for(const auto&s:v){
std::cout << s << std::endl;
}
std::cout << "crazy version:" << std::endl;
v = split(str, std::regex{"[^e]+"}); //using e as delim shows flexibility
for(const auto&s:v){
std::cout << s << std::endl;
}
return 0;
}
If we need to be faster and accept the constraint that all chars must be 8 bits we can make a look up table at compile time using metaprogramming:
template<bool...> struct BoolSequence{}; //just here to hold bools
template<char...> struct CharSequence{}; //just here to hold chars
template<typename T, char C> struct Contains; //generic
template<char First, char... Cs, char Match> //not first specialization
struct Contains<CharSequence<First, Cs...>,Match> :
Contains<CharSequence<Cs...>, Match>{}; //strip first and increase index
template<char First, char... Cs> //is first specialization
struct Contains<CharSequence<First, Cs...>,First>: std::true_type {};
template<char Match> //not found specialization
struct Contains<CharSequence<>,Match>: std::false_type{};
template<int I, typename T, typename U>
struct MakeSequence; //generic
template<int I, bool... Bs, typename U>
struct MakeSequence<I,BoolSequence<Bs...>, U>: //not last
MakeSequence<I-1, BoolSequence<Contains<U,I-1>::value,Bs...>, U>{};
template<bool... Bs, typename U>
struct MakeSequence<0,BoolSequence<Bs...>,U>{ //last
using Type = BoolSequence<Bs...>;
};
template<typename T> struct BoolASCIITable;
template<bool... Bs> struct BoolASCIITable<BoolSequence<Bs...>>{
/* could be made constexpr but not yet supported by MSVC */
static bool isDelim(const char c){
static const bool table[256] = {Bs...};
return table[static_cast<int>(c)];
}
};
using Delims = CharSequence<'.',',',' ',':','\n'>; //list your custom delimiters here
using Table = BoolASCIITable<typename MakeSequence<256,BoolSequence<>,Delims>::Type>;
With that in place making a getNextToken function is easy:
template<typename T_It>
std::pair<T_It,T_It> getNextToken(T_It begin,T_It end){
begin = std::find_if(begin,end,std::not1(Table{})); //find first non delim or end
auto second = std::find_if(begin,end,Table{}); //find first delim or end
return std::make_pair(begin,second);
}
Using it is also easy:
int main() {
std::string s{"Some people, excluding those present, have been compile time constants - since puberty."};
auto it = std::begin(s);
auto end = std::end(s);
while(it != std::end(s)){
auto token = getNextToken(it,end);
std::cout << std::string(token.first,token.second) << std::endl;
it = token.second;
}
return 0;
}
Here is a live example: http://ideone.com/GKtkLQ
I know this question is already answered but I want to contribute. Maybe my solution is a bit simple but this is what I came up with:
vector<string> get_words(string const& text, string const& separator)
{
vector<string> result;
string tmp = text;
size_t first_pos = 0;
size_t second_pos = tmp.find(separator);
while (second_pos != string::npos)
{
if (first_pos != second_pos)
{
string word = tmp.substr(first_pos, second_pos - first_pos);
result.push_back(word);
}
tmp = tmp.substr(second_pos + separator.length());
second_pos = tmp.find(separator);
}
result.push_back(tmp);
return result;
}
Please comment if there is a better approach to something in my code or if something is wrong.
UPDATE: added generic separator
you can take advantage of boost::make_find_iterator. Something similar to this:
template<typename CH>
inline vector< basic_string<CH> > tokenize(
const basic_string<CH> &Input,
const basic_string<CH> &Delimiter,
bool remove_empty_token
) {
typedef typename basic_string<CH>::const_iterator string_iterator_t;
typedef boost::find_iterator< string_iterator_t > string_find_iterator_t;
vector< basic_string<CH> > Result;
string_iterator_t it = Input.begin();
string_iterator_t it_end = Input.end();
for(string_find_iterator_t i = boost::make_find_iterator(Input, boost::first_finder(Delimiter, boost::is_equal()));
i != string_find_iterator_t();
++i) {
if(remove_empty_token){
if(it != i->begin())
Result.push_back(basic_string<CH>(it,i->begin()));
}
else
Result.push_back(basic_string<CH>(it,i->begin()));
it = i->end();
}
if(it != it_end)
Result.push_back(basic_string<CH>(it,it_end));
return Result;
}
Here's my Swiss® Army Knife of string-tokenizers for splitting up strings by whitespace, accounting for single and double-quote wrapped strings as well as stripping those characters from the results. I used RegexBuddy 4.x to generate most of the code-snippet, but I added custom handling for stripping quotes and a few other things.
#include <string>
#include <locale>
#include <regex>
std::vector<std::wstring> tokenize_string(std::wstring string_to_tokenize) {
std::vector<std::wstring> tokens;
std::wregex re(LR"(("[^"]*"|'[^']*'|[^"' ]+))", std::regex_constants::collate);
std::wsregex_iterator next( string_to_tokenize.begin(),
string_to_tokenize.end(),
re,
std::regex_constants::match_not_null );
std::wsregex_iterator end;
const wchar_t single_quote = L'\'';
const wchar_t double_quote = L'\"';
while ( next != end ) {
std::wsmatch match = *next;
const std::wstring token = match.str( 0 );
next++;
if (token.length() > 2 && (token.front() == double_quote || token.front() == single_quote))
tokens.emplace_back( std::wstring(token.begin()+1, token.begin()+token.length()-1) );
else
tokens.emplace_back(token);
}
return tokens;
}
I wrote a simplified version (and maybe a little bit efficient) of https://stackoverflow.com/a/50247503/3976739 for my own use. I hope it would help.
void StrTokenizer(string& source, const char* delimiter, vector<string>& Tokens)
{
size_t new_index = 0;
size_t old_index = 0;
while (new_index != std::string::npos)
{
new_index = source.find(delimiter, old_index);
Tokens.emplace_back(source.substr(old_index, new_index-old_index));
if (new_index != std::string::npos)
old_index = ++new_index;
}
}
If the maximum length of the input string to be tokenized is known, one can exploit this and implement a very fast version. I am sketching the basic idea below, which was inspired by both strtok() and the "suffix array"-data structure described Jon Bentley's "Programming Perls" 2nd edition, chapter 15. The C++ class in this case only gives some organization and convenience of use. The implementation shown can be easily extended for removing leading and trailing whitespace characters in the tokens.
Basically one can replace the separator characters with string-terminating '\0'-characters and set pointers to the tokens withing the modified string. In the extreme case when the string consists only of separators, one gets string-length plus 1 resulting empty tokens. It is practical to duplicate the string to be modified.
Header file:
class TextLineSplitter
{
public:
TextLineSplitter( const size_t max_line_len );
~TextLineSplitter();
void SplitLine( const char *line,
const char sep_char = ',',
);
inline size_t NumTokens( void ) const
{
return mNumTokens;
}
const char * GetToken( const size_t token_idx ) const
{
assert( token_idx < mNumTokens );
return mTokens[ token_idx ];
}
private:
const size_t mStorageSize;
char *mBuff;
char **mTokens;
size_t mNumTokens;
inline void ResetContent( void )
{
memset( mBuff, 0, mStorageSize );
// mark all items as empty:
memset( mTokens, 0, mStorageSize * sizeof( char* ) );
// reset counter for found items:
mNumTokens = 0L;
}
};
Implementattion file:
TextLineSplitter::TextLineSplitter( const size_t max_line_len ):
mStorageSize ( max_line_len + 1L )
{
// allocate memory
mBuff = new char [ mStorageSize ];
mTokens = new char* [ mStorageSize ];
ResetContent();
}
TextLineSplitter::~TextLineSplitter()
{
delete [] mBuff;
delete [] mTokens;
}
void TextLineSplitter::SplitLine( const char *line,
const char sep_char /* = ',' */,
)
{
assert( sep_char != '\0' );
ResetContent();
strncpy( mBuff, line, mMaxLineLen );
size_t idx = 0L; // running index for characters
do
{
assert( idx < mStorageSize );
const char chr = line[ idx ]; // retrieve current character
if( mTokens[ mNumTokens ] == NULL )
{
mTokens[ mNumTokens ] = &mBuff[ idx ];
} // if
if( chr == sep_char || chr == '\0' )
{ // item or line finished
// overwrite separator with a 0-terminating character:
mBuff[ idx ] = '\0';
// count-up items:
mNumTokens ++;
} // if
} while( line[ idx++ ] );
}
A scenario of usage would be:
// create an instance capable of splitting strings up to 1000 chars long:
TextLineSplitter spl( 1000 );
spl.SplitLine( "Item1,,Item2,Item3" );
for( size_t i = 0; i < spl.NumTokens(); i++ )
{
printf( "%s\n", spl.GetToken( i ) );
}
output:
Item1
Item2
Item3

How to implode a vector of strings into a string (the elegant way)

I'm looking for the most elegant way to implode a vector of strings into a string. Below is the solution I'm using now:
static std::string& implode(const std::vector<std::string>& elems, char delim, std::string& s)
{
for (std::vector<std::string>::const_iterator ii = elems.begin(); ii != elems.end(); ++ii)
{
s += (*ii);
if ( ii + 1 != elems.end() ) {
s += delim;
}
}
return s;
}
static std::string implode(const std::vector<std::string>& elems, char delim)
{
std::string s;
return implode(elems, delim, s);
}
Is there any others out there?
Use boost::algorithm::join(..):
#include <boost/algorithm/string/join.hpp>
...
std::string joinedString = boost::algorithm::join(elems, delim);
See also this question.
std::vector<std::string> strings;
const char* const delim = ", ";
std::ostringstream imploded;
std::copy(strings.begin(), strings.end(),
std::ostream_iterator<std::string>(imploded, delim));
(include <string>, <vector>, <sstream> and <iterator>)
If you want to have a clean end (no trailing delimiter) have a look here
You should use std::ostringstream rather than std::string to build the output (then you can call its str() method at the end to get a string, so your interface need not change, only the temporary s).
From there, you could change to using std::ostream_iterator, like so:
copy(elems.begin(), elems.end(), ostream_iterator<string>(s, delim));
But this has two problems:
delim now needs to be a const char*, rather than a single char. No big deal.
std::ostream_iterator writes the delimiter after every single element, including the last. So you'd either need to erase the last one at the end, or write your own version of the iterator which doesn't have this annoyance. It'd be worth doing the latter if you have a lot of code that needs things like this; otherwise the whole mess might be best avoided (i.e. use ostringstream but not ostream_iterator).
Because I love one-liners (they are very useful for all kinds of weird stuff, as you'll see at the end), here's a solution using std::accumulate and C++11 lambda:
std::accumulate(alist.begin(), alist.end(), std::string(),
[](const std::string& a, const std::string& b) -> std::string {
return a + (a.length() > 0 ? "," : "") + b;
} )
I find this syntax useful with stream operator, where I don't want to have all kinds of weird logic out of scope from the stream operation, just to do a simple string join. Consider for example this return statement from method that formats a string using stream operators (using std;):
return (dynamic_cast<ostringstream&>(ostringstream()
<< "List content: " << endl
<< std::accumulate(alist.begin(), alist.end(), std::string(),
[](const std::string& a, const std::string& b) -> std::string {
return a + (a.length() > 0 ? "," : "") + b;
} ) << endl
<< "Maybe some more stuff" << endl
)).str();
Update:
As pointed out by #plexando in the comments, the above code suffers from misbehavior when the array starts with empty strings due to the fact that the check for "first run" is missing previous runs that have resulted in no additional characters, and also - it is weird to run a check for "is first run" on all runs (i.e. the code is under-optimized).
The solution for both of these problems is easy if we know for a fact that the list has at least one element. OTOH, if we know for a fact that the list does not have at least one element, then we can shorten the run even more.
I think the resulting code isn't as pretty, so I'm adding it here as The Correct Solution, but I think the discussion above still has merrit:
alist.empty() ? "" : /* leave early if there are no items in the list */
std::accumulate( /* otherwise, accumulate */
++alist.begin(), alist.end(), /* the range 2nd to after-last */
*alist.begin(), /* and start accumulating with the first item */
[](auto& a, auto& b) { return a + "," + b; });
Notes:
For containers that support direct access to the first element, its probably better to use that for the third argument instead, so alist[0] for vectors.
As per the discussion in the comments and chat, the lambda still does some copying. This can be minimized by using this (less pretty) lambda instead: [](auto&& a, auto&& b) -> auto& { a += ','; a += b; return a; }) which (on GCC 10) improves performance by more than x10. Thanks to #Deduplicator for the suggestion. I'm still trying to figure out what is going on here.
I like to use this one-liner accumulate (no trailing delimiter):
(std::accumulate defined in <numeric>)
std::accumulate(
std::next(elems.begin()),
elems.end(),
elems[0],
[](std::string a, std::string b) {
return a + delimiter + b;
}
);
what about simple stupid solution?
std::string String::join(const std::vector<std::string> &lst, const std::string &delim)
{
std::string ret;
for(const auto &s : lst) {
if(!ret.empty())
ret += delim;
ret += s;
}
return ret;
}
With fmt you can do.
#include <fmt/format.h>
auto s = fmt::format("{}",fmt::join(elems,delim));
But I don't know if join will make it to std::format.
string join(const vector<string>& vec, const char* delim)
{
stringstream res;
copy(vec.begin(), vec.end(), ostream_iterator<string>(res, delim));
return res.str();
}
Especially with bigger collections, you want to avoid having to check if youre still adding the first element or not to ensure no trailing separator...
So for the empty or single-element list, there is no iteration at all.
Empty ranges are trivial: return "".
Single element or multi-element can be handled perfectly by accumulate:
auto join = [](const auto &&range, const auto separator) {
if (range.empty()) return std::string();
return std::accumulate(
next(begin(range)), // there is at least 1 element, so OK.
end(range),
range[0], // the initial value
[&separator](auto result, const auto &value) {
return result + separator + value;
});
};
Running sample (require C++14): http://cpp.sh/8uspd
A version that uses std::accumulate:
#include <numeric>
#include <iostream>
#include <string>
struct infix {
std::string sep;
infix(const std::string& sep) : sep(sep) {}
std::string operator()(const std::string& lhs, const std::string& rhs) {
std::string rz(lhs);
if(!lhs.empty() && !rhs.empty())
rz += sep;
rz += rhs;
return rz;
}
};
int main() {
std::string a[] = { "Hello", "World", "is", "a", "program" };
std::string sum = std::accumulate(a, a+5, std::string(), infix(", "));
std::cout << sum << "\n";
}
While I would normally recommend using Boost as per the top answer, I recognise that in some projects that's not desired.
The STL solutions suggested using std::ostream_iterator will not work as intended - it'll append a delimiter at the end.
There is now a way to do this with modern C++ using std::experimental::ostream_joiner:
std::ostringstream outstream;
std::copy(strings.begin(),
strings.end(),
std::experimental::make_ostream_joiner(outstream, delimiter.c_str()));
return outstream.str();
Here's what I use, simple and flexible
string joinList(vector<string> arr, string delimiter)
{
if (arr.empty()) return "";
string str;
for (auto i : arr)
str += i + delimiter;
str = str.substr(0, str.size() - delimiter.size());
return str;
}
using:
string a = joinList({ "a", "bbb", "c" }, "!##");
output:
a!##bbb!##c
Here is another one that doesn't add the delimiter after the last element:
std::string concat_strings(const std::vector<std::string> &elements,
const std::string &separator)
{
if (!elements.empty())
{
std::stringstream ss;
auto it = elements.cbegin();
while (true)
{
ss << *it++;
if (it != elements.cend())
ss << separator;
else
return ss.str();
}
}
return "";
Using part of this answer to another question gives you a joined this, based on a separator without a trailing comma,
Usage:
std::vector<std::string> input_str = std::vector<std::string>({"a", "b", "c"});
std::string result = string_join(input_str, ",");
printf("%s", result.c_str());
/// a,b,c
Code:
std::string string_join(const std::vector<std::string>& elements, const char* const separator)
{
switch (elements.size())
{
case 0:
return "";
case 1:
return elements[0];
default:
std::ostringstream os;
std::copy(elements.begin(), elements.end() - 1, std::ostream_iterator<std::string>(os, separator));
os << *elements.rbegin();
return os.str();
}
}
Another simple and good solution is using ranges v3. The current version is C++14 or greater, but there are older versions that are C++11 or greater. Unfortunately, C++20 ranges don't have the intersperse function.
The benefits of this approach are:
Elegant
Easily handle empty strings
Handles the last element of the list
Efficiency. Because ranges are lazily evaluated.
Small and useful library
Functions breakdown(Reference):
accumulate = Similar to std::accumulate but arguments are a range and the initial value. There is an optional third argument that is the operator function.
filter = Like std::filter, filter the elements that don't fit the predicate.
intersperse = The key function! Intersperses a delimiter between range input elements.
#include <iostream>
#include <string>
#include <vector>
#include <range/v3/numeric/accumulate.hpp>
#include <range/v3/view/filter.hpp>
#include <range/v3/view/intersperse.hpp>
int main()
{
using namespace ranges;
// Can be any std container
std::vector<std::string> a{ "Hello", "", "World", "is", "", "a", "program" };
std::string delimiter{", "};
std::string finalString =
accumulate(a | views::filter([](std::string s){return !s.empty();})
| views::intersperse(delimiter)
, std::string());
std::cout << finalString << std::endl; // Hello, World, is, a, program
}
A possible solution with ternary operator ?:.
std::string join(const std::vector<std::string> & v, const std::string & delimiter = ", ") {
std::string result;
for (size_t i = 0; i < v.size(); ++i) {
result += (i ? delimiter : "") + v[i];
}
return result;
}
join({"2", "4", "5"}) will give you 2, 4, 5.
If you are already using a C++ base library (for commonly used tools), string-processing features are typically included. Besides Boost mentioned above, Abseil provides:
std::vector<std::string> names {"Linus", "Dennis", "Ken"};
std::cout << absl::StrJoin(names, ", ") << std::endl;
Folly provides:
std::vector<std::string> names {"Linus", "Dennis", "Ken"};
std::cout << folly::join(", ", names) << std::endl;
Both give the string "Linus, Dennis, Ken".
Slightly long solution, but doesn't use std::ostringstream, and doesn't require a hack to remove the last delimiter.
http://www.ideone.com/hW1M9
And the code:
struct appender
{
appender(char d, std::string& sd, int ic) : delim(d), dest(sd), count(ic)
{
dest.reserve(2048);
}
void operator()(std::string const& copy)
{
dest.append(copy);
if (--count)
dest.append(1, delim);
}
char delim;
mutable std::string& dest;
mutable int count;
};
void implode(const std::vector<std::string>& elems, char delim, std::string& s)
{
std::for_each(elems.begin(), elems.end(), appender(delim, s, elems.size()));
}
This can be solved using boost
#include <boost/range/adaptor/filtered.hpp>
#include <boost/algorithm/string/join.hpp>
#include <boost/algorithm/algorithm.hpp>
std::vector<std::string> win {"Stack", "", "Overflow"};
const std::string Delimitor{","};
const std::string combined_string =
boost::algorithm::join(win |
boost::adaptors::filtered([](const auto &x) {
return x.size() != 0;
}), Delimitor);
Output:
combined_string: "Stack,Overflow"
I'm using the following approach that works fine in C++17. The function starts checking if the given vector is empty, in which case returns an empty string. If that's not the case, it takes the first element from the vector, then iterates from the second one until the end and appends the separator followed by the vector element.
template <typename T>
std::basic_string<T> Join(std::vector<std::basic_string<T>> vValues,
std::basic_string<T> strDelim)
{
std::basic_string<T> strRet;
typename std::vector<std::basic_string<T>>::iterator it(vValues.begin());
if (it != vValues.end()) // The vector is not empty
{
strRet = *it;
while (++it != vValues.end()) strRet += strDelim + *it;
}
return strRet;
}
Usage example:
std::vector<std::string> v1;
std::vector<std::string> v2 { "Hello" };
std::vector<std::string> v3 { "Str1", "Str2" };
std::cout << "(1): " << Join<char>(v1, ",") << std::endl;
std::cout << "(2): " << Join<char>(v2, "; ") << std::endl;
std::cout << "(3): [" << Join<char>(v3, "] [") << "]" << std::endl;
Output:
(1):
(2): Hello
(3): [Str1] [Str2]

Selective iterator

FYI: no boost, yes it has this, I want to reinvent the wheel ;)
Is there some form of a selective iterator (possible) in C++? What I want is to seperate strings like this:
some:word{or other
to a form like this:
some : word { or other
I can do that with two loops and find_first_of(":") and ("{") but this seems (very) inefficient to me. I thought that maybe there would be a way to create/define/write an iterator that would iterate over all these values with for_each. I fear this will have me writing a full-fledged custom way-too-complex iterator class for a std::string.
So I thought maybe this would do:
std::vector<size_t> list;
size_t index = mystring.find(":");
while( index != std::string::npos )
{
list.push_back(index);
index = mystring.find(":", list.back());
}
std::for_each(list.begin(), list.end(), addSpaces(mystring));
This looks messy to me, and I'm quite sure a more elegant way of doing this exists. But I can't think of it. Anyone have a bright idea? Thanks
PS: I did not test the code posted, just a quick write-up of what I would try
UPDATE: after taking all your answers into account, I came up with this, and it works to my liking :). this does assume the last char is a newline or something, otherwise an ending {,}, or : won't get processed.
void tokenize( string &line )
{
char oneBack = ' ';
char twoBack = ' ';
char current = ' ';
size_t length = line.size();
for( size_t index = 0; index<length; ++index )
{
twoBack = oneBack;
oneBack = current;
current = line.at( index );
if( isSpecial(oneBack) )
{
if( !isspace(twoBack) ) // insert before
{
line.insert(index-1, " ");
++index;
++length;
}
if( !isspace(current) ) // insert after
{
line.insert(index, " ");
++index;
++length;
}
}
}
Comments are welcome as always :)
That's relatively easy using the std::istream_iterator.
What you need to do is define your own class (say Term). Then define how to read a single "word" (term) from the stream using the operator >>.
I don't know your exact definition of a word is, so I am using the following definition:
Any consecutive sequence of alpha numeric characters is a term
Any single non white space character that is also not alpha numeric is a word.
Try this:
#include <string>
#include <sstream>
#include <iostream>
#include <iterator>
#include <algorithm>
class Term
{
public:
// This cast operator is not required but makes it easy to use
// a Term anywhere that a string can normally be used.
operator std::string const&() const {return value;}
private:
// A term is just a string
// And we friend the operator >> to make sure we can read it.
friend std::istream& operator>>(std::istream& inStr,Term& dst);
std::string value;
};
Now all we have to do is define an operator >> that reads a word according to the rules:
// This function could be a lot neater using some boost regular expressions.
// I just do it manually to show it can be done without boost (as requested)
std::istream& operator>>(std::istream& inStr,Term& dst)
{
// Note the >> operator drops all proceeding white space.
// So we get the first non white space
char first;
inStr >> first;
// If the stream is in any bad state the stop processing.
if (inStr)
{
if(std::isalnum(first))
{
// Alpha Numeric so read a sequence of characters
dst.value = first;
// This is ugly. And needs re-factoring.
while((first = insStr.get(), inStr) && std::isalnum(first))
{
dst.value += first;
}
// Take into account the special case of EOF.
// And bad stream states.
if (!inStr)
{
if (!inStr.eof())
{
// The last letter read was not EOF and and not part of the word
// So put it back for use by the next call to read from the stream.
inStr.putback(first);
}
// We know that we have a word so clear any errors to make sure it
// is used. Let the next attempt to read a word (term) fail at the outer if.
inStr.clear();
}
}
else
{
// It was not alpha numeric so it is a one character word.
dst.value = first;
}
}
return inStr;
}
So now we can use it in standard algorithms by just employing the istream_iterator
int main()
{
std::string data = "some:word{or other";
std::stringstream dataStream(data);
std::copy( // Read the stream one Term at a time.
std::istream_iterator<Term>(dataStream),
std::istream_iterator<Term>(),
// Note the ostream_iterator is using a std::string
// This works because a Term can be converted into a string.
std::ostream_iterator<std::string>(std::cout, "\n")
);
}
The output:
> ./a.exe
some
:
word
{
or
other
std::string const str = "some:word{or other";
std::string result;
result.reserve(str.size());
for (std::string::const_iterator it = str.begin(), end = str.end();
it != end; ++it)
{
if (isalnum(*it))
{
result.push_back(*it);
}
else
{
result.push_back(' '); result.push_back(*it); result.push_back(' ');
}
}
Insert version for speed-up
std::string str = "some:word{or other";
for (std::string::iterator it = str.begin(), end = str.end(); it != end; ++it)
{
if (!isalnum(*it))
{
it = str.insert(it, ' ') + 2;
it = str.insert(it, ' ');
end = str.end();
}
}
Note that std::string::insert inserts BEFORE the iterator passed and returns an iterator to the newly inserted character. Assigning is important since the buffer may have been reallocated at another memory location (the iterators are invalidated by the insertion). Also note that you can't keep end for the whole loop, each time you insert you need to recompute it.
a more elegant way of doing this exists.
I do not know how BOOST implements that, but traditional way is by feeding input string character by character into a FSM which detects where tokens (words, symbols) start and end.
I can do that with two loops and find_first_of(":") and ("{")
One loop with std::find_first_of() should suffice.
Though I'm still a huge fan of FSMs for such parsing tasks.
P.S. Similar question
How about something like:
std::string::const_iterator it, end = mystring.end();
for(it = mystring.begin(); it != end; ++it) {
if ( !isalnum( *it ))
list.push_back(it);
}
This way, you'll only iterate once through the string, and isalnum from ctype.h seems to do what you want. Of course, the code above is very simplistic and incomplete and only suggests a solution.
Are you looking to tokenize the input string, ala strtok?
If so, here is a tokenizing function that you can use. It takes an input string and a string of delimiters (each char int he string is a possible delimitter), and it returns a vector of tokens. Each token is a tuple with the delimitted string, and the delimiter used in that case:
#include <cstdlib>
#include <vector>
#include <string>
#include <functional>
#include <iostream>
#include <algorithm>
using namespace std;
// FUNCTION : stringtok(char const* Raw, string sToks)
// PARAMATERS : Raw Pointer to NULL-Terminated string containing a string to be tokenized.
// sToks string of individual token characters -- each character in the string is a token
// DESCRIPTION : Tokenizes a string, much in the same was as strtok does. The input string is not modified. The
// function is called once to tokenize a string, and all the tokens are retuned at once.
// RETURNS : Returns a vector of strings. Each element in the vector is one token. The token character is
// not included in the string. The number of elements in the vector is N+1, where N is the number
// of times the Token character is found in the string. If one token is an empty string (as with the
// string "string1##string3", where the token character is '#'), then that element in the vector
// is an empty string.
// NOTES :
//
typedef pair<char,string> token; // first = delimiter, second = data
inline vector<token> tokenize(const string& str, const string& delims, bool bCaseSensitive=false) // tokenizes a string, returns a vector of tokens
{
bCaseSensitive;
// prologue
vector<token> vRet;
// tokenize input string
for( string::const_iterator itA = str.begin(), it=itA; it != str.end(); it = find_first_of(++it,str.end(),delims.begin(),delims.end()) )
{
// prologue
// find end of token
string::const_iterator itEnd = find_first_of(it+1,str.end(),delims.begin(),delims.end());
// add string to output
if( it == itA ) vRet.push_back(make_pair(0,string(it,itEnd)));
else vRet.push_back(make_pair(*it,string(it+1,itEnd)));
// epilogue
}
// epilogue
return vRet;
}
using namespace std;
int main()
{
string input = "some:word{or other";
typedef vector<token> tokens;
tokens toks = tokenize(input.c_str(), " :{");
cout << "Input: '" << input << " # Tokens: " << toks.size() << "'\n";
for( tokens::iterator it = toks.begin(); it != toks.end(); ++it )
{
cout << " Token : '" << it->second << "', Delimiter: '" << it->first << "'\n";
}
return 0;
}

How do I tokenize a string in C++?

Java has a convenient split method:
String str = "The quick brown fox";
String[] results = str.split(" ");
Is there an easy way to do this in C++?
The Boost tokenizer class can make this sort of thing quite simple:
#include <iostream>
#include <string>
#include <boost/foreach.hpp>
#include <boost/tokenizer.hpp>
using namespace std;
using namespace boost;
int main(int, char**)
{
string text = "token, test string";
char_separator<char> sep(", ");
tokenizer< char_separator<char> > tokens(text, sep);
BOOST_FOREACH (const string& t, tokens) {
cout << t << "." << endl;
}
}
Updated for C++11:
#include <iostream>
#include <string>
#include <boost/tokenizer.hpp>
using namespace std;
using namespace boost;
int main(int, char**)
{
string text = "token, test string";
char_separator<char> sep(", ");
tokenizer<char_separator<char>> tokens(text, sep);
for (const auto& t : tokens) {
cout << t << "." << endl;
}
}
Here's a real simple one:
#include <vector>
#include <string>
using namespace std;
vector<string> split(const char *str, char c = ' ')
{
vector<string> result;
do
{
const char *begin = str;
while(*str != c && *str)
str++;
result.push_back(string(begin, str));
} while (0 != *str++);
return result;
}
C++ standard library algorithms are pretty universally based around iterators rather than concrete containers. Unfortunately this makes it hard to provide a Java-like split function in the C++ standard library, even though nobody argues that this would be convenient. But what would its return type be? std::vector<std::basic_string<…>>? Maybe, but then we’re forced to perform (potentially redundant and costly) allocations.
Instead, C++ offers a plethora of ways to split strings based on arbitrarily complex delimiters, but none of them is encapsulated as nicely as in other languages. The numerous ways fill whole blog posts.
At its simplest, you could iterate using std::string::find until you hit std::string::npos, and extract the contents using std::string::substr.
A more fluid (and idiomatic, but basic) version for splitting on whitespace would use a std::istringstream:
auto iss = std::istringstream{"The quick brown fox"};
auto str = std::string{};
while (iss >> str) {
process(str);
}
Using std::istream_iterators, the contents of the string stream could also be copied into a vector using its iterator range constructor.
Multiple libraries (such as Boost.Tokenizer) offer specific tokenisers.
More advanced splitting require regular expressions. C++ provides the std::regex_token_iterator for this purpose in particular:
auto const str = "The quick brown fox"s;
auto const re = std::regex{R"(\s+)"};
auto const vec = std::vector<std::string>(
std::sregex_token_iterator{begin(str), end(str), re, -1},
std::sregex_token_iterator{}
);
Another quick way is to use getline. Something like:
stringstream ss("bla bla");
string s;
while (getline(ss, s, ' ')) {
cout << s << endl;
}
If you want, you can make a simple split() method returning a vector<string>, which is
really useful.
Use strtok. In my opinion, there isn't a need to build a class around tokenizing unless strtok doesn't provide you with what you need. It might not, but in 15+ years of writing various parsing code in C and C++, I've always used strtok. Here is an example
char myString[] = "The quick brown fox";
char *p = strtok(myString, " ");
while (p) {
printf ("Token: %s\n", p);
p = strtok(NULL, " ");
}
A few caveats (which might not suit your needs). The string is "destroyed" in the process, meaning that EOS characters are placed inline in the delimter spots. Correct usage might require you to make a non-const version of the string. You can also change the list of delimiters mid parse.
In my own opinion, the above code is far simpler and easier to use than writing a separate class for it. To me, this is one of those functions that the language provides and it does it well and cleanly. It's simply a "C based" solution. It's appropriate, it's easy, and you don't have to write a lot of extra code :-)
You can use streams, iterators, and the copy algorithm to do this fairly directly.
#include <string>
#include <vector>
#include <iostream>
#include <istream>
#include <ostream>
#include <iterator>
#include <sstream>
#include <algorithm>
int main()
{
std::string str = "The quick brown fox";
// construct a stream from the string
std::stringstream strstr(str);
// use stream iterators to copy the stream to the vector as whitespace separated strings
std::istream_iterator<std::string> it(strstr);
std::istream_iterator<std::string> end;
std::vector<std::string> results(it, end);
// send the vector to stdout.
std::ostream_iterator<std::string> oit(std::cout);
std::copy(results.begin(), results.end(), oit);
}
A solution using regex_token_iterators:
#include <iostream>
#include <regex>
#include <string>
using namespace std;
int main()
{
string str("The quick brown fox");
regex reg("\\s+");
sregex_token_iterator iter(str.begin(), str.end(), reg, -1);
sregex_token_iterator end;
vector<string> vec(iter, end);
for (auto a : vec)
{
cout << a << endl;
}
}
No offense folks, but for such a simple problem, you are making things way too complicated. There are a lot of reasons to use Boost. But for something this simple, it's like hitting a fly with a 20# sledge.
void
split( vector<string> & theStringVector, /* Altered/returned value */
const string & theString,
const string & theDelimiter)
{
UASSERT( theDelimiter.size(), >, 0); // My own ASSERT macro.
size_t start = 0, end = 0;
while ( end != string::npos)
{
end = theString.find( theDelimiter, start);
// If at end, use length=maxLength. Else use length=end-start.
theStringVector.push_back( theString.substr( start,
(end == string::npos) ? string::npos : end - start));
// If at end, use start=maxSize. Else use start=end+delimiter.
start = ( ( end > (string::npos - theDelimiter.size()) )
? string::npos : end + theDelimiter.size());
}
}
For example (for Doug's case),
#define SHOW(I,X) cout << "[" << (I) << "]\t " # X " = \"" << (X) << "\"" << endl
int
main()
{
vector<string> v;
split( v, "A:PEP:909:Inventory Item", ":" );
for (unsigned int i = 0; i < v.size(); i++)
SHOW( i, v[i] );
}
And yes, we could have split() return a new vector rather than passing one in. It's trivial to wrap and overload. But depending on what I'm doing, I often find it better to re-use pre-existing objects rather than always creating new ones. (Just as long as I don't forget to empty the vector in between!)
Reference: http://www.cplusplus.com/reference/string/string/.
(I was originally writing a response to Doug's question: C++ Strings Modifying and Extracting based on Separators (closed). But since Martin York closed that question with a pointer over here... I'll just generalize my code.)
Boost has a strong split function: boost::algorithm::split.
Sample program:
#include <vector>
#include <boost/algorithm/string.hpp>
int main() {
auto s = "a,b, c ,,e,f,";
std::vector<std::string> fields;
boost::split(fields, s, boost::is_any_of(","));
for (const auto& field : fields)
std::cout << "\"" << field << "\"\n";
return 0;
}
Output:
"a"
"b"
" c "
""
"e"
"f"
""
This is a simple STL-only solution (~5 lines!) using std::find and std::find_first_not_of that handles repetitions of the delimiter (like spaces or periods for instance), as well leading and trailing delimiters:
#include <string>
#include <vector>
void tokenize(std::string str, std::vector<string> &token_v){
size_t start = str.find_first_not_of(DELIMITER), end=start;
while (start != std::string::npos){
// Find next occurence of delimiter
end = str.find(DELIMITER, start);
// Push back the token found into vector
token_v.push_back(str.substr(start, end-start));
// Skip all occurences of the delimiter to find new start
start = str.find_first_not_of(DELIMITER, end);
}
}
Try it out live!
I know you asked for a C++ solution, but you might consider this helpful:
Qt
#include <QString>
...
QString str = "The quick brown fox";
QStringList results = str.split(" ");
The advantage over Boost in this example is that it's a direct one to one mapping to your post's code.
See more at Qt documentation
Here is a sample tokenizer class that might do what you want
//Header file
class Tokenizer
{
public:
static const std::string DELIMITERS;
Tokenizer(const std::string& str);
Tokenizer(const std::string& str, const std::string& delimiters);
bool NextToken();
bool NextToken(const std::string& delimiters);
const std::string GetToken() const;
void Reset();
protected:
size_t m_offset;
const std::string m_string;
std::string m_token;
std::string m_delimiters;
};
//CPP file
const std::string Tokenizer::DELIMITERS(" \t\n\r");
Tokenizer::Tokenizer(const std::string& s) :
m_string(s),
m_offset(0),
m_delimiters(DELIMITERS) {}
Tokenizer::Tokenizer(const std::string& s, const std::string& delimiters) :
m_string(s),
m_offset(0),
m_delimiters(delimiters) {}
bool Tokenizer::NextToken()
{
return NextToken(m_delimiters);
}
bool Tokenizer::NextToken(const std::string& delimiters)
{
size_t i = m_string.find_first_not_of(delimiters, m_offset);
if (std::string::npos == i)
{
m_offset = m_string.length();
return false;
}
size_t j = m_string.find_first_of(delimiters, i);
if (std::string::npos == j)
{
m_token = m_string.substr(i);
m_offset = m_string.length();
return true;
}
m_token = m_string.substr(i, j - i);
m_offset = j;
return true;
}
Example:
std::vector <std::string> v;
Tokenizer s("split this string", " ");
while (s.NextToken())
{
v.push_back(s.GetToken());
}
pystring is a small library which implements a bunch of Python's string functions, including the split method:
#include <string>
#include <vector>
#include "pystring.h"
std::vector<std::string> chunks;
pystring::split("this string", chunks);
// also can specify a separator
pystring::split("this-string", chunks, "-");
I posted this answer for similar question.
Don't reinvent the wheel. I've used a number of libraries and the fastest and most flexible I have come across is: C++ String Toolkit Library.
Here is an example of how to use it that I've posted else where on the stackoverflow.
#include <iostream>
#include <vector>
#include <string>
#include <strtk.hpp>
const char *whitespace = " \t\r\n\f";
const char *whitespace_and_punctuation = " \t\r\n\f;,=";
int main()
{
{ // normal parsing of a string into a vector of strings
std::string s("Somewhere down the road");
std::vector<std::string> result;
if( strtk::parse( s, whitespace, result ) )
{
for(size_t i = 0; i < result.size(); ++i )
std::cout << result[i] << std::endl;
}
}
{ // parsing a string into a vector of floats with other separators
// besides spaces
std::string s("3.0, 3.14; 4.0");
std::vector<float> values;
if( strtk::parse( s, whitespace_and_punctuation, values ) )
{
for(size_t i = 0; i < values.size(); ++i )
std::cout << values[i] << std::endl;
}
}
{ // parsing a string into specific variables
std::string s("angle = 45; radius = 9.9");
std::string w1, w2;
float v1, v2;
if( strtk::parse( s, whitespace_and_punctuation, w1, v1, w2, v2) )
{
std::cout << "word " << w1 << ", value " << v1 << std::endl;
std::cout << "word " << w2 << ", value " << v2 << std::endl;
}
}
return 0;
}
Adam Pierce's answer provides an hand-spun tokenizer taking in a const char*. It's a bit more problematic to do with iterators because incrementing a string's end iterator is undefined. That said, given string str{ "The quick brown fox" } we can certainly accomplish this:
auto start = find(cbegin(str), cend(str), ' ');
vector<string> tokens{ string(cbegin(str), start) };
while (start != cend(str)) {
const auto finish = find(++start, cend(str), ' ');
tokens.push_back(string(start, finish));
start = finish;
}
Live Example
If you're looking to abstract complexity by using standard functionality, as On Freund suggests strtok is a simple option:
vector<string> tokens;
for (auto i = strtok(data(str), " "); i != nullptr; i = strtok(nullptr, " ")) tokens.push_back(i);
If you don't have access to C++17 you'll need to substitute data(str) as in this example: http://ideone.com/8kAGoa
Though not demonstrated in the example, strtok need not use the same delimiter for each token. Along with this advantage though, there are several drawbacks:
strtok cannot be used on multiple strings at the same time: Either a nullptr must be passed to continue tokenizing the current string or a new char* to tokenize must be passed (there are some non-standard implementations which do support this however, such as: strtok_s)
For the same reason strtok cannot be used on multiple threads simultaneously (this may however be implementation defined, for example: Visual Studio's implementation is thread safe)
Calling strtok modifies the string it is operating on, so it cannot be used on const strings, const char*s, or literal strings, to tokenize any of these with strtok or to operate on a string who's contents need to be preserved, str would have to be copied, then the copy could be operated on
c++20 provides us with split_view to tokenize strings, in a non-destructive manner: https://topanswers.xyz/cplusplus?q=749#a874
The previous methods cannot generate a tokenized vector in-place, meaning without abstracting them into a helper function they cannot initialize const vector<string> tokens. That functionality and the ability to accept any white-space delimiter can be harnessed using an istream_iterator. For example given: const string str{ "The quick \tbrown \nfox" } we can do this:
istringstream is{ str };
const vector<string> tokens{ istream_iterator<string>(is), istream_iterator<string>() };
Live Example
The required construction of an istringstream for this option has far greater cost than the previous 2 options, however this cost is typically hidden in the expense of string allocation.
If none of the above options are flexable enough for your tokenization needs, the most flexible option is using a regex_token_iterator of course with this flexibility comes greater expense, but again this is likely hidden in the string allocation cost. Say for example we want to tokenize based on non-escaped commas, also eating white-space, given the following input: const string str{ "The ,qu\\,ick ,\tbrown, fox" } we can do this:
const regex re{ "\\s*((?:[^\\\\,]|\\\\.)*?)\\s*(?:,|$)" };
const vector<string> tokens{ sregex_token_iterator(cbegin(str), cend(str), re, 1), sregex_token_iterator() };
Live Example
Check this example. It might help you..
#include <iostream>
#include <sstream>
using namespace std;
int main ()
{
string tmps;
istringstream is ("the dellimiter is the space");
while (is.good ()) {
is >> tmps;
cout << tmps << "\n";
}
return 0;
}
If you're using C++ ranges - the full ranges-v3 library, not the limited functionality accepted into C++20 - you could do it this way:
auto results = str | ranges::views::tokenize(" ",1);
... and this is lazily-evaluated. You can alternatively set a vector to this range:
auto results = str | ranges::views::tokenize(" ",1) | ranges::to<std::vector>();
this will take O(m) space and O(n) time if str has n characters making up m words.
See also the library's own tokenization example, here.
MFC/ATL has a very nice tokenizer. From MSDN:
CAtlString str( "%First Second#Third" );
CAtlString resToken;
int curPos= 0;
resToken= str.Tokenize("% #",curPos);
while (resToken != "")
{
printf("Resulting token: %s\n", resToken);
resToken= str.Tokenize("% #",curPos);
};
Output
Resulting Token: First
Resulting Token: Second
Resulting Token: Third
If you're willing to use C, you can use the strtok function. You should pay attention to multi-threading issues when using it.
For simple stuff I just use the following:
unsigned TokenizeString(const std::string& i_source,
const std::string& i_seperators,
bool i_discard_empty_tokens,
std::vector<std::string>& o_tokens)
{
unsigned prev_pos = 0;
unsigned pos = 0;
unsigned number_of_tokens = 0;
o_tokens.clear();
pos = i_source.find_first_of(i_seperators, pos);
while (pos != std::string::npos)
{
std::string token = i_source.substr(prev_pos, pos - prev_pos);
if (!i_discard_empty_tokens || token != "")
{
o_tokens.push_back(i_source.substr(prev_pos, pos - prev_pos));
number_of_tokens++;
}
pos++;
prev_pos = pos;
pos = i_source.find_first_of(i_seperators, pos);
}
if (prev_pos < i_source.length())
{
o_tokens.push_back(i_source.substr(prev_pos));
number_of_tokens++;
}
return number_of_tokens;
}
Cowardly disclaimer: I write real-time data processing software where the data comes in through binary files, sockets, or some API call (I/O cards, camera's). I never use this function for something more complicated or time-critical than reading external configuration files on startup.
You can simply use a regular expression library and solve that using regular expressions.
Use expression (\w+) and the variable in \1 (or $1 depending on the library implementation of regular expressions).
Many overly complicated suggestions here. Try this simple std::string solution:
using namespace std;
string someText = ...
string::size_type tokenOff = 0, sepOff = tokenOff;
while (sepOff != string::npos)
{
sepOff = someText.find(' ', sepOff);
string::size_type tokenLen = (sepOff == string::npos) ? sepOff : sepOff++ - tokenOff;
string token = someText.substr(tokenOff, tokenLen);
if (!token.empty())
/* do something with token */;
tokenOff = sepOff;
}
I thought that was what the >> operator on string streams was for:
string word; sin >> word;
Here's an approach that allows you control over whether empty tokens are included (like strsep) or excluded (like strtok).
#include <string.h> // for strchr and strlen
/*
* want_empty_tokens==true : include empty tokens, like strsep()
* want_empty_tokens==false : exclude empty tokens, like strtok()
*/
std::vector<std::string> tokenize(const char* src,
char delim,
bool want_empty_tokens)
{
std::vector<std::string> tokens;
if (src and *src != '\0') // defensive
while( true ) {
const char* d = strchr(src, delim);
size_t len = (d)? d-src : strlen(src);
if (len or want_empty_tokens)
tokens.push_back( std::string(src, len) ); // capture token
if (d) src += len+1; else break;
}
return tokens;
}
Seems odd to me that with all us speed conscious nerds here on SO no one has presented a version that uses a compile time generated look up table for the delimiter (example implementation further down). Using a look up table and iterators should beat std::regex in efficiency, if you don't need to beat regex, just use it, its standard as of C++11 and super flexible.
Some have suggested regex already but for the noobs here is a packaged example that should do exactly what the OP expects:
std::vector<std::string> split(std::string::const_iterator it, std::string::const_iterator end, std::regex e = std::regex{"\\w+"}){
std::smatch m{};
std::vector<std::string> ret{};
while (std::regex_search (it,end,m,e)) {
ret.emplace_back(m.str());
std::advance(it, m.position() + m.length()); //next start position = match position + match length
}
return ret;
}
std::vector<std::string> split(const std::string &s, std::regex e = std::regex{"\\w+"}){ //comfort version calls flexible version
return split(s.cbegin(), s.cend(), std::move(e));
}
int main ()
{
std::string str {"Some people, excluding those present, have been compile time constants - since puberty."};
auto v = split(str);
for(const auto&s:v){
std::cout << s << std::endl;
}
std::cout << "crazy version:" << std::endl;
v = split(str, std::regex{"[^e]+"}); //using e as delim shows flexibility
for(const auto&s:v){
std::cout << s << std::endl;
}
return 0;
}
If we need to be faster and accept the constraint that all chars must be 8 bits we can make a look up table at compile time using metaprogramming:
template<bool...> struct BoolSequence{}; //just here to hold bools
template<char...> struct CharSequence{}; //just here to hold chars
template<typename T, char C> struct Contains; //generic
template<char First, char... Cs, char Match> //not first specialization
struct Contains<CharSequence<First, Cs...>,Match> :
Contains<CharSequence<Cs...>, Match>{}; //strip first and increase index
template<char First, char... Cs> //is first specialization
struct Contains<CharSequence<First, Cs...>,First>: std::true_type {};
template<char Match> //not found specialization
struct Contains<CharSequence<>,Match>: std::false_type{};
template<int I, typename T, typename U>
struct MakeSequence; //generic
template<int I, bool... Bs, typename U>
struct MakeSequence<I,BoolSequence<Bs...>, U>: //not last
MakeSequence<I-1, BoolSequence<Contains<U,I-1>::value,Bs...>, U>{};
template<bool... Bs, typename U>
struct MakeSequence<0,BoolSequence<Bs...>,U>{ //last
using Type = BoolSequence<Bs...>;
};
template<typename T> struct BoolASCIITable;
template<bool... Bs> struct BoolASCIITable<BoolSequence<Bs...>>{
/* could be made constexpr but not yet supported by MSVC */
static bool isDelim(const char c){
static const bool table[256] = {Bs...};
return table[static_cast<int>(c)];
}
};
using Delims = CharSequence<'.',',',' ',':','\n'>; //list your custom delimiters here
using Table = BoolASCIITable<typename MakeSequence<256,BoolSequence<>,Delims>::Type>;
With that in place making a getNextToken function is easy:
template<typename T_It>
std::pair<T_It,T_It> getNextToken(T_It begin,T_It end){
begin = std::find_if(begin,end,std::not1(Table{})); //find first non delim or end
auto second = std::find_if(begin,end,Table{}); //find first delim or end
return std::make_pair(begin,second);
}
Using it is also easy:
int main() {
std::string s{"Some people, excluding those present, have been compile time constants - since puberty."};
auto it = std::begin(s);
auto end = std::end(s);
while(it != std::end(s)){
auto token = getNextToken(it,end);
std::cout << std::string(token.first,token.second) << std::endl;
it = token.second;
}
return 0;
}
Here is a live example: http://ideone.com/GKtkLQ
I know this question is already answered but I want to contribute. Maybe my solution is a bit simple but this is what I came up with:
vector<string> get_words(string const& text, string const& separator)
{
vector<string> result;
string tmp = text;
size_t first_pos = 0;
size_t second_pos = tmp.find(separator);
while (second_pos != string::npos)
{
if (first_pos != second_pos)
{
string word = tmp.substr(first_pos, second_pos - first_pos);
result.push_back(word);
}
tmp = tmp.substr(second_pos + separator.length());
second_pos = tmp.find(separator);
}
result.push_back(tmp);
return result;
}
Please comment if there is a better approach to something in my code or if something is wrong.
UPDATE: added generic separator
you can take advantage of boost::make_find_iterator. Something similar to this:
template<typename CH>
inline vector< basic_string<CH> > tokenize(
const basic_string<CH> &Input,
const basic_string<CH> &Delimiter,
bool remove_empty_token
) {
typedef typename basic_string<CH>::const_iterator string_iterator_t;
typedef boost::find_iterator< string_iterator_t > string_find_iterator_t;
vector< basic_string<CH> > Result;
string_iterator_t it = Input.begin();
string_iterator_t it_end = Input.end();
for(string_find_iterator_t i = boost::make_find_iterator(Input, boost::first_finder(Delimiter, boost::is_equal()));
i != string_find_iterator_t();
++i) {
if(remove_empty_token){
if(it != i->begin())
Result.push_back(basic_string<CH>(it,i->begin()));
}
else
Result.push_back(basic_string<CH>(it,i->begin()));
it = i->end();
}
if(it != it_end)
Result.push_back(basic_string<CH>(it,it_end));
return Result;
}
Here's my Swiss® Army Knife of string-tokenizers for splitting up strings by whitespace, accounting for single and double-quote wrapped strings as well as stripping those characters from the results. I used RegexBuddy 4.x to generate most of the code-snippet, but I added custom handling for stripping quotes and a few other things.
#include <string>
#include <locale>
#include <regex>
std::vector<std::wstring> tokenize_string(std::wstring string_to_tokenize) {
std::vector<std::wstring> tokens;
std::wregex re(LR"(("[^"]*"|'[^']*'|[^"' ]+))", std::regex_constants::collate);
std::wsregex_iterator next( string_to_tokenize.begin(),
string_to_tokenize.end(),
re,
std::regex_constants::match_not_null );
std::wsregex_iterator end;
const wchar_t single_quote = L'\'';
const wchar_t double_quote = L'\"';
while ( next != end ) {
std::wsmatch match = *next;
const std::wstring token = match.str( 0 );
next++;
if (token.length() > 2 && (token.front() == double_quote || token.front() == single_quote))
tokens.emplace_back( std::wstring(token.begin()+1, token.begin()+token.length()-1) );
else
tokens.emplace_back(token);
}
return tokens;
}
I wrote a simplified version (and maybe a little bit efficient) of https://stackoverflow.com/a/50247503/3976739 for my own use. I hope it would help.
void StrTokenizer(string& source, const char* delimiter, vector<string>& Tokens)
{
size_t new_index = 0;
size_t old_index = 0;
while (new_index != std::string::npos)
{
new_index = source.find(delimiter, old_index);
Tokens.emplace_back(source.substr(old_index, new_index-old_index));
if (new_index != std::string::npos)
old_index = ++new_index;
}
}
If the maximum length of the input string to be tokenized is known, one can exploit this and implement a very fast version. I am sketching the basic idea below, which was inspired by both strtok() and the "suffix array"-data structure described Jon Bentley's "Programming Perls" 2nd edition, chapter 15. The C++ class in this case only gives some organization and convenience of use. The implementation shown can be easily extended for removing leading and trailing whitespace characters in the tokens.
Basically one can replace the separator characters with string-terminating '\0'-characters and set pointers to the tokens withing the modified string. In the extreme case when the string consists only of separators, one gets string-length plus 1 resulting empty tokens. It is practical to duplicate the string to be modified.
Header file:
class TextLineSplitter
{
public:
TextLineSplitter( const size_t max_line_len );
~TextLineSplitter();
void SplitLine( const char *line,
const char sep_char = ',',
);
inline size_t NumTokens( void ) const
{
return mNumTokens;
}
const char * GetToken( const size_t token_idx ) const
{
assert( token_idx < mNumTokens );
return mTokens[ token_idx ];
}
private:
const size_t mStorageSize;
char *mBuff;
char **mTokens;
size_t mNumTokens;
inline void ResetContent( void )
{
memset( mBuff, 0, mStorageSize );
// mark all items as empty:
memset( mTokens, 0, mStorageSize * sizeof( char* ) );
// reset counter for found items:
mNumTokens = 0L;
}
};
Implementattion file:
TextLineSplitter::TextLineSplitter( const size_t max_line_len ):
mStorageSize ( max_line_len + 1L )
{
// allocate memory
mBuff = new char [ mStorageSize ];
mTokens = new char* [ mStorageSize ];
ResetContent();
}
TextLineSplitter::~TextLineSplitter()
{
delete [] mBuff;
delete [] mTokens;
}
void TextLineSplitter::SplitLine( const char *line,
const char sep_char /* = ',' */,
)
{
assert( sep_char != '\0' );
ResetContent();
strncpy( mBuff, line, mMaxLineLen );
size_t idx = 0L; // running index for characters
do
{
assert( idx < mStorageSize );
const char chr = line[ idx ]; // retrieve current character
if( mTokens[ mNumTokens ] == NULL )
{
mTokens[ mNumTokens ] = &mBuff[ idx ];
} // if
if( chr == sep_char || chr == '\0' )
{ // item or line finished
// overwrite separator with a 0-terminating character:
mBuff[ idx ] = '\0';
// count-up items:
mNumTokens ++;
} // if
} while( line[ idx++ ] );
}
A scenario of usage would be:
// create an instance capable of splitting strings up to 1000 chars long:
TextLineSplitter spl( 1000 );
spl.SplitLine( "Item1,,Item2,Item3" );
for( size_t i = 0; i < spl.NumTokens(); i++ )
{
printf( "%s\n", spl.GetToken( i ) );
}
output:
Item1
Item2
Item3