This question already has answers here:
How do I iterate over the words of a string?
(84 answers)
Closed 4 years ago.
If I have a std::string containing a comma-separated list of numbers, what's the simplest way to parse out the numbers and put them in an integer array?
I don't want to generalise this out into parsing anything else. Just a simple string of comma separated integer numbers such as "1,1,1,1,2,1,1,1,0".
Input one number at a time, and check whether the following character is ,. If so, discard it.
#include <vector>
#include <string>
#include <sstream>
#include <iostream>
int main()
{
std::string str = "1,2,3,4,5,6";
std::vector<int> vect;
std::stringstream ss(str);
for (int i; ss >> i;) {
vect.push_back(i);
if (ss.peek() == ',')
ss.ignore();
}
for (std::size_t i = 0; i < vect.size(); i++)
std::cout << vect[i] << std::endl;
}
Something less verbose, std and takes anything separated by a comma.
stringstream ss( "1,1,1,1, or something else ,1,1,1,0" );
vector<string> result;
while( ss.good() )
{
string substr;
getline( ss, substr, ',' );
result.push_back( substr );
}
Yet another, rather different, approach: use a special locale that treats commas as white space:
#include <locale>
#include <vector>
struct csv_reader: std::ctype<char> {
csv_reader(): std::ctype<char>(get_table()) {}
static std::ctype_base::mask const* get_table() {
static std::vector<std::ctype_base::mask> rc(table_size, std::ctype_base::mask());
rc[','] = std::ctype_base::space;
rc['\n'] = std::ctype_base::space;
rc[' '] = std::ctype_base::space;
return &rc[0];
}
};
To use this, you imbue() a stream with a locale that includes this facet. Once you've done that, you can read numbers as if the commas weren't there at all. Just for example, we'll read comma-delimited numbers from input, and write then out one-per line on standard output:
#include <algorithm>
#include <iterator>
#include <iostream>
int main() {
std::cin.imbue(std::locale(std::locale(), new csv_reader()));
std::copy(std::istream_iterator<int>(std::cin),
std::istream_iterator<int>(),
std::ostream_iterator<int>(std::cout, "\n"));
return 0;
}
The C++ String Toolkit Library (Strtk) has the following solution to your problem:
#include <string>
#include <deque>
#include <vector>
#include "strtk.hpp"
int main()
{
std::string int_string = "1,2,3,4,5,6,7,8,9,10,11,12,13,14,15";
std::vector<int> int_list;
strtk::parse(int_string,",",int_list);
std::string double_string = "123.456|789.012|345.678|901.234|567.890";
std::deque<double> double_list;
strtk::parse(double_string,"|",double_list);
return 0;
}
More examples can be found Here
Alternative solution using generic algorithms and Boost.Tokenizer:
struct ToInt
{
int operator()(string const &str) { return atoi(str.c_str()); }
};
string values = "1,2,3,4,5,9,8,7,6";
vector<int> ints;
tokenizer<> tok(values);
transform(tok.begin(), tok.end(), back_inserter(ints), ToInt());
Lots of pretty terrible answers here so I'll add mine (including test program):
#include <string>
#include <iostream>
#include <cstddef>
template<typename StringFunction>
void splitString(const std::string &str, char delimiter, StringFunction f) {
std::size_t from = 0;
for (std::size_t i = 0; i < str.size(); ++i) {
if (str[i] == delimiter) {
f(str, from, i);
from = i + 1;
}
}
if (from <= str.size())
f(str, from, str.size());
}
int main(int argc, char* argv[]) {
if (argc != 2)
return 1;
splitString(argv[1], ',', [](const std::string &s, std::size_t from, std::size_t to) {
std::cout << "`" << s.substr(from, to - from) << "`\n";
});
return 0;
}
Nice properties:
No dependencies (e.g. boost)
Not an insane one-liner
Easy to understand (I hope)
Handles spaces perfectly fine
Doesn't allocate splits if you don't want to, e.g. you can process them with a lambda as shown.
Doesn't add characters one at a time - should be fast.
If using C++17 you could change it to use a std::stringview and then it won't do any allocations and should be extremely fast.
Some design choices you may wish to change:
Empty entries are not ignored.
An empty string will call f() once.
Example inputs and outputs:
"" -> {""}
"," -> {"", ""}
"1," -> {"1", ""}
"1" -> {"1"}
" " -> {" "}
"1, 2," -> {"1", " 2", ""}
" ,, " -> {" ", "", " "}
You could also use the following function.
void tokenize(const string& str, vector<string>& tokens, const string& delimiters = ",")
{
// Skip delimiters at beginning.
string::size_type lastPos = str.find_first_not_of(delimiters, 0);
// Find first non-delimiter.
string::size_type pos = str.find_first_of(delimiters, lastPos);
while (string::npos != pos || string::npos != lastPos) {
// Found a token, add it to the vector.
tokens.push_back(str.substr(lastPos, pos - lastPos));
// Skip delimiters.
lastPos = str.find_first_not_of(delimiters, pos);
// Find next non-delimiter.
pos = str.find_first_of(delimiters, lastPos);
}
}
std::string input="1,1,1,1,2,1,1,1,0";
std::vector<long> output;
for(std::string::size_type p0=0,p1=input.find(',');
p1!=std::string::npos || p0!=std::string::npos;
(p0=(p1==std::string::npos)?p1:++p1),p1=input.find(',',p0) )
output.push_back( strtol(input.c_str()+p0,NULL,0) );
It would be a good idea to check for conversion errors in strtol(), of course. Maybe the code may benefit from some other error checks as well.
I'm surprised no one has proposed a solution using std::regex yet:
#include <string>
#include <algorithm>
#include <vector>
#include <regex>
void parse_csint( const std::string& str, std::vector<int>& result ) {
typedef std::regex_iterator<std::string::const_iterator> re_iterator;
typedef re_iterator::value_type re_iterated;
std::regex re("(\\d+)");
re_iterator rit( str.begin(), str.end(), re );
re_iterator rend;
std::transform( rit, rend, std::back_inserter(result),
[]( const re_iterated& it ){ return std::stoi(it[1]); } );
}
This function inserts all integers at the back of the input vector. You can tweak the regular expression to include negative integers, or floating point numbers, etc.
#include <sstream>
#include <vector>
const char *input = "1,1,1,1,2,1,1,1,0";
int main() {
std::stringstream ss(input);
std::vector<int> output;
int i;
while (ss >> i) {
output.push_back(i);
ss.ignore(1);
}
}
Bad input (for instance consecutive separators) will mess this up, but you did say simple.
string exp = "token1 token2 token3";
char delimiter = ' ';
vector<string> str;
string acc = "";
for(int i = 0; i < exp.size(); i++)
{
if(exp[i] == delimiter)
{
str.push_back(acc);
acc = "";
}
else
acc += exp[i];
}
bool GetList (const std::string& src, std::vector<int>& res)
{
using boost::lexical_cast;
using boost::bad_lexical_cast;
bool success = true;
typedef boost::tokenizer<boost::char_separator<char> > tokenizer;
boost::char_separator<char> sepa(",");
tokenizer tokens(src, sepa);
for (tokenizer::iterator tok_iter = tokens.begin();
tok_iter != tokens.end(); ++tok_iter) {
try {
res.push_back(lexical_cast<int>(*tok_iter));
}
catch (bad_lexical_cast &) {
success = false;
}
}
return success;
}
I cannot yet comment (getting started on the site) but added a more generic version of Jerry Coffin's fantastic ctype's derived class to his post.
Thanks Jerry for the super idea.
(Because it must be peer-reviewed, adding it here too temporarily)
struct SeparatorReader: std::ctype<char>
{
template<typename T>
SeparatorReader(const T &seps): std::ctype<char>(get_table(seps), true) {}
template<typename T>
std::ctype_base::mask const *get_table(const T &seps) {
auto &&rc = new std::ctype_base::mask[std::ctype<char>::table_size]();
for(auto &&sep: seps)
rc[static_cast<unsigned char>(sep)] = std::ctype_base::space;
return &rc[0];
}
};
This is the simplest way, which I used a lot. It works for any one-character delimiter.
#include<bits/stdc++.h>
using namespace std;
int main() {
string str;
cin >> str;
int temp;
vector<int> result;
char ch;
stringstream ss(str);
do
{
ss>>temp;
result.push_back(temp);
}while(ss>>ch);
for(int i=0 ; i < result.size() ; i++)
cout<<result[i]<<endl;
return 0;
}
simple structure, easily adaptable, easy maintenance.
std::string stringIn = "my,csv,,is 10233478,separated,by commas";
std::vector<std::string> commaSeparated(1);
int commaCounter = 0;
for (int i=0; i<stringIn.size(); i++) {
if (stringIn[i] == ",") {
commaSeparated.push_back("");
commaCounter++;
} else {
commaSeparated.at(commaCounter) += stringIn[i];
}
}
in the end you will have a vector of strings with every element in the sentence separated by spaces. empty strings are saved as separate items.
Simple Copy/Paste function, based on the boost tokenizer.
void strToIntArray(std::string string, int* array, int array_len) {
boost::tokenizer<> tok(string);
int i = 0;
for(boost::tokenizer<>::iterator beg=tok.begin(); beg!=tok.end();++beg){
if(i < array_len)
array[i] = atoi(beg->c_str());
i++;
}
void ExplodeString( const std::string& string, const char separator, std::list<int>& result ) {
if( string.size() ) {
std::string::const_iterator last = string.begin();
for( std::string::const_iterator i=string.begin(); i!=string.end(); ++i ) {
if( *i == separator ) {
const std::string str(last,i);
int id = atoi(str.c_str());
result.push_back(id);
last = i;
++ last;
}
}
if( last != string.end() ) result.push_back( atoi(&*last) );
}
}
#include <sstream>
#include <vector>
#include <algorithm>
#include <iterator>
const char *input = ",,29870,1,abc,2,1,1,1,0";
int main()
{
std::stringstream ss(input);
std::vector<int> output;
int i;
while ( !ss.eof() )
{
int c = ss.peek() ;
if ( c < '0' || c > '9' )
{
ss.ignore(1);
continue;
}
if (ss >> i)
{
output.push_back(i);
}
}
std::copy(output.begin(), output.end(), std::ostream_iterator<int> (std::cout, " ") );
return 0;
}
Related
I've been looking for ways to count the number of words in a string, but specifically for strings that may contain typos (i.e. "_This_is_a___test" as opposed to "This_is_a_test"). Most of the pages I've looked at only handle single spaces.
This is actually my first time programming in C++, and I don't have much other programming experience to speak of (2 years of college in C and Java). Although what I have is functional, I'm also aware it's complex, and I'm wondering if there is a more efficient way to achieve the same results?
This is what I have currently. Before I run the string through numWords(), I run it through a trim function that removes leading whitespace, then check that there are still characters remaining.
int numWords(string str) {
int count = 1;
for (int i = 0; i < str.size(); i++) {
if (str[i] == ' ' || str[i] == '\t' || str[i] == '\n') {
bool repeat = true;
int j = 1;
while (j < (str.size() - i) && repeat) {
if (str[i + j] != ' ' && str[i + j] != '\t' && str[i + j] != '\n') {
repeat = false;
i = i + j;
count++;
}
else
j++;
}
}
}
return count;
}
Also, I wrote mine to take a string argument, but most of the examples I've seen used (char* str) instead, which I wasn't sure how to use with my input string.
You don't need all those stringstreams to count word boundary
#include <string>
#include <cctype>
int numWords(std::string str)
{
bool space = true; // not in word
int count = 0;
for(auto c:str){
if(std::isspace(c))space=true;
else{
if(space)++count;
space=false;
}
}
return count;
}
One solution is to utilize std::istringstream to count the number of words and to skip over spaces automatically.
#include <sstream>
#include <string>
#include <iostream>
int numWords(std::string str)
{
int count = 0;
std::istringstream strm(str);
std::string word;
while (strm >> word)
++count;
return count;
}
int main()
{
std::cout << numWords(" This is a test ");
}
Output:
4
Albeit as mentioned std::istringstream is more "heavier" in terms of performance than writing your own loop.
Sam's comment made me write a function that does not allocate strings for words. But just creates string_views on the input string.
#include <cassert>
#include <cctype>
#include <vector>
#include <string_view>
#include <iostream>
std::vector<std::string_view> get_words(const std::string& input)
{
std::vector<std::string_view> words;
// the first word begins at an alpha character
auto begin_of_word = std::find_if(input.begin(), input.end(), [](const char c) { return std::isalpha(c); });
auto end_of_word = input.begin();
auto end_of_input = input.end();
// parse the whole string
while (end_of_word != end_of_input)
{
// as long as you see text characters move end_of_word one back
while ((end_of_word != end_of_input) && std::isalpha(*end_of_word)) end_of_word++;
// create a string view from begin of word to end of word.
// no new string memory will be allocated
// std::vector will do some dynamic memory allocation to store string_view (metadata of word positions)
words.emplace_back(begin_of_word, end_of_word);
// then skip all non readable characters.
while ((end_of_word != end_of_input) && !std::isalpha(*end_of_word) ) end_of_word++;
// and if we haven't reached the end then we are at the beginning of a new word.
if ( end_of_word != input.end()) begin_of_word = end_of_word;
}
return words;
}
int main()
{
std::string input{ "This, this is a test!" };
auto words = get_words(input);
for (const auto& word : words)
{
std::cout << word << "\n";
}
return 0;
}
You can use standard function std::distance with std::istringstream the following way
#include <iostream>
#include <sstream>
#include <string>
#include <iterator>
int main()
{
std::string s( " This is a test" );
std::istringstream iss( s );
auto count = std::distance( std::istream_iterator<std::string>( iss ),
std::istream_iterator<std::string>() );
std::cout << count << '\n';
}
The program output is
4
If you want you can place the call of std::distance in a separate function like
#include <iostream>
#include <sstream>
#include <string>
#include <iterator>
size_t numWords( const std::string &s )
{
std::istringstream iss( s );
return std::distance( std::istream_iterator<std::string>( iss ),
std::istream_iterator<std::string>() );
}
int main()
{
std::string s( " This is a test" );
std::cout << numWords( s ) << '\n';
}
If separators can include other characters apart from white space characters as for example punctuations then you should use methods of the class std::string or std::string_view find_first_of and find_first_not_of.
Here is a demonstration program.
#include <iostream>
#include <string>
#include <string_view>
size_t numWords( const std::string_view s, std::string_view delim = " \t" )
{
size_t count = 0;
for ( std::string_view::size_type pos = 0;
( pos = s.find_first_not_of( delim, pos ) ) != std::string_view::npos;
pos = s.find_first_of( delim, pos ) )
{
++count;
}
return count;
}
int main()
{
std::string s( "Is it a test ? Yes ! Now we will run it ..." );
std::cout << numWords( s, " \t!?.," ) << '\n';
}
The program output is
10
you can do it easily with regex
int numWords(std::string str)
{
std::regex re("\\S+"); // or `[^ \t\n]+` to exactly match the question
return std::distance(
std::sregex_iterator(str.begin(), str.end(), re),
std::sregex_iterator()
);
}
I want to split a string by any occurrence of and.
First of all I have to make it clear that I do not intend to use any regex as a delimiter.
I run the following code:
#include <iostream>
#include <regex>
#include <boost/algorithm/string.hpp>
int main()
{
std::vector<std::string> results;
std::string text=
"Alexievich, Svetlana and Lindahl,Tomas and Campbell,William";
boost::split(
results,
text,
boost::is_any_of(" and "),
boost::token_compress_off
);
for(auto result:results)
{
std::cout<<result<<"\n";
}
return 0;
}
and the results are different from what I expect:
Alexievich,
Svetl
Li
hl,Tom
s
C
mpbell,Willi
m
It seems every character in the delimiter acts separately while I need to have the whole and as a delimiter.
Please do not link to this boost example unless you are sure that it will work for my case.
<algorithm> contains search - right tool for this task.
vector<string> results;
const string text{ "Alexievich, Svetlana and Lindahl,Tomas and Campbell,William" };
const string delim{ " and " };
for (auto p = cbegin(text); p != cend(text); ) {
const auto n = search(p, cend(text), cbegin(delim), cend(delim));
results.emplace_back(p, n);
p = n;
if (cend(text) != n) // we found delim, skip over it.
p += delim.length();
}
The old-fashioned way:
#include <iostream>
#include <string>
#include <vector>
int main()
{
std::vector<std::string> results;
std::string text=
"Alexievich, Svetlana and Lindahl,Tomas and Campbell,William";
size_t pos = 0;
for (;;) {
size_t next = text.find("and", pos);
results.push_back(text.substr(pos, next - pos));
if (next == std::string::npos) break;
pos = next + 3;
}
for(auto result:results)
{
std::cout<<result<<"\n";
}
return 0;
}
Packaging into a reusable function is left as an exercise for the reader.
I am trying to spit a string in C++ with the following way:
#include <bitset>
#include <iostream>
#include <boost/algorithm/string/split.hpp>
#include <boost/algorithm/string/classification.hpp>
#include <boost/timer.hpp>
using namespace std;
size_t const N = 10000000;
typedef string::const_iterator iter;
typedef boost::iterator_range<iter> string_view;
template<typename C>
void test_custom(string const& s, char const* d, C& ret)
{
C output;
bitset<255> delims;
while (*d)
{
unsigned char code = *d++;
delims[code] = true;
}
typedef string::const_iterator iter;
iter beg;
bool in_token = false;
bool go = false;
for (string::const_iterator it = s.begin(), end = s.end(); it != end; ++it)
{
if (delims[*it])
{
if (in_token)
{
output.push_back(typename C::value_type(beg, it));
in_token = false;
}
}
else if (!in_token)
{
beg = it;
in_token = true;
}
else
{
if (!go)
{
cout << typename C::value_type(beg, it);
//outputs the first character
go = true;
}
}
}
if (in_token)
output.push_back(typename C::value_type(beg, s.end()));
output.swap(ret);
}
vector<string_view> split_string(string in, const char* delim = " ")
{
vector<string_view> vsv;
test_custom(in, delim, vsv);
return vsv;
}
int split()
{
string text = "123 456";
vector<string_view> vsv = split_string(text);
for (int i = 0; i < vsv.size(); i++)
cout << endl << vsv.at(i) << "|" << endl;
return 0;
}
The problem here is the fact that the first character is erased for one reason... The returned string are ' 23' and '456' but I want them to be '123' and '456'
So, the first character is ' ' and not '1'
I'm not familiar with boost::iterator_range, but it sure sounds like a pair of iterators.
If so, then in this code:
vector<string_view> split_string(string in, const char* delim = " ")
{
vector<string_view> vsv;
test_custom(in, delim, vsv);
return vsv;
}
you're returning iterators referring to a local string called in, that has ceased to exist when the function returns.
That's Undefined Behavior.
One fix would be to pass that string by reference.
By the way, one inefficient but simple and safe way to split a string on whitespace is to use a istringstream:
istringstream stream( source_string );
string word;
while( stream >> word ) { cout << word; }
Disclaimer: code untouched by compiler's hands.
Since you're already using Boost, you could use this (the "simple" way :P):
#include <boost/algorithm/string.hpp>
std::string text = "this is sample string";
std::vector<std::string> tokens;
boost::split(tokens, text, boost::is_any_of("\t "));
Or use whatever separator(s) you want to use as third argument.
Java has a convenient split method:
String str = "The quick brown fox";
String[] results = str.split(" ");
Is there an easy way to do this in C++?
The Boost tokenizer class can make this sort of thing quite simple:
#include <iostream>
#include <string>
#include <boost/foreach.hpp>
#include <boost/tokenizer.hpp>
using namespace std;
using namespace boost;
int main(int, char**)
{
string text = "token, test string";
char_separator<char> sep(", ");
tokenizer< char_separator<char> > tokens(text, sep);
BOOST_FOREACH (const string& t, tokens) {
cout << t << "." << endl;
}
}
Updated for C++11:
#include <iostream>
#include <string>
#include <boost/tokenizer.hpp>
using namespace std;
using namespace boost;
int main(int, char**)
{
string text = "token, test string";
char_separator<char> sep(", ");
tokenizer<char_separator<char>> tokens(text, sep);
for (const auto& t : tokens) {
cout << t << "." << endl;
}
}
Here's a real simple one:
#include <vector>
#include <string>
using namespace std;
vector<string> split(const char *str, char c = ' ')
{
vector<string> result;
do
{
const char *begin = str;
while(*str != c && *str)
str++;
result.push_back(string(begin, str));
} while (0 != *str++);
return result;
}
C++ standard library algorithms are pretty universally based around iterators rather than concrete containers. Unfortunately this makes it hard to provide a Java-like split function in the C++ standard library, even though nobody argues that this would be convenient. But what would its return type be? std::vector<std::basic_string<…>>? Maybe, but then we’re forced to perform (potentially redundant and costly) allocations.
Instead, C++ offers a plethora of ways to split strings based on arbitrarily complex delimiters, but none of them is encapsulated as nicely as in other languages. The numerous ways fill whole blog posts.
At its simplest, you could iterate using std::string::find until you hit std::string::npos, and extract the contents using std::string::substr.
A more fluid (and idiomatic, but basic) version for splitting on whitespace would use a std::istringstream:
auto iss = std::istringstream{"The quick brown fox"};
auto str = std::string{};
while (iss >> str) {
process(str);
}
Using std::istream_iterators, the contents of the string stream could also be copied into a vector using its iterator range constructor.
Multiple libraries (such as Boost.Tokenizer) offer specific tokenisers.
More advanced splitting require regular expressions. C++ provides the std::regex_token_iterator for this purpose in particular:
auto const str = "The quick brown fox"s;
auto const re = std::regex{R"(\s+)"};
auto const vec = std::vector<std::string>(
std::sregex_token_iterator{begin(str), end(str), re, -1},
std::sregex_token_iterator{}
);
Another quick way is to use getline. Something like:
stringstream ss("bla bla");
string s;
while (getline(ss, s, ' ')) {
cout << s << endl;
}
If you want, you can make a simple split() method returning a vector<string>, which is
really useful.
Use strtok. In my opinion, there isn't a need to build a class around tokenizing unless strtok doesn't provide you with what you need. It might not, but in 15+ years of writing various parsing code in C and C++, I've always used strtok. Here is an example
char myString[] = "The quick brown fox";
char *p = strtok(myString, " ");
while (p) {
printf ("Token: %s\n", p);
p = strtok(NULL, " ");
}
A few caveats (which might not suit your needs). The string is "destroyed" in the process, meaning that EOS characters are placed inline in the delimter spots. Correct usage might require you to make a non-const version of the string. You can also change the list of delimiters mid parse.
In my own opinion, the above code is far simpler and easier to use than writing a separate class for it. To me, this is one of those functions that the language provides and it does it well and cleanly. It's simply a "C based" solution. It's appropriate, it's easy, and you don't have to write a lot of extra code :-)
You can use streams, iterators, and the copy algorithm to do this fairly directly.
#include <string>
#include <vector>
#include <iostream>
#include <istream>
#include <ostream>
#include <iterator>
#include <sstream>
#include <algorithm>
int main()
{
std::string str = "The quick brown fox";
// construct a stream from the string
std::stringstream strstr(str);
// use stream iterators to copy the stream to the vector as whitespace separated strings
std::istream_iterator<std::string> it(strstr);
std::istream_iterator<std::string> end;
std::vector<std::string> results(it, end);
// send the vector to stdout.
std::ostream_iterator<std::string> oit(std::cout);
std::copy(results.begin(), results.end(), oit);
}
A solution using regex_token_iterators:
#include <iostream>
#include <regex>
#include <string>
using namespace std;
int main()
{
string str("The quick brown fox");
regex reg("\\s+");
sregex_token_iterator iter(str.begin(), str.end(), reg, -1);
sregex_token_iterator end;
vector<string> vec(iter, end);
for (auto a : vec)
{
cout << a << endl;
}
}
No offense folks, but for such a simple problem, you are making things way too complicated. There are a lot of reasons to use Boost. But for something this simple, it's like hitting a fly with a 20# sledge.
void
split( vector<string> & theStringVector, /* Altered/returned value */
const string & theString,
const string & theDelimiter)
{
UASSERT( theDelimiter.size(), >, 0); // My own ASSERT macro.
size_t start = 0, end = 0;
while ( end != string::npos)
{
end = theString.find( theDelimiter, start);
// If at end, use length=maxLength. Else use length=end-start.
theStringVector.push_back( theString.substr( start,
(end == string::npos) ? string::npos : end - start));
// If at end, use start=maxSize. Else use start=end+delimiter.
start = ( ( end > (string::npos - theDelimiter.size()) )
? string::npos : end + theDelimiter.size());
}
}
For example (for Doug's case),
#define SHOW(I,X) cout << "[" << (I) << "]\t " # X " = \"" << (X) << "\"" << endl
int
main()
{
vector<string> v;
split( v, "A:PEP:909:Inventory Item", ":" );
for (unsigned int i = 0; i < v.size(); i++)
SHOW( i, v[i] );
}
And yes, we could have split() return a new vector rather than passing one in. It's trivial to wrap and overload. But depending on what I'm doing, I often find it better to re-use pre-existing objects rather than always creating new ones. (Just as long as I don't forget to empty the vector in between!)
Reference: http://www.cplusplus.com/reference/string/string/.
(I was originally writing a response to Doug's question: C++ Strings Modifying and Extracting based on Separators (closed). But since Martin York closed that question with a pointer over here... I'll just generalize my code.)
Boost has a strong split function: boost::algorithm::split.
Sample program:
#include <vector>
#include <boost/algorithm/string.hpp>
int main() {
auto s = "a,b, c ,,e,f,";
std::vector<std::string> fields;
boost::split(fields, s, boost::is_any_of(","));
for (const auto& field : fields)
std::cout << "\"" << field << "\"\n";
return 0;
}
Output:
"a"
"b"
" c "
""
"e"
"f"
""
This is a simple STL-only solution (~5 lines!) using std::find and std::find_first_not_of that handles repetitions of the delimiter (like spaces or periods for instance), as well leading and trailing delimiters:
#include <string>
#include <vector>
void tokenize(std::string str, std::vector<string> &token_v){
size_t start = str.find_first_not_of(DELIMITER), end=start;
while (start != std::string::npos){
// Find next occurence of delimiter
end = str.find(DELIMITER, start);
// Push back the token found into vector
token_v.push_back(str.substr(start, end-start));
// Skip all occurences of the delimiter to find new start
start = str.find_first_not_of(DELIMITER, end);
}
}
Try it out live!
I know you asked for a C++ solution, but you might consider this helpful:
Qt
#include <QString>
...
QString str = "The quick brown fox";
QStringList results = str.split(" ");
The advantage over Boost in this example is that it's a direct one to one mapping to your post's code.
See more at Qt documentation
Here is a sample tokenizer class that might do what you want
//Header file
class Tokenizer
{
public:
static const std::string DELIMITERS;
Tokenizer(const std::string& str);
Tokenizer(const std::string& str, const std::string& delimiters);
bool NextToken();
bool NextToken(const std::string& delimiters);
const std::string GetToken() const;
void Reset();
protected:
size_t m_offset;
const std::string m_string;
std::string m_token;
std::string m_delimiters;
};
//CPP file
const std::string Tokenizer::DELIMITERS(" \t\n\r");
Tokenizer::Tokenizer(const std::string& s) :
m_string(s),
m_offset(0),
m_delimiters(DELIMITERS) {}
Tokenizer::Tokenizer(const std::string& s, const std::string& delimiters) :
m_string(s),
m_offset(0),
m_delimiters(delimiters) {}
bool Tokenizer::NextToken()
{
return NextToken(m_delimiters);
}
bool Tokenizer::NextToken(const std::string& delimiters)
{
size_t i = m_string.find_first_not_of(delimiters, m_offset);
if (std::string::npos == i)
{
m_offset = m_string.length();
return false;
}
size_t j = m_string.find_first_of(delimiters, i);
if (std::string::npos == j)
{
m_token = m_string.substr(i);
m_offset = m_string.length();
return true;
}
m_token = m_string.substr(i, j - i);
m_offset = j;
return true;
}
Example:
std::vector <std::string> v;
Tokenizer s("split this string", " ");
while (s.NextToken())
{
v.push_back(s.GetToken());
}
pystring is a small library which implements a bunch of Python's string functions, including the split method:
#include <string>
#include <vector>
#include "pystring.h"
std::vector<std::string> chunks;
pystring::split("this string", chunks);
// also can specify a separator
pystring::split("this-string", chunks, "-");
I posted this answer for similar question.
Don't reinvent the wheel. I've used a number of libraries and the fastest and most flexible I have come across is: C++ String Toolkit Library.
Here is an example of how to use it that I've posted else where on the stackoverflow.
#include <iostream>
#include <vector>
#include <string>
#include <strtk.hpp>
const char *whitespace = " \t\r\n\f";
const char *whitespace_and_punctuation = " \t\r\n\f;,=";
int main()
{
{ // normal parsing of a string into a vector of strings
std::string s("Somewhere down the road");
std::vector<std::string> result;
if( strtk::parse( s, whitespace, result ) )
{
for(size_t i = 0; i < result.size(); ++i )
std::cout << result[i] << std::endl;
}
}
{ // parsing a string into a vector of floats with other separators
// besides spaces
std::string s("3.0, 3.14; 4.0");
std::vector<float> values;
if( strtk::parse( s, whitespace_and_punctuation, values ) )
{
for(size_t i = 0; i < values.size(); ++i )
std::cout << values[i] << std::endl;
}
}
{ // parsing a string into specific variables
std::string s("angle = 45; radius = 9.9");
std::string w1, w2;
float v1, v2;
if( strtk::parse( s, whitespace_and_punctuation, w1, v1, w2, v2) )
{
std::cout << "word " << w1 << ", value " << v1 << std::endl;
std::cout << "word " << w2 << ", value " << v2 << std::endl;
}
}
return 0;
}
Adam Pierce's answer provides an hand-spun tokenizer taking in a const char*. It's a bit more problematic to do with iterators because incrementing a string's end iterator is undefined. That said, given string str{ "The quick brown fox" } we can certainly accomplish this:
auto start = find(cbegin(str), cend(str), ' ');
vector<string> tokens{ string(cbegin(str), start) };
while (start != cend(str)) {
const auto finish = find(++start, cend(str), ' ');
tokens.push_back(string(start, finish));
start = finish;
}
Live Example
If you're looking to abstract complexity by using standard functionality, as On Freund suggests strtok is a simple option:
vector<string> tokens;
for (auto i = strtok(data(str), " "); i != nullptr; i = strtok(nullptr, " ")) tokens.push_back(i);
If you don't have access to C++17 you'll need to substitute data(str) as in this example: http://ideone.com/8kAGoa
Though not demonstrated in the example, strtok need not use the same delimiter for each token. Along with this advantage though, there are several drawbacks:
strtok cannot be used on multiple strings at the same time: Either a nullptr must be passed to continue tokenizing the current string or a new char* to tokenize must be passed (there are some non-standard implementations which do support this however, such as: strtok_s)
For the same reason strtok cannot be used on multiple threads simultaneously (this may however be implementation defined, for example: Visual Studio's implementation is thread safe)
Calling strtok modifies the string it is operating on, so it cannot be used on const strings, const char*s, or literal strings, to tokenize any of these with strtok or to operate on a string who's contents need to be preserved, str would have to be copied, then the copy could be operated on
c++20 provides us with split_view to tokenize strings, in a non-destructive manner: https://topanswers.xyz/cplusplus?q=749#a874
The previous methods cannot generate a tokenized vector in-place, meaning without abstracting them into a helper function they cannot initialize const vector<string> tokens. That functionality and the ability to accept any white-space delimiter can be harnessed using an istream_iterator. For example given: const string str{ "The quick \tbrown \nfox" } we can do this:
istringstream is{ str };
const vector<string> tokens{ istream_iterator<string>(is), istream_iterator<string>() };
Live Example
The required construction of an istringstream for this option has far greater cost than the previous 2 options, however this cost is typically hidden in the expense of string allocation.
If none of the above options are flexable enough for your tokenization needs, the most flexible option is using a regex_token_iterator of course with this flexibility comes greater expense, but again this is likely hidden in the string allocation cost. Say for example we want to tokenize based on non-escaped commas, also eating white-space, given the following input: const string str{ "The ,qu\\,ick ,\tbrown, fox" } we can do this:
const regex re{ "\\s*((?:[^\\\\,]|\\\\.)*?)\\s*(?:,|$)" };
const vector<string> tokens{ sregex_token_iterator(cbegin(str), cend(str), re, 1), sregex_token_iterator() };
Live Example
Check this example. It might help you..
#include <iostream>
#include <sstream>
using namespace std;
int main ()
{
string tmps;
istringstream is ("the dellimiter is the space");
while (is.good ()) {
is >> tmps;
cout << tmps << "\n";
}
return 0;
}
If you're using C++ ranges - the full ranges-v3 library, not the limited functionality accepted into C++20 - you could do it this way:
auto results = str | ranges::views::tokenize(" ",1);
... and this is lazily-evaluated. You can alternatively set a vector to this range:
auto results = str | ranges::views::tokenize(" ",1) | ranges::to<std::vector>();
this will take O(m) space and O(n) time if str has n characters making up m words.
See also the library's own tokenization example, here.
MFC/ATL has a very nice tokenizer. From MSDN:
CAtlString str( "%First Second#Third" );
CAtlString resToken;
int curPos= 0;
resToken= str.Tokenize("% #",curPos);
while (resToken != "")
{
printf("Resulting token: %s\n", resToken);
resToken= str.Tokenize("% #",curPos);
};
Output
Resulting Token: First
Resulting Token: Second
Resulting Token: Third
If you're willing to use C, you can use the strtok function. You should pay attention to multi-threading issues when using it.
For simple stuff I just use the following:
unsigned TokenizeString(const std::string& i_source,
const std::string& i_seperators,
bool i_discard_empty_tokens,
std::vector<std::string>& o_tokens)
{
unsigned prev_pos = 0;
unsigned pos = 0;
unsigned number_of_tokens = 0;
o_tokens.clear();
pos = i_source.find_first_of(i_seperators, pos);
while (pos != std::string::npos)
{
std::string token = i_source.substr(prev_pos, pos - prev_pos);
if (!i_discard_empty_tokens || token != "")
{
o_tokens.push_back(i_source.substr(prev_pos, pos - prev_pos));
number_of_tokens++;
}
pos++;
prev_pos = pos;
pos = i_source.find_first_of(i_seperators, pos);
}
if (prev_pos < i_source.length())
{
o_tokens.push_back(i_source.substr(prev_pos));
number_of_tokens++;
}
return number_of_tokens;
}
Cowardly disclaimer: I write real-time data processing software where the data comes in through binary files, sockets, or some API call (I/O cards, camera's). I never use this function for something more complicated or time-critical than reading external configuration files on startup.
You can simply use a regular expression library and solve that using regular expressions.
Use expression (\w+) and the variable in \1 (or $1 depending on the library implementation of regular expressions).
Many overly complicated suggestions here. Try this simple std::string solution:
using namespace std;
string someText = ...
string::size_type tokenOff = 0, sepOff = tokenOff;
while (sepOff != string::npos)
{
sepOff = someText.find(' ', sepOff);
string::size_type tokenLen = (sepOff == string::npos) ? sepOff : sepOff++ - tokenOff;
string token = someText.substr(tokenOff, tokenLen);
if (!token.empty())
/* do something with token */;
tokenOff = sepOff;
}
I thought that was what the >> operator on string streams was for:
string word; sin >> word;
Here's an approach that allows you control over whether empty tokens are included (like strsep) or excluded (like strtok).
#include <string.h> // for strchr and strlen
/*
* want_empty_tokens==true : include empty tokens, like strsep()
* want_empty_tokens==false : exclude empty tokens, like strtok()
*/
std::vector<std::string> tokenize(const char* src,
char delim,
bool want_empty_tokens)
{
std::vector<std::string> tokens;
if (src and *src != '\0') // defensive
while( true ) {
const char* d = strchr(src, delim);
size_t len = (d)? d-src : strlen(src);
if (len or want_empty_tokens)
tokens.push_back( std::string(src, len) ); // capture token
if (d) src += len+1; else break;
}
return tokens;
}
Seems odd to me that with all us speed conscious nerds here on SO no one has presented a version that uses a compile time generated look up table for the delimiter (example implementation further down). Using a look up table and iterators should beat std::regex in efficiency, if you don't need to beat regex, just use it, its standard as of C++11 and super flexible.
Some have suggested regex already but for the noobs here is a packaged example that should do exactly what the OP expects:
std::vector<std::string> split(std::string::const_iterator it, std::string::const_iterator end, std::regex e = std::regex{"\\w+"}){
std::smatch m{};
std::vector<std::string> ret{};
while (std::regex_search (it,end,m,e)) {
ret.emplace_back(m.str());
std::advance(it, m.position() + m.length()); //next start position = match position + match length
}
return ret;
}
std::vector<std::string> split(const std::string &s, std::regex e = std::regex{"\\w+"}){ //comfort version calls flexible version
return split(s.cbegin(), s.cend(), std::move(e));
}
int main ()
{
std::string str {"Some people, excluding those present, have been compile time constants - since puberty."};
auto v = split(str);
for(const auto&s:v){
std::cout << s << std::endl;
}
std::cout << "crazy version:" << std::endl;
v = split(str, std::regex{"[^e]+"}); //using e as delim shows flexibility
for(const auto&s:v){
std::cout << s << std::endl;
}
return 0;
}
If we need to be faster and accept the constraint that all chars must be 8 bits we can make a look up table at compile time using metaprogramming:
template<bool...> struct BoolSequence{}; //just here to hold bools
template<char...> struct CharSequence{}; //just here to hold chars
template<typename T, char C> struct Contains; //generic
template<char First, char... Cs, char Match> //not first specialization
struct Contains<CharSequence<First, Cs...>,Match> :
Contains<CharSequence<Cs...>, Match>{}; //strip first and increase index
template<char First, char... Cs> //is first specialization
struct Contains<CharSequence<First, Cs...>,First>: std::true_type {};
template<char Match> //not found specialization
struct Contains<CharSequence<>,Match>: std::false_type{};
template<int I, typename T, typename U>
struct MakeSequence; //generic
template<int I, bool... Bs, typename U>
struct MakeSequence<I,BoolSequence<Bs...>, U>: //not last
MakeSequence<I-1, BoolSequence<Contains<U,I-1>::value,Bs...>, U>{};
template<bool... Bs, typename U>
struct MakeSequence<0,BoolSequence<Bs...>,U>{ //last
using Type = BoolSequence<Bs...>;
};
template<typename T> struct BoolASCIITable;
template<bool... Bs> struct BoolASCIITable<BoolSequence<Bs...>>{
/* could be made constexpr but not yet supported by MSVC */
static bool isDelim(const char c){
static const bool table[256] = {Bs...};
return table[static_cast<int>(c)];
}
};
using Delims = CharSequence<'.',',',' ',':','\n'>; //list your custom delimiters here
using Table = BoolASCIITable<typename MakeSequence<256,BoolSequence<>,Delims>::Type>;
With that in place making a getNextToken function is easy:
template<typename T_It>
std::pair<T_It,T_It> getNextToken(T_It begin,T_It end){
begin = std::find_if(begin,end,std::not1(Table{})); //find first non delim or end
auto second = std::find_if(begin,end,Table{}); //find first delim or end
return std::make_pair(begin,second);
}
Using it is also easy:
int main() {
std::string s{"Some people, excluding those present, have been compile time constants - since puberty."};
auto it = std::begin(s);
auto end = std::end(s);
while(it != std::end(s)){
auto token = getNextToken(it,end);
std::cout << std::string(token.first,token.second) << std::endl;
it = token.second;
}
return 0;
}
Here is a live example: http://ideone.com/GKtkLQ
I know this question is already answered but I want to contribute. Maybe my solution is a bit simple but this is what I came up with:
vector<string> get_words(string const& text, string const& separator)
{
vector<string> result;
string tmp = text;
size_t first_pos = 0;
size_t second_pos = tmp.find(separator);
while (second_pos != string::npos)
{
if (first_pos != second_pos)
{
string word = tmp.substr(first_pos, second_pos - first_pos);
result.push_back(word);
}
tmp = tmp.substr(second_pos + separator.length());
second_pos = tmp.find(separator);
}
result.push_back(tmp);
return result;
}
Please comment if there is a better approach to something in my code or if something is wrong.
UPDATE: added generic separator
you can take advantage of boost::make_find_iterator. Something similar to this:
template<typename CH>
inline vector< basic_string<CH> > tokenize(
const basic_string<CH> &Input,
const basic_string<CH> &Delimiter,
bool remove_empty_token
) {
typedef typename basic_string<CH>::const_iterator string_iterator_t;
typedef boost::find_iterator< string_iterator_t > string_find_iterator_t;
vector< basic_string<CH> > Result;
string_iterator_t it = Input.begin();
string_iterator_t it_end = Input.end();
for(string_find_iterator_t i = boost::make_find_iterator(Input, boost::first_finder(Delimiter, boost::is_equal()));
i != string_find_iterator_t();
++i) {
if(remove_empty_token){
if(it != i->begin())
Result.push_back(basic_string<CH>(it,i->begin()));
}
else
Result.push_back(basic_string<CH>(it,i->begin()));
it = i->end();
}
if(it != it_end)
Result.push_back(basic_string<CH>(it,it_end));
return Result;
}
Here's my Swiss® Army Knife of string-tokenizers for splitting up strings by whitespace, accounting for single and double-quote wrapped strings as well as stripping those characters from the results. I used RegexBuddy 4.x to generate most of the code-snippet, but I added custom handling for stripping quotes and a few other things.
#include <string>
#include <locale>
#include <regex>
std::vector<std::wstring> tokenize_string(std::wstring string_to_tokenize) {
std::vector<std::wstring> tokens;
std::wregex re(LR"(("[^"]*"|'[^']*'|[^"' ]+))", std::regex_constants::collate);
std::wsregex_iterator next( string_to_tokenize.begin(),
string_to_tokenize.end(),
re,
std::regex_constants::match_not_null );
std::wsregex_iterator end;
const wchar_t single_quote = L'\'';
const wchar_t double_quote = L'\"';
while ( next != end ) {
std::wsmatch match = *next;
const std::wstring token = match.str( 0 );
next++;
if (token.length() > 2 && (token.front() == double_quote || token.front() == single_quote))
tokens.emplace_back( std::wstring(token.begin()+1, token.begin()+token.length()-1) );
else
tokens.emplace_back(token);
}
return tokens;
}
I wrote a simplified version (and maybe a little bit efficient) of https://stackoverflow.com/a/50247503/3976739 for my own use. I hope it would help.
void StrTokenizer(string& source, const char* delimiter, vector<string>& Tokens)
{
size_t new_index = 0;
size_t old_index = 0;
while (new_index != std::string::npos)
{
new_index = source.find(delimiter, old_index);
Tokens.emplace_back(source.substr(old_index, new_index-old_index));
if (new_index != std::string::npos)
old_index = ++new_index;
}
}
If the maximum length of the input string to be tokenized is known, one can exploit this and implement a very fast version. I am sketching the basic idea below, which was inspired by both strtok() and the "suffix array"-data structure described Jon Bentley's "Programming Perls" 2nd edition, chapter 15. The C++ class in this case only gives some organization and convenience of use. The implementation shown can be easily extended for removing leading and trailing whitespace characters in the tokens.
Basically one can replace the separator characters with string-terminating '\0'-characters and set pointers to the tokens withing the modified string. In the extreme case when the string consists only of separators, one gets string-length plus 1 resulting empty tokens. It is practical to duplicate the string to be modified.
Header file:
class TextLineSplitter
{
public:
TextLineSplitter( const size_t max_line_len );
~TextLineSplitter();
void SplitLine( const char *line,
const char sep_char = ',',
);
inline size_t NumTokens( void ) const
{
return mNumTokens;
}
const char * GetToken( const size_t token_idx ) const
{
assert( token_idx < mNumTokens );
return mTokens[ token_idx ];
}
private:
const size_t mStorageSize;
char *mBuff;
char **mTokens;
size_t mNumTokens;
inline void ResetContent( void )
{
memset( mBuff, 0, mStorageSize );
// mark all items as empty:
memset( mTokens, 0, mStorageSize * sizeof( char* ) );
// reset counter for found items:
mNumTokens = 0L;
}
};
Implementattion file:
TextLineSplitter::TextLineSplitter( const size_t max_line_len ):
mStorageSize ( max_line_len + 1L )
{
// allocate memory
mBuff = new char [ mStorageSize ];
mTokens = new char* [ mStorageSize ];
ResetContent();
}
TextLineSplitter::~TextLineSplitter()
{
delete [] mBuff;
delete [] mTokens;
}
void TextLineSplitter::SplitLine( const char *line,
const char sep_char /* = ',' */,
)
{
assert( sep_char != '\0' );
ResetContent();
strncpy( mBuff, line, mMaxLineLen );
size_t idx = 0L; // running index for characters
do
{
assert( idx < mStorageSize );
const char chr = line[ idx ]; // retrieve current character
if( mTokens[ mNumTokens ] == NULL )
{
mTokens[ mNumTokens ] = &mBuff[ idx ];
} // if
if( chr == sep_char || chr == '\0' )
{ // item or line finished
// overwrite separator with a 0-terminating character:
mBuff[ idx ] = '\0';
// count-up items:
mNumTokens ++;
} // if
} while( line[ idx++ ] );
}
A scenario of usage would be:
// create an instance capable of splitting strings up to 1000 chars long:
TextLineSplitter spl( 1000 );
spl.SplitLine( "Item1,,Item2,Item3" );
for( size_t i = 0; i < spl.NumTokens(); i++ )
{
printf( "%s\n", spl.GetToken( i ) );
}
output:
Item1
Item2
Item3
Java has a convenient split method:
String str = "The quick brown fox";
String[] results = str.split(" ");
Is there an easy way to do this in C++?
The Boost tokenizer class can make this sort of thing quite simple:
#include <iostream>
#include <string>
#include <boost/foreach.hpp>
#include <boost/tokenizer.hpp>
using namespace std;
using namespace boost;
int main(int, char**)
{
string text = "token, test string";
char_separator<char> sep(", ");
tokenizer< char_separator<char> > tokens(text, sep);
BOOST_FOREACH (const string& t, tokens) {
cout << t << "." << endl;
}
}
Updated for C++11:
#include <iostream>
#include <string>
#include <boost/tokenizer.hpp>
using namespace std;
using namespace boost;
int main(int, char**)
{
string text = "token, test string";
char_separator<char> sep(", ");
tokenizer<char_separator<char>> tokens(text, sep);
for (const auto& t : tokens) {
cout << t << "." << endl;
}
}
Here's a real simple one:
#include <vector>
#include <string>
using namespace std;
vector<string> split(const char *str, char c = ' ')
{
vector<string> result;
do
{
const char *begin = str;
while(*str != c && *str)
str++;
result.push_back(string(begin, str));
} while (0 != *str++);
return result;
}
C++ standard library algorithms are pretty universally based around iterators rather than concrete containers. Unfortunately this makes it hard to provide a Java-like split function in the C++ standard library, even though nobody argues that this would be convenient. But what would its return type be? std::vector<std::basic_string<…>>? Maybe, but then we’re forced to perform (potentially redundant and costly) allocations.
Instead, C++ offers a plethora of ways to split strings based on arbitrarily complex delimiters, but none of them is encapsulated as nicely as in other languages. The numerous ways fill whole blog posts.
At its simplest, you could iterate using std::string::find until you hit std::string::npos, and extract the contents using std::string::substr.
A more fluid (and idiomatic, but basic) version for splitting on whitespace would use a std::istringstream:
auto iss = std::istringstream{"The quick brown fox"};
auto str = std::string{};
while (iss >> str) {
process(str);
}
Using std::istream_iterators, the contents of the string stream could also be copied into a vector using its iterator range constructor.
Multiple libraries (such as Boost.Tokenizer) offer specific tokenisers.
More advanced splitting require regular expressions. C++ provides the std::regex_token_iterator for this purpose in particular:
auto const str = "The quick brown fox"s;
auto const re = std::regex{R"(\s+)"};
auto const vec = std::vector<std::string>(
std::sregex_token_iterator{begin(str), end(str), re, -1},
std::sregex_token_iterator{}
);
Another quick way is to use getline. Something like:
stringstream ss("bla bla");
string s;
while (getline(ss, s, ' ')) {
cout << s << endl;
}
If you want, you can make a simple split() method returning a vector<string>, which is
really useful.
Use strtok. In my opinion, there isn't a need to build a class around tokenizing unless strtok doesn't provide you with what you need. It might not, but in 15+ years of writing various parsing code in C and C++, I've always used strtok. Here is an example
char myString[] = "The quick brown fox";
char *p = strtok(myString, " ");
while (p) {
printf ("Token: %s\n", p);
p = strtok(NULL, " ");
}
A few caveats (which might not suit your needs). The string is "destroyed" in the process, meaning that EOS characters are placed inline in the delimter spots. Correct usage might require you to make a non-const version of the string. You can also change the list of delimiters mid parse.
In my own opinion, the above code is far simpler and easier to use than writing a separate class for it. To me, this is one of those functions that the language provides and it does it well and cleanly. It's simply a "C based" solution. It's appropriate, it's easy, and you don't have to write a lot of extra code :-)
You can use streams, iterators, and the copy algorithm to do this fairly directly.
#include <string>
#include <vector>
#include <iostream>
#include <istream>
#include <ostream>
#include <iterator>
#include <sstream>
#include <algorithm>
int main()
{
std::string str = "The quick brown fox";
// construct a stream from the string
std::stringstream strstr(str);
// use stream iterators to copy the stream to the vector as whitespace separated strings
std::istream_iterator<std::string> it(strstr);
std::istream_iterator<std::string> end;
std::vector<std::string> results(it, end);
// send the vector to stdout.
std::ostream_iterator<std::string> oit(std::cout);
std::copy(results.begin(), results.end(), oit);
}
A solution using regex_token_iterators:
#include <iostream>
#include <regex>
#include <string>
using namespace std;
int main()
{
string str("The quick brown fox");
regex reg("\\s+");
sregex_token_iterator iter(str.begin(), str.end(), reg, -1);
sregex_token_iterator end;
vector<string> vec(iter, end);
for (auto a : vec)
{
cout << a << endl;
}
}
No offense folks, but for such a simple problem, you are making things way too complicated. There are a lot of reasons to use Boost. But for something this simple, it's like hitting a fly with a 20# sledge.
void
split( vector<string> & theStringVector, /* Altered/returned value */
const string & theString,
const string & theDelimiter)
{
UASSERT( theDelimiter.size(), >, 0); // My own ASSERT macro.
size_t start = 0, end = 0;
while ( end != string::npos)
{
end = theString.find( theDelimiter, start);
// If at end, use length=maxLength. Else use length=end-start.
theStringVector.push_back( theString.substr( start,
(end == string::npos) ? string::npos : end - start));
// If at end, use start=maxSize. Else use start=end+delimiter.
start = ( ( end > (string::npos - theDelimiter.size()) )
? string::npos : end + theDelimiter.size());
}
}
For example (for Doug's case),
#define SHOW(I,X) cout << "[" << (I) << "]\t " # X " = \"" << (X) << "\"" << endl
int
main()
{
vector<string> v;
split( v, "A:PEP:909:Inventory Item", ":" );
for (unsigned int i = 0; i < v.size(); i++)
SHOW( i, v[i] );
}
And yes, we could have split() return a new vector rather than passing one in. It's trivial to wrap and overload. But depending on what I'm doing, I often find it better to re-use pre-existing objects rather than always creating new ones. (Just as long as I don't forget to empty the vector in between!)
Reference: http://www.cplusplus.com/reference/string/string/.
(I was originally writing a response to Doug's question: C++ Strings Modifying and Extracting based on Separators (closed). But since Martin York closed that question with a pointer over here... I'll just generalize my code.)
Boost has a strong split function: boost::algorithm::split.
Sample program:
#include <vector>
#include <boost/algorithm/string.hpp>
int main() {
auto s = "a,b, c ,,e,f,";
std::vector<std::string> fields;
boost::split(fields, s, boost::is_any_of(","));
for (const auto& field : fields)
std::cout << "\"" << field << "\"\n";
return 0;
}
Output:
"a"
"b"
" c "
""
"e"
"f"
""
This is a simple STL-only solution (~5 lines!) using std::find and std::find_first_not_of that handles repetitions of the delimiter (like spaces or periods for instance), as well leading and trailing delimiters:
#include <string>
#include <vector>
void tokenize(std::string str, std::vector<string> &token_v){
size_t start = str.find_first_not_of(DELIMITER), end=start;
while (start != std::string::npos){
// Find next occurence of delimiter
end = str.find(DELIMITER, start);
// Push back the token found into vector
token_v.push_back(str.substr(start, end-start));
// Skip all occurences of the delimiter to find new start
start = str.find_first_not_of(DELIMITER, end);
}
}
Try it out live!
I know you asked for a C++ solution, but you might consider this helpful:
Qt
#include <QString>
...
QString str = "The quick brown fox";
QStringList results = str.split(" ");
The advantage over Boost in this example is that it's a direct one to one mapping to your post's code.
See more at Qt documentation
Here is a sample tokenizer class that might do what you want
//Header file
class Tokenizer
{
public:
static const std::string DELIMITERS;
Tokenizer(const std::string& str);
Tokenizer(const std::string& str, const std::string& delimiters);
bool NextToken();
bool NextToken(const std::string& delimiters);
const std::string GetToken() const;
void Reset();
protected:
size_t m_offset;
const std::string m_string;
std::string m_token;
std::string m_delimiters;
};
//CPP file
const std::string Tokenizer::DELIMITERS(" \t\n\r");
Tokenizer::Tokenizer(const std::string& s) :
m_string(s),
m_offset(0),
m_delimiters(DELIMITERS) {}
Tokenizer::Tokenizer(const std::string& s, const std::string& delimiters) :
m_string(s),
m_offset(0),
m_delimiters(delimiters) {}
bool Tokenizer::NextToken()
{
return NextToken(m_delimiters);
}
bool Tokenizer::NextToken(const std::string& delimiters)
{
size_t i = m_string.find_first_not_of(delimiters, m_offset);
if (std::string::npos == i)
{
m_offset = m_string.length();
return false;
}
size_t j = m_string.find_first_of(delimiters, i);
if (std::string::npos == j)
{
m_token = m_string.substr(i);
m_offset = m_string.length();
return true;
}
m_token = m_string.substr(i, j - i);
m_offset = j;
return true;
}
Example:
std::vector <std::string> v;
Tokenizer s("split this string", " ");
while (s.NextToken())
{
v.push_back(s.GetToken());
}
pystring is a small library which implements a bunch of Python's string functions, including the split method:
#include <string>
#include <vector>
#include "pystring.h"
std::vector<std::string> chunks;
pystring::split("this string", chunks);
// also can specify a separator
pystring::split("this-string", chunks, "-");
I posted this answer for similar question.
Don't reinvent the wheel. I've used a number of libraries and the fastest and most flexible I have come across is: C++ String Toolkit Library.
Here is an example of how to use it that I've posted else where on the stackoverflow.
#include <iostream>
#include <vector>
#include <string>
#include <strtk.hpp>
const char *whitespace = " \t\r\n\f";
const char *whitespace_and_punctuation = " \t\r\n\f;,=";
int main()
{
{ // normal parsing of a string into a vector of strings
std::string s("Somewhere down the road");
std::vector<std::string> result;
if( strtk::parse( s, whitespace, result ) )
{
for(size_t i = 0; i < result.size(); ++i )
std::cout << result[i] << std::endl;
}
}
{ // parsing a string into a vector of floats with other separators
// besides spaces
std::string s("3.0, 3.14; 4.0");
std::vector<float> values;
if( strtk::parse( s, whitespace_and_punctuation, values ) )
{
for(size_t i = 0; i < values.size(); ++i )
std::cout << values[i] << std::endl;
}
}
{ // parsing a string into specific variables
std::string s("angle = 45; radius = 9.9");
std::string w1, w2;
float v1, v2;
if( strtk::parse( s, whitespace_and_punctuation, w1, v1, w2, v2) )
{
std::cout << "word " << w1 << ", value " << v1 << std::endl;
std::cout << "word " << w2 << ", value " << v2 << std::endl;
}
}
return 0;
}
Adam Pierce's answer provides an hand-spun tokenizer taking in a const char*. It's a bit more problematic to do with iterators because incrementing a string's end iterator is undefined. That said, given string str{ "The quick brown fox" } we can certainly accomplish this:
auto start = find(cbegin(str), cend(str), ' ');
vector<string> tokens{ string(cbegin(str), start) };
while (start != cend(str)) {
const auto finish = find(++start, cend(str), ' ');
tokens.push_back(string(start, finish));
start = finish;
}
Live Example
If you're looking to abstract complexity by using standard functionality, as On Freund suggests strtok is a simple option:
vector<string> tokens;
for (auto i = strtok(data(str), " "); i != nullptr; i = strtok(nullptr, " ")) tokens.push_back(i);
If you don't have access to C++17 you'll need to substitute data(str) as in this example: http://ideone.com/8kAGoa
Though not demonstrated in the example, strtok need not use the same delimiter for each token. Along with this advantage though, there are several drawbacks:
strtok cannot be used on multiple strings at the same time: Either a nullptr must be passed to continue tokenizing the current string or a new char* to tokenize must be passed (there are some non-standard implementations which do support this however, such as: strtok_s)
For the same reason strtok cannot be used on multiple threads simultaneously (this may however be implementation defined, for example: Visual Studio's implementation is thread safe)
Calling strtok modifies the string it is operating on, so it cannot be used on const strings, const char*s, or literal strings, to tokenize any of these with strtok or to operate on a string who's contents need to be preserved, str would have to be copied, then the copy could be operated on
c++20 provides us with split_view to tokenize strings, in a non-destructive manner: https://topanswers.xyz/cplusplus?q=749#a874
The previous methods cannot generate a tokenized vector in-place, meaning without abstracting them into a helper function they cannot initialize const vector<string> tokens. That functionality and the ability to accept any white-space delimiter can be harnessed using an istream_iterator. For example given: const string str{ "The quick \tbrown \nfox" } we can do this:
istringstream is{ str };
const vector<string> tokens{ istream_iterator<string>(is), istream_iterator<string>() };
Live Example
The required construction of an istringstream for this option has far greater cost than the previous 2 options, however this cost is typically hidden in the expense of string allocation.
If none of the above options are flexable enough for your tokenization needs, the most flexible option is using a regex_token_iterator of course with this flexibility comes greater expense, but again this is likely hidden in the string allocation cost. Say for example we want to tokenize based on non-escaped commas, also eating white-space, given the following input: const string str{ "The ,qu\\,ick ,\tbrown, fox" } we can do this:
const regex re{ "\\s*((?:[^\\\\,]|\\\\.)*?)\\s*(?:,|$)" };
const vector<string> tokens{ sregex_token_iterator(cbegin(str), cend(str), re, 1), sregex_token_iterator() };
Live Example
Check this example. It might help you..
#include <iostream>
#include <sstream>
using namespace std;
int main ()
{
string tmps;
istringstream is ("the dellimiter is the space");
while (is.good ()) {
is >> tmps;
cout << tmps << "\n";
}
return 0;
}
If you're using C++ ranges - the full ranges-v3 library, not the limited functionality accepted into C++20 - you could do it this way:
auto results = str | ranges::views::tokenize(" ",1);
... and this is lazily-evaluated. You can alternatively set a vector to this range:
auto results = str | ranges::views::tokenize(" ",1) | ranges::to<std::vector>();
this will take O(m) space and O(n) time if str has n characters making up m words.
See also the library's own tokenization example, here.
MFC/ATL has a very nice tokenizer. From MSDN:
CAtlString str( "%First Second#Third" );
CAtlString resToken;
int curPos= 0;
resToken= str.Tokenize("% #",curPos);
while (resToken != "")
{
printf("Resulting token: %s\n", resToken);
resToken= str.Tokenize("% #",curPos);
};
Output
Resulting Token: First
Resulting Token: Second
Resulting Token: Third
If you're willing to use C, you can use the strtok function. You should pay attention to multi-threading issues when using it.
For simple stuff I just use the following:
unsigned TokenizeString(const std::string& i_source,
const std::string& i_seperators,
bool i_discard_empty_tokens,
std::vector<std::string>& o_tokens)
{
unsigned prev_pos = 0;
unsigned pos = 0;
unsigned number_of_tokens = 0;
o_tokens.clear();
pos = i_source.find_first_of(i_seperators, pos);
while (pos != std::string::npos)
{
std::string token = i_source.substr(prev_pos, pos - prev_pos);
if (!i_discard_empty_tokens || token != "")
{
o_tokens.push_back(i_source.substr(prev_pos, pos - prev_pos));
number_of_tokens++;
}
pos++;
prev_pos = pos;
pos = i_source.find_first_of(i_seperators, pos);
}
if (prev_pos < i_source.length())
{
o_tokens.push_back(i_source.substr(prev_pos));
number_of_tokens++;
}
return number_of_tokens;
}
Cowardly disclaimer: I write real-time data processing software where the data comes in through binary files, sockets, or some API call (I/O cards, camera's). I never use this function for something more complicated or time-critical than reading external configuration files on startup.
You can simply use a regular expression library and solve that using regular expressions.
Use expression (\w+) and the variable in \1 (or $1 depending on the library implementation of regular expressions).
Many overly complicated suggestions here. Try this simple std::string solution:
using namespace std;
string someText = ...
string::size_type tokenOff = 0, sepOff = tokenOff;
while (sepOff != string::npos)
{
sepOff = someText.find(' ', sepOff);
string::size_type tokenLen = (sepOff == string::npos) ? sepOff : sepOff++ - tokenOff;
string token = someText.substr(tokenOff, tokenLen);
if (!token.empty())
/* do something with token */;
tokenOff = sepOff;
}
I thought that was what the >> operator on string streams was for:
string word; sin >> word;
Here's an approach that allows you control over whether empty tokens are included (like strsep) or excluded (like strtok).
#include <string.h> // for strchr and strlen
/*
* want_empty_tokens==true : include empty tokens, like strsep()
* want_empty_tokens==false : exclude empty tokens, like strtok()
*/
std::vector<std::string> tokenize(const char* src,
char delim,
bool want_empty_tokens)
{
std::vector<std::string> tokens;
if (src and *src != '\0') // defensive
while( true ) {
const char* d = strchr(src, delim);
size_t len = (d)? d-src : strlen(src);
if (len or want_empty_tokens)
tokens.push_back( std::string(src, len) ); // capture token
if (d) src += len+1; else break;
}
return tokens;
}
Seems odd to me that with all us speed conscious nerds here on SO no one has presented a version that uses a compile time generated look up table for the delimiter (example implementation further down). Using a look up table and iterators should beat std::regex in efficiency, if you don't need to beat regex, just use it, its standard as of C++11 and super flexible.
Some have suggested regex already but for the noobs here is a packaged example that should do exactly what the OP expects:
std::vector<std::string> split(std::string::const_iterator it, std::string::const_iterator end, std::regex e = std::regex{"\\w+"}){
std::smatch m{};
std::vector<std::string> ret{};
while (std::regex_search (it,end,m,e)) {
ret.emplace_back(m.str());
std::advance(it, m.position() + m.length()); //next start position = match position + match length
}
return ret;
}
std::vector<std::string> split(const std::string &s, std::regex e = std::regex{"\\w+"}){ //comfort version calls flexible version
return split(s.cbegin(), s.cend(), std::move(e));
}
int main ()
{
std::string str {"Some people, excluding those present, have been compile time constants - since puberty."};
auto v = split(str);
for(const auto&s:v){
std::cout << s << std::endl;
}
std::cout << "crazy version:" << std::endl;
v = split(str, std::regex{"[^e]+"}); //using e as delim shows flexibility
for(const auto&s:v){
std::cout << s << std::endl;
}
return 0;
}
If we need to be faster and accept the constraint that all chars must be 8 bits we can make a look up table at compile time using metaprogramming:
template<bool...> struct BoolSequence{}; //just here to hold bools
template<char...> struct CharSequence{}; //just here to hold chars
template<typename T, char C> struct Contains; //generic
template<char First, char... Cs, char Match> //not first specialization
struct Contains<CharSequence<First, Cs...>,Match> :
Contains<CharSequence<Cs...>, Match>{}; //strip first and increase index
template<char First, char... Cs> //is first specialization
struct Contains<CharSequence<First, Cs...>,First>: std::true_type {};
template<char Match> //not found specialization
struct Contains<CharSequence<>,Match>: std::false_type{};
template<int I, typename T, typename U>
struct MakeSequence; //generic
template<int I, bool... Bs, typename U>
struct MakeSequence<I,BoolSequence<Bs...>, U>: //not last
MakeSequence<I-1, BoolSequence<Contains<U,I-1>::value,Bs...>, U>{};
template<bool... Bs, typename U>
struct MakeSequence<0,BoolSequence<Bs...>,U>{ //last
using Type = BoolSequence<Bs...>;
};
template<typename T> struct BoolASCIITable;
template<bool... Bs> struct BoolASCIITable<BoolSequence<Bs...>>{
/* could be made constexpr but not yet supported by MSVC */
static bool isDelim(const char c){
static const bool table[256] = {Bs...};
return table[static_cast<int>(c)];
}
};
using Delims = CharSequence<'.',',',' ',':','\n'>; //list your custom delimiters here
using Table = BoolASCIITable<typename MakeSequence<256,BoolSequence<>,Delims>::Type>;
With that in place making a getNextToken function is easy:
template<typename T_It>
std::pair<T_It,T_It> getNextToken(T_It begin,T_It end){
begin = std::find_if(begin,end,std::not1(Table{})); //find first non delim or end
auto second = std::find_if(begin,end,Table{}); //find first delim or end
return std::make_pair(begin,second);
}
Using it is also easy:
int main() {
std::string s{"Some people, excluding those present, have been compile time constants - since puberty."};
auto it = std::begin(s);
auto end = std::end(s);
while(it != std::end(s)){
auto token = getNextToken(it,end);
std::cout << std::string(token.first,token.second) << std::endl;
it = token.second;
}
return 0;
}
Here is a live example: http://ideone.com/GKtkLQ
I know this question is already answered but I want to contribute. Maybe my solution is a bit simple but this is what I came up with:
vector<string> get_words(string const& text, string const& separator)
{
vector<string> result;
string tmp = text;
size_t first_pos = 0;
size_t second_pos = tmp.find(separator);
while (second_pos != string::npos)
{
if (first_pos != second_pos)
{
string word = tmp.substr(first_pos, second_pos - first_pos);
result.push_back(word);
}
tmp = tmp.substr(second_pos + separator.length());
second_pos = tmp.find(separator);
}
result.push_back(tmp);
return result;
}
Please comment if there is a better approach to something in my code or if something is wrong.
UPDATE: added generic separator
you can take advantage of boost::make_find_iterator. Something similar to this:
template<typename CH>
inline vector< basic_string<CH> > tokenize(
const basic_string<CH> &Input,
const basic_string<CH> &Delimiter,
bool remove_empty_token
) {
typedef typename basic_string<CH>::const_iterator string_iterator_t;
typedef boost::find_iterator< string_iterator_t > string_find_iterator_t;
vector< basic_string<CH> > Result;
string_iterator_t it = Input.begin();
string_iterator_t it_end = Input.end();
for(string_find_iterator_t i = boost::make_find_iterator(Input, boost::first_finder(Delimiter, boost::is_equal()));
i != string_find_iterator_t();
++i) {
if(remove_empty_token){
if(it != i->begin())
Result.push_back(basic_string<CH>(it,i->begin()));
}
else
Result.push_back(basic_string<CH>(it,i->begin()));
it = i->end();
}
if(it != it_end)
Result.push_back(basic_string<CH>(it,it_end));
return Result;
}
Here's my Swiss® Army Knife of string-tokenizers for splitting up strings by whitespace, accounting for single and double-quote wrapped strings as well as stripping those characters from the results. I used RegexBuddy 4.x to generate most of the code-snippet, but I added custom handling for stripping quotes and a few other things.
#include <string>
#include <locale>
#include <regex>
std::vector<std::wstring> tokenize_string(std::wstring string_to_tokenize) {
std::vector<std::wstring> tokens;
std::wregex re(LR"(("[^"]*"|'[^']*'|[^"' ]+))", std::regex_constants::collate);
std::wsregex_iterator next( string_to_tokenize.begin(),
string_to_tokenize.end(),
re,
std::regex_constants::match_not_null );
std::wsregex_iterator end;
const wchar_t single_quote = L'\'';
const wchar_t double_quote = L'\"';
while ( next != end ) {
std::wsmatch match = *next;
const std::wstring token = match.str( 0 );
next++;
if (token.length() > 2 && (token.front() == double_quote || token.front() == single_quote))
tokens.emplace_back( std::wstring(token.begin()+1, token.begin()+token.length()-1) );
else
tokens.emplace_back(token);
}
return tokens;
}
I wrote a simplified version (and maybe a little bit efficient) of https://stackoverflow.com/a/50247503/3976739 for my own use. I hope it would help.
void StrTokenizer(string& source, const char* delimiter, vector<string>& Tokens)
{
size_t new_index = 0;
size_t old_index = 0;
while (new_index != std::string::npos)
{
new_index = source.find(delimiter, old_index);
Tokens.emplace_back(source.substr(old_index, new_index-old_index));
if (new_index != std::string::npos)
old_index = ++new_index;
}
}
If the maximum length of the input string to be tokenized is known, one can exploit this and implement a very fast version. I am sketching the basic idea below, which was inspired by both strtok() and the "suffix array"-data structure described Jon Bentley's "Programming Perls" 2nd edition, chapter 15. The C++ class in this case only gives some organization and convenience of use. The implementation shown can be easily extended for removing leading and trailing whitespace characters in the tokens.
Basically one can replace the separator characters with string-terminating '\0'-characters and set pointers to the tokens withing the modified string. In the extreme case when the string consists only of separators, one gets string-length plus 1 resulting empty tokens. It is practical to duplicate the string to be modified.
Header file:
class TextLineSplitter
{
public:
TextLineSplitter( const size_t max_line_len );
~TextLineSplitter();
void SplitLine( const char *line,
const char sep_char = ',',
);
inline size_t NumTokens( void ) const
{
return mNumTokens;
}
const char * GetToken( const size_t token_idx ) const
{
assert( token_idx < mNumTokens );
return mTokens[ token_idx ];
}
private:
const size_t mStorageSize;
char *mBuff;
char **mTokens;
size_t mNumTokens;
inline void ResetContent( void )
{
memset( mBuff, 0, mStorageSize );
// mark all items as empty:
memset( mTokens, 0, mStorageSize * sizeof( char* ) );
// reset counter for found items:
mNumTokens = 0L;
}
};
Implementattion file:
TextLineSplitter::TextLineSplitter( const size_t max_line_len ):
mStorageSize ( max_line_len + 1L )
{
// allocate memory
mBuff = new char [ mStorageSize ];
mTokens = new char* [ mStorageSize ];
ResetContent();
}
TextLineSplitter::~TextLineSplitter()
{
delete [] mBuff;
delete [] mTokens;
}
void TextLineSplitter::SplitLine( const char *line,
const char sep_char /* = ',' */,
)
{
assert( sep_char != '\0' );
ResetContent();
strncpy( mBuff, line, mMaxLineLen );
size_t idx = 0L; // running index for characters
do
{
assert( idx < mStorageSize );
const char chr = line[ idx ]; // retrieve current character
if( mTokens[ mNumTokens ] == NULL )
{
mTokens[ mNumTokens ] = &mBuff[ idx ];
} // if
if( chr == sep_char || chr == '\0' )
{ // item or line finished
// overwrite separator with a 0-terminating character:
mBuff[ idx ] = '\0';
// count-up items:
mNumTokens ++;
} // if
} while( line[ idx++ ] );
}
A scenario of usage would be:
// create an instance capable of splitting strings up to 1000 chars long:
TextLineSplitter spl( 1000 );
spl.SplitLine( "Item1,,Item2,Item3" );
for( size_t i = 0; i < spl.NumTokens(); i++ )
{
printf( "%s\n", spl.GetToken( i ) );
}
output:
Item1
Item2
Item3