As far as I can understand strtok() doesn't modify the underlying string, so why doesn't it not take a const char* pointer rather than a const char* pointer? Also while tokenizing you wouldn't want your string to change, right?
Updated:
https://godbolt.org/z/3SPvRB
It is clear that strtok() does modify the underlying string. What is the alternative for an non-mutating tokenizer?
But strtok DOES change the string.
Take the following code:
char sz[] = "The quick brown fox";
char* token = strtok(sz, " ");
It's going to alter the contents of the array into:
"The\0quick brown fox";
The first discovered delimiter gets replaced with a null char. Internally (via thread local storage or global variable), the pointer to the the next char past the discovered delimiter is stored such that a subsequent call to strtok(NULL, " ") will parse the next token from the original string.
It does modify the underlying string. See: http://www.cplusplus.com/reference/cstring/strtok/
This end of the token is automatically replaced by a null-character, and the beginning
of the token is returned by the function.
Proof:
/* strtok example */
#include <stdio.h>
#include <string.h>
int main ()
{
char str[] ="- This, a sample string.";
char * pch;
printf ("Splitting string \"%s\" into tokens:\n",str);
pch = strtok (str," ,.-");
while (pch != NULL)
{
printf ("%s\n",pch);
pch = strtok (NULL, " ,.-");
}
/* note this line... */
printf ("str = \"%s\"\n",str);
return 0;
}
Prints:
Splitting string "- This, a sample string." into tokens:
This
a
sample
string
str = "- This"
Updated: https://godbolt.org/z/3SPvRB It is clear that strtok() does
modify the underlying string. What is the alternative for an
non-mutating tokenizer?
As mentioned in the comments, you can either:
make a copy of the original string and then tokenize the copy with strtok(); or
write your own implementation that brackets the tokens and copies the tokens to new storage:
using C strspn to scan forward to the first non-delimiter character which will be the beginning of the token, then use strcspn to scan forward to the next delimiter marking the end of the token,
do the same thing manually with a pair of pointers; or
for C++11 or later, you can use .find_first_not_of() to scan forward to the first non-delimiter character, and then .find_first_of() to locate the delimiter that follows the token.
In each case you will then copy the token characters to a new string (using memcpy for C-type implementation -- don't forget to nul-terminate) or for C++11 simply using the .substr() member function.
A very-basic C++11 implementation would look similar to:
std::vector<std::string> stringtok (const std::string& s, const std::string& delim)
{
std::vector<std::string> v {}; /* vector of strings for tokens */
size_t beg = 0, end = 0; /* begin and end positons in str */
/* while non-delimiter char found */
while ((beg = s.find_first_not_of (delim, end)) != std::string::npos) {
end = s.find_first_of (delim, beg); /* find delim after non-delim */
v.push_back (s.substr (beg, end - beg)); /* add substr to vector */
if (end == std::string::npos) /* if last delim, break */
break;
}
return v; /* return vector of tokens */
}
If you follow the logic, it tracks exactly what is described above the function definition. Combining it into a short example, you would have:
#include <iostream>
#include <string>
#include <vector>
std::vector<std::string> stringtok (const std::string& s, const std::string& delim)
{
std::vector<std::string> v {}; /* vector of strings for tokens */
size_t beg = 0, end = 0; /* begin and end positons in str */
/* while non-delimiter char found */
while ((beg = s.find_first_not_of (delim, end)) != std::string::npos) {
end = s.find_first_of (delim, beg); /* find delim after non-delim */
v.push_back (s.substr (beg, end - beg)); /* add substr to vector */
if (end == std::string::npos) /* if last delim, break */
break;
}
return v; /* return vector of tokens */
}
int main (void) {
std::string str = " my dog has fleas ",
delim = " ";
std::vector<std::string> tokens;
tokens = stringtok (str, delim);
std::cout << "string: '" << str << "'\ntokens:\n";
for (auto s : tokens)
std::cout << " " << s << '\n';
}
Example Use/Output
$ ./bin/stringtok
string: ' my dog has fleas '
tokens:
my
dog
has
fleas
Note: this is only one of many ways to implement a string tokenization that does not modify the original. Look things over and let me know if you have further questions.
Related
I'm kind of new to C++. I want to make split function for std::string in c++ like java split function in String class(I don't want to use boost library). so I made custom split function which is..
using namespace std;
void ofApp::split(string ori, string tokens[], string deli){
// calculate the number of tokens
int length = 0;
int curPos = 0;
do{
curPos = ori.find(deli, curPos) + 1;
length++;
}while(ori.find(deli, curPos) != string::npos);
length++;
// to save tokens, initialize tokens array
tokens = new string[length]; // this is the line I'm suspicious about..
int startPos = 0;
int strLength = 0;
int curIndex = 0;
do{
strLength = ori.find(deli, startPos) - startPos;
tokens[curIndex++] = ori.substr(startPos, strLength);
startPos = ori.find(deli, startPos) + 1;
}while(ori.find(deli, startPos) != string::npos);
tokens[curIndex] = ori.substr(startPos, ori.length() - startPos);
}
First, I thought passing parameter as string tokens[] is the way call by reference, so when function is finished, tokens[] array will be full of tokens seperated by deli string. But when i call this function like
string str = "abc,def,10.2,dadd,adsf";
string* tokens;
split(str, tokens, ",");
after this, tokens array is completely empty. On my guess, this happens because of the line
tokens = new string[length];
I think memory space for tokens array as local variable is allocated and when split function is finished, this memory space will be free as block is finished.
when i try to debug, split function itself is working very well as tokens array is full of tokens at least in split function block. I think my guess is right but how can I solve this problem? Any solution? I think this is not only matter of std::string array, this is homework of "call by reference".
Requirement
pass std::string[] type to function parameter (return tokens[] is OK too. But I think this will have same problem)
when function is finished, array must full of tokens
tokens array length must be calculated in split function(if user has to calculate tokens length, it is foolish function). Because of this, memory for tokens array can't be allocated before split function call.
Thank you in advanced for your great answer!
As #chris suggested, something like the following code should work.
Example Code
#include <iostream>
#include <string>
#include <vector>
std::vector<std::string> split(const std::string& delimiter, const std::string& str)
{
std::vector<std::string> result;
std::size_t prevPos = 0;
while (prevPos != std::string::npos)
{
std::size_t currPos = str.find(delimiter, prevPos);
result.push_back(str.substr(prevPos, currPos - prevPos));
prevPos = currPos;
if (prevPos != std::string::npos)
{
// Skip the delimiter
prevPos += delimiter.size();
}
}
return result;
}
int main()
{
std::string str("this,is,a,test");
std::vector<std::string> splitResult = split(",", str);
for (const auto &s : splitResult)
{
std::cout << "'" << s << "'\n";
}
std::cout << "\n";
str = "this - is - a - test";
splitResult = split(" - ", str);
for (const auto &s : splitResult)
{
std::cout << "'" << s << "'\n";
}
return 0;
}
Example Output
'this'
'is'
'a'
'test'
'this'
'is'
'a'
'test'
I use boost framework, so it could be helpful, but I haven't found a necessary function.
For usual fast splitting I can use:
string str = ...;
vector<string> strs;
boost::split(strs, str, boost::is_any_of("mM"));
but it removes m and M characters.
I also can't siply use regexp because it searches the string for the longest value which meets a defined pattern.
P.S. There are a lot of similar questions, but they describe this implementation in other programming languages only.
Untested, but rather than using vector<string>, you could try a vector<boost::iterator_range<std::string::iterator>> (so you get a pair of iterators to the main string for each token. Then iterate from (start of range -1 [as long as start of range is not begin() of main string], to end of range)
EDIT: Here is an example:
#include <iostream>
#include <string>
#include <boost/algorithm/string/classification.hpp>
#include <boost/algorithm/string/split.hpp>
#include <boost/range/iterator_range.hpp>
int main(void)
{
std::string str = "FooMBarMSFM";
std::vector<boost::iterator_range<std::string::iterator>> tokens;
boost::split(tokens, str, boost::is_any_of("mM"));
for(auto r : tokens)
{
std::string b(r.begin(), r.end());
std::cout << b << std::endl;
if (r.begin() != str.begin())
{
std::string bm(std::prev(r.begin()), r.end());
std::cout << "With token: [" << bm << "]" << std::endl;
}
}
}
Your need is beyond the conception of split. If you want to keep 'm or M', you could write a special split by strstr, strchr,strtok or find function. You could change some code to produce a flexible split function.
Here is an example:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
void split(char *src, const char *separator, char **dest, int *num)
{
char *pNext;
int count = 0;
if (src == NULL || strlen(src) == 0) return;
if (separator == NULL || strlen(separator) == 0) return;
pNext = strtok(src,separator);
while(pNext != NULL)
{
*dest++ = pNext;
++count;
pNext = strtok(NULL,separator);
}
*num = count;
}
Besides, you could try boost::regex.
My current solution is the following (but it is not universal and looks like too complex).
I choose one character which couldn't appear in this string. In my case it is '|'.
string str = ...;
vector<string> strs;
boost::split(strs, str, boost::is_any_of("m"));
str = boost::join(strs, "|m");
boost::split(strs, str, boost::is_any_of("M"));
str = boost::join(strs, "|M");
if (boost::iequals(str.substr(0, 1), "|") {
str = str.substr(1);
}
boost::split(strs, str, boost::is_any_of("|"));
I add "|" before each of symbols m/M, except of the very first position in string. Then I split the string into substrings with deleting of this extra character
I am trying to reverse the order of words in a sentence by maintaining the spaces as below.
[this is my test string] ==> [string test my is this]
I did in a step by step manner as,
[this is my test string] - input string
[gnirts tset ym si siht] - reverse the whole string - in-place
[string test my is this] - reverse the words of the string - in-place
[string test my is this] - string-2 with spaces rearranged
Is there any other method to do this ? Is it also possible to do the last step in-place ?
Your approach is fine. But alternatively you can also do:
Keep scanning the input for words and
spaces
If you find a word push it onto stack
S
If you find space(s) enqueue the
number of spaces into a queue Q
After this is done there will be N words on the stack and N-1 numbers in the queue.
While stack not empty do
print S.pop
if stack is empty break
print Q.deque number of spaces
end-while
Here's an approach.
In short, build two lists of tokens you find: one for words, and another for spaces. Then piece together a new string, with the words in reverse order and the spaces in forward order.
#include <iostream>
#include <algorithm>
#include <vector>
#include <string>
#include <sstream>
using namespace std;
string test_string = "this is my test string";
int main()
{
// Create 2 vectors of strings. One for words, another for spaces.
typedef vector<string> strings;
strings words, spaces;
// Walk through the input string, and find individual tokens.
// A token is either a word or a contigious string of spaces.
for( string::size_type pos = 0; pos != string::npos; )
{
// is this a word token or a space token?
bool is_char = test_string[pos] != ' ';
string::size_type pos_end_token = string::npos;
// find the one-past-the-end index for the end of this token
if( is_char )
pos_end_token = test_string.find(' ', pos);
else
pos_end_token = test_string.find_first_not_of(' ', pos);
// pull out this token
string token = test_string.substr(pos, pos_end_token == string::npos ? string::npos : pos_end_token-pos);
// if the token is a word, save it to the list of words.
// if it's a space, save it to the list of spaces
if( is_char )
words.push_back(token);
else
spaces.push_back(token);
// move on to the next token
pos = pos_end_token;
}
// construct the new string using stringstream
stringstream ss;
// walk through both the list of spaces and the list of words,
// keeping in mind that there may be more words than spaces, or vice versa
// construct the new string by first copying the word, then the spaces
strings::const_reverse_iterator it_w = words.rbegin();
strings::const_iterator it_s = spaces.begin();
while( it_w != words.rend() || it_s != spaces.end() )
{
if( it_w != words.rend() )
ss << *it_w++;
if( it_s != spaces.end() )
ss << *it_s++;
}
// pull a `string` out of the results & dump it
string reversed = ss.str();
cout << "Input: '" << test_string << "'" << endl << "Output: '" << reversed << "'" << endl;
}
I would rephrase the problem this way:
Non-space tokens are reversed, but preserves their original order
The 5 non-space tokens ‘this’, ‘is’, ‘my’, ‘test’, ‘string’ gets reversed to ‘string’, ‘test’, ‘my’, ‘is’, ‘this’.
Space tokens remain in the original order
The space tokens ‘ ‘, ‘ ‘, ‘ ‘, ‘ ‘ remains in original order between the new order of non-space tokens.
Following is a O(N) solution [N being the length of char array]. Unfortunately, it is not in place as OP wanted, but it does not use additional stack or queue either -- it uses a separate character array as a working space.
Here is a C-ish pseudo code.
work_array = char array with size of input_array
dst = &work_array[ 0 ]
for( i = 1; ; i++) {
detect i’th non-space token in input_array starting from the back side
if no such token {
break;
}
copy the token starting at dst
advance dst by token_size
detect i’th space-token in input_array starting from the front side
copy the token starting at dst
advance dst by token_size
}
// at this point work_array contains the desired output,
// it can be copied back to input_array and destroyed
For words from first to central words switch word n with word length - n
First use a split function and then do the switching
This pseudocode assumes you don't end the initial string with a blank space, though can be suitably modified for that too.
1. Get string length; allocate equivalent space for final string; set getText=1
2. While pointer doesn't reach position 0 of string,
i.start from end of string, read character by character...
a.if getText=1
...until blank space encountered
b.if getText=0
...until not blank space encountered
ii.back up pointer to previously pointed character
iii.output to final string in reverse
iv.toggle getText
3. Stop
All strtok-solutions work not for your example, see above.
Try this:
char *wordrev(char *s)
{
char *y=calloc(1,strlen(s)+1);
char *p=s+strlen(s);
while( p--!=s )
if( *p==32 )
strcat(y,p+1),strcat(y," "),*p=0;
strcpy(s,y);
free(y);
return s;
}
Too bad stl string doesn't implement push_front. Then you could do this with transform().
#include <string>
#include <iostream>
#include <algorithm>
class push_front
{
public:
push_front( std::string& s ) : _s(s) {};
bool operator()(char c) { _s.insert( _s.begin(), c ); return true; };
std::string& _s;
};
int main( int argc, char** argv )
{
std::string s1;
std::string s( "Now is the time for all good men");
for_each( s.begin(), s.end(), push_front(s1) );
std::cout << s << "\n";
std::cout << s1 << "\n";
}
Now is the time for all good men
nem doog lla rof emit eht si woN
Copy each string in the array and print it in reverse order(i--)
int main()
{
int j=0;
string str;
string copy[80];
int start=0;
int end=0;
cout<<"Enter the String :: ";
getline(cin,str);
cout<<"Entered String is : "<<str<<endl;
for(int i=0;str[i]!='\0';i++)
{
end=s.find(" ",start);
if(end==-1)
{
copy[j]=str.substr(start,(str.length()-start));
break;
}
else
{
copy[j]=str.substr(start,(end-start));
start=end+1;
j++;
i=end;
}
}
for(int s1=j;s1>=0;s1--)
cout<<" "<<copy[s1];
}
I think I'd just tokenize (strtok or CString::Tokanize) the string using the space character. Shove the strings into a vector, than pull them back out in reverse order and concatenate them with a space in between.
FYI: no boost, yes it has this, I want to reinvent the wheel ;)
Is there some form of a selective iterator (possible) in C++? What I want is to seperate strings like this:
some:word{or other
to a form like this:
some : word { or other
I can do that with two loops and find_first_of(":") and ("{") but this seems (very) inefficient to me. I thought that maybe there would be a way to create/define/write an iterator that would iterate over all these values with for_each. I fear this will have me writing a full-fledged custom way-too-complex iterator class for a std::string.
So I thought maybe this would do:
std::vector<size_t> list;
size_t index = mystring.find(":");
while( index != std::string::npos )
{
list.push_back(index);
index = mystring.find(":", list.back());
}
std::for_each(list.begin(), list.end(), addSpaces(mystring));
This looks messy to me, and I'm quite sure a more elegant way of doing this exists. But I can't think of it. Anyone have a bright idea? Thanks
PS: I did not test the code posted, just a quick write-up of what I would try
UPDATE: after taking all your answers into account, I came up with this, and it works to my liking :). this does assume the last char is a newline or something, otherwise an ending {,}, or : won't get processed.
void tokenize( string &line )
{
char oneBack = ' ';
char twoBack = ' ';
char current = ' ';
size_t length = line.size();
for( size_t index = 0; index<length; ++index )
{
twoBack = oneBack;
oneBack = current;
current = line.at( index );
if( isSpecial(oneBack) )
{
if( !isspace(twoBack) ) // insert before
{
line.insert(index-1, " ");
++index;
++length;
}
if( !isspace(current) ) // insert after
{
line.insert(index, " ");
++index;
++length;
}
}
}
Comments are welcome as always :)
That's relatively easy using the std::istream_iterator.
What you need to do is define your own class (say Term). Then define how to read a single "word" (term) from the stream using the operator >>.
I don't know your exact definition of a word is, so I am using the following definition:
Any consecutive sequence of alpha numeric characters is a term
Any single non white space character that is also not alpha numeric is a word.
Try this:
#include <string>
#include <sstream>
#include <iostream>
#include <iterator>
#include <algorithm>
class Term
{
public:
// This cast operator is not required but makes it easy to use
// a Term anywhere that a string can normally be used.
operator std::string const&() const {return value;}
private:
// A term is just a string
// And we friend the operator >> to make sure we can read it.
friend std::istream& operator>>(std::istream& inStr,Term& dst);
std::string value;
};
Now all we have to do is define an operator >> that reads a word according to the rules:
// This function could be a lot neater using some boost regular expressions.
// I just do it manually to show it can be done without boost (as requested)
std::istream& operator>>(std::istream& inStr,Term& dst)
{
// Note the >> operator drops all proceeding white space.
// So we get the first non white space
char first;
inStr >> first;
// If the stream is in any bad state the stop processing.
if (inStr)
{
if(std::isalnum(first))
{
// Alpha Numeric so read a sequence of characters
dst.value = first;
// This is ugly. And needs re-factoring.
while((first = insStr.get(), inStr) && std::isalnum(first))
{
dst.value += first;
}
// Take into account the special case of EOF.
// And bad stream states.
if (!inStr)
{
if (!inStr.eof())
{
// The last letter read was not EOF and and not part of the word
// So put it back for use by the next call to read from the stream.
inStr.putback(first);
}
// We know that we have a word so clear any errors to make sure it
// is used. Let the next attempt to read a word (term) fail at the outer if.
inStr.clear();
}
}
else
{
// It was not alpha numeric so it is a one character word.
dst.value = first;
}
}
return inStr;
}
So now we can use it in standard algorithms by just employing the istream_iterator
int main()
{
std::string data = "some:word{or other";
std::stringstream dataStream(data);
std::copy( // Read the stream one Term at a time.
std::istream_iterator<Term>(dataStream),
std::istream_iterator<Term>(),
// Note the ostream_iterator is using a std::string
// This works because a Term can be converted into a string.
std::ostream_iterator<std::string>(std::cout, "\n")
);
}
The output:
> ./a.exe
some
:
word
{
or
other
std::string const str = "some:word{or other";
std::string result;
result.reserve(str.size());
for (std::string::const_iterator it = str.begin(), end = str.end();
it != end; ++it)
{
if (isalnum(*it))
{
result.push_back(*it);
}
else
{
result.push_back(' '); result.push_back(*it); result.push_back(' ');
}
}
Insert version for speed-up
std::string str = "some:word{or other";
for (std::string::iterator it = str.begin(), end = str.end(); it != end; ++it)
{
if (!isalnum(*it))
{
it = str.insert(it, ' ') + 2;
it = str.insert(it, ' ');
end = str.end();
}
}
Note that std::string::insert inserts BEFORE the iterator passed and returns an iterator to the newly inserted character. Assigning is important since the buffer may have been reallocated at another memory location (the iterators are invalidated by the insertion). Also note that you can't keep end for the whole loop, each time you insert you need to recompute it.
a more elegant way of doing this exists.
I do not know how BOOST implements that, but traditional way is by feeding input string character by character into a FSM which detects where tokens (words, symbols) start and end.
I can do that with two loops and find_first_of(":") and ("{")
One loop with std::find_first_of() should suffice.
Though I'm still a huge fan of FSMs for such parsing tasks.
P.S. Similar question
How about something like:
std::string::const_iterator it, end = mystring.end();
for(it = mystring.begin(); it != end; ++it) {
if ( !isalnum( *it ))
list.push_back(it);
}
This way, you'll only iterate once through the string, and isalnum from ctype.h seems to do what you want. Of course, the code above is very simplistic and incomplete and only suggests a solution.
Are you looking to tokenize the input string, ala strtok?
If so, here is a tokenizing function that you can use. It takes an input string and a string of delimiters (each char int he string is a possible delimitter), and it returns a vector of tokens. Each token is a tuple with the delimitted string, and the delimiter used in that case:
#include <cstdlib>
#include <vector>
#include <string>
#include <functional>
#include <iostream>
#include <algorithm>
using namespace std;
// FUNCTION : stringtok(char const* Raw, string sToks)
// PARAMATERS : Raw Pointer to NULL-Terminated string containing a string to be tokenized.
// sToks string of individual token characters -- each character in the string is a token
// DESCRIPTION : Tokenizes a string, much in the same was as strtok does. The input string is not modified. The
// function is called once to tokenize a string, and all the tokens are retuned at once.
// RETURNS : Returns a vector of strings. Each element in the vector is one token. The token character is
// not included in the string. The number of elements in the vector is N+1, where N is the number
// of times the Token character is found in the string. If one token is an empty string (as with the
// string "string1##string3", where the token character is '#'), then that element in the vector
// is an empty string.
// NOTES :
//
typedef pair<char,string> token; // first = delimiter, second = data
inline vector<token> tokenize(const string& str, const string& delims, bool bCaseSensitive=false) // tokenizes a string, returns a vector of tokens
{
bCaseSensitive;
// prologue
vector<token> vRet;
// tokenize input string
for( string::const_iterator itA = str.begin(), it=itA; it != str.end(); it = find_first_of(++it,str.end(),delims.begin(),delims.end()) )
{
// prologue
// find end of token
string::const_iterator itEnd = find_first_of(it+1,str.end(),delims.begin(),delims.end());
// add string to output
if( it == itA ) vRet.push_back(make_pair(0,string(it,itEnd)));
else vRet.push_back(make_pair(*it,string(it+1,itEnd)));
// epilogue
}
// epilogue
return vRet;
}
using namespace std;
int main()
{
string input = "some:word{or other";
typedef vector<token> tokens;
tokens toks = tokenize(input.c_str(), " :{");
cout << "Input: '" << input << " # Tokens: " << toks.size() << "'\n";
for( tokens::iterator it = toks.begin(); it != toks.end(); ++it )
{
cout << " Token : '" << it->second << "', Delimiter: '" << it->first << "'\n";
}
return 0;
}
I have a string that I would like to tokenize.
But the C strtok() function requires my string to be a char*.
How can I do this simply?
I tried:
token = strtok(str.c_str(), " ");
which fails because it turns it into a const char*, not a char*
#include <iostream>
#include <string>
#include <sstream>
int main(){
std::string myText("some-text-to-tokenize");
std::istringstream iss(myText);
std::string token;
while (std::getline(iss, token, '-'))
{
std::cout << token << std::endl;
}
return 0;
}
Or, as mentioned, use boost for more flexibility.
Duplicate the string, tokenize it, then free it.
char *dup = strdup(str.c_str());
token = strtok(dup, " ");
free(dup);
If boost is available on your system (I think it's standard on most Linux distros these days), it has a Tokenizer class you can use.
If not, then a quick Google turns up a hand-rolled tokenizer for std::string that you can probably just copy and paste. It's very short.
And, if you don't like either of those, then here's a split() function I wrote to make my life easier. It'll break a string into pieces using any of the chars in "delim" as separators. Pieces are appended to the "parts" vector:
void split(const string& str, const string& delim, vector<string>& parts) {
size_t start, end = 0;
while (end < str.size()) {
start = end;
while (start < str.size() && (delim.find(str[start]) != string::npos)) {
start++; // skip initial whitespace
}
end = start;
while (end < str.size() && (delim.find(str[end]) == string::npos)) {
end++; // skip to end of word
}
if (end-start != 0) { // just ignore zero-length strings.
parts.push_back(string(str, start, end-start));
}
}
}
There is a more elegant solution.
With std::string you can use resize() to allocate a suitably large buffer, and &s[0] to get a pointer to the internal buffer.
At this point many fine folks will jump and yell at the screen. But this is the fact. About 2 years ago
the library working group decided (meeting at Lillehammer) that just like for std::vector, std::string should also formally, not just in practice, have a guaranteed contiguous buffer.
The other concern is does strtok() increases the size of the string. The MSDN documentation says:
Each call to strtok modifies strToken by inserting a null character after the token returned by that call.
But this is not correct. Actually the function replaces the first occurrence of a separator character with \0. No change in the size of the string. If we have this string:
one-two---three--four
we will end up with
one\0two\0--three\0-four
So my solution is very simple:
std::string str("some-text-to-split");
char seps[] = "-";
char *token;
token = strtok( &str[0], seps );
while( token != NULL )
{
/* Do your thing */
token = strtok( NULL, seps );
}
Read the discussion on http://www.archivum.info/comp.lang.c++/2008-05/02889/does_std::string_have_something_like_CString::GetBuffer
With C++17 str::string receives data() overload that returns a pointer to modifieable buffer so string can be used in strtok directly without any hacks:
#include <string>
#include <iostream>
#include <cstring>
#include <cstdlib>
int main()
{
::std::string text{"pop dop rop"};
char const * const psz_delimiter{" "};
char * psz_token{::std::strtok(text.data(), psz_delimiter)};
while(nullptr != psz_token)
{
::std::cout << psz_token << ::std::endl;
psz_token = std::strtok(nullptr, psz_delimiter);
}
return EXIT_SUCCESS;
}
output
pop
dop
rop
EDIT: usage of const cast is only used to demonstrate the effect of strtok() when applied to a pointer returned by string::c_str().
You should not use
strtok() since it modifies the tokenized string which may lead to undesired, if not undefined, behaviour as the C string "belongs" to the string instance.
#include <string>
#include <iostream>
int main(int ac, char **av)
{
std::string theString("hello world");
std::cout << theString << " - " << theString.size() << std::endl;
//--- this cast *only* to illustrate the effect of strtok() on std::string
char *token = strtok(const_cast<char *>(theString.c_str()), " ");
std::cout << theString << " - " << theString.size() << std::endl;
return 0;
}
After the call to strtok(), the space was "removed" from the string, or turned down to a non-printable character, but the length remains unchanged.
>./a.out
hello world - 11
helloworld - 11
Therefore you have to resort to native mechanism, duplication of the string or an third party library as previously mentioned.
I suppose the language is C, or C++...
strtok, IIRC, replace separators with \0. That's what it cannot use a const string.
To workaround that "quickly", if the string isn't huge, you can just strdup() it. Which is wise if you need to keep the string unaltered (what the const suggest...).
On the other hand, you might want to use another tokenizer, perhaps hand rolled, less violent on the given argument.
Assuming that by "string" you're talking about std::string in C++, you might have a look at the Tokenizer package in Boost.
First off I would say use boost tokenizer.
Alternatively if your data is space separated then the string stream library is very useful.
But both the above have already been covered.
So as a third C-Like alternative I propose copying the std::string into a buffer for modification.
std::string data("The data I want to tokenize");
// Create a buffer of the correct length:
std::vector<char> buffer(data.size()+1);
// copy the string into the buffer
strcpy(&buffer[0],data.c_str());
// Tokenize
strtok(&buffer[0]," ");
If you don't mind open source, you could use the subbuffer and subparser classes from https://github.com/EdgeCast/json_parser. The original string is left intact, there is no allocation and no copying of data. I have not compiled the following so there may be errors.
std::string input_string("hello world");
subbuffer input(input_string);
subparser flds(input, ' ', subparser::SKIP_EMPTY);
while (!flds.empty())
{
subbuffer fld = flds.next();
// do something with fld
}
// or if you know it is only two fields
subbuffer fld1 = input.before(' ');
subbuffer fld2 = input.sub(fld1.length() + 1).ltrim(' ');
Typecasting to (char*) got it working for me!
token = strtok((char *)str.c_str(), " ");
Chris's answer is probably fine when using std::string; however in case you want to use std::basic_string<char16_t>, std::getline can't be used. Here is a possible other implementation:
template <class CharT> bool tokenizestring(const std::basic_string<CharT> &input, CharT separator, typename std::basic_string<CharT>::size_type &pos, std::basic_string<CharT> &token) {
if (pos >= input.length()) {
// if input is empty, or ends with a separator, return an empty token when the end has been reached (and return an out-of-bound position so subsequent call won't do it again)
if ((pos == 0) || ((pos > 0) && (pos == input.length()) && (input[pos-1] == separator))) {
token.clear();
pos=input.length()+1;
return true;
}
return false;
}
typename std::basic_string<CharT>::size_type separatorPos=input.find(separator, pos);
if (separatorPos == std::basic_string<CharT>::npos) {
token=input.substr(pos, input.length()-pos);
pos=input.length();
} else {
token=input.substr(pos, separatorPos-pos);
pos=separatorPos+1;
}
return true;
}
Then use it like this:
std::basic_string<char16_t> s;
std::basic_string<char16_t> token;
std::basic_string<char16_t>::size_type tokenPos=0;
while (tokenizestring(s, (char16_t)' ', tokenPos, token)) {
...
}
It fails because str.c_str() returns constant string but char * strtok (char * str, const char * delimiters ) requires volatile string. So you need to use *const_cast< char > inorder to make it voletile.
I am giving you a complete but small program to tokenize the string using C strtok() function.
#include <iostream>
#include <string>
#include <string.h>
using namespace std;
int main() {
string s="20#6 5, 3";
// strtok requires volatile string as it modifies the supplied string in order to tokenize it
char *str=const_cast< char *>(s.c_str());
char *tok;
tok=strtok(str, "#, " );
int arr[4], i=0;
while(tok!=NULL){
arr[i++]=stoi(tok);
tok=strtok(NULL, "#, " );
}
for(int i=0; i<4; i++) cout<<arr[i]<<endl;
return 0;
}
NOTE: strtok may not be suitable in all situation as the string passed to function gets modified by being broken into smaller strings. Pls., ref to get better understanding of strtok functionality.
How strtok works
Added few print statement to better understand the changes happning to string in each call to strtok and how it returns token.
#include <iostream>
#include <string>
#include <string.h>
using namespace std;
int main() {
string s="20#6 5, 3";
char *str=const_cast< char *>(s.c_str());
char *tok;
cout<<"string: "<<s<<endl;
tok=strtok(str, "#, " );
cout<<"String: "<<s<<"\tToken: "<<tok<<endl;
while(tok!=NULL){
tok=strtok(NULL, "#, " );
cout<<"String: "<<s<<"\t\tToken: "<<tok<<endl;
}
return 0;
}
Output:
string: 20#6 5, 3
String: 206 5, 3 Token: 20
String: 2065, 3 Token: 6
String: 2065 3 Token: 5
String: 2065 3 Token: 3
String: 2065 3 Token:
strtok iterate over the string first call find the non delemetor character (2 in this case) and marked it as token start then continues scan for a delimeter and replace it with null charater (# gets replaced in actual string) and return start which points to token start character( i.e., it return token 20 which is terminated by null). In subsequent call it start scaning from the next character and returns token if found else null. subsecuntly it returns token 6, 5, 3.