How to replace all occurrences of a character in string? - c++

What is the effective way to replace all occurrences of a character with another character in std::string?

std::string doesn't contain such function but you could use stand-alone replace function from algorithm header.
#include <algorithm>
#include <string>
void some_func() {
std::string s = "example string";
std::replace( s.begin(), s.end(), 'x', 'y'); // replace all 'x' to 'y'
}

The question is centered on character replacement, but, as I found this page very useful (especially Konrad's remark), I'd like to share this more generalized implementation, which allows to deal with substrings as well:
std::string ReplaceAll(std::string str, const std::string& from, const std::string& to) {
size_t start_pos = 0;
while((start_pos = str.find(from, start_pos)) != std::string::npos) {
str.replace(start_pos, from.length(), to);
start_pos += to.length(); // Handles case where 'to' is a substring of 'from'
}
return str;
}
Usage:
std::cout << ReplaceAll(string("Number Of Beans"), std::string(" "), std::string("_")) << std::endl;
std::cout << ReplaceAll(string("ghghjghugtghty"), std::string("gh"), std::string("X")) << std::endl;
std::cout << ReplaceAll(string("ghghjghugtghty"), std::string("gh"), std::string("h")) << std::endl;
Outputs:
Number_Of_Beans
XXjXugtXty
hhjhugthty
EDIT:
The above can be implemented in a more suitable way, in case performance is of your concern, by returning nothing (void) and performing the changes "in-place"; that is, by directly modifying the string argument str, passed by reference instead of by value. This would avoid an extra costly copy of the original string by overwriting it.
Code :
static inline void ReplaceAll2(std::string &str, const std::string& from, const std::string& to)
{
// Same inner code...
// No return statement
}
Hope this will be helpful for some others...

I thought I'd toss in the boost solution as well:
#include <boost/algorithm/string/replace.hpp>
// in place
std::string in_place = "blah#blah";
boost::replace_all(in_place, "#", "#");
// copy
const std::string input = "blah#blah";
std::string output = boost::replace_all_copy(input, "#", "#");

Imagine a large binary blob where all 0x00 bytes shall be replaced by "\1\x30" and all 0x01 bytes by "\1\x31" because the transport protocol allows no \0-bytes.
In cases where:
the replacing and the to-replaced string have different lengths,
there are many occurences of the to-replaced string within the source string and
the source string is large,
the provided solutions cannot be applied (because they replace only single characters) or have a performance problem, because they would call string::replace several times which generates copies of the size of the blob over and over.
(I do not know the boost solution, maybe it is OK from that perspective)
This one walks along all occurrences in the source string and builds the new string piece by piece once:
void replaceAll(std::string& source, const std::string& from, const std::string& to)
{
std::string newString;
newString.reserve(source.length()); // avoids a few memory allocations
std::string::size_type lastPos = 0;
std::string::size_type findPos;
while(std::string::npos != (findPos = source.find(from, lastPos)))
{
newString.append(source, lastPos, findPos - lastPos);
newString += to;
lastPos = findPos + from.length();
}
// Care for the rest after last occurrence
newString += source.substr(lastPos);
source.swap(newString);
}

A simple find and replace for a single character would go something like:
s.replace(s.find("x"), 1, "y")
To do this for the whole string, the easy thing to do would be to loop until your s.find starts returning npos. I suppose you could also catch range_error to exit the loop, but that's kinda ugly.

For completeness, here's how to do it with std::regex.
#include <regex>
#include <string>
int main()
{
const std::string s = "example string";
const std::string r = std::regex_replace(s, std::regex("x"), "y");
}

If you're looking to replace more than a single character, and are dealing only with std::string, then this snippet would work, replacing sNeedle in sHaystack with sReplace, and sNeedle and sReplace do not need to be the same size. This routine uses the while loop to replace all occurrences, rather than just the first one found from left to right.
while(sHaystack.find(sNeedle) != std::string::npos) {
sHaystack.replace(sHaystack.find(sNeedle),sNeedle.size(),sReplace);
}

As Kirill suggested, either use the replace method or iterate along the string replacing each char independently.
Alternatively you can use the find method or find_first_of depending on what you need to do. None of these solutions will do the job in one go, but with a few extra lines of code you ought to make them work for you. :-)

What about Abseil StrReplaceAll? From the header file:
// This file defines `absl::StrReplaceAll()`, a general-purpose string
// replacement function designed for large, arbitrary text substitutions,
// especially on strings which you are receiving from some other system for
// further processing (e.g. processing regular expressions, escaping HTML
// entities, etc.). `StrReplaceAll` is designed to be efficient even when only
// one substitution is being performed, or when substitution is rare.
//
// If the string being modified is known at compile-time, and the substitutions
// vary, `absl::Substitute()` may be a better choice.
//
// Example:
//
// std::string html_escaped = absl::StrReplaceAll(user_input, {
// {"&", "&"},
// {"<", "<"},
// {">", ">"},
// {"\"", """},
// {"'", "'"}});

#include <iostream>
#include <string>
using namespace std;
// Replace function..
string replace(string word, string target, string replacement){
int len, loop=0;
string nword="", let;
len=word.length();
len--;
while(loop<=len){
let=word.substr(loop, 1);
if(let==target){
nword=nword+replacement;
}else{
nword=nword+let;
}
loop++;
}
return nword;
}
//Main..
int main() {
string word;
cout<<"Enter Word: ";
cin>>word;
cout<<replace(word, "x", "y")<<endl;
return 0;
}

Old School :-)
std::string str = "H:/recursos/audio/youtube/libre/falta/";
for (int i = 0; i < str.size(); i++) {
if (str[i] == '/') {
str[i] = '\\';
}
}
std::cout << str;
Result:
H:\recursos\audio\youtube\libre\falta\

For simple situations this works pretty well without using any other library then std::string (which is already in use).
Replace all occurences of character a with character b in some_string:
for (size_t i = 0; i < some_string.size(); ++i) {
if (some_string[i] == 'a') {
some_string.replace(i, 1, "b");
}
}
If the string is large or multiple calls to replace is an issue, you can apply the technique mentioned in this answer: https://stackoverflow.com/a/29752943/3622300

here's a solution i rolled, in a maximal DRI spirit.
it will search sNeedle in sHaystack and replace it by sReplace,
nTimes if non 0, else all the sNeedle occurences.
it will not search again in the replaced text.
std::string str_replace(
std::string sHaystack, std::string sNeedle, std::string sReplace,
size_t nTimes=0)
{
size_t found = 0, pos = 0, c = 0;
size_t len = sNeedle.size();
size_t replen = sReplace.size();
std::string input(sHaystack);
do {
found = input.find(sNeedle, pos);
if (found == std::string::npos) {
break;
}
input.replace(found, len, sReplace);
pos = found + replen;
++c;
} while(!nTimes || c < nTimes);
return input;
}

I think I'd use std::replace_if()
A simple character-replacer (requested by OP) can be written by using standard library functions.
For an in-place version:
#include <string>
#include <algorithm>
void replace_char(std::string& in,
std::string::value_type srch,
std::string::value_type repl)
{
std::replace_if(std::begin(in), std::end(in),
[&srch](std::string::value_type v) { return v==srch; },
repl);
return;
}
and an overload that returns a copy if the input is a const string:
std::string replace_char(std::string const& in,
std::string::value_type srch,
std::string::value_type repl)
{
std::string result{ in };
replace_char(result, srch, repl);
return result;
}

This works! I used something similar to this for a bookstore app, where the inventory was stored in a CSV (like a .dat file). But in the case of a single char, meaning the replacer is only a single char, e.g.'|', it must be in double quotes "|" in order not to throw an invalid conversion const char.
#include <iostream>
#include <string>
using namespace std;
int main()
{
int count = 0; // for the number of occurences.
// final hold variable of corrected word up to the npos=j
string holdWord = "";
// a temp var in order to replace 0 to new npos
string holdTemp = "";
// a csv for a an entry in a book store
string holdLetter = "Big Java 7th Ed,Horstman,978-1118431115,99.85";
// j = npos
for (int j = 0; j < holdLetter.length(); j++) {
if (holdLetter[j] == ',') {
if ( count == 0 )
{
holdWord = holdLetter.replace(j, 1, " | ");
}
else {
string holdTemp1 = holdLetter.replace(j, 1, " | ");
// since replacement is three positions in length,
// must replace new replacement's 0 to npos-3, with
// the 0 to npos - 3 of the old replacement
holdTemp = holdTemp1.replace(0, j-3, holdWord, 0, j-3);
holdWord = "";
holdWord = holdTemp;
}
holdTemp = "";
count++;
}
}
cout << holdWord << endl;
return 0;
}
// result:
Big Java 7th Ed | Horstman | 978-1118431115 | 99.85
Uncustomarily I am using CentOS currently, so my compiler version is below . The C++ version (g++), C++98 default:
g++ (GCC) 4.8.5 20150623 (Red Hat 4.8.5-4)
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

This is not the only method missing from the standard library, it was intended be low level.
This use case and many other are covered by general libraries such as:
POCO
Abseil
Boost
QtCore
QtCore & QString has my preference: it supports UTF8 and uses less templates, which means understandable errors and faster compilation. It uses the "q" prefix which makes namespaces unnecessary and simplifies headers.
Boost often generates hideous error messages and slow compile time.
POCO seems to be a reasonable compromise.

How about replace any character string with any character string using only good-old C string functions?
char original[256]="First Line\nNext Line\n", dest[256]="";
char* replace_this = "\n"; // this is now a single character but could be any string
char* with_this = "\r\n"; // this is 2 characters but could be of any length
/* get the first token */
char* token = strtok(original, replace_this);
/* walk through other tokens */
while (token != NULL) {
strcat(dest, token);
strcat(dest, with_this);
token = strtok(NULL, replace_this);
}
dest should now have what we are looking for.

Related

How can you split string in C++ and store them in variables? [duplicate]

Java has a convenient split method:
String str = "The quick brown fox";
String[] results = str.split(" ");
Is there an easy way to do this in C++?
The Boost tokenizer class can make this sort of thing quite simple:
#include <iostream>
#include <string>
#include <boost/foreach.hpp>
#include <boost/tokenizer.hpp>
using namespace std;
using namespace boost;
int main(int, char**)
{
string text = "token, test string";
char_separator<char> sep(", ");
tokenizer< char_separator<char> > tokens(text, sep);
BOOST_FOREACH (const string& t, tokens) {
cout << t << "." << endl;
}
}
Updated for C++11:
#include <iostream>
#include <string>
#include <boost/tokenizer.hpp>
using namespace std;
using namespace boost;
int main(int, char**)
{
string text = "token, test string";
char_separator<char> sep(", ");
tokenizer<char_separator<char>> tokens(text, sep);
for (const auto& t : tokens) {
cout << t << "." << endl;
}
}
Here's a real simple one:
#include <vector>
#include <string>
using namespace std;
vector<string> split(const char *str, char c = ' ')
{
vector<string> result;
do
{
const char *begin = str;
while(*str != c && *str)
str++;
result.push_back(string(begin, str));
} while (0 != *str++);
return result;
}
C++ standard library algorithms are pretty universally based around iterators rather than concrete containers. Unfortunately this makes it hard to provide a Java-like split function in the C++ standard library, even though nobody argues that this would be convenient. But what would its return type be? std::vector<std::basic_string<…>>? Maybe, but then we’re forced to perform (potentially redundant and costly) allocations.
Instead, C++ offers a plethora of ways to split strings based on arbitrarily complex delimiters, but none of them is encapsulated as nicely as in other languages. The numerous ways fill whole blog posts.
At its simplest, you could iterate using std::string::find until you hit std::string::npos, and extract the contents using std::string::substr.
A more fluid (and idiomatic, but basic) version for splitting on whitespace would use a std::istringstream:
auto iss = std::istringstream{"The quick brown fox"};
auto str = std::string{};
while (iss >> str) {
process(str);
}
Using std::istream_iterators, the contents of the string stream could also be copied into a vector using its iterator range constructor.
Multiple libraries (such as Boost.Tokenizer) offer specific tokenisers.
More advanced splitting require regular expressions. C++ provides the std::regex_token_iterator for this purpose in particular:
auto const str = "The quick brown fox"s;
auto const re = std::regex{R"(\s+)"};
auto const vec = std::vector<std::string>(
std::sregex_token_iterator{begin(str), end(str), re, -1},
std::sregex_token_iterator{}
);
Another quick way is to use getline. Something like:
stringstream ss("bla bla");
string s;
while (getline(ss, s, ' ')) {
cout << s << endl;
}
If you want, you can make a simple split() method returning a vector<string>, which is
really useful.
Use strtok. In my opinion, there isn't a need to build a class around tokenizing unless strtok doesn't provide you with what you need. It might not, but in 15+ years of writing various parsing code in C and C++, I've always used strtok. Here is an example
char myString[] = "The quick brown fox";
char *p = strtok(myString, " ");
while (p) {
printf ("Token: %s\n", p);
p = strtok(NULL, " ");
}
A few caveats (which might not suit your needs). The string is "destroyed" in the process, meaning that EOS characters are placed inline in the delimter spots. Correct usage might require you to make a non-const version of the string. You can also change the list of delimiters mid parse.
In my own opinion, the above code is far simpler and easier to use than writing a separate class for it. To me, this is one of those functions that the language provides and it does it well and cleanly. It's simply a "C based" solution. It's appropriate, it's easy, and you don't have to write a lot of extra code :-)
You can use streams, iterators, and the copy algorithm to do this fairly directly.
#include <string>
#include <vector>
#include <iostream>
#include <istream>
#include <ostream>
#include <iterator>
#include <sstream>
#include <algorithm>
int main()
{
std::string str = "The quick brown fox";
// construct a stream from the string
std::stringstream strstr(str);
// use stream iterators to copy the stream to the vector as whitespace separated strings
std::istream_iterator<std::string> it(strstr);
std::istream_iterator<std::string> end;
std::vector<std::string> results(it, end);
// send the vector to stdout.
std::ostream_iterator<std::string> oit(std::cout);
std::copy(results.begin(), results.end(), oit);
}
A solution using regex_token_iterators:
#include <iostream>
#include <regex>
#include <string>
using namespace std;
int main()
{
string str("The quick brown fox");
regex reg("\\s+");
sregex_token_iterator iter(str.begin(), str.end(), reg, -1);
sregex_token_iterator end;
vector<string> vec(iter, end);
for (auto a : vec)
{
cout << a << endl;
}
}
No offense folks, but for such a simple problem, you are making things way too complicated. There are a lot of reasons to use Boost. But for something this simple, it's like hitting a fly with a 20# sledge.
void
split( vector<string> & theStringVector, /* Altered/returned value */
const string & theString,
const string & theDelimiter)
{
UASSERT( theDelimiter.size(), >, 0); // My own ASSERT macro.
size_t start = 0, end = 0;
while ( end != string::npos)
{
end = theString.find( theDelimiter, start);
// If at end, use length=maxLength. Else use length=end-start.
theStringVector.push_back( theString.substr( start,
(end == string::npos) ? string::npos : end - start));
// If at end, use start=maxSize. Else use start=end+delimiter.
start = ( ( end > (string::npos - theDelimiter.size()) )
? string::npos : end + theDelimiter.size());
}
}
For example (for Doug's case),
#define SHOW(I,X) cout << "[" << (I) << "]\t " # X " = \"" << (X) << "\"" << endl
int
main()
{
vector<string> v;
split( v, "A:PEP:909:Inventory Item", ":" );
for (unsigned int i = 0; i < v.size(); i++)
SHOW( i, v[i] );
}
And yes, we could have split() return a new vector rather than passing one in. It's trivial to wrap and overload. But depending on what I'm doing, I often find it better to re-use pre-existing objects rather than always creating new ones. (Just as long as I don't forget to empty the vector in between!)
Reference: http://www.cplusplus.com/reference/string/string/.
(I was originally writing a response to Doug's question: C++ Strings Modifying and Extracting based on Separators (closed). But since Martin York closed that question with a pointer over here... I'll just generalize my code.)
Boost has a strong split function: boost::algorithm::split.
Sample program:
#include <vector>
#include <boost/algorithm/string.hpp>
int main() {
auto s = "a,b, c ,,e,f,";
std::vector<std::string> fields;
boost::split(fields, s, boost::is_any_of(","));
for (const auto& field : fields)
std::cout << "\"" << field << "\"\n";
return 0;
}
Output:
"a"
"b"
" c "
""
"e"
"f"
""
This is a simple STL-only solution (~5 lines!) using std::find and std::find_first_not_of that handles repetitions of the delimiter (like spaces or periods for instance), as well leading and trailing delimiters:
#include <string>
#include <vector>
void tokenize(std::string str, std::vector<string> &token_v){
size_t start = str.find_first_not_of(DELIMITER), end=start;
while (start != std::string::npos){
// Find next occurence of delimiter
end = str.find(DELIMITER, start);
// Push back the token found into vector
token_v.push_back(str.substr(start, end-start));
// Skip all occurences of the delimiter to find new start
start = str.find_first_not_of(DELIMITER, end);
}
}
Try it out live!
I know you asked for a C++ solution, but you might consider this helpful:
Qt
#include <QString>
...
QString str = "The quick brown fox";
QStringList results = str.split(" ");
The advantage over Boost in this example is that it's a direct one to one mapping to your post's code.
See more at Qt documentation
Here is a sample tokenizer class that might do what you want
//Header file
class Tokenizer
{
public:
static const std::string DELIMITERS;
Tokenizer(const std::string& str);
Tokenizer(const std::string& str, const std::string& delimiters);
bool NextToken();
bool NextToken(const std::string& delimiters);
const std::string GetToken() const;
void Reset();
protected:
size_t m_offset;
const std::string m_string;
std::string m_token;
std::string m_delimiters;
};
//CPP file
const std::string Tokenizer::DELIMITERS(" \t\n\r");
Tokenizer::Tokenizer(const std::string& s) :
m_string(s),
m_offset(0),
m_delimiters(DELIMITERS) {}
Tokenizer::Tokenizer(const std::string& s, const std::string& delimiters) :
m_string(s),
m_offset(0),
m_delimiters(delimiters) {}
bool Tokenizer::NextToken()
{
return NextToken(m_delimiters);
}
bool Tokenizer::NextToken(const std::string& delimiters)
{
size_t i = m_string.find_first_not_of(delimiters, m_offset);
if (std::string::npos == i)
{
m_offset = m_string.length();
return false;
}
size_t j = m_string.find_first_of(delimiters, i);
if (std::string::npos == j)
{
m_token = m_string.substr(i);
m_offset = m_string.length();
return true;
}
m_token = m_string.substr(i, j - i);
m_offset = j;
return true;
}
Example:
std::vector <std::string> v;
Tokenizer s("split this string", " ");
while (s.NextToken())
{
v.push_back(s.GetToken());
}
pystring is a small library which implements a bunch of Python's string functions, including the split method:
#include <string>
#include <vector>
#include "pystring.h"
std::vector<std::string> chunks;
pystring::split("this string", chunks);
// also can specify a separator
pystring::split("this-string", chunks, "-");
I posted this answer for similar question.
Don't reinvent the wheel. I've used a number of libraries and the fastest and most flexible I have come across is: C++ String Toolkit Library.
Here is an example of how to use it that I've posted else where on the stackoverflow.
#include <iostream>
#include <vector>
#include <string>
#include <strtk.hpp>
const char *whitespace = " \t\r\n\f";
const char *whitespace_and_punctuation = " \t\r\n\f;,=";
int main()
{
{ // normal parsing of a string into a vector of strings
std::string s("Somewhere down the road");
std::vector<std::string> result;
if( strtk::parse( s, whitespace, result ) )
{
for(size_t i = 0; i < result.size(); ++i )
std::cout << result[i] << std::endl;
}
}
{ // parsing a string into a vector of floats with other separators
// besides spaces
std::string s("3.0, 3.14; 4.0");
std::vector<float> values;
if( strtk::parse( s, whitespace_and_punctuation, values ) )
{
for(size_t i = 0; i < values.size(); ++i )
std::cout << values[i] << std::endl;
}
}
{ // parsing a string into specific variables
std::string s("angle = 45; radius = 9.9");
std::string w1, w2;
float v1, v2;
if( strtk::parse( s, whitespace_and_punctuation, w1, v1, w2, v2) )
{
std::cout << "word " << w1 << ", value " << v1 << std::endl;
std::cout << "word " << w2 << ", value " << v2 << std::endl;
}
}
return 0;
}
Adam Pierce's answer provides an hand-spun tokenizer taking in a const char*. It's a bit more problematic to do with iterators because incrementing a string's end iterator is undefined. That said, given string str{ "The quick brown fox" } we can certainly accomplish this:
auto start = find(cbegin(str), cend(str), ' ');
vector<string> tokens{ string(cbegin(str), start) };
while (start != cend(str)) {
const auto finish = find(++start, cend(str), ' ');
tokens.push_back(string(start, finish));
start = finish;
}
Live Example
If you're looking to abstract complexity by using standard functionality, as On Freund suggests strtok is a simple option:
vector<string> tokens;
for (auto i = strtok(data(str), " "); i != nullptr; i = strtok(nullptr, " ")) tokens.push_back(i);
If you don't have access to C++17 you'll need to substitute data(str) as in this example: http://ideone.com/8kAGoa
Though not demonstrated in the example, strtok need not use the same delimiter for each token. Along with this advantage though, there are several drawbacks:
strtok cannot be used on multiple strings at the same time: Either a nullptr must be passed to continue tokenizing the current string or a new char* to tokenize must be passed (there are some non-standard implementations which do support this however, such as: strtok_s)
For the same reason strtok cannot be used on multiple threads simultaneously (this may however be implementation defined, for example: Visual Studio's implementation is thread safe)
Calling strtok modifies the string it is operating on, so it cannot be used on const strings, const char*s, or literal strings, to tokenize any of these with strtok or to operate on a string who's contents need to be preserved, str would have to be copied, then the copy could be operated on
c++20 provides us with split_view to tokenize strings, in a non-destructive manner: https://topanswers.xyz/cplusplus?q=749#a874
The previous methods cannot generate a tokenized vector in-place, meaning without abstracting them into a helper function they cannot initialize const vector<string> tokens. That functionality and the ability to accept any white-space delimiter can be harnessed using an istream_iterator. For example given: const string str{ "The quick \tbrown \nfox" } we can do this:
istringstream is{ str };
const vector<string> tokens{ istream_iterator<string>(is), istream_iterator<string>() };
Live Example
The required construction of an istringstream for this option has far greater cost than the previous 2 options, however this cost is typically hidden in the expense of string allocation.
If none of the above options are flexable enough for your tokenization needs, the most flexible option is using a regex_token_iterator of course with this flexibility comes greater expense, but again this is likely hidden in the string allocation cost. Say for example we want to tokenize based on non-escaped commas, also eating white-space, given the following input: const string str{ "The ,qu\\,ick ,\tbrown, fox" } we can do this:
const regex re{ "\\s*((?:[^\\\\,]|\\\\.)*?)\\s*(?:,|$)" };
const vector<string> tokens{ sregex_token_iterator(cbegin(str), cend(str), re, 1), sregex_token_iterator() };
Live Example
Check this example. It might help you..
#include <iostream>
#include <sstream>
using namespace std;
int main ()
{
string tmps;
istringstream is ("the dellimiter is the space");
while (is.good ()) {
is >> tmps;
cout << tmps << "\n";
}
return 0;
}
If you're using C++ ranges - the full ranges-v3 library, not the limited functionality accepted into C++20 - you could do it this way:
auto results = str | ranges::views::tokenize(" ",1);
... and this is lazily-evaluated. You can alternatively set a vector to this range:
auto results = str | ranges::views::tokenize(" ",1) | ranges::to<std::vector>();
this will take O(m) space and O(n) time if str has n characters making up m words.
See also the library's own tokenization example, here.
MFC/ATL has a very nice tokenizer. From MSDN:
CAtlString str( "%First Second#Third" );
CAtlString resToken;
int curPos= 0;
resToken= str.Tokenize("% #",curPos);
while (resToken != "")
{
printf("Resulting token: %s\n", resToken);
resToken= str.Tokenize("% #",curPos);
};
Output
Resulting Token: First
Resulting Token: Second
Resulting Token: Third
If you're willing to use C, you can use the strtok function. You should pay attention to multi-threading issues when using it.
For simple stuff I just use the following:
unsigned TokenizeString(const std::string& i_source,
const std::string& i_seperators,
bool i_discard_empty_tokens,
std::vector<std::string>& o_tokens)
{
unsigned prev_pos = 0;
unsigned pos = 0;
unsigned number_of_tokens = 0;
o_tokens.clear();
pos = i_source.find_first_of(i_seperators, pos);
while (pos != std::string::npos)
{
std::string token = i_source.substr(prev_pos, pos - prev_pos);
if (!i_discard_empty_tokens || token != "")
{
o_tokens.push_back(i_source.substr(prev_pos, pos - prev_pos));
number_of_tokens++;
}
pos++;
prev_pos = pos;
pos = i_source.find_first_of(i_seperators, pos);
}
if (prev_pos < i_source.length())
{
o_tokens.push_back(i_source.substr(prev_pos));
number_of_tokens++;
}
return number_of_tokens;
}
Cowardly disclaimer: I write real-time data processing software where the data comes in through binary files, sockets, or some API call (I/O cards, camera's). I never use this function for something more complicated or time-critical than reading external configuration files on startup.
You can simply use a regular expression library and solve that using regular expressions.
Use expression (\w+) and the variable in \1 (or $1 depending on the library implementation of regular expressions).
Many overly complicated suggestions here. Try this simple std::string solution:
using namespace std;
string someText = ...
string::size_type tokenOff = 0, sepOff = tokenOff;
while (sepOff != string::npos)
{
sepOff = someText.find(' ', sepOff);
string::size_type tokenLen = (sepOff == string::npos) ? sepOff : sepOff++ - tokenOff;
string token = someText.substr(tokenOff, tokenLen);
if (!token.empty())
/* do something with token */;
tokenOff = sepOff;
}
I thought that was what the >> operator on string streams was for:
string word; sin >> word;
Here's an approach that allows you control over whether empty tokens are included (like strsep) or excluded (like strtok).
#include <string.h> // for strchr and strlen
/*
* want_empty_tokens==true : include empty tokens, like strsep()
* want_empty_tokens==false : exclude empty tokens, like strtok()
*/
std::vector<std::string> tokenize(const char* src,
char delim,
bool want_empty_tokens)
{
std::vector<std::string> tokens;
if (src and *src != '\0') // defensive
while( true ) {
const char* d = strchr(src, delim);
size_t len = (d)? d-src : strlen(src);
if (len or want_empty_tokens)
tokens.push_back( std::string(src, len) ); // capture token
if (d) src += len+1; else break;
}
return tokens;
}
Seems odd to me that with all us speed conscious nerds here on SO no one has presented a version that uses a compile time generated look up table for the delimiter (example implementation further down). Using a look up table and iterators should beat std::regex in efficiency, if you don't need to beat regex, just use it, its standard as of C++11 and super flexible.
Some have suggested regex already but for the noobs here is a packaged example that should do exactly what the OP expects:
std::vector<std::string> split(std::string::const_iterator it, std::string::const_iterator end, std::regex e = std::regex{"\\w+"}){
std::smatch m{};
std::vector<std::string> ret{};
while (std::regex_search (it,end,m,e)) {
ret.emplace_back(m.str());
std::advance(it, m.position() + m.length()); //next start position = match position + match length
}
return ret;
}
std::vector<std::string> split(const std::string &s, std::regex e = std::regex{"\\w+"}){ //comfort version calls flexible version
return split(s.cbegin(), s.cend(), std::move(e));
}
int main ()
{
std::string str {"Some people, excluding those present, have been compile time constants - since puberty."};
auto v = split(str);
for(const auto&s:v){
std::cout << s << std::endl;
}
std::cout << "crazy version:" << std::endl;
v = split(str, std::regex{"[^e]+"}); //using e as delim shows flexibility
for(const auto&s:v){
std::cout << s << std::endl;
}
return 0;
}
If we need to be faster and accept the constraint that all chars must be 8 bits we can make a look up table at compile time using metaprogramming:
template<bool...> struct BoolSequence{}; //just here to hold bools
template<char...> struct CharSequence{}; //just here to hold chars
template<typename T, char C> struct Contains; //generic
template<char First, char... Cs, char Match> //not first specialization
struct Contains<CharSequence<First, Cs...>,Match> :
Contains<CharSequence<Cs...>, Match>{}; //strip first and increase index
template<char First, char... Cs> //is first specialization
struct Contains<CharSequence<First, Cs...>,First>: std::true_type {};
template<char Match> //not found specialization
struct Contains<CharSequence<>,Match>: std::false_type{};
template<int I, typename T, typename U>
struct MakeSequence; //generic
template<int I, bool... Bs, typename U>
struct MakeSequence<I,BoolSequence<Bs...>, U>: //not last
MakeSequence<I-1, BoolSequence<Contains<U,I-1>::value,Bs...>, U>{};
template<bool... Bs, typename U>
struct MakeSequence<0,BoolSequence<Bs...>,U>{ //last
using Type = BoolSequence<Bs...>;
};
template<typename T> struct BoolASCIITable;
template<bool... Bs> struct BoolASCIITable<BoolSequence<Bs...>>{
/* could be made constexpr but not yet supported by MSVC */
static bool isDelim(const char c){
static const bool table[256] = {Bs...};
return table[static_cast<int>(c)];
}
};
using Delims = CharSequence<'.',',',' ',':','\n'>; //list your custom delimiters here
using Table = BoolASCIITable<typename MakeSequence<256,BoolSequence<>,Delims>::Type>;
With that in place making a getNextToken function is easy:
template<typename T_It>
std::pair<T_It,T_It> getNextToken(T_It begin,T_It end){
begin = std::find_if(begin,end,std::not1(Table{})); //find first non delim or end
auto second = std::find_if(begin,end,Table{}); //find first delim or end
return std::make_pair(begin,second);
}
Using it is also easy:
int main() {
std::string s{"Some people, excluding those present, have been compile time constants - since puberty."};
auto it = std::begin(s);
auto end = std::end(s);
while(it != std::end(s)){
auto token = getNextToken(it,end);
std::cout << std::string(token.first,token.second) << std::endl;
it = token.second;
}
return 0;
}
Here is a live example: http://ideone.com/GKtkLQ
I know this question is already answered but I want to contribute. Maybe my solution is a bit simple but this is what I came up with:
vector<string> get_words(string const& text, string const& separator)
{
vector<string> result;
string tmp = text;
size_t first_pos = 0;
size_t second_pos = tmp.find(separator);
while (second_pos != string::npos)
{
if (first_pos != second_pos)
{
string word = tmp.substr(first_pos, second_pos - first_pos);
result.push_back(word);
}
tmp = tmp.substr(second_pos + separator.length());
second_pos = tmp.find(separator);
}
result.push_back(tmp);
return result;
}
Please comment if there is a better approach to something in my code or if something is wrong.
UPDATE: added generic separator
you can take advantage of boost::make_find_iterator. Something similar to this:
template<typename CH>
inline vector< basic_string<CH> > tokenize(
const basic_string<CH> &Input,
const basic_string<CH> &Delimiter,
bool remove_empty_token
) {
typedef typename basic_string<CH>::const_iterator string_iterator_t;
typedef boost::find_iterator< string_iterator_t > string_find_iterator_t;
vector< basic_string<CH> > Result;
string_iterator_t it = Input.begin();
string_iterator_t it_end = Input.end();
for(string_find_iterator_t i = boost::make_find_iterator(Input, boost::first_finder(Delimiter, boost::is_equal()));
i != string_find_iterator_t();
++i) {
if(remove_empty_token){
if(it != i->begin())
Result.push_back(basic_string<CH>(it,i->begin()));
}
else
Result.push_back(basic_string<CH>(it,i->begin()));
it = i->end();
}
if(it != it_end)
Result.push_back(basic_string<CH>(it,it_end));
return Result;
}
Here's my Swiss® Army Knife of string-tokenizers for splitting up strings by whitespace, accounting for single and double-quote wrapped strings as well as stripping those characters from the results. I used RegexBuddy 4.x to generate most of the code-snippet, but I added custom handling for stripping quotes and a few other things.
#include <string>
#include <locale>
#include <regex>
std::vector<std::wstring> tokenize_string(std::wstring string_to_tokenize) {
std::vector<std::wstring> tokens;
std::wregex re(LR"(("[^"]*"|'[^']*'|[^"' ]+))", std::regex_constants::collate);
std::wsregex_iterator next( string_to_tokenize.begin(),
string_to_tokenize.end(),
re,
std::regex_constants::match_not_null );
std::wsregex_iterator end;
const wchar_t single_quote = L'\'';
const wchar_t double_quote = L'\"';
while ( next != end ) {
std::wsmatch match = *next;
const std::wstring token = match.str( 0 );
next++;
if (token.length() > 2 && (token.front() == double_quote || token.front() == single_quote))
tokens.emplace_back( std::wstring(token.begin()+1, token.begin()+token.length()-1) );
else
tokens.emplace_back(token);
}
return tokens;
}
I wrote a simplified version (and maybe a little bit efficient) of https://stackoverflow.com/a/50247503/3976739 for my own use. I hope it would help.
void StrTokenizer(string& source, const char* delimiter, vector<string>& Tokens)
{
size_t new_index = 0;
size_t old_index = 0;
while (new_index != std::string::npos)
{
new_index = source.find(delimiter, old_index);
Tokens.emplace_back(source.substr(old_index, new_index-old_index));
if (new_index != std::string::npos)
old_index = ++new_index;
}
}
If the maximum length of the input string to be tokenized is known, one can exploit this and implement a very fast version. I am sketching the basic idea below, which was inspired by both strtok() and the "suffix array"-data structure described Jon Bentley's "Programming Perls" 2nd edition, chapter 15. The C++ class in this case only gives some organization and convenience of use. The implementation shown can be easily extended for removing leading and trailing whitespace characters in the tokens.
Basically one can replace the separator characters with string-terminating '\0'-characters and set pointers to the tokens withing the modified string. In the extreme case when the string consists only of separators, one gets string-length plus 1 resulting empty tokens. It is practical to duplicate the string to be modified.
Header file:
class TextLineSplitter
{
public:
TextLineSplitter( const size_t max_line_len );
~TextLineSplitter();
void SplitLine( const char *line,
const char sep_char = ',',
);
inline size_t NumTokens( void ) const
{
return mNumTokens;
}
const char * GetToken( const size_t token_idx ) const
{
assert( token_idx < mNumTokens );
return mTokens[ token_idx ];
}
private:
const size_t mStorageSize;
char *mBuff;
char **mTokens;
size_t mNumTokens;
inline void ResetContent( void )
{
memset( mBuff, 0, mStorageSize );
// mark all items as empty:
memset( mTokens, 0, mStorageSize * sizeof( char* ) );
// reset counter for found items:
mNumTokens = 0L;
}
};
Implementattion file:
TextLineSplitter::TextLineSplitter( const size_t max_line_len ):
mStorageSize ( max_line_len + 1L )
{
// allocate memory
mBuff = new char [ mStorageSize ];
mTokens = new char* [ mStorageSize ];
ResetContent();
}
TextLineSplitter::~TextLineSplitter()
{
delete [] mBuff;
delete [] mTokens;
}
void TextLineSplitter::SplitLine( const char *line,
const char sep_char /* = ',' */,
)
{
assert( sep_char != '\0' );
ResetContent();
strncpy( mBuff, line, mMaxLineLen );
size_t idx = 0L; // running index for characters
do
{
assert( idx < mStorageSize );
const char chr = line[ idx ]; // retrieve current character
if( mTokens[ mNumTokens ] == NULL )
{
mTokens[ mNumTokens ] = &mBuff[ idx ];
} // if
if( chr == sep_char || chr == '\0' )
{ // item or line finished
// overwrite separator with a 0-terminating character:
mBuff[ idx ] = '\0';
// count-up items:
mNumTokens ++;
} // if
} while( line[ idx++ ] );
}
A scenario of usage would be:
// create an instance capable of splitting strings up to 1000 chars long:
TextLineSplitter spl( 1000 );
spl.SplitLine( "Item1,,Item2,Item3" );
for( size_t i = 0; i < spl.NumTokens(); i++ )
{
printf( "%s\n", spl.GetToken( i ) );
}
output:
Item1
Item2
Item3

Recognize string formatting Debug Assertion

I have a runtime problem with code below.
The purpose is to "recognize" the formats (%s %d etc) within the input string.
To do this, it returns an integer that matches the data type.
Then the extracted types are manipulated/handled in other functions.
I want to clarify that my purpose isn't to write formatted types in a string (snprintf etc.) but only to recognize/extract them.
The problem is the crash of my application with error:
Debug Assertion Failed!
Program:
...ers\Alex\source\repos\TestProgram\Debug\test.exe
File: minkernel\crts\ucrt\appcrt\convert\isctype.cpp
Line: 36
Expression: c >= -1 && c <= 255
My code:
#include <iostream>
#include <cstring>
enum Formats
{
TYPE_INT,
TYPE_FLOAT,
TYPE_STRING,
TYPE_NUM
};
typedef struct Format
{
Formats Type;
char Name[5 + 1];
} SFormat;
SFormat FormatsInfo[TYPE_NUM] =
{
{TYPE_INT, "d"},
{TYPE_FLOAT, "f"},
{TYPE_STRING, "s"},
};
int GetFormatType(const char* formatName)
{
for (const auto& format : FormatsInfo)
{
if (strcmp(format.Name, formatName) == 0)
return format.Type;
}
return -1;
}
bool isValidFormat(const char* formatName)
{
for (const auto& format : FormatsInfo)
{
if (strcmp(format.Name, formatName) == 0)
return true;
}
return false;
}
bool isFindFormat(const char* strBufFormat, size_t stringSize, int& typeFormat)
{
bool foundFormat = false;
std::string stringFormat = "";
for (size_t pos = 0; pos < stringSize; pos++)
{
if (!isalpha(strBufFormat[pos]))
continue;
if (!isdigit(strBufFormat[pos]))
{
stringFormat += strBufFormat[pos];
if (isValidFormat(stringFormat.c_str()))
{
typeFormat = GetFormatType(stringFormat.c_str());
foundFormat = true;
}
}
}
return foundFormat;
}
int main()
{
std::string testString = "some test string with %d arguments"; // crash application
// std::string testString = "%d some test string with arguments"; // not crash application
size_t stringSize = testString.size();
char buf[1024 + 1];
memcpy(buf, testString.c_str(), stringSize);
buf[stringSize] = '\0';
for (size_t pos = 0; pos < stringSize; pos++)
{
if (buf[pos] == '%')
{
if (buf[pos + 1] == '%')
{
pos++;
continue;
}
else
{
char bufFormat[1024 + 1];
memcpy(bufFormat, buf + pos, stringSize);
bufFormat[stringSize] = '\0';
int typeFormat;
if (isFindFormat(bufFormat, stringSize, typeFormat))
{
std::cout << "type = " << typeFormat << "\n";
// ...
}
}
}
}
}
As I commented in the code, with the first string everything works. While with the second, the application crashes.
I also wanted to ask you is there a better/more performing way to recognize types "%d %s etc" within a string? (even not necessarily returning an int to recognize it).
Thanks.
Let's take a look at this else clause:
char bufFormat[1024 + 1];
memcpy(bufFormat, buf + pos, stringSize);
bufFormat[stringSize] = '\0';
The variable stringSize was initialized with the size of the original format string. Let's say it's 30 in this case.
Let's say you found the %d code at offset 20. You're going to copy 30 characters, starting at offset 20, into bufFormat. That means you're copying 20 characters past the end of the original string. You could possibly read off the end of the original buf, but that doesn't happen here because buf is large. The third line sets a NUL into the buffer at position 30, again past the end of the data, but your memcpy copied the NUL from buf into bufFormat, so that's where the string in bufFormat will end.
Now bufFormat contains the string "%d arguments." Inside isFindFormat you search for the first isalpha character. Possibly you meant isalnum here? Because we can only get to the isdigit line if the isalpha check passes, and if it's isalpha, it's not isdigit.
In any case, after isalpha passes, isdigit will definitely return false so we enter that if block. Your code will find the right type here. But, the loop doesn't terminate. Instead, it continues scanning up to stringSize characters, which is the stringSize from main, that is, the size of the original format string. But the string you're passing to isFindFormat only contains the part starting at '%'. So you're going to scan past the end of the string and read whatever's in the buffer, which will probably trigger the assertion error you're seeing.
Theres a lot more going on here. You're mixing and matching std::string and C strings; see if you can use std::string::substr instead of copying. You can use std::string::find to find characters in a string. If you have to use C strings, use strcpy instead of memcpy followed by the addition of a NUL.
You could just demand it to a regexp engine which bourned to search through strings
Since C++11 there's direct support, what you have to do is
#include <regex>
then you can match against strings using various methods, for instance regex_match which gives you the possibility, together with an smatch to find out your target with just few lines of codes using standard library
std::smatch sm;
std::regex_match ( testString.cbegin(), testString.cend(), sm, str_expr);
where str_exp is your regex to find what you want specifically
in the sm you have now every matched string against your regexp, which you can print in this way
for (int i = 0; i < sm.size(); ++i)
{
std::cout << "Match:" << sm[i] << std::endl;
}
EDIT:
to better express the result you would achieve i'll include a simple sample below
// target string to be searched against
string target_string = "simple example no.%d is: %s";
// pattern to look for
regex str_exp("(%[sd])");
// match object
smatch sm;
// iteratively search your pattern on the string, excluding parts of the string already matched
cout << "My format strings extracted:" << endl;
while (regex_search(target_string, sm, str_exp))
{
std::cout << sm[0] << std::endl;
target_string = sm.suffix();
}
you can easily add any format string you want modifying the str_exp regex expression.

C++ Extract number from the middle of a string

I have a vector containing strings that follow the format of text_number-number
Eg: Example_45-3
I only want the first number (45 in the example) and nothing else which I am able to do with my current code:
std::vector<std::string> imgNumStrVec;
for(size_t i = 0; i < StrVec.size(); i++){
std::vector<std::string> seglist;
std::stringstream ss(StrVec[i]);
std::string seg, seg2;
while(std::getline(ss, seg, '_')) seglist.push_back(seg);
std::stringstream ss2(seglist[1]);
std::getline(ss2, seg2, '-');
imgNumStrVec.push_back(seg2);
}
Are there more streamlined and simpler ways of doing this? and if so what are they?
I ask purely out of desire to learn how to code better as at the end of the day, the code above does successfully extract just the first number, but it seems long winded and round-about.
You can also use the built in find_first_of and find_first_not_of to find the first "numberstring" in any string.
std::string first_numberstring(std::string const & str)
{
char const* digits = "0123456789";
std::size_t const n = str.find_first_of(digits);
if (n != std::string::npos)
{
std::size_t const m = str.find_first_not_of(digits, n);
return str.substr(n, m != std::string::npos ? m-n : m);
}
return std::string();
}
This should be more efficient than Ashot Khachatryan's solution. Note the use of '_' and '-' instead of "_" and "-". And also, the starting position of the search for '-'.
inline std::string mid_num_str(const std::string& s) {
std::string::size_type p = s.find('_');
std::string::size_type pp = s.find('-', p + 2);
return s.substr(p + 1, pp - p - 1);
}
If you need a number instead of a string, like what Alexandr Lapenkov's solution has done, you may also want to try the following:
inline long mid_num(const std::string& s) {
return std::strtol(&s[s.find('_') + 1], nullptr, 10);
}
updated for C++11
(important note for compiler regex support: for gcc. you need version 4.9 or later. i tested this on g++ version 4.9[1], and 9.2. cppreference.com has in browser compiler that i used.)
Thanks to user #2b-t who found a bug in the c++11 code!
Here is the C++11 code:
#include <iostream>
#include <string>
#include <regex>
using std::cout;
using std::endl;
int main() {
std::string input = "Example_45-3";
std::string output = std::regex_replace(
input,
std::regex("[^0-9]*([0-9]+).*"),
std::string("$1")
);
cout << input << endl;
cout << output << endl;
}
boost solution that only requires C++98
Minimal implementation example that works on many strings (not just strings of the form "text_45-text":
#include <iostream>
#include <string>
using namespace std;
#include <boost/regex.hpp>
int main() {
string input = "Example_45-3";
string output = boost::regex_replace(
input,
boost::regex("[^0-9]*([0-9]+).*"),
string("\\1")
);
cout << input << endl;
cout << output << endl;
}
console output:
Example_45-3
45
Other example strings that this would work on:
"asdfasdf 45 sdfsdf"
"X = 45, sdfsdf"
For this example I used g++ on Linux with #include <boost/regex.hpp> and -lboost_regex. You could also use C++11x regex.
Feel free to edit my solution if you have a better regex.
Commentary:
If there aren't performance constraints, using Regex is ideal for this sort of thing because you aren't reinventing the wheel (by writing a bunch of string parsing code which takes time to write/test-fully).
Additionally if/when your strings become more complex or have more varied patterns regex easily accommodates the complexity. (The question's example pattern is easy enough. But often times a more complex pattern would take 10-100+ lines of code when a one line regex would do the same.)
[1]
[1]
Apparently full support for C++11 <regex> was implemented and released for g++ version 4.9.x and on Jun 26, 2015. Hat tip to SO questions #1 and #2 for figuring out the compiler version needing to be 4.9.x.
Check this out
std::string ex = "Example_45-3";
int num;
sscanf( ex.c_str(), "%*[^_]_%d", &num );
I can think of two ways of doing it:
Use regular expressions
Use an iterator to step through the string, and copy each consecutive digit to a temporary buffer. Break when it reaches an unreasonable length or on the first non-digit after a string of consecutive digits. Then you have a string of digits that you can easily convert.
std::string s = "Example_45-3";
int p1 = s.find("_");
int p2 = s.find("-");
std::string number = s.substr(p1 + 1, p2 - p1 - 1)
The 'best' way to do this in C++11 and later is probably using regular expressions, which combine high expressiveness and high performance when the test is repeated often enough.
The following code demonstrates the basics. You should #include <regex> for it to work.
// The example inputs
std::vector<std::string> inputs {
"Example_0-0", "Example_0-1", "Example_0-2", "Example_0-3", "Example_0-4",
"Example_1-0", "Example_1-1", "Example_1-2", "Example_1-3", "Example_1-4"
};
// The regular expression. A lot of the cost is incurred when building the
// std::regex object, but when it's reused a lot that cost is amortised.
std::regex imgNumRegex { "^[^_]+_([[:digit:]]+)-([[:digit:]]+)$" };
for (const auto &input: inputs){
// This wil contain the match results. Parts of the regular expression
// enclosed in parentheses will be stored here, so in this case: both numbers
std::smatch matchResults;
if (!std::regex_match(input, matchResults, imgNumRegex)) {
// Handle failure to match
abort();
}
// Note that the first match is in str(1). str(0) contains the whole string
std::string theFirstNumber = matchResults.str(1);
std::string theSecondNumber = matchResults.str(2);
std::cout << "The input had numbers " << theFirstNumber;
std::cout << " and " << theSecondNumber << std::endl;
}
Using #Pixelchemist's answer and e.g. std::stoul:
bool getFirstNumber(std::string const & a_str, unsigned long & a_outVal)
{
auto pos = a_str.find_first_of("0123456789");
try
{
if (std::string::npos != pos)
{
a_outVal = std::stoul(a_str.substr(pos));
return true;
}
}
catch (...)
{
// handle conversion failure
// ...
}
return false;
}

Replace all occurrences of the search string after specific position

I'm looking for replace all algorithm which replaced all occurrences of substring after specific position. So far I have replace all copy approach. What is most convenient way to do it without allocation of new string except this one? Does exist convenient way to do it with boost?
#include <iostream>
#include <string>
#include <boost/algorithm/string/replace.hpp>
int main() {
std::string str = "1234 abc1234 marker 1234 1234 123 1 12 134 12341234";
const std::string marker("marker");
size_t pos = str.find(marker);
if (pos == std::string::npos) {
return 0;
}
pos += marker.length();
std::string str_out(str, 0, pos);
boost::algorithm::replace_all_copy(std::back_inserter(str_out), str.substr(pos, std::string::npos), "12", "XXXX");
std::cout << str << std::endl;
std::cout << str_out << std::endl;
}
If you want to do an in-place find and replace operation, you'll have to be aware of the performance implications. In order to do such an operation, you will likely have to read the string backwards which can lead to bad cache behavior, or do a lot of memory shuffling, which can also be bad for performance. In general, it's best to do a replace-in-copy operation since any strings you'll be operating on will likely be relatively small, and most memory caches will handle things quite easily.
If you must have an in-place find and replace algorithm, use the following code if you're just looking for a drop-in function. I benchmarked it and it's very fast.
std::string& find_replace_in_place( std::string &haystack, const std::string needle, const std::string replacement, size_t start = 0 ){
size_t ret = 0;
size_t position = haystack.find( needle, start );
while( position != std::string::npos ){
haystack.replace( position, needle.length(), replacement );
position = haystack.find( needle, position + replacement.length() );
}
return haystack;
}

Using strtok with a std::string

I have a string that I would like to tokenize.
But the C strtok() function requires my string to be a char*.
How can I do this simply?
I tried:
token = strtok(str.c_str(), " ");
which fails because it turns it into a const char*, not a char*
#include <iostream>
#include <string>
#include <sstream>
int main(){
std::string myText("some-text-to-tokenize");
std::istringstream iss(myText);
std::string token;
while (std::getline(iss, token, '-'))
{
std::cout << token << std::endl;
}
return 0;
}
Or, as mentioned, use boost for more flexibility.
Duplicate the string, tokenize it, then free it.
char *dup = strdup(str.c_str());
token = strtok(dup, " ");
free(dup);
If boost is available on your system (I think it's standard on most Linux distros these days), it has a Tokenizer class you can use.
If not, then a quick Google turns up a hand-rolled tokenizer for std::string that you can probably just copy and paste. It's very short.
And, if you don't like either of those, then here's a split() function I wrote to make my life easier. It'll break a string into pieces using any of the chars in "delim" as separators. Pieces are appended to the "parts" vector:
void split(const string& str, const string& delim, vector<string>& parts) {
size_t start, end = 0;
while (end < str.size()) {
start = end;
while (start < str.size() && (delim.find(str[start]) != string::npos)) {
start++; // skip initial whitespace
}
end = start;
while (end < str.size() && (delim.find(str[end]) == string::npos)) {
end++; // skip to end of word
}
if (end-start != 0) { // just ignore zero-length strings.
parts.push_back(string(str, start, end-start));
}
}
}
There is a more elegant solution.
With std::string you can use resize() to allocate a suitably large buffer, and &s[0] to get a pointer to the internal buffer.
At this point many fine folks will jump and yell at the screen. But this is the fact. About 2 years ago
the library working group decided (meeting at Lillehammer) that just like for std::vector, std::string should also formally, not just in practice, have a guaranteed contiguous buffer.
The other concern is does strtok() increases the size of the string. The MSDN documentation says:
Each call to strtok modifies strToken by inserting a null character after the token returned by that call.
But this is not correct. Actually the function replaces the first occurrence of a separator character with \0. No change in the size of the string. If we have this string:
one-two---three--four
we will end up with
one\0two\0--three\0-four
So my solution is very simple:
std::string str("some-text-to-split");
char seps[] = "-";
char *token;
token = strtok( &str[0], seps );
while( token != NULL )
{
/* Do your thing */
token = strtok( NULL, seps );
}
Read the discussion on http://www.archivum.info/comp.lang.c++/2008-05/02889/does_std::string_have_something_like_CString::GetBuffer
With C++17 str::string receives data() overload that returns a pointer to modifieable buffer so string can be used in strtok directly without any hacks:
#include <string>
#include <iostream>
#include <cstring>
#include <cstdlib>
int main()
{
::std::string text{"pop dop rop"};
char const * const psz_delimiter{" "};
char * psz_token{::std::strtok(text.data(), psz_delimiter)};
while(nullptr != psz_token)
{
::std::cout << psz_token << ::std::endl;
psz_token = std::strtok(nullptr, psz_delimiter);
}
return EXIT_SUCCESS;
}
output
pop
dop
rop
EDIT: usage of const cast is only used to demonstrate the effect of strtok() when applied to a pointer returned by string::c_str().
You should not use
strtok() since it modifies the tokenized string which may lead to undesired, if not undefined, behaviour as the C string "belongs" to the string instance.
#include <string>
#include <iostream>
int main(int ac, char **av)
{
std::string theString("hello world");
std::cout << theString << " - " << theString.size() << std::endl;
//--- this cast *only* to illustrate the effect of strtok() on std::string
char *token = strtok(const_cast<char *>(theString.c_str()), " ");
std::cout << theString << " - " << theString.size() << std::endl;
return 0;
}
After the call to strtok(), the space was "removed" from the string, or turned down to a non-printable character, but the length remains unchanged.
>./a.out
hello world - 11
helloworld - 11
Therefore you have to resort to native mechanism, duplication of the string or an third party library as previously mentioned.
I suppose the language is C, or C++...
strtok, IIRC, replace separators with \0. That's what it cannot use a const string.
To workaround that "quickly", if the string isn't huge, you can just strdup() it. Which is wise if you need to keep the string unaltered (what the const suggest...).
On the other hand, you might want to use another tokenizer, perhaps hand rolled, less violent on the given argument.
Assuming that by "string" you're talking about std::string in C++, you might have a look at the Tokenizer package in Boost.
First off I would say use boost tokenizer.
Alternatively if your data is space separated then the string stream library is very useful.
But both the above have already been covered.
So as a third C-Like alternative I propose copying the std::string into a buffer for modification.
std::string data("The data I want to tokenize");
// Create a buffer of the correct length:
std::vector<char> buffer(data.size()+1);
// copy the string into the buffer
strcpy(&buffer[0],data.c_str());
// Tokenize
strtok(&buffer[0]," ");
If you don't mind open source, you could use the subbuffer and subparser classes from https://github.com/EdgeCast/json_parser. The original string is left intact, there is no allocation and no copying of data. I have not compiled the following so there may be errors.
std::string input_string("hello world");
subbuffer input(input_string);
subparser flds(input, ' ', subparser::SKIP_EMPTY);
while (!flds.empty())
{
subbuffer fld = flds.next();
// do something with fld
}
// or if you know it is only two fields
subbuffer fld1 = input.before(' ');
subbuffer fld2 = input.sub(fld1.length() + 1).ltrim(' ');
Typecasting to (char*) got it working for me!
token = strtok((char *)str.c_str(), " ");
Chris's answer is probably fine when using std::string; however in case you want to use std::basic_string<char16_t>, std::getline can't be used. Here is a possible other implementation:
template <class CharT> bool tokenizestring(const std::basic_string<CharT> &input, CharT separator, typename std::basic_string<CharT>::size_type &pos, std::basic_string<CharT> &token) {
if (pos >= input.length()) {
// if input is empty, or ends with a separator, return an empty token when the end has been reached (and return an out-of-bound position so subsequent call won't do it again)
if ((pos == 0) || ((pos > 0) && (pos == input.length()) && (input[pos-1] == separator))) {
token.clear();
pos=input.length()+1;
return true;
}
return false;
}
typename std::basic_string<CharT>::size_type separatorPos=input.find(separator, pos);
if (separatorPos == std::basic_string<CharT>::npos) {
token=input.substr(pos, input.length()-pos);
pos=input.length();
} else {
token=input.substr(pos, separatorPos-pos);
pos=separatorPos+1;
}
return true;
}
Then use it like this:
std::basic_string<char16_t> s;
std::basic_string<char16_t> token;
std::basic_string<char16_t>::size_type tokenPos=0;
while (tokenizestring(s, (char16_t)' ', tokenPos, token)) {
...
}
It fails because str.c_str() returns constant string but char * strtok (char * str, const char * delimiters ) requires volatile string. So you need to use *const_cast< char > inorder to make it voletile.
I am giving you a complete but small program to tokenize the string using C strtok() function.
#include <iostream>
#include <string>
#include <string.h>
using namespace std;
int main() {
string s="20#6 5, 3";
// strtok requires volatile string as it modifies the supplied string in order to tokenize it
char *str=const_cast< char *>(s.c_str());
char *tok;
tok=strtok(str, "#, " );
int arr[4], i=0;
while(tok!=NULL){
arr[i++]=stoi(tok);
tok=strtok(NULL, "#, " );
}
for(int i=0; i<4; i++) cout<<arr[i]<<endl;
return 0;
}
NOTE: strtok may not be suitable in all situation as the string passed to function gets modified by being broken into smaller strings. Pls., ref to get better understanding of strtok functionality.
How strtok works
Added few print statement to better understand the changes happning to string in each call to strtok and how it returns token.
#include <iostream>
#include <string>
#include <string.h>
using namespace std;
int main() {
string s="20#6 5, 3";
char *str=const_cast< char *>(s.c_str());
char *tok;
cout<<"string: "<<s<<endl;
tok=strtok(str, "#, " );
cout<<"String: "<<s<<"\tToken: "<<tok<<endl;
while(tok!=NULL){
tok=strtok(NULL, "#, " );
cout<<"String: "<<s<<"\t\tToken: "<<tok<<endl;
}
return 0;
}
Output:
string: 20#6 5, 3
String: 206 5, 3 Token: 20
String: 2065, 3 Token: 6
String: 2065 3 Token: 5
String: 2065 3 Token: 3
String: 2065 3 Token:
strtok iterate over the string first call find the non delemetor character (2 in this case) and marked it as token start then continues scan for a delimeter and replace it with null charater (# gets replaced in actual string) and return start which points to token start character( i.e., it return token 20 which is terminated by null). In subsequent call it start scaning from the next character and returns token if found else null. subsecuntly it returns token 6, 5, 3.