Fast String tokenization in C/C++

Fast String tokenization in C/C++ - c++

I'm working on a C/C++ app (in Visual Studio 2010) where I need to tokenize a comma delimited string and I would like this to be as fast as possible. Currently I'm using strtok_s. I ran some tests of strtok_s versus sscanf and it seemed like strtok_s was faster (unless I wrote a terrible implementation :) ) but I was wondering if anyone could suggest a faster alternative.

For sheer runtime speed, boost.spirit.qi is an excellent candidate.

The best you can do is to make sure you go through the string only once, and build the output on the fly. Start pulling off chars in to a temp buffer, and when you encounter the delimiter save the temp buffer to the output collection, clear the temp buffer, rinse and repeat.
Here's a basic implementation that does this.
template<class C=char>
struct basic_token
{
typedef std::basic_string<C> token_string;
typedef unsigned long size_type;
token_string token_, delim_;
basic_token(const token_string& token, const token_string& delim = token_string());
};
template<class C>
basic_token<C>::basic_token(const token_string& token, const token_string& delim)
: token_(token),
delim_(delim)
{
}
typedef basic_token<char> token;
template<class Char, class Iter> void tokenize(const std::basic_string<Char>& line, const Char* delims, Iter itx)
{
typedef basic_token<Char> Token;
typedef std::basic_string<Char> TString;
for( TString::size_type tok_begin = 0, tok_end = line.find_first_of(delims, tok_begin);
tok_begin != TString::npos; tok_end = line.find_first_of(delims, tok_begin) )
{
if( tok_end == TString::npos )
{
(*itx++) = Token(TString(&line[tok_begin]));
tok_begin = tok_end;
}
else
{
(*itx++) = Token(TString(&line[tok_begin], &line[tok_end]), TString(1, line[tok_end]));
tok_begin = tok_end + 1;
}
}
}
template<class Char, class Iter> void tokenize(const Char* line, const Char* delim, Iter itx)
{
tokenize(std::basic_string<Char>(line), delim, itx);
}
template<class Stream, class Token> Stream& operator<<(Stream& os, const Token& tok)
{
os << tok.token_ << "\t[" << tok.delim_ << "]";
return os;
}
...which you would use like this:
string raw = "35=BW|49=TEST|1346=REQ22|1355=2|1182=88500|1183=88505|10=087^";
vector<stoken> tokens;
tokenize(raw, "|", back_inserter(tokens));
copy(tokens.begin(), tokens.end(), ostream_iterator<stoken>(cout, "\n"));
Output is:
35=BW [|]
49=TEST [|]
1346=REQ22 [|]
1355=2 [|]
1182=88500 [|]
1183=88505 [|]
10=087^ []

I would remind you that there is a risk with strtok and its ilk
that you can get back a different number of tokens than you might want.
one|two|three would yield 3 tokens
while
one|||three would yield 2.

The test of mmhmm haven't make use of spirit correctly, his grammars are flaw.
#include <cstdio>
#include <cstring>
#include <iostream>
#include <string>
#include <boost/fusion/include/adapt_struct.hpp>
#include <boost/fusion/include/io.hpp>
#include <boost/spirit/include/qi.hpp>
/****************************strtok_r************************/
typedef struct sTokenDataC {
char *time;
char *symb;
float bid;
float ask;
int bidSize;
int askSize;
} tokenDataC;
tokenDataC parseTick( char *line, char *parseBuffer )
{
tokenDataC tokenDataOut;
tokenDataOut.time = strtok_r( line,",", &parseBuffer );
tokenDataOut.symb = strtok_r( nullptr,",", &parseBuffer );
tokenDataOut.bid = atof(strtok_r( nullptr,",", &parseBuffer ));
tokenDataOut.ask = atof(strtok_r( nullptr , ",", &parseBuffer ));
tokenDataOut.bidSize = atoi(strtok_r( nullptr,",", &parseBuffer ));
tokenDataOut.askSize = atoi(strtok_r( nullptr, ",", &parseBuffer ));
return tokenDataOut;
}
void test_strcpy_s(int iteration)
{
char *testStringC = new char[64];
char *lineBuffer = new char[64];
printf("test_strcpy_s....\n");
strcpy(testStringC,"09:30:00,TEST,13.24,15.32,10,14");
{
timeEstimate<> es;
tokenDataC tokenData2;
for(int i = 0; i < iteration; i++)
{
strcpy(lineBuffer, testStringC);//this is more realistic since this has to happen because I want to preserve the line
tokenData2 = parseTick(lineBuffer, testStringC);
//std::cout<<*tokenData2.time<<", "<<*tokenData2.symb<<",";
//std::cout<<tokenData2.bid<<", "<<tokenData2.ask<<", "<<tokenData2.bidSize<<", "<<tokenData2.askSize<<std::endl;
}
}
delete[] lineBuffer;
delete[] testStringC;
}
/****************************strtok_r************************/
/****************************spirit::qi*********************/
namespace qi = boost::spirit::qi;
struct tokenDataCPP
{
std::string time;
std::string symb;
float bid;
float ask;
int bidSize;
int askSize;
void clearTimeSymb(){
time.clear();
symb.clear();
}
};
BOOST_FUSION_ADAPT_STRUCT(
tokenDataCPP,
(std::string, time)
(std::string, symb)
(float, bid)
(float, ask)
(int, bidSize)
(int, askSize)
)
void test_spirit_qi(int iteration)
{
std::string const strs("09:30:00,TEST,13.24,15.32,10,14");
tokenDataCPP data;
auto myString = *~qi::char_(",");
auto parser = myString >> "," >> myString >> "," >> qi::float_ >> "," >> qi::float_ >> "," >> qi::int_ >> "," >> qi::int_;
{
std::cout<<("test_spirit_qi....\n");
timeEstimate<> es;
for(int i = 0; i < iteration; ++i){
qi::parse(std::begin(strs), std::end(strs), parser, data);
//std::cout<<data.time<<", "<<data.symb<<", ";
//std::cout<<data.bid<<", "<<data.ask<<", "<<data.bidSize<<", "<<data.askSize<<std::endl;
data.clearTimeSymb();
}
}
}
/****************************spirit::qi*********************/
int main()
{
int const ITERATIONS = 500 * 10000;
test_strcpy_s(ITERATIONS);
test_spirit_qi(ITERATIONS);
}
Since clang++ don't have strtok_s, I use strtok_r to replace it
Iterate 500 * 10k, the times are
test_strcpy_s : 1.40951
test_spirit_qi : 1.34277
Their times are almost the same, not much different.
compiler, clang++ 3.2, -O2
codes of timeEstime

This should be quite fast, no temp buffers, it allocates empty tokes too.
template <class char_t, class char_traits_t,
class char_allocator_t, class string_allocator_t>
inline void _tokenize(
const std::basic_string<char_t, char_traits_t, char_allocator_t>& _Str,
const char_t& _Tok,
std::vector<std::basic_string<char_t, char_traits_t, char_allocator_t>,
string_allocator_t>& _Tokens,
const size_t& _HintSz=10)
{
_Tokens.reserve(_HintSz);
const char_t* _Beg(&_Str[0]), *_End(&_Str[_Str.size()]);
for (const char_t* _Ptr=_Beg; _Ptr<_End; ++_Ptr)
{
if (*_Ptr == _Tok)
{
_Tokens.push_back(
std::basic_string<char_t, char_traits_t,
char_allocator_t>(_Beg, _Ptr));
_Beg = 1+_Ptr;
}
}
_Tokens.push_back(
std::basic_string<char_t, char_traits_t,
char_allocator_t>(_Beg, _End));
}

After testing and timing each suggested candidate, the result is that strtok is clearly the fastest. Although I'm not surprised my love of testing dictated it was worth exploring other options. [Note: Code was thrown together edits welcome :) ]
Given:
typedef struct sTokenDataC {
char *time;
char *symb;
float bid;
float ask;
int bidSize;
int askSize;
} tokenDataC;
tokenDataC parseTick( char *line, char *parseBuffer )
{
tokenDataC tokenDataOut;
tokenDataOut.time = strtok_s( line,",", &parseBuffer );
tokenDataOut.symb = strtok_s( null,",", &parseBuffer );
tokenDataOut.bid = atof(strtok_s( null,",", &parseBuffer ));
tokenDataOut.ask = atof(strtok_s( null , ",", &parseBuffer ));
tokenDataOut.bidSize = atoi(strtok_s( null,",", &parseBuffer ));
tokenDataOut.askSize = atoi(strtok_s( null, ",", &parseBuffer ));
return tokenDataOut;
}
char *testStringC = new char[64];
strcpy(testStringC,"09:30:00,TEST,13.24,15.32,10,14");
int _tmain(int argc, _TCHAR* argv[])
{
char *lineBuffer = new char[64];
printf("Testing method2....\n");
for(int i = 0; i < ITERATIONS; i++)
{
strcpy(lineBuffer,testStringC);//this is more realistic since this has to happen because I want to preserve the line
tokenData2 = parseTick(lineBuffer,parseBuffer);
}
}
vs calling John Diblings impl via:
struct sTokenDataCPP
{
std::basic_string<char> time;
std::basic_string<char> symb;
float bid;
float ask;
int bidSize;
int askSize;
};
std::vector<myToken> tokens1;
tokenDataCPP tokenData;
printf("Testing method1....\n");
for(int i = 0; i < ITERATIONS; i++)
{
tokens1.clear();
tokenize(raw, ",", std::back_inserter(tokens1));
tokenData.time.assign(tokens1.at(0).token_);
tokenData.symb.assign(tokens1.at(1).token_);
tokenData.ask = atof(tokens1.at(2).token_.c_str());
tokenData.bid = atof(tokens1.at(3).token_.c_str());
tokenData.askSize = atoi(tokens1.at(4).token_.c_str());
tokenData.bidSize = atoi(tokens1.at(5).token_.c_str());
}
vs a simple boost.spirit.qi implementation defining the grammer as follows:
template <typename Iterator>
struct tick_parser : grammar<Iterator, tokenDataCPP(), boost::spirit::ascii::space_type>
{
tick_parser() : tick_parser::base_type(start)
{
my_string %= lexeme[+(boost::spirit::ascii::char_ ) ];
start %=
my_string >> ','
>> my_string >> ','
>> float_ >> ','
>> float_ >> ','
>> int_ >> ','
>> int_
;
}
rule<Iterator, std::string(), boost::spirit::ascii::space_type> my_string;
rule<Iterator, sTokenDataCPP(), boost::spirit::ascii::space_type> start;
};
with ITERATIONS set to 500k:
strtok version: 2s
john's version: 115s
boost: 172s
I can post the full code is people want this, I just didn't want to take up a huge amt of space

Related

Separate text on char [duplicate]

How do I iterate over the words of a string composed of words separated by whitespace?
Note that I'm not interested in C string functions or that kind of character manipulation/access. I prefer elegance over efficiency. My current solution:
#include <iostream>
#include <sstream>
#include <string>
using namespace std;
int main() {
string s = "Somewhere down the road";
istringstream iss(s);
do {
string subs;
iss >> subs;
cout << "Substring: " << subs << endl;
} while (iss);
}

I use this to split string by a delimiter. The first puts the results in a pre-constructed vector, the second returns a new vector.
#include <string>
#include <sstream>
#include <vector>
#include <iterator>
template <typename Out>
void split(const std::string &s, char delim, Out result) {
std::istringstream iss(s);
std::string item;
while (std::getline(iss, item, delim)) {
*result++ = item;
}
}
std::vector<std::string> split(const std::string &s, char delim) {
std::vector<std::string> elems;
split(s, delim, std::back_inserter(elems));
return elems;
}
Note that this solution does not skip empty tokens, so the following will find 4 items, one of which is empty:
std::vector<std::string> x = split("one:two::three", ':');

For what it's worth, here's another way to extract tokens from an input string, relying only on standard library facilities. It's an example of the power and elegance behind the design of the STL.
#include <iostream>
#include <string>
#include <sstream>
#include <algorithm>
#include <iterator>
int main() {
using namespace std;
string sentence = "And I feel fine...";
istringstream iss(sentence);
copy(istream_iterator<string>(iss),
istream_iterator<string>(),
ostream_iterator<string>(cout, "\n"));
}
Instead of copying the extracted tokens to an output stream, one could insert them into a container, using the same generic copy algorithm.
vector<string> tokens;
copy(istream_iterator<string>(iss),
istream_iterator<string>(),
back_inserter(tokens));
... or create the vector directly:
vector<string> tokens{istream_iterator<string>{iss},
istream_iterator<string>{}};

A possible solution using Boost might be:
#include <boost/algorithm/string.hpp>
std::vector<std::string> strs;
boost::split(strs, "string to split", boost::is_any_of("\t "));
This approach might be even faster than the stringstream approach. And since this is a generic template function it can be used to split other types of strings (wchar, etc. or UTF-8) using all kinds of delimiters.
See the documentation for details.

#include <vector>
#include <string>
#include <sstream>
int main()
{
std::string str("Split me by whitespaces");
std::string buf; // Have a buffer string
std::stringstream ss(str); // Insert the string into a stream
std::vector<std::string> tokens; // Create vector to hold our words
while (ss >> buf)
tokens.push_back(buf);
return 0;
}

For those with whom it does not sit well to sacrifice all efficiency for code size and see "efficient" as a type of elegance, the following should hit a sweet spot (and I think the template container class is an awesomely elegant addition.):
template < class ContainerT >
void tokenize(const std::string& str, ContainerT& tokens,
const std::string& delimiters = " ", bool trimEmpty = false)
{
std::string::size_type pos, lastPos = 0, length = str.length();
using value_type = typename ContainerT::value_type;
using size_type = typename ContainerT::size_type;
while(lastPos < length + 1)
{
pos = str.find_first_of(delimiters, lastPos);
if(pos == std::string::npos)
{
pos = length;
}
if(pos != lastPos || !trimEmpty)
tokens.push_back(value_type(str.data()+lastPos,
(size_type)pos-lastPos ));
lastPos = pos + 1;
}
}
I usually choose to use std::vector<std::string> types as my second parameter (ContainerT)... but list<> is way faster than vector<> for when direct access is not needed, and you can even create your own string class and use something like std::list<subString> where subString does not do any copies for incredible speed increases.
It's more than double as fast as the fastest tokenize on this page and almost 5 times faster than some others. Also with the perfect parameter types you can eliminate all string and list copies for additional speed increases.
Additionally it does not do the (extremely inefficient) return of result, but rather it passes the tokens as a reference, thus also allowing you to build up tokens using multiple calls if you so wished.
Lastly it allows you to specify whether to trim empty tokens from the results via a last optional parameter.
All it needs is std::string... the rest are optional. It does not use streams or the boost library, but is flexible enough to be able to accept some of these foreign types naturally.

Here's another solution. It's compact and reasonably efficient:
std::vector<std::string> split(const std::string &text, char sep) {
std::vector<std::string> tokens;
std::size_t start = 0, end = 0;
while ((end = text.find(sep, start)) != std::string::npos) {
tokens.push_back(text.substr(start, end - start));
start = end + 1;
}
tokens.push_back(text.substr(start));
return tokens;
}
It can easily be templatised to handle string separators, wide strings, etc.
Note that splitting "" results in a single empty string and splitting "," (ie. sep) results in two empty strings.
It can also be easily expanded to skip empty tokens:
std::vector<std::string> split(const std::string &text, char sep) {
std::vector<std::string> tokens;
std::size_t start = 0, end = 0;
while ((end = text.find(sep, start)) != std::string::npos) {
if (end != start) {
tokens.push_back(text.substr(start, end - start));
}
start = end + 1;
}
if (end != start) {
tokens.push_back(text.substr(start));
}
return tokens;
}
If splitting a string at multiple delimiters while skipping empty tokens is desired, this version may be used:
std::vector<std::string> split(const std::string& text, const std::string& delims)
{
std::vector<std::string> tokens;
std::size_t start = text.find_first_not_of(delims), end = 0;
while((end = text.find_first_of(delims, start)) != std::string::npos)
{
tokens.push_back(text.substr(start, end - start));
start = text.find_first_not_of(delims, end);
}
if(start != std::string::npos)
tokens.push_back(text.substr(start));
return tokens;
}

This is my favorite way to iterate through a string. You can do whatever you want per word.
string line = "a line of text to iterate through";
string word;
istringstream iss(line, istringstream::in);
while( iss >> word )
{
// Do something on `word` here...
}

This is similar to Stack Overflow question How do I tokenize a string in C++?. Requires Boost external library
#include <iostream>
#include <string>
#include <boost/tokenizer.hpp>
using namespace std;
using namespace boost;
int main(int argc, char** argv)
{
string text = "token test\tstring";
char_separator<char> sep(" \t");
tokenizer<char_separator<char>> tokens(text, sep);
for (const string& t : tokens)
{
cout << t << "." << endl;
}
}

I like the following because it puts the results into a vector, supports a string as a delim and gives control over keeping empty values. But, it doesn't look as good then.
#include <ostream>
#include <string>
#include <vector>
#include <algorithm>
#include <iterator>
using namespace std;
vector<string> split(const string& s, const string& delim, const bool keep_empty = true) {
vector<string> result;
if (delim.empty()) {
result.push_back(s);
return result;
}
string::const_iterator substart = s.begin(), subend;
while (true) {
subend = search(substart, s.end(), delim.begin(), delim.end());
string temp(substart, subend);
if (keep_empty || !temp.empty()) {
result.push_back(temp);
}
if (subend == s.end()) {
break;
}
substart = subend + delim.size();
}
return result;
}
int main() {
const vector<string> words = split("So close no matter how far", " ");
copy(words.begin(), words.end(), ostream_iterator<string>(cout, "\n"));
}
Of course, Boost has a split() that works partially like that. And, if by 'white-space', you really do mean any type of white-space, using Boost's split with is_any_of() works great.

The STL does not have such a method available already.
However, you can either use C's strtok() function by using the std::string::c_str() member, or you can write your own. Here is a code sample I found after a quick Google search ("STL string split"):
void Tokenize(const string& str,
vector<string>& tokens,
const string& delimiters = " ")
{
// Skip delimiters at beginning.
string::size_type lastPos = str.find_first_not_of(delimiters, 0);
// Find first "non-delimiter".
string::size_type pos = str.find_first_of(delimiters, lastPos);
while (string::npos != pos || string::npos != lastPos)
{
// Found a token, add it to the vector.
tokens.push_back(str.substr(lastPos, pos - lastPos));
// Skip delimiters. Note the "not_of"
lastPos = str.find_first_not_of(delimiters, pos);
// Find next "non-delimiter"
pos = str.find_first_of(delimiters, lastPos);
}
}
Taken from: http://oopweb.com/CPP/Documents/CPPHOWTO/Volume/C++Programming-HOWTO-7.html
If you have questions about the code sample, leave a comment and I will explain.
And just because it does not implement a typedef called iterator or overload the << operator does not mean it is bad code. I use C functions quite frequently. For example, printf and scanf both are faster than std::cin and std::cout (significantly), the fopen syntax is a lot more friendly for binary types, and they also tend to produce smaller EXEs.
Don't get sold on this "Elegance over performance" deal.

Here is a split function that:
is generic
uses standard C++ (no boost)
accepts multiple delimiters
ignores empty tokens (can easily be changed)
template<typename T>
vector<T>
split(const T & str, const T & delimiters) {
vector<T> v;
typename T::size_type start = 0;
auto pos = str.find_first_of(delimiters, start);
while(pos != T::npos) {
if(pos != start) // ignore empty tokens
v.emplace_back(str, start, pos - start);
start = pos + 1;
pos = str.find_first_of(delimiters, start);
}
if(start < str.length()) // ignore trailing delimiter
v.emplace_back(str, start, str.length() - start); // add what's left of the string
return v;
}
Example usage:
vector<string> v = split<string>("Hello, there; World", ";,");
vector<wstring> v = split<wstring>(L"Hello, there; World", L";,");

I have a 2 lines solution to this problem:
char sep = ' ';
std::string s="1 This is an example";
for(size_t p=0, q=0; p!=s.npos; p=q)
std::cout << s.substr(p+(p!=0), (q=s.find(sep, p+1))-p-(p!=0)) << std::endl;
Then instead of printing you can put it in a vector.

Yet another flexible and fast way
template<typename Operator>
void tokenize(Operator& op, const char* input, const char* delimiters) {
const char* s = input;
const char* e = s;
while (*e != 0) {
e = s;
while (*e != 0 && strchr(delimiters, *e) == 0) ++e;
if (e - s > 0) {
op(s, e - s);
}
s = e + 1;
}
}
To use it with a vector of strings (Edit: Since someone pointed out not to inherit STL classes... hrmf ;) ) :
template<class ContainerType>
class Appender {
public:
Appender(ContainerType& container) : container_(container) {;}
void operator() (const char* s, unsigned length) {
container_.push_back(std::string(s,length));
}
private:
ContainerType& container_;
};
std::vector<std::string> strVector;
Appender v(strVector);
tokenize(v, "A number of words to be tokenized", " \t");
That's it! And that's just one way to use the tokenizer, like how to just
count words:
class WordCounter {
public:
WordCounter() : noOfWords(0) {}
void operator() (const char*, unsigned) {
++noOfWords;
}
unsigned noOfWords;
};
WordCounter wc;
tokenize(wc, "A number of words to be counted", " \t");
ASSERT( wc.noOfWords == 7 );
Limited by imagination ;)

Here's a simple solution that uses only the standard regex library
#include <regex>
#include <string>
#include <vector>
std::vector<string> Tokenize( const string str, const std::regex regex )
{
using namespace std;
std::vector<string> result;
sregex_token_iterator it( str.begin(), str.end(), regex, -1 );
sregex_token_iterator reg_end;
for ( ; it != reg_end; ++it ) {
if ( !it->str().empty() ) //token could be empty:check
result.emplace_back( it->str() );
}
return result;
}
The regex argument allows checking for multiple arguments (spaces, commas, etc.)
I usually only check to split on spaces and commas, so I also have this default function:
std::vector<string> TokenizeDefault( const string str )
{
using namespace std;
regex re( "[\\s,]+" );
return Tokenize( str, re );
}
The "[\\s,]+" checks for spaces (\\s) and commas (,).
Note, if you want to split wstring instead of string,
change all std::regex to std::wregex
change all sregex_token_iterator to wsregex_token_iterator
Note, you might also want to take the string argument by reference, depending on your compiler.

Using std::stringstream as you have works perfectly fine, and do exactly what you wanted. If you're just looking for different way of doing things though, you can use std::find()/std::find_first_of() and std::string::substr().
Here's an example:
#include <iostream>
#include <string>
int main()
{
std::string s("Somewhere down the road");
std::string::size_type prev_pos = 0, pos = 0;
while( (pos = s.find(' ', pos)) != std::string::npos )
{
std::string substring( s.substr(prev_pos, pos-prev_pos) );
std::cout << substring << '\n';
prev_pos = ++pos;
}
std::string substring( s.substr(prev_pos, pos-prev_pos) ); // Last word
std::cout << substring << '\n';
return 0;
}

If you like to use boost, but want to use a whole string as delimiter (instead of single characters as in most of the previously proposed solutions), you can use the boost_split_iterator.
Example code including convenient template:
#include <iostream>
#include <vector>
#include <boost/algorithm/string.hpp>
template<typename _OutputIterator>
inline void split(
const std::string& str,
const std::string& delim,
_OutputIterator result)
{
using namespace boost::algorithm;
typedef split_iterator<std::string::const_iterator> It;
for(It iter=make_split_iterator(str, first_finder(delim, is_equal()));
iter!=It();
++iter)
{
*(result++) = boost::copy_range<std::string>(*iter);
}
}
int main(int argc, char* argv[])
{
using namespace std;
vector<string> splitted;
split("HelloFOOworldFOO!", "FOO", back_inserter(splitted));
// or directly to console, for example
split("HelloFOOworldFOO!", "FOO", ostream_iterator<string>(cout, "\n"));
return 0;
}

Heres a regex solution that only uses the standard regex library. (I'm a little rusty, so there may be a few syntax errors, but this is at least the general idea)
#include <regex.h>
#include <string.h>
#include <vector.h>
using namespace std;
vector<string> split(string s){
regex r ("\\w+"); //regex matches whole words, (greedy, so no fragment words)
regex_iterator<string::iterator> rit ( s.begin(), s.end(), r );
regex_iterator<string::iterator> rend; //iterators to iterate thru words
vector<string> result<regex_iterator>(rit, rend);
return result; //iterates through the matches to fill the vector
}

There is a function named strtok.
#include<string>
using namespace std;
vector<string> split(char* str,const char* delim)
{
char* saveptr;
char* token = strtok_r(str,delim,&saveptr);
vector<string> result;
while(token != NULL)
{
result.push_back(token);
token = strtok_r(NULL,delim,&saveptr);
}
return result;
}

C++20 finally blesses us with a split function. Or rather, a range adapter. Godbolt link.
#include <iostream>
#include <ranges>
#include <string_view>
namespace ranges = std::ranges;
namespace views = std::views;
using str = std::string_view;
constexpr auto view =
"Multiple words"
| views::split(' ')
| views::transform([](auto &&r) -> str {
return {
&*r.begin(),
static_cast<str::size_type>(ranges::distance(r))
};
});
auto main() -> int {
for (str &&sv : view) {
std::cout << sv << '\n';
}
}

Using std::string_view and Eric Niebler's range-v3 library:
https://wandbox.org/permlink/kW5lwRCL1pxjp2pW
#include <iostream>
#include <string>
#include <string_view>
#include "range/v3/view.hpp"
#include "range/v3/algorithm.hpp"
int main() {
std::string s = "Somewhere down the range v3 library";
ranges::for_each(s
| ranges::view::split(' ')
| ranges::view::transform([](auto &&sub) {
return std::string_view(&*sub.begin(), ranges::distance(sub));
}),
[](auto s) {std::cout << "Substring: " << s << "\n";}
);
}
By using a range for loop instead of ranges::for_each algorithm:
#include <iostream>
#include <string>
#include <string_view>
#include "range/v3/view.hpp"
int main()
{
std::string str = "Somewhere down the range v3 library";
for (auto s : str | ranges::view::split(' ')
| ranges::view::transform([](auto&& sub) { return std::string_view(&*sub.begin(), ranges::distance(sub)); }
))
{
std::cout << "Substring: " << s << "\n";
}
}

The stringstream can be convenient if you need to parse the string by non-space symbols:
string s = "Name:JAck; Spouse:Susan; ...";
string dummy, name, spouse;
istringstream iss(s);
getline(iss, dummy, ':');
getline(iss, name, ';');
getline(iss, dummy, ':');
getline(iss, spouse, ';')

So far I used the one in Boost, but I needed something that doesn't depends on it, so I came to this:
static void Split(std::vector<std::string>& lst, const std::string& input, const std::string& separators, bool remove_empty = true)
{
std::ostringstream word;
for (size_t n = 0; n < input.size(); ++n)
{
if (std::string::npos == separators.find(input[n]))
word << input[n];
else
{
if (!word.str().empty() || !remove_empty)
lst.push_back(word.str());
word.str("");
}
}
if (!word.str().empty() || !remove_empty)
lst.push_back(word.str());
}
A good point is that in separators you can pass more than one character.

Short and elegant
#include <vector>
#include <string>
using namespace std;
vector<string> split(string data, string token)
{
vector<string> output;
size_t pos = string::npos; // size_t to avoid improbable overflow
do
{
pos = data.find(token);
output.push_back(data.substr(0, pos));
if (string::npos != pos)
data = data.substr(pos + token.size());
} while (string::npos != pos);
return output;
}
can use any string as delimiter, also can be used with binary data (std::string supports binary data, including nulls)
using:
auto a = split("this!!is!!!example!string", "!!");
output:
this
is
!example!string

I've rolled my own using strtok and used boost to split a string. The best method I have found is the C++ String Toolkit Library. It is incredibly flexible and fast.
#include <iostream>
#include <vector>
#include <string>
#include <strtk.hpp>
const char *whitespace = " \t\r\n\f";
const char *whitespace_and_punctuation = " \t\r\n\f;,=";
int main()
{
{ // normal parsing of a string into a vector of strings
std::string s("Somewhere down the road");
std::vector<std::string> result;
if( strtk::parse( s, whitespace, result ) )
{
for(size_t i = 0; i < result.size(); ++i )
std::cout << result[i] << std::endl;
}
}
{ // parsing a string into a vector of floats with other separators
// besides spaces
std::string s("3.0, 3.14; 4.0");
std::vector<float> values;
if( strtk::parse( s, whitespace_and_punctuation, values ) )
{
for(size_t i = 0; i < values.size(); ++i )
std::cout << values[i] << std::endl;
}
}
{ // parsing a string into specific variables
std::string s("angle = 45; radius = 9.9");
std::string w1, w2;
float v1, v2;
if( strtk::parse( s, whitespace_and_punctuation, w1, v1, w2, v2) )
{
std::cout << "word " << w1 << ", value " << v1 << std::endl;
std::cout << "word " << w2 << ", value " << v2 << std::endl;
}
}
return 0;
}
The toolkit has much more flexibility than this simple example shows but its utility in parsing a string into useful elements is incredible.

I made this because I needed an easy way to split strings and c-based strings... Hopefully someone else can find it useful as well. Also it doesn't rely on tokens and you can use fields as delimiters, which is another key I needed.
I'm sure there's improvements that can be made to even further improve its elegance and please do by all means
StringSplitter.hpp:
#include <vector>
#include <iostream>
#include <string.h>
using namespace std;
class StringSplit
{
private:
void copy_fragment(char*, char*, char*);
void copy_fragment(char*, char*, char);
bool match_fragment(char*, char*, int);
int untilnextdelim(char*, char);
int untilnextdelim(char*, char*);
void assimilate(char*, char);
void assimilate(char*, char*);
bool string_contains(char*, char*);
long calc_string_size(char*);
void copy_string(char*, char*);
public:
vector<char*> split_cstr(char);
vector<char*> split_cstr(char*);
vector<string> split_string(char);
vector<string> split_string(char*);
char* String;
bool do_string;
bool keep_empty;
vector<char*> Container;
vector<string> ContainerS;
StringSplit(char * in)
{
String = in;
}
StringSplit(string in)
{
size_t len = calc_string_size((char*)in.c_str());
String = new char[len + 1];
memset(String, 0, len + 1);
copy_string(String, (char*)in.c_str());
do_string = true;
}
~StringSplit()
{
for (int i = 0; i < Container.size(); i++)
{
if (Container[i] != NULL)
{
delete[] Container[i];
}
}
if (do_string)
{
delete[] String;
}
}
};
StringSplitter.cpp:
#include <string.h>
#include <iostream>
#include <vector>
#include "StringSplit.hpp"
using namespace std;
void StringSplit::assimilate(char*src, char delim)
{
int until = untilnextdelim(src, delim);
if (until > 0)
{
char * temp = new char[until + 1];
memset(temp, 0, until + 1);
copy_fragment(temp, src, delim);
if (keep_empty || *temp != 0)
{
if (!do_string)
{
Container.push_back(temp);
}
else
{
string x = temp;
ContainerS.push_back(x);
}
}
else
{
delete[] temp;
}
}
}
void StringSplit::assimilate(char*src, char* delim)
{
int until = untilnextdelim(src, delim);
if (until > 0)
{
char * temp = new char[until + 1];
memset(temp, 0, until + 1);
copy_fragment(temp, src, delim);
if (keep_empty || *temp != 0)
{
if (!do_string)
{
Container.push_back(temp);
}
else
{
string x = temp;
ContainerS.push_back(x);
}
}
else
{
delete[] temp;
}
}
}
long StringSplit::calc_string_size(char* _in)
{
long i = 0;
while (*_in++)
{
i++;
}
return i;
}
bool StringSplit::string_contains(char* haystack, char* needle)
{
size_t len = calc_string_size(needle);
size_t lenh = calc_string_size(haystack);
while (lenh--)
{
if (match_fragment(haystack + lenh, needle, len))
{
return true;
}
}
return false;
}
bool StringSplit::match_fragment(char* _src, char* cmp, int len)
{
while (len--)
{
if (*(_src + len) != *(cmp + len))
{
return false;
}
}
return true;
}
int StringSplit::untilnextdelim(char* _in, char delim)
{
size_t len = calc_string_size(_in);
if (*_in == delim)
{
_in += 1;
return len - 1;
}
int c = 0;
while (*(_in + c) != delim && c < len)
{
c++;
}
return c;
}
int StringSplit::untilnextdelim(char* _in, char* delim)
{
int s = calc_string_size(delim);
int c = 1 + s;
if (!string_contains(_in, delim))
{
return calc_string_size(_in);
}
else if (match_fragment(_in, delim, s))
{
_in += s;
return calc_string_size(_in);
}
while (!match_fragment(_in + c, delim, s))
{
c++;
}
return c;
}
void StringSplit::copy_fragment(char* dest, char* src, char delim)
{
if (*src == delim)
{
src++;
}
int c = 0;
while (*(src + c) != delim && *(src + c))
{
*(dest + c) = *(src + c);
c++;
}
*(dest + c) = 0;
}
void StringSplit::copy_string(char* dest, char* src)
{
int i = 0;
while (*(src + i))
{
*(dest + i) = *(src + i);
i++;
}
}
void StringSplit::copy_fragment(char* dest, char* src, char* delim)
{
size_t len = calc_string_size(delim);
size_t lens = calc_string_size(src);
if (match_fragment(src, delim, len))
{
src += len;
lens -= len;
}
int c = 0;
while (!match_fragment(src + c, delim, len) && (c < lens))
{
*(dest + c) = *(src + c);
c++;
}
*(dest + c) = 0;
}
vector<char*> StringSplit::split_cstr(char Delimiter)
{
int i = 0;
while (*String)
{
if (*String != Delimiter && i == 0)
{
assimilate(String, Delimiter);
}
if (*String == Delimiter)
{
assimilate(String, Delimiter);
}
i++;
String++;
}
String -= i;
delete[] String;
return Container;
}
vector<string> StringSplit::split_string(char Delimiter)
{
do_string = true;
int i = 0;
while (*String)
{
if (*String != Delimiter && i == 0)
{
assimilate(String, Delimiter);
}
if (*String == Delimiter)
{
assimilate(String, Delimiter);
}
i++;
String++;
}
String -= i;
delete[] String;
return ContainerS;
}
vector<char*> StringSplit::split_cstr(char* Delimiter)
{
int i = 0;
size_t LenDelim = calc_string_size(Delimiter);
while(*String)
{
if (!match_fragment(String, Delimiter, LenDelim) && i == 0)
{
assimilate(String, Delimiter);
}
if (match_fragment(String, Delimiter, LenDelim))
{
assimilate(String,Delimiter);
}
i++;
String++;
}
String -= i;
delete[] String;
return Container;
}
vector<string> StringSplit::split_string(char* Delimiter)
{
do_string = true;
int i = 0;
size_t LenDelim = calc_string_size(Delimiter);
while (*String)
{
if (!match_fragment(String, Delimiter, LenDelim) && i == 0)
{
assimilate(String, Delimiter);
}
if (match_fragment(String, Delimiter, LenDelim))
{
assimilate(String, Delimiter);
}
i++;
String++;
}
String -= i;
delete[] String;
return ContainerS;
}
Examples:
int main(int argc, char*argv[])
{
StringSplit ss = "This:CUT:is:CUT:an:CUT:example:CUT:cstring";
vector<char*> Split = ss.split_cstr(":CUT:");
for (int i = 0; i < Split.size(); i++)
{
cout << Split[i] << endl;
}
return 0;
}
Will output:
This
is
an
example
cstring
int main(int argc, char*argv[])
{
StringSplit ss = "This:is:an:example:cstring";
vector<char*> Split = ss.split_cstr(':');
for (int i = 0; i < Split.size(); i++)
{
cout << Split[i] << endl;
}
return 0;
}
int main(int argc, char*argv[])
{
string mystring = "This[SPLIT]is[SPLIT]an[SPLIT]example[SPLIT]string";
StringSplit ss = mystring;
vector<string> Split = ss.split_string("[SPLIT]");
for (int i = 0; i < Split.size(); i++)
{
cout << Split[i] << endl;
}
return 0;
}
int main(int argc, char*argv[])
{
string mystring = "This|is|an|example|string";
StringSplit ss = mystring;
vector<string> Split = ss.split_string('|');
for (int i = 0; i < Split.size(); i++)
{
cout << Split[i] << endl;
}
return 0;
}
To keep empty entries (by default empties will be excluded):
StringSplit ss = mystring;
ss.keep_empty = true;
vector<string> Split = ss.split_string(":DELIM:");
The goal was to make it similar to C#'s Split() method where splitting a string is as easy as:
String[] Split =
"Hey:cut:what's:cut:your:cut:name?".Split(new[]{":cut:"}, StringSplitOptions.None);
foreach(String X in Split)
{
Console.Write(X);
}
I hope someone else can find this as useful as I do.

This answer takes the string and puts it into a vector of strings. It uses the boost library.
#include <boost/algorithm/string.hpp>
std::vector<std::string> strs;
boost::split(strs, "string to split", boost::is_any_of("\t "));

Here's another way of doing it..
void split_string(string text,vector<string>& words)
{
int i=0;
char ch;
string word;
while(ch=text[i++])
{
if (isspace(ch))
{
if (!word.empty())
{
words.push_back(word);
}
word = "";
}
else
{
word += ch;
}
}
if (!word.empty())
{
words.push_back(word);
}
}

What about this:
#include <string>
#include <vector>
using namespace std;
vector<string> split(string str, const char delim) {
vector<string> v;
string tmp;
for(string::const_iterator i; i = str.begin(); i <= str.end(); ++i) {
if(*i != delim && i != str.end()) {
tmp += *i;
} else {
v.push_back(tmp);
tmp = "";
}
}
return v;
}

I like to use the boost/regex methods for this task since they provide maximum flexibility for specifying the splitting criteria.
#include <iostream>
#include <string>
#include <boost/regex.hpp>
int main() {
std::string line("A:::line::to:split");
const boost::regex re(":+"); // one or more colons
// -1 means find inverse matches aka split
boost::sregex_token_iterator tokens(line.begin(),line.end(),re,-1);
boost::sregex_token_iterator end;
for (; tokens != end; ++tokens)
std::cout << *tokens << std::endl;
}

Recently I had to split a camel-cased word into subwords. There are no delimiters, just upper characters.
#include <string>
#include <list>
#include <locale> // std::isupper
template<class String>
const std::list<String> split_camel_case_string(const String &s)
{
std::list<String> R;
String w;
for (String::const_iterator i = s.begin(); i < s.end(); ++i) { {
if (std::isupper(*i)) {
if (w.length()) {
R.push_back(w);
w.clear();
}
}
w += *i;
}
if (w.length())
R.push_back(w);
return R;
}
For example, this splits "AQueryTrades" into "A", "Query" and "Trades". The function works with narrow and wide strings. Because it respects the current locale it splits "RaumfahrtÜberwachungsVerordnung" into "Raumfahrt", "Überwachungs" and "Verordnung".
Note std::upper should be really passed as function template argument. Then the more generalized from of this function can split at delimiters like ",", ";" or " " too.

Replace a sef of characters in c++with the the least number of code lines

I am using the following function in c++ to replace a set of ASCII characters.
std::string debug::convertStringToEdiFormat(const char *ediBuffer) {
std::string local(ediBuffer);
std::replace(local.begin(), local.end(), '\037', ':');
std::replace(local.begin(), local.end(), '\031', '*');
std::replace(local.begin(), local.end(), '\035', '+');
std::replace(local.begin(), local.end(), '\034', '\'');
return std::string(local);
}
the problem is that it is too long. If I want to replace like 100 characters it will have 100 lines of code. Is there another function that takes less code and allows me to do the same?

This is what you are looking for:
array< char, 256 > m;
// fill m
//...
m['\037'] = ':';
m['\031'] = '*';
m['\035'] = '+';
m['\034'] = '\'';
//...
string s{ "Hello world!" };
for (auto& c : s)
c = m[c];
If all you need is change just a couple of characters, you may use std::transform:
auto my_transform = [](const char c)
{
switch (c)
{
case '\037': return ':';
case '\031': return '*';
case '\035': return '+';
case '\034': return '\'';
default: return c;
}
};
std::string s{ "\037\031\035\034" };
std::transform(s.begin(), s.end(), s.begin(), my_transform);
See the live example.

Best approach is to extract a function/functor which does character conversion.
Here is fully functional functor which is able to perform conversion in both directions.
class ReplaceEncoder {
public:
ReplaceEncoder() {
initArray(m_encode);
initArray(m_decode);
}
void updateEncoding(const std::string &from, const std::string &to) {
assert(from.size() == to.size());
for (int i=0; i<from.size()) {
m_encode[from[i]] = to[i];
m_decode[ to[i]] = from[i];
}
}
char encode(char ch) const {
return m_encode[static_cast<unsigned char>(ch)];
}
char decode(char ch) const {
return m_decode[static_cast<unsigned char>(ch)];
}
char operator()(char ch) const {
return encode(ch);
}
private:
void initArray(std::array<char, 0x100> &arr) {
for (size_t i = 0; i < arr.size(); ++i) {
arr[i] = static_cast<char>(i);
}
}
private:
std::array<char, 0x100> m_encoder;
std::array<char, 0x100> m_decoder;
};
ReplaceEncoder encrypt;
encrypt.updateEncoding("absdet", "gmbstp");
string s{ "Hello world!" };
std::transform(s.begin(), s.end(), s.begin(), encrypt);

Why not use an unordered_map and one loop?
static const std::unordered_map<char, char> a = {{'\037', ':'}, {'\031', '*'}, {'\035', '+'}};
void convertStringToEdiFormat(std::string &ediBuffer) {
for (auto& c : ediBuffer)
{
c = a.at(c);
}
}

How to read a csv file data into an array?

I have a CSV file formatted as below:
1,50,a,46,50,b
2, 20,s,56,30,f
3,35,b,5,67,s
...
How can I turn that to a 2D array so that I could do some calculations?
void Core::parseCSV(){
std::ifstream data("test.csv");
std::string line;
while(std::getline(data,line))
{
std::stringstream lineStream(line);
std::string cell;
while(std::getline(lineStream,cell,','))
{
//not sure how to create the 2d array here
}
}
};

Try this,
void Core::parseCSV()
{
std::ifstream data("test.csv");
std::string line;
std::vector<std::vector<std::string> > parsedCsv;
while(std::getline(data,line))
{
std::stringstream lineStream(line);
std::string cell;
std::vector<std::string> parsedRow;
while(std::getline(lineStream,cell,','))
{
parsedRow.push_back(cell);
}
parsedCsv.push_back(parsedRow);
}
};

Following code is working. hope it can help you. It has a CSV file istream_iterator-like class. It is a template so that it can read strings, ints, doubles, etc. Its constructors accept a char delimiter, so that it may be used for more than strictly comma-delimited files. It also has a specialization for strings so that white space characters could be retained.
#include <iostream>
#include <sstream>
#include <fstream>
#include <iterator>
using namespace std;
template <class T>
class csv_istream_iterator: public iterator<input_iterator_tag, T>
{
istream * _input;
char _delim;
string _value;
public:
csv_istream_iterator( char delim = ',' ): _input( 0 ), _delim( delim ) {}
csv_istream_iterator( istream & in, char delim = ',' ): _input( &in ), _delim( delim ) { ++*this; }
const T operator *() const {
istringstream ss( _value );
T value;
ss >> value;
return value;
}
istream & operator ++() {
if( !( getline( *_input, _value, _delim ) ) )
{
_input = 0;
}
return *_input;
}
bool operator !=( const csv_istream_iterator & rhs ) const {
return _input != rhs._input;
}
};
template <>
const string csv_istream_iterator<string>::operator *() const {
return _value;
}
int main( int argc, char * args[] )
{
{ // test for integers
ifstream fin( "data.csv" );
if( fin )
{
copy( csv_istream_iterator<int>( fin ),
csv_istream_iterator<int>(),
ostream_iterator<int>( cout, " " ) );
fin.close();
}
}
cout << endl << "----" << endl;
{ // test for strings
ifstream fin( "data.csv" );
if( fin )
{
copy( csv_istream_iterator<string>( fin ),
csv_istream_iterator<string>(),
ostream_iterator<string>( cout, "|" ) );
fin.close();
}
}
return 0;
}

I would go for something like this (untested, incomplete) and eventually refine operator >>, if you have strings instead of chars, or floats instead of ints.
struct data_t
{
int a ;
int b ;
char c ;
int d ;
int e ;
char f ;
} ;
std::istream &operator>>(std::istream &ist, data_t &data)
{
char comma ;
ist >> data.a >> comma
>> data.b >> comma
>> data.c >> comma
>> data.d >> comma
>> data.e >> comma
>> data.f
;
return ist ;
}
void Core::parseCSV(){
std::ifstream data("test.csv");
std::string line;
std::vector<data_t> datavect ;
while(std::getline(data,line))
{
data_t data ;
std::stringstream lineStream(line);
lineStream >> data ;
datavect.push_back(data) ;
}
};

Convert Comma Separated Hex (String) Array to either integer or float

The array is set up like so:
string * str = new string[11];
Where the content of the string looks like:
str[0]=AAAAAAAA,BBBBBBBB,CCCCCCCC,DDDDDDDD,EEEE,FFFFFFFF,GGGGGGGG,HHHH,IIII,JJJJ,KKKK
str[1]=AAAAAAAA,BBBBBBBB,CCCCCCCC,DDDDDDDD,EEEE,FFFFFFFF,GGGGGGGG,HHHH,IIII,JJJJ,KKKK
str[2]=AAAAAAAA,BBBBBBBB,CCCCCCCC,DDDDDDDD,EEEE,FFFFFFFF,GGGGGGGG,HHHH,IIII,JJJJ,KKKK
...
str[12]=AAAAAAAA,BBBBBBBB,CCCCCCCC,DDDDDDDD,EEEE,FFFFFFFF,GGGGGGGG,HHHH,IIII,JJJJ,KKKK
Another array looks like:
string * type = new string[11];
Where the content is:
type[0]="1";
type[1]="1";
type[2]="1";
type[3]="1";
type[4]="2";
type[5]="1";
type[6]="1";
type[7]="2";
type[8]="2";
type[9]="2";
type[10]="2";
These types correspond to each value in the string, so, for the first string:
1=float , 2=integer
AAAAAAAA would be 1; or an float
BBBBBBBB would be 1; or an float
CCCCCCCC would be 1; or an float
DDDDDDDD would be 1; or an float
EEEE would be 2; or a integer
FFFFFFFF would be 1; or an float
GGGGGGGG would be 1; or an float
HHHH would be 2; or a integer
IIII would be 2; or a integer
JJJJ would be 2; or a integer
KKKK would be 2; or a integer
In addition the single type array works for all strings in the str array.
Now for my question:
How do i use the above information to extract each individual values from the string and convert it to an integer or a float based on the value in the type array.
BE AWARE:
Boost is not available to me
The conversion functions look like: (The other is formatted similarly except for an integer)
unsigned int BinaryParser::hexToFloat(std::string hexInput)
{
std::stringstream ss (hexInput);
unsigned int floatOutput;
ss >> hex >> floatOutput;
return reinterpret_cast<float&>(floatOutput);
}

OK, first part: extract the comma-separated strings. One way would be:
std::vector<std::string> split( std::string s ){
std::vector<std::string> vec;
int pos = 0;
while( std::string::npos != (pos = s.find( ',', pos ) ) ){
vec.push_back( s.substr( 0, pos ) );
s = s.substr( pos + 1 );
}
vec.push_back( s );
return vec;
}
Depends on the input string being "well-behaved".
This converts an int from hex digits:
int convInt( std::string hexInput ){
std::istringstream iss (hexInput);
uint16_t intOutput;
iss >> std::hex >> intOutput;
return intOutput;
}
Float cannot be read using std::hex, so we assume the HHHHHHHH is a float's bytes interpreted as an int32_t.
float convFloat( std::string & hexInput ){
std::istringstream iss (hexInput);
uint32_t intOutput;
iss >> std::hex >> intOutput;
return reinterpret_cast<float&>(intOutput);
}
For storing the results we can use:
enum TypeTag { eInt, eFloat };
class IntOrFloat {
public:
IntOrFloat( int i ) : typeTag(eInt),integer(i),floating(0) { }
IntOrFloat( float f ) : typeTag(eFloat),integer(0),floating(f) { }
virtual ~IntOrFloat(){}
int getInt() const { return integer; }
float getFloat() const { return floating; }
TypeTag getTypeTag() const { return typeTag; }
private:
TypeTag typeTag;
int integer;
float floating;
};
std::ostream& operator<< (std::ostream& os, const IntOrFloat& iof){
switch( iof.getTypeTag() ){
case eInt:
os << iof.getInt();
break;
case eFloat:
os << iof.getFloat();
break;
}
return os;
}
To convert one comma-separated string according to the type vector:
std::vector<IntOrFloat> convert( const std::vector<std::string> t, const std::string s ){
std::vector<IntOrFloat> results;
std::vector<std::string> hexes = split( s );
for( int i = 0; i < hexes.size(); i++ ){
if( t[i] == "1" ){
results.push_back( IntOrFloat( convFloat( hexes[i] ) ) );
} else {
results.push_back( IntOrFloat( convInt( hexes[i] ) ) );
}
}
return results;
}
That's it, then. - I've been using vector instead of the arrays. You can easily convert, e.g.
std::vector<std::string> fromArray( std::string strs[], int n ){
std::vector<std::string> strings;
for( int i = 0; i < n; i++ ) strings.push_back( std::string( strs[i] ) );
return strings;
}
#define fromArray(a) fromArray( a, (sizeof(a)/sizeof(a[0])) )
And here is my test program:
#define LENGTH(a) (sizeof(a)/sizeof(a[0]))
int main(){
std::string t[] = {"2","1","1","2"};
std::string s[] = {
"8000,4048f5c3,bf000000,FFFF",
"0001,42f6e979,c44271ba,7FFF",
"1234,00000000,447a0000,5678"
};
std::vector<std::string> types = fromArray( t );
std::vector<std::string> strings = fromArray( s );
for( std::vector<std::string>::iterator it = strings.begin() ; it != strings.end(); ++it ){
std::vector<IntOrFloat> results = convert( types, *it );
std::cout << "converting string " << *it << ", " << results.size() << " values:" << std::endl;
for( std::vector<IntOrFloat>::iterator iof = results.begin() ; iof != results.end(); ++iof ){
std::cout << " " << *iof << std::endl;
}
}
}

parsing a string to a structure of c-style character arrays

I have a Visual Studio 2008 C++ project where I need to parse a string to a structure of c-style character arrays. What is the most elegant/efficient way of doing this?
Here is my current (functioning) solution:
struct Foo {
char a[ MAX_A ];
char b[ MAX_B ];
char c[ MAX_C ];
char d[ MAX_D ];
};
Func( const Foo& foo );
std::string input = "abcd#efgh#ijkl#mnop";
std::vector< std::string > parsed;
boost::split( parsed, input, boost::is_any_of( "#" ) );
Foo foo = { 0 };
parsed[ 1 ].copy( foo.a, MAX_A );
parsed[ 2 ].copy( foo.b, MAX_B );
parsed[ 3 ].copy( foo.c, MAX_C );
parsed[ 4 ].copy( foo.d, MAX_D );
Func( foo );

Here is my (now tested) idea:
#include <vector>
#include <string>
#include <cstring>
#define MAX_A 40
#define MAX_B 3
#define MAX_C 40
#define MAX_D 4
struct Foo {
char a[ MAX_A ];
char b[ MAX_B ];
char c[ MAX_C ];
char d[ MAX_D ];
};
template <std::ptrdiff_t N>
const char* extractToken(const char* inIt, char (&buf)[N])
{
if (!inIt || !*inIt)
return NULL;
const char* end = strchr(inIt, '#');
if (end)
{
strncpy(buf, inIt, std::min(N, end-inIt));
return end + 1;
}
strncpy(buf, inIt, N);
return NULL;
}
int main(int argc, const char *argv[])
{
std::string input = "abcd#efgh#ijkl#mnop";
Foo foo = { 0 };
const char* cursor = input.c_str();
cursor = extractToken(cursor, foo.a);
cursor = extractToken(cursor, foo.b);
cursor = extractToken(cursor, foo.c);
cursor = extractToken(cursor, foo.d);
}
[Edit] Tests
Adding a little test code
template <std::ptrdiff_t N>
std::string display(const char (&buf)[N])
{
std::string result;
for(size_t i=0; i<N && buf[i]; ++i)
result += buf[i];
return result;
}
int main(int argc, const char *argv[])
{
std::string input = "abcd#efgh#ijkl#mnop";
Foo foo = { 0 };
const char* cursor = input.c_str();
cursor = extractToken(cursor, foo.a);
cursor = extractToken(cursor, foo.b);
cursor = extractToken(cursor, foo.c);
cursor = extractToken(cursor, foo.d);
std::cout << "foo.a: '" << display(foo.a) << "'\n";
std::cout << "foo.b: '" << display(foo.b) << "'\n";
std::cout << "foo.c: '" << display(foo.c) << "'\n";
std::cout << "foo.d: '" << display(foo.d) << "'\n";
}
Outputs
foo.a: 'abcd'
foo.b: 'efg'
foo.c: 'ijkl'
foo.d: 'mnop'
See it Live on http://ideone.com/KdAhO

What about redesigning Foo?
struct Foo {
std::array<std::string, 4> abcd;
std::string a() const { return abcd[0]; }
std::string b() const { return abcd[1]; }
std::string c() const { return abcd[2]; }
std::string d() const { return abcd[3]; }
};
boost::algorithm::split_iterator<std::string::iterator> end,
it = boost::make_split_iterator(input, boost::algorithm::first_finder("#"));
std::transform(it, end, foo.abcd.begin(),
boost::copy_range<std::string, decltype(*it)>);

using a regex would look like this (in C++11, you can translate this to boost or tr1 for VS2008):
// Assuming MAX_A...MAX_D are all 10 in our regex
std::cmatch res;
if(std::regex_match(input.data(),input.data()+input.size(),
res,
std::regex("([^#]{0,10})([^#]{0,10})([^#]{0,10})([^#]{0,10})")))
{
Foo foo = {};
std::copy(res[1].first,res[1].second,foo.a);
std::copy(res[2].first,res[2].second,foo.b);
std::copy(res[3].first,res[3].second,foo.c);
std::copy(res[4].first,res[4].second,foo.d);
}
You should probably create the pattern using a format string and the actual MAX_* variables rather than hard coding the values in the regex like I did here, and you might also want to compile the regex once and save it instead of recreating it every time.
But otherwise, this method avoids doing any extra copies of the string data. The char *s held in each submatch in res is a pointer directly into the input string's buffer, so the only copy is directly from the input string to the final foo object.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Fast String tokenization in C/C++ - c++

For sheer runtime speed, boost.spirit.qi is an excellent candidate.

I would remind you that there is a risk with strtok and its ilk that you can get back a different number of tokens than you might want. one|two|three would yield 3 tokens while one|||three would yield 2.

Related

Separate text on char [duplicate]

Replace a sef of characters in c++with the the least number of code lines

How to read a csv file data into an array?

Convert Comma Separated Hex (String) Array to either integer or float

parsing a string to a structure of c-style character arrays

Categories

Resources