String Tokenizer with multiple delimiters including delimiter without Boost

String Tokenizer with multiple delimiters including delimiter without Boost - c++

I need to create string parser in C++. I tried using
vector<string> Tokenize(const string& strInput, const string& strDelims)
{
vector<string> vS;
string strOne = strInput;
string delimiters = strDelims;
int startpos = 0;
int pos = strOne.find_first_of(delimiters, startpos);
while (string::npos != pos || string::npos != startpos)
{
if(strOne.substr(startpos, pos - startpos) != "")
vS.push_back(strOne.substr(startpos, pos - startpos));
// if delimiter is a new line (\n) then add new line
if(strOne.substr(pos, 1) == "\n")
vS.push_back("\\n");
// else if the delimiter is not a space
else if (strOne.substr(pos, 1) != " ")
vS.push_back(strOne.substr(pos, 1));
if( string::npos == strOne.find_first_not_of(delimiters, pos) )
startpos = strOne.find_first_not_of(delimiters, pos);
else
startpos = pos + 1;
pos = strOne.find_first_of(delimiters, startpos);
}
return vS;
}
This works for 2X+7cos(3Y)
(tokenizer("2X+7cos(3Y)","+-/^() \t");)
But gives a runtime error for 2X
I need non Boost solution.
I tried using C++ String Toolkit (StrTk) Tokenizer
std::vector<std::string> results;
strtk::split(delimiter, source,
strtk::range_to_type_back_inserter(results),
strtk::tokenize_options::include_all_delimiters);
return results;
but it doesn't give token as a separate string.
eg: if I give the input as 2X+3Y
output vector contains
2X+
3Y

What's probably happening is this is crashing when passed npos:
lastPos = str.find_first_not_of(delimiters, pos);
Just add breaks to your loop instead of relying on the while clause to break out of it.
if (pos == string::npos)
break;
lastPos = str.find_first_not_of(delimiters, pos);
if (lastPos == string::npos)
break;
pos = str.find_first_of(delimiters, lastPos);

Loop exit condition is broken:
while (string::npos != pos || string::npos != startpos)
Allows entry with, say pos = npos and startpos = 1.
So
strOne.substr(startpos, pos - startpos)
strOne.substr(1, npos - 1)
end is not npos, so substr doesn't stop where it should and BOOM!
If pos = npos and startpos = 0,
strOne.substr(startpos, pos - startpos)
lives, but
strOne.substr(pos, 1) == "\n"
strOne.substr(npos, 1) == "\n"
dies. So does
strOne.substr(pos, 1) != " "
Sadly I'm out of time and can't solve this right now, but QuestionC's got the right idea. Better filtering. Something along the lines of:
if (string::npos != pos)
{
if (strOne.substr(pos, 1) == "\n") // can possibly simplify this with strOne[pos] == '\n'
vS.push_back("\\n");
// else if the delimiter is not a space
else if (strOne[pos] != ' ')
vS.push_back(strOne.substr(pos, 1));
}

Would be great if you could share some info on your environment. Your program ran fine with an input value of 2X on my Fedora 20 using g++.

I created a little function that splits a string into substrings (which are stored in a vector) and it allows you to set which characters you want to treat as whitespace. Normal whitespace will still be treated as whitespace, so you don't have to define that. Actually, all it does is turns the character you defined as whitespace into actual whitespace (space char ' '). Then it runs that in a stream (stringstream) to separate the substrings and store them in a vector. This may not be what you need for this particular problem, but maybe it can give you some ideas.
// split a string into its whitespace-separated substrings and store
// each substring in a vector<string>. Whitespace can be defined in argument
// w as a string (e.g. ".;,?-'")
vector<string> split(const string& s, const string& w)
{
string temp{ s };
// go through each char in temp (or s)
for (char& ch : temp) {
// check if any characters in temp (s) are whitespace defined in w
for (char white : w) {
if (ch == white)
ch = ' '; // if so, replace them with a space char (' ')
}
}
vector<string> substrings;
stringstream ss{ temp };
for (string buffer; ss >> buffer;) {
substrings.push_back(buffer);
}
return substrings;
}

Related

Tokenize returns a vector that contains the delimiters - C++

I have the function that tokenizes a vector string and returns without the delimiters. But, I want to return with delimiters.
Desired output:
tokenize("<ab><>cd<", "<>")
should display: "<", "ab", ">", "<", ">", "cd", "<"
Here's my function:
vector<string> tokenize1(const string& s, const string& delim) {
vector<string> tokens;
string::size_type lastPos = s.find_first_not_of(delim, 0);
string::size_type pos = s.find_first_of(delim, lastPos);
while (string::npos != pos || string::npos != lastPos) {
tokens.push_back(s.substr(lastPos, pos - lastPos));
lastPos = s.find_first_not_of(delim, pos);
pos = s.find_first_of(delim, lastPos);
}
return tokens;
}

I'm not quite sure why you'd want to do this, but you've got almost all the code to do it already. Here's a minor modification that alternates between stuffing individual delimiter characters into the token vector (simpler if you're happy to return delimiter chunks, eg. "><>" instead of "<", ">", "<") and adding whole non-delimiter chunks to it.
std::vector<std::string> tokenize2(const std::string& s, const std::string& delim)
{
std::vector<std::string> tokens;
auto nextDelimiter = s.find_first_of(delim, 0);
auto nextNonDelimiter = s.find_first_not_of(delim, 0);
while (std::string::npos != nextDelimiter || std::string::npos != nextNonDelimiter)
{
if (nextNonDelimiter > nextDelimiter)
{
for (auto d = nextDelimiter; d < nextNonDelimiter && d < s.size(); d++)
tokens.push_back(s.substr(d, 1));
nextDelimiter = s.find_first_of(delim, nextNonDelimiter);
}
else
{
tokens.push_back(s.substr(nextNonDelimiter, nextDelimiter - nextNonDelimiter));
nextNonDelimiter = s.find_first_not_of(delim, nextDelimiter);
}
}
return tokens;
}
Note the use of auto (because we're living in the future now) and std:: (because using namespace std; is considered bad practise, for good reason).

Output between two whitespaces

I have this string: System->ONDRASHEK: Nick aaasssddd není v žádné místnosti and I need aaasssddd as output from string. Output is not the same each time. So it must be get from two whitespaces. I tried substr or split, but my knowledge of C++ is very poor.
I find this code:
#include <string>
#include <iostream>
int main()
{
const std::string str = "System->ONDRASHEK: Nick aaasssddd není v žádné místnosti";
size_t pos = str.find(" ");
if (pos == std::string::npos)
return -1;
pos = str.find(" ", pos + 1);
if (pos == std::string::npos)
return -1;
std::cout << str.substr(pos, std::string::npos);
}
But is not, what I need.

I assume you want the third word from the given string.
You have find the second space, but your output is the sub-string from the second space to the end of the string.
Instead, you need to find the third space, and output the sub-string between the two spaces.
So here is the modification.
#include <string>
#include <iostream>
int main()
{
const std::string str = "System->ONDRASHEK: Nick aaasssddd není v žádné místnosti";
size_t pos = str.find(" ");
size_t start;
size_t end;
if (pos == std::string::npos)
return -1;
pos = str.find(" ", pos + 1);
if (pos == std::string::npos)
return -1;
start = pos + 1;
pos = str.find(" ", pos + 1);
if (pos == std::string::npos)
return -1;
end = pos;
std::cout << str.substr(start, end - start) << std::endl;
}

please elaborate you question? you need substring between 2 white spaces ? if I am true, find first whitespace and then print string until you find another whitespace. you can use characters for that

splitting a string but keeping empty tokens c++

I am trying to split a string and put it into a vector
however, I also want to keep an empty token whenever there are consecutive delimiter:
For example:
string mystring = "::aa;;bb;cc;;c"
I would like to tokenize this string on :; delimiters
but in between delimiters such as :: and ;;
I would like to push in my vector an empty string;
so my desired output for this string is:
"" (empty)
aa
"" (empty)
bb
cc
"" (empty)
c
Also my requirement is not to use the boost library.
if any could lend me an idea.
thanks
code that tokenize a string but does not include the empty tokens
void Tokenize(const string& str,vector<string>& tokens, const string& delim)
{
// Skip delimiters at beginning.
string::size_type lastPos = str.find_first_not_of(delimiters, 0);
// Find first "non-delimiter".
string::size_type pos = str.find_first_of(delimiters, lastPos);
while (string::npos != pos || string::npos != lastPos)
{
// Found a token, add it to the vector.
tokens.push_back(str.substr(lastPos, pos - lastPos));
// Skip delimiters. Note the "not_of"
lastPos = str.find_first_not_of(delimiters, pos);
// Find next "non-delimiter"
pos = str.find_first_of(delimiters, lastPos);
}
}

You can make your algorithm work with some simple changes. First, don't skip delimiters at the beginning, then instead of skipping delimiters in the middle of the string, just increment the position by one. Also, your npos check should ensure that both positions are not npos so it should be && instead of ||.
void Tokenize(const string& str,vector<string>& tokens, const string& delimiters)
{
// Start at the beginning
string::size_type lastPos = 0;
// Find position of the first delimiter
string::size_type pos = str.find_first_of(delimiters, lastPos);
// While we still have string to read
while (string::npos != pos && string::npos != lastPos)
{
// Found a token, add it to the vector
tokens.push_back(str.substr(lastPos, pos - lastPos));
// Look at the next token instead of skipping delimiters
lastPos = pos+1;
// Find the position of the next delimiter
pos = str.find_first_of(delimiters, lastPos);
}
// Push the last token
tokens.push_back(str.substr(lastPos, pos - lastPos));
}

I have a version using iterators:
std::vector<std::string> split_from(const std::string& s
, const std::string& d, unsigned r = 20)
{
std::vector<std::string> v;
v.reserve(r);
auto pos = s.begin();
auto end = pos;
while(end != s.end())
{
end = std::find_first_of(pos, s.end(), d.begin(), d.end());
v.emplace_back(pos, end);
pos = end + 1;
}
return v;
}
Using your interface:
void Tokenize(const std::string& s, std::vector<std::string>& tokens
, const std::string& delims)
{
auto pos = s.begin();
auto end = pos;
while(end != s.end())
{
end = std::find_first_of(pos, s.end(), delims.begin(), delims.end());
tokens.emplace_back(pos, end);
pos = end + 1;
}
}

Tokenize a string and include delimiters in C++

I'm tokening with the following, but unsure how to include the delimiters with it.
void Tokenize(const string str, vector<string>& tokens, const string& delimiters)
{
int startpos = 0;
int pos = str.find_first_of(delimiters, startpos);
string strTemp;
while (string::npos != pos || string::npos != startpos)
{
strTemp = str.substr(startpos, pos - startpos);
tokens.push_back(strTemp.substr(0, strTemp.length()));
startpos = str.find_first_not_of(delimiters, pos);
pos = str.find_first_of(delimiters, startpos);
}
}

The C++ String Toolkit Library (StrTk) has the following solution:
std::string str = "abc,123 xyz";
std::vector<std::string> token_list;
strtk::split(";., ",
str,
strtk::range_to_type_back_inserter(token_list),
strtk::include_delimiters);
It should result with token_list have the following elements:
Token0 = "abc,"
Token1 = "123 "
Token2 = "xyz"
More examples can be found Here

I now this a little sloppy, but this is what I ended up with. I did not want to use boost since this is a school assignment and my instructor wanted me to use find_first_of to accomplish this.
Thanks for everyone's help.
vector<string> Tokenize(const string& strInput, const string& strDelims)
{
vector<string> vS;
string strOne = strInput;
string delimiters = strDelims;
int startpos = 0;
int pos = strOne.find_first_of(delimiters, startpos);
while (string::npos != pos || string::npos != startpos)
{
if(strOne.substr(startpos, pos - startpos) != "")
vS.push_back(strOne.substr(startpos, pos - startpos));
// if delimiter is a new line (\n) then addt new line
if(strOne.substr(pos, 1) == "\n")
vS.push_back("\\n");
// else if the delimiter is not a space
else if (strOne.substr(pos, 1) != " ")
vS.push_back(strOne.substr(pos, 1));
if( string::npos == strOne.find_first_not_of(delimiters, pos) )
startpos = strOne.find_first_not_of(delimiters, pos);
else
startpos = pos + 1;
pos = strOne.find_first_of(delimiters, startpos);
}
return vS;
}

I can't really follow your code, could you post a working program?
Anyway, this is a simple tokenizer, without testing edge cases:
#include <iostream>
#include <string>
#include <vector>
using namespace std;
void tokenize(vector<string>& tokens, const string& text, const string& del)
{
string::size_type startpos = 0,
currentpos = text.find(del, startpos);
do
{
tokens.push_back(text.substr(startpos, currentpos-startpos+del.size()));
startpos = currentpos + del.size();
currentpos = text.find(del, startpos);
} while(currentpos != string::npos);
tokens.push_back(text.substr(startpos, currentpos-startpos+del.size()));
}
Example input, delimiter = $$:
Hello$$Stack$$Over$$$Flow$$$$!
Tokens:
Hello$$
Stack$$
Over$$
$Flow$$
$$
!
Note: I would never use a tokenizer I wrote without testing! please use boost::tokenizer!

if the delimiters are characters and not strings, then you can use strtok.

It depends on whether you want the preceding delimiters, the following delimiters, or both, and what you want to do with strings at the beginning and end of the string that may not have delimiters before/after them.
I'm going to assume you want each word, with its preceding and following delimiters, but NOT any strings of delimiters by themselves (e.g. if there's a delimiter following the last string).
template <class iter>
void tokenize(std::string const &str, std::string const &delims, iter out) {
int pos = 0;
do {
int beg_word = str.find_first_not_of(delims, pos);
if (beg_word == std::string::npos)
break;
int end_word = str.find_first_of(delims, beg_word);
int beg_next_word = str.find_first_not_of(delims, end_word);
*out++ = std::string(str, pos, beg_next_word-pos);
pos = end_word;
} while (pos != std::string::npos);
}
For the moment, I've written it more like an STL algorithm, taking an iterator for its output instead of assuming it's always pushing onto a collection. Since it depends (for the moment) in the input being a string, it doesn't use iterators for the input.

Is There A Built-In Way to Split Strings In C++?

well is there? by string i mean std::string

Here's a perl-style split function I use:
void split(const string& str, const string& delimiters , vector<string>& tokens)
{
// Skip delimiters at beginning.
string::size_type lastPos = str.find_first_not_of(delimiters, 0);
// Find first "non-delimiter".
string::size_type pos = str.find_first_of(delimiters, lastPos);
while (string::npos != pos || string::npos != lastPos)
{
// Found a token, add it to the vector.
tokens.push_back(str.substr(lastPos, pos - lastPos));
// Skip delimiters. Note the "not_of"
lastPos = str.find_first_not_of(delimiters, pos);
// Find next "non-delimiter"
pos = str.find_first_of(delimiters, lastPos);
}
}

There's no built-in way to split a string in C++, but boost provides the string algo library to do all sort of string manipulation, including string splitting.

Yup, stringstream.
std::istringstream oss(std::string("This is a test string"));
std::string word;
while(oss >> word) {
std::cout << "[" << word << "] ";
}

STL strings
You can use string iterators to do your dirty work.
std::string str = "hello world";
std::string::const_iterator pos = std::find(string.begin(), string.end(), ' '); // Split at ' '.
std::string left(str.begin(), pos);
std::string right(pos + 1, str.end());
// Echoes "hello|world".
std::cout << left << "|" << right << std::endl;

void split(string StringToSplit, string Separators)
{
size_t EndPart1 = StringToSplit.find_first_of(Separators)
string Part1 = StringToSplit.substr(0, EndPart1);
string Part2 = StringToSplit.substr(EndPart1 + 1);
}

The answer is no. You have to break them up using one of the library functions.
Something I use:
std::vector<std::string> parse(std::string l, char delim)
{
std::replace(l.begin(), l.end(), delim, ' ');
std::istringstream stm(l);
std::vector<std::string> tokens;
for (;;) {
std::string word;
if (!(stm >> word)) break;
tokens.push_back(word);
}
return tokens;
}
You can also take a look at the basic_streambuf<T>::underflow() method and write a filter.

What the heck... Here's my version...
Note: Splitting on ("XZaaaXZ", "XZ") will give you 3 strings. 2 of those strings will be empty, and won't be added to theStringVector if theIncludeEmptyStrings is false.
Delimiter is not any element in the set, but rather matches that exact string.
inline void
StringSplit( vector<string> * theStringVector, /* Altered/returned value */
const string & theString,
const string & theDelimiter,
bool theIncludeEmptyStrings = false )
{
UASSERT( theStringVector, !=, (vector<string> *) NULL );
UASSERT( theDelimiter.size(), >, 0 );
size_t start = 0, end = 0, length = 0;
while ( end != string::npos )
{
end = theString.find( theDelimiter, start );
// If at end, use length=maxLength. Else use length=end-start.
length = (end == string::npos) ? string::npos : end - start;
if ( theIncludeEmptyStrings
|| ( ( length > 0 ) /* At end, end == length == string::npos */
&& ( start < theString.size() ) ) )
theStringVector -> push_back( theString.substr( start, length ) );
// If at end, use start=maxSize. Else use start=end+delimiter.
start = ( ( end > (string::npos - theDelimiter.size()) )
? string::npos : end + theDelimiter.size() );
}
}
inline vector<string>
StringSplit( const string & theString,
const string & theDelimiter,
bool theIncludeEmptyStrings = false )
{
vector<string> v;
StringSplit( & v, theString, theDelimiter, theIncludeEmptyStrings );
return v;
}

There is no common way doing this.
I prefer the boost::tokenizer, its header only and easy to use.

C strings
Simply insert a \0 where you wish to split. This is about as built-in as you can get with standard C functions.
This function splits on the first occurance of a char separator, returning the second string.
char *split_string(char *str, char separator) {
char *second = strchr(str, separator);
if(second == NULL)
return NULL;
*second = '\0';
++second;
return second;
}

A fairly simple method would be to use the c_str() method of std::string to get a C-style character array, then use strtok() to tokenize the string. Not quite as eloquent as some of the other solutions listed here, but it's easy and works.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

String Tokenizer with multiple delimiters including delimiter without Boost - c++

Would be great if you could share some info on your environment. Your program ran fine with an input value of 2X on my Fedora 20 using g++.

Related

Tokenize returns a vector that contains the delimiters - C++

Output between two whitespaces

splitting a string but keeping empty tokens c++

Tokenize a string and include delimiters in C++

Is There A Built-In Way to Split Strings In C++?

Categories

Resources