How to get a word vector from a string?

How to get a word vector from a string? - c++

I want to store words separated by spaces into single string elements in a vector.
The input is a string that may end or may not end in a symbol( comma, period, etc.)
All symbols will be separated by spaces too.
I created this function but it doesn't return me a vector of words.
vector<string> single_words(string sentence)
{
vector<string> word_vector;
string result_word;
for (size_t character = 0; character < sentence.size(); ++character)
{
if (sentence[character] == ' ' && result_word.size() != 0)
{
word_vector.push_back(result_word);
result_word = "";
}
else
result_word += character;
}
return word_vector;
}
What did I do wrong?

Your problem has already been resolved by answers and comments.
I would like to give you the additional information that such functionality is already existing in C++.
You could take advantage of the fact that the extractor operator extracts space separated tokens from a stream. Because a std::string is not a stream, we can put the string first into an std::istringstream and then extract from this stream vie the std:::istream_iterator.
We could life make even more easier.
Since roundabout 10 years we have a dedicated, special C++ functionality for splitting strings into tokens, explicitely designed for this purpose. The std::sregex_token_iterator. And because we have such a dedicated function, we should simply use it.
The idea behind it is the iterator concept. In C++ we have many containers and always iterators, to iterate over the similar elements in these containers. And a string, with similar elements (tokens), separated by a delimiter, can also be seen as such a container. And with the std::sregex:token_iterator, we can iterate over the elements/tokens/substrings of the string, splitting it up effectively.
This iterator is very powerfull and you can do really much much more fancy stuff with it. But that is too much for here. Important is that splitting up a string into tokens is a one-liner. For example a variable definition using a range constructor for iterating over the tokens.
See some examples below:
#include <iostream>
#include <sstream>
#include <string>
#include <vector>
#include <iterator>
#include <algorithm>
#include <regex>
const std::regex delimiter{ " " };
const std::regex reWord{ "(\\w+)" };
int main() {
// Some debug print function
auto print = [](const std::vector<std::string>& sv) -> void {
std::copy(sv.begin(), sv.end(), std::ostream_iterator<std::string>(std::cout, "\n")); std::cout << "\n"; };
// The test string
std::string test{ "word1 word2 word3 word4." };
//-----------------------------------------------------------------------------------------
// Solution 1: use istringstream and then extract from there
std::istringstream iss1(test);
// Define a vector (CTAD), use its range constructor and, the std::istream_iterator as iterator
std::vector words1(std::istream_iterator<std::string>(iss1), {});
print(words1); // Show debug output
//-----------------------------------------------------------------------------------------
// Solution 2: directly use dedicated function sregex_token iterator
std::vector<std::string> words2(std::sregex_token_iterator(test.begin(), test.end(), delimiter, -1), {});
print(words2); // Show debug output
//-----------------------------------------------------------------------------------------
// Solution 3: directly use dedicated function sregex_token iterator and look for words only
std::vector<std::string> words3(std::sregex_token_iterator(test.begin(), test.end(), reWord, 1), {});
print(words3); // Show debug output
//-----------------------------------------------------------------------------------------
// Solution 4: Use such iterator in an algorithm, to copy data to a vector
std::vector<std::string> words4{};
std::copy(std::sregex_token_iterator(test.begin(), test.end(), reWord, 1), {}, std::back_inserter(words4));
print(words4); // Show debug output
//-----------------------------------------------------------------------------------------
// Solution 5: Use such iterator in an algorithm for direct output
std::copy(std::sregex_token_iterator(test.begin(), test.end(), reWord, 1), {}, std::ostream_iterator<std::string>(std::cout,"\n"));
return 0;
}

You added the index instead of the character:
vector<string> single_words(string sentence)
{
vector<string> word_vector;
string result_word;
for (size_t i = 0; i < sentence.size(); ++i)
{
char character = sentence[i];
if (character == ' ' && result_word.size() != 0)
{
word_vector.push_back(result_word);
result_word = "";
}
else
result_word += character;
}
return word_vector;
}

Since your mistake was only due to the reason, that you named your iterator variable character even though it is actually not a character, but rather an iterator or index, I would like to suggest to use a ranged-base loop here, since it avoids this kind of confusion. The clean solution is obviously to do what #ArminMontigny said, but I assume you are prohibited to use stringstreams. The code would look like this:
#include <iostream>
#include <string>
#include <vector>
using namespace std;
vector<string> single_words(string sentence)
{
vector<string> word_vector;
string result_word;
for (char& character: sentence) // Now `character` is actually a character.
{
if (character==' ' && result_word.size() != 0)
{
word_vector.push_back(result_word);
result_word = "";
}
else
result_word += character;
}
word_vector.push_back(result_word); // In your solution, you forgot to push the last word into the vector.
return word_vector;
}
int main() {
string sentence="Maybe try range based loops";
vector<string> result= single_words(sentence);
for(string& word: result)
cout<<word<<" ";
return 0;
}

Related

Split text with array of delimiters

I want a function that split text by array of delimiters. I have a demo that works perfectly, but it is really really slow. Here is a example of parameters.
text:
"pop-pap-bab bob"
vector of delimiters:
"-"," "
the result:
"pop", "-", "pap", "-", "bab", "bob"
So the function loops throw the string and tries to find delimeters and if it finds one it pushes the text and the delimiter that was found to the result array, if the text only contains spaces or if it is empty then don't push the text.
std::string replace(std::string str,std::string old,std::string new_str){
size_t pos = 0;
while ((pos = str.find(old)) != std::string::npos) {
str.replace(pos, old.length(), new_str);
}
return str;
}
std::vector<std::string> split_with_delimeter(std::string str,std::vector<std::string> delimeters){
std::vector<std::string> result;
std::string token;
int flag = 0;
for(int i=0;i<(int)str.size();i++){
for(int j=0;j<(int)delimeters.size();j++){
if(str.substr(i,delimeters.at(j).size()) == delimeters.at(j)){
if(token != ""){
result.push_back(token);
token = "";
}
if(replace(delimeters.at(j)," ","") != ""){
result.push_back(delimeters.at(j));
}
i += delimeters.at(j).size()-1;
flag = 1;
break;
}
}
if(flag == 0){token += str.at(i);}
flag = 0;
}
if(token != ""){
result.push_back(token);
}
return result;
}
My issue is that, the functions is really slow since it has 3 loops. I am wondering if anyone knows how to make the function faster. I am sorry, if I wasn't clear enough my english isn't the best.

It might be a good idea to use boost expressive. It is a powerful tool for various string operations more than struggling with string::find_xx and self for-loop or regex.
Concise explanation:
+as_xpr(" ") is repeated match more than 1 like regex and then prefix "-" means
shortest match.
If you define regex parser as sregex rex = "(" >> (+_w | +"_") >> ":" >> +_d >> ")", it would match (port_num:8080). In this case, ">>" means the concat of parsers and (+_w | +"_") means that it matches character or "_" repeatedly.
#include <vector>
#include <string>
#include <iostream>
#include <boost/xpressive/xpressive.hpp>
using namespace std;
using namespace boost::xpressive;
int main() {
string source = "Nigeria is a multi&&national state in--habited by more than 2;;50 ethnic groups speak###ing 500 distinct languages";
vector<string> delimiters{ " ", " ", "&&", "-", ";;", "###"};
vector<sregex> pss{ -+as_xpr(delimiters.front()) };
for (const auto& d : delimiters) pss.push_back(pss.back() | -+as_xpr(d));
vector<string> ret;
size_t pos = 0;
auto push = [&](auto s, auto e) { ret.push_back(source.substr(s, e)); };
for_each(sregex_iterator(source.begin(), source.end(), pss.back()), {}, [&](smatch const& m) {
if (m.position() - pos) push(pos, m.position() - pos);
pos = m.position() + m.str().size();
}
);
push(pos, source.size() - pos);
for (auto& s : ret) printf("%s\n", s.c_str());
}
Output is splitted by multiple string delimiers.
Nigeria
is
a
multi
national
state
in
habited
by
more
than
2
50
ethnic
groups
speak
ing
500
distinct
languages

Maybe, as an alternative, you could use a regex? But maybe also too slow for you . . .
With a regex life would be very simple.
Please see the following example:
#include <iostream>
#include <string>
#include <vector>
#include <regex>
#include <iterator>
const std::regex re(R"((\w+|[\- ]))");
int main() {
std::string s{"pop-pap-bab bob"};
std::vector<std::string> part{std::sregex_token_iterator(s.begin(),s.end(),re),{}};
for (const std::string& p : part) std::cout << p << '\n';
}
We use the std::sregex_token_iterator in combination with the std::vectors range constructor, to extract everything specified in the regex and then put all those stuff into the std::vector
The regex itself is also simple. It specifies words or delimiters.
Maybe its worth a try . . .

NOTE: You've complained that your code is slow, but it's important to understand that most of the answers will have options to potentially speed up the program. And even if the author of the option measured the acceleration of the program, the option may be slower on your machine, so do not forget to measure the execution speed yourself.
If I were you, I would create a separate function that receives an array of strings and outputs an array of delimited strings. The problem with this approach may be that if the delimiter includes another delimiter, the result may not be what you expect, but it will be easier to iterate through different options for string splitting, finding the best.
And my solution would looks like this(though, it requires c++20)
#include <iomanip>
#include <iostream>
#include <ranges>
#include <string_view>
#include <vector>
std::vector<std::string> split_elems_of_array(const std::vector<std::string>& array, const std::string& delim)
{
std::vector<std::string> result;
for(const auto str: array)
{
for (const auto word : std::views::split(str, delim))
{
std::string chunk(word.begin(), word.end());
if(!chunk.empty() && chunk != " ")
result.push_back(chunk + delim);
}
}
return result;
}
std::vector<std::string> split_string(std::string str, std::vector<std::string> delims)
{
std::vector<std::string> result = {std::string(str)};
for(const auto&delim: delims)
result = split_elems_of_array(result, delim);
return {result.begin(), result.end()};
}
For my machine, my approach is 56 times faster: 67 ms versus 5112 ms. Length of string is 1000000, there are 100 delims with length 100

Here is the algorithm of standard splitting. if you split pop-pap-bab bob by {'-' , ' '} it gives you ["pop", "pap", "bab", "bob"] it's not storing delimiters and doesn't check for empty text. You can change it to do those things too.
Define a vector of strings named result.
Define a string variable named buffer.
Loop over your string, if current character is not a delimiter append it to buffer.
if current character is a delimiter, append buffer to result.
Return result at the end.
std::vector<std::string> split(std::string str, std::vector<char> delimiters)
{
std::vector<std::string> result;
std::string buffer;
for (const auto ch : str)
{
if (std::find(delimiters.begin(), delimiters.end(), ch) == delimiters.end())
buffer += ch;
else
{
result.insert(result.end(), buffer);
buffer.clear();
}
}
if (buffer.length())
result.insert(result.end(), buffer);
return result;
}
It's time complexity is O(n.m). n is the length of string and m is the length of delimiters.

How to delete part of a string c++ [duplicate]

I got a string and I want to remove all the punctuations from it. How do I do that? I did some research and found that people use the ispunct() function (I tried that), but I cant seem to get it to work in my code. Anyone got any ideas?
#include <string>
int main() {
string text = "this. is my string. it's here."
if (ispunct(text))
text.erase();
return 0;
}

Using algorithm remove_copy_if :-
string text,result;
std::remove_copy_if(text.begin(), text.end(),
std::back_inserter(result), //Store output
std::ptr_fun<int, int>(&std::ispunct)
);

POW already has a good answer if you need the result as a new string. This answer is how to handle it if you want an in-place update.
The first part of the recipe is std::remove_if, which can remove the punctuation efficiently, packing all the non-punctuation as it goes.
std::remove_if (text.begin (), text.end (), ispunct)
Unfortunately, std::remove_if doesn't shrink the string to the new size. It can't because it has no access to the container itself. Therefore, there's junk characters left in the string after the packed result.
To handle this, std::remove_if returns an iterator that indicates the part of the string that's still needed. This can be used with strings erase method, leading to the following idiom...
text.erase (std::remove_if (text.begin (), text.end (), ispunct), text.end ());
I call this an idiom because it's a common technique that works in many situations. Other types than string provide suitable erase methods, and std::remove (and probably some other algorithm library functions I've forgotten for the moment) take this approach of closing the gaps for items they remove, but leaving the container-resizing to the caller.

#include <string>
#include <iostream>
#include <cctype>
int main() {
std::string text = "this. is my string. it's here.";
for (int i = 0, len = text.size(); i < len; i++)
{
if (ispunct(text[i]))
{
text.erase(i--, 1);
len = text.size();
}
}
std::cout << text;
return 0;
}
Output
this is my string its here
When you delete a character, the size of the string changes. It has to be updated whenever deletion occurs. And, you deleted the current character, so the next character becomes the current character. If you don't decrement the loop counter, the character next to the punctuation character will not be checked.

ispunct takes a char value not a string.
you can do like
for (auto c : string)
if (ispunct(c)) text.erase(text.find_first_of(c));
This will work but it is a slow algorithm.

Pretty good answer by Steve314.
I would like to add a small change :
text.erase (std::remove_if (text.begin (), text.end (), ::ispunct), text.end ());
Adding the :: before the function ispunct takes care of overloading .

The problem here is that ispunct() takes one argument being a character, while you are trying to send a string. You should loop over the elements of the string and erase each character if it is a punctuation like here:
for(size_t i = 0; i<text.length(); ++i)
if(ispunct(text[i]))
text.erase(i--, 1);

#include <iostream>
#include <string>
#include <algorithm>
using namespace std;
int main() {
string str = "this. is my string. it's here.";
transform(str.begin(), str.end(), str.begin(), [](char ch)
{
if( ispunct(ch) )
return '\0';
return ch;
});
}

#include <iostream>
#include <string>
using namespace std;
int main()
{
string s;//string is defined here.
cout << "Please enter a string with punctuation's: " << endl;//Asking for users input
getline(cin, s);//reads in a single string one line at a time
/* ERROR Check: The loop didn't run at first because a semi-colon was placed at the end
of the statement. Remember not to add it for loops. */
for(auto &c : s) //loop checks every character
{
if (ispunct(c)) //to see if its a punctuation
{
c=' '; //if so it replaces it with a blank space.(delete)
}
}
cout << s << endl;
system("pause");
return 0;
}

Another way you could do this would be as follows:
#include <ctype.h> //needed for ispunct()
string onlyLetters(string str){
string retStr = "";
for(int i = 0; i < str.length(); i++){
if(!ispunct(str[i])){
retStr += str[i];
}
}
return retStr;
This ends up creating a new string instead of actually erasing the characters from the old string, but it is a little easier to wrap your head around than using some of the more complex built in functions.

I tried to apply #Steve314's answer but couldn't get it to work until I came across this note here on cppreference.com:
Notes
Like all other functions from <cctype>, the behavior of std::ispunct
is undefined if the argument's value is neither representable as
unsigned char nor equal to EOF. To use these functions safely with
plain chars (or signed chars), the argument should first be converted
to unsigned char.
By studying the example it provides, I am able to make it work like this:
#include <string>
#include <iostream>
#include <cctype>
#include <algorithm>
int main()
{
std::string text = "this. is my string. it's here.";
std::string result;
text.erase(std::remove_if(text.begin(),
text.end(),
[](unsigned char c) { return std::ispunct(c); }),
text.end());
std::cout << text << std::endl;
}

Try to use this one, it will remove all the punctuation on the string in the text file oky.
str.erase(remove_if(str.begin(), str.end(), ::ispunct), str.end());
please reply if helpful

i got it.
size_t found = text.find('.');
text.erase(found, 1);

remove chars from string in c++

I was implementing a method to remove certain characters from a string txt, in-place. the following is my code. The result is expected as "bdeg". however the result is "bdegfg", which seems the null terminator is not set. the weird thing is that when I use gdb to debug, after setting null terminator
(gdb) p txt
$5 = (std::string &) #0xbffff248: {static npos = <optimized out>,
_M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>}, _M_p = 0x804b014 "bdeg"}}
it looks right to me. So what is the problem here?
#include <iostream>
#include <string>
using namespace std;
void censorString(string &txt, string rem)
{
// create look-up table
bool lut[256]={false};
for (int i=0; i<rem.size(); i++)
{
lut[rem[i]] = true;
}
int i=0;
int j=0;
// iterate txt to remove chars
for (i=0, j=0; i<txt.size(); i++)
{
if (!lut[txt[i]]){
txt[j]=txt[i];
j++;
}
}
// set null-terminator
txt[j]='\0';
}
int main(){
string txt="abcdefg";
censorString(txt, "acf");
// expect: "bdeg"
std::cout << txt <<endl;
}
follow-up question:
if string is not truncated like c string. so what happens with txt[j]='\0'
and why it is "bdegfg" not 'bdeg'\0'g' or some corrupted strings.
another follow-up:
if I use txt.erase(txt.begin()+j, txt.end());
it works fine. so I'd better use string related api. the point is that I do not know the time complexity of the underlying code of these api.

std::string is not null terminated as you think therefore you have to use other ways to do this
modify the function to:
void censorString(string &txt, string rem)
{
// create look-up table
bool lut[256]={false};
for (int i=0; i<rem.size(); i++)
{
lut[rem[i]] = true;
}
// iterate txt to remove chars
for (std::string::iterator it=txt.begin();it!=txt.end();)
{
if(lut[*it]){
it=txt.erase(it);//erase the character pointed by it and returns the iterator to next character
continue;
}
//increment iterator here to avoid increment after erasing the character
it++;
}
}
Here basically you have to use std::string::erase function to erase any character in the string which take iterator as input and return iterator to next character
http://en.cppreference.com/w/cpp/string/basic_string/erase
http://www.cplusplus.com/reference/string/string/erase/
the complexity of erase function is O(n). So the whole function would have complexity of o(n^2). space complexity for a very long string i.e. >256 chars would be O(n).
Well there is another way which will have only O(n) complexity for time.
create a another string and append the character while iterating over the txt string which are not censored.
The new function would be:
void censorString(string &txt, string rem)
{
// create look-up set
std::unordered_set<char> luckUpSet(rem.begin(),rem.end());
std::string newString;
// iterate txt to remove chars
for (std::string::iterator it=txt.begin();it!=txt.end();it++)
{
if(luckUpSet.find(*it)==luckUpSet.end()){
newString.push_back(*it);
}
}
txt=std::move(newString);
}
Now this function has complexity of O(n), since functionstd::unordered_set::find and std::string::push_back have complexity of O(1).
if You use normal std::set find which has complexity of O(log n), then complexity of whole function would become O(n log n).

Embedding null-terminators inside a std::string is completely valid and will not change the length of the string. It will give you unexpected results if you, for example, try to output it using a stream extraction, though.
The goal you are attempting to reach can be done much easier:
#include <algorithm>
#include <iostream>
#include <iterator>
#include <string>
int main()
{
std::string txt="abcdefg";
std::string filter = "acf";
txt.erase(std::remove_if(txt.begin(), txt.end(), [&](char c)
{
return std::find(filter.begin(), filter.end(), c) != filter.end();
}), txt.end());
// expect: "bdeg"
std::cout << txt << std::endl;
}
In the same vein as Himanshu's answer, you can accomplish an O(N) complexity (using additional memory) like so:
#include <algorithm>
#include <iostream>
#include <iterator>
#include <string>
#include <unordered_set>
int main()
{
std::string txt="abcdefg";
std::string filter = "acf";
std::unordered_set<char> filter_set(filter.begin(), filter.end());
std::string output;
std::copy_if(txt.begin(), txt.end(), std::back_inserter(output), [&](char c)
{
return filter_set.find(c) == filter_set.end();
});
// expect: "bdeg"
std::cout << output << std::endl;
}

You have not told the string that you have changed it's size. You need to use the resize method to update the size if you remove any characters from the string.

Problem is you can't treat the C++ string like a C style string is the problem. I.e. you can't just insert a 0 like in C. To convince your self of this, add this to your code "cout << txt.length() << endl;" - you'll get 7. You want to use the erase() method;
Removes specified characters from the string.
1) Removes min(count, size() - index) characters starting at index.
2) Removes the character at position.
3) Removes the character in the range [first; last).

Text is a string not a character array.
This code
// set null-terminator
txt[j]='\0';
Will not truncate the string at the j-th position.

sequence of delimiters in function strtok

im trying to obtain tokens with function strtok() in C++. Is very simple when you use just 1 delimiter like:
token = strtok(auxiliar,"[,]");. This will cut auxiliar everytime the function finds [,,or].
What I want is obtain tokens with a sequence of delimiters like: [,]
It is posible doing that with strtok function? I cannot find the way.
Thank you!

If you want strtok to treat [,] as a single token, this cannot be done. strtok always treats whatever you pass in the delimiters string as individual, 1-character delimiters.
Beyond this, it's best to not use strtok in C++ anyway. It is not re-entrant (eg, you can't nest calls), not type-safe, and very easy to use in a way that creates nasty bugs.
The simplest solution is to simply search withing a std::string for the particular delimiter you want, in a loop. If you need more sophisticated functionality, there are tokenizers in the Boost library, and I've also posted code to do more comprehensive tokenizing using only the Standard Library, here.
The code I've linked above also treats delimiters as single characters, but I think the code could be extended in the way you desire.

If this is really C++, you should use std::string and not C strings.
Here's an example that uses only the STL to split a std::string into a std::vector:
#include <cstddef>
#include <string>
#include <vector>
std::vector<std::string> split(std::string str, std::string sep) {
std::vector<std::string> vec;
size_t i = 0, j = 0;
do {
i = str.find(sep, j);
vec.push_back( str.substr(j, i-j) );
j = i + sep.size();
} while (i != str.npos);
return vec;
}
int main() {
std::vector<std::string> vec = split("This[,]is[[,]your, string", "[,]");
// vec is contains "This", "is[", "your, string"
return 0;
}

If you can use the new C++11 features, you can do it with regex and token iterators. For example:
regex reg("\[,\]");
const sregex_token_iterator end;
string aux(auxilar);
for(sregex_token_iterator iter(aux.begin(), aux.end(), reg); iter != end; ++iter) {
cout << *iter << endl;
}
This example is from the Wrox book Professional C++.

If you can use the boost library I think this will do what you want it to do - not totally sure though as your question is a little unclear
#include <iostream>
#include <vector>
#include <string>
#include <boost/tokenizer.hpp>
int main(int argc, char *argv[])
{
std::string data("[this],[is],[some],[weird],[fields],[data],[I],[want],[to],[split]");
boost::tokenizer<boost::char_separator<char> > tokens(data, boost::char_separator<char>("],["));
std::vector<std::string> words(tokens.begin(), tokens.end());
for(std::vector<std::string>::const_iterator i=words.begin(),end=words.end(); i!=end; ++i)
{
std::cout << '\'' << *i << "'\n";
}
return 0;
}
This produces the following output
'this'
'is'
'some'
'weird'
'fields'
'data'
'I'
'want'
'to'
'split'

Selective iterator

FYI: no boost, yes it has this, I want to reinvent the wheel ;)
Is there some form of a selective iterator (possible) in C++? What I want is to seperate strings like this:
some:word{or other
to a form like this:
some : word { or other
I can do that with two loops and find_first_of(":") and ("{") but this seems (very) inefficient to me. I thought that maybe there would be a way to create/define/write an iterator that would iterate over all these values with for_each. I fear this will have me writing a full-fledged custom way-too-complex iterator class for a std::string.
So I thought maybe this would do:
std::vector<size_t> list;
size_t index = mystring.find(":");
while( index != std::string::npos )
{
list.push_back(index);
index = mystring.find(":", list.back());
}
std::for_each(list.begin(), list.end(), addSpaces(mystring));
This looks messy to me, and I'm quite sure a more elegant way of doing this exists. But I can't think of it. Anyone have a bright idea? Thanks
PS: I did not test the code posted, just a quick write-up of what I would try
UPDATE: after taking all your answers into account, I came up with this, and it works to my liking :). this does assume the last char is a newline or something, otherwise an ending {,}, or : won't get processed.
void tokenize( string &line )
{
char oneBack = ' ';
char twoBack = ' ';
char current = ' ';
size_t length = line.size();
for( size_t index = 0; index<length; ++index )
{
twoBack = oneBack;
oneBack = current;
current = line.at( index );
if( isSpecial(oneBack) )
{
if( !isspace(twoBack) ) // insert before
{
line.insert(index-1, " ");
++index;
++length;
}
if( !isspace(current) ) // insert after
{
line.insert(index, " ");
++index;
++length;
}
}
}
Comments are welcome as always :)

That's relatively easy using the std::istream_iterator.
What you need to do is define your own class (say Term). Then define how to read a single "word" (term) from the stream using the operator >>.
I don't know your exact definition of a word is, so I am using the following definition:
Any consecutive sequence of alpha numeric characters is a term
Any single non white space character that is also not alpha numeric is a word.
Try this:
#include <string>
#include <sstream>
#include <iostream>
#include <iterator>
#include <algorithm>
class Term
{
public:
// This cast operator is not required but makes it easy to use
// a Term anywhere that a string can normally be used.
operator std::string const&() const {return value;}
private:
// A term is just a string
// And we friend the operator >> to make sure we can read it.
friend std::istream& operator>>(std::istream& inStr,Term& dst);
std::string value;
};
Now all we have to do is define an operator >> that reads a word according to the rules:
// This function could be a lot neater using some boost regular expressions.
// I just do it manually to show it can be done without boost (as requested)
std::istream& operator>>(std::istream& inStr,Term& dst)
{
// Note the >> operator drops all proceeding white space.
// So we get the first non white space
char first;
inStr >> first;
// If the stream is in any bad state the stop processing.
if (inStr)
{
if(std::isalnum(first))
{
// Alpha Numeric so read a sequence of characters
dst.value = first;
// This is ugly. And needs re-factoring.
while((first = insStr.get(), inStr) && std::isalnum(first))
{
dst.value += first;
}
// Take into account the special case of EOF.
// And bad stream states.
if (!inStr)
{
if (!inStr.eof())
{
// The last letter read was not EOF and and not part of the word
// So put it back for use by the next call to read from the stream.
inStr.putback(first);
}
// We know that we have a word so clear any errors to make sure it
// is used. Let the next attempt to read a word (term) fail at the outer if.
inStr.clear();
}
}
else
{
// It was not alpha numeric so it is a one character word.
dst.value = first;
}
}
return inStr;
}
So now we can use it in standard algorithms by just employing the istream_iterator
int main()
{
std::string data = "some:word{or other";
std::stringstream dataStream(data);
std::copy( // Read the stream one Term at a time.
std::istream_iterator<Term>(dataStream),
std::istream_iterator<Term>(),
// Note the ostream_iterator is using a std::string
// This works because a Term can be converted into a string.
std::ostream_iterator<std::string>(std::cout, "\n")
);
}
The output:
> ./a.exe
some
:
word
{
or
other

std::string const str = "some:word{or other";
std::string result;
result.reserve(str.size());
for (std::string::const_iterator it = str.begin(), end = str.end();
it != end; ++it)
{
if (isalnum(*it))
{
result.push_back(*it);
}
else
{
result.push_back(' '); result.push_back(*it); result.push_back(' ');
}
}
Insert version for speed-up
std::string str = "some:word{or other";
for (std::string::iterator it = str.begin(), end = str.end(); it != end; ++it)
{
if (!isalnum(*it))
{
it = str.insert(it, ' ') + 2;
it = str.insert(it, ' ');
end = str.end();
}
}
Note that std::string::insert inserts BEFORE the iterator passed and returns an iterator to the newly inserted character. Assigning is important since the buffer may have been reallocated at another memory location (the iterators are invalidated by the insertion). Also note that you can't keep end for the whole loop, each time you insert you need to recompute it.

a more elegant way of doing this exists.
I do not know how BOOST implements that, but traditional way is by feeding input string character by character into a FSM which detects where tokens (words, symbols) start and end.
I can do that with two loops and find_first_of(":") and ("{")
One loop with std::find_first_of() should suffice.
Though I'm still a huge fan of FSMs for such parsing tasks.
P.S. Similar question

How about something like:
std::string::const_iterator it, end = mystring.end();
for(it = mystring.begin(); it != end; ++it) {
if ( !isalnum( *it ))
list.push_back(it);
}
This way, you'll only iterate once through the string, and isalnum from ctype.h seems to do what you want. Of course, the code above is very simplistic and incomplete and only suggests a solution.

Are you looking to tokenize the input string, ala strtok?
If so, here is a tokenizing function that you can use. It takes an input string and a string of delimiters (each char int he string is a possible delimitter), and it returns a vector of tokens. Each token is a tuple with the delimitted string, and the delimiter used in that case:
#include <cstdlib>
#include <vector>
#include <string>
#include <functional>
#include <iostream>
#include <algorithm>
using namespace std;
// FUNCTION : stringtok(char const* Raw, string sToks)
// PARAMATERS : Raw Pointer to NULL-Terminated string containing a string to be tokenized.
// sToks string of individual token characters -- each character in the string is a token
// DESCRIPTION : Tokenizes a string, much in the same was as strtok does. The input string is not modified. The
// function is called once to tokenize a string, and all the tokens are retuned at once.
// RETURNS : Returns a vector of strings. Each element in the vector is one token. The token character is
// not included in the string. The number of elements in the vector is N+1, where N is the number
// of times the Token character is found in the string. If one token is an empty string (as with the
// string "string1##string3", where the token character is '#'), then that element in the vector
// is an empty string.
// NOTES :
//
typedef pair<char,string> token; // first = delimiter, second = data
inline vector<token> tokenize(const string& str, const string& delims, bool bCaseSensitive=false) // tokenizes a string, returns a vector of tokens
{
bCaseSensitive;
// prologue
vector<token> vRet;
// tokenize input string
for( string::const_iterator itA = str.begin(), it=itA; it != str.end(); it = find_first_of(++it,str.end(),delims.begin(),delims.end()) )
{
// prologue
// find end of token
string::const_iterator itEnd = find_first_of(it+1,str.end(),delims.begin(),delims.end());
// add string to output
if( it == itA ) vRet.push_back(make_pair(0,string(it,itEnd)));
else vRet.push_back(make_pair(*it,string(it+1,itEnd)));
// epilogue
}
// epilogue
return vRet;
}
using namespace std;
int main()
{
string input = "some:word{or other";
typedef vector<token> tokens;
tokens toks = tokenize(input.c_str(), " :{");
cout << "Input: '" << input << " # Tokens: " << toks.size() << "'\n";
for( tokens::iterator it = toks.begin(); it != toks.end(); ++it )
{
cout << " Token : '" << it->second << "', Delimiter: '" << it->first << "'\n";
}
return 0;
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How to get a word vector from a string? - c++

Related

Split text with array of delimiters

How to delete part of a string c++ [duplicate]

remove chars from string in c++

sequence of delimiters in function strtok

Selective iterator

Categories

Resources