How to remove stopwords from a vector of sentences?

How to remove stopwords from a vector of sentences? - c++

I am working on some code that requires stopwords to be removed from sentences. My current solution does not work.
I have a vector of two test sentences:
std::vector<std::string> sentences = {"this is a test", "another a test"};
I have an unordered set of strings containing stopwords:
std::unordered_set<std::string> stopwords;
Now I tried to loop over the sentences in the vector, check and compare each word with the stopwords, and if it is a stopword is should get removed.
sentences.erase(std::remove_if(sentences.begin(), sentences.end(),
[](const std::string &s){return stopwords.find(s) != stopwords.end();}),
sentences.end());
The idea is that my vector -after removing the stopwords- contains the sentences without the stopwords, but for now, I get the exact same sentences back. Any idea why?
My unordered set is filled with the following function:
void load() {
std::ifstream file;
file.open ("stopwords.txt");
if(!file.is_open()) {return;}
std::string stopword;
while (file >> stopword) {
stopwords.insert(stopword);
}
}

Your current code cannot work, since you are not deleting words from each individual string. Your erase/remove_if call takes an entire string and tries to match the word in the set with the entire string.
First, you should write a simple function that when given a std::string and a map of words to delete, return the string with the deleted words.
Here is a small function using std::istringstream that can do this:
#include <unordered_set>
#include <sstream>
#include <string>
#include <iostream>
std::string remove_stop_words(const std::string& src, const std::unordered_set<std::string>& stops)
{
std::string retval;
std::istringstream strm(src);
std::string word;
while (strm >> word)
{
if ( !stops.count(word) )
retval += word + " ";
}
if ( !retval.empty())
retval.pop_back();
return retval;
}
int main()
{
std::string test = "this is a test";
std::unordered_set<std::string> stops = {"is", "test"};
std::cout << "Changed word:\n" << remove_stop_words(test, stops) << "\n";
}
Output:
Changed word:
this a
So once you have this working correctly, the std::vector version is nothing more than looping through each item in the vector and calling the remove_stop_words function:
int main()
{
std::vector<std::string> test = {"this is a test", "another a test"};
std::unordered_set<std::string> stops = {"is", "test"};
for (size_t i = 0; i < test.size(); ++i)
test[i] = remove_stop_words(test[i], stops);
std::cout << "Changed words:\n";
for ( auto& s : test )
std::cout << s << "\n";
}
Output:
Changed words:
this a
another a
Note that you can utilize the std::transform function to remove the hand-rolled loop in the above example:
#include <algorithm>
//...
int main()
{
std::vector<std::string> test = {"this is a test", "another a test"};
std::unordered_set<std::string> stops = {"is", "test"};
// Use std::transform
std::transform(test.begin(), test.end(), test.begin(),
[&](const std::string& s){return remove_stop_words(s, stops);});
std::cout << "Changed words:\n";
for ( auto& s : test )
std::cout << s << "\n";
}

Related

i.m trying to split string by whitespace using c++, where the data from database [duplicate]

What would be easiest method to split a string using c++11?
I've seen the method used by this post, but I feel that there ought to be a less verbose way of doing it using the new standard.
Edit: I would like to have a vector<string> as a result and be able to delimitate on a single character.

std::regex_token_iterator performs generic tokenization based on a regex. It may or may not be overkill for doing simple splitting on a single character, but it works and is not too verbose:
std::vector<std::string> split(const string& input, const string& regex) {
// passing -1 as the submatch index parameter performs splitting
std::regex re(regex);
std::sregex_token_iterator
first{input.begin(), input.end(), re, -1},
last;
return {first, last};
}

Here is a (maybe less verbose) way to split string (based on the post you mentioned).
#include <string>
#include <sstream>
#include <vector>
std::vector<std::string> split(const std::string &s, char delim) {
std::stringstream ss(s);
std::string item;
std::vector<std::string> elems;
while (std::getline(ss, item, delim)) {
elems.push_back(item);
// elems.push_back(std::move(item)); // if C++11 (based on comment from #mchiasson)
}
return elems;
}

Here's an example of splitting a string and populating a vector with the extracted elements using boost.
#include <boost/algorithm/string.hpp>
std::string my_input("A,B,EE");
std::vector<std::string> results;
boost::algorithm::split(results, my_input, boost::is_any_of(","));
assert(results[0] == "A");
assert(results[1] == "B");
assert(results[2] == "EE");

Another regex solution inspired by other answers but hopefully shorter and easier to read:
std::string s{"String to split here, and here, and here,..."};
std::regex regex{R"([\s,]+)"}; // split on space and comma
std::sregex_token_iterator it{s.begin(), s.end(), regex, -1};
std::vector<std::string> words{it, {}};

I don't know if this is less verbose, but it might be easier to grok for those more seasoned in dynamic languages such as javascript. The only C++11 features it uses is auto and range-based for loop.
#include <string>
#include <cctype>
#include <iostream>
#include <vector>
using namespace std;
int main()
{
string s = "hello how are you won't you tell me your name";
vector<string> tokens;
string token;
for (const auto& c: s) {
if (!isspace(c))
token += c;
else {
if (token.length()) tokens.push_back(token);
token.clear();
}
}
if (token.length()) tokens.push_back(token);
return 0;
}

#include <iostream>
#include <algorithm>
#include <vector>
#include <string>
using namespace std;
vector<string> split(const string& str, int delimiter(int) = ::isspace){
vector<string> result;
auto e=str.end();
auto i=str.begin();
while(i!=e){
i=find_if_not(i,e, delimiter);
if(i==e) break;
auto j=find_if(i,e, delimiter);
result.push_back(string(i,j));
i=j;
}
return result;
}
int main(){
string line;
getline(cin,line);
vector<string> result = split(line);
for(auto s: result){
cout<<s<<endl;
}
}

My choice is boost::tokenizer but I didn't have any heavy tasks and test with huge data.
Example from boost doc with lambda modification:
#include <iostream>
#include <boost/tokenizer.hpp>
#include <string>
#include <vector>
int main()
{
using namespace std;
using namespace boost;
string s = "This is, a test";
vector<string> v;
tokenizer<> tok(s);
for_each (tok.begin(), tok.end(), [&v](const string & s) { v.push_back(s); } );
// result 4 items: 1)This 2)is 3)a 4)test
return 0;
}

This is my answer. Verbose, readable and efficient.
std::vector<std::string> tokenize(const std::string& s, char c) {
auto end = s.cend();
auto start = end;
std::vector<std::string> v;
for( auto it = s.cbegin(); it != end; ++it ) {
if( *it != c ) {
if( start == end )
start = it;
continue;
}
if( start != end ) {
v.emplace_back(start, it);
start = end;
}
}
if( start != end )
v.emplace_back(start, end);
return v;
}

#include <string>
#include <vector>
#include <sstream>
inline vector<string> split(const string& s) {
vector<string> result;
istringstream iss(s);
for (string w; iss >> w; )
result.push_back(w);
return result;
}

Here is a C++11 solution that uses only std::string::find(). The delimiter can be any number of characters long. Parsed tokens are output via an output iterator, which is typically a std::back_inserter in my code.
I have not tested this with UTF-8, but I expect it should work as long as the input and delimiter are both valid UTF-8 strings.
#include <string>
template<class Iter>
Iter splitStrings(const std::string &s, const std::string &delim, Iter out)
{
if (delim.empty()) {
*out++ = s;
return out;
}
size_t a = 0, b = s.find(delim);
for ( ; b != std::string::npos;
a = b + delim.length(), b = s.find(delim, a))
{
*out++ = std::move(s.substr(a, b - a));
}
*out++ = std::move(s.substr(a, s.length() - a));
return out;
}
Some test cases:
void test()
{
std::vector<std::string> out;
size_t counter;
std::cout << "Empty input:" << std::endl;
out.clear();
splitStrings("", ",", std::back_inserter(out));
counter = 0;
for (auto i = out.begin(); i != out.end(); ++i, ++counter) {
std::cout << counter << ": " << *i << std::endl;
}
std::cout << "Non-empty input, empty delimiter:" << std::endl;
out.clear();
splitStrings("Hello, world!", "", std::back_inserter(out));
counter = 0;
for (auto i = out.begin(); i != out.end(); ++i, ++counter) {
std::cout << counter << ": " << *i << std::endl;
}
std::cout << "Non-empty input, non-empty delimiter"
", no delimiter in string:" << std::endl;
out.clear();
splitStrings("abxycdxyxydefxya", "xyz", std::back_inserter(out));
counter = 0;
for (auto i = out.begin(); i != out.end(); ++i, ++counter) {
std::cout << counter << ": " << *i << std::endl;
}
std::cout << "Non-empty input, non-empty delimiter"
", delimiter exists string:" << std::endl;
out.clear();
splitStrings("abxycdxy!!xydefxya", "xy", std::back_inserter(out));
counter = 0;
for (auto i = out.begin(); i != out.end(); ++i, ++counter) {
std::cout << counter << ": " << *i << std::endl;
}
std::cout << "Non-empty input, non-empty delimiter"
", delimiter exists string"
", input contains blank token:" << std::endl;
out.clear();
splitStrings("abxycdxyxydefxya", "xy", std::back_inserter(out));
counter = 0;
for (auto i = out.begin(); i != out.end(); ++i, ++counter) {
std::cout << counter << ": " << *i << std::endl;
}
std::cout << "Non-empty input, non-empty delimiter"
", delimiter exists string"
", nothing after last delimiter:" << std::endl;
out.clear();
splitStrings("abxycdxyxydefxy", "xy", std::back_inserter(out));
counter = 0;
for (auto i = out.begin(); i != out.end(); ++i, ++counter) {
std::cout << counter << ": " << *i << std::endl;
}
std::cout << "Non-empty input, non-empty delimiter"
", only delimiter exists string:" << std::endl;
out.clear();
splitStrings("xy", "xy", std::back_inserter(out));
counter = 0;
for (auto i = out.begin(); i != out.end(); ++i, ++counter) {
std::cout << counter << ": " << *i << std::endl;
}
}
Expected output:
Empty input:
0:
Non-empty input, empty delimiter:
0: Hello, world!
Non-empty input, non-empty delimiter, no delimiter in string:
0: abxycdxyxydefxya
Non-empty input, non-empty delimiter, delimiter exists string:
0: ab
1: cd
2: !!
3: def
4: a
Non-empty input, non-empty delimiter, delimiter exists string, input contains blank token:
0: ab
1: cd
2:
3: def
4: a
Non-empty input, non-empty delimiter, delimiter exists string, nothing after last delimiter:
0: ab
1: cd
2:
3: def
4:
Non-empty input, non-empty delimiter, only delimiter exists string:
0:
1:

One possible way of doing this is finding all occurrences of the split string and storing locations to a list. Then count input string characters and when you get to a position where there is a 'search hit' in the position list then you jump forward by 'length of the split string'. This approach takes a split string of any length. Here is my tested and working solution.
#include <iostream>
#include <string>
#include <list>
#include <vector>
using namespace std;
vector<string> Split(string input_string, string search_string)
{
list<int> search_hit_list;
vector<string> word_list;
size_t search_position, search_start = 0;
// Find start positions of every substring occurence and store positions to a hit list.
while ( (search_position = input_string.find(search_string, search_start) ) != string::npos) {
search_hit_list.push_back(search_position);
search_start = search_position + search_string.size();
}
// Iterate through hit list and reconstruct substring start and length positions
int character_counter = 0;
int start, length;
for (auto hit_position : search_hit_list) {
// Skip over substrings we are splitting with. This also skips over repeating substrings.
if (character_counter == hit_position) {
character_counter = character_counter + search_string.size();
continue;
}
start = character_counter;
character_counter = hit_position;
length = character_counter - start;
word_list.push_back(input_string.substr(start, length));
character_counter = character_counter + search_string.size();
}
// If the search string is not found in the input string, then return the whole input_string.
if (word_list.size() == 0) {
word_list.push_back(input_string);
return word_list;
}
// The last substring might be still be unprocessed, get it.
if (character_counter < input_string.size()) {
word_list.push_back(input_string.substr(character_counter, input_string.size() - character_counter));
}
return word_list;
}
int main() {
vector<string> word_list;
string search_string = " ";
// search_string = "the";
string text = "thetheThis is some text to test with the split-thethe function.";
word_list = Split(text, search_string);
for (auto item : word_list) {
cout << "'" << item << "'" << endl;
}
cout << endl;
}

How to run a string search algorithm through whole body of text

I am using the brute force string search algorithm to search through a small sentence, however I want the algorithm to return every time it finds the certain string instead of finding it once and then stopping
//Declare and initialise variables
string pat, text;
text = "This is a test sentence, find test within this string";
cout << text << endl;
//User input for pat
cout << "Please enter the string you want to search for" << endl;
cin >> pat;
//Set the length of the pat and text
int patLength = pat.size();
int textLength = text.size();
//Algorithm
for (int i = 0; i < textLength - patLength; ++i)
{
//Do while loop to run through the whole text
do
{
int j;
for (j = 0; j < patLength; j++)
{
if (text[i + j] != pat[j])
break; // Doesn't match here.
}
if (j == patLength)
{
finds.push(i); // Matched here.
}
} while (i < textLength);
}
//Print output
cout << "String: " << pat << " was found at positions: " << finds.top();
The program stores each find in a queue. When I run this program, it asks for the 'pat', then does nothing. I have done a bit of debugging and found that it is probably the do while loop. However I can't find a fix

You could use the std::string::find function combined with a function that you call for each find.
#include <iostream>
#include <functional>
#include <vector>
#include <sstream>
void Algorithm(
const std::string& text, const std::string& pat,
std::function<void(const std::string&,size_t)> f, std::vector<size_t>& positions)
{
size_t pos=0;
while((pos=text.find(pat, pos)) != std::string::npos) {
// store the position
positions.push_back(pos);
// call the supplied function
f(text, pos++);
}
}
// function to call for each position in which the pattern is found
void gotit(const std::string& found_in, size_t pos) {
std::cout << "Found in \"" << found_in << "\" # " << pos << "\n";
}
int main(int argc, char* argv[]) {
std::vector<std::string> args(argv+1, argv+argc);
if(args.size()==0)
args.push_back("This is a test sentence, find test within this string");
for(const auto& text : args) {
std::vector<size_t> found_at;
std::cout << "Please enter the string you want to search for: ";
std::string pat;
std::cin >> pat;
Algorithm(text, pat, gotit, found_at);
std::cout << "collected positions:\n";
for(size_t pos : found_at) {
std::cout << pos << "\n";
}
}
}

My first bit of advice would be to structure your code into separate functions.
Let's say you have a function that returns the position of the pattern's first occurrence in a sequence of characters:
using position = typename std::string::const_iterator;
position first_occurrence(position text_begin, position text_end, const std::string& pattern);
If there is no more occurrence of the pattern, it returns text_end.
You can now write a very simple loop:
auto occurrence = first_occurrence(text_begin, pattern);
while (occurrence != text_end) {
occurrences.push_back(occurrence);
occurrence = first_occurence(occurrence + 1, text_end, pattern);
}
to accumulate all the occurrences of the pattern.
The first_occurrence function already exists in the standard library under the name of std::search. Since C++17, you can customize this function with pattern-searching specialized searchers, such as std::boyer_moore_searcher: it pre-processes the pattern to make it faster to look for in the string. Here's an example application to your problem:
#include <algorithm>
#include <string>
#include <vector>
#include <functional>
using occurrence = typename std::string::const_iterator;
std::vector<occurrence> find_occurrences(const std::string& input, const std::string& pattern) {
auto engine = std::boyer_moore_searcher(pattern.begin(), pattern.end());
std::vector<occurrence> occurrences;
auto it = std::search(input.begin(), input.end(), engine);
while (it != input.end()) {
occurrences.push_back(it);
it = std::search(std::next(it), input.end(), engine);
}
return occurrences;
}
#include <iostream>
int main() {
std::string text = "This is a test sentence, find test within this string";
std::string pattern = "st";
auto occs = find_occurrences(text, pattern);
for (auto occ: occs) std::cout << std::string(occ, std::next(occ, pattern.size())) << std::endl;
}

C++ alternative of Java's split(str, -1) [duplicate]

What would be easiest method to split a string using c++11?
I've seen the method used by this post, but I feel that there ought to be a less verbose way of doing it using the new standard.
Edit: I would like to have a vector<string> as a result and be able to delimitate on a single character.

std::regex_token_iterator performs generic tokenization based on a regex. It may or may not be overkill for doing simple splitting on a single character, but it works and is not too verbose:
std::vector<std::string> split(const string& input, const string& regex) {
// passing -1 as the submatch index parameter performs splitting
std::regex re(regex);
std::sregex_token_iterator
first{input.begin(), input.end(), re, -1},
last;
return {first, last};
}

Here is a (maybe less verbose) way to split string (based on the post you mentioned).
#include <string>
#include <sstream>
#include <vector>
std::vector<std::string> split(const std::string &s, char delim) {
std::stringstream ss(s);
std::string item;
std::vector<std::string> elems;
while (std::getline(ss, item, delim)) {
elems.push_back(item);
// elems.push_back(std::move(item)); // if C++11 (based on comment from #mchiasson)
}
return elems;
}

Here's an example of splitting a string and populating a vector with the extracted elements using boost.
#include <boost/algorithm/string.hpp>
std::string my_input("A,B,EE");
std::vector<std::string> results;
boost::algorithm::split(results, my_input, boost::is_any_of(","));
assert(results[0] == "A");
assert(results[1] == "B");
assert(results[2] == "EE");

Another regex solution inspired by other answers but hopefully shorter and easier to read:
std::string s{"String to split here, and here, and here,..."};
std::regex regex{R"([\s,]+)"}; // split on space and comma
std::sregex_token_iterator it{s.begin(), s.end(), regex, -1};
std::vector<std::string> words{it, {}};

I don't know if this is less verbose, but it might be easier to grok for those more seasoned in dynamic languages such as javascript. The only C++11 features it uses is auto and range-based for loop.
#include <string>
#include <cctype>
#include <iostream>
#include <vector>
using namespace std;
int main()
{
string s = "hello how are you won't you tell me your name";
vector<string> tokens;
string token;
for (const auto& c: s) {
if (!isspace(c))
token += c;
else {
if (token.length()) tokens.push_back(token);
token.clear();
}
}
if (token.length()) tokens.push_back(token);
return 0;
}

#include <iostream>
#include <algorithm>
#include <vector>
#include <string>
using namespace std;
vector<string> split(const string& str, int delimiter(int) = ::isspace){
vector<string> result;
auto e=str.end();
auto i=str.begin();
while(i!=e){
i=find_if_not(i,e, delimiter);
if(i==e) break;
auto j=find_if(i,e, delimiter);
result.push_back(string(i,j));
i=j;
}
return result;
}
int main(){
string line;
getline(cin,line);
vector<string> result = split(line);
for(auto s: result){
cout<<s<<endl;
}
}

My choice is boost::tokenizer but I didn't have any heavy tasks and test with huge data.
Example from boost doc with lambda modification:
#include <iostream>
#include <boost/tokenizer.hpp>
#include <string>
#include <vector>
int main()
{
using namespace std;
using namespace boost;
string s = "This is, a test";
vector<string> v;
tokenizer<> tok(s);
for_each (tok.begin(), tok.end(), [&v](const string & s) { v.push_back(s); } );
// result 4 items: 1)This 2)is 3)a 4)test
return 0;
}

This is my answer. Verbose, readable and efficient.
std::vector<std::string> tokenize(const std::string& s, char c) {
auto end = s.cend();
auto start = end;
std::vector<std::string> v;
for( auto it = s.cbegin(); it != end; ++it ) {
if( *it != c ) {
if( start == end )
start = it;
continue;
}
if( start != end ) {
v.emplace_back(start, it);
start = end;
}
}
if( start != end )
v.emplace_back(start, end);
return v;
}

#include <string>
#include <vector>
#include <sstream>
inline vector<string> split(const string& s) {
vector<string> result;
istringstream iss(s);
for (string w; iss >> w; )
result.push_back(w);
return result;
}

Here is a C++11 solution that uses only std::string::find(). The delimiter can be any number of characters long. Parsed tokens are output via an output iterator, which is typically a std::back_inserter in my code.
I have not tested this with UTF-8, but I expect it should work as long as the input and delimiter are both valid UTF-8 strings.
#include <string>
template<class Iter>
Iter splitStrings(const std::string &s, const std::string &delim, Iter out)
{
if (delim.empty()) {
*out++ = s;
return out;
}
size_t a = 0, b = s.find(delim);
for ( ; b != std::string::npos;
a = b + delim.length(), b = s.find(delim, a))
{
*out++ = std::move(s.substr(a, b - a));
}
*out++ = std::move(s.substr(a, s.length() - a));
return out;
}
Some test cases:
void test()
{
std::vector<std::string> out;
size_t counter;
std::cout << "Empty input:" << std::endl;
out.clear();
splitStrings("", ",", std::back_inserter(out));
counter = 0;
for (auto i = out.begin(); i != out.end(); ++i, ++counter) {
std::cout << counter << ": " << *i << std::endl;
}
std::cout << "Non-empty input, empty delimiter:" << std::endl;
out.clear();
splitStrings("Hello, world!", "", std::back_inserter(out));
counter = 0;
for (auto i = out.begin(); i != out.end(); ++i, ++counter) {
std::cout << counter << ": " << *i << std::endl;
}
std::cout << "Non-empty input, non-empty delimiter"
", no delimiter in string:" << std::endl;
out.clear();
splitStrings("abxycdxyxydefxya", "xyz", std::back_inserter(out));
counter = 0;
for (auto i = out.begin(); i != out.end(); ++i, ++counter) {
std::cout << counter << ": " << *i << std::endl;
}
std::cout << "Non-empty input, non-empty delimiter"
", delimiter exists string:" << std::endl;
out.clear();
splitStrings("abxycdxy!!xydefxya", "xy", std::back_inserter(out));
counter = 0;
for (auto i = out.begin(); i != out.end(); ++i, ++counter) {
std::cout << counter << ": " << *i << std::endl;
}
std::cout << "Non-empty input, non-empty delimiter"
", delimiter exists string"
", input contains blank token:" << std::endl;
out.clear();
splitStrings("abxycdxyxydefxya", "xy", std::back_inserter(out));
counter = 0;
for (auto i = out.begin(); i != out.end(); ++i, ++counter) {
std::cout << counter << ": " << *i << std::endl;
}
std::cout << "Non-empty input, non-empty delimiter"
", delimiter exists string"
", nothing after last delimiter:" << std::endl;
out.clear();
splitStrings("abxycdxyxydefxy", "xy", std::back_inserter(out));
counter = 0;
for (auto i = out.begin(); i != out.end(); ++i, ++counter) {
std::cout << counter << ": " << *i << std::endl;
}
std::cout << "Non-empty input, non-empty delimiter"
", only delimiter exists string:" << std::endl;
out.clear();
splitStrings("xy", "xy", std::back_inserter(out));
counter = 0;
for (auto i = out.begin(); i != out.end(); ++i, ++counter) {
std::cout << counter << ": " << *i << std::endl;
}
}
Expected output:
Empty input:
0:
Non-empty input, empty delimiter:
0: Hello, world!
Non-empty input, non-empty delimiter, no delimiter in string:
0: abxycdxyxydefxya
Non-empty input, non-empty delimiter, delimiter exists string:
0: ab
1: cd
2: !!
3: def
4: a
Non-empty input, non-empty delimiter, delimiter exists string, input contains blank token:
0: ab
1: cd
2:
3: def
4: a
Non-empty input, non-empty delimiter, delimiter exists string, nothing after last delimiter:
0: ab
1: cd
2:
3: def
4:
Non-empty input, non-empty delimiter, only delimiter exists string:
0:
1:

One possible way of doing this is finding all occurrences of the split string and storing locations to a list. Then count input string characters and when you get to a position where there is a 'search hit' in the position list then you jump forward by 'length of the split string'. This approach takes a split string of any length. Here is my tested and working solution.
#include <iostream>
#include <string>
#include <list>
#include <vector>
using namespace std;
vector<string> Split(string input_string, string search_string)
{
list<int> search_hit_list;
vector<string> word_list;
size_t search_position, search_start = 0;
// Find start positions of every substring occurence and store positions to a hit list.
while ( (search_position = input_string.find(search_string, search_start) ) != string::npos) {
search_hit_list.push_back(search_position);
search_start = search_position + search_string.size();
}
// Iterate through hit list and reconstruct substring start and length positions
int character_counter = 0;
int start, length;
for (auto hit_position : search_hit_list) {
// Skip over substrings we are splitting with. This also skips over repeating substrings.
if (character_counter == hit_position) {
character_counter = character_counter + search_string.size();
continue;
}
start = character_counter;
character_counter = hit_position;
length = character_counter - start;
word_list.push_back(input_string.substr(start, length));
character_counter = character_counter + search_string.size();
}
// If the search string is not found in the input string, then return the whole input_string.
if (word_list.size() == 0) {
word_list.push_back(input_string);
return word_list;
}
// The last substring might be still be unprocessed, get it.
if (character_counter < input_string.size()) {
word_list.push_back(input_string.substr(character_counter, input_string.size() - character_counter));
}
return word_list;
}
int main() {
vector<string> word_list;
string search_string = " ";
// search_string = "the";
string text = "thetheThis is some text to test with the split-thethe function.";
word_list = Split(text, search_string);
for (auto item : word_list) {
cout << "'" << item << "'" << endl;
}
cout << endl;
}

String.erase giving out_of_range exception

I was meant to write some program which will read text from text file and erase given words.
Unfortunately, something's wrong with this particular part of code, I get the following exception notification:
This text is just a sample, based on other textterminate called after throwing
an instance of 'std::out_of_range' what<>: Basic_string_erase
I guess that there is something wrong with the way I use erase, I'm trying to to use do while loop, determine the beginning of word which is meant to be erased every time the loop is done and eventually erase text which begins at the beginning of word which is meant to be erased and the end of it - I'm using its length.
#include <iostream>
#include <string>
using namespace std;
void eraseString(string &str1, string &str2) // str1 - text, str2 - phrase
{
size_t positionOfPhrase = str1.find(str2);
if(positionOfPhrase == string::npos)
{
cout <<"Phrase hasn't been found... at all"<< endl;
}
else
{
do{
positionOfPhrase = str1.find(str2, positionOfPhrase + str2.size());
str1.erase(positionOfPhrase, str2.size());//**IT's PROBABLY THE SOURCE OF PROBLEM**
}while(positionOfPhrase != string::npos);
}
}
int main(void)
{
string str("This text is just a sample text, based on other text");
string str0("text");
cout << str;
eraseString(str, str0);
cout << str;
}

Your function is wrong. It is entirely unclear why you call method find twice after each other.
Try the following code.
#include <iostream>
#include <string>
std::string & eraseString( std::string &s1, const std::string &s2 )
{
std::string::size_type pos = 0;
while ( ( pos = s1.find( s2, pos ) ) != std::string::npos )
{
s1.erase( pos, s2.size() );
}
return s1;
}
int main()
{
std::string s1( "This text is just a sample text, based on other text" );
std::string s2( "text" );
std::cout << s1 << std::endl;
std::cout << eraseString( s1, s2 ) << std::endl;
return 0;
}
The program output is
This text is just a sample text, based on other text
This is just a sample , based on other

I think your trouble is that positionOfPhrase inside do loop can be string::npos, in which case erase will throw an exception. This can be fixed by changing logic to:
while (true) {
positionOfPhrase = str1.find(str2, positionOfPhrase + str2.size());
if (positionOfPhrase == string::npos) break;
str1.erase(positionOfPhrase, str2.size());
}

C++ split string using a list of words as separators

I would like to split a string like this one
“this1245is#g$0,therhsuidthing345”
using a list of words like the one bellow
{“this”, “is”, “the”, “thing”}
into this list
{“this”, “1245”, “is”, “#g$0,”, “the”, “rhsuid”, “thing”, “345”}
// ^--------------^---------------^------------------^-- these were the delimiters
The delimiters are allowed to appear more than once in the string to split, and it can be done using regular expressions
The precedence is in the order in which the delimiters appear in the array
The platform I'm developing for has no support for the Boost library
Update
This is what I have for the moment
#include <iostream>
#include <string>
#include <regex>
int main ()
{
std::string s ("this1245is#g$0,therhsuidthing345");
std::string delimiters[] = {"this", "is", "the", "thing"};
for (int i=0; i<4; i++) {
std::string delimiter = "(" + delimiters[i] + ")(.*)";
std::regex e (delimiter); // matches words beginning by the i-th delimiter
// default constructor = end-of-sequence:
std::sregex_token_iterator rend;
std::cout << "1st and 2nd submatches:";
int submatches[] = { 1, 2 };
std::sregex_token_iterator c ( s.begin(), s.end(), e, submatches );
while (c!=rend) std::cout << " [" << *c++ << "]";
std::cout << std::endl;
}
return 0;
}
output:
1st and 2nd submatches:[this][x1245fisA#g$0,therhsuidthing345]
1st and 2nd submatches:[is][x1245fisA#g$0,therhsuidthing345]
1st and 2nd submatches:[the][rhsuidthing345]
1st and 2nd submatches:[thing][345]
I think I need to make some recursive thing to call on each iteration

Build the expression you want for matches only (re), then pass in {-1, 0} to your std::sregex_token_iterator to return all non-matches (-1) and matches (0).
#include <iostream>
#include <regex>
int main() {
std::string s("this1245is#g$0,therhsuidthing345");
std::regex re("(this|is|the|thing)");
std::sregex_token_iterator iter(s.begin(), s.end(), re, { -1, 0 });
std::sregex_token_iterator end;
while (iter != end) {
//Works in vc13, clang requires you increment separately,
//haven't gone into implementation to see if/how ssub_match is affected.
//Workaround: increment separately.
//std::cout << "[" << *iter++ << "] ";
std::cout << "[" << *iter << "] ";
++iter;
}
}

I don't know how to perform the precedence requirement. This seems to work on the given input:
std::vector<std::string> parse (std::string s)
{
std::vector<std::string> out;
std::regex re("\(this|is|the|thing).*");
std::string word;
auto i = s.begin();
while (i != s.end()) {
std::match_results<std::string::iterator> m;
if (std::regex_match(i, s.end(), m, re)) {
if (!word.empty()) {
out.push_back(word);
word.clear();
}
out.push_back(std::string(m[1].first, m[1].second));
i += out.back().size();
} else {
word += *i++;
}
}
if (!word.empty()) {
out.push_back(word);
}
return out;
}

vector<string> strs;
boost::split(strs,line,boost::is_space());

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How to remove stopwords from a vector of sentences? - c++

Related

i.m trying to split string by whitespace using c++, where the data from database [duplicate]

How to run a string search algorithm through whole body of text

C++ alternative of Java's split(str, -1) [duplicate]

String.erase giving out_of_range exception

C++ split string using a list of words as separators

Categories

Resources