Having issues finding multiple substrings within a string - c++

I am trying to write a program that compares two strings (string and substring) and incitements each time the substring is found within the string. However, using the standard:
if(str.find(substr) != string::npos)
{
count++;
}
I run into the problem that if the substring appears multiple times in the string it only increments once. So if the string is "test test test test" and the substring is "test" count only ends up being 1 instead of 4.
What would be the best way to fix this?
*Notes for context:
1) At one point I was checking the string character by character to see if they matched, but had to scrap that when I ran into issues when some words had smaller words in them.
Example: 'is' would get picked up inside the word 'this', etc
2)The larger program that this is for accepts two vectors. The first vector has a string for each element being sentences the user get to type in (acting at the main string in the example above). And the second vector has each word from all the sentences entered into the first vector (acting as the substring in the example above). Not sure if that bit matters or not, but figured I would throw it in there
Example:
vector<string> str {this is line one, this is line two, this is line three};
vector<string> substr {is, line, one, this, three, two};
3) I'm thinking if there was some way of doing the opposite of !=string::npos would work, but not sure if that even exist.

You need a loop to find all of the occurances of a substring in a given string.
However, since you want to differentiate substrings that are whole words from substrings in larger words, you need to parse the string to determine the whole words before you compare them.
You can use std::string::find_first_of() and std::string::find_first_not_of() to find the beginning and ending indexes of each whole word between desired delimiters (whitespace, punctuation, etc). You can use std::string::compare() to compare a substring between those two indexes to your desired substring. For example:
#include <string>
const std::string delims = ",. ";
size_t countWord(const std::string &str, const std::string &word)
{
std::string::size_type start = 0, end;
size_t count = 0;
while ((start = str.find_first_not_of(delims, start)) != std::string::npos)
{
end = str.find_first_of(delims, start+1);
if (end == std::string::npos)
{
if (str.compare(start, str.size()-start, word) == 0)
++count;
break;
}
if (str.compare(start, end-start, word) == 0)
++count;
start = end + 1;
}
return count;
}
Alternatively, you can extract the whole words into a std::vector and then use std::count() to count how many elements match the substring. For example:
#include <string>
#include <vector>
#include <algorithm>
const std::string delims = ",. ";
size_t countWord(const std::string &str, const std::string &word)
{
std::vector<std::string> vec;
std::string::size_type start = 0, end;
while ((start = str.find_first_not_of(delims, start)) != string::npos)
{
end = str.find_first_of(delims, start+1);
if (end == std::string::npos)
{
vec.push_back(str.substr(start));
break;
}
vec.push_back(str.substr(start, end-start));
start = end + 1;
}
return std::count(vec.begin(), vec.end(), word);
}

Related

Split text with array of delimiters

I want a function that split text by array of delimiters. I have a demo that works perfectly, but it is really really slow. Here is a example of parameters.
text:
"pop-pap-bab bob"
vector of delimiters:
"-"," "
the result:
"pop", "-", "pap", "-", "bab", "bob"
So the function loops throw the string and tries to find delimeters and if it finds one it pushes the text and the delimiter that was found to the result array, if the text only contains spaces or if it is empty then don't push the text.
std::string replace(std::string str,std::string old,std::string new_str){
size_t pos = 0;
while ((pos = str.find(old)) != std::string::npos) {
str.replace(pos, old.length(), new_str);
}
return str;
}
std::vector<std::string> split_with_delimeter(std::string str,std::vector<std::string> delimeters){
std::vector<std::string> result;
std::string token;
int flag = 0;
for(int i=0;i<(int)str.size();i++){
for(int j=0;j<(int)delimeters.size();j++){
if(str.substr(i,delimeters.at(j).size()) == delimeters.at(j)){
if(token != ""){
result.push_back(token);
token = "";
}
if(replace(delimeters.at(j)," ","") != ""){
result.push_back(delimeters.at(j));
}
i += delimeters.at(j).size()-1;
flag = 1;
break;
}
}
if(flag == 0){token += str.at(i);}
flag = 0;
}
if(token != ""){
result.push_back(token);
}
return result;
}
My issue is that, the functions is really slow since it has 3 loops. I am wondering if anyone knows how to make the function faster. I am sorry, if I wasn't clear enough my english isn't the best.
It might be a good idea to use boost expressive. It is a powerful tool for various string operations more than struggling with string::find_xx and self for-loop or regex.
Concise explanation:
+as_xpr(" ") is repeated match more than 1 like regex and then prefix "-" means
shortest match.
If you define regex parser as sregex rex = "(" >> (+_w | +"_") >> ":" >> +_d >> ")", it would match (port_num:8080). In this case, ">>" means the concat of parsers and (+_w | +"_") means that it matches character or "_" repeatedly.
#include <vector>
#include <string>
#include <iostream>
#include <boost/xpressive/xpressive.hpp>
using namespace std;
using namespace boost::xpressive;
int main() {
string source = "Nigeria is a multi&&national state in--habited by more than 2;;50 ethnic groups speak###ing 500 distinct languages";
vector<string> delimiters{ " ", " ", "&&", "-", ";;", "###"};
vector<sregex> pss{ -+as_xpr(delimiters.front()) };
for (const auto& d : delimiters) pss.push_back(pss.back() | -+as_xpr(d));
vector<string> ret;
size_t pos = 0;
auto push = [&](auto s, auto e) { ret.push_back(source.substr(s, e)); };
for_each(sregex_iterator(source.begin(), source.end(), pss.back()), {}, [&](smatch const& m) {
if (m.position() - pos) push(pos, m.position() - pos);
pos = m.position() + m.str().size();
}
);
push(pos, source.size() - pos);
for (auto& s : ret) printf("%s\n", s.c_str());
}
Output is splitted by multiple string delimiers.
Nigeria
is
a
multi
national
state
in
habited
by
more
than
2
50
ethnic
groups
speak
ing
500
distinct
languages
Maybe, as an alternative, you could use a regex? But maybe also too slow for you . . .
With a regex life would be very simple.
Please see the following example:
#include <iostream>
#include <string>
#include <vector>
#include <regex>
#include <iterator>
const std::regex re(R"((\w+|[\- ]))");
int main() {
std::string s{"pop-pap-bab bob"};
std::vector<std::string> part{std::sregex_token_iterator(s.begin(),s.end(),re),{}};
for (const std::string& p : part) std::cout << p << '\n';
}
We use the std::sregex_token_iterator in combination with the std::vectors range constructor, to extract everything specified in the regex and then put all those stuff into the std::vector
The regex itself is also simple. It specifies words or delimiters.
Maybe its worth a try . . .
NOTE: You've complained that your code is slow, but it's important to understand that most of the answers will have options to potentially speed up the program. And even if the author of the option measured the acceleration of the program, the option may be slower on your machine, so do not forget to measure the execution speed yourself.
If I were you, I would create a separate function that receives an array of strings and outputs an array of delimited strings. The problem with this approach may be that if the delimiter includes another delimiter, the result may not be what you expect, but it will be easier to iterate through different options for string splitting, finding the best.
And my solution would looks like this(though, it requires c++20)
#include <iomanip>
#include <iostream>
#include <ranges>
#include <string_view>
#include <vector>
std::vector<std::string> split_elems_of_array(const std::vector<std::string>& array, const std::string& delim)
{
std::vector<std::string> result;
for(const auto str: array)
{
for (const auto word : std::views::split(str, delim))
{
std::string chunk(word.begin(), word.end());
if(!chunk.empty() && chunk != " ")
result.push_back(chunk + delim);
}
}
return result;
}
std::vector<std::string> split_string(std::string str, std::vector<std::string> delims)
{
std::vector<std::string> result = {std::string(str)};
for(const auto&delim: delims)
result = split_elems_of_array(result, delim);
return {result.begin(), result.end()};
}
For my machine, my approach is 56 times faster: 67 ms versus 5112 ms. Length of string is 1000000, there are 100 delims with length 100
Here is the algorithm of standard splitting. if you split pop-pap-bab bob by {'-' , ' '} it gives you ["pop", "pap", "bab", "bob"] it's not storing delimiters and doesn't check for empty text. You can change it to do those things too.
Define a vector of strings named result.
Define a string variable named buffer.
Loop over your string, if current character is not a delimiter append it to buffer.
if current character is a delimiter, append buffer to result.
Return result at the end.
std::vector<std::string> split(std::string str, std::vector<char> delimiters)
{
std::vector<std::string> result;
std::string buffer;
for (const auto ch : str)
{
if (std::find(delimiters.begin(), delimiters.end(), ch) == delimiters.end())
buffer += ch;
else
{
result.insert(result.end(), buffer);
buffer.clear();
}
}
if (buffer.length())
result.insert(result.end(), buffer);
return result;
}
It's time complexity is O(n.m). n is the length of string and m is the length of delimiters.

Function to separate each word from a string and put them into a vector, without using auto keyword?

I'm really stuck here. So I can't edit the main function, and inside it there is a function call with the only parameter being the string. How can I make this function put each word from the string into a vector, without using the auto keyword? I realize that this code is probably really wrong but its my best attempt at what it should look like.
#include <iostream>
#include <string>
#include <vector>
#include <algorithm>
using namespace std;
vector<string> extract_words(const char * sentence[])
{
string word = "";
vector<string> list;
for (int i = 0; i < sentence.size(); ++i)
{
while (sentence[i] != ' ')
{
word = word + sentence[i];
}
list.push_back(word);
}
}
int main()
{
sentence = "Help me please" /*In the actual code a function call is here that gets input sentence.*/
if (sentence.length() > 0)
{
words = extract_words(sentence);
}
}
Do you know how to read "words" from std::cin?
Then you can put that string in a std::istringstream which works like std::cin but for "reading" strings instead.
Use the stream extract operator >> in a loop to get all the words one by one, and add them to the vector.
Perhaps something like:
std::vector<std::string> get_all_words(std::string const& string)
{
std::vector<std::string> words;
std::istringstream in(string);
std::string word;
while (in >> word)
{
words.push_back(word);
}
return words;
}
With a little more knowledge of C++ and its standard classes and functions, you can actually make the function a lot shorter:
std::vector<std::string> get_all_words(std::string const& string)
{
std::istringstream in(string);
return std::vector<std::string>(std::istream_iterator<std::string>(in),
std::istream_iterator<std::string>());
}
I recommend making the argument to the function a const std::string& instead of const char * sentence[]. A std::string has many member functions, like find_first_of, find_first_not_of and substr and more that could help a lot.
Here's an example using those mentioned:
std::vector<std::string> extract_words(const std::string& sentence)
{
/* Control char's, "whitespaces", that we don't want in our words:
\a audible bell
\b backspace
\f form feed
\n line feed
\r carriage return
\t horizontal tab
\v vertical tab
*/
static const char whitespaces[] = " \t\n\r\a\b\f\v";
std::vector<std::string> list;
std::size_t begin = 0;
while(true)
{
// Skip whitespaces by finding the first non-whitespace, starting at
// "begin":
begin = sentence.find_first_not_of(whitespaces, begin);
// If no non-whitespace char was found, break out:
if(begin == std::string::npos) break;
// Search for a whitespace starting at "begin + 1":
std::size_t end = sentence.find_first_of(whitespaces, begin + 1);
// Store the result by creating a substring from "begin" with the
// length "end - begin":
list.push_back(sentence.substr(begin, end - begin));
// If no whitespace was found, break out:
if(end == std::string::npos) break;
// Set "begin" to the char after the found whitespace before the loop
// makes another lap:
begin = end + 1;
}
return list;
}
Demo
With the added restriction "no breaks", this could be a variant. It does exactly the same as the above, but without using break:
std::vector<std::string> extract_words(const std::string& sentence)
{
static const char whitespaces[] = " \t\n\r\a\b\f\v";
std::vector<std::string> list;
std::size_t begin = 0;
bool loop = true;
while(loop)
{
begin = sentence.find_first_not_of(whitespaces, begin);
if(begin == std::string::npos) {
loop = false;
} else {
std::size_t end = sentence.find_first_of(whitespaces, begin + 1);
list.push_back(sentence.substr(begin, end - begin));
if(end == std::string::npos) {
loop = false;
} else {
begin = end + 1;
}
}
}
return list;
}

how do you split a string embedded in a delimiter in C++?

I understand how to split a string by a string by a delimiter in C++, but how do you split a string embedded in a delimiter, e.g. try and split ”~!hello~! random junk... ~!world~!” by the string ”~!” into an array of [“hello”, “ random junk...”, “world”]? are there any C++ standard library functions for this or if not any algorithm which could achieve this?
#include <iostream>
#include <vector>
using namespace std;
vector<string> split(string s,string delimiter){
vector<string> res;
s+=delimiter; //adding delimiter at end of string
string word;
int pos = s.find(delimiter);
while (pos != string::npos) {
word = s.substr(0, pos); // The Word that comes before the delimiter
res.push_back(word); // Push the Word to our Final vector
s.erase(0, pos + delimiter.length()); // Delete the Delimiter and repeat till end of String to find all words
pos = s.find(delimiter); // Update pos to hold position of next Delimiter in our String
}
res.push_back(s); //push the last word that comes after the delimiter
return res;
}
int main() {
string s="~!hello~!random junk... ~!world~!";
vector<string>words = split(s,"~!");
int n=words.size();
for(int i=0;i<n;i++)
std::cout<<words[i]<<std::endl;
return 0;
}
The above program will find all the words that occur before, in between and after the delimiter that you specify. With minor changes to the function, you can make the function suit your need ( like for example if you don't need to find the word that occurs before the first delimiter or last delimiter) .
But for your need, the given function does the word splitting in the right way according to the delimiter you provide.
I hope this solves your question !

Separating alphabetic characters in C++ STL

I've been practicing C++ for a competition next week. And in the sample problem I've been working on, requires splitting of paragraphs into words. Of course, that's easy. But this problem is so weird, that the words like: isn't should be separated as well: isn and t. I know it's weird but I have to follow this.
I have a function split() that takes a constant char delimiter as one of the parameter. It's what I use to separate words from spaces. But I can't figure out this one. Even numbers like: phil67bs should be separated as phil and bs.
And no, I don't ask for full code. A pseudocode will do, or something that will help me understand what to do. Thanks!
PS: Please no recommendations for external libs. Just the STL. :)
Filter out numbers, spaces and anything else that isn't a letter by using a proper locale. See this SO thread about treating everything but numbers as a whitespace. So use a mask and do something similar to what Jerry Coffin suggests but only for letters:
struct alphabet_only: std::ctype<char>
{
alphabet_only(): std::ctype<char>(get_table()) {}
static std::ctype_base::mask const* get_table()
{
static std::vector<std::ctype_base::mask>
rc(std::ctype<char>::table_size,std::ctype_base::space);
std::fill(&rc['A'], &rc['['], std::ctype_base::upper);
std::fill(&rc['a'], &rc['{'], std::ctype_base::lower);
return &rc[0];
}
};
And, boom! You're golden.
Or... you could just do a transform:
char changeToLetters(const char& input){ return isalpha(input) ? input : ' '; }
vector<char> output;
output.reserve( myVector.size() );
transform( myVector.begin(), myVector.end(), insert_iterator(output), ptr_fun(changeToLetters) );
Which, um, is much easier to grok, just not as efficient as Jerry's idea.
Edit:
Changed 'Z' to '[' so that the value 'Z' is filled. Likewise with 'z' to '{'.
This sounds like a perfect job for the find_first_of function which finds the first occurrence of a set of characters. You can use this to look for arbitrary stop characters and generate words from the spaces between such stop characters.
Roughly:
size_t previous = 0;
for (; ;) {
size_t next = str.find_first_of(" '1234567890", previous);
// Do processing
if (next == string::npos)
break;
previous = next + 1;
};
Just change your function to delimit on anything that isn't an alphabetic character. Is there anything in particular that you are having trouble with?
Break down the problem: First, write a function that gets the first "word" from the sentence. This is easy; just look for the first non-alphabetic character. The next step is to remove all leading non-alphabetic character from the remaining string. From there, just repeat.
You can do something like this:
vector<string> split(const string& str)
{
vector<string> splits;
string cur;
for(int i = 0; i < str.size(); ++i)
{
if(str[i] >= '0' && str[i] <= '9')
{
if(!cur.empty())
{
splits.push_back(cur);
}
cur="";
}
else
{
cur += str[i];
}
}
if(! cur.empty())
{
splits.push_back(cur);
}
return splits;
}
let's assume that the input is in a std::string (use std::getline(cin, line) for example to read a full line from cin)
std::vector<std::string> split(std::string const& input)
{
std::string::const_iterator it(input), end(input.end());
std::string current;
vector<std::string> words;
for(; it != end; ++it)
{
if (isalpha(*it))
{
current.push_back(*it); // add this char to the current word
}
else
{
// push the current word in to the result list
words.push_back(current);
current.clear(); // next word
}
}
return words;
}
I've not tested it, but I guess it ought to work...

String reversal in C++

I am trying to reverse the order of words in a sentence by maintaining the spaces as below.
[this is my test string] ==> [string test my is this]
I did in a step by step manner as,
[this is my test string] - input string
[gnirts tset ym si siht] - reverse the whole string - in-place
[string test my is this] - reverse the words of the string - in-place
[string test my is this] - string-2 with spaces rearranged
Is there any other method to do this ? Is it also possible to do the last step in-place ?
Your approach is fine. But alternatively you can also do:
Keep scanning the input for words and
spaces
If you find a word push it onto stack
S
If you find space(s) enqueue the
number of spaces into a queue Q
After this is done there will be N words on the stack and N-1 numbers in the queue.
While stack not empty do
print S.pop
if stack is empty break
print Q.deque number of spaces
end-while
Here's an approach.
In short, build two lists of tokens you find: one for words, and another for spaces. Then piece together a new string, with the words in reverse order and the spaces in forward order.
#include <iostream>
#include <algorithm>
#include <vector>
#include <string>
#include <sstream>
using namespace std;
string test_string = "this is my test string";
int main()
{
// Create 2 vectors of strings. One for words, another for spaces.
typedef vector<string> strings;
strings words, spaces;
// Walk through the input string, and find individual tokens.
// A token is either a word or a contigious string of spaces.
for( string::size_type pos = 0; pos != string::npos; )
{
// is this a word token or a space token?
bool is_char = test_string[pos] != ' ';
string::size_type pos_end_token = string::npos;
// find the one-past-the-end index for the end of this token
if( is_char )
pos_end_token = test_string.find(' ', pos);
else
pos_end_token = test_string.find_first_not_of(' ', pos);
// pull out this token
string token = test_string.substr(pos, pos_end_token == string::npos ? string::npos : pos_end_token-pos);
// if the token is a word, save it to the list of words.
// if it's a space, save it to the list of spaces
if( is_char )
words.push_back(token);
else
spaces.push_back(token);
// move on to the next token
pos = pos_end_token;
}
// construct the new string using stringstream
stringstream ss;
// walk through both the list of spaces and the list of words,
// keeping in mind that there may be more words than spaces, or vice versa
// construct the new string by first copying the word, then the spaces
strings::const_reverse_iterator it_w = words.rbegin();
strings::const_iterator it_s = spaces.begin();
while( it_w != words.rend() || it_s != spaces.end() )
{
if( it_w != words.rend() )
ss << *it_w++;
if( it_s != spaces.end() )
ss << *it_s++;
}
// pull a `string` out of the results & dump it
string reversed = ss.str();
cout << "Input: '" << test_string << "'" << endl << "Output: '" << reversed << "'" << endl;
}
I would rephrase the problem this way:
Non-space tokens are reversed, but preserves their original order
The 5 non-space tokens ‘this’, ‘is’, ‘my’, ‘test’, ‘string’ gets reversed to ‘string’, ‘test’, ‘my’, ‘is’, ‘this’.
Space tokens remain in the original order
The space tokens ‘ ‘, ‘ ‘, ‘ ‘, ‘ ‘ remains in original order between the new order of non-space tokens.
Following is a O(N) solution [N being the length of char array]. Unfortunately, it is not in place as OP wanted, but it does not use additional stack or queue either -- it uses a separate character array as a working space.
Here is a C-ish pseudo code.
work_array = char array with size of input_array
dst = &work_array[ 0 ]
for( i = 1; ; i++) {
detect i’th non-space token in input_array starting from the back side
if no such token {
break;
}
copy the token starting at dst
advance dst by token_size
detect i’th space-token in input_array starting from the front side
copy the token starting at dst
advance dst by token_size
}
// at this point work_array contains the desired output,
// it can be copied back to input_array and destroyed
For words from first to central words switch word n with word length - n
First use a split function and then do the switching
This pseudocode assumes you don't end the initial string with a blank space, though can be suitably modified for that too.
1. Get string length; allocate equivalent space for final string; set getText=1
2. While pointer doesn't reach position 0 of string,
i.start from end of string, read character by character...
a.if getText=1
...until blank space encountered
b.if getText=0
...until not blank space encountered
ii.back up pointer to previously pointed character
iii.output to final string in reverse
iv.toggle getText
3. Stop
All strtok-solutions work not for your example, see above.
Try this:
char *wordrev(char *s)
{
char *y=calloc(1,strlen(s)+1);
char *p=s+strlen(s);
while( p--!=s )
if( *p==32 )
strcat(y,p+1),strcat(y," "),*p=0;
strcpy(s,y);
free(y);
return s;
}
Too bad stl string doesn't implement push_front. Then you could do this with transform().
#include <string>
#include <iostream>
#include <algorithm>
class push_front
{
public:
push_front( std::string& s ) : _s(s) {};
bool operator()(char c) { _s.insert( _s.begin(), c ); return true; };
std::string& _s;
};
int main( int argc, char** argv )
{
std::string s1;
std::string s( "Now is the time for all good men");
for_each( s.begin(), s.end(), push_front(s1) );
std::cout << s << "\n";
std::cout << s1 << "\n";
}
Now is the time for all good men
nem doog lla rof emit eht si woN
Copy each string in the array and print it in reverse order(i--)
int main()
{
int j=0;
string str;
string copy[80];
int start=0;
int end=0;
cout<<"Enter the String :: ";
getline(cin,str);
cout<<"Entered String is : "<<str<<endl;
for(int i=0;str[i]!='\0';i++)
{
end=s.find(" ",start);
if(end==-1)
{
copy[j]=str.substr(start,(str.length()-start));
break;
}
else
{
copy[j]=str.substr(start,(end-start));
start=end+1;
j++;
i=end;
}
}
for(int s1=j;s1>=0;s1--)
cout<<" "<<copy[s1];
}
I think I'd just tokenize (strtok or CString::Tokanize) the string using the space character. Shove the strings into a vector, than pull them back out in reverse order and concatenate them with a space in between.