Which data structure and algorithm is appropriate for this? - c++

I have 1000's of string. Given a pattern that need to be searched in all the string, and return all the string which contains that pattern.
Presently i am using vector for to store the original strings. searching for a pattern and if matches add it into new vector and finally return the vector.
int main() {
vector <string> v;
v.push_back ("maggi");
v.push_back ("Active Baby Pants Large 9-14 Kg ");
v.push_back ("Premium Kachi Ghani Pure Mustard Oil ");
v.push_back ("maggi soup");
v.push_back ("maggi sauce");
v.push_back ("Superlite Advanced Jar");
v.push_back ("Superlite Advanced");
v.push_back ("Goldlite Advanced");
v.push_back ("Active Losorb Oil Jar");
vector <string> result;
string str = "Advanced";
for (unsigned i=0; i<v.size(); ++i)
{
size_t found = v[i].find(str);
if (found!=string::npos)
result.push_back(v[i]);
}
for (unsigned j=0; j<result.size(); ++j)
{
cout << result[j] << endl;
}
// your code goes here
return 0;
}
Is there any optimum way to achieve the same with lesser complexity and higher performance ??

The containers I think are appropriate for your application.
However instead of std::string::find, if you implement your own KMP algorithm, then you can guarantee the time complexity to be linear in terms of the length of string + search string.
http://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm
As such the complexity of std::string::find is unspecified.
http://www.cplusplus.com/reference/string/string/find/
EDIT: As pointed out by this link, if the length of your strings is not large (more than 1000), then probably using std::string::find would be good enough since here tabulation etc is not needed.
C++ string::find complexity

If the result is used in the same block of code as the input string vector (it is so in your example) or even if you have a guarantee that everyone uses the result only while input exists, you don't need actually to copy strings. It could be an expensive operation, which considerably slows total algorithm.
Instead you could have a vector of pointers as the result:
vector <string*> result;

If the list of strings is "fixed" for many searches then you can do some simple preprocessing to speed up things quite considerably by using an inverted index.
Build a map of all chars present in the strings, in other words for each possible char store a list of all strings containing that char:
std::map< char, std::vector<int> > index;
std::vector<std::string> strings;
void add_string(const std::string& s) {
int new_pos = strings.size();
strings.push_back(s);
for (int i=0,n=s.size(); i<n; i++) {
index[s[i]].push_back(new_pos);
}
}
Then when asked to search for a substring you first check for all chars in the inverted index and iterate only on the list in the index with the smallest number of entries:
std::vector<std::string *> matching(const std::string& text) {
std::vector<int> *best_ix = NULL;
for (int i=0,n=text.size(); i<n; i++) {
std::vector<int> *ix = &index[text[i]];
if (best_ix == NULL || best_ix->size() > ix->size()) {
best_ix = ix;
}
}
std::vector<std::string *> result;
if (best_ix) {
for (int i=0,n=best_ix->size(); i<n; i++) {
std::string& cand = strings[(*best_ix)[i]];
if (cand.find(text) != std::string::npos) {
result.push_back(&cand);
}
}
} else {
// Empty text as input, just return the whole list
for (int i=0,n=strings.size(); i<n; i++) {
result.push_back(&strings[i]);
}
}
return result;
}
Many improvements are possible:
use a bigger index (e.g. using pairs of consecutive chars)
avoid considering very common chars (stop lists)
use hashes computed from triplets or longer sequences
search the intersection instead of searching the shorter list. Given the elements are added in order the vectors are anyway already sorted and intersection could be computed efficently even using vectors (see std::set_intersection).
All of them may make sense or not depending on the parameters of the problem (how many strings, how long, how long is the text being searched ...).

If the source text is large and static (e.g. crawled webpages), then you can save search time by pre-building a suffix tree or a trie data structure. The search pattern can than traverse the tree to find matches.
If the source text is small and changes frequently, then your original approach is appropriate. The STL functions are generally very well optimized and have stood the test of time.

Related

C++ - checking a string for all values in an array

I have some parsed text from the Vision API, and I'm filtering it using keywords, like so:
if (finalTextRaw.find("File") != finalTextRaw.npos)
{
LogMsg("Found Menubar");
}
E.g., if the keyword "File" is found anywhere within the string finalTextRaw, then the function is interrupted and a log message is printed.
This method is very reliable. But I've inefficiently just made a bunch of if-else-if statements in this fashion, and as I'm finding more words that need filtering, I'd rather be a little more efficient. Instead, I'm now getting a string from a config file, and then parsing that string into an array:
string filterWords = GetApp()->GetFilter();
std::replace(filterWords.begin(), filterWords.end(), ',', ' '); ///replace ',' with ' '
vector<int> array;
stringstream ss(filterWords);
int temp;
while (ss >> temp)
array.push_back(temp); ///create an array of filtered words
And I'd like to have just one if statement for checking that string against the array, instead of many of them for checking the string against each keyword I'm having to manually specify in the code. Something like this:
if (finalTextRaw.find(array) != finalTextRaw.npos)
{
LogMsg("Found filtered word");
}
Of course, that syntax doesn't work, and it's surely more complicated than that, but hopefully you get the idea: if any words from my array appear anywhere in that string, that string should be ignored and a log message printed instead.
Any ideas how I might fashion such a function? I'm guessing it's going to necessitate some kind of loop.
Borrowing from Thomas's answer, a ranged for loop offers a neat solution:
for (const auto &word : words)
{
if (finalTextRaw.find(word) != std::string::npos)
{
// word is found.
// do stuff here or call a function.
break; // stop the loop.
}
}
As pointed out by Thomas, the most efficient way is to split both texts into a list of words. Then use std::set_intersection to find occurrences in both lists. You can use std::vector as long it is sorted. You end up with O(n*log(n)) (with n = max words), rather than O(n*m).
Split sentences to words:
auto split(std::string_view sentence) {
auto result = std::vector<std::string>{};
auto stream = std::istringstream{sentence.data()};
std::copy(std::istream_iterator<std::string>(stream),
std::istream_iterator<std::string>(), std::back_inserter(result));
return result;
}
Find words existing in both lists. This only works for sorted lists (like sets or manually sorted vectors).
auto intersect(std::vector<std::string> a, std::vector<std::string> b) {
std::sort(a.begin(), a.end());
std::sort(b.begin(), b.end());
auto result = std::vector<std::string>{};
std::set_intersection(std::move_iterator{a.begin()},
std::move_iterator{a.end()},
b.cbegin(), b.cend(),
std::back_inserter(result));
return result;
}
Example of how to use.
int main() {
const auto result = intersect(split("hello my name is mister raw"),
split("this is the final raw text"));
for (const auto& word: result) {
// do something with word
}
}
Note that this makes sense when working with large or undefined number of words. If you know the limits, you might want to use easier solutions (provided by other answers).
You could use a fundamental, brute force, loop:
unsigned int quantity_words = array.size();
for (unsigned int i = 0; i < quantity_words; ++i)
{
std::string word = array[i];
if (finalTextRaw.find(word) != std::string::npos)
{
// word is found.
// do stuff here or call a function.
break; // stop the loop.
}
}
The above loop takes each word in the array and searches the finalTextRaw for the word.
There are better methods using some std algorithms. I'll leave that for other answers.
Edit 1: maps and association
The above code is bothering me because there are too many passes through the finalTextRaw string.
Here's another idea:
Create a std::set using the words in finalTextRaw.
For each word in your array, check for existence in the set.
This reduces the quantity of searches (it's like searching a tree).
You should also investigate creating a set of the words in array and finding the intersection between the two sets.

Sorting referenced substrings using C++'s sort?

I have two long strings (a million or so characters) that I want to generate suffixes from and sort in order to find the longest shared substrings as this will be much faster than brute-forcing all possible substrings. I'm most familiar with Python, but my quick calculation estimated 40 Tb of suffixes so I'm hoping it's possible to use C++ (suggested) to sort references to the substrings in each main, unchanging string.
I'll need to retain the index of each substring to find the value as well as the origin string later, so any advice on the type of data structure I could use that would 1) allow sorting of reference strings and 2) keep track of the original index would be super helpful!
Current pseudocode:
//Function to make vector? of structures that contain the reference to the string and the original index
int main() {
//Declare strings
string str1="This is a very long string with some repeats of strings."
string str2="This is another string with some repeats that is very long."
//Call function to make array
//Pass vector to sort(v.begin(), v.end), somehow telling it to deference?
//Process the output in multilayer loop to find the longest exact match
// "string with some repeats"
return 0;}
First of all, you should use a suffix tree for this. But I'll answer your original question.
C++17 :
NOTE: uses experimental features
You may use std::string_view to reference the strings without copying. Here is an example code:
//Declare string
char* str1 = "This is a very long string with some repeats of strings."
int main() {
//Call function to make array
vector<string_view> substrings;
//example of adding substring [5,19) into vector
substrings.push_back(string_view(str1 + 5, 19 - 5));
//Pass vector to sort(v.begin(), v.end)
sort(substrings.begin(), substrings.end());
return 0;
}
Everything before C++17:
You could use a custom predicate with the sort function. Instead of making your vector store the actual strings, make it store pair which contains the index.
Here is an example of code needed to make it work:
//Declare string
string str1="This is a very long string with some repeats of strings."
bool pred(pair<int,int> a, pair<int,int> b){
int substring1start=a.first,
substring1end=a.second;
int substring2start=b.first,
substring2end=b.second;
//use a for loop to manually compare substring1 and substring 2
...
//return true if substring1 should go before substring2 in vector
//otherwise return false
}
int main() {
//Call function to make array
vector<pair<int,int>> substrings;
//example of adding substring [1,19) into vector
substrings.push_back({1,19});
//Pass vector to sort(v.begin(), v.end), passing custom predicate
sort(substrings.begin(), substrings.end(), pred);
return 0;
}
Even if you reduce your memory usage, your program will still take 40T iterations to run anyways (since you need to compare the strings). Unless you use some sort of hashing string comparison algorithm.
You could use a combination of std::string_view, std::hash and std::set.
#include <iostream>
#include <set>
#include <string>
#include <string_view>
#include <vector>
std::string str1="This is a very long string with some repeats of strings.";
std::string str2="This is another string with some repeats that is very long.";
std::set<std::size_t> substringhashes;
std::vector<std::string_view> matches;
bool makeSubHashes(std::string& str, std::size_t lenght) {
for (int pos=0; pos+lenght <= str.size(); ++pos) {
std::string_view sv(str.data()+pos, lenght);
auto hash = std::hash<std::string_view>()(sv);
if (!substringhashes.insert(hash).second) {
matches.push_back(sv);
if (matches.size() > 99) // Optional break after finding the 100 longest matches
return true;
}
}
return false;
}
int main() {
for (int lenght=std::min(str1.size(), str2.size()); lenght>0; --lenght) {
if (makeSubHashes(str1, lenght) || makeSubHashes(str2, lenght))
break;
}
for (auto& sv : matches) {
std::cout << sv << std::endl;
}
return 0;
}
If the amount of suffixes are extremely high, there is a chance for false positives with the std::set. It has std::size_ts max value number of different hashes, which is normally a uint64.
It also starts searching for matches at the maximum lenght of the strings, maybe a more reasonable approach is to set some sort of maximum lenght for the suffixes.
std::sort sorts data in main memory.
If you can fit the data in main memory, then you can sort it with std::sort.
Otherwise not.

how to find set of distinct strings from a given string after cyclic shifts?

I am solving a [QUESTION][1] in Codeforces where the problem statement asks me to find the set of all distinct strings from a given string after cyclic shifts.
like for example :
Given string :"abcd"
the output should be 4 ("dabc","cdab", "bcda", "abcd")[note:"abcd" is also counted]
So
t=s[l-1];
for(i=l-1;i>0;i--)
{
s[i]=s[i-1];
}
s[0]=t;
I applied above method for length - 1 times for all possible strings but I am unable to find the distinct ones,
is there any STL function to do this?
You may use the following:
std::set<std::string>
retrieve_unique_rotations(std::string s)
{
std::set<std::string> res;
res.insert(s);
if (s.empty()) {
return res;
}
for (std::size_t i = 0, size = s.size() - 1; i != size; ++i) {
std::rotate(s.begin(), s.begin() + 1, s.end());
res.insert(s);
}
return res;
}
Demo
Not sure about STL specific functions, however a general solution could be to have all shifted strings in a list. Then you sort the list and then you iterate over the list elements. When the current element is different to the last, increment the counter.
There is probably a solution that is less memory intensive. For short strings this solution should be sufficient.
You can use vector for making a list after rotating by using vector.push_back("string"). Before each push, You can check if it already exists by using something like:
if (std::find(vector.begin(), vector.end(), "string") != v.end())
{
increment++;
vector.push_back("string");
}
Or else you can count the elements in the end by vector.size(); and remove increment++.
Hope this helps

Faster way to sort an array of structs c++

I have a struct called count declared called count with two things in it, an int called frequency and a string called word. To simplify, my program takes in a book as a text file and I count how many times each word appears. I have an array of structs and I have my program to where it will count each time a word appears and now I want a faster way to sort the array by top frequency than the way I have below. I used the bubble sorting algorithm below but it is taking my code way too long to run using this method. Any other suggestions or help would be welcome!! I have looked up sort from the algorithm library but don't understand how I would use it here. I am new to c++ so lots of explanation on how to use sort would help a lot.
void sortArray(struct count array[],int size)
{
int cur_pos = 0;
string the_word;
bool flag= true;
for(int i=0; i<(size); i++)
{
flag = false;
for(int j=0; j< (size); j++)
{
if((array[j+1].frequency)>(array[j].frequency))
{
cur_pos = array[j].frequency;
the_word = array[j].word;
array[j].frequency = array[j+1].frequency;
array[j].word = array[j+1].word;
array[j+1].frequency = cur_pos;
array[j+1].word = the_word;
flag = true;
}
}
}
};
You just need to define operator less for your structures,
and use std::sort, see example:
http://en.wikipedia.org/wiki/Sort_%28C%2B%2B%29
After you created a pair of for the data set, you can use std::map as container and insert the pairs into it. If you want to sort according to frequency define std:map as follows
std::map myMap;
myMap.insert(std::make_pair(frequency,word));
std::map is internally using a binary tree so you will get a sorted data when you retrieve it.

Removing specified characters from a string - Efficient methods (time and space complexity)

Here is the problem: Remove specified characters from a given string.
Input: The string is "Hello World!" and characters to be deleted are "lor"
Output: "He Wd!"
Solving this involves two sub-parts:
Determining if the given character is to be deleted
If so, then deleting the character
To solve the first part, I am reading the characters to be deleted into a std::unordered_map, i.e. I parse the string "lor" and insert each character into the hashmap. Later, when I am parsing the main string, I will look into this hashmap with each character as the key and if the returned value is non-zero, then I delete the character from the string.
Question 1: Is this the best approach?
Question 2: Which would be better for this problem? std::map or std::unordered_map? Since I am not interested in ordering, I used an unordered_map. But is there a higher overhead for creating the hash table? What to do in such situations? Use a map (balanced tree) or a unordered_map (hash table)?
Now coming to the next part, i.e. deleting the characters from the string. One approach is to delete the character and shift the data from that point on, back by one position. In the worst case, where we have to delete all the characters, this would take O(n^2).
The second approach would be to copy only the required characters to another buffer. This would involve allocating enough memory to hold the original string and copy over character by character leaving out the ones that are to be deleted. Although this requires additional memory, this would be a O(n) operation.
The third approach, would be to start reading and writing from the 0th position, increment the source pointer when every time I read and increment the destination pointer only when I write. Since source pointer will always be same or ahead of destination pointer, I can write over the same buffer. This saves memory and is also an O(n) operation. I am doing the same and calling resize in the end to remove the additional unnecessary characters?
Here is the function I have written:
// str contains the string (Hello World!)
// chars contains the characters to be deleted (lor)
void remove_chars(string& str, const string& chars)
{
unordered_map<char, int> chars_map;
for(string::size_type i = 0; i < chars.size(); ++i)
chars_map[chars[i]] = 1;
string::size_type i = 0; // source
string::size_type j = 0; // destination
while(i < str.size())
{
if(chars_map[str[i]] != 0)
++i;
else
{
str[j] = str[i];
++i;
++j;
}
}
str.resize(j);
}
Question 3: What are the different ways by which I can improve this function. Or is this best we can do?
Thanks!
Good job, now learn about the standard library algorithms and boost:
str.erase(std::remove_if(str.begin(), str.end(), boost::is_any_of("lor")), str.end());
Assuming that you're studying algorithms, and not interested in library solutions:
Hash tables are most valuable when the number of possible keys is large, but you only need to store a few of them. Your hash table would make sense if you were deleting specific 32-bit integers from digit sequences. But with ASCII characters, it's overkill.
Just make an array of 256 bools and set a flag for the characters you want to delete. It only uses one table lookup instruction per input character. Hash map involves at least a few more instructions to compute the hash function. Space-wise, they are probably no more compact once you add up all the auxiliary data.
void remove_chars(string& str, const string& chars)
{
// set up the look-up table
std::vector<bool> discard(256, false);
for (int i = 0; i < chars.size(); ++i)
{
discard[chars[i]] = true;
}
for (int j = 0; j < str.size(); ++j)
{
if (discard[str[j]])
{
// do something, depending on your storage choice
}
}
}
Regarding your storage choices: Choose between options 2 and 3 depending on whether you need to preserve the input data or not. 3 is obviously most efficient, but you don't always want an in-place procedure.
Here is a KISS solution with many advantages:
void remove_chars (char *dest, const char *src, const char *excludes)
{
do {
if (!strchr (excludes, *src))
*dest++ = *src;
} while (*src++);
*dest = '\000';
}
You can ping pong between strcspn and strspn to avoid the need for a hash table:
void remove_chars(
const char *input,
char *output,
const char *characters)
{
const char *next_input= input;
char *next_output= output;
while (*next_input!='\0')
{
int copy_length= strspn(next_input, characters);
memcpy(next_output, next_input, copy_length);
next_output+= copy_length;
next_input+= copy_length;
next_input+= strcspn(next_input, characters);
}
}