Sorting referenced substrings using C++'s sort?

Sorting referenced substrings using C++'s sort? - c++

I have two long strings (a million or so characters) that I want to generate suffixes from and sort in order to find the longest shared substrings as this will be much faster than brute-forcing all possible substrings. I'm most familiar with Python, but my quick calculation estimated 40 Tb of suffixes so I'm hoping it's possible to use C++ (suggested) to sort references to the substrings in each main, unchanging string.
I'll need to retain the index of each substring to find the value as well as the origin string later, so any advice on the type of data structure I could use that would 1) allow sorting of reference strings and 2) keep track of the original index would be super helpful!
Current pseudocode:
//Function to make vector? of structures that contain the reference to the string and the original index
int main() {
//Declare strings
string str1="This is a very long string with some repeats of strings."
string str2="This is another string with some repeats that is very long."
//Call function to make array
//Pass vector to sort(v.begin(), v.end), somehow telling it to deference?
//Process the output in multilayer loop to find the longest exact match
// "string with some repeats"
return 0;}

First of all, you should use a suffix tree for this. But I'll answer your original question.
C++17 :
NOTE: uses experimental features
You may use std::string_view to reference the strings without copying. Here is an example code:
//Declare string
char* str1 = "This is a very long string with some repeats of strings."
int main() {
//Call function to make array
vector<string_view> substrings;
//example of adding substring [5,19) into vector
substrings.push_back(string_view(str1 + 5, 19 - 5));
//Pass vector to sort(v.begin(), v.end)
sort(substrings.begin(), substrings.end());
return 0;
}
Everything before C++17:
You could use a custom predicate with the sort function. Instead of making your vector store the actual strings, make it store pair which contains the index.
Here is an example of code needed to make it work:
//Declare string
string str1="This is a very long string with some repeats of strings."
bool pred(pair<int,int> a, pair<int,int> b){
int substring1start=a.first,
substring1end=a.second;
int substring2start=b.first,
substring2end=b.second;
//use a for loop to manually compare substring1 and substring 2
...
//return true if substring1 should go before substring2 in vector
//otherwise return false
}
int main() {
//Call function to make array
vector<pair<int,int>> substrings;
//example of adding substring [1,19) into vector
substrings.push_back({1,19});
//Pass vector to sort(v.begin(), v.end), passing custom predicate
sort(substrings.begin(), substrings.end(), pred);
return 0;
}
Even if you reduce your memory usage, your program will still take 40T iterations to run anyways (since you need to compare the strings). Unless you use some sort of hashing string comparison algorithm.

You could use a combination of std::string_view, std::hash and std::set.
#include <iostream>
#include <set>
#include <string>
#include <string_view>
#include <vector>
std::string str1="This is a very long string with some repeats of strings.";
std::string str2="This is another string with some repeats that is very long.";
std::set<std::size_t> substringhashes;
std::vector<std::string_view> matches;
bool makeSubHashes(std::string& str, std::size_t lenght) {
for (int pos=0; pos+lenght <= str.size(); ++pos) {
std::string_view sv(str.data()+pos, lenght);
auto hash = std::hash<std::string_view>()(sv);
if (!substringhashes.insert(hash).second) {
matches.push_back(sv);
if (matches.size() > 99) // Optional break after finding the 100 longest matches
return true;
}
}
return false;
}
int main() {
for (int lenght=std::min(str1.size(), str2.size()); lenght>0; --lenght) {
if (makeSubHashes(str1, lenght) || makeSubHashes(str2, lenght))
break;
}
for (auto& sv : matches) {
std::cout << sv << std::endl;
}
return 0;
}
If the amount of suffixes are extremely high, there is a chance for false positives with the std::set. It has std::size_ts max value number of different hashes, which is normally a uint64.
It also starts searching for matches at the maximum lenght of the strings, maybe a more reasonable approach is to set some sort of maximum lenght for the suffixes.

std::sort sorts data in main memory.
If you can fit the data in main memory, then you can sort it with std::sort.
Otherwise not.

Related

C++ - checking a string for all values in an array

I have some parsed text from the Vision API, and I'm filtering it using keywords, like so:
if (finalTextRaw.find("File") != finalTextRaw.npos)
{
LogMsg("Found Menubar");
}
E.g., if the keyword "File" is found anywhere within the string finalTextRaw, then the function is interrupted and a log message is printed.
This method is very reliable. But I've inefficiently just made a bunch of if-else-if statements in this fashion, and as I'm finding more words that need filtering, I'd rather be a little more efficient. Instead, I'm now getting a string from a config file, and then parsing that string into an array:
string filterWords = GetApp()->GetFilter();
std::replace(filterWords.begin(), filterWords.end(), ',', ' '); ///replace ',' with ' '
vector<int> array;
stringstream ss(filterWords);
int temp;
while (ss >> temp)
array.push_back(temp); ///create an array of filtered words
And I'd like to have just one if statement for checking that string against the array, instead of many of them for checking the string against each keyword I'm having to manually specify in the code. Something like this:
if (finalTextRaw.find(array) != finalTextRaw.npos)
{
LogMsg("Found filtered word");
}
Of course, that syntax doesn't work, and it's surely more complicated than that, but hopefully you get the idea: if any words from my array appear anywhere in that string, that string should be ignored and a log message printed instead.
Any ideas how I might fashion such a function? I'm guessing it's going to necessitate some kind of loop.

Borrowing from Thomas's answer, a ranged for loop offers a neat solution:
for (const auto &word : words)
{
if (finalTextRaw.find(word) != std::string::npos)
{
// word is found.
// do stuff here or call a function.
break; // stop the loop.
}
}

As pointed out by Thomas, the most efficient way is to split both texts into a list of words. Then use std::set_intersection to find occurrences in both lists. You can use std::vector as long it is sorted. You end up with O(n*log(n)) (with n = max words), rather than O(n*m).
Split sentences to words:
auto split(std::string_view sentence) {
auto result = std::vector<std::string>{};
auto stream = std::istringstream{sentence.data()};
std::copy(std::istream_iterator<std::string>(stream),
std::istream_iterator<std::string>(), std::back_inserter(result));
return result;
}
Find words existing in both lists. This only works for sorted lists (like sets or manually sorted vectors).
auto intersect(std::vector<std::string> a, std::vector<std::string> b) {
std::sort(a.begin(), a.end());
std::sort(b.begin(), b.end());
auto result = std::vector<std::string>{};
std::set_intersection(std::move_iterator{a.begin()},
std::move_iterator{a.end()},
b.cbegin(), b.cend(),
std::back_inserter(result));
return result;
}
Example of how to use.
int main() {
const auto result = intersect(split("hello my name is mister raw"),
split("this is the final raw text"));
for (const auto& word: result) {
// do something with word
}
}
Note that this makes sense when working with large or undefined number of words. If you know the limits, you might want to use easier solutions (provided by other answers).

You could use a fundamental, brute force, loop:
unsigned int quantity_words = array.size();
for (unsigned int i = 0; i < quantity_words; ++i)
{
std::string word = array[i];
if (finalTextRaw.find(word) != std::string::npos)
{
// word is found.
// do stuff here or call a function.
break; // stop the loop.
}
}
The above loop takes each word in the array and searches the finalTextRaw for the word.
There are better methods using some std algorithms. I'll leave that for other answers.
Edit 1: maps and association
The above code is bothering me because there are too many passes through the finalTextRaw string.
Here's another idea:
Create a std::set using the words in finalTextRaw.
For each word in your array, check for existence in the set.
This reduces the quantity of searches (it's like searching a tree).
You should also investigate creating a set of the words in array and finding the intersection between the two sets.

why doesn't user defined function sort the elements of same length in the order given?

My task is to sort the words of a string in the increasing order of their length and for words of same length, I have to keep them in the order given.
for ex: "to be or not to be" will become "to be or to be not".
i am first making a vector 'v' of all the words in the string and then trying to sort the vector using user defined function in sort() function of C++.
Here is my code:
#include <bits/stdc++.h>
using namespace std;
static bool comparelength(string first,string second){//function to compare length
return first.size()<second.size();
}
int main() {
string text="Jlhvvd wfwnphmxoa qcuucx qsvqskq cqwfypww dyphntfz hkbwx xmwohi qvzegb ubogo sbdfmnyeim tuqppyipb llwzeug hrsaebveez aszqnvruhr xqpqd ipwbapd mlghuuwvec xpefyglstj dkvhhgecd kry";
vector<string> v;
string cur="";
text+=" ";
for(int i=0;i<text.size();i++){
if(text[i]==32){//if space is encountered then the word is inserted in the vector
v.push_back(cur);
cur="";
}
else{
cur+=text[i];//if not space then text[i] is added to the current word
}
}
sort(v.begin(),v.end(),comparelength);//sort the vector
for(int i=0;i<v.size();i++)
cout<<v[i]<<" ";
Now it gives this output:
"Kry xqpqd ubogo hkbwx qvzegb jlhvvd xmwohi qcuucx qsvqskq llwzeug ipwbapd dyphntfz cqwfypww tuqppyipb dkvhhgecd sbdfmnyeim xpefyglstj mlghuuwvec aszqnvruhr hrsaebveez wfwnphmxoa"
But the correct output should be:
"Kry hkbwx ubogo xqpqd jlhvvd qcuucx xmwohi qvzegb qsvqskq llwzeug ipwbapd cqwfypww dyphntfz tuqppyipb dkvhhgecd wfwnphmxoa sbdfmnyeim hrsaebveez aszqnvruhr mlghuuwvec xpefyglstj"
see the position 1,2 and 3(using 0 indexing).
it should give: hkbwx ubogo xqpqd.
but it gives: xqpqd ubogo hkbwx.
which makes me think that it is not sorting the words of same length in the order given. You can find many other positions where this happens(for ex: 4,5,6 and 7).
But for the string "leetcode plus try suck geaser is cool best"
it gives the correct output which is: "is try plus suck cool best geaser
leetcode"
Can anyone make it clear why is it not working for the former string but working for the latter.
I've tried doing
static bool comparelength(string first,string second){
if(first.size()==second.size())
return true;
if(first.size()<second.size())
return true;
else
return false;
}
But this throws runtime error.
sorry for making the question messy but i really want to understand this.

std::sort is not stable, ie order of elements that are equivalent is not necessarily preserved. If you get a stable sorting from std::sort then this is just by chance. Stable sorting is more expensive (O(N·log(N)^2) vs O(N·log(N))), hence you have to explicitly ask for it. It can be done with std::stable_sort.
You could use std::sort with a custom comparator if you would populate a container of std::pair<std::string,size_t> where second is the index in the original container. However, I suppose using std::stable_sort is simpler.

reduce time complexity in checking if a substring of a string is palindromic

this is a simple program to check for a substring which is a palindrome.
it works fine for string of length 1000 but gives TLE error on SPOJ for a length of 100000. how shall i optimize this code. saving all the substrings will not work for such large inputs. the time limit is 1 sec so we can do at most 10^6-10^7 iterations. is there any other way i can do it.
#include<bits/stdc++.h>
int main()
{
int t;
std::cin>>t;
if(t<1||t>10)
return 0;
while(t--)
{
std::string s;
std::cin>>s;
//std::cout<<s.substr(0,1);
//std::vector<std::string>s1;
int n=s.length();
if(n<1||n>100000)
return 0;
int len,mid,k=0,i=0;
for(i=0;i<n-1;i++)
{
for(int j=2;j<=n-i;j++)
{
std::string ss=s.substr(i,j);
//s1.push_back(ss);
len=ss.length();
mid=len/2;
while(k<=mid&&(len-1-k)>=mid&&len>1)
{
if(ss[k]!=ss[len-1-k])
break;
k++;
}
if(k>mid||(len-1-k)<mid)
{
std::cout<<"YES"<<std::endl;
break;
}
}
if(k>mid||(len-1-k)<mid)
break;
}
if(i==n-1)
std::cout<<"NO"<<std::endl;
//for(i=0;i<m;i++)
// std::cout<<s1[i]<<std::endl
}
return 0;
}

Your assumption that saving all sub-strings in another vector and checking them later with same O(N^2) approach will not help you to reduce time complexity of your algorithm. Instead, it will increases your memory complexity too. Holding all possible sub-strings in another vector will take lots of memory.
Since the size of string could be maximum of 10^5. To check if there exist any palindromic sub-string should be done either in O(NlogN) or O(N) time complexity to pass within time limit. For this, I suggest you two algorithms:
1.) Suffix array : Link here
2.) Manacher’s Algorithm: Link here

I'm not completely sure what your function is trying to accomplish... are you finding t palindromic substrings?
To save on memory, rather than store every substring in a vector and then iterating over the vector to check for palindromes, why not just check if the substring is a palindrome as you generate them?
std::string ss = s.substr(i,j);
// s1.push_back(ss); // Don't store the substrings
if (palindromic(ss)) {
std::cout << "YES" << std::endl;
break;
}
This saves some time as most cases, as you no longer always generate every possible substring. However, it is not guaranteed to be much faster in the worst case.

How to cut off parts of a string, which every string in a collection has

My currently problem is the following:
I have a std::vector of full path names to files.
Now i want to cut off the common prefix of all string.
Example
If I have these 3 strings in the vector:
/home/user/foo.txt
/home/user/bar.txt
/home/baz.txt
I would like to cut off /home/ from every string in the vector.
Question
Is there any method to achieve this in general?
I want an algorithm that drops the common prefix of all string.
I currently only have an idea which solves this problem in O(n m) with n strings and m is the longest string length, by just going through every string with every other string char by char.
Is there a faster or more elegant way solving this?

This can be done entirely with std:: algorithms.
synopsis:
sort the input range if not already sorted. The first and last paths in the sorted range
will be the most dissimilar. Best case is O(N), worst case O(N + N.logN)
use std::mismatch to determine the larges common sequence between the
two most dissimilar paths [insignificant]
run through each path erasing the first COUNT characters where COUNT is the number of characters in the longest common sequence. O (N)
Best case time complexity: O(2N), worst case O(2N + N.logN) (can someone check that?)
#include <iostream>
#include <algorithm>
#include <string>
#include <vector>
std::string common_substring(const std::string& l, const std::string& r)
{
return std::string(l.begin(),
std::mismatch(l.begin(), l.end(),
r.begin(), r.end()).first);
}
std::string mutating_common_substring(std::vector<std::string>& range)
{
if (range.empty())
return std::string();
else
{
if (not std::is_sorted(range.begin(), range.end()))
std::sort(range.begin(), range.end());
return common_substring(range.front(), range.back());
}
}
std::vector<std::string> chop(std::vector<std::string> samples)
{
auto str = mutating_common_substring(samples);
for (auto& s : samples)
{
s.erase(s.begin(), std::next(s.begin(), str.size()));
}
return samples;
}
int main()
{
std::vector<std::string> samples = {
"/home/user/foo.txt",
"/home/user/bar.txt",
"/home/baz.txt"
};
samples = chop(std::move(samples));
for (auto& s : samples)
{
std::cout << s << std::endl;
}
}
expected:
baz.txt
user/bar.txt
user/foo.txt
Here's an alternate `common_substring' which does not require a sort. time complexity is in theory O(N) but whether it's faster in practice you'd have to check:
std::string common_substring(const std::vector<std::string>& range)
{
if (range.empty())
{
return {};
}
return std::accumulate(std::next(range.begin(), 1), range.end(), range.front(),
[](auto const& best, const auto& sample)
{
return common_substring(best, sample);
});
}
update:
Elegance aside, this is probably the fastest way since it avoids any memory allocations, performing all transformations in-place. For most architectures and sample sizes, this will matter more than any other performance consideration.
#include <iostream>
#include <vector>
#include <string>
void reduce_to_common(std::string& best, const std::string& sample)
{
best.erase(std::mismatch(best.begin(), best.end(),
sample.begin(), sample.end()).first,
best.end());
}
void remove_common_prefix(std::vector<std::string>& range)
{
if (range.size())
{
auto iter = range.begin();
auto best = *iter;
for ( ; ++iter != range.end() ; )
{
reduce_to_common(best, *iter);
}
auto prefix_length = best.size();
for (auto& s : range)
{
s.erase(s.begin(), std::next(s.begin(), prefix_length));
}
}
}
int main()
{
std::vector<std::string> samples = {
"/home/user/foo.txt",
"/home/user/bar.txt",
"/home/baz.txt"
};
remove_common_prefix(samples);
for (auto& s : samples)
{
std::cout << s << std::endl;
}
}

You have to search every string in the list. However you don't need to compare all the characters in every string. The common prefix can only get shorter, so you only need to compare with "the common prefix so far". I don't think this changes the big-O complexity - but it will make quite a difference to the actual speed.
Also, these look like file names. Are they sorted (bearing in mind that many filesystems tend to return things in sorted order)? If so, you only need to consider the first and last elements. If they are probably pr mostly ordered, then consider the common prefix of the first and last, and then iterate through all the other strings shortening the prefix further as necessary.

You just have to iterate over every string. You can only avoid iterating over the full length of strings needlessly by exploiting the fact, that the prefix can only shorten:
#include <iostream>
#include <string>
#include <vector>
std::string common_prefix(const std::vector<std::string> &ss) {
if (ss.empty())
// no prefix
return "";
std::string prefix = ss[0];
for (size_t i = 1; i < ss.size(); i++) {
size_t c = 0; // index after which the string differ
for (; c < prefix.length(); c++) {
if (prefix[c] != ss[i][c]) {
// strings differ from character c on
break;
}
}
if (c == 0)
// no common prefix
return "";
// the prefix is only up to character c-1, so resize prefix
prefix.resize(c);
}
return prefix;
}
void strip_common_prefix(std::vector<std::string> &ss) {
std::string prefix = common_prefix(ss);
if (prefix.empty())
// no common prefix, nothing to do
return;
// drop the common part, which are always the first prefix.length() characters
for (std::string &s: ss) {
s = s.substr(prefix.length());
}
}
int main()
{
std::vector<std::string> ss { "/home/user/foo.txt", "/home/user/bar.txt", "/home/baz.txt"};
strip_common_prefix(ss);
for (std::string &s: ss)
std::cout << s << "\n";
}
Drawing from the hints of Martin Bonner's answer, you may implement a more efficient algorithm if you have more prior knowledge on your input.
In particular, if you know your input is sorted, it suffices to compare the first and last strings (see Richard's answer).

i - Find the file which has the least folder depth (i.e. baz.txt) - it's root path is home
ii - Then go through the other strings to see if they start with that root.
iii - If so then remove root from all the strings.

Start with std::size_t index=0;. Scan the list to see if characters at that index match (note: past the end does not match). If it does, advance index and repeat.
When done, index will have the value of the length of the prefix.
At this point, I'd advise you to write or find a string_view type. If you do, simply create a string_view for each of your strings str with start/end of index, str.size().
Overall cost: O(|prefix|*N+N), which is also the cost to confirm that your answer is correct.
If you don't want to write a string_view, simply call str.erase(str.begin(), str.begin()+index) on each str in your vector.
Overall cost is O(|total string length|+N). The prefix has to be visited in order to confirm it, then the tail of the string has to be rewritten.
Now the cost of the breadth-first is locality, as you are touching memory all over the place. It will probably be more efficient in practice to do it in chunks, where you scan the first K strings up to length Q and find the common prefix, then chain that common prefix plus the next block. This won't change the O-notation, but will improve locality of memory reference.

for(vector<string>::iterator itr=V.begin(); itr!=V.end(); ++itr)
itr->erase(0,6);

Which data structure and algorithm is appropriate for this?

I have 1000's of string. Given a pattern that need to be searched in all the string, and return all the string which contains that pattern.
Presently i am using vector for to store the original strings. searching for a pattern and if matches add it into new vector and finally return the vector.
int main() {
vector <string> v;
v.push_back ("maggi");
v.push_back ("Active Baby Pants Large 9-14 Kg ");
v.push_back ("Premium Kachi Ghani Pure Mustard Oil ");
v.push_back ("maggi soup");
v.push_back ("maggi sauce");
v.push_back ("Superlite Advanced Jar");
v.push_back ("Superlite Advanced");
v.push_back ("Goldlite Advanced");
v.push_back ("Active Losorb Oil Jar");
vector <string> result;
string str = "Advanced";
for (unsigned i=0; i<v.size(); ++i)
{
size_t found = v[i].find(str);
if (found!=string::npos)
result.push_back(v[i]);
}
for (unsigned j=0; j<result.size(); ++j)
{
cout << result[j] << endl;
}
// your code goes here
return 0;
}
Is there any optimum way to achieve the same with lesser complexity and higher performance ??

The containers I think are appropriate for your application.
However instead of std::string::find, if you implement your own KMP algorithm, then you can guarantee the time complexity to be linear in terms of the length of string + search string.
http://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm
As such the complexity of std::string::find is unspecified.
http://www.cplusplus.com/reference/string/string/find/
EDIT: As pointed out by this link, if the length of your strings is not large (more than 1000), then probably using std::string::find would be good enough since here tabulation etc is not needed.
C++ string::find complexity

If the result is used in the same block of code as the input string vector (it is so in your example) or even if you have a guarantee that everyone uses the result only while input exists, you don't need actually to copy strings. It could be an expensive operation, which considerably slows total algorithm.
Instead you could have a vector of pointers as the result:
vector <string*> result;

If the list of strings is "fixed" for many searches then you can do some simple preprocessing to speed up things quite considerably by using an inverted index.
Build a map of all chars present in the strings, in other words for each possible char store a list of all strings containing that char:
std::map< char, std::vector<int> > index;
std::vector<std::string> strings;
void add_string(const std::string& s) {
int new_pos = strings.size();
strings.push_back(s);
for (int i=0,n=s.size(); i<n; i++) {
index[s[i]].push_back(new_pos);
}
}
Then when asked to search for a substring you first check for all chars in the inverted index and iterate only on the list in the index with the smallest number of entries:
std::vector<std::string *> matching(const std::string& text) {
std::vector<int> *best_ix = NULL;
for (int i=0,n=text.size(); i<n; i++) {
std::vector<int> *ix = &index[text[i]];
if (best_ix == NULL || best_ix->size() > ix->size()) {
best_ix = ix;
}
}
std::vector<std::string *> result;
if (best_ix) {
for (int i=0,n=best_ix->size(); i<n; i++) {
std::string& cand = strings[(*best_ix)[i]];
if (cand.find(text) != std::string::npos) {
result.push_back(&cand);
}
}
} else {
// Empty text as input, just return the whole list
for (int i=0,n=strings.size(); i<n; i++) {
result.push_back(&strings[i]);
}
}
return result;
}
Many improvements are possible:
use a bigger index (e.g. using pairs of consecutive chars)
avoid considering very common chars (stop lists)
use hashes computed from triplets or longer sequences
search the intersection instead of searching the shorter list. Given the elements are added in order the vectors are anyway already sorted and intersection could be computed efficently even using vectors (see std::set_intersection).
All of them may make sense or not depending on the parameters of the problem (how many strings, how long, how long is the text being searched ...).

If the source text is large and static (e.g. crawled webpages), then you can save search time by pre-building a suffix tree or a trie data structure. The search pattern can than traverse the tree to find matches.
If the source text is small and changes frequently, then your original approach is appropriate. The STL functions are generally very well optimized and have stood the test of time.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Sorting referenced substrings using C++'s sort? - c++

std::sort sorts data in main memory. If you can fit the data in main memory, then you can sort it with std::sort. Otherwise not.

Related

C++ - checking a string for all values in an array

why doesn't user defined function sort the elements of same length in the order given?

reduce time complexity in checking if a substring of a string is palindromic

How to cut off parts of a string, which every string in a collection has

Which data structure and algorithm is appropriate for this?

Categories

Resources