Efficient removal of a set of integers from another set - c++

I have a (large) set of integers S, and I want to run the following pseudocode:
set result = {};
while(S isn't empty)
{
int i = S.getArbitraryElement();
result.insert(i);
set T = elementsToDelete(i);
S = S \ T; // set difference
}
The function elementsToDelete is efficient (sublinear in the initial size of S) and the size of T is small (assume it's constant). T may contain integers no longer in S.
Is there a way of implementing the above that is faster than O(|S|^2)? I suspect I should be able to get O(|S| k), where k is the time complexity of elementsToDelete. I can of course implement the above in a straightforward way using std::set_difference but my understanding is that set_difference is O(|S|).

Using std::set S;, you can do:
for (auto k : elementsToDelete(i)) {
S.erase(k);
}
Of course the lookup for erase is O(log(S.size())), not the O(1) you're asking for. That can be achieved with std::unordered_set, assuming not too many collisions (which is a big assumption in general but very often true in particular).
Despite the name, the std::set_difference algorithm doesn't have much to do with std::set. It works on anything you can iterate in order. Anyway it's not for in-place modification of a container. Since T.size() is small in this case, you really don't want to create a new container each time you remove a batch of elements. In another example where the result set is small enough, it would be more efficient than repeated erase.

The set_difference in C++ library has time complexity of O(|S|) hence it is not good for your purposes so i advice you to use S.erase() to delete set element in the S in O(logN) implemented as BST . Hence your time complexity reduces to O(NlogN)

Related

efficient extraction of elements from a C++ unordered set

In C++ suppose you have an unordered set (https://en.cppreference.com/w/cpp/container/unordered_set) of strings - is there a way to efficiently extract all strings from that set that meet a certain criteria (e.g. find all strings in the set that begin with letter "a") using a method other than iterating through the entire set with a for loop and checking the first character of every single string?
For any criteria this is not possible, see this answer for some more details.
Depending on your other needs, a sorted std::vector is highly likely to be the most efficient for the extraction part alone. Use algorithms like std::lower_bound to work with a sorted std::vector. In the end, your actual use cases is what determines overall which container is best suited performance-wise - although std::vector comes close to a one-size fit all for considering performance (this is because of all the internal optimizations of contiguous storage).
That being said, in general it's advisable to use the container that seems best suited for the problem at hand and only do clever optimizations if there's an actual performance bottleneck.
For the general case of any criteria, you can't do better than iterating over every element.
Each container has specific criteria that it can do better with, e.g.
std::set<std::string> strings = /* something */;
auto first = strings.lower_bound("a"); // O(log(strings)), "a" is the least string that starts with 'a'
auto last = strings.lower_bound("b"); // O(log(strings)), "b" is the first string after those that start with 'a'
strings.erase(first, last); // O(log(strings) + distance(first, last)), all the strings starting with 'a' are removed
Here we remove elements starting with 'a', with a complexity of O(log(strings) + distance(first, last)) which is a O(alphabet) improvement over iterating all elements.
Or the more contrived
std::unordered_set<std::string> strings = /* something */;
auto hashed = strings.hash_function()("Any collision will do"); // O(1)
strings.erase(strings.begin(hashed), strings.end(hashed)); // O(distance(first, last))
Here we remove elements that hash the same as "Any collision will do", with a complexity of O(distance(first, last))
Instead of using an unordered set, adapt your data structure to something like a trie.
In this case it might be more useful to you.
For more details please check: https://en.wikipedia.org/wiki/Trie
Implementation: https://www.geeksforgeeks.org/trie-insert-and-search/.
Depending on your needs you might think of some other algorithms like Aho-Corasick/Suffix-Arrays etc. You might need to do some research on the data-structure you need based on the amount of data that you have, the recomputing that you need and the amount of queries that you do.
I hope this helps.

Time complexity difference between two containsDuplicates algorithms

I completed two version of a leetcode algorithm and am wondering if my complexity analysis is correct, even though the online submission time in ms does not show it accurately. The goal is to take a vector of numbers as a reference and return true if it contains duplicate values and false if it does not.
The two most intuitive approaches are:
1.) Sort the vector and do one sweep to the second to last, and see if any neighboring elements are identical and return true if so.
2.) Use a hashtable and insert the values and if a key already exists in the table, return true.
I completed the first version first, and it was quick, but seeing as how the sort routine would take O(nlog(n)) and the hash table inserts & map.count()s would make the second version O(log(n) + N) = O(N) I would think the hashing version would be faster with very large data sets.
In the online judging I was proven wrong, however I assumed they weren't using large enough data sets to offset the std::map overhead. So I ran a lot of tests repeatedly filling vectors up to a size between 0 and 10000 incrementing by 2, adding random values in between 0 and 20000. I piped the output to a csv file and plotted it on linux and here's the image I got.
Is the provided image truly showing me the difference here, between an O(N) and an O(nlog(n)) algorithm? I just want to make sure my complexity analysis is correct on these?
Here are the algorithms run:
bool containsDuplicate(vector<int>& nums) {
if(nums.size() < 2) return false;
sort(nums.begin(), nums.end());
for(int i = 0; i < nums.size()-1; ++i) {
if(nums[i] == nums[i+1]) return true;
}
return false;
}
// Slightly slower in small cases because of data structure overhead I presume
bool containsDuplicateWithHashing(vector<int>& nums) {
map<int, int> map;
for (int i = 0; i < nums.size(); ++i) {
if(map.count(nums[i])) return true;
map.insert({nums[i], i});
}
return false;
}
std::map is sorted, and involves O(log n) cost for each insertion and lookup, so the total cost in the "no duplicates" case (or in the "first duplicate near the end of the vector" case) would have similar big-O to sorting and scanning: O(n log n); it's typically fragmented in memory, so overhead could easily be higher than that of an optimized std::sort.
It would appear much faster if duplicates were common though; if you usually find a duplicate in the first 10 elements, it doesn't matter if the input has 10,000 elements, because the map doesn't have time to grow before you hit a duplicate and duck out. It's just that a test that only works well when it succeeds is not a very good test for general usage (if duplicates are that common, the test seems a bit silly); you want good performance in both the contains duplicate and doesn't contain duplicate cases.
If you're looking to compare approaches with meaningfully different algorithmic complexity, try using std::unordered_set to replace your map-based solution (insert returns whether the key already existed as well, so you reduce work from one lookup followed by one insert to just one combined insert and lookup on each loop), which has average case O(1) insertion and lookup, for O(n) duplicate checking complexity.
FYI, another approach that would be O(n log n) but use a sort-like strategy that shortcuts when a duplicate is found early, would be to make a heap with std::make_heap (O(n) work), then repeatedly pop_heap (O(log n) per pop) from the heap and compare to the heap's .front(); if the value you just popped and the front are the same, you've got a duplicate and can exit immediately. You could also use the priority_queue adapter to simplify this into a single container, instead of manually using the utility functions on a std::vector or the like.

Sort a vector in which the n first elements have been already sorted?

Consider a std::vector v of N elements, and consider that the n first elements have already been sorted withn < N and where (N-n)/N is very small:
Is there a clever way using the STL algorithms to sort this vector more rapidly than with a complete std::sort(std::begin(v), std::end(v)) ?
EDIT: a clarification: the (N-n) unsorted elements should be inserted at the right position within the n first elements already sorted.
EDIT2: bonus question: and how to find n ? (which corresponds to the first unsorted element)
Sort the other range only, and then use std::merge.
void foo( std::vector<int> & tab, int n ) {
std::sort( begin(tab)+n, end(tab));
std::inplace_merge(begin(tab), begin(tab)+n, end(tab));
}
for edit 2
auto it = std::adjacent_find(begin(tab), end(tab), std::greater<int>() );
if (it!=end(tab)) {
it++;
std::sort( it, end(tab));
std::inplace_merge(begin(tab), it, end(tab));
}
The optimal solution would be to sort the tail portion independently and then perform in-place merge, as described here
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.22.5750
The algorithm is quite convoluted and is usually regarded as "not worth the effort".
Of course, with C++ you can use readily available std::inplace_merge. However, the name of that algorithm is highly misleading. Firstly, there's no guarantee that std::inplace_merge actually works in-place. And when it is actually in-place, there's no guarantee that it is not implemented as a full-blown sort. In practice it boils down to trying it and seeing whether it is good enough for your purposes.
But if you really want to make it in-place and formally more efficient than a full sort, then you will have to implement it manually. STL might help with a few utility algorithms, but it does not offer any solid solutions of "just a few calls to standard functions" kind.
Using insertion sort on N - n last elements:
template <typename IT>
void mysort(IT begin, IT end) {
for (IT it = std::is_sorted_until(begin, end); it != end; ++it) {
IT insertPos = std::lower_bound(begin, it, *it);
IT endRotate = it;
std::rotate(insertPos, it, ++endRotate);
}
}
The Timsort sorting algorithm is a hybrid algorithm developed by Pythonista Tim Peters. It makes optimal use of already-sorted subsegments anywhere inside the array, including in the beginning. Although you may find a faster algorithm if you know for sure that in particular the first n elements are already sorted, this algorithm should be useful for the overall class of problems involved. Wikipedia describes it as:
The algorithm finds subsets of the data that are already ordered, and uses that knowledge to sort the remainder more efficiently.
In Tim Peters' own words,
It has supernatural performance on many
kinds of partially ordered arrays (less than lg(N!) comparisons needed, and
as few as N-1), yet as fast as Python's previous highly tuned samplesort
hybrid on random arrays.
Full details are described in this undated text document by Tim Peters. The examples are in Python, but Python should be quite readable even to people not familiar with its syntax.
Use std::partition_point (or is_sorted_until) to find n. Then if n-m is small do an insertion sort (linear search+std::rotate).
I assume that your question has two aims:
improve runtime (using a clever way)
with few effort (restricting to STL)
Considering these aims, I'd strongly recommend against this specific optimization, unless you are sure that the effort is worth the benefit.
As far as I remember, std::sort() implements the quick sort algorithm, which is almost as fast on presorted input as to determine, if / how-much-of the input is sorted.
Instead of meddling with std::sort you can try changing the data structure to a sorted/prioritized queue.

Is it possible to generically sort in linear time?

I'm trying to solve a problem in O(n) time where, given two forward iterators to the front of a container and the back of a container, I want to remove all elements in the container that don't appear at least < this number > of times. For example, given a vector of strings such as ("john", "hello", "one", "yes", "hello", "one") and I wanted to remove all elements that appear less than 2 times, my final vector would then contain just ("hello", "one").
I was thinking that if I could generically sort in O(n) time I can accomplish this result (in O(n) time), but I'm having a hard time doing that with strings, ints, chars, or whatever else may be used (generically). Am I thinking about this correctly, or is there a simpler way to solve the problem?
Yes, you are not actually sorting but removing elements.
1). Store each word into a hashset.
2). Lookup and only add if not in hashset.
Short answer: no. Comparison based sorting takes O(n log n) time. (This can be formally proved.) If you know something about your input (e.g. the input is distributed uniformly at random within a known range) then you can use well known algorithms such as bucket sort or radix sort in O(n) time. Contrary to #Mooing Duck, there is no such thing as sorting in O(1) time (this should be obvious -- you must visit each element at least once for any sorting algorithm).
However, as several other posters have noted, your problem does not require a sorting algorithm ...
There is no need to sort
1) Populate std::unordered_map<string,vector<int>> indexOfStrings; - O(N)
2) For each string whose vector size() < 2, delete element - O(number of deletions) <= O(N)
indexOfStrings - stores the index of each occurance of the string. This allows for quick deletion from vector without the need for a search.
You don't need a sort, you just need an unordered_map:
unordered_map<string, int> counter;
vector<string> newvec;
for(string &s : v) {
if((++counter[s]) == 2) {
newvec.push_back(s);
}
}
Note that this is C++11 code. (Thanks #jogojapan for the code improvement suggestion).

Unordered_set questions

Could anyone explain how an unordered set works? I am also not sure how a set works. My main question is what is the efficiency of its find function.
For example, what is the total big O run time of this?
vector<int> theFirst;
vector<int> theSecond;
vector<int> theMatch;
theFirst.push_back( -2147483648 );
theFirst.push_back(2);
theFirst.push_back(44);
theSecond.push_back(2);
theSecond.push_back( -2147483648 );
theSecond.push_back( 33 );
//1) Place the contents into a unordered set that is O(m).
//2) O(n) look up so thats O(m + n).
//3) Add them to third structure so that's O(t)
//4) All together it becomes O(m + n + t)
unordered_set<int> theUnorderedSet(theFirst.begin(), theFirst.end());
for(int i = 0; i < theSecond.size(); i++)
{
if(theUnorderedSet.find(theSecond[i]) != theUnorderedSet.end())
{
theMatch.push_back( theSecond[i] );
cout << theSecond[i];
}
}
unordered_set and all the other unordered_ data structures use hashing, as mentioned by #Sean. Hashing involves amortized constant time for insertion, and close to constant time for lookup. A hash function essentially takes some information and produces a number from it. It is a function in the sense that the same input has to produce the same output. However, different inputs can result in the same output, resulting in what is termed a collision. Lookup would be guaranteed to be constant time for an "perfect hash function", that is, one with no collisions. In practice, the input number comes from the element you store in the structure (say it's value, it is a primitive type) and maps it to a location in a data structure. Hence, for a given key, the function takes you to the place where the element is stored without need for any traversals or searches (ignoring collisions here for simplicity), hence constant time. There are different implementations of these structures (open addressing, chaining, etc.) See hash table, hash function. I also recommend section 3.7 of The Algorithm Design Manual by Skiena. Now, concerning big-O complexity, you are right that you have O(n) + O(n) + O(size of overlap). Since the overlap cannot be bigger than the smaller of m and n, the overall complexity can be expressed as O(kN), where N is the largest between m and n. So, O(N). Again, this is "best case", without collisions, and with perfect hashing.
set and multi_set on the other hand use binary trees, so insertions and look-ups are typically O(logN). The actual performance of a hashed structure vs. a binary tree one will depend on N, so it is best to try the two approaches and profile them in a realistic running scenario.
All of the std::unordered_*() data types make use of a hash to perform lookups. Look at Boost's documentation on the subject and I think you'll gain an understanding very quickly.
http://www.boost.org/doc/libs/1_46_1/doc/html/unordered.html