Time complexity difference between two containsDuplicates algorithms - c++

I completed two version of a leetcode algorithm and am wondering if my complexity analysis is correct, even though the online submission time in ms does not show it accurately. The goal is to take a vector of numbers as a reference and return true if it contains duplicate values and false if it does not.
The two most intuitive approaches are:
1.) Sort the vector and do one sweep to the second to last, and see if any neighboring elements are identical and return true if so.
2.) Use a hashtable and insert the values and if a key already exists in the table, return true.
I completed the first version first, and it was quick, but seeing as how the sort routine would take O(nlog(n)) and the hash table inserts & map.count()s would make the second version O(log(n) + N) = O(N) I would think the hashing version would be faster with very large data sets.
In the online judging I was proven wrong, however I assumed they weren't using large enough data sets to offset the std::map overhead. So I ran a lot of tests repeatedly filling vectors up to a size between 0 and 10000 incrementing by 2, adding random values in between 0 and 20000. I piped the output to a csv file and plotted it on linux and here's the image I got.
Is the provided image truly showing me the difference here, between an O(N) and an O(nlog(n)) algorithm? I just want to make sure my complexity analysis is correct on these?
Here are the algorithms run:
bool containsDuplicate(vector<int>& nums) {
if(nums.size() < 2) return false;
sort(nums.begin(), nums.end());
for(int i = 0; i < nums.size()-1; ++i) {
if(nums[i] == nums[i+1]) return true;
}
return false;
}
// Slightly slower in small cases because of data structure overhead I presume
bool containsDuplicateWithHashing(vector<int>& nums) {
map<int, int> map;
for (int i = 0; i < nums.size(); ++i) {
if(map.count(nums[i])) return true;
map.insert({nums[i], i});
}
return false;
}

std::map is sorted, and involves O(log n) cost for each insertion and lookup, so the total cost in the "no duplicates" case (or in the "first duplicate near the end of the vector" case) would have similar big-O to sorting and scanning: O(n log n); it's typically fragmented in memory, so overhead could easily be higher than that of an optimized std::sort.
It would appear much faster if duplicates were common though; if you usually find a duplicate in the first 10 elements, it doesn't matter if the input has 10,000 elements, because the map doesn't have time to grow before you hit a duplicate and duck out. It's just that a test that only works well when it succeeds is not a very good test for general usage (if duplicates are that common, the test seems a bit silly); you want good performance in both the contains duplicate and doesn't contain duplicate cases.
If you're looking to compare approaches with meaningfully different algorithmic complexity, try using std::unordered_set to replace your map-based solution (insert returns whether the key already existed as well, so you reduce work from one lookup followed by one insert to just one combined insert and lookup on each loop), which has average case O(1) insertion and lookup, for O(n) duplicate checking complexity.
FYI, another approach that would be O(n log n) but use a sort-like strategy that shortcuts when a duplicate is found early, would be to make a heap with std::make_heap (O(n) work), then repeatedly pop_heap (O(log n) per pop) from the heap and compare to the heap's .front(); if the value you just popped and the front are the same, you've got a duplicate and can exit immediately. You could also use the priority_queue adapter to simplify this into a single container, instead of manually using the utility functions on a std::vector or the like.

Related

More efficient data structure

I'm developing a project and I need to do a lot of comparisons between objects and insertions in lists.
Basically I have a object of type Board and I do the following:
if(!(seenStates.contains(children[i])))
{
statesToExpand.addToListOrderly(children[i]);
seenStates.insertHead(children[i]);
}
where statesToExpand and seenStates are two lists that I defined this way:
typedef struct t_Node
{
Board *board;
int distanceToGoal;
t_Node *next;
} m_Node;
typedef m_Node* m_List;
class ListOfStates {
...
Everything works fine but I did some profiling and discovered that almost 99% of the time is spent in operating on these lists, since I have to expand, compare, insert, etc. almost 20000 states.
My question is: is there a more efficient data structure that I could use in order to reduce the execution time of that portion of code?
Update
So I tried using std::vector and it is a bit worse (15 seconds instead of 13 with my old list). Probably I'm doing something wrong... With some more profiling I discovered that approximately 13.5 seconds are spent searching for an element in a vector. This is the code I am using:
bool Game::vectorContains(Board &b)
{
clock_t stop;
clock_t start = clock();
if(seenStates.size() == 0)
{
stop = clock();
clock_counter += (stop-start);
return false;
}
for(vector<m__Node>::iterator it = seenStates.begin(); it != seenStates.end(); it++)
{
if( /* condition */ )
{
stop = clock();
clock_counter += (stop - start);
return true;
}
}
stop = clock();
clock_counter += (stop - start);
return false;
}
Can I do something better here or should I move on to another data structure (maybe an unordered_set as suggested below)?
One more update
I tried the exact same code in release mode and the whole algorithm executes in just 1.2 seconds.
I didn't know there could be such a big difference between Debug and Release. I know that Release does some optimization but this is some difference!
This part:
if(!(seenStates.contains(children[i])))
for a linked list is going to be very slow. While the algorithmic time is O(n), same as it would be for a std::vector<Node>, the memory that you're walking over is going to be all over the place... so you're going to incur lots of cache misses as your container gets larger. After a while, your time is just going to be dominated by those cache misses. So std::vector will likely perform much better.
That said, if you're doing a lot of find()-type operations, you should consider using a container that is setup to do find very quickly... maybe a std::unordered_set?
Using a list ends up with O(n) time to search for elements. You could consider data-structures with more effiecient lookßup, e.g. std::map, std::unordered_map, a sorted vector, other tree-structures. There many data-structures. Which one is best depends on your algorithm design.
Indeed you don't want to use a linked list in your case. Looking for a specific value (ie contains()) is very slow in a linked list, O(n).
Thus using an array list (for example std::vector) or a binary search tree would be smarter, complexity of contains() would become on average O(log n).
However if you are worried about expanding your array list very often, you might make it take a lot of space when you create it (for example 20 000 elements).
Don't forget to consider using two different data structures for your two lists.
If I understand it correctly, your data structure resembles a singly linked list. So, instead of usong your own implementation, you can try to work with a
std::slist<Board*>
or probably better with a
std::slist<std::unique_ptr<Board> >
If you further also need the reference to the previous element, then use a standard std::list. Both will give you constant insertion, but only linear lookup (at least if you don't know where to search).
Alternatively, you can consider using a std::map<std::unique_ptr<Board> > which will give you logarithmic insertion and lookup, but without further effort you lose the information on the successor.
EDIT: std::vector seems no good choise for your kind of requirements. As far as I understood, you need fast search and fast insertion. Both are O(n) for a vector. Use a std::map instead, where both are O(log n). [But note that using the latter doesn't mean you will directly get faster execution times, as that depends on the number of elements]

Efficient removal of a set of integers from another set

I have a (large) set of integers S, and I want to run the following pseudocode:
set result = {};
while(S isn't empty)
{
int i = S.getArbitraryElement();
result.insert(i);
set T = elementsToDelete(i);
S = S \ T; // set difference
}
The function elementsToDelete is efficient (sublinear in the initial size of S) and the size of T is small (assume it's constant). T may contain integers no longer in S.
Is there a way of implementing the above that is faster than O(|S|^2)? I suspect I should be able to get O(|S| k), where k is the time complexity of elementsToDelete. I can of course implement the above in a straightforward way using std::set_difference but my understanding is that set_difference is O(|S|).
Using std::set S;, you can do:
for (auto k : elementsToDelete(i)) {
S.erase(k);
}
Of course the lookup for erase is O(log(S.size())), not the O(1) you're asking for. That can be achieved with std::unordered_set, assuming not too many collisions (which is a big assumption in general but very often true in particular).
Despite the name, the std::set_difference algorithm doesn't have much to do with std::set. It works on anything you can iterate in order. Anyway it's not for in-place modification of a container. Since T.size() is small in this case, you really don't want to create a new container each time you remove a batch of elements. In another example where the result set is small enough, it would be more efficient than repeated erase.
The set_difference in C++ library has time complexity of O(|S|) hence it is not good for your purposes so i advice you to use S.erase() to delete set element in the S in O(logN) implemented as BST . Hence your time complexity reduces to O(NlogN)

Performance of vector sort/unique/erase vs. copy to unordered_set

I have a function that gets all neighbours of a list of points in a grid out to a certain distance, which involves a lot of duplicates (my neighbour's neighbour == me again).
I've been experimenting with a couple of different solutions, but I have no idea which is the more efficient. Below is some code demonstrating two solutions running side by side, one using std::vector sort-unique-erase, the other using std::copy into a std::unordered_set.
I also tried another solution, which is to pass the vector containing the neighbours so far to the neighbour function, which will use std::find to ensure a neighbour doesn't already exist before adding it.
So three solutions, but I can't quite wrap my head around which is gonna be faster. Any ideas anyone?
Code snippet follows:
// Vector of all neighbours of all modified phi points, which may initially include duplicates.
std::vector<VecDi> aneighs;
// Hash function, mapping points to their norm distance.
auto hasher = [&] (const VecDi& a) {
return std::hash<UINT>()(a.squaredNorm() >> 2);
};
// Unordered set for storing neighbours without duplication.
std::unordered_set<VecDi, UINT (*) (const VecDi& a)> sneighs(phi.dims().squaredNorm() >> 2, hasher);
... compute big long list of points including many duplicates ...
// Insert neighbours into unordered_set to remove duplicates.
std::copy(aneighs.begin(), aneighs.end(), std::inserter(sneighs, sneighs.end()));
// De-dupe neighbours list.
// TODO: is this method faster or slower than unordered_set?
std::sort(aneighs.begin(), aneighs.end(), [&] (const VecDi& a, const VecDi&b) {
const UINT aidx = Grid<VecDi, D>::index(a, phi.dims(), phi.offset());
const UINT bidx = Grid<VecDi, D>::index(b, phi.dims(), phi.offset());
return aidx < bidx;
});
aneighs.erase(std::unique(aneighs.begin(), aneighs.end()), aneighs.end());
A great deal here is likely to depend on the size of the output set (which, in turn, will depend on how distant of neighbors you sample).
If it's small, (no more than a few dozen items or so) your hand-rolled set implementation using std::vector and std::find will probably remain fairly competitive. Its problem is that it's an O(N2) algorithm -- each time you insert an item, you have to search all the existing items, so each insertion is linear on the number of items already in the set. Therefore, as the set grows larger, its time to insert items grows roughly quadratically.
Using std::set you each insertion has to only do approximately log2(N) comparisons instead of N comparison. That reduces the overall complexity from O(N2) to O(N log N). The major shortcoming is that it's (at least normally) implemented as a tree built up of individually allocated nodes. That typically reduces its locality of reference -- i.e., each item you insert will consist of the data itself plus some pointers, and traversing the tree means following pointers around. Since they're allocated individually, chances are pretty good that nodes that are (currently) adjacent in the tree won't be adjacent in memory, so you'll see a fair number of cache misses. Bottom line: while its speed grows fairly slowly as the number of items increases, the constants involved are fairly large -- for a small number of items, it'll start out fairly slow (typically quite a bit slower than your hand-rolled version).
Using a vector/sort/unique combines some of the advantages of each of the preceding. Storing the items in a vector (without extra pointers for each) typically leads to better cache usage -- items at adjacent indexes are also at adjacent memory locations, so when you insert a new item, chances are that the location for the new item will already be in the cache. The major disadvantage is that if you're dealing with a really large set, this could use quite a bit more memory. Where a set eliminates duplicates as you insert each item (i.e., an item will only be inserted if it's different from anything already in the set) this will insert all the items, then at the end delete all the duplicates. Given current memory availability and the number of neighbors I'd guess you're probably visiting, I doubt this is a major disadvantage in practice, but under the wrong circumstances, it could lead to a serious problem -- nearly any use of virtual memory would almost certainly make it a net loss.
Looking at the last from a complexity viewpoint, it's going to O(N log N), sort of like the set. The difference is that with the set it's really more like O(N log M), where N is the total number of neighbors, and M is the number of unique neighbors. With the vector, it's really O(N log N), where N is (again) the total number of neighbors. As such, if the number of duplicates is extremely large, a set could have a significant algorithmic advantage.
It's also possible to implement a set-like structure in purely linear sequences. This retains the set's advantage of only storing unique items, but also the vector's locality of reference advantage. The idea is to keep most of the current set sorted, so you can search it in log(N) complexity. When you insert a new item, however, you just put it in the separate vector (or an unsorted portion of the existing vector). When you do a new insertion you also do a linear search on those unsorted items.
When that unsorted part gets too large (for some definition of "too large") you sort those items and merge them into the main group, then start the same sequence again. If you define "too large" in terms of "log N" (where N is the number of items in the sorted group) you can retain O(N log N) complexity for the data structure as a whole. When I've played with it, I've found that the unsorted portion can be larger than I'd have expected before it starts to cause a problem though.
Unsorted set has a constant time complexity o(1) for insertion (on average), so the operation will be o(n) where n is the number is elements before removal.
sorting a list of element of size n is o(n log n), going over the list to remove duplicates is o(n). o(n log n) + o(n) = o(n log n)
The unsorted set (which is similar to an hash table in performance) is better.
data about unsorted set times:
http://en.cppreference.com/w/cpp/container/unordered_set

Is it possible to generically sort in linear time?

I'm trying to solve a problem in O(n) time where, given two forward iterators to the front of a container and the back of a container, I want to remove all elements in the container that don't appear at least < this number > of times. For example, given a vector of strings such as ("john", "hello", "one", "yes", "hello", "one") and I wanted to remove all elements that appear less than 2 times, my final vector would then contain just ("hello", "one").
I was thinking that if I could generically sort in O(n) time I can accomplish this result (in O(n) time), but I'm having a hard time doing that with strings, ints, chars, or whatever else may be used (generically). Am I thinking about this correctly, or is there a simpler way to solve the problem?
Yes, you are not actually sorting but removing elements.
1). Store each word into a hashset.
2). Lookup and only add if not in hashset.
Short answer: no. Comparison based sorting takes O(n log n) time. (This can be formally proved.) If you know something about your input (e.g. the input is distributed uniformly at random within a known range) then you can use well known algorithms such as bucket sort or radix sort in O(n) time. Contrary to #Mooing Duck, there is no such thing as sorting in O(1) time (this should be obvious -- you must visit each element at least once for any sorting algorithm).
However, as several other posters have noted, your problem does not require a sorting algorithm ...
There is no need to sort
1) Populate std::unordered_map<string,vector<int>> indexOfStrings; - O(N)
2) For each string whose vector size() < 2, delete element - O(number of deletions) <= O(N)
indexOfStrings - stores the index of each occurance of the string. This allows for quick deletion from vector without the need for a search.
You don't need a sort, you just need an unordered_map:
unordered_map<string, int> counter;
vector<string> newvec;
for(string &s : v) {
if((++counter[s]) == 2) {
newvec.push_back(s);
}
}
Note that this is C++11 code. (Thanks #jogojapan for the code improvement suggestion).

Unordered_set questions

Could anyone explain how an unordered set works? I am also not sure how a set works. My main question is what is the efficiency of its find function.
For example, what is the total big O run time of this?
vector<int> theFirst;
vector<int> theSecond;
vector<int> theMatch;
theFirst.push_back( -2147483648 );
theFirst.push_back(2);
theFirst.push_back(44);
theSecond.push_back(2);
theSecond.push_back( -2147483648 );
theSecond.push_back( 33 );
//1) Place the contents into a unordered set that is O(m).
//2) O(n) look up so thats O(m + n).
//3) Add them to third structure so that's O(t)
//4) All together it becomes O(m + n + t)
unordered_set<int> theUnorderedSet(theFirst.begin(), theFirst.end());
for(int i = 0; i < theSecond.size(); i++)
{
if(theUnorderedSet.find(theSecond[i]) != theUnorderedSet.end())
{
theMatch.push_back( theSecond[i] );
cout << theSecond[i];
}
}
unordered_set and all the other unordered_ data structures use hashing, as mentioned by #Sean. Hashing involves amortized constant time for insertion, and close to constant time for lookup. A hash function essentially takes some information and produces a number from it. It is a function in the sense that the same input has to produce the same output. However, different inputs can result in the same output, resulting in what is termed a collision. Lookup would be guaranteed to be constant time for an "perfect hash function", that is, one with no collisions. In practice, the input number comes from the element you store in the structure (say it's value, it is a primitive type) and maps it to a location in a data structure. Hence, for a given key, the function takes you to the place where the element is stored without need for any traversals or searches (ignoring collisions here for simplicity), hence constant time. There are different implementations of these structures (open addressing, chaining, etc.) See hash table, hash function. I also recommend section 3.7 of The Algorithm Design Manual by Skiena. Now, concerning big-O complexity, you are right that you have O(n) + O(n) + O(size of overlap). Since the overlap cannot be bigger than the smaller of m and n, the overall complexity can be expressed as O(kN), where N is the largest between m and n. So, O(N). Again, this is "best case", without collisions, and with perfect hashing.
set and multi_set on the other hand use binary trees, so insertions and look-ups are typically O(logN). The actual performance of a hashed structure vs. a binary tree one will depend on N, so it is best to try the two approaches and profile them in a realistic running scenario.
All of the std::unordered_*() data types make use of a hash to perform lookups. Look at Boost's documentation on the subject and I think you'll gain an understanding very quickly.
http://www.boost.org/doc/libs/1_46_1/doc/html/unordered.html