Could anyone explain how an unordered set works? I am also not sure how a set works. My main question is what is the efficiency of its find function.
For example, what is the total big O run time of this?
vector<int> theFirst;
vector<int> theSecond;
vector<int> theMatch;
theFirst.push_back( -2147483648 );
theFirst.push_back(2);
theFirst.push_back(44);
theSecond.push_back(2);
theSecond.push_back( -2147483648 );
theSecond.push_back( 33 );
//1) Place the contents into a unordered set that is O(m).
//2) O(n) look up so thats O(m + n).
//3) Add them to third structure so that's O(t)
//4) All together it becomes O(m + n + t)
unordered_set<int> theUnorderedSet(theFirst.begin(), theFirst.end());
for(int i = 0; i < theSecond.size(); i++)
{
if(theUnorderedSet.find(theSecond[i]) != theUnorderedSet.end())
{
theMatch.push_back( theSecond[i] );
cout << theSecond[i];
}
}
unordered_set and all the other unordered_ data structures use hashing, as mentioned by #Sean. Hashing involves amortized constant time for insertion, and close to constant time for lookup. A hash function essentially takes some information and produces a number from it. It is a function in the sense that the same input has to produce the same output. However, different inputs can result in the same output, resulting in what is termed a collision. Lookup would be guaranteed to be constant time for an "perfect hash function", that is, one with no collisions. In practice, the input number comes from the element you store in the structure (say it's value, it is a primitive type) and maps it to a location in a data structure. Hence, for a given key, the function takes you to the place where the element is stored without need for any traversals or searches (ignoring collisions here for simplicity), hence constant time. There are different implementations of these structures (open addressing, chaining, etc.) See hash table, hash function. I also recommend section 3.7 of The Algorithm Design Manual by Skiena. Now, concerning big-O complexity, you are right that you have O(n) + O(n) + O(size of overlap). Since the overlap cannot be bigger than the smaller of m and n, the overall complexity can be expressed as O(kN), where N is the largest between m and n. So, O(N). Again, this is "best case", without collisions, and with perfect hashing.
set and multi_set on the other hand use binary trees, so insertions and look-ups are typically O(logN). The actual performance of a hashed structure vs. a binary tree one will depend on N, so it is best to try the two approaches and profile them in a realistic running scenario.
All of the std::unordered_*() data types make use of a hash to perform lookups. Look at Boost's documentation on the subject and I think you'll gain an understanding very quickly.
http://www.boost.org/doc/libs/1_46_1/doc/html/unordered.html
Related
In Hash map data structure such as unordered_map in C++:
unodered_map<char, int> mp = { {'a', 10}, {'b', 20} };
if (mp.find('a') != mp.end())
cout << "found you";
we know find() method takes constant time. but if I have composite data as the key:
unodered_map<tuple<char, string, int>, int> mp = { {'a', "apple", 10}, 100};
if (mp.find( {'a', "apple", 10} ) != mp.end())
cout << "found you";
Will the find() method still take constant time? how to evaluate the time complexity now?
In general, the more bytes of data in the key, the longer the hash function will take to generate a value (though some hash functions do not look at every byte, and can therefore have reduced big-O complexity). There might be more or less bytes because the tuple has more values, or some element in the tuple is variable sized (like a std::string). Similarly, with more bytes, it generally takes longer to test two keys for equality, which is another crucial operation for hash tables.
So, you could say your table's operations scale linearly with the size of the keys - O(K) - all other things being equal.
But, more often, you're interested in comparing how the performance of any given insert/erase/find compares with how long it would take in another type of container, and in many other types of containers the performance tends to degrade as you add more and more keys. That's where people describe hash tables as generally having amortised average-case O(1) operational complexity, whereas e.g. balanced binary trees may be O(logN) where N is the number of elements stored.
There are some other considerations, such as that operations in a balanced binary tree tend to involve comparisons (i.e. key1 < key2), which may be short-circuited at the first differing byte, whereas hash functions tend to have to process all bytes in the key.
Now, if in your problem domain, the size of keys may vary widely, then it's meaningful to think in terms of O(K) complexity, but if the size of keys tends to hover around the same typical range - regardless of the number of keys you're storing, then the table property is reasonably expressed as O(1) - removing the near-constant multiplicative factor.
I think it helps to consider a familiar analogy. If you have 100 friends' names stored in your phone address book, or you have millions of names from a big city's phone book, the average length of names is probably pretty similar, so you could very reasonably talk about the big-O efficiency of your data structure in terms of "N" while ignoring the way it shrinks or grows with name length "K".
On the other hand, if you're thinking about storing arbitrary-length keys in a hash table, and some people might try to put XML versions of encyclopaedias, while others store novels, poems, or individual words, then there's enough variety in key length that it makes sense to describe the varying performance in terms of K.
Similarly true if you were storing say information on binary video data, and someone was considering using the raw binary video data as the hash table key: some 8k HDR and hours long, and others tiny animated gifs. (A better approach would be to generate a 64+ bit hash of the video data and use that for a key, which for most practical purposes will be reliably unique; if dealing with billions of videos use 128 bit).
The theoretical running time is not in fact constant. The running time is constant only on avarage, given reasonable use cases.
A hash function is used in the implementation. If you implement a (good) hash function for your tuple that runs in constant time, the asymptotic running time of find is unaffected.
#include <iostream>
#include <unordered_set>
using namespace std;
int main()
{
auto hash = [](int i) {return i; };
unordered_set<int, decltype(hash)> s(4000, hash);
for (int i = 0; i < 4000; i++)
s.emplace(i * 4027);
cout<<s.bucket_size(0)<<endl;//4000 here ,all the keys fell into the same bucket .
return 0;
}
http://ideone.com/U1Vs1P
I find out that the ideone complier use the prime 4027 (which is the first prime number after 4000 ,4000 is the unordered_set's size) as the divisor to divide the hash value ,and use the remainder to detemine which bucket the key should fall in ,which is 0 in this case .
And I ran this piece of code on visual studio 2015,just change 4027 to 4096 ,and it returns 4000 to me too .Seems like vs use the first power of 2 after 4000 as the divisor .
My problem is ,I have several unique intergers(maybe hundreds) ,they are all in the [0,4000) interval .
I want to store them in a hash table ,so that I can insert and erase these keys really fast .
And I don't want to waste memory ,I don't want to keep a 4000 long vector for just a few ints .
I tried the default unordered_set ,but it's hash function is too slow .
So I think I can use [](int i){return i;} as my hash function .As long as I know my keys will be distribute the way (my keys are likely to be quite compact,like 301,303,304,306,308 ).
But is this good practice ?I'm afraid this would cause collision issues on other compliers .
And I don't want to waste memory ,I don't want to keep a 4000 long vector for just a few ints .
That's what a hash table is. It's a memory-for-performance tradeoff. If you want a container which can provide O(1) performance for search, insertion, and removal, then the price is high memory costs.
The node-based set has lower memory costs, but O(log(n)) searching operations and lots of dynamic allocations, but relatively fast insertion and removal (ignoring the search time). The array-based flat_set (aka: a sorted vector) gives you the smallest possible memory (and very fast start-to-end iteration), but O(log(n)) search and insert/removal operations that can be excruciatingly slow for large sets.
There is no free lunch when it comes to these things.
The only way to deal with this sort of thing is to make sure that the number of buckets is sufficiently large relative to the number of elements. That will help minimize collisions.
If you know a hash table's implementation and the hash function you use, you can always construct a series of numbers that represents the worst-case-scenario. But hash tables are not optimized for worst-case; they're optimized for the average case, where most elements don't collide.
That being said, you can always have your hash function perform some arbitrary math on the numbers. Adding an arbitrary fixed constant, doing some bitshifts, or whatever else you feel works. But again, that won't stop someone from constructing the worst-case scenario. So you should only bother with such a thing if your actual code frequently gets collisions and you can't eliminate them without removing something important.
I have a (large) set of integers S, and I want to run the following pseudocode:
set result = {};
while(S isn't empty)
{
int i = S.getArbitraryElement();
result.insert(i);
set T = elementsToDelete(i);
S = S \ T; // set difference
}
The function elementsToDelete is efficient (sublinear in the initial size of S) and the size of T is small (assume it's constant). T may contain integers no longer in S.
Is there a way of implementing the above that is faster than O(|S|^2)? I suspect I should be able to get O(|S| k), where k is the time complexity of elementsToDelete. I can of course implement the above in a straightforward way using std::set_difference but my understanding is that set_difference is O(|S|).
Using std::set S;, you can do:
for (auto k : elementsToDelete(i)) {
S.erase(k);
}
Of course the lookup for erase is O(log(S.size())), not the O(1) you're asking for. That can be achieved with std::unordered_set, assuming not too many collisions (which is a big assumption in general but very often true in particular).
Despite the name, the std::set_difference algorithm doesn't have much to do with std::set. It works on anything you can iterate in order. Anyway it's not for in-place modification of a container. Since T.size() is small in this case, you really don't want to create a new container each time you remove a batch of elements. In another example where the result set is small enough, it would be more efficient than repeated erase.
The set_difference in C++ library has time complexity of O(|S|) hence it is not good for your purposes so i advice you to use S.erase() to delete set element in the S in O(logN) implemented as BST . Hence your time complexity reduces to O(NlogN)
I'm currently working on an embedded device project where I'm running into performance problems. Profiling has located an O(N) operation that I'd like to eliminate.
I basically have two arrays int A[N] and short B[N]. Entries in A are unique and ordered by external constraints. The most common operation is to check if a particular value a appears in A[]. Less frequently, but still common is a change to an element of A[]. The new value is unrelated to the previous value.
Since the most common operation is the find, that's where B[] comes in. It's a sorted array of indices in A[], such that A[B[i]] < A[B[j]] if and only if i<j. That means that I can find values in A using a binary search.
Of course, when I update A[k], I have to find k in B and move it to a new position, to maintain the search order. Since I know the old and new values of A[k], that's just a memmove() of a subset of B[] between the old and new position of k. This is the O(N) operation that I need to fix; since the old and new values of A[k] are essentially random I'm moving on average about N/2 N/3 elements.
I looked into std::make_heap using [](int i, int j) { return A[i] < A[j]; } as the predicate. In that case I can easily make B[0] point to the smallest element of A, and updating B is now a cheap O(log N) rebalancing operation. However, I generally don't need the smallest value of A, I need to find if any given value is present. And that's now a O(N log N) search in B. (Half of my N elements are at heap depth log N, a quarter at (log N)-1, etc), which is no improvement over a dumb O(N) search directly in A.
Considering that std::set has O(log N) insert and find, I'd say that it should be possible to get the same performance here for update and find. But how do I do that? Do I need another order for B? A different type?
B is currently a short [N] because A and B together are about the size of my CPU cache, and my main memory is a lot slower. Going from 6*N to 8*N bytes would not be nice, but still acceptable if my find and update go to O(log N) both.
If the only operations are (1) check if value 'a' belongs to A and (2) update values in A, why don't you use a hash table in place of the sorted array B? Especially if A does not grow or shrink in size and the values only change this would be a much better solution. A hash table does not require significantly more memory than an array. (Alternatively, B should be changed not to a heap but to a binary search tree, that could be self-balancing, e.g. a splay tree or a red-black tree. However, trees require extra memory because of the left- and right-pointers.)
A practical solution that grows memory use from 6N to 8N bytes is to aim for exactly 50% filled hash table, i.e. use a hash table that consists of an array of 2N shorts. I would recommend implementing the Cuckoo Hashing mechanism (see http://en.wikipedia.org/wiki/Cuckoo_hashing). Read the article further and you find that you can get load factors above 50% (i.e. push memory consumption down from 8N towards, say, 7N) by using more hash functions. "Using just three hash functions increases the load to 91%."
From Wikipedia:
A study by Zukowski et al. has shown that cuckoo hashing is much
faster than chained hashing for small, cache-resident hash tables on
modern processors. Kenneth Ross has shown bucketized versions of
cuckoo hashing (variants that use buckets that contain more than one
key) to be faster than conventional methods also for large hash
tables, when space utilization is high. The performance of the
bucketized cuckoo hash table was investigated further by Askitis,
with its performance compared against alternative hashing schemes.
std::set usually provides the O(log(n)) insert and delete by using a binary search tree. This unfortunately uses 3*N space for most pointer based implementations. Assuming word sized data, 1 for data, 2 for pointers to left and right child on each node.
If you have some constant N and can guarantee that ceil(log2(N)) is less than half the word size you can use a fixed length array of tree nodes each 2*N size. Use 1 for data, 1 for the indexes of the two child nodes, stored as the upper and lower half of the word. Whether this would let you use a self balancing binary search tree of some manner depends on your N and word size. For a 16 bit system you only get N = 256, but for 32 its 65k.
Since you have limited N, can't you use std::set<short, cmp, pool_allocator> B with Boost's pool_allocator?
I know that the individual map queries take a maximum of log(N) time. However I was wondering, I have seen a lot of examples that use strings as map keys. What is the performance cost of associating a std::string as a key to a map instead of an int for example ?
std::map<std::string, aClass*> someMap; vs std::map<int, aClass*> someMap;
Thanks!
Analyzing algorithms for asymptotic performance is working on the operations that must be performed and the cost they add to the equation. For that you need to first know what are the performed operations and then evaluate its costs.
Searching for a key in a balanced binary tree (which maps happen to be) require O( log N ) complex operations. Each of those operations implies comparing the key for a match and following the appropriate pointer (child) if the key did not match. This means that the overall cost is proportional to log N times the cost of those two operations. Following pointers is a constant time operation O(1), and comparing keys depend on the key. For an integer key, comparisons are fast O(1). Comparing two strings is another story, it takes time proportional to the sizes of the strings involved O(L) (where I have used intentionally L as the length of string parameter instead of the more common N.
When you sum all the costs up you get that using integers as keys the total cost is O( log N )*( O(1) + O(1) ) that is equivalent to O( log N ). (O(1) gets hidden in the constant that the O notation silently hides.
If you use strings as keys, the total cost is O( log N )*( O(L) + O(1) ) where the constant time operation gets hidden by the more costly linear operation O(L) and can be converted into O( L * log N ). That is, the cost of locating an element in a map keyed by strings is proportional to the logarithm of the number of elements stored in the map times the average length of the strings used as keys.
Note that the big-O notation is most appropriate to use as an analysis tool to determine how the algorithm will behave when the size of the problem grows, but it hides many facts underneath that are important for raw performance.
As the simplest example, if you change the key from a generic string to an array of 1000 characters you can hide that cost within the constant dropped out of the notation. Comparing arrays of 1000 chars is a constant operation that just happens to take quite a bit of time. With the asymptotic notation that would just be a O( log N ) operation, as with integers.
The same happens with many other hidden costs, as the cost of creation of the elements that is usually considered as a constant time operation, just because it does not depend on the parameters to your problem (the cost of locating the block of memory in each allocation does not depend on your data set, but rather on memory fragmentation that is outside of the scope of the algorithm analysis, the cost of acquiring the lock inside malloc as to guarantee that not two processes try to return the same block of memory depends on the contention of the lock that depends itself number of processors, processes and how much memory requests they perform..., again out of the scope of the algorithm analysis). When reading costs in the big-O notation you must be conscious of what it really means.
In addition to the time complexity from comparing strings already mentioned, a string key will also cause an additional memory allocation each time an item is added to the container. In certain cases, e.g. highly parallel systems, a global allocator mutex can be a source of performance problems.
In general, you should choose the alternative that makes the most sense in your situation, and only optimize based on actual performance testing. It's notoriously hard to judge what will be a bottleneck.
The cost difference will be linked to the difference in cost between comparing two ints versus comparing two strings.
When comparing two strings, you have to dereference a pointer to get to the first chars, and compare them. If they are identical, you have to compare the second chars, and so on. If your strings have a long common prefix, this can slow down the process a bit. It is very unlikely to be as fast as comparing ints, though.
The cost is ofcourse that ints can be compared in real O(1) time whereas strings are compared in O(n) time (n being the maximal shared prefix). Also, the storage of strings consumes more space than that of integers.
Other than these apparent differences, there's no major performance cost.
First of all, I doubt that in a real application, whether you have string keys or int keys makes any noticeable difference. Profiling your application will tell you if it matters.
If it does matter, you could change your key to be something like this (untested):
class Key {
public:
unsigned hash;
std::string s;
int cmp(const Key& other) {
int diff = hash - other.hash;
if (diff == 0)
diff = strcmp(s, other.s);
return diff;
}
Now you're doing an int comparison on the hashes of two strings. If the hashes are different, the strings are certainly different. If the hashes are the same, you still have to compare the strings because of the Pigeonhole Principle.
Simple example with just accessing values in two maps with equal number of keys - one int keys another strings of the same int values takes 8 times longer with strings.