Data structure for O(log N) find and update, considering small L1 cache

Data structure for O(log N) find and update, considering small L1 cache - c++

I'm currently working on an embedded device project where I'm running into performance problems. Profiling has located an O(N) operation that I'd like to eliminate.
I basically have two arrays int A[N] and short B[N]. Entries in A are unique and ordered by external constraints. The most common operation is to check if a particular value a appears in A[]. Less frequently, but still common is a change to an element of A[]. The new value is unrelated to the previous value.
Since the most common operation is the find, that's where B[] comes in. It's a sorted array of indices in A[], such that A[B[i]] < A[B[j]] if and only if i<j. That means that I can find values in A using a binary search.
Of course, when I update A[k], I have to find k in B and move it to a new position, to maintain the search order. Since I know the old and new values of A[k], that's just a memmove() of a subset of B[] between the old and new position of k. This is the O(N) operation that I need to fix; since the old and new values of A[k] are essentially random I'm moving on average about N/2 N/3 elements.
I looked into std::make_heap using [](int i, int j) { return A[i] < A[j]; } as the predicate. In that case I can easily make B[0] point to the smallest element of A, and updating B is now a cheap O(log N) rebalancing operation. However, I generally don't need the smallest value of A, I need to find if any given value is present. And that's now a O(N log N) search in B. (Half of my N elements are at heap depth log N, a quarter at (log N)-1, etc), which is no improvement over a dumb O(N) search directly in A.
Considering that std::set has O(log N) insert and find, I'd say that it should be possible to get the same performance here for update and find. But how do I do that? Do I need another order for B? A different type?
B is currently a short [N] because A and B together are about the size of my CPU cache, and my main memory is a lot slower. Going from 6*N to 8*N bytes would not be nice, but still acceptable if my find and update go to O(log N) both.

If the only operations are (1) check if value 'a' belongs to A and (2) update values in A, why don't you use a hash table in place of the sorted array B? Especially if A does not grow or shrink in size and the values only change this would be a much better solution. A hash table does not require significantly more memory than an array. (Alternatively, B should be changed not to a heap but to a binary search tree, that could be self-balancing, e.g. a splay tree or a red-black tree. However, trees require extra memory because of the left- and right-pointers.)
A practical solution that grows memory use from 6N to 8N bytes is to aim for exactly 50% filled hash table, i.e. use a hash table that consists of an array of 2N shorts. I would recommend implementing the Cuckoo Hashing mechanism (see http://en.wikipedia.org/wiki/Cuckoo_hashing). Read the article further and you find that you can get load factors above 50% (i.e. push memory consumption down from 8N towards, say, 7N) by using more hash functions. "Using just three hash functions increases the load to 91%."
From Wikipedia:
A study by Zukowski et al. has shown that cuckoo hashing is much
faster than chained hashing for small, cache-resident hash tables on
modern processors. Kenneth Ross has shown bucketized versions of
cuckoo hashing (variants that use buckets that contain more than one
key) to be faster than conventional methods also for large hash
tables, when space utilization is high. The performance of the
bucketized cuckoo hash table was investigated further by Askitis,
with its performance compared against alternative hashing schemes.

std::set usually provides the O(log(n)) insert and delete by using a binary search tree. This unfortunately uses 3*N space for most pointer based implementations. Assuming word sized data, 1 for data, 2 for pointers to left and right child on each node.
If you have some constant N and can guarantee that ceil(log2(N)) is less than half the word size you can use a fixed length array of tree nodes each 2*N size. Use 1 for data, 1 for the indexes of the two child nodes, stored as the upper and lower half of the word. Whether this would let you use a self balancing binary search tree of some manner depends on your N and word size. For a 16 bit system you only get N = 256, but for 32 its 65k.

Since you have limited N, can't you use std::set<short, cmp, pool_allocator> B with Boost's pool_allocator?

Related

How to efficiently look up elements in a large vector

I have a vector<unsigned> of size (90,000 * 9,000). I need to find many times whether an element exists in this vector or not?
For doing so, I stored the vector in a sorted form using std::sort() and then looked up elements in the vector using std::binary_search(). However on profiling using perf I find that looking up elements in vector<unsigned> is the slowest operation.
Can someone suggest some data-structure in C/C++ which I can use to efficiently look up elements in a vector of (90,000 * 9,000) elements.
I perform insertion (bulk-insertion) only once. The rest of the times I perform only lookups, so the main overhead here is because of lookups.

You've got 810 million values out of 4 billion possible values (assuming 32 bits unsigned). That's 1/5th of the total range, and uses 3.2 GB. This means you're in fact better of with a std::vector<bool> with 4 billion bits. This gives you O(1) lookup in less space (0.5 GB).
(In theory, unsigned could be 16 bits. unsigned long is at least 32 bits, std::uint32_t might be what you want)

Depending on the actual data structure of the vector the contains operation may take an O(n) or O(1). Usually, it's O(N) if vector is backed by either associative array or linked list, in this case contains will be a full scan in the worst case scenario. You have mitigated a full scan by ordering and using binary search, which is O(log (N)). Log N is pretty good complexity with only O(1) being better. So your choice is either:
Cache look up result for the items, this might be a good compromise if you have many repetitions of the same element
Replace vector with another data structure with efficient contains operation such as the one based on a hashtable or set. Note you may loose other features, such as ordering of items
Use two data structures, one for contains operations and original vector for whatever you use it for
Use a third data structure that offers a compromise, for example a data structure that work well with bloom filter

However on profiling using perf I find that looking up elements in
vector is the slowest operation.
That is half of the information you need, the other half being "how fast is it compared to other algorithms/containers"? Maybe using std::vector<> is actually the fastest, or maybe its the slowest. To find you'll have to benchmark/profile a few different designs.
For example, the following are very naive benchmarks using random integers on 1000x9000 sized containers (I would get seg-faults on larger sizes for the maps, assumably a limit of 32-bit memory).
If you need a count of non-unique integers:
std::vector<unsigned> = 500 ms
std::map<unsigned, unsigned> = 1700 ms
std::unordered_map<unsigned, unsigned> = 3700 ms
If you just need to test for the presence of unique integers:
std::vector<bool> = 15 ms
std::bitset<> = 50 ms
std::set<unsigned> = 350 ms
Note that we're not too interested in the exact values but rather the relative comparisons between containers. std::map<> is relatively slow which is not surprising given the number of dynamic allocations and non-locality of the data involved. The bitsets are by far the fastest but don't work if need the counts of non-unique integers.
I would suggest doing a similar benchmark using your exact container sizes and contents, both of which may well affect the benchmark results. It may turn out that std::vector<> may be the best solution after all but now you have some data to back up that design choice.

If you do not need iterate through the collection (in a sorted manner) since c++11 you could use std::unordered_set<yourtype> all you need to do is to provide the collection way of getting hashing and equality information for yourtype. The time of accessing element of the collection is here amortised O(1), unlike sorted vector where it's O(log(n)).

Performance of vector sort/unique/erase vs. copy to unordered_set

I have a function that gets all neighbours of a list of points in a grid out to a certain distance, which involves a lot of duplicates (my neighbour's neighbour == me again).
I've been experimenting with a couple of different solutions, but I have no idea which is the more efficient. Below is some code demonstrating two solutions running side by side, one using std::vector sort-unique-erase, the other using std::copy into a std::unordered_set.
I also tried another solution, which is to pass the vector containing the neighbours so far to the neighbour function, which will use std::find to ensure a neighbour doesn't already exist before adding it.
So three solutions, but I can't quite wrap my head around which is gonna be faster. Any ideas anyone?
Code snippet follows:
// Vector of all neighbours of all modified phi points, which may initially include duplicates.
std::vector<VecDi> aneighs;
// Hash function, mapping points to their norm distance.
auto hasher = [&] (const VecDi& a) {
return std::hash<UINT>()(a.squaredNorm() >> 2);
};
// Unordered set for storing neighbours without duplication.
std::unordered_set<VecDi, UINT (*) (const VecDi& a)> sneighs(phi.dims().squaredNorm() >> 2, hasher);
... compute big long list of points including many duplicates ...
// Insert neighbours into unordered_set to remove duplicates.
std::copy(aneighs.begin(), aneighs.end(), std::inserter(sneighs, sneighs.end()));
// De-dupe neighbours list.
// TODO: is this method faster or slower than unordered_set?
std::sort(aneighs.begin(), aneighs.end(), [&] (const VecDi& a, const VecDi&b) {
const UINT aidx = Grid<VecDi, D>::index(a, phi.dims(), phi.offset());
const UINT bidx = Grid<VecDi, D>::index(b, phi.dims(), phi.offset());
return aidx < bidx;
});
aneighs.erase(std::unique(aneighs.begin(), aneighs.end()), aneighs.end());

A great deal here is likely to depend on the size of the output set (which, in turn, will depend on how distant of neighbors you sample).
If it's small, (no more than a few dozen items or so) your hand-rolled set implementation using std::vector and std::find will probably remain fairly competitive. Its problem is that it's an O(N2) algorithm -- each time you insert an item, you have to search all the existing items, so each insertion is linear on the number of items already in the set. Therefore, as the set grows larger, its time to insert items grows roughly quadratically.
Using std::set you each insertion has to only do approximately log2(N) comparisons instead of N comparison. That reduces the overall complexity from O(N2) to O(N log N). The major shortcoming is that it's (at least normally) implemented as a tree built up of individually allocated nodes. That typically reduces its locality of reference -- i.e., each item you insert will consist of the data itself plus some pointers, and traversing the tree means following pointers around. Since they're allocated individually, chances are pretty good that nodes that are (currently) adjacent in the tree won't be adjacent in memory, so you'll see a fair number of cache misses. Bottom line: while its speed grows fairly slowly as the number of items increases, the constants involved are fairly large -- for a small number of items, it'll start out fairly slow (typically quite a bit slower than your hand-rolled version).
Using a vector/sort/unique combines some of the advantages of each of the preceding. Storing the items in a vector (without extra pointers for each) typically leads to better cache usage -- items at adjacent indexes are also at adjacent memory locations, so when you insert a new item, chances are that the location for the new item will already be in the cache. The major disadvantage is that if you're dealing with a really large set, this could use quite a bit more memory. Where a set eliminates duplicates as you insert each item (i.e., an item will only be inserted if it's different from anything already in the set) this will insert all the items, then at the end delete all the duplicates. Given current memory availability and the number of neighbors I'd guess you're probably visiting, I doubt this is a major disadvantage in practice, but under the wrong circumstances, it could lead to a serious problem -- nearly any use of virtual memory would almost certainly make it a net loss.
Looking at the last from a complexity viewpoint, it's going to O(N log N), sort of like the set. The difference is that with the set it's really more like O(N log M), where N is the total number of neighbors, and M is the number of unique neighbors. With the vector, it's really O(N log N), where N is (again) the total number of neighbors. As such, if the number of duplicates is extremely large, a set could have a significant algorithmic advantage.
It's also possible to implement a set-like structure in purely linear sequences. This retains the set's advantage of only storing unique items, but also the vector's locality of reference advantage. The idea is to keep most of the current set sorted, so you can search it in log(N) complexity. When you insert a new item, however, you just put it in the separate vector (or an unsorted portion of the existing vector). When you do a new insertion you also do a linear search on those unsorted items.
When that unsorted part gets too large (for some definition of "too large") you sort those items and merge them into the main group, then start the same sequence again. If you define "too large" in terms of "log N" (where N is the number of items in the sorted group) you can retain O(N log N) complexity for the data structure as a whole. When I've played with it, I've found that the unsorted portion can be larger than I'd have expected before it starts to cause a problem though.

Unsorted set has a constant time complexity o(1) for insertion (on average), so the operation will be o(n) where n is the number is elements before removal.
sorting a list of element of size n is o(n log n), going over the list to remove duplicates is o(n). o(n log n) + o(n) = o(n log n)
The unsorted set (which is similar to an hash table in performance) is better.
data about unsorted set times:
http://en.cppreference.com/w/cpp/container/unordered_set

What is the fastest data structure to search and update list of integer values?

I have to maintain a list of unordered integers , where number of integers are unknown. It may increase or decrease over the time. I need to update this list of integers frequently. I have tried using vector . But it is really slow . Array appears to be faster , but since the length of list is not fixed, it takes significant amount of time to resize it . Please suggest any other option .

Use a hash table, if order of the values in unimportant. Time is O(1). I'm pretty sure you'll find an implementation in the standard template libraries.
Failing that, a splay tree is extremely fast, especially if you want to keep the list ordered: amortized cost of O(ln n) per operation, with a very low constant factor. I think C++ stdlib map is something like this.
Know thy data structures.

If you are interested in Dynamic increments of Arrays size you can do this .
current =0;
x = (int**)malloc(temp * sizeof(int*));
x[current]=(int*)malloc(RequiredLength * sizeof(int));
So add elements to array and when elements are filled in x[current]
You can add more space for elements by doing
x[++current]=(int*)malloc(RequiredLength * sizeof(int));
Doing this you can accommodate for RequiredLength more elements .
You can repeat this upto 1024 times which means 1024*RequiredLength elements can be
accommodated , here it gives you chance to increase size of array whenever you want it .
You can always access the n th element by X[ n / 1024 ][ n % 1024] ;

Considering your comments, it looks like it is std::set or std::unordered_set fits your needs better than std::vector.

If sequential data structures fails to meet requirements, you could try looking at trees (binary, AVL, m-way, red-black ect ...). I would suggest you try to implement AVL tree since it yields a balanced or near balanced binary search tree which would optimize your operation. For more on AVL tree: http://en.wikipedia.org/wiki/AVL_tree

well,deque has no resize cost,but if it's unordered,it's search time is linear ，and its delete and insert operation time in the middle of its self is even worth than vector.
if you don't need search by the value of the number,hashmap or map may be your choice .No resize cost.,then you set the key of the map to number's index,and the value to the number's value.the search and insert operation is better than linear.

std::list is definitely created for such problems, adding and deleting elements in list do not necessitate memory re-allocations like in vector. However, due to the noncontagious memory allocation of the list, searching elements may prove to be a painful experience ofcourse but if you do not search its entries frequently, it can be used.

Unordered_set questions

Could anyone explain how an unordered set works? I am also not sure how a set works. My main question is what is the efficiency of its find function.
For example, what is the total big O run time of this?
vector<int> theFirst;
vector<int> theSecond;
vector<int> theMatch;
theFirst.push_back( -2147483648 );
theFirst.push_back(2);
theFirst.push_back(44);
theSecond.push_back(2);
theSecond.push_back( -2147483648 );
theSecond.push_back( 33 );
//1) Place the contents into a unordered set that is O(m).
//2) O(n) look up so thats O(m + n).
//3) Add them to third structure so that's O(t)
//4) All together it becomes O(m + n + t)
unordered_set<int> theUnorderedSet(theFirst.begin(), theFirst.end());
for(int i = 0; i < theSecond.size(); i++)
{
if(theUnorderedSet.find(theSecond[i]) != theUnorderedSet.end())
{
theMatch.push_back( theSecond[i] );
cout << theSecond[i];
}
}

unordered_set and all the other unordered_ data structures use hashing, as mentioned by #Sean. Hashing involves amortized constant time for insertion, and close to constant time for lookup. A hash function essentially takes some information and produces a number from it. It is a function in the sense that the same input has to produce the same output. However, different inputs can result in the same output, resulting in what is termed a collision. Lookup would be guaranteed to be constant time for an "perfect hash function", that is, one with no collisions. In practice, the input number comes from the element you store in the structure (say it's value, it is a primitive type) and maps it to a location in a data structure. Hence, for a given key, the function takes you to the place where the element is stored without need for any traversals or searches (ignoring collisions here for simplicity), hence constant time. There are different implementations of these structures (open addressing, chaining, etc.) See hash table, hash function. I also recommend section 3.7 of The Algorithm Design Manual by Skiena. Now, concerning big-O complexity, you are right that you have O(n) + O(n) + O(size of overlap). Since the overlap cannot be bigger than the smaller of m and n, the overall complexity can be expressed as O(kN), where N is the largest between m and n. So, O(N). Again, this is "best case", without collisions, and with perfect hashing.
set and multi_set on the other hand use binary trees, so insertions and look-ups are typically O(logN). The actual performance of a hashed structure vs. a binary tree one will depend on N, so it is best to try the two approaches and profile them in a realistic running scenario.

All of the std::unordered_*() data types make use of a hash to perform lookups. Look at Boost's documentation on the subject and I think you'll gain an understanding very quickly.
http://www.boost.org/doc/libs/1_46_1/doc/html/unordered.html

Fast Algorithm for finding largest values in 2d array

I have a 2D array (an image actually) that is size N x N. I need to find the indices of the M largest values in the array ( M << N x N) . Linearized index or the 2D coords are both fine. The array must remain intact (since it's an image). I can make a copy for scratch, but sorting the array will bugger up the indices.
I'm fine with doing a full pass over the array (ie. O(N^2) is fine). Anyone have a good algorithm for doing this as efficiently as possible?

Selection is sorting's austere sister (repeat this ten times in a row). Selection algorithms are less known than sort algorithms, but nonetheless useful.
You can't do better than O(N^2) (in N) here, since nothing indicates that you must not visit each element of the array.
A good approach is to keep a priority queue made of the M largest elements. This makes something O(N x N x log M).
You traverse the array, enqueuing pairs (elements, index) as you go. The queue keeps its elements sorted by first component.
Once the queue has M elements, instead of enqueuing you now:
Query the min element of the queue
If the current element of the array is greater, insert it into the queue and discard the min element of the queue
Else do nothing.
If M is bigger, sorting the array is preferable.
NOTE: #Andy Finkenstadt makes a good point (in the comments to your question) : you definitely should traverse your array in the "direction of data locality": make sure that you read memory contiguously.
Also, this is trivially parallelizable, the only non parallelizable part is when you merge the queues when joining the sub processes.

You could copy the array into a single dimensioned array of tuples (value, original X, original Y ) and build a basic heap out of it in (O(n) time), provided you implement the heap as an array.
You could then retrieve the M largest tuples in O(M lg n) time and reference their original x and y from the tuple.

If you are going to make a copy of the input array in order to do a sort, that's way worse than just walking linearly through the whole thing to pick out numbers.
So the question is how big is your M? If it is small, you can store results (i.e. structs with 2D indexes and values) in a simple array or a vector. That'll minimize heap operations but when you find a larger value than what's in your vector, you'll have to shift things around.
If you expect M to get really large, then you may need a better data structure like a binary tree (std::set) or use sorted std::deque. std::set will reduce number of times elements must be shifted in memory, while if you use std::deque, it'll do some shifting, but it'll reduce number of times you have to go to the heap significantly, which may give you better performance.

Your problem doesn't use the 2 dimensions in any interesting way, it is easier to consiger the equivalent problem in a 2d array.
There are 2 main ways to solve this problem:
Mantain a set of M largest elements, and iterate through the array. (Using a heap allows you to do this efficiently).
This is simple and is probably better in your case (M << N)
Use selection, (the following algorithm is an adaptation of quicksort):
Create an auxiliary array, containing the indexes [1..N].
Choose an arbritary index (and corresponding value), and partition the index array so that indexes corresponding to elements less go to the left, and bigger elements go to the right.
Repeat the process, binary search style until you narrow down the M largest elements.
This is good for cases with large M. If you want to avoid worst case issues (the same quicksort has) then look at more advanced algorithms, (like median of medians selection)

How many times do you search for the largest value from the array?
If you only search 1 time, then just scan through it keeping the M largest ones.
If you do it many times, just insert the values into a sorted list (probably best implemented as a balanced tree).

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js