Least expensive equality search for small unsorted arrays - c++

What is the most efficient method, in terms of execution time, to search a small array of about 4 to 16 elements, for an element that is equal to what you're searching for, in C++? The element being searched for is a pointer in this case, so it's relatively small.
(My purpose is to prevent points in a point cloud from creating edges with points that already share an edge with them. The edge array is small for each point, but there can be a massive number of points. Also, I'm just curious too!)

Your best bet is to profile your specific application with a variety of mechanisms and see which performs best.
I suspect that given it's unsorted a straight linear search will be best for you. If you're able to pre-sort the array once and it updates infrequently or never, you could pre-sort and then use a binary search.

Try a linear search; try starting with one or more binary chop stages. The former involves more comparisons on average; the latter has more scope for cache misses and branch mispredictions, and requires that the arrays are pre-sorted.
Only by measuring can you tell which is faster, and then only on the platform you measured on.

If you have to do this search more than once, and the array doesn't change often/at all, sort it and then use a binary search.

Related

Deciding when to use a hash table

I was soving a competitive programming problem with the following requirements:
I had to maintain a list of unqiue 2d points (x,y), the number of unique points would be less than 500.
My idea was to store them in a hash table (C++ unordered set to be specific) and each time a node turned up i would lookup the table and if the node is not already there i would insert it.
I also know for a fact that i wouldn't be doing more than 500 lookups.
So i saw some solutions simply searching through an array (unsorted) and checking if the node was already there before inserting.
My question is is there any reasonable way to guess when should i use a hash table over a manual search over keys without having to benchmark them?
My question is is there any reasonable way to guess when should i use a hash table over a manual search over keys without having to benchmark them?
I am guessing you are familiar with basic algorithmics & time complexity and C++ standard containers and know that with luck hash table access is O(1)
If the hash table code (or some balanced tree code, e.g. using std::map - assuming there is an easy order on keys) is more readable, I would prefer it for that readability reason alone.
Otherwise, you might make some guess taking into account the approximate timing for various operations on a PC. BTW, the entire http:///norvig.com/21-days.html page is worth reading.
Basically, memory accesses are much more slow than everything else in the CPU. The CPU cache is extremely important. A typical memory access with cache fault requiring fetching data from DRAM modules is several hundreds times slower than some elementary arithmetic operation or machine instruction (e.g. adding two integers in registers).
In practice, it does not matter that much, as long as your data is tiny (e.g. less than a thousand elements), since in that case it is likely to sit in L2 cache.
Searching (linearly) in an array is really fast (since very cache friendly), up to several thousand of (small) elements.
IIRC, Herb Sutter mentions in some video that even inserting an element inside a vector is practically -but unintuitively- faster (taking into account the time needed to move slices) than inserting it into some balanced tree (or perhaps some other container, e.g. an hash table), up to a container size of several thousand small elements. This is on typical tablet, desktop or server microprocessor with a multimegabyte cache. YMMV.
If you really care that much, you cannot avoid benchmarking.
Notice that 500 pairs of integers is probably fitting into the L1 cache!
My rule of thumb is to assume the processor can deal with 10^9 operations per second.
In your case there are only 500 entries. An algorithm up to O(N^2) could be safe. By using contiguous data structure like vector you can leverage the fast cache hit. Also hash function sometimes can be costly in terms of constant. However if you have a data size of 10^6, the safe complexity might be only O(N) in total. In this case you might need to consider O(1) hashmap for a single lookup.
You can use Big O Complexity to roughly estimate the performance. For the Hash Table, Searching an element is between O(1) and O(n) in the worst case. That means, that in the best case your access time is independant of the number of elements in your map but in the worst case it is linear dependant on the size of your hash table.
A Binary tree has a guaranteed search complexity of O(nlog(n)). That means, that searching an element always depends on the size of the array, but in the Worst Case its faster than a hash table.
You can look up some Big O Complexities at this handy website here: http://bigocheatsheet.com/

Is it possible to implement median of medians introselect with no swaps or heap allocations?

So I've run into a problem in some code I'm writing in c++. I need to find the median of an array of points with an offset and step size (example).
This code will be executed millions of times in as it's part of one of my core data structures so I'm trying to make it as fast as possible.
Research has led me to believe that for the best worst-case time complexity, introselect is the fastest way to find a median in a set of unordered values. I have some additional limitations that have to due with optimization:
I can't swap any values in the array. The values in the array are all exactly where they need to be based on that context in the program. But I still need the median.
I can't make any "new" allocations or call anything that does heap allocation. Or if I have to, then they need to be at a minimum as they are costly.
I've tried implementing the following in C++:
First Second Third
Is this possible? Or are there alternatives that are just as fast at finding the median and fit those requirements?
You could consider using the same heap allocation for all operations, and avoid freeing it until you're done. This way, rather than creating millions of arrays you just create one.
Of course this approach is more complex if you're doing these find-median operations in parallel. You'd need one array per thread.

Being cache efficient with data, mainly arrays

I have recently started to look into being cache efficient by trying to avoid cache misses in c++. So far I have taken away the following:
Try and avoid linked lists objects where possible when processing. Instead use them to point to contiguous data that you can store in cache and perform operations on.
Be careful of holding state in classes as it makes the above potentially more difficult.
Use structs when allocating on the heap, as this helps in localising data.
Try and use 1D arrays when possible for lists of data.
So my question is broken into two parts:
Is the above correct? Have I made any fundamental misunderstandings?
When dealing with 2D arrays I have seen other users recommend the use of Hilbert curves. I do not understand how this provides a speed increase over using division and modulus operators on an index to simulate a 2D array as that is surely less instructions which is good for speed and instruction cache usage?
Thanks for reading.
P.S. I do not have a CompSci background therefore, if you notice anything that I have said that is incorrect I would appreciate it if you could alert me so that I can read around that topic.
Your approach is flawed for at least one reason: you are willing to sacrifice everything to avoid cache misses. How do you know if that (cache misses) is the major performance factor in your code?
For example, there are MANY cases where the use of linked list is better that a contiguous array, specifically - where you frequently insert / delete items. You would pay greatly for compacting or expanding an array.
So the answer to your first question is: yes, you will improve the data locality using those four principals. But - at the cost, probably greater than the savings.
For the second question, I suggest you read about Hilbert curves. You don't need them if you are processing your 2D array in order, row-by-row. They will help a lot (with data locality) if you process some area of your 2D array, because the distance between elements in the same column/different rows is much smaller that way.

How to implement Radix sort on multi-GPU?

How to implement Radix sort on multi-GPU – same way as on single GPU i.e. by splitting the data then building histograms on separate GPUs and then use merge data back (like bunch of cards)?
That method would work, but I don't think it would be the fastest approach. Specifically, merging histograms for every K bits (K=4 is currently best) would require the keys to be exchanged between GPUs 32/K = 8 times to sort 32-bit integers. Since the memory bandwidth between GPUs (~5GB/s) is much lower than the memory bandwidth on a GPU (~150GB/s) this will kill performance.
A better strategy would be to split the data into multiple parts, sort each part in parallel on a different GPU, and then merge the parts once at the end. This approach requires only one inter-GPU transfer (vs. 8 above) so it will be considerably faster.
Unfortunately this question is not adequately posed. It depends on element size, where the elements begin life in memory, and where you want the sorted elements to end up residing.
Sometimes it's possible to compress the sorted list by storing elements in groups sharing the same common prefix, or you can unique elements on the fly, storing each element once in the sorted list with an associated count. For example, you might sort a huge list of 32-bit integers into 64K distinct lists of 16-bit values, cutting your memory requirement in half.
The general principle is that you want to make the fewest number of passes over the data as possible and that your throughput will almost always correspond to bandwidth constraints associated with your storage policy.
If your data set exceeds the size of fast memory, you probably want to finish with a merge pass rather than continue to radix sort, as another person has already answered.
I'm just getting into GPU architecture and I don't understand the K=4 comment above. I've never seen an architecture yet where such a small K would prove optimal.
I suspect merging histograms is also the wrong approach. I'd probably let the elements fragment in memory rather than merge histograms. Is it that hard to manage meso-scale scatter/gather lists in the GPU fabric? I sure hope not.
Finally, it's hard to conceive of a reason why you would want to involve multiple GPUs for this task. Say your card has 2GB of memory and 60GB/s write bandwidth (that's what my mid-range card is showing). A three pass radix sort (11-bit histograms) requires 6GB of write bandwidth (likely your rate limiting factor), or about 100ms to sort a 2GB list of 32-bit integers. Great, they're sorted, now what? If you need to ship them anywhere else without some kind of preprocessing or compression, the sorting time will be small fish.
In any case, just compiled my first example programs today. There's still a lot to learn. My target application is permutation intensive, which is closely related to sorting. I'm sure I'll weigh in on this subject again in future.

Correct data structure to use for (this specific) expiring cache?

I need to read from a dataset which is very large, highly interlinked, the data is fairly localized, and reads are fairly expensive. Specifically:
The data sets are 2gigs - 30gigs in size, so I have to map sections of the file into memory to read. This is very expensive compared to the rest of the work I do in the algorithm. From profiling I've found roughly 60% of the time is spent reading the memory, so this is the right place to start optimizing.
When operating on a piece of this dataset, I have to follow links inside of it (imagine it like being similar to a linked list), and while those reads aren't guaranteed to anywhere near sequential, they are fairly localized. This means:
Let's say, for example, we operate on 2 megs of memory at a time. If you read 2 megs of data into memory, roughly 40% of the reads I will have to subsequently do will be in that same 2 megs of memory. Roughly 20% of the reads will be purely random access in the rest of the data, and the other 40% very likely links back into the 2meg segment which pointed to this one.
From knowledge of the problem and from profiling, I believe that introducing a cache to the program will help greatly. What I want to do is create a cache which holds N chunks of X megs of memory (N and X configurable so I can tune it) which I can check first, before having to map another section of memory. Additionally, the longer something has been in the cache, the less likely it is that we will request that memory in the short term, and so the oldest data will need to be expired.
After all that, my question is very simple: What data structure would be best to implement a cache of this nature?
I need to have very fast lookups to see if a given address is in the cache. With every "miss" of the cache, I'll want to expire the oldest member of it, and add a new member. However, I plan to try to tune it (by changing the amount that's cached) such that 70% or more of reads are hits.
My current thinking is to use either an AVL tree (LOG2 n for search/insert/delete) would be the safest (no degenerate cases). My other option is a sparse hashtable such that lookups would be O(1) in the best case. In theory this could degenerate into O(n), but in practice I could keep collisions low. The concern here would be how long it takes to find and remove the oldest entry in the hashtable.
Does anyone have any thoughts or suggestions on what data structure would be best here, and why?
Put the cache into two sorted trees (AVL or any other reasonably balanced tree implementation is fine--you're better off using one from a library than creating your own).
One tree should sort by position in the file. This lets you do log(n) lookups to see if your cache is there.
The other tree should sort by time used (which can be represented by a number that increments by one on each use). When you use a cached block, you remove it, update the time, and insert it again. This will take log(n) also. When you miss, remove the smallest element of the tree, and add the new block as the largest. (Don't forget to also remove/add that block to the by-position-in-file tree.)
If your cache doesn't have very many items in it, you'll be better off still by just storing everything in a sorted array (using insertion sort to add new elements). Moving 16 items down one spot in an array is incredibly fast.
Seems like you are looking for an LRU (Least Recently Used) cache: LRU cache design
If 60% of your algorithm is I/O, I suggest that the actual cache design doesn't really matter that much- any sort of cache could be a substantial speed-up.
However, the design depends a lot on what data you're using to access your chunks. String, int, etc. If you had an int, you could do a hashmap into a linked list, erase the back on cache miss, erase and then push on top if cache hit.
hashmaps are provided under varying names (most commonly, unordered map) in many implementations. Boost has one, there's one in TR1, etc. A big advantage of a hash_map is less performance loss with growing numbers, and more flexibility about key values.