What is the complexity of a 2 phase multi-way external sort using quick sort (nlogn) as internal sort.
Not an expert here but...
If I understand correctly, what you describe as phases are the number of passes your algorithm will make over the input, right? In this case, a running time approximation would be the number of passes (2 in your case) * the time necessary to read and write the whole input to the external device.
When evaluating complexity of such algorithms it is hard to put it in usual running time terms. There are many aspects that could influence the result (sequential/non-sequential access, technology, etc). The common approach is to provide complexity in terms of passes, which accounts from the number of devices used, the number of items in the input, and the number of items that can fit in memory.
The point is that the sorting algorithm is dominated by the IO operations. Internal quick sort should be ok (although its quadratic worst case).
Also, I'm not sure if you counted the initial distribution. This is also a pass.
Related
Given an array A of size 105.
Then given m (m is very large, m>> the size of A) operations, each operation is for position p, increasing t.
A[p]+=t
Finally, I output the value of each position of the whole array.
Is there any constant optimization to speed up the intermediate modification operations?
For example, if I sort the positions, I can modify them sequentially to avoid random access. However, this operation will incur an additional sorting cost. Is there any other way to speed it up?
Trying to re-execute all operations after sorting can be an order of magnitude faster than executing them directly. But the cost of sorting is too high.
On architectures with many cores, the best solution is certainly to perform atomic accesses of A[p] in parallel. This assume the number of cores is sufficiently big for the parallelism to not only mitigate the overhead of the atomic operations but also be faster than the serial implementation. This can be pretty easily done with OpenMP or with native C++ thread/atomics. The number of core need not to be too huge, otherwise, the number of conflict may be significantly bigger causing contention and so decreasing performance. This should be fine since the number of item is pretty big. This solution also assume the accesses are quite uniformly random. If they are not (eg. normal distribution), then the contention can be too big for the method to be efficient.
An alternative solution is to split the accesses between N threads spacially. The array range can be statically split in N (relatively equal) parts. All the threads read the inputs but only the thread owning the target range of the output array write into it. The array parts can then be combined after that. This method works well with few threads and if the data distribution is uniform. When the distribution is not uniform at all (eg. normal distribution), then a pre-computing step may be needed so to adjust the array range owned by threads. For example, one can compute the median, or event the quartiles so to better balance the work between threads. Computing quartiles can be done using a partitioning algorithm like Floyd Rivest (std::partition should not be too bad despite I expect it to use a kind of IntroSelect algorithm that is often a bit slower). The pre-computation may be expensive but this should be significantly faster than doing a sort. Using OpenMP is certainly a good idea to implement this.
Another alternative implementation is simply to perform the reduction separately in each thread and then sum up the final array of each thread in a global array. This solution works well in your case (since "m >> the size of A") assuming the number of core is not too big. If so, on need to mix this method with the first one. This last method is probably the simplest efficient method.
Besides, #Jérôme Richard's answer targeted parallel thread computing.
I would name an idea of the partial sort like "merge-sort-just-a-few-iterations" or "bucket-sort-only-in-bucket" (note, they are different). Preferably, set the bulk size to be the Page size to have a better overall performance in terms of OS level. Especially considering m is extraordinarily big. The cost of the partial sort would be amortized by saving cache miss and page swap.
And if this is an interview question, I would ask for more details about m, p, t, data sparsity, distribution, hardware, CPU, memory, power consumption, latency, .etc details. And for each new condition, customizes more detailed designs accordingly.
I was soving a competitive programming problem with the following requirements:
I had to maintain a list of unqiue 2d points (x,y), the number of unique points would be less than 500.
My idea was to store them in a hash table (C++ unordered set to be specific) and each time a node turned up i would lookup the table and if the node is not already there i would insert it.
I also know for a fact that i wouldn't be doing more than 500 lookups.
So i saw some solutions simply searching through an array (unsorted) and checking if the node was already there before inserting.
My question is is there any reasonable way to guess when should i use a hash table over a manual search over keys without having to benchmark them?
My question is is there any reasonable way to guess when should i use a hash table over a manual search over keys without having to benchmark them?
I am guessing you are familiar with basic algorithmics & time complexity and C++ standard containers and know that with luck hash table access is O(1)
If the hash table code (or some balanced tree code, e.g. using std::map - assuming there is an easy order on keys) is more readable, I would prefer it for that readability reason alone.
Otherwise, you might make some guess taking into account the approximate timing for various operations on a PC. BTW, the entire http:///norvig.com/21-days.html page is worth reading.
Basically, memory accesses are much more slow than everything else in the CPU. The CPU cache is extremely important. A typical memory access with cache fault requiring fetching data from DRAM modules is several hundreds times slower than some elementary arithmetic operation or machine instruction (e.g. adding two integers in registers).
In practice, it does not matter that much, as long as your data is tiny (e.g. less than a thousand elements), since in that case it is likely to sit in L2 cache.
Searching (linearly) in an array is really fast (since very cache friendly), up to several thousand of (small) elements.
IIRC, Herb Sutter mentions in some video that even inserting an element inside a vector is practically -but unintuitively- faster (taking into account the time needed to move slices) than inserting it into some balanced tree (or perhaps some other container, e.g. an hash table), up to a container size of several thousand small elements. This is on typical tablet, desktop or server microprocessor with a multimegabyte cache. YMMV.
If you really care that much, you cannot avoid benchmarking.
Notice that 500 pairs of integers is probably fitting into the L1 cache!
My rule of thumb is to assume the processor can deal with 10^9 operations per second.
In your case there are only 500 entries. An algorithm up to O(N^2) could be safe. By using contiguous data structure like vector you can leverage the fast cache hit. Also hash function sometimes can be costly in terms of constant. However if you have a data size of 10^6, the safe complexity might be only O(N) in total. In this case you might need to consider O(1) hashmap for a single lookup.
You can use Big O Complexity to roughly estimate the performance. For the Hash Table, Searching an element is between O(1) and O(n) in the worst case. That means, that in the best case your access time is independant of the number of elements in your map but in the worst case it is linear dependant on the size of your hash table.
A Binary tree has a guaranteed search complexity of O(nlog(n)). That means, that searching an element always depends on the size of the array, but in the Worst Case its faster than a hash table.
You can look up some Big O Complexities at this handy website here: http://bigocheatsheet.com/
I'm trying to optimize a program which needs to compute a hash for a constant size window in a data stream at every position (byte) of stream. It is needed for a lookup of repetitions in disk files much larger than available RAM. Currently I compute separate md5 hash for every window, but it costs a lot of time (window size is a few kilobytes, so each byte of data is processed few thousand times). I wonder if there exist a way to compute every subsequent hash in constant (window-size-independent) time (like addition and subtraction of 1 element in moving average)? The hash function may be anything as long as it gives not to long hashes (50-100 bits is ok) and its computation is reasonably fast. It also must give virtually no colisions on up to trillions of not-so-random windows (TB of data) - every collision means a disk access in my case (crc32 is much to weak, md5 is OK in this aspect).
I'll be thankful if you point me to an existing library function available on linux if there is one.
This is my first question here, so please be tolerant if I did something wrong.
regards,
bartosz
The Wikipedia article on rolling hashes has a link to ngramhashing which implements a few different techniques in C++, including:
Randomized Karp-Rabin (sometimes called Rabin-Karp)
Hashing by Cyclic Polynomials (also known as Buzhash)
Hashing by Irreducible Polynomials
(Also available on GitHub)
What you describe, is pretty near to the basic approach used in data deduplication storage.
The data deduplication systems, we usually use Rabin's fingerprinting method as fast, rolling hash function.
However, while Rabin fingerprints are good and well understood collision properties, it is not cryptographically secure, i.e., there will be collisions. Check e.g. how Bentley et al. used such a method in their compression method. The question is if and how much collisions you can tolerate. If you can tolerate an occasional collision, a good Rabin fingerprint implementation might be a good idea. Good implementations can process more then 200 MB per second per core.
I am not aware of any approach with virtually no collisions (aka cryptographically secure) and rolling at the same time. As PlasmaHH, I have serious doubts that this is actually possible.
Think if you can relax your restrictions. Maybe you can allow to miss some duplicates. In these cases, faster ways are possible.
I was reading sorting method which include bubble sort, selection sort, merge sort, heap sort, bucket sort etc.. They also contain time complexity which help us to know which sorting is efficient. So I had a basic question. If we contain data than how will we be choose sorting. Time complexity is one of parameter which help us to decide sorting method. But do we have another parameter to choose sorting method?.
Just trying to figure out sorting for better understanding.
Having some query about heap sort:
Where do we use heap sort?
What is bigger advantage of heap sort (except time complexity O(n log n))?
What is disadvantage of heap sort?
What is build time for heap? (I heard O(n) but I'm not sure.)
Any scenario where we have to use heap sort or heap sort is better option (except priority queue)?
Before applying the heap sort on data, what are the parameter will we look into data?
The two main theoretical features of sorting algorithms are time complexity and space complexity.
In general, time complexity lets us know how the performance of the algorithm changes as the size of the data set increases. Things to consider:
How much data are you expecting to sort? This will help you know whether you need to look for an algorithm with a very low time complexity.
How sorted will your data be already? Will it be partly sorted? Randomly sorted? This can affect the time complexity of the sorting algorithm. Most algorithms will have worst and best cases - you want to make sure you're not using an algorithm on a worst-case data set.
Time complexity is not the same as running time. Remember that time complexity only describes how the performance of an algorithm varies as the size of the data set increases. An algorithm that always does one pass over all the input would be O(n) - it's performance is linearly correlated with the size of the input. But, an algorithm that always does two passes over the data set is also O(n) - the correlation is still linear, even if the constant (and actual running time) is different.
Similarly, space complexity describes how much space an algorithm needs to run. For example, a simple sort such as insertion sort needs an additional fixed amount of space to store the value of the element currently being inserted. This is an auxiliary space complexity of O(1) - it doesn't change with the size of the input. However, merge sort creates extra arrays in memory while it runs, with an auxiliary space complexity of O(n). This means the amount of extra space it requires is linearly correlated with the size of the input.
Of course, algorithm design is often a trade-off between time and space - algorithms with a low space complexity may require more time, and algoithms with a low time complexity may require more space.
For more information, you may find this tutorial useful.
To answer your updated question, you may find the wikipedia page on Heap Sort useful.
If you mean criteria for what type of sort to choose, here are some other items to consider.
The amount of data you have: To you have ten, one hundred, a thousand or millions of items to be sorted.
Complexity of the algorithm: The more complex the more testing will need to be done to make sure it is correct. For small amounts, a bubble sort or quick sort is easy to code and test, verse other sorts which may be overkill for the amount of data you have to sort.
How much time will it take to sort: If you have a large set, bubble/quick sort will take a lot of time, but if you have a lot of time, that may not be an issue. However, using a more complex algorithm will cut down the time to sort, but at the cost of more effort in coding and testing, which may be worth it if sorting goes from long (hours/days) to a shorter amount of time.
The data itself: Is the data close to being the same for everything. For some sorts you may end up with a linear list, so if you know something about the composition of the data, it may help in determining which algorithm to choose for the effort.
The amount of resources available: Do you have lots of memory in which you store all items, or do you need to store items to disk. If everything cannot fit in memory, merge sort may be best, where other may be better if you work with everything in memory.
How to implement Radix sort on multi-GPU – same way as on single GPU i.e. by splitting the data then building histograms on separate GPUs and then use merge data back (like bunch of cards)?
That method would work, but I don't think it would be the fastest approach. Specifically, merging histograms for every K bits (K=4 is currently best) would require the keys to be exchanged between GPUs 32/K = 8 times to sort 32-bit integers. Since the memory bandwidth between GPUs (~5GB/s) is much lower than the memory bandwidth on a GPU (~150GB/s) this will kill performance.
A better strategy would be to split the data into multiple parts, sort each part in parallel on a different GPU, and then merge the parts once at the end. This approach requires only one inter-GPU transfer (vs. 8 above) so it will be considerably faster.
Unfortunately this question is not adequately posed. It depends on element size, where the elements begin life in memory, and where you want the sorted elements to end up residing.
Sometimes it's possible to compress the sorted list by storing elements in groups sharing the same common prefix, or you can unique elements on the fly, storing each element once in the sorted list with an associated count. For example, you might sort a huge list of 32-bit integers into 64K distinct lists of 16-bit values, cutting your memory requirement in half.
The general principle is that you want to make the fewest number of passes over the data as possible and that your throughput will almost always correspond to bandwidth constraints associated with your storage policy.
If your data set exceeds the size of fast memory, you probably want to finish with a merge pass rather than continue to radix sort, as another person has already answered.
I'm just getting into GPU architecture and I don't understand the K=4 comment above. I've never seen an architecture yet where such a small K would prove optimal.
I suspect merging histograms is also the wrong approach. I'd probably let the elements fragment in memory rather than merge histograms. Is it that hard to manage meso-scale scatter/gather lists in the GPU fabric? I sure hope not.
Finally, it's hard to conceive of a reason why you would want to involve multiple GPUs for this task. Say your card has 2GB of memory and 60GB/s write bandwidth (that's what my mid-range card is showing). A three pass radix sort (11-bit histograms) requires 6GB of write bandwidth (likely your rate limiting factor), or about 100ms to sort a 2GB list of 32-bit integers. Great, they're sorted, now what? If you need to ship them anywhere else without some kind of preprocessing or compression, the sorting time will be small fish.
In any case, just compiled my first example programs today. There's still a lot to learn. My target application is permutation intensive, which is closely related to sorting. I'm sure I'll weigh in on this subject again in future.