What are the criteria for choosing a sorting algorithm?

What are the criteria for choosing a sorting algorithm? - c++

I was reading sorting method which include bubble sort, selection sort, merge sort, heap sort, bucket sort etc.. They also contain time complexity which help us to know which sorting is efficient. So I had a basic question. If we contain data than how will we be choose sorting. Time complexity is one of parameter which help us to decide sorting method. But do we have another parameter to choose sorting method?.
Just trying to figure out sorting for better understanding.
Having some query about heap sort:
Where do we use heap sort?
What is bigger advantage of heap sort (except time complexity O(n log n))?
What is disadvantage of heap sort?
What is build time for heap? (I heard O(n) but I'm not sure.)
Any scenario where we have to use heap sort or heap sort is better option (except priority queue)?
Before applying the heap sort on data, what are the parameter will we look into data?

The two main theoretical features of sorting algorithms are time complexity and space complexity.
In general, time complexity lets us know how the performance of the algorithm changes as the size of the data set increases. Things to consider:
How much data are you expecting to sort? This will help you know whether you need to look for an algorithm with a very low time complexity.
How sorted will your data be already? Will it be partly sorted? Randomly sorted? This can affect the time complexity of the sorting algorithm. Most algorithms will have worst and best cases - you want to make sure you're not using an algorithm on a worst-case data set.
Time complexity is not the same as running time. Remember that time complexity only describes how the performance of an algorithm varies as the size of the data set increases. An algorithm that always does one pass over all the input would be O(n) - it's performance is linearly correlated with the size of the input. But, an algorithm that always does two passes over the data set is also O(n) - the correlation is still linear, even if the constant (and actual running time) is different.
Similarly, space complexity describes how much space an algorithm needs to run. For example, a simple sort such as insertion sort needs an additional fixed amount of space to store the value of the element currently being inserted. This is an auxiliary space complexity of O(1) - it doesn't change with the size of the input. However, merge sort creates extra arrays in memory while it runs, with an auxiliary space complexity of O(n). This means the amount of extra space it requires is linearly correlated with the size of the input.
Of course, algorithm design is often a trade-off between time and space - algorithms with a low space complexity may require more time, and algoithms with a low time complexity may require more space.
For more information, you may find this tutorial useful.
To answer your updated question, you may find the wikipedia page on Heap Sort useful.

If you mean criteria for what type of sort to choose, here are some other items to consider.
The amount of data you have: To you have ten, one hundred, a thousand or millions of items to be sorted.
Complexity of the algorithm: The more complex the more testing will need to be done to make sure it is correct. For small amounts, a bubble sort or quick sort is easy to code and test, verse other sorts which may be overkill for the amount of data you have to sort.
How much time will it take to sort: If you have a large set, bubble/quick sort will take a lot of time, but if you have a lot of time, that may not be an issue. However, using a more complex algorithm will cut down the time to sort, but at the cost of more effort in coding and testing, which may be worth it if sorting goes from long (hours/days) to a shorter amount of time.
The data itself: Is the data close to being the same for everything. For some sorts you may end up with a linear list, so if you know something about the composition of the data, it may help in determining which algorithm to choose for the effort.
The amount of resources available: Do you have lots of memory in which you store all items, or do you need to store items to disk. If everything cannot fit in memory, merge sort may be best, where other may be better if you work with everything in memory.

Related

Deciding when to use a hash table

I was soving a competitive programming problem with the following requirements:
I had to maintain a list of unqiue 2d points (x,y), the number of unique points would be less than 500.
My idea was to store them in a hash table (C++ unordered set to be specific) and each time a node turned up i would lookup the table and if the node is not already there i would insert it.
I also know for a fact that i wouldn't be doing more than 500 lookups.
So i saw some solutions simply searching through an array (unsorted) and checking if the node was already there before inserting.
My question is is there any reasonable way to guess when should i use a hash table over a manual search over keys without having to benchmark them?

My question is is there any reasonable way to guess when should i use a hash table over a manual search over keys without having to benchmark them?
I am guessing you are familiar with basic algorithmics & time complexity and C++ standard containers and know that with luck hash table access is O(1)
If the hash table code (or some balanced tree code, e.g. using std::map - assuming there is an easy order on keys) is more readable, I would prefer it for that readability reason alone.
Otherwise, you might make some guess taking into account the approximate timing for various operations on a PC. BTW, the entire http:///norvig.com/21-days.html page is worth reading.
Basically, memory accesses are much more slow than everything else in the CPU. The CPU cache is extremely important. A typical memory access with cache fault requiring fetching data from DRAM modules is several hundreds times slower than some elementary arithmetic operation or machine instruction (e.g. adding two integers in registers).
In practice, it does not matter that much, as long as your data is tiny (e.g. less than a thousand elements), since in that case it is likely to sit in L2 cache.
Searching (linearly) in an array is really fast (since very cache friendly), up to several thousand of (small) elements.
IIRC, Herb Sutter mentions in some video that even inserting an element inside a vector is practically -but unintuitively- faster (taking into account the time needed to move slices) than inserting it into some balanced tree (or perhaps some other container, e.g. an hash table), up to a container size of several thousand small elements. This is on typical tablet, desktop or server microprocessor with a multimegabyte cache. YMMV.
If you really care that much, you cannot avoid benchmarking.
Notice that 500 pairs of integers is probably fitting into the L1 cache!

My rule of thumb is to assume the processor can deal with 10^9 operations per second.
In your case there are only 500 entries. An algorithm up to O(N^2) could be safe. By using contiguous data structure like vector you can leverage the fast cache hit. Also hash function sometimes can be costly in terms of constant. However if you have a data size of 10^6, the safe complexity might be only O(N) in total. In this case you might need to consider O(1) hashmap for a single lookup.

You can use Big O Complexity to roughly estimate the performance. For the Hash Table, Searching an element is between O(1) and O(n) in the worst case. That means, that in the best case your access time is independant of the number of elements in your map but in the worst case it is linear dependant on the size of your hash table.
A Binary tree has a guaranteed search complexity of O(nlog(n)). That means, that searching an element always depends on the size of the array, but in the Worst Case its faster than a hash table.
You can look up some Big O Complexities at this handy website here: http://bigocheatsheet.com/

Is it possible to implement median of medians introselect with no swaps or heap allocations?

So I've run into a problem in some code I'm writing in c++. I need to find the median of an array of points with an offset and step size (example).
This code will be executed millions of times in as it's part of one of my core data structures so I'm trying to make it as fast as possible.
Research has led me to believe that for the best worst-case time complexity, introselect is the fastest way to find a median in a set of unordered values. I have some additional limitations that have to due with optimization:
I can't swap any values in the array. The values in the array are all exactly where they need to be based on that context in the program. But I still need the median.
I can't make any "new" allocations or call anything that does heap allocation. Or if I have to, then they need to be at a minimum as they are costly.
I've tried implementing the following in C++:
First Second Third
Is this possible? Or are there alternatives that are just as fast at finding the median and fit those requirements?

You could consider using the same heap allocation for all operations, and avoid freeing it until you're done. This way, rather than creating millions of arrays you just create one.
Of course this approach is more complex if you're doing these find-median operations in parallel. You'd need one array per thread.

Least expensive equality search for small unsorted arrays

What is the most efficient method, in terms of execution time, to search a small array of about 4 to 16 elements, for an element that is equal to what you're searching for, in C++? The element being searched for is a pointer in this case, so it's relatively small.
(My purpose is to prevent points in a point cloud from creating edges with points that already share an edge with them. The edge array is small for each point, but there can be a massive number of points. Also, I'm just curious too!)

Your best bet is to profile your specific application with a variety of mechanisms and see which performs best.
I suspect that given it's unsorted a straight linear search will be best for you. If you're able to pre-sort the array once and it updates infrequently or never, you could pre-sort and then use a binary search.

Try a linear search; try starting with one or more binary chop stages. The former involves more comparisons on average; the latter has more scope for cache misses and branch mispredictions, and requires that the arrays are pre-sorted.
Only by measuring can you tell which is faster, and then only on the platform you measured on.

If you have to do this search more than once, and the array doesn't change often/at all, sort it and then use a binary search.

Fast container for consistent performance

I am looking a container to support frequent adds/removals. I have not idea how large the container may grow but I don't want to get stalls due to huge reallocations. I need a good balance between performance and consistent behavior.
Initially, I considered std::tr1::unordered_map but since I don't know the upper bound of the data set, collisions could really slow down the unordered_map's performance. It's not a matter of a good hashing function because no matter how good it is, if the occupancy of the map is more than half the bucket count, collisions will likely be a problem.
Now I'm considering std::map because it doesn't suffer from the issue of collisions but it only has log(n) performance.
Is there a way to intelligently handle collisions when you don't know the target size of an unordered_map? Any other ideas for handling this situation, which I imagine is not uncommon?
Thanks

This is a run-time container, right?
Are you adding at the end (as in push_back) or in the front or the middle?
Are you removing at random locations, or what?
How are you referencing information in it?
Randomly, or from the front or back, or what?
If you need random access, something based on array or hash is preferred.
If reallocation is a big problem, you want something more like a tree or list.
Even so, if you are constantly new-ing (and delete-ing) the objects that you're putting in the container, that alone is likely to consume a large fraction of time,
in which case you might find it makes sense to save used objects in junk lists, so you can recycle them.
My suggestion is, rather than agonize over the choice of container, just pick one, write the program, and then tune it, as in this example.
No matter what you choose, you will probably want to change it, maybe more than once.
What I found in that example was that any pre-existing container class is justifying its existence by ease of programming, not by fastest-possible speed.
I know it's counter-intuitive, but
unless some other activity in your program ends up being the dominant cost, and you can't shrink it, your final burst of speed will require hand-coding the data structure.

What kind of access do you need? Sequential, random access, lookup by key? Plus, you can rehash unordered map either manually (rehash method), and set its load factor. In any case the hash will rebuild itself when chains get too long (i.e., when the load factor is exceeded). Additionally, the slow-down point of a hash table is when it is full ~80%, not 50%.
You should really have read the documentation, for example here.

Correct data structure to use for (this specific) expiring cache?

I need to read from a dataset which is very large, highly interlinked, the data is fairly localized, and reads are fairly expensive. Specifically:
The data sets are 2gigs - 30gigs in size, so I have to map sections of the file into memory to read. This is very expensive compared to the rest of the work I do in the algorithm. From profiling I've found roughly 60% of the time is spent reading the memory, so this is the right place to start optimizing.
When operating on a piece of this dataset, I have to follow links inside of it (imagine it like being similar to a linked list), and while those reads aren't guaranteed to anywhere near sequential, they are fairly localized. This means:
Let's say, for example, we operate on 2 megs of memory at a time. If you read 2 megs of data into memory, roughly 40% of the reads I will have to subsequently do will be in that same 2 megs of memory. Roughly 20% of the reads will be purely random access in the rest of the data, and the other 40% very likely links back into the 2meg segment which pointed to this one.
From knowledge of the problem and from profiling, I believe that introducing a cache to the program will help greatly. What I want to do is create a cache which holds N chunks of X megs of memory (N and X configurable so I can tune it) which I can check first, before having to map another section of memory. Additionally, the longer something has been in the cache, the less likely it is that we will request that memory in the short term, and so the oldest data will need to be expired.
After all that, my question is very simple: What data structure would be best to implement a cache of this nature?
I need to have very fast lookups to see if a given address is in the cache. With every "miss" of the cache, I'll want to expire the oldest member of it, and add a new member. However, I plan to try to tune it (by changing the amount that's cached) such that 70% or more of reads are hits.
My current thinking is to use either an AVL tree (LOG2 n for search/insert/delete) would be the safest (no degenerate cases). My other option is a sparse hashtable such that lookups would be O(1) in the best case. In theory this could degenerate into O(n), but in practice I could keep collisions low. The concern here would be how long it takes to find and remove the oldest entry in the hashtable.
Does anyone have any thoughts or suggestions on what data structure would be best here, and why?

Put the cache into two sorted trees (AVL or any other reasonably balanced tree implementation is fine--you're better off using one from a library than creating your own).
One tree should sort by position in the file. This lets you do log(n) lookups to see if your cache is there.
The other tree should sort by time used (which can be represented by a number that increments by one on each use). When you use a cached block, you remove it, update the time, and insert it again. This will take log(n) also. When you miss, remove the smallest element of the tree, and add the new block as the largest. (Don't forget to also remove/add that block to the by-position-in-file tree.)
If your cache doesn't have very many items in it, you'll be better off still by just storing everything in a sorted array (using insertion sort to add new elements). Moving 16 items down one spot in an array is incredibly fast.

Seems like you are looking for an LRU (Least Recently Used) cache: LRU cache design

If 60% of your algorithm is I/O, I suggest that the actual cache design doesn't really matter that much- any sort of cache could be a substantial speed-up.
However, the design depends a lot on what data you're using to access your chunks. String, int, etc. If you had an int, you could do a hashmap into a linked list, erase the back on cache miss, erase and then push on top if cache hit.
hashmaps are provided under varying names (most commonly, unordered map) in many implementations. Boost has one, there's one in TR1, etc. A big advantage of a hash_map is less performance loss with growing numbers, and more flexibility about key values.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js