Can we know if a collection is almost sorted without applying a sort algorithm? - bubble-sort

In the wikipedia article on sorting algorithms,
http://en.wikipedia.org/wiki/Sorting_algorithm#Summaries_of_popular_sorting_algorithms
under Bubble sort it says:Bubble sort can also be used efficiently on a list of any length that is nearly sorted (that is, the elements are not significantly out of place)
So my question is: Without sorting the list using a sorting algoithm first, how can one know if that is nearly sorted or not?

Are you familiar with the general sorting lower bound? You can prove that in a comparison-based sorting algorithm, any sorting algorithm must make Ω(n log n) comparisons in the average case. The way you prove this is through an information-theoretic argument. The basic idea is that there are n! possible permutations of the input array, and since the only way you can learn about which permutation you got is to make comparisons, you have to make at least lg n! comparisons in order to be certain that you know the structure of your input permutation.
I haven't worked out the math on this, but I suspect that you could make similar arguments to show that it's difficult to learn how sorted a particular array is. Essentially, if you don't do a large number of comparisons, then you wouldn't be able to tell apart an array that's mostly sorted from an array that is actually quite far from sorted. As a result, all the algorithms I'm aware of that measure "sortedness" take a decent amount of time to do so.
For example, one measure of the level of "sortedness" in an array is the number of inversions in that array. You can count the number of inversions in an array in time O(n log n) using a divide-and-conquer algorithm based on mergesort, but with that runtime you could just sort the array instead.
Typically, the way that you'd know that your array was mostly sorted was to know something a priori about how it was generated. For example, if you're looking at temperature data gathered from 8AM - 12PM, it's very likely that the data is already mostly sorted (modulo some variance in the quality of the sensor readings). If your data looks at a stock price over time, it's also likely to be mostly sorted unless the company has a really wonky trajectory. Some other algorithms also partially sort arrays; for example, it's not uncommon for quicksort implementations to stop sorting when the size of the array left to sort is small and to follow everything up with a final insertion sort pass, since every element won't be very far from its final position then.

I don't believe there exists any standardized measure of how sorted or random an array is.
You can come up with your own measure - like count the number of adjacent pairs which are out of order (suggested in comment), or count the number of larger numbers which occur before smaller numbers in the array (this is trickier than a simple single pass).

Related

Benefit of printing values from an array in ascending order by selecting?

I read the tutorial regarding arrange a number of array in ascending order and understood the idea https://www.includehelp.com/cpp-programs/sort-an-array-in-ascending-order.aspx . However, now I'm thinking of other way to perform the operation. Wonder will the idea below works?
The method will be using while loop and check (while remaining number in array not equal to 0), find the smallest number in the array, print out the number and remove it from array. Repeat the same process until remaining number in array = 0. So my numbers will be print out in ascending order also and the number in the array will decrease in each loop until it reached zero.
I started learning programming just few weeks ago and have trouble writing out the code now. However I'm interested to know if this method will work? If cannot, please explain why.
What you've described is a variant of what's normally called a "selection sort". It's pretty well known. It does work, but there are many sorting algorithms that work--and while there are a few sorting algorithms that are generally less efficient, it's still one of the least efficient around.
Selection sort is typically faster than Bubble sort and a few of its variants like Shaker sort. Depending on the precise situation, it can also be faster than insertion sort, though that's pretty unusual. Those three (bubble sort, insertion sort, and selection sort) are the best known of the simple sorting algorithms. Of the three, bubble sort is most often the slowest, and insertion sort most often the fastest. But all three take time proportional to the square of the number of items being sorted, which means they get much slower in a hurry as you try to sort more items. If you have very many items, more advanced algorithms (e.g., Shell-Metzner, Quicksort, heap sort and merge sort) will almost always be substantially faster.
Ignoring execution speed for a moment, selection sort does have one extremely good property: it's easy to understand, easy to code up correctly and easy to prove that it works. If you only need to sort a few items, and need to type in the sorting code yourself (especially if you're in a hurry) it's my experience that it's probably the easiest sorting algorithm to be certain you've implemented correctly.

Merging Two Sorted Arrays with O(log(n+m)) Worst Case

What kind of algorithm can I use to merge two sorted arrays into one sorted array with worst-case time complexity of O(log(m+n)) where n, m are the length of the arrays? I have very little experience with algorithms, but I checked out merge-sort and it seems that the time-complexity for the merging step is O(n). Is there a different approach to merge in O(log(n))?
Edit: I hadn't considered initially, but maybe it's not possible to merge two sorted arrays in O(log(n))? The actual goal is to find the median of two sorted arrays. Is there a way to do this without merging them?
The only idea I've had was I read that merging two binomial heaps is O(log(n)), but turning an array into a binomial heap is O(n) I think so that won't work.
Edit2: I'm going to post a new question because I've realized that merging will never work fast enough. I think instead I need to perform a binary search on each array to find the median in log(n).
I don't think there is an algorithm that would merge two arrays in O(log(n+m)) time.
And it makes sense when you think about it. If you're trying to create a new sorted array of n+m elements you will need to do at least n+m copies. There is no way getting around that.
I think the best way would be to iterate through each array simultaneously and, at each iteration, compare the values of both elements. If one is less than the other (if you want the array sorted in descending order), then copy that element to the array and increment your indexing pointer for that array and vice versa. If the two elements are the same, you can just add them both into the newly sorted array and increment both pointers.
Continue until one of the pointers has reached the end of its respective array and then copy in the rest of the other array once one has.
That should be O(m+n)
Regarding your edit, there is a way to find the median of two separate arrays in log(n + m) time.
You can first find the median of the two sorted arrays (the middle element) and compare them. If they are equal, then that is the median. If the first's median is greater than the second's you know the median has to be in either the first half of the first array or the second half of the second array and vice versa if the first's median is less than the second's.
This method cuts your search space in half each iteration and is thus log(n + m)
You're probably thinking of The Selection Algorithm.
For a sorted data structure, finding the median is O(1). For an unsorted data structure (or a data structure where the data is sorted into two logical partitions) the runtime is O(n).
You could probably pull it off with a massively parallel reduction algorithm, but I think that's cheating in Runtime Analysis terms.
So I don't believe there's an algorithm that reduces it below O(n) (or, in your case, O(n+m))
You need to merge the arrays. so, no matter what, you need to traverse the 2 arrays at least, so the complexity can't be less than o(m+n)

Fastest way to search and sort vectors

I'm doing a project in which i need to insert data into vectors sort it and search it ...
i need fastest possible algorithms for sort and search ... i've been searching and found out that std::sort is basically quicksort which is one of the fastest sorts but i cant figure out which search algorithm is the best ? binarysearch?? can u help me with it? tnx ... So i've got 3 methods:
void addToVector(Obj o)
{
fvector.push_back(o);
}
void sortVector()
{
sort(fvector.begin(), fvector().end());
}
Obj* search(string& bla)
{
//i would write binary search here
return binarysearch(..);
}
I've been searching and found out that std::sort is basically
quicksort.
Answer: Not quite. Most implementations use a hybrid algorithm like
introsort, which combines quick-sort, heap-sort and insertion sort.
Quick-sort is one of the fastest sorting methods.
Answer: Not quite. In general it holds (i.e., in the average case quick-sort is of complexity). However, quick-sort has quadratic worst-case performance (i.e., ). Furthermore, for a small number of inputs (e.g., if you have a std::vector with a small numbers of elements) sorting with quick-sort tends to achieve worst performance than other sorting algorithms that are considered "slower" (see chart below):
I can't figure out which searching algorithm is the best. Is it binary-search?
Answer: Binary search has the same average and worst case performance (i.e., ). Also have in mind that binary-search requires that the container should be arranged in ascending or descending order. However, whether is better than other searching methods (e.g., linear search which has time complexity) depends on a number of factors. Some of them are:
The number of elements/objects (see chart below).
The type of elements/objects.
Bottom Line:
Usually looking for the "fastest" algorithm denotes premature optimization and according to one of the "great ones" (Premature optimization is the root of all evil - Donald Knuth). The "fastest", as I hope it has been clearly shown, depends on quite a number of factors.
Use std::sort to sort your std::vector.
After sorting your std::vector use std::binary_search to find out whether a certain element exists in your std::vector or use std::lower_bound or std::upper_bound to find and get an element from your std::vector.
For amortised O(1) access times, use a [std::unordered_map], maybe using a custom hash for best effects.
Sorting seems to be unneccessary extra work.
Searching and Sorting efficiency is highly dependent on the type of data, the ordering of the raw data, and the quantity of the data.
For example, for small sorted data sets, a linear search may be faster than a binary search; or the time differences between the two is negligible.
Some sort algorithms will perform horribly on inversely ordered data, such a binary tree sort. Data that does not have much variation may cause a high degree of collisions on hash algorithms.
Perhaps you need to answer the bigger question: Is search or sorting the execution bottleneck in my program? Profile and find out.
If you need the fastest or the best sorting algorithm... There is no such one. At least it haven't been found yet. There are algorithms that provide better results for different data, there are algorithms that provide good results for most of data. You either need to analyze your data and find the best one for your case or use generic algo like std::sort and expect it to provide good results but not the best.
if your elements are of integer you should use bucket sort algorithm which run at O(N) time instead of O(nlogn) average case as with qsort
[http://en.wikipedia.org/wiki/Bucket_sort]
Sorting
In case you want to know about the fastest sorting technique for integer values in a vector then I would suggest you to refer the following link:
https://github.com/fenilgmehta/Fastest-Integer-Sort
It uses radix sort and counting sort for large arrays and merge sort along with insertion sort for small arrays.
According to statistics, this sorting algorithm is way faster than C++ std::sort for integral values.
It is 6 times faster than C++ STL std::sort for "int64_t array[10000000]"
Searching
If you want to know whether a particular value is present in the vector or not, then you should use binary_search(...)
If you want to know the exact location of an element, then use lower_bound(...) and upper_bound(...)

How to sort faster than n log n (given a strong condition on the list)?

I was asked the following question (didn`t know at all the approach how to solve it)
Given an array arr of n ints we need to sort it.We already know that k of this ints are placed in the original arr as in sorted array.(just don`t know which of them)
They told that such sorting is much better than nlogn - i have no any clue...
Any advices?
http://en.wikipedia.org/wiki/Radix_sort
the key fact is that you're working with integers and you know the largest key, which is exactly when radix sort is used and its complexity is linear.
also second approach if k of them are already sorted you can use some version of shell sort with sequence that will yield the best result
If we do not know:
how k and n are related to each other
and how exactly the k elements are located in the array
There is simple no option we can do much better than Θ(nlog(n)) in the worst case.
Why:
Let put k=1 and good luck...
Let say that k=0.9n and let place the k elements in the front. Even if we knew that they are in the front, then we still have to sort array of size 0.1n, so in the worst case we need 0.1*n*log(0.1*n)=0.1*n*(log(0.1)+log(n))=0.01*nlog(n)-0.1*n comparisions which is Θ(n*log(n)).
Of course this is just theoretical result for the worst case. In practice the information that there are exactly k elements on proper places, can limit significantly amount of work to be done. But for sure we need to know bit more about k and n (or at least assume something).
Selection sort is a good choice when your array is already mostly sorted; it should perform only O(n(n-k) swaps. If the sorted elements tend to be contiguous, then Timsort might also perform well. In neither case will you do better than O(n log n) for sufficiently small k, of course.
Adaptive sort is a kind of sort algorithms which take advantage of existing order in its input. Insertion sort is one of the adaptive sorts, which works well when the array is almost sorted. Of course, the worst case is O(N^2).
There are other adaptive sort such as,
Adaptive heap sort, which use treap to take the advantage of ordered elements when build the heap. Adaptive merge sort(Natural merge sort) and Smoothsort
The theoretical complexity will also be O(N*lnN), but they might perform more effectively when the data is partly sorted.
Algorithm:
Locate the run of k contiguous elements already in order
Sort the other n-k elements
Merge the two sorted lists
Example with n=8, k=4.
['echo', 'cat', 'bat', 'board', 'hand', 'hotel', 'kilo', 'hit']
Locate the 4 contiguous elements already in order.
['echo', 'cat', 'bat', 'board', 'hand', 'hotel', 'kilo', 'hit']
(As it happens, we found 5 already in order. All the better.)
Sort the other elements
[cat, echo, hit]
Merge the two sorted lists
['bat', 'board', 'cat', 'echo', 'hand', 'hit', 'hotel', 'kilo']
Done.
The time complexities of the three steps are
O(n)
O((n-k)log(n-k))
O(n)
For any fixed the ratio k/n, the second step dominates (for large enough n).

An efficient sorting algorithm for almost sorted list containing time data?

The name says it all really. I suspect that insertion sort is best, since it's the best sort for mostly-sorted data in general. However, since I know more about the data there is a chance there are other sorts woth looking at. So the other relevant pieces of information are:
1) this is time data, which means I presumable could create an effective hash for ordering of data.
2) The data won't all exist at one time. instead I'll be reading in records which may contain a single vector, or dozen or hundreds of vectors. I want to output all time within a 5 second window. So it's possible that a sort that does the sorting as I insert the data would be a better option.
3) memory is not a big issue, but CPU speed is as this may be a bottleneck of the system.
Given these conditions can anyone suggest an algorithm that may be worth considering in addition to insertion sort? Also, How does one defined 'mostly sorted' to decide what is a good sort option? What I mean by that is how do I look at my data and decided 'this isn't as sorted as I thought it as, maybe insertion sort is no longer the best option'? Any link to an article which considered process complexity which better defines the complexity relative to the degree data is sorted would be appreciated.
Thanks
Edit:
thank you everyone for your information. I will be going with an easy insertion or merge sort (whichever I have already pre-written) for now. However, I'll be trying some of the other methods once were closer to the optimization phase (since they take more effort to implement). I appreciate the help
You could adopt option (2) you suggested - sort the data while you insert elements.
Use a skip list, sorted according to time, ascending to maintain your data.
Once a new entree arrives - check if it is larger then the last
element (easy and quick) if it is - simply append it (easy to do in a skip list). The
skip list will need to add 2 nodes on average for these cases, and will be O(1) on
average for these cases.
If the element is not larger then the last element - add it to the
skip list as a standard insert op, which will be O(logn).
This approach will yield you O(n+klogn) algorithm, where k is the number of elements inserted out of order.
I would throw in merge sort if you implement the natural version you get a best case of O(N) with a typical and worst case of O(N log N) if you have any problems. Insertion you get a worst case of O(N^2) and a best case of O(N).
You can sort a list of size n with k elements out of place in O(n + k lg k) time.
See: http://www.quora.com/How-can-I-quickly-sort-an-array-of-elements-that-is-already-sorted-except-for-a-small-number-of-elements-say-up-to-1-4-of-the-total-whose-positions-are-known/answer/Mark-Gordon-6?share=1
The basic idea is this:
Iterate over the elements of the array, building an increasing subsequence (if the current element is greater than or equal to the last element of the subsequence, append it to the end of the subsequence. Otherwise, discard both the current element and the last element of the subsequence). This takes O(n) time.
You will have discarded no more than 2k elements since k elements are out of place.
Sort the 2k elements that were discarded using an O(k lg k) sorting algorithm like merge sort or heapsort.
You now have two sorted lists. Merge the lists in O(n) time like you would in the merge step of merge sort.
Overall time complexity = O(n + k lg k)
Overall space complexity = O(n)
(this can be modified to run in O(1) space if you can merge in O(1) space, but it's by no means trivial)
Without fully understanding the problem, Timsort may fit the bill as you're alleging that your data is mostly sorted already.
There are many adaptive sorting algorithms out there that are specifically designed to sort mostly-sorted data. Ignoring the fact that you're storing dates, you might want to look at smoothsort or Cartesian tree sort as algorithms that can sort data that is reasonable sorted in worst-case O(n log n) time and best-case O(n) time. Smoothsort also has the advantage of requiring only O(1) space, like insertion sort.
Using the fact that everything is a date and therefore can be converted into an integer, you might want to look at binary quicksort (MSD radix sort) using a median-of-three pivot selection. This algorithm has best-case O(n log n) performance, but has a very low constant factor that makes it pretty competitive. Its worst case is O(n log U), where U is the number of bits in each date (probably 64), which isn't too bad.
Hope this helps!
If your OS or C library provides a mergesort function, it is very likely that it already handles the case where the data given is partially ordered (in any direction) running in O(N) time.
Otherwise, you can just copy the mergesort available from your favorite BSD operating system.