fastest integer sort implementation for 200-300 bit integers? - c++

What is the fastest integer sort implementation for 200-300 bit sized integers? Exact int size is fixed; I have up to 2 gigabytes with such integers (all in RAM).
I hear that it is possible to sort such set in average at O(n log log M) or even at O(n sqrt(log log M)) time, wher n is number of integers and M is the largest integer. Memory usage is limited (I may use up to 0.5-1 GB addtionally). Sorting can be done in-place; in can be unstable (reorder dups).
Is there C/C++ implementation of such sort method, e.g. of Han & Thorup (2002)?

A Radix Sort can be used to sort data with fixed size keys. As this condition is not often met the technique isn't discussed much, but it can be O(n) when the key size is factored out.

If memory usage is truly limited. I would separate each byte and store them into a trie data structure from most significant to least significant byte. If you insert the bytes in sorted order you can then iterate the trie and have all your data sorted.

Signature sort is good with large word sizes with 'O (n lg lg n)' expetcted time complexity, but with small word sizes you can get the same complexity with von Emde Boas sort. Also recently even faster sorting algorithm was published from Han and Thorup with 'O (n sqrt(lg lg n))' expected time complexity. I'm not sure if u can find implementations of these algorithms online, but there are probably some great articles and lectures on MIT and Harvard.

I think the most reasonable thing to do is to create an array of pointers to the bigints, and sort the array of pointers. I would suggest some sort of templated quicksort, with a smart compare function.
The compare function should be able to decide most of the time by looking at the most significant 4 bytes. If they don't match, then the compare is decided. If they do match then you look at the next 4 bytes until end of int.
I am guessing that the data range, is probably large enough, that a radix sort would be impractical. Quick sort is generally faster enough if you data is random, and has cache performance that beats most non-radix sorts.

Related

What's the most performant way to create a sorted copy of a C++ vector?

Given a C++ vector (let's say it's of doubles, and let's call it unsorted), what is the most performant way to create a new vector sorted which contains a sorted copy of unsorted?
Consider the following naïve solution:
std::vector<double> sorted = unsorted;
std::sort(sorted.begin(), sorted.end());
This solution has two steps:
Create an entire copy of unsorted.
Sort it.
However, there is potentially a lot of wasted effort in the initial copy of step 1, particularly for a large vector that is (for example) already mostly sorted.
Were I writing this code by hand, I could combine the first pass of my sorting algorithm with step 1, by having the first pass read the values from the unsorted vector while writing them, partially sorted as necessary, into sorted. Depending on the algorithm, subsequent steps might then work just with the data in sorted.
Is there a way to do such a thing with the C++ standard library, Boost, or a third-party cross-platform library?
One important point would be to ensure that the memory for the sorted C++ vector isn't needlessly initialized to zeroes before the sorting begins. Many sorting algorithms would require immediate random-write access to the sorted vector, so using reserve() and push_back() won't work for that first pass, yet resize() would waste time initializing the vector.
Edit: As the answers and comments don't necessarily see why the "naïve solution" is inefficient, consider the case where the unsorted array is in fact already in sorted order (or just needs a single swap to become sorted). In that case, regardless of the sort algorithm, with the naïve solution every value will need to be read at least twice--once while copying, and once while sorting. But with a copy-while-sorting solution, the number of reads could potentially be halved, and thus the performance approximately doubled. A similar situation arises, regardless of the data in unsorted, when using sorting algorithms that are more performant than std::sort (which may be O(n) rather than O(n log n)).
The standard library - on purpose - doesn't have a sort-while-copying function, because the copy is O(n) while std::sort is O(n log n).
So the sort will totally dominate the cost for any larger values of n. (And if n is small, it doesn't matter anyway).
Assuming the vector of doubles doesn't contain special numbers like NAN or infinity, then the doubles can be treated as 64 bit sign + magnitude integers, which can be converted to be used for a radix sort which is fastest. These "sign + magnitude integers" will need to be converted into 64 bit unsigned integers. These macros can be used to convert back and forth SM stands fro sign + magnitude, ULL for unsigned long long (uint64_t). It's assumed that the doubles are cast to type unsigned long long in order to use these macros:
#define SM2ULL(x) ((x)^(((~(x) >> 63)-1) | 0x8000000000000000ull))
#define ULL2SM(x) ((x)^((( (x) >> 63)-1) | 0x8000000000000000ull))
Note that using these macros will treat negative zero as less than positive zero, but this is normally not an issue.
Since radix sort needs an initial read pass to generate a matrix of counts (which are then converted into the starting or ending indices of logical bucket boundaries), then in this case, the initial read pass would be a copy pass that also generates the matrix of counts. A base 256 sort would use a matrix of size [8][256], and after the copy, 8 radix sort passes would be performed. If the vector is much larger than cache size, then the dominant time factor will be the random access writes during each radix sort pass.

How to efficiently look up elements in a large vector

I have a vector<unsigned> of size (90,000 * 9,000). I need to find many times whether an element exists in this vector or not?
For doing so, I stored the vector in a sorted form using std::sort() and then looked up elements in the vector using std::binary_search(). However on profiling using perf I find that looking up elements in vector<unsigned> is the slowest operation.
Can someone suggest some data-structure in C/C++ which I can use to efficiently look up elements in a vector of (90,000 * 9,000) elements.
I perform insertion (bulk-insertion) only once. The rest of the times I perform only lookups, so the main overhead here is because of lookups.
You've got 810 million values out of 4 billion possible values (assuming 32 bits unsigned). That's 1/5th of the total range, and uses 3.2 GB. This means you're in fact better of with a std::vector<bool> with 4 billion bits. This gives you O(1) lookup in less space (0.5 GB).
(In theory, unsigned could be 16 bits. unsigned long is at least 32 bits, std::uint32_t might be what you want)
Depending on the actual data structure of the vector the contains operation may take an O(n) or O(1). Usually, it's O(N) if vector is backed by either associative array or linked list, in this case contains will be a full scan in the worst case scenario. You have mitigated a full scan by ordering and using binary search, which is O(log (N)). Log N is pretty good complexity with only O(1) being better. So your choice is either:
Cache look up result for the items, this might be a good compromise if you have many repetitions of the same element
Replace vector with another data structure with efficient contains operation such as the one based on a hashtable or set. Note you may loose other features, such as ordering of items
Use two data structures, one for contains operations and original vector for whatever you use it for
Use a third data structure that offers a compromise, for example a data structure that work well with bloom filter
However on profiling using perf I find that looking up elements in
vector is the slowest operation.
That is half of the information you need, the other half being "how fast is it compared to other algorithms/containers"? Maybe using std::vector<> is actually the fastest, or maybe its the slowest. To find you'll have to benchmark/profile a few different designs.
For example, the following are very naive benchmarks using random integers on 1000x9000 sized containers (I would get seg-faults on larger sizes for the maps, assumably a limit of 32-bit memory).
If you need a count of non-unique integers:
std::vector<unsigned> = 500 ms
std::map<unsigned, unsigned> = 1700 ms
std::unordered_map<unsigned, unsigned> = 3700 ms
If you just need to test for the presence of unique integers:
std::vector<bool> = 15 ms
std::bitset<> = 50 ms
std::set<unsigned> = 350 ms
Note that we're not too interested in the exact values but rather the relative comparisons between containers. std::map<> is relatively slow which is not surprising given the number of dynamic allocations and non-locality of the data involved. The bitsets are by far the fastest but don't work if need the counts of non-unique integers.
I would suggest doing a similar benchmark using your exact container sizes and contents, both of which may well affect the benchmark results. It may turn out that std::vector<> may be the best solution after all but now you have some data to back up that design choice.
If you do not need iterate through the collection (in a sorted manner) since c++11 you could use std::unordered_set<yourtype> all you need to do is to provide the collection way of getting hashing and equality information for yourtype. The time of accessing element of the collection is here amortised O(1), unlike sorted vector where it's O(log(n)).

Can we know if a collection is almost sorted without applying a sort algorithm?

In the wikipedia article on sorting algorithms,
http://en.wikipedia.org/wiki/Sorting_algorithm#Summaries_of_popular_sorting_algorithms
under Bubble sort it says:Bubble sort can also be used efficiently on a list of any length that is nearly sorted (that is, the elements are not significantly out of place)
So my question is: Without sorting the list using a sorting algoithm first, how can one know if that is nearly sorted or not?
Are you familiar with the general sorting lower bound? You can prove that in a comparison-based sorting algorithm, any sorting algorithm must make Ω(n log n) comparisons in the average case. The way you prove this is through an information-theoretic argument. The basic idea is that there are n! possible permutations of the input array, and since the only way you can learn about which permutation you got is to make comparisons, you have to make at least lg n! comparisons in order to be certain that you know the structure of your input permutation.
I haven't worked out the math on this, but I suspect that you could make similar arguments to show that it's difficult to learn how sorted a particular array is. Essentially, if you don't do a large number of comparisons, then you wouldn't be able to tell apart an array that's mostly sorted from an array that is actually quite far from sorted. As a result, all the algorithms I'm aware of that measure "sortedness" take a decent amount of time to do so.
For example, one measure of the level of "sortedness" in an array is the number of inversions in that array. You can count the number of inversions in an array in time O(n log n) using a divide-and-conquer algorithm based on mergesort, but with that runtime you could just sort the array instead.
Typically, the way that you'd know that your array was mostly sorted was to know something a priori about how it was generated. For example, if you're looking at temperature data gathered from 8AM - 12PM, it's very likely that the data is already mostly sorted (modulo some variance in the quality of the sensor readings). If your data looks at a stock price over time, it's also likely to be mostly sorted unless the company has a really wonky trajectory. Some other algorithms also partially sort arrays; for example, it's not uncommon for quicksort implementations to stop sorting when the size of the array left to sort is small and to follow everything up with a final insertion sort pass, since every element won't be very far from its final position then.
I don't believe there exists any standardized measure of how sorted or random an array is.
You can come up with your own measure - like count the number of adjacent pairs which are out of order (suggested in comment), or count the number of larger numbers which occur before smaller numbers in the array (this is trickier than a simple single pass).

An efficient sorting algorithm for almost sorted list containing time data?

The name says it all really. I suspect that insertion sort is best, since it's the best sort for mostly-sorted data in general. However, since I know more about the data there is a chance there are other sorts woth looking at. So the other relevant pieces of information are:
1) this is time data, which means I presumable could create an effective hash for ordering of data.
2) The data won't all exist at one time. instead I'll be reading in records which may contain a single vector, or dozen or hundreds of vectors. I want to output all time within a 5 second window. So it's possible that a sort that does the sorting as I insert the data would be a better option.
3) memory is not a big issue, but CPU speed is as this may be a bottleneck of the system.
Given these conditions can anyone suggest an algorithm that may be worth considering in addition to insertion sort? Also, How does one defined 'mostly sorted' to decide what is a good sort option? What I mean by that is how do I look at my data and decided 'this isn't as sorted as I thought it as, maybe insertion sort is no longer the best option'? Any link to an article which considered process complexity which better defines the complexity relative to the degree data is sorted would be appreciated.
Thanks
Edit:
thank you everyone for your information. I will be going with an easy insertion or merge sort (whichever I have already pre-written) for now. However, I'll be trying some of the other methods once were closer to the optimization phase (since they take more effort to implement). I appreciate the help
You could adopt option (2) you suggested - sort the data while you insert elements.
Use a skip list, sorted according to time, ascending to maintain your data.
Once a new entree arrives - check if it is larger then the last
element (easy and quick) if it is - simply append it (easy to do in a skip list). The
skip list will need to add 2 nodes on average for these cases, and will be O(1) on
average for these cases.
If the element is not larger then the last element - add it to the
skip list as a standard insert op, which will be O(logn).
This approach will yield you O(n+klogn) algorithm, where k is the number of elements inserted out of order.
I would throw in merge sort if you implement the natural version you get a best case of O(N) with a typical and worst case of O(N log N) if you have any problems. Insertion you get a worst case of O(N^2) and a best case of O(N).
You can sort a list of size n with k elements out of place in O(n + k lg k) time.
See: http://www.quora.com/How-can-I-quickly-sort-an-array-of-elements-that-is-already-sorted-except-for-a-small-number-of-elements-say-up-to-1-4-of-the-total-whose-positions-are-known/answer/Mark-Gordon-6?share=1
The basic idea is this:
Iterate over the elements of the array, building an increasing subsequence (if the current element is greater than or equal to the last element of the subsequence, append it to the end of the subsequence. Otherwise, discard both the current element and the last element of the subsequence). This takes O(n) time.
You will have discarded no more than 2k elements since k elements are out of place.
Sort the 2k elements that were discarded using an O(k lg k) sorting algorithm like merge sort or heapsort.
You now have two sorted lists. Merge the lists in O(n) time like you would in the merge step of merge sort.
Overall time complexity = O(n + k lg k)
Overall space complexity = O(n)
(this can be modified to run in O(1) space if you can merge in O(1) space, but it's by no means trivial)
Without fully understanding the problem, Timsort may fit the bill as you're alleging that your data is mostly sorted already.
There are many adaptive sorting algorithms out there that are specifically designed to sort mostly-sorted data. Ignoring the fact that you're storing dates, you might want to look at smoothsort or Cartesian tree sort as algorithms that can sort data that is reasonable sorted in worst-case O(n log n) time and best-case O(n) time. Smoothsort also has the advantage of requiring only O(1) space, like insertion sort.
Using the fact that everything is a date and therefore can be converted into an integer, you might want to look at binary quicksort (MSD radix sort) using a median-of-three pivot selection. This algorithm has best-case O(n log n) performance, but has a very low constant factor that makes it pretty competitive. Its worst case is O(n log U), where U is the number of bits in each date (probably 64), which isn't too bad.
Hope this helps!
If your OS or C library provides a mergesort function, it is very likely that it already handles the case where the data given is partially ordered (in any direction) running in O(N) time.
Otherwise, you can just copy the mergesort available from your favorite BSD operating system.

Algorithm to find a duplicate entry in constant space and O(n) time

Given an array of N integer such that only one integer is repeated. Find the repeated integer in O(n) time and constant space. There is no range for the value of integers or the value of N
For example given an array of 6 integers as 23 45 67 87 23 47. The answer is 23
(I hope this covers ambiguous and vague part)
I searched on the net but was unable to find any such question in which range of integers was not fixed.
Also here is an example that answers a similar question to mine but here he created a hash table with the highest integer value in C++.But the cpp does not allow such to create an array with 2^64 element(on a 64-bit computer).
I am sorry I didn't mention it before the array is immutable
Jun Tarui has shown that any duplicate finder using O(log n) space requires at least Ω(log n / log log n) passes, which exceeds linear time. I.e. your question is provably unsolvable even if you allow logarithmic space.
There is an interesting algorithm by Gopalan and Radhakrishnan that finds duplicates in one pass over the input and O((log n)^3) space, which sounds like your best bet a priori.
Radix sort has time complexity O(kn) where k > log_2 n often gets viewed as a constant, albeit a large one. You cannot implement a radix sort in constant space obviously, but you could perhaps reuse your input data's space.
There are numerical tricks if you assume features about the numbers themselves. If almost all numbers between 1 and n are present, then simply add them up and subtract n(n+1)/2. If all the numbers are primes, you could cheat by ignoring the running time of division.
As an aside, there is a well-known lower bound of Ω(log_2(n!)) on comparison sorting, which suggests that google might help you find lower bounds on simple problems like finding duplicates as well.
If the array isn't sorted, you can only do it in O(nlogn).
Some approaches can be found here.
If the range of the integers is bounded, you can perform a counting sort variant in O(n) time. The space complexity is O(k) where k is the upper bound on the integers(*), but that's a constant, so it's O(1).
If the range of the integers is unbounded, then I don't think there's any way to do this, but I'm not an expert at complexity puzzles.
(*) It's O(k) since there's also a constant upper bound on the number of occurrences of each integer, namely 2.
In the case where the entries are bounded by the length of the array, then you can check out Find any one of multiple possible repeated integers in a list and the O(N) time and O(1) space solution.
The generalization you mention is discussed in this follow up question: Algorithm to find a repeated number in a list that may contain any number of repeats and the O(n log^2 n) time and O(1) space solution.
The approach that would come closest to O(N) in time is probably a conventional hash table, where the hash entries are simply the numbers, used as keys. You'd walk through the list, inserting each entry in the hash table, after first checking whether it was already in the table.
Not strictly O(N), however, since hash search/insertion gets slower as the table fills up. And in terms of storage it would be expensive for large lists -- at least 3x and possibly 10-20x the size of the array of numbers.
As was already mentioned by others, I don't see any way to do it in O(n).
However, you can try a probabilistic approach by using a Bloom Filter. It will give you O(n) if you are lucky.
Since extra space is not allowed this can't be done without comparison.The concept of lower bound on the time complexity of comparison sort can be applied here to prove that the problem in its original form can't be solved in O(n) in the worst case.
We can do in linear time o(n) here as well
public class DuplicateInOnePass {
public static void duplicate()
{
int [] ar={6,7,8,8,7,9,9,10};
Arrays.sort(ar);
for (int i =0 ; i <ar.length-1; i++)
{
if (ar[i]==ar[i+1])
System.out.println("Uniqie Elements are" +ar[i]);
}
}
public static void main(String[] args) {
duplicate();
}
}