Best sorting algorithm for case where many objects have "do-not-care" relationships to each other - c++

I have an unusual sorting case that my googling has turned up little on. Here are the parameters:
1) Random access container. (C++ vector)
2) Generally small vector size (less than 32 objects)
3) Many objects have "do-not-care" relationships relative to each other, but they are not equal. (i.e. They don't care about which of them appears first in the final sorted vector, but they may compare differently to other objects.) To put it a third way (if it's still unclear), the comparison function for 2 objects can return 3 results: "order is correct," "order need to be fliped," or "do not care."
4) Equalities are possible, but will be very rare. (But this would probably just be treated like any other "do-not-care."
5) Comparison operator is far more expensive than object movement.
6) There is no comparison speed difference for determining that objects care or don't care about each other. (i.e. I don't know of a way to make a quicker comparison that simply says whether the 2 objects care about each other of not.)
7) Random starting order.

Whatever you're going to do, given your conditions I'd make sure you draw up a big pile of tests cases (eg get a few datasets and shuffle them a few thousand times) as I suspect it'd be easy to choose a sort that fails to meet your requirements.
The "do not care" is tricky as most sort algorithms depend on a strict ordering of the sort value - if A is 'less than or equal to' B, and B is 'less than or equal to' C, then it assumes that A is less than or equal to C -- in your case if A 'doesn't care' about B but does care about C, but B is less than C, then what do you return for the A-B comparison to ensure A will be compared to C?
For this reason, and it being small vectors, I'd recommend NOT using any of the built in methods as I think you'll get the wrong answers, instead I'd build a custom insertion sort.
Start with an empty target vector, insert first item, then for each subsequent item scan the array looking for the bounds of where it can be inserted (ie ignoring the 'do not cares', find the last item it must go after and the first it must go before) and insert it in the middle of that gap, moving everything else along the target vector (ie it grows by one entry each time).
[If the comparison operation is particularly expensive, you might do better to start in the middle and scan in one direction until you hit one bound, then choose whether the other bound is found moving from that bound, or the mid point... this would probably reduce the number of comparisons, but from reading what you say about your requirements you couldn't, say, use a binary search to find the right place to insert each entry]
Yes, this is basically O(n^2), but for a small array this shouldn't matter, and you can prove that the answers are right. You can then see if any other sorts do better, but unless you can return a proper ordering for any given pair then you'll get weird results...

You can't make the sorting with "don't care", it is likely to mess with the order of elemets. Example:
list = {A, B, C};
where:
A dont care B
B > C
A < C
So even with the don't care between A and B, B has to be greater than A, or one of those will be false: B > C or A < C. If it will never happen, then you need to treat them as equals instead of the don't care.

What you have there is a "partial order".
If you have an easy way to figure out the objects where the order is not "don't care" for a given objects, you can tackle this with basic topological sorting.
If you have a lot of "don't care"s (i.e. if you only have a sub-quadratic number of edges in your partial ordering graph), this will be a lot faster than ordinary sorting - however, if you don't the algorithm will be quadratic!

I believe a selection sort will work without modification, if you treat the "do-not-care" result as equal. Of course, the performance leaves something to be desired.

Related

What are the best sorting algorithms when 'n' is very small?

In the critical path of my program, I need to sort an array (specifically, a C++ std::vector<int64_t>, using the gnu c++ standard libray). I am using the standard library provided sorting algorithm (std::sort), which in this case is introsort.
I was curious about how well this algorithm performs, and when doing some research on various sorting algorithms different standard and third party libraries use, almost all of them care about cases where 'n' tends to be the dominant factor.
In my specific case though, 'n' is going to be on the order of 2-20 elements. So the constant factors could actually be dominant. And things like cache effects might be very different when the entire array we are sorting fits into a couple of cache lines.
What are the best sorting algorithms for cases like this where the constant factors likely overwhelm the asymptotic factors? And do there exist any vetted C++ implementations of these algorithms?
Introsort takes your concern into account, and switches to an insertion sort implementation for short sequences.
Since your STL already provides it, you should probably use that.
Insertion sort or selection sort are both typically faster for small arrays (i.e., fewer than 10-20 elements).
Watch https://www.youtube.com/watch?v=FJJTYQYB1JQ
A simple linear insertion sort is really fast. Making a heap first can improve it a bit.
Sadly the talk doesn't compare that against the hardcoded solutions for <= 15 elements.
It's impossible to know the fastest way to do anything without knowing exactly what the "anything" is.
Here is one possible set of assumptions:
We don't have any knowledge of the element structure except that elements are comparable. We have no useful way to group them into bins (for radix sort), we must implement a comparison-based sort, and comparison takes place in an opaque manner.
We have no information about the initial state of the input; any input order is equally likely.
We don't have to care about whether the sort is stable.
The input sequence is a simple array. Accessing elements is constant-time, as is swapping them. Furthermore, we will benchmark the function purely according to the expected number of comparisons - not number of swaps, wall-clock time or anything else.
With that set of assumptions (and possibly some other sets), the best algorithms for small numbers of elements will be hand-crafted sorting networks, tailored to the exact length of the input array. (These always perform the same number of comparisons; it isn't feasible to "short-circuit" these algorithms conditionally because the "conditions" would depend on detecting data that is already partially sorted, which still requires comparisons.)
For a network sorting four elements (in the known-optimal five comparisons), this might look like (I did not test this):
template<class RandomIt, class Compare>
void _compare_and_swap(RandomIt first, Compare comp, int x, int y) {
if (comp(first[x], first[y])) {
auto tmp = first[x];
arr[x] = arr[y];
arr[y] = tmp;
}
}
// Assume there are exactly four elements available at the `first` iterator.
template<class RandomIt, class Compare>
void network_sort_4(RandomIt first, Compare comp) {
_compare_and_swap(2, 0);
_compare_and_swap(1, 3);
_compare_and_swap(0, 1);
_compare_and_swap(2, 3);
_compare_and_swap(1, 2);
}
In real-world environments, of course, we will have different assumptions. For small numbers of elements, with real data (but still assuming we must do comparison-based sorts) it will be difficult to beat naive implementations of insertion sort (or bubble sort, which is effectively the same thing) that have been compiled with good optimizations. It's really not feasible to reason about these things by hand, considering both the complexity of the hardware level (e.g. the steps it takes to pipeline instructions and then compensate for branch mis-predictions) and the software level (e.g. the relative cost of performing the swap vs. performing the comparison, and the effect that has on the constant-factor analysis of performance).

What's the most efficient way to store a subset of column indices of big matrix and in C++?

I am working with a very big matrix X (say, 1,000-by-1,000,000). My algorithm goes like following:
Scan the columns of X one by one, based on some filtering rules, to identify only a subset of columns that are needed. Denote the subset of indices of columns be S. Its size depends on the filter, so is unknown before computation and will change if the filtering rules are different.
Loop over S, do some computation with a column x_i if i is in S. This step needs to be parallelized with openMP.
Repeat 1 and 2 for 100 times with changed filtering rules, defined by a parameter.
I am wondering what the best way is to implement this procedure in C++. Here are two ways I can think of:
(a) Use a 0-1 array (with length 1,000,000) to indicate needed columns for Step 1 above; then in Step 2 loop over 1 to 1,000,000, use if-else to check indicator and do computation if indicator is 1 for that column;
(b) Use std::vector for S and push_back the column index if identified as needed; then only loop over S, each time extract column index from S and then do computation. (I thought about using this way, but it's said push_back is expensive if just storing integers.)
Since my algorithm is very time-consuming, I assume a little time saving in the basic step would mean a lot overall. So my question is, should I try (a) or (b) or other even better way for better performance (and for working with openMP)?
Any suggestions/comments for achieving better speedup are very appreciated. Thank you very much!
To me, it seems that "step #1 really does not matter much." (At the end of the day, you're going to wind up with: "a set of columns, however represented.")
To me, what's really going to matter is: "just what's gonna happen when you unleash ("parallelized ...") step #2.
"An array of 'ones and zeros,'" however large, should be fairly simple for parallelization, while a more-'advanced' data structure might well, in this case, "just get in the way."
"One thousand mega-bits, these days?" Sure. Done. No problem. ("And if not, a simple array of bit-sets.") However-many simultaneously executing entities should be able to navigate such a data structure, in parallel, with a minimum of conflict . . . Therefore, to my gut, "big bit-sets win."
I think you will find std::vector easier to use. Regarding push_back, the cost is when the vector reallocates (and maybe copies) the data. To avoid that (if it matters), you could set vector::capacity to 1,000,000. Your vector is then 8 MB, insignificant compared to your problem size. It's only 1 order magnitude bigger than a bitmap would be, and a lot simpler to deal with: If we call your vector S and the nth interesting column i, then your column access is just x[S[i]].
(Based on my gut feeling) I'd probably go for pushing back into a vector, but the answer is quite simple: Measure both methods (they are both trivial to implement). Most likely you won't see a noticeable difference.

Finding the index position of the nearest value in a Fortran array

I have two sorted arrays, one containing factors (array a) that when multiplied with values from another array (array b), yields the desired value:
a(idx1) * b(idx2) = value
With idx2 known, I would like find the idx1 of a that provides the factor necessary to get as close to value as possible.
I have looked at some different algorithms (like this one, for example), but I feel like they would all be subject to potential problems with floating point arithmetic in my particular case.
Could anyone suggest a method that would avoid this?
If I understand correctly, this expression
minloc(abs(a-value/b(idx2)))
will return the the index into a of the first occurrence of the value in a which minimises the difference. I expect that the compiler will write code to scan all the elements in a so this may not be faster in execution than a search which takes advantage of the knowledge that a and b are both sorted. In compensation, this is much quicker to write and, I expect, to debug.

Preferred way to test against many discrete values?

I have the following scenario:
variable in {12, 4, 999, ... }:
Where there are about 100 discrete values in the list. I am writing a parser to convert this to C++, and the only ways that I can think of to do it are 100 case statements, or 100 if ==
Is one preferred to the other, or is there an all round better way to do this?
I should clarify, the values are constant integers. Thanks
If the maximum value of any one of your discrete values is small enough a std::vector<bool> of flags set true or false depending on whether that entry is in the list should be pretty optimal - assuming the values occur with approximately equal probabilility.
One way is to arrange the values in order and use binary search to check whether a value is contained in your collection.
You can either put your values in a vector in sorted order using std::lower_bound for the insertion point and then use std::binary_search to test for membership, or you can put your values in an std::set and get that feature for free (using std::set::find() for membership testing).
There are minor performance considerations that may make either option preferable; profile and decide for yourself.
A second approach is to put your values in a hash table such as std::unordered_set (or some kind of static equivalent if your values are known statically).
Assuming the values are constants, you can certainly use a switch statement. The compiler will do this pretty efficiently, using either a binary search type approach or a table [or a combination of table and binary search]. A long list of if-statements will not be as efficient, unless you sort the numbers and make a binary search type approach - a switch-statement is much easier to generate, as the compiler will sort out the best approach to decide what numbers are in the list and which ones aren't.
If the values are not constants, then a switch-statement is obviously not a solution. A bitmap may work - again, depending on the actual range - of the values are a large range, then that's not a good solution, since it will use a lot of memory [but it probably is one of the fastest methods, since it's just a case of dividing/modulo with a 2^n number, which can be done with simple >> and & operators, followed by one memory read].

Perfect hash function for a set of integers with no updates

In one of the applications I work on, it is necessary to have a function like this:
bool IsInList(int iTest)
{
//Return if iTest appears in a set of numbers.
}
The number list is known at app load up (But are not always the same between two instances of the same application) and will not change (or added to) throughout the whole of the program. The integers themselves maybe large and have a large range so it is not efficient to have a vector<bool>. Performance is a issue as the function sits in a hot spot. I have heard about Perfect hashing but could not find out any good advice. Any pointers would be helpful. Thanks.
p.s. I'd ideally like if the solution isn't a third party library because I can't use them here. Something simple enough to be understood and manually implemented would be great if it were possible.
I would suggest using Bloom Filters in conjunction with a simple std::map.
Unfortunately the bloom filter is not part of the standard library, so you'll have to implement it yourself. However it turns out to be quite a simple structure!
A Bloom Filter is a data structure that is specialized in the question: Is this element part of the set, but does so with an incredibly tight memory requirement, and quite fast too.
The slight catch is that the answer is... special: Is this element part of the set ?
No
Maybe (with a given probability depending on the properties of the Bloom Filter)
This looks strange until you look at the implementation, and it may require some tuning (there are several properties) to lower the probability but...
What is really interesting for you, is that for all the cases it answers No, you have the guarantee that it isn't part of the set.
As such a Bloom Filter is ideal as a doorman for a Binary Tree or a Hash Map. Carefully tuned it will only let very few false positive pass. For example, gcc uses one.
What comes to my mind is gperf. However, it is based in strings and not in numbers. However, part of the calculation can be tweaked to use numbers as input for the hash generator.
integers, strings, doesn't matter
http://videolectures.net/mit6046jf05_leiserson_lec08/
After the intro, at 49:38, you'll learn how to do this. The Dot Product hash function is demonstrated since it has an elegant proof. Most hash functions are like voodoo black magic. Don't waste time here, find something that is FAST for your datatype and that offers some adjustable SEED for hashing. A good combo there is better than the alternative of growing the hash table.
#54:30 The Prof. draws picture of a standard way of doing perfect hash. The perfect minimal hash is beyond this lecture. (good luck!)
It really all depends on what you mod by.
Keep in mind, the analysis he shows can be further optimized by knowing the hardware you are running on.
The std::map you get very good performance in 99.9% scenarios. If your hot spot has the same iTest(s) multiple times, combine the map result with a temporary hash cache.
Int is one of the datatypes where it is possible to just do:
bool hash[UINT_MAX]; // stackoverflow ;)
And fill it up. If you don't care about negative numbers, then it's twice as easy.
A perfect hash function maps a set of inputs onto the integers with no collisions. Given that your input is a set of integers, the values themselves are a perfect hash function. That really has nothing to do with the problem at hand.
The most obvious and easy to implement solution for testing existence would be a sorted list or balanced binary tree. Then you could decide existence in log(N) time. I doubt it'll get much better than that.
For this problem I would use a binary search, assuming it's possible to keep the list of numbers sorted.
Wikipedia has example implementations that should be simple enough to translate to C++.
It's not necessary or practical to aim for mapping N distinct randomly dispersed integers to N contiguous buckets - i.e. a perfect minimal hash - the important thing is to identify an acceptable ratio. To do this at run-time, you can start by configuring a worst-acceptible ratio (say 1 to 20) and a no-point-being-better-than-this-ratio (say 1 to 4), then randomly vary (e.g. changing prime numbers used) a fast-to-calculate hash algorithm to see how easily you can meet increasingly difficult ratios. For worst-acceptible you don't time out, or you fall back on something slower but reliable (container or displacement lists to resolve collisions). Then, allow a second or ten (configurable) for each X% better until you can't succeed at that ratio or reach the no-pint-being-better ratio....
Just so everyone's clear, this works for inputs only known at run time with no useful patterns known beforehand, which is why different hash functions have to be trialed or actively derived at run time. It is not acceptible to simple say "integer inputs form a hash", because there are collisions when %-ed into any sane array size. But, you don't need to aim for a perfectly packed array either. Remember too that you can have a sparse array of pointers to a packed array, so there's little memory wasted for large objects.
Original Question
After working with it for a while, I came up with a number of hash functions that seemed to work reasonably well on strings, resulting in a unique - perfect hashing.
Let's say the values ranged from L to H in the array. This yields a Range R = H - L + 1.
Generally it was pretty big.
I then applied the modulus operator from H down to L + 1, looking for a mapping that keeps them unique, but has a smaller range.
In you case you are using integers. Technically, they are already hashed, but the range is large.
It may be that you can get what you want, simply by applying the modulus operator.
It may be that you need to put a hash function in front of it first.
It also may be that you can't find a perfect hash for it, in which case your container class should have a fall back position.... binary search, or map or something like that, so that
you can guarantee that the container will work in all cases.
A trie or perhaps a van Emde Boas tree might be a better bet for creating a space efficient set of integers with lookup time bring constant against the number of objects in the data structure, assuming that even std::bitset would be too large.