Map from integer ranges to arbitrary single integers - c++

Working in C++ in a Linux environment, I have a situation where a number of integer ranges are defined, and integer inputs map to different arbitrary integers based on which range they fall into. None of the ranges overlap, and they aren't always contiguous.
The "simplest" way to solve this problem is with a bunch of if-statements for each range, but the number of ranges, their bounds, and the target values can all vary, so if-statements aren't maintainable.
For example, the ranges might be [0, 70], called r_a, [101, 150], call it r_b, and [201, 400], call it r_c. Inputs in r_a map to 1, in r_b map to 2, and r_c map to 3. Anything not in r_a, r_b, r_c maps to 0.
I can come up with a data structure & algorithm that stores tuples of (bounds, map target) and iterates through them, so finding the target value takes linear time in the number of bounds pairs. I can also imagine a scheme that keeps the pairs ordered and uses a binary sort-ish algorithm against all the lower bounds (or upper bounds), finds the closest to the input, then compares against the opposing bound.
Is there a better way to accomplish the mapping than a binary-search based algorithm? Even better, is there some C++ library out that does this already?

The best approach here is indeed a binary search, but any efficient order-based search will do perfectly well. You don't really have to implement the search and the data structure explicitly. You can use it indirectly by employing a standard associative container instead.
Since your ranges don't overlap, the solution is very simple. You can immediately use a std::map for this problem to solve it in just a few lines of code.
For example, this is one possible approach. Let's assume that we are mapping an [ int, int ] range to an int value. Let's represent our ranges as closed-open ranges, i.e. if the original range is [0, 70], let's consider a [0, 71) range instead. Also, let's use the value of 0 as a "reserved" value that means "no mapping" (as you requested in your question)
const int EMPTY = 0;
All you need to do is to declare a map from int to int:
typedef std::map<int, int> Map;
Map map;
and fill it with each end of your closed-open ranges. The left (closed) end should be mapped to the desired value the entire range is mapped to, while the right (open) end should be mapped to our EMPTY value. For your example, it will look as follows
map[0] = r_a;
map[71] = EMPTY;
map[101] = r_b;
map[251] = EMPTY;
map[260] = r_c; // 260 adjusted from 201
map[401] = EMPTY;
(I adjusted your last range, since in your original example it overlapped the previous range, and you said that your ranges don't overlap).
That's it for initialization.
Now, in order to determine where a given value of i maps to all you need to do is
Map::iterator it = map.upper_bound(i);
If it == map.begin(), then i is not in any range. Otherwise, do
--it;
If the it->second (for the decremented it) is EMPTY, then i is not in any range.
The combined "miss" check might look as follows
Map::iterator it = map.upper_bound(i);
if (it == map.begin() || (--it)->second == EMPTY)
/* Missed all ranges */;
Otherwise, it->second (for the decremented it) is your mapped value
int mapped_to = it->second;
Note that if the original ranges were "touching", as in [40, 60] and [61, 100], then the closed-open ranges will look as [40, 61) and [61, 101) meaning that the value of 61 will be mapped twice during map initialization. In this case it is important to make sure that the value of 61 is mapped to the proper destination value and not to the value of EMPTY. If you map the ranges as shown above in the left-to-right (i.e. increasing) order it will work correctly by itself.
Note, that only the endpoints of the ranges are inserted into the map, meaning that the memory consumption and the performance of the search depends only on the total number of ranges and completely independent of their total length.
If you wish, you can add a "guard" element to the map during the initialization
map[INT_MIN] = EMPTY;
(it corresponds to "negative infinity") and the "miss" check will become simpler
Map::iterator it = map.upper_bound(i);
assert(it != map.begin());
if ((--it)->second == EMPTY)
/* Missed all ranges */;
but that's just a matter of personal preference.
Of course, if you just want to return 0 for non-mapped values, you don't need to carry out any checking at all. Just take the it->second from the decremented iterator and you are done.

I would use a very simple thing: a std::map.
class Range
{
public:
explicit Range(int item); // [item,item]
Range(int low, int high); // [low,high]
bool operator<(const Range& rhs) const
{
if (mLow < rhs.mLow)
{
assert(mHigh < rhs.mLow); // sanity check
return true;
}
return false;
} // operator<
int low() const { return mLow; }
int high() const { return mHigh; }
private:
int mLow;
int mHigh;
}; // class Range
Then, let's have a map:
typedef std::map<Range, int> ranges_type;
And write a function that search in this map:
int find(int item, const ranges_type& ranges)
{
ranges_type::const_iterator it = ranges.lower_bound(Range(item));
if (it != ranges.end() && it->first.low() <= item)
return it->second;
else
return 0; // No mapping ?
}
Main benefits:
Will check that the ranges effectively don't overlap during insertion in the set (you can make it so that it's only in debug mode)
Supports edition of the Ranges on the fly
Finding is fast (binary search)
If the ranges are frozen (even if their values are not), you may wish to use Loki::AssocVector to reduce the memory overhead and improve performance a bit (basically, it's a sorted vector with the interface of a map).

Wouldn't a simple array be enough? You're not saying how many items you have, but by far the fastest data structure is a simple array.
If the ranges are:
0..9 --> 25
10..19 --> 42
Then the array would simply be like this:
[25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42]

You can have two sorted arrays: one for lower bounds, one for upper bounds. Use std::lower_bound(lower_bound_array, value) and std::upper_bound(upper_bound_array, value). If the index of both results is the same, return index + 1. Otherwise, return 0.
If the indices returned match, it means that the value is >= the lower bound and < the upper bound. If they don't, then you are in between ranges.

The ideal is an interval tree (specialized binary tree). Wikipedia describes the method completely. Better than I. You won't get much more optimal than this, without sacrificing space for performance.

Your example ranges overlap, but the question says they wont. I'll assume the range is a typo. You could, could, store the destinations in an array and use the indices as the ranges. It's pretty easy, but ugly and not very maintainable. You'd need to initialize the array to 0, then for each range, iterate over those indices and set each index to the destination value. Very ugly, but constant lookup time so maybe useful if the numbers don't get too high and the ranges don't change very often.

Record the limits into a set (or map). When you call insert you will have a return value which is a pair. An iterator and a boolean. If the boolean is true then a new element is created which you have to remove later. After that step one with the iterator and look at what you have found.
http://www.cplusplus.com/reference/stl/set/insert/ See Return value

It's 1-dimensional spatial index. Quadtree-style binary tree will do, for example - and there are several other widely used methods.

A simple Linked List containing the range entries should be quick enough, even for say 50-100 ranges. Moreover, you could implement a Skip List, on say the upper bounds, to speed up these range queries. Yet another possibility is an Interval Tree.
Ultimately I'd choose the simplest: binary search.

You may find Minimal Perfect Hashing Function useful, http://cmph.sourceforge.net/.

Related

CodeSignal: Execution time limit exceeded with c++

I am trying to solve the programming problem firstDuplicate on codesignal. The problem is "Given an array a that contains only numbers in the range 1 to a.length, find the first duplicate number for which the second occurrence has minimal index".
Example: For a = [2, 1, 3, 5, 3, 2] the output should be firstDuplicate(a) = 3
There are 2 duplicates: numbers 2 and 3. The second occurrence of 3 has a smaller index than the second occurrence of 2 does, so the answer is 3.
With this code I pass 21/23 tests, but then it tells me that the program exceeded the execution time limit on test 22. How would I go about making it faster so that it passes the remaining two tests?
#include <algorithm>
int firstDuplicate(vector<int> a) {
vector<int> seen;
for (size_t i = 0; i < a.size(); ++i){
if (std::find(seen.begin(), seen.end(), a[i]) != seen.end()){
return a[i];
}else{
seen.push_back(a[i]);
}
}
if (seen == a){
return -1;
}
}
Anytime you get asked a question about "find the duplicate", "find the missing element", or "find the thing that should be there", your first instinct should be use a hash table. In C++, there are the unordered_map and unordered_set classes that are for such types of coding exercises. The unordered_set is effectively a map of keys to bools.
Also, pass you vector by reference, not value. Passing by value incurs the overhead of copying the entire vector.
Also, that comparison seems costly and unnecessary at the end.
This is probably closer to what you want:
#include <unordered_set>
int firstDuplicate(const vector<int>& a) {
std::unordered_set<int> seen;
for (int i : a) {
auto result_pair = seen.insert(i);
bool duplicate = (result_pair.second == false);
if (duplicate) {
return (i);
}
}
return -1;
}
std::find is linear time complexity in terms of distance between first and last element (or until the number is found) in the container, thus having a worst-case complexity of O(N), so your algorithm would be O(N^2).
Instead of storing your numbers in a vector and searching for it every time, Yyu should do something like hashing with std::map to store the numbers encountered and return a number if while iterating, it is already present in the map.
std::map<int, int> hash;
for(const auto &i: a) {
if(hash[i])
return i;
else
hash[i] = 1;
}
Edit: std::unordered_map is even more efficient if the order of keys doesn't matter, since insertion time complexity is constant in average case as compared to logarithmic insertion complexity for std::map.
It's probably an unnecessary optimization, but I think I'd try to take slightly better advantage of the specification. A hash table is intended primarily for cases where you have a fairly sparse conversion from possible keys to actual keys--that is, only a small percentage of possible keys are ever used. For example, if your keys are strings of length up to 20 characters, the theoretical maximum number of keys is 25620. With that many possible keys, it's clear no practical program is going to store any more than a minuscule percentage, so a hash table makes sense.
In this case, however, we're told that the input is: "an array a that contains only numbers in the range 1 to a.length". So, even if half the numbers are duplicates, we're using 50% of the possible keys.
Under the circumstances, instead of a hash table, even though it's often maligned, I'd use an std::vector<bool>, and expect to get considerably better performance in the vast majority of cases.
int firstDuplicate(std::vector<int> const &input) {
std::vector<bool> seen(input.size()+1);
for (auto i : input) {
if (seen[i])
return i;
seen[i] = true;
}
return -1;
}
The advantage here is fairly simple: at least in a typical case, std::vector<bool> uses a specialization to store bools in only one bit apiece. This way we're storing only one bit for each number of input, which increases storage density, so we can expect excellent use of the cache. In particular, as long as the number of bytes in the cache is at least a little more than 1/8th the number of elements in the input array, we can expect all of seen to be in the cache most of the time.
Now make no mistake: if you look around, you'll find quite a few articles pointing out that vector<bool> has problems--and for some cases, that's entirely true. There are places and times that vector<bool> should be avoided. But none of its limitations applies to the way we're using it here--and it really does give an advantage in storage density that can be quite useful, especially for cases like this one.
We could also write some custom code to implement a bitmap that would give still faster code than vector<bool>. But using vector<bool> is easy, and writing our own replacement that's more efficient is quite a bit of extra work...

Most efficient way to find index of matching values in two sorted arrays using C++

I currently have a solution but I feel it's not as efficient as it could be to this problem, so I want to see if there is a faster method to this.
I have two arrays (std::vectors for example). Both arrays contain only unique integer values that are sorted but are sparse in value, ie: 1,4,12,13... What I want to ask is there fast way I can find the INDEX to one of the arrays where the values are the same. For example, array1 has values 1,4,12,13 and array2 has values 2,12,14,16. The first matching value index is 1 in array2. The index into the array is what is important as I have other arrays that contain data that will use this index that "matches".
I am not confined to using arrays, maps are possible to. I am only comparing the two arrays once. They will not be reused again after the first matching pass. There can be small to large number of values (300,000+) in either array, but DO NOT always have the same number of values (that would make things much easier)
Worse case is a linear search O(N^2). Using map would get me better O(log N) but I would still have convert an array to into a map of value, index pairs.
What I currently have to not do any container type conversions is this. Loop over the smaller of the two arrays. Compare current element of small array (array1) with the current element of large array (array2). If array1 element value is larger than array2 element value, increment the index for array2 until is it no longer larger than array1 element value (while loop). Then, if array1 element value is smaller than array2 element, go to next loop iteration and begin again. Otherwise they must be equal and I have my index to either arrays of the matching value.
So in this loop, I am at best O(N) if all values have matches and at worse O(2N) if none match. So I am wondering if there is something faster out there? It's hard to know for sure how often the two arrays will match, but I would way I would lean more toward most of the arrays will mostly have matches than not.
I hope I explained the problem well enough and I appreciate any feedback or tips on improving this.
Code example:
std::vector<int> array1 = {4,6,12,34};
std::vector<int> array2 = {1,3,6,34,40};
for(unsigned int i=0, z=0; i < array1.size(); i++)
{
int value1 = array1[i];
while(value1 > array2[z] && z < array2.size())
z++;
if (z >= array2.size())
break; // reached end of array2
if (value1 < array2[z])
continue;
// we have a match, i and z indices have same value
}
Result will be matching indexes for array1 = [1,3] and for array2= [2,3]
I wrote an implementation of this function using an algorithm that performs better with sparse distributions, than the trivial linear merge.
For distributions, that are similar†, it has O(n) complexity but ranges where the distributions are greatly different, it should perform below linear, approaching O(log n) in optimal cases. However, I wasn't able to prove that the worst case isn't better than O(n log n). On the other hand, I haven't been able to find that worst case either.
I templated it so that any type of ranges can be used, such as sub-ranges or raw arrays. Technically it works with non-random access iterators as well, but the complexity is much greater, so it's not recommended. I think it should be possible to modify the algorithm to fall back to linear search in that case, but I haven't bothered.
† By similar distribution, I mean that the pair of arrays have many crossings. By crossing, I mean a point where you would switch from one array to another if you were to merge the two arrays together in sorted order.
#include <algorithm>
#include <iterator>
#include <utility>
// helper structure for the search
template<class Range, class Out>
struct search_data {
// is any there clearer way to get iterator that might be either
// a Range::const_iterator or const T*?
using iterator = decltype(std::cbegin(std::declval<Range&>()));
iterator curr;
const iterator begin, end;
Out out;
};
template<class Range, class Out>
auto init_search_data(const Range& range, Out out) {
return search_data<Range, Out>{
std::begin(range),
std::begin(range),
std::end(range),
out,
};
}
template<class Range, class Out1, class Out2>
void match_indices(const Range& in1, const Range& in2, Out1 out1, Out2 out2) {
auto search_data1 = init_search_data(in1, out1);
auto search_data2 = init_search_data(in2, out2);
// initial order is arbitrary
auto lesser = &search_data1;
auto greater = &search_data2;
// if either range is exhausted, we are finished
while(lesser->curr != lesser->end
&& greater->curr != greater->end) {
// difference of first values in each range
auto delta = *greater->curr - *lesser->curr;
if(!delta) { // matching value was found
// store both results and increment the iterators
*lesser->out++ = std::distance(lesser->begin, lesser->curr++);
*greater->out++ = std::distance(greater->begin, greater->curr++);
continue; // then start a new iteraton
}
if(delta < 0) { // set the order of ranges by their first value
std::swap(lesser, greater);
delta = -delta; // delta is always positive after this
}
// next crossing cannot be farther than the delta
// this assumption has following pre-requisites:
// range is sorted, values are integers, values in the range are unique
auto range_left = std::distance(lesser->curr, lesser->end);
auto upper_limit =
std::min(range_left, static_cast<decltype(range_left)>(delta));
// exponential search for a sub range where the value at upper bound
// is greater than target, and value at lower bound is lesser
auto target = *greater->curr;
auto lower = lesser->curr;
auto upper = std::next(lower, upper_limit);
for(int i = 1; i < upper_limit; i *= 2) {
auto guess = std::next(lower, i);
if(*guess >= target) {
upper = guess;
break;
}
lower = guess;
}
// skip all values in lesser,
// that are less than the least value in greater
lesser->curr = std::lower_bound(lower, upper, target);
}
}
#include <iostream>
#include <vector>
int main() {
std::vector<int> array1 = {4,6,12,34};
std::vector<int> array2 = {1,3,6,34};
std::vector<std::size_t> indices1;
std::vector<std::size_t> indices2;
match_indices(array1, array2,
std::back_inserter(indices1),
std::back_inserter(indices2));
std::cout << "indices in array1: ";
for(std::vector<int>::size_type i : indices1)
std::cout << i << ' ';
std::cout << "\nindices in array2: ";
for(std::vector<int>::size_type i : indices2)
std::cout << i << ' ';
std::cout << std::endl;
}
Since the arrays are already sorted you can just use something very much like the merge step of mergesort. This just looks at the head element of each array, and discards the lower element (the next element becomes the head). Stop when you find a match (or when either array becomes exhausted, indicating no match).
This is O(n) and the fastest you can do for arbitrary distubtions. With certain clustered distributions a "skip ahead" approach could be used rather than always looking at the next element. This could result in better than O(n) running times for certain distributions. For example, given the arrays 1,2,3,4,5 and 10,11,12,13,14 an algorithm could determine there were no matches to be found in as few as one comparison (5 < 10).
What is the range of the stored numbers?
I mean, you say that the numbers are integers, sorted, and sparse (i.e. non-sequential), and that there may be more than 300,000 of them, but what is their actual range?
The reason that I ask is that, if there is a reasonably small upper limit, u, (say, u=500,000), the fastest and most expedient solution might be to just use the values as indices. Yes, you might be wasting memory, but is 4*u really a lot of memory? This depends on your application and your target platform (i.e. if this is for a memory-constrained embedded system, its less likely to be a good idea than if you have a laptop with 32GiB RAM).
Of course, if the values are more-or-less evenly spread over 0-2^31-1, this crude idea isn't attractive, but maybe there are properties of the input values that you can exploit other simply than the range. You might be able to hand-write a fairly simple hash function.
Another thing worth considering is whether you actually need to be able to retrieve the index quickly or if it helps just be able to tell if the index exists in the other array quickly. Whether or not a value exists at a particular index requires only one bit, so you could have a bitmap of the range of the input values using 32x less memory (i.e. mask off 5 LSBs and use that as a bit position, then shift the remaining 27 bits 5 places right and use that as an array index).
Finally, a hybrid approach might be worth considering, where you decide how much memory you're prepared to use (say you decide 256KiB, which corresponds to 64Ki 4-byte integers) then use that as a lookup-table to into much smaller sub-problems. Say you have 300,000 values whose LSBs are pretty evenly distributed. Then you could use 16 LSBs as indices into a lookup-table of lists that are (on average) only 4 or 5 elements long, which you can then search by other means. A couple of year ago, I worked on some simulation software that had ~200,000,000 cells, each with a cell id; some utility functionality used a binary search to identify cells by id. We were able to speed it up significantly and non-intrusively with this strategy. Not a perfect solution, but a great improvement. (If the LSBs are not evenly distributed, maybe that's a property that you can exploit or maybe you can choose a range of bits that are, or do a bit of hashing.)
I guess the upshot is “consider some kind of hashing”, even the “identity hash” or simple masking/modulo with a little “your solution doesn't have to be perfectly general” on the side and some “your solution doesn't have to be perfectly space efficient” sauce on top.

Is std::sort the best choice to do in-place sort for a huge array with limited integer value?

I want to sort an array with huge(millions or even billions) elements, while the values are integers within a small range(1 to 100 or 1 to 1000), in such a case, is std::sort and the parallelized version __gnu_parallel::sort the best choice for me?
actually I want to sort a vecotor of my own class with an integer member representing the processor index.
as there are other member inside the class, so, even if two data have same integer member that is used for comparing, they might not be regarded as same data.
Counting sort would be the right choice if you know that your range is so limited. If the range is [0,m) the most efficient way to do so it have a vector in which the index represent the element and the value the count. For example:
vector<int> to_sort;
vector<int> counts;
for (int i : to_sort) {
if (counts.size() < i) {
counts.resize(i+1, 0);
}
counts[i]++;
}
Note that the count at i is lazily initialized but you can resize once if you know m.
If you are sorting objects by some field and they are all distinct, you can modify the above as:
vector<T> to_sort;
vector<vector<const T*>> count_sorted;
for (const T& t : to_sort) {
const int i = t.sort_field()
if (count_sorted.size() < i) {
count_sorted.resize(i+1, {});
}
count_sorted[i].push_back(&t);
}
Now the main difference is that your space requirements grow substantially because you need to store the vectors of pointers. The space complexity went from O(m) to O(n). Time complexity is the same. Note that the algorithm is stable. The code above assumes that to_sort is in scope during the life cycle of count_sorted. If your Ts implement move semantics you can store the object themselves and move them in. If you need count_sorted to outlive to_sort you will need to do so or make copies.
If you have a range of type [-l, m), the substance does not change much, but your index now represents the value i + l and you need to know l beforehand.
Finally, it should be trivial to simulate an iteration through the sorted array by iterating through the counts array taking into account the value of the count. If you want stl like iterators you might need a custom data structure that encapsulates that behavior.
Note: in the previous version of this answer I mentioned multiset as a way to use a data structure to count sort. This would be efficient in some java implementations (I believe the Guava implementation would be efficient) but not in C++ where the keys in the RB tree are just repeated many times.
You say "in-place", I therefore assume that you don't want to use O(n) extra memory.
First, count the number of objects with each value (as in Gionvanni's and ronaldo's answers). You still need to get the objects into the right locations in-place. I think the following works, but I haven't implemented or tested it:
Create a cumulative sum from your counts, so that you know what index each object needs to go to. For example, if the counts are 1: 3, 2: 5, 3: 7, then the cumulative sums are 1: 0, 2: 3, 3: 8, 4: 15, meaning that the first object with value 1 in the final array will be at index 0, the first object with value 2 will be at index 3, and so on.
The basic idea now is to go through the vector, starting from the beginning. Get the element's processor index, and look up the corresponding cumulative sum. This is where you want it to be. If it's already in that location, move on to the next element of the vector and increment the cumulative sum (so that the next object with that value goes in the next position along). If it's not already in the right location, swap it with the correct location, increment the cumulative sum, and then continue the process for the element you swapped into this position in the vector.
There's a potential problem when you reach the start of a block of elements that have already been moved into place. You can solve that by remembering the original cumulative sums, "noticing" when you reach one, and jump ahead to the current cumulative sum for that value, so that you don't revisit any elements that you've already swapped into place. There might be a cleverer way to deal with this, but I don't know it.
Finally, compare the performance (and correctness!) of your code against std::sort. This has better time complexity than std::sort, but that doesn't mean it's necessarily faster for your actual data.
You definitely want to use counting sort. But not the one you're thinking of. Its main selling point is that its time complexity is O(N+X) where X is the maximum value you allow the sorting of.
Regular old counting sort (as seen on some other answers) can only sort integers, or has to be implemented with a multiset or some other data structure (becoming O(Nlog(N))). But a more general version of counting sort can be used to sort (in place) anything that can provide an integer key, which is perfectly suited to your use case.
The algorithm is somewhat different though, and it's also known as American Flag Sort. Just like regular counting sort, it starts off by calculating the counts.
After that, it builds a prefix sums array of the counts. This is so that we can know how many elements should be placed behind a particular item, thus allowing us to index into the right place in constant time.
since we know the correct final position of the items, we can just swap them into place. And doing just that would work if there weren't any repetitions but, since it's almost certain that there will be repetitions, we have to be more careful.
First: when we put something into its place we have to increment the value in the prefix sum so that the next element with same value doesn't remove the previous element from its place.
Second: either
keep track of how many elements of each value we have already put into place so that we dont keep moving elements of values that have already reached their place, this requires a second copy of the counts array (prior to calculating the prefix sum), as well as a "move count" array.
keep a copy of the prefix sums shifted over by one so that we stop moving elements once the stored position of the latest element
reaches the first position of the next value.
Even though the first approach is somewhat more intuitive, I chose the second method (because it's faster and uses less memory).
template<class It, class KeyOf>
void countsort (It begin, It end, KeyOf key_of) {
constexpr int max_value = 1000;
int final_destination[max_value] = {}; // zero initialized
int destination[max_value] = {}; // zero initialized
// Record counts
for (It it = begin; it != end; ++it)
final_destination[key_of(*it)]++;
// Build prefix sum of counts
for (int i = 1; i < max_value; ++i) {
final_destination[i] += final_destination[i-1];
destination[i] = final_destination[i-1];
}
for (auto it = begin; it != end; ++it) {
auto key = key_of(*it);
// while item is not in the correct position
while ( std::distance(begin, it) != destination[key] &&
// and not all items of this value have reached their final position
final_destination[key] != destination[key] ) {
// swap into the right place
std::iter_swap(it, begin + destination[key]);
// tidy up for next iteration
++destination[key];
key = key_of(*it);
}
}
}
Usage:
vector<Person> records = populateRecords();
countsort(records.begin(), records.end(), [](Person const &){
return Person.id()-1; // map [1, 1000] -> [0, 1000)
});
This can be further generalized to become MSD Radix Sort,
here's a talk by Malte Skarupke about it: https://www.youtube.com/watch?v=zqs87a_7zxw
Here's a neat visualization of the algorithm: https://www.youtube.com/watch?v=k1XkZ5ANO64
The answer given by Giovanni Botta is perfect, and Counting Sort is definitely the way to go. However, I personally prefer not to go resizing the vector progressively, but I'd rather do it this way (assuming your range is [0-1000]):
vector<int> to_sort;
vector<int> counts(1001);
int maxvalue=0;
for (int i : to_sort) {
if(i > maxvalue) maxvalue = i;
counts[i]++;
}
counts.resize(maxvalue+1);
It is essentially the same, but no need to be constantly managing the size of the counts vector. Depending on your memory constraints, you could use one solution or the other.

An fast algorithm for sorting and shuffling equal valued entries (preferably by STL's)

I'm currently developing stochastic optimization algorithms and have encountered the following issue (which I imagine appears also in other places): It could be called totally unstable partial sort:
Given a container of size n and a comparator, such that entries may be equally valued.
Return the best k entries, but if values are equal, it should be (nearly) equally probable to receive any of them.
(output order is irrelevant to me, i.e. equal values completely among the best k need not be shuffled. To even have all equal values shuffled is however a related, interesting question and would suffice!)
A very (!) inefficient way would be to use shuffle_randomly and then partial_sort, but one actually only needs to shuffle the block of equally valued entries "at the selection border" (resp. all blocks of equally valued entries, both is much faster). Maybe that Observation is where to start...
I would very much prefer, if someone could provide a solution with STL algorithms (or at least to a large portion), both because they're usually very fast, well encapsulated and OMP-parallelized.
Thanx in advance for any ideas!
You want to partial_sort first. Then, while elements are not equal, return them. If you meet a sequence of equal elements which is larger than the remaining k, shuffle and return first k. Else return all and continue.
Not fully understanding your issue, but if you it were me solving this issue (if I am reading it correctly) ...
Since it appears you will have to traverse the given object anyway, you might as well build a copy of it for your results, sort it upon insert, and randomize your "equal" items as you insert.
In other words, copy the items from the given container into an STL list but overload the comparison operator to create a B-Tree, and if two items are equal on insert randomly choose to insert it before or after the current item.
This way it's optimally traversed (since it's a tree) and you get the random order of the items that are equal each time the list is built.
It's double the memory, but I was reading this as you didn't want to alter the original list. If you don't care about losing the original, delete each item from the original as you insert into your new list. The worst traversal will be the first time you call your function since the passed in list might be unsorted. But since you are replacing the list with your sorted copy, future runs should be much faster and you can pick a better pivot point for your tree by assigning the root node as the element at length() / 2.
Hope this is helpful, sounds like a neat project. :)
If you really mean that output order is irrelevant, then you want std::nth_element, rather than std::partial_sort, since it is generally somewhat faster. Note that std::nth_element puts the nth element in the right position, so you can do the following, which is 100% standard algorithm invocations (warning: not tested very well; fencepost error possibilities abound):
template<typename RandomIterator, typename Compare>
void best_n(RandomIterator first,
RandomIterator nth,
RandomIterator limit,
Compare cmp) {
using ref = typename std::iterator_traits<RandomIterator>::reference;
std::nth_element(first, nth, limit, cmp);
auto p = std::partition(first, nth, [&](ref a){return cmp(a, *nth);});
auto q = std::partition(nth + 1, limit, [&](ref a){return !cmp(*nth, a);});
std::random_shuffle(p, q); // See note
}
The function takes three iterators, like nth_element, where nth is an iterator to the nth element, which means that it is begin() + (n - 1)).
Edit: Note that this is different from most STL algorithms, in that it is effectively an inclusive range. In particular, it is UB if nth == limit, since it is required that *nth be valid. Furthermore, there is no way to request the best 0 elements, just as there is no way to ask for the 0th element with std::nth_element. You might prefer it with a different interface; do feel free to do so.
Or you might call it like this, after requiring that 0 < k <= n:
best_n(container.begin(), container.begin()+(k-1), container.end(), cmp);
It first uses nth_element to put the "best" k elements in positions 0..k-1, guaranteeing that the kth element (or one of them, anyway) is at position k-1. It then repartitions the elements preceding position k-1 so that the equal elements are at the end, and the elements following position k-1 so that the equal elements are at the beginning. Finally, it shuffles the equal elements.
nth_element is O(n); the two partition operations sum up to O(n); and random_shuffle is O(r) where r is the number of equal elements shuffled. I think that all sums up to O(n) so it's optimally scalable, but it may or may not be the fastest solution.
Note: You should use std::shuffle instead of std::random_shuffle, passing a uniform random number generator through to best_n. But I was too lazy to write all the boilerplate to do that and test it. Sorry.
If you don't mind sorting the whole list, there is a simple answer. Randomize the result in your comparator for equivalent elements.
std::sort(validLocations.begin(), validLocations.end(),
[&](const Point& i_point1, const Point& i_point2)
{
if (i_point1.mX == i_point2.mX)
{
return Rand(1.0f) < 0.5;
}
else
{
return i_point1.mX < i_point2.mX;
}
});

Hashing an unordered sequence of small integers

Background
I have a large collection (~thousands) of sequences of integers. Each sequence has the following properties:
it is of length 12;
the order of the sequence elements does not matter;
no element appears twice in the same sequence;
all elements are smaller than about 300.
Note that the properties 2. and 3. imply that the sequences are actually sets, but they are stored as C arrays in order to maximise access speed.
I'm looking for a good C++ algorithm to check if a new sequence is already present in the collection. If not, the new sequence is added to the collection. I thought about using a hash table (note however that I cannot use any C++11 constructs or external libraries, e.g. Boost). Hashing the sequences and storing the values in a std::set is also an option, since collisions can be just neglected if they are sufficiently rare. Any other suggestion is also welcome.
Question
I need a commutative hash function, i.e. a function that does not depend on the order of the elements in the sequence. I thought about first reducing the sequences to some canonical form (e.g. sorting) and then using standard hash functions (see refs. below), but I would prefer to avoid the overhead associated with copying (I can't modify the original sequences) and sorting. As far as I can tell, none of the functions referenced below are commutative. Ideally, the hash function should also take advantage of the fact that elements never repeat. Speed is crucial.
Any suggestions?
http://partow.net/programming/hashfunctions/index.html
http://code.google.com/p/smhasher/
Here's a basic idea; feel free to modify it at will.
Hashing an integer is just the identity.
We use the formula from boost::hash_combine to get combine hashes.
We sort the array to get a unique representative.
Code:
#include <algorithm>
std::size_t array_hash(int (&array)[12])
{
int a[12];
std::copy(array, array + 12, a);
std::sort(a, a + 12);
std::size_t result = 0;
for (int * p = a; p != a + 12; ++p)
{
std::size_t const h = *p; // the "identity hash"
result ^= h + 0x9e3779b9 + (result << 6) + (result >> 2);
}
return result;
}
Update: scratch that. You just edited the question to be something completely different.
If every number is at most 300, then you can squeeze the sorted array into 9 bits each, i.e. 108 bits. The "unordered" property only saves you an extra 12!, which is about 29 bits, so it doesn't really make a difference.
You can either look for a 128 bit unsigned integral type and store the sorted, packed set of integers in that directly. Or you can split that range up into two 64-bit integers and compute the hash as above:
uint64_t hash = lower_part + 0x9e3779b9 + (upper_part << 6) + (upper_part >> 2);
(Or maybe use 0x9E3779B97F4A7C15 as the magic number, which is the 64-bit version.)
Sort the elements of your sequences numerically and then store the sequences in a trie. Each level of the trie is a data structure in which you search for the element at that level ... you can use different data structures depending on how many elements are in it ... e.g., a linked list, a binary search tree, or a sorted vector.
If you want to use a hash table rather than a trie, then you can still sort the elements numerically and then apply one of those non-commutative hash functions. You need to sort the elements in order to compare the sequences, which you must do because you will have hash table collisions. If you didn't need to sort, then you could multiply each element by a constant factor that would smear them across the bits of an int (there's theory for finding such a factor, but you can find it experimentally), and then XOR the results. Or you could look up your ~300 values in a table, mapping them to unique values that mix well via XOR (each one could be a random value chosen so that it has an equal number of 0 and 1 bits -- each XOR flips a random half of the bits, which is optimal).
I would just use the sum function as the hash and see how far you come with that. This doesn’t take advantage of the non-repeating property of the data, nor of the fact that they are all < 300. On the other hand, it’s blazingly fast.
std::size_t hash(int (&arr)[12]) {
return std::accumulate(arr, arr + 12, 0);
}
Since the function needs to be unaware of ordering, I don’t see a smart way of taking advantage of the limited range of the input values without first sorting them. If this is absolutely required, collision-wise, I’d hard-code a sorting network (i.e. a number of if…else statements) to sort the 12 values in-place (but I have no idea how a sorting network for 12 values would look like or even if it’s practical).
EDIT After the discussion in the comments, here’s a very nice way of reducing collisions: raise every value in the array to some integer power before summing. The easiest way of doing this is via transform. This does generate a copy but that’s probably still very fast:
struct pow2 {
int operator ()(int n) const { return n * n; }
};
std::size_t hash(int (&arr)[12]) {
int raised[12];
std::transform(arr, arr + 12, raised, pow2());
return std::accumulate(raised, raised + 12, 0);
}
You could toggle bits, corresponding to each of the 12 integers, in the bitset of size 300. Then use formula from boost::hash_combine to combine ten 32-bit integers, implementing this bitset.
This gives commutative hash function, does not use sorting, and takes advantage of the fact that elements never repeat.
This approach may be generalized if we choose arbitrary bitset size and if we set or toggle arbitrary number of bits for each of the 12 integers (which bits to set/toggle for each of the 300 values is determined either by a hash function or using a pre-computed lookup table). Which results in a Bloom filter or related structures.
We can choose Bloom filter of size 32 or 64 bits. In this case, there is no need to combine pieces of large bit vector into a single hash value. In case of classical implementation of Bloom filter with size 32, optimal number of hash functions (or non-zero bits for each value of the lookup table) is 2.
If, instead of "or" operation of classical Bloom filter, we choose "xor" and use half non-zero bits for each value of the lookup table, we get a solution, mentioned by Jim Balter.
If, instead of "or" operation, we choose "+" and use approximately half non-zero bits for each value of the lookup table, we get a solution, similar to one, suggested by Konrad Rudolph.
I accepted Jim Balter's answer because he's the one who came closest to what I eventually coded, but all of the answers got my +1 for their helpfulness.
Here is the algorithm I ended up with. I wrote a small Python script that generates 300 64-bit integers such that their binary representation contains exactly 32 true and 32 false bits. The positions of the true bits are randomly distributed.
import itertools
import random
import sys
def random_combination(iterable, r):
"Random selection from itertools.combinations(iterable, r)"
pool = tuple(iterable)
n = len(pool)
indices = sorted(random.sample(xrange(n), r))
return tuple(pool[i] for i in indices)
mask_size = 64
mask_size_over_2 = mask_size/2
nmasks = 300
suffix='UL'
print 'HashType mask[' + str(nmasks) + '] = {'
for i in range(nmasks):
combo = random_combination(xrange(mask_size),mask_size_over_2)
mask = 0;
for j in combo:
mask |= (1<<j);
if(i<nmasks-1):
print '\t' + str(mask) + suffix + ','
else:
print '\t' + str(mask) + suffix + ' };'
The C++ array generated by the script is used as follows:
typedef int_least64_t HashType;
const int maxTableSize = 300;
HashType mask[maxTableSize] = {
// generated array goes here
};
inline HashType xorrer(HashType const &l, HashType const &r) {
return l^mask[r];
}
HashType hashConfig(HashType *sequence, int n) {
return std::accumulate(sequence, sequence+n, (HashType)0, xorrer);
}
This algorithm is by far the fastest of those that I have tried (this, this with cubes and this with a bitset of size 300). For my "typical" sequences of integers, collision rates are smaller than 1E-7, which is completely acceptable for my purpose.