Customizing compare in bsearch() - c++

I have an array of addresses that point to integers ( these integers
are sorted in ascending order). They have duplicate values. Ex: 1,
2, 2, 3, 3, 3, 3, 4, 4......
I am trying to get hold of all the values that are greater than a
certain value(key). Currently trying to implement it using binary
search algo -
void *bsearch(
const void *key,
const void *base,
size_t num,
size_t width,
int ( __cdecl *compare ) ( const void *, const void *)
);
I am not able to achieve this completely, but for some of them.
Would there be any other way to get hold of all the values of the
array, with out changing the algorithm I am using?

As Klatchko and GMan have noted, the STL function gives you exactly what you're asking: std::upper_bound.
If you need to stick with bsearch, though, the simplest solution may be to iterate forwards until you reach a new value.
void* p = bsearch(key, base, num, width, compare);
while ((p != end) && // however you define the end of the array -
// base + num, perhaps?
(compare(key, p)==0)){ // while p points to an element matching the key
++p; // advance p
}
If you want to get the first p that matches key, rather than the first one that's larger, just use --p instead of ++p.
Whether you prefer this or a repeated binary search, as Michael suggests, depends on the size of the array and how many repetitions you expect.
Now, your question title refers to customizing the compare function, but as I understand the question that won't help you here - the compare function must compare any two equivalent objects as being equivalent, so it's no good for recognizing which of several equivalent objects is the first/last of its type in an array. Unless you had a different problem, specifically concerning the compare function?

You should look into std::upper_bound
For example, to find the address of the first value > 3:
const int data[] = { 1, 2, 2, 3, 3, 3, 3, 4, 4, ... };
size_t data_count = sizeof(data) / sizeof(*data);
const int *ptr = std::upper_bound(data, data + data_count, 3);
// ptr will now point to the address of the first 4
A related function is std::lower_bound.

Yes, you can use a binary search. The trick is what you do when you find an element with the given key... unless your lower and upper indices are the same, you need to continue binary searching in the left part of your interval... that is, you should move the upper bound to be the current midpoint. That way, when your binary search terminates, you will have found the first such element. Then just iterate over the rest.

If you have a binary search tree implemented, you have tree traversal algorithms to do this. You could reach the required 'upper-bound' node and simply traverse in-order from there. Traversal is simpler than searching the tree multiple times, i.e, traversing a tree of n nodes would take n operations at most, whereas searching n times would take (n.log n) operations.

Related

Sort merged array consisting of sorted arrays

I got two sorted arrays e.g. (3,4,5) and (1,3,7,8) and I got the combined sorted array (3,4,5,1,3,7,8).
Now I would like to sort the already combined array, without splitting it, but by overwriting it, by making use of the fact that it consists of 2 arrays which had already been sorted. Is there any way of doing this efficiently? I know there are a lot of threads about how to do this, by iterating through the sorted arrays and then putting the values into the new array accordingly, but I haven't seen this type of question anywhere yet. I would like to do this in c, but any help / pseudocode would be very kindly appreciated. Thanks!
Edit: The function which would do the sorting, would only be given the combined array and (maybe ) the length of the other two arrays if needed.
If you already have the original sorted arrays, the combined array (note it is not sorted) doesn't really help, except in that your destination storage is already allocated.
There's a well-known and very simple algorithm for merging two sorted ranges, but you can just use std::merge instead of coding it yourself.
Note that only works for non-overlapping input & output ranges: for your amended question, use std::inplace_merge, with the middle iterator set to the first element from your second sequence:
void sort_combined(int *array, size_t total, size_t first) {
std::inplace_merge(array, array + first, array + total);
}
// and use it like
int combined[] = {3, 4, 5, 1, 3, 7, 8};
const size_t first = 3;
const size_t second = 4;
const size_t total = 7; // == sizeof(combined)/sizeof(*combined)
sort_combined(combined, total, first);

Most efficient way to find index of matching values in two sorted arrays using C++

I currently have a solution but I feel it's not as efficient as it could be to this problem, so I want to see if there is a faster method to this.
I have two arrays (std::vectors for example). Both arrays contain only unique integer values that are sorted but are sparse in value, ie: 1,4,12,13... What I want to ask is there fast way I can find the INDEX to one of the arrays where the values are the same. For example, array1 has values 1,4,12,13 and array2 has values 2,12,14,16. The first matching value index is 1 in array2. The index into the array is what is important as I have other arrays that contain data that will use this index that "matches".
I am not confined to using arrays, maps are possible to. I am only comparing the two arrays once. They will not be reused again after the first matching pass. There can be small to large number of values (300,000+) in either array, but DO NOT always have the same number of values (that would make things much easier)
Worse case is a linear search O(N^2). Using map would get me better O(log N) but I would still have convert an array to into a map of value, index pairs.
What I currently have to not do any container type conversions is this. Loop over the smaller of the two arrays. Compare current element of small array (array1) with the current element of large array (array2). If array1 element value is larger than array2 element value, increment the index for array2 until is it no longer larger than array1 element value (while loop). Then, if array1 element value is smaller than array2 element, go to next loop iteration and begin again. Otherwise they must be equal and I have my index to either arrays of the matching value.
So in this loop, I am at best O(N) if all values have matches and at worse O(2N) if none match. So I am wondering if there is something faster out there? It's hard to know for sure how often the two arrays will match, but I would way I would lean more toward most of the arrays will mostly have matches than not.
I hope I explained the problem well enough and I appreciate any feedback or tips on improving this.
Code example:
std::vector<int> array1 = {4,6,12,34};
std::vector<int> array2 = {1,3,6,34,40};
for(unsigned int i=0, z=0; i < array1.size(); i++)
{
int value1 = array1[i];
while(value1 > array2[z] && z < array2.size())
z++;
if (z >= array2.size())
break; // reached end of array2
if (value1 < array2[z])
continue;
// we have a match, i and z indices have same value
}
Result will be matching indexes for array1 = [1,3] and for array2= [2,3]
I wrote an implementation of this function using an algorithm that performs better with sparse distributions, than the trivial linear merge.
For distributions, that are similar†, it has O(n) complexity but ranges where the distributions are greatly different, it should perform below linear, approaching O(log n) in optimal cases. However, I wasn't able to prove that the worst case isn't better than O(n log n). On the other hand, I haven't been able to find that worst case either.
I templated it so that any type of ranges can be used, such as sub-ranges or raw arrays. Technically it works with non-random access iterators as well, but the complexity is much greater, so it's not recommended. I think it should be possible to modify the algorithm to fall back to linear search in that case, but I haven't bothered.
† By similar distribution, I mean that the pair of arrays have many crossings. By crossing, I mean a point where you would switch from one array to another if you were to merge the two arrays together in sorted order.
#include <algorithm>
#include <iterator>
#include <utility>
// helper structure for the search
template<class Range, class Out>
struct search_data {
// is any there clearer way to get iterator that might be either
// a Range::const_iterator or const T*?
using iterator = decltype(std::cbegin(std::declval<Range&>()));
iterator curr;
const iterator begin, end;
Out out;
};
template<class Range, class Out>
auto init_search_data(const Range& range, Out out) {
return search_data<Range, Out>{
std::begin(range),
std::begin(range),
std::end(range),
out,
};
}
template<class Range, class Out1, class Out2>
void match_indices(const Range& in1, const Range& in2, Out1 out1, Out2 out2) {
auto search_data1 = init_search_data(in1, out1);
auto search_data2 = init_search_data(in2, out2);
// initial order is arbitrary
auto lesser = &search_data1;
auto greater = &search_data2;
// if either range is exhausted, we are finished
while(lesser->curr != lesser->end
&& greater->curr != greater->end) {
// difference of first values in each range
auto delta = *greater->curr - *lesser->curr;
if(!delta) { // matching value was found
// store both results and increment the iterators
*lesser->out++ = std::distance(lesser->begin, lesser->curr++);
*greater->out++ = std::distance(greater->begin, greater->curr++);
continue; // then start a new iteraton
}
if(delta < 0) { // set the order of ranges by their first value
std::swap(lesser, greater);
delta = -delta; // delta is always positive after this
}
// next crossing cannot be farther than the delta
// this assumption has following pre-requisites:
// range is sorted, values are integers, values in the range are unique
auto range_left = std::distance(lesser->curr, lesser->end);
auto upper_limit =
std::min(range_left, static_cast<decltype(range_left)>(delta));
// exponential search for a sub range where the value at upper bound
// is greater than target, and value at lower bound is lesser
auto target = *greater->curr;
auto lower = lesser->curr;
auto upper = std::next(lower, upper_limit);
for(int i = 1; i < upper_limit; i *= 2) {
auto guess = std::next(lower, i);
if(*guess >= target) {
upper = guess;
break;
}
lower = guess;
}
// skip all values in lesser,
// that are less than the least value in greater
lesser->curr = std::lower_bound(lower, upper, target);
}
}
#include <iostream>
#include <vector>
int main() {
std::vector<int> array1 = {4,6,12,34};
std::vector<int> array2 = {1,3,6,34};
std::vector<std::size_t> indices1;
std::vector<std::size_t> indices2;
match_indices(array1, array2,
std::back_inserter(indices1),
std::back_inserter(indices2));
std::cout << "indices in array1: ";
for(std::vector<int>::size_type i : indices1)
std::cout << i << ' ';
std::cout << "\nindices in array2: ";
for(std::vector<int>::size_type i : indices2)
std::cout << i << ' ';
std::cout << std::endl;
}
Since the arrays are already sorted you can just use something very much like the merge step of mergesort. This just looks at the head element of each array, and discards the lower element (the next element becomes the head). Stop when you find a match (or when either array becomes exhausted, indicating no match).
This is O(n) and the fastest you can do for arbitrary distubtions. With certain clustered distributions a "skip ahead" approach could be used rather than always looking at the next element. This could result in better than O(n) running times for certain distributions. For example, given the arrays 1,2,3,4,5 and 10,11,12,13,14 an algorithm could determine there were no matches to be found in as few as one comparison (5 < 10).
What is the range of the stored numbers?
I mean, you say that the numbers are integers, sorted, and sparse (i.e. non-sequential), and that there may be more than 300,000 of them, but what is their actual range?
The reason that I ask is that, if there is a reasonably small upper limit, u, (say, u=500,000), the fastest and most expedient solution might be to just use the values as indices. Yes, you might be wasting memory, but is 4*u really a lot of memory? This depends on your application and your target platform (i.e. if this is for a memory-constrained embedded system, its less likely to be a good idea than if you have a laptop with 32GiB RAM).
Of course, if the values are more-or-less evenly spread over 0-2^31-1, this crude idea isn't attractive, but maybe there are properties of the input values that you can exploit other simply than the range. You might be able to hand-write a fairly simple hash function.
Another thing worth considering is whether you actually need to be able to retrieve the index quickly or if it helps just be able to tell if the index exists in the other array quickly. Whether or not a value exists at a particular index requires only one bit, so you could have a bitmap of the range of the input values using 32x less memory (i.e. mask off 5 LSBs and use that as a bit position, then shift the remaining 27 bits 5 places right and use that as an array index).
Finally, a hybrid approach might be worth considering, where you decide how much memory you're prepared to use (say you decide 256KiB, which corresponds to 64Ki 4-byte integers) then use that as a lookup-table to into much smaller sub-problems. Say you have 300,000 values whose LSBs are pretty evenly distributed. Then you could use 16 LSBs as indices into a lookup-table of lists that are (on average) only 4 or 5 elements long, which you can then search by other means. A couple of year ago, I worked on some simulation software that had ~200,000,000 cells, each with a cell id; some utility functionality used a binary search to identify cells by id. We were able to speed it up significantly and non-intrusively with this strategy. Not a perfect solution, but a great improvement. (If the LSBs are not evenly distributed, maybe that's a property that you can exploit or maybe you can choose a range of bits that are, or do a bit of hashing.)
I guess the upshot is “consider some kind of hashing”, even the “identity hash” or simple masking/modulo with a little “your solution doesn't have to be perfectly general” on the side and some “your solution doesn't have to be perfectly space efficient” sauce on top.

Find the element in one array, which has the minimal absolute value

I am trying to find one element in one array, which has the minimum absolute value. For example, in array [5.1, -2.2, 8.2, -1, 4, 3, -5, 6], I want get the value -1. I use following code (myarray is 1D array and not sorted)
for (int i = 1; i < 8; ++i)
{
if(fabsf(myarray[i])<fabsf(myarray[0])) myarray[0] = myarray[i];
}
Then, the target value is in myarray[0].
Because I have to repeat this procedure many times, this piece of code becomes the bottleneck in my program. Does anyone know how to improve this code? Thanks in advance!
BTW, the size of the array is always eight. Could this be used to optimize this code?
Update: so far, following code works slightly better on my machine:
float absMin = fabsf(myarray[0]); int index = 0;
for (int i = 1; i < 8; ++i)
{
if(fabsf(myarray[i])<absMin) {absMin = fabsf(myarray[i]); index=i;}
}
float result = myarray[index];
I am wandering how to avoid fabsf, because I just want to compare the absolute values instead of computing them. Does anyone have any idea?
There are some urban myths like inlining, loop unrolling by hand and similar which are supposed to make your code faster. Good news is you don't have to do it, at least if you use -O3 compiler optimization.
Bad news is, if you already use -O3 there is nothing you can do to speed up this function: the compiler will optimize the hell out of your code! For example it will surely do the caching of fabsf(myarray[0]) as some suggested. The only thing you can achieve with this "refactoring" is to build bugs into your program and make it less readable.
My advice is to look somewhere else for improvements:
try to reduce the number of invocations of this code
if this code is the bottle neck, than my guess would be that you recalculate the minimal value over and over again (otherwise filling the values into the array would take approximately the same time) - so cache the results of the search
shift costs to changing the elements of the array, for example by using some fancy data structures (heaps, priority_queue) or by tracking the minimum of elements. Lets say your array has only two elements values [1,2] so minimum is 1. Now if you change
2 to 3, you don't have to do anything
2 to 0, you can easily update your minimum to 0
1 to 3, you have to loop through all elements. But maybe this case is not that often.
Can you store the values pre fabbed?
Also as #Gerstrong mentions, storing the number outside the loop and only calculating it when array changes will give you a boost.
Calling partial_sort or nth_element will sort the array only so that the correct value is in the right location.
std::nth_element(v.begin(), v.begin(), v.end(), [](float& lhs, float& rhs){
return fabsf(lhs)<fabsf(rhs);
});
Let me give some ideas that could help:
float minVal = fabsf(myarray[0]);
for (int i = 1; i < 8; ++i)
{
if(fabsf(myarray[i])<minVal) minVal = fabsf(myarray[i]);
}
myarray[0] = minVal;
But compilers nowadays are very smart and you might not get any more speed, as you already get optimized code. It depends on how your mentioned piece of code is called.
Another way to optimize this maybe is using C++ and STL, so you can do the following using the typical binary search tree std::set:
// Absolute comparator for std::set
bool absless_compare(const int64_t &a, const int64_t &b)
{
return (fabsf(a) < fabsf(b));
}
std::set<float, absless_compare> mySet = {5.1, -2.2, 8.2, -1, 4, 3, -5, 6};
const float minVal = *(mySet.begin());
With this approach by inserting your numbers they are already sorted in ascending order. The less-Comparator is usually a set for the std::set, but you can change it to use something different like in this example. This might help on larger datasets, but you mentioned you only have eight values to compare, so it really will not help.
Eight elements is a very small number, which might be kept in stack with for example the declaration of std::array<float,8> myarray close to your sorting function before filling it with data. You should that variants on your full codeset and observe what helps. Of course if you declare std::array<float,8> myarray or float[8] myarray runtime you should get the same results.
What you also could check is if fabsf really uses float as parameter and does not convert your variable to double which would degrade the performance. There is also std::abs() which for my understanding deduces the data type, because in C++ you can use templates etc.
If don't want to use fabs obviously a call like this
float myAbs(const float val)
{
return (val<0) ? -val : val;
}
or you hack the bit to zero which make your number negative. Either way, I'm pretty sure, that fabsf is fully aware of that, and I don't think a code like that will make it faster.
So I would check if the argument is converted to double. If you have C99 Standard in your system though, you should not have that issue.
One thought would be to do your comparisons "tournament" style, instead of linearly. In other words, you first compare 1 with 2, 3 with 4, etc. Then you take those 4 elements and do the same thing, and then again, until you only have one element left.
This does not change the number of comparisons. Since each comparison eliminates one element from the running, you will have exactly 7 comparisons no matter what. So why do I suggest this? Because it removes data dependencies from your code. Modern processors have multiple pipelines and can retire multiple instructions simultaneously. However, when you do the comparisons in a loop, each loop iteration depends on the previous one. When you do it tournament style, the first four comparisons are completely independent, so the processor may be able to do them all at once.
In addition to doing that, you can compute all the fabs at once in a trivial loop and put it in a new array. Since the fabs computations are independent, this can get sped up pretty easily. You would do this first, and then the tournament style comparisons to get the index. It should be exactly the same number of operations, it's just changing the order around so that the compiler can more easily see larger blocks that lack data dependencies.
The element of an array with minimal absolute value
Let the array, A
A = [5.1, -2.2, 8.2, -1, 4, 3, -5, 6]
The minimal absolute value of A is,
double miniAbsValue = A.array().abs().minCoeff();
int i_minimum = 0; // to find the position of minimum absolute value
for(int i = 0; i < 8; i++)
{
double ftn = evalsH(i);
if( fabs(ftn) == miniAbsValue )
{
i_minimum = i;
}
}
Now the element of A with minimal absolute value is
A(i_minimum)

Is std::sort the best choice to do in-place sort for a huge array with limited integer value?

I want to sort an array with huge(millions or even billions) elements, while the values are integers within a small range(1 to 100 or 1 to 1000), in such a case, is std::sort and the parallelized version __gnu_parallel::sort the best choice for me?
actually I want to sort a vecotor of my own class with an integer member representing the processor index.
as there are other member inside the class, so, even if two data have same integer member that is used for comparing, they might not be regarded as same data.
Counting sort would be the right choice if you know that your range is so limited. If the range is [0,m) the most efficient way to do so it have a vector in which the index represent the element and the value the count. For example:
vector<int> to_sort;
vector<int> counts;
for (int i : to_sort) {
if (counts.size() < i) {
counts.resize(i+1, 0);
}
counts[i]++;
}
Note that the count at i is lazily initialized but you can resize once if you know m.
If you are sorting objects by some field and they are all distinct, you can modify the above as:
vector<T> to_sort;
vector<vector<const T*>> count_sorted;
for (const T& t : to_sort) {
const int i = t.sort_field()
if (count_sorted.size() < i) {
count_sorted.resize(i+1, {});
}
count_sorted[i].push_back(&t);
}
Now the main difference is that your space requirements grow substantially because you need to store the vectors of pointers. The space complexity went from O(m) to O(n). Time complexity is the same. Note that the algorithm is stable. The code above assumes that to_sort is in scope during the life cycle of count_sorted. If your Ts implement move semantics you can store the object themselves and move them in. If you need count_sorted to outlive to_sort you will need to do so or make copies.
If you have a range of type [-l, m), the substance does not change much, but your index now represents the value i + l and you need to know l beforehand.
Finally, it should be trivial to simulate an iteration through the sorted array by iterating through the counts array taking into account the value of the count. If you want stl like iterators you might need a custom data structure that encapsulates that behavior.
Note: in the previous version of this answer I mentioned multiset as a way to use a data structure to count sort. This would be efficient in some java implementations (I believe the Guava implementation would be efficient) but not in C++ where the keys in the RB tree are just repeated many times.
You say "in-place", I therefore assume that you don't want to use O(n) extra memory.
First, count the number of objects with each value (as in Gionvanni's and ronaldo's answers). You still need to get the objects into the right locations in-place. I think the following works, but I haven't implemented or tested it:
Create a cumulative sum from your counts, so that you know what index each object needs to go to. For example, if the counts are 1: 3, 2: 5, 3: 7, then the cumulative sums are 1: 0, 2: 3, 3: 8, 4: 15, meaning that the first object with value 1 in the final array will be at index 0, the first object with value 2 will be at index 3, and so on.
The basic idea now is to go through the vector, starting from the beginning. Get the element's processor index, and look up the corresponding cumulative sum. This is where you want it to be. If it's already in that location, move on to the next element of the vector and increment the cumulative sum (so that the next object with that value goes in the next position along). If it's not already in the right location, swap it with the correct location, increment the cumulative sum, and then continue the process for the element you swapped into this position in the vector.
There's a potential problem when you reach the start of a block of elements that have already been moved into place. You can solve that by remembering the original cumulative sums, "noticing" when you reach one, and jump ahead to the current cumulative sum for that value, so that you don't revisit any elements that you've already swapped into place. There might be a cleverer way to deal with this, but I don't know it.
Finally, compare the performance (and correctness!) of your code against std::sort. This has better time complexity than std::sort, but that doesn't mean it's necessarily faster for your actual data.
You definitely want to use counting sort. But not the one you're thinking of. Its main selling point is that its time complexity is O(N+X) where X is the maximum value you allow the sorting of.
Regular old counting sort (as seen on some other answers) can only sort integers, or has to be implemented with a multiset or some other data structure (becoming O(Nlog(N))). But a more general version of counting sort can be used to sort (in place) anything that can provide an integer key, which is perfectly suited to your use case.
The algorithm is somewhat different though, and it's also known as American Flag Sort. Just like regular counting sort, it starts off by calculating the counts.
After that, it builds a prefix sums array of the counts. This is so that we can know how many elements should be placed behind a particular item, thus allowing us to index into the right place in constant time.
since we know the correct final position of the items, we can just swap them into place. And doing just that would work if there weren't any repetitions but, since it's almost certain that there will be repetitions, we have to be more careful.
First: when we put something into its place we have to increment the value in the prefix sum so that the next element with same value doesn't remove the previous element from its place.
Second: either
keep track of how many elements of each value we have already put into place so that we dont keep moving elements of values that have already reached their place, this requires a second copy of the counts array (prior to calculating the prefix sum), as well as a "move count" array.
keep a copy of the prefix sums shifted over by one so that we stop moving elements once the stored position of the latest element
reaches the first position of the next value.
Even though the first approach is somewhat more intuitive, I chose the second method (because it's faster and uses less memory).
template<class It, class KeyOf>
void countsort (It begin, It end, KeyOf key_of) {
constexpr int max_value = 1000;
int final_destination[max_value] = {}; // zero initialized
int destination[max_value] = {}; // zero initialized
// Record counts
for (It it = begin; it != end; ++it)
final_destination[key_of(*it)]++;
// Build prefix sum of counts
for (int i = 1; i < max_value; ++i) {
final_destination[i] += final_destination[i-1];
destination[i] = final_destination[i-1];
}
for (auto it = begin; it != end; ++it) {
auto key = key_of(*it);
// while item is not in the correct position
while ( std::distance(begin, it) != destination[key] &&
// and not all items of this value have reached their final position
final_destination[key] != destination[key] ) {
// swap into the right place
std::iter_swap(it, begin + destination[key]);
// tidy up for next iteration
++destination[key];
key = key_of(*it);
}
}
}
Usage:
vector<Person> records = populateRecords();
countsort(records.begin(), records.end(), [](Person const &){
return Person.id()-1; // map [1, 1000] -> [0, 1000)
});
This can be further generalized to become MSD Radix Sort,
here's a talk by Malte Skarupke about it: https://www.youtube.com/watch?v=zqs87a_7zxw
Here's a neat visualization of the algorithm: https://www.youtube.com/watch?v=k1XkZ5ANO64
The answer given by Giovanni Botta is perfect, and Counting Sort is definitely the way to go. However, I personally prefer not to go resizing the vector progressively, but I'd rather do it this way (assuming your range is [0-1000]):
vector<int> to_sort;
vector<int> counts(1001);
int maxvalue=0;
for (int i : to_sort) {
if(i > maxvalue) maxvalue = i;
counts[i]++;
}
counts.resize(maxvalue+1);
It is essentially the same, but no need to be constantly managing the size of the counts vector. Depending on your memory constraints, you could use one solution or the other.

Map from integer ranges to arbitrary single integers

Working in C++ in a Linux environment, I have a situation where a number of integer ranges are defined, and integer inputs map to different arbitrary integers based on which range they fall into. None of the ranges overlap, and they aren't always contiguous.
The "simplest" way to solve this problem is with a bunch of if-statements for each range, but the number of ranges, their bounds, and the target values can all vary, so if-statements aren't maintainable.
For example, the ranges might be [0, 70], called r_a, [101, 150], call it r_b, and [201, 400], call it r_c. Inputs in r_a map to 1, in r_b map to 2, and r_c map to 3. Anything not in r_a, r_b, r_c maps to 0.
I can come up with a data structure & algorithm that stores tuples of (bounds, map target) and iterates through them, so finding the target value takes linear time in the number of bounds pairs. I can also imagine a scheme that keeps the pairs ordered and uses a binary sort-ish algorithm against all the lower bounds (or upper bounds), finds the closest to the input, then compares against the opposing bound.
Is there a better way to accomplish the mapping than a binary-search based algorithm? Even better, is there some C++ library out that does this already?
The best approach here is indeed a binary search, but any efficient order-based search will do perfectly well. You don't really have to implement the search and the data structure explicitly. You can use it indirectly by employing a standard associative container instead.
Since your ranges don't overlap, the solution is very simple. You can immediately use a std::map for this problem to solve it in just a few lines of code.
For example, this is one possible approach. Let's assume that we are mapping an [ int, int ] range to an int value. Let's represent our ranges as closed-open ranges, i.e. if the original range is [0, 70], let's consider a [0, 71) range instead. Also, let's use the value of 0 as a "reserved" value that means "no mapping" (as you requested in your question)
const int EMPTY = 0;
All you need to do is to declare a map from int to int:
typedef std::map<int, int> Map;
Map map;
and fill it with each end of your closed-open ranges. The left (closed) end should be mapped to the desired value the entire range is mapped to, while the right (open) end should be mapped to our EMPTY value. For your example, it will look as follows
map[0] = r_a;
map[71] = EMPTY;
map[101] = r_b;
map[251] = EMPTY;
map[260] = r_c; // 260 adjusted from 201
map[401] = EMPTY;
(I adjusted your last range, since in your original example it overlapped the previous range, and you said that your ranges don't overlap).
That's it for initialization.
Now, in order to determine where a given value of i maps to all you need to do is
Map::iterator it = map.upper_bound(i);
If it == map.begin(), then i is not in any range. Otherwise, do
--it;
If the it->second (for the decremented it) is EMPTY, then i is not in any range.
The combined "miss" check might look as follows
Map::iterator it = map.upper_bound(i);
if (it == map.begin() || (--it)->second == EMPTY)
/* Missed all ranges */;
Otherwise, it->second (for the decremented it) is your mapped value
int mapped_to = it->second;
Note that if the original ranges were "touching", as in [40, 60] and [61, 100], then the closed-open ranges will look as [40, 61) and [61, 101) meaning that the value of 61 will be mapped twice during map initialization. In this case it is important to make sure that the value of 61 is mapped to the proper destination value and not to the value of EMPTY. If you map the ranges as shown above in the left-to-right (i.e. increasing) order it will work correctly by itself.
Note, that only the endpoints of the ranges are inserted into the map, meaning that the memory consumption and the performance of the search depends only on the total number of ranges and completely independent of their total length.
If you wish, you can add a "guard" element to the map during the initialization
map[INT_MIN] = EMPTY;
(it corresponds to "negative infinity") and the "miss" check will become simpler
Map::iterator it = map.upper_bound(i);
assert(it != map.begin());
if ((--it)->second == EMPTY)
/* Missed all ranges */;
but that's just a matter of personal preference.
Of course, if you just want to return 0 for non-mapped values, you don't need to carry out any checking at all. Just take the it->second from the decremented iterator and you are done.
I would use a very simple thing: a std::map.
class Range
{
public:
explicit Range(int item); // [item,item]
Range(int low, int high); // [low,high]
bool operator<(const Range& rhs) const
{
if (mLow < rhs.mLow)
{
assert(mHigh < rhs.mLow); // sanity check
return true;
}
return false;
} // operator<
int low() const { return mLow; }
int high() const { return mHigh; }
private:
int mLow;
int mHigh;
}; // class Range
Then, let's have a map:
typedef std::map<Range, int> ranges_type;
And write a function that search in this map:
int find(int item, const ranges_type& ranges)
{
ranges_type::const_iterator it = ranges.lower_bound(Range(item));
if (it != ranges.end() && it->first.low() <= item)
return it->second;
else
return 0; // No mapping ?
}
Main benefits:
Will check that the ranges effectively don't overlap during insertion in the set (you can make it so that it's only in debug mode)
Supports edition of the Ranges on the fly
Finding is fast (binary search)
If the ranges are frozen (even if their values are not), you may wish to use Loki::AssocVector to reduce the memory overhead and improve performance a bit (basically, it's a sorted vector with the interface of a map).
Wouldn't a simple array be enough? You're not saying how many items you have, but by far the fastest data structure is a simple array.
If the ranges are:
0..9 --> 25
10..19 --> 42
Then the array would simply be like this:
[25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42]
You can have two sorted arrays: one for lower bounds, one for upper bounds. Use std::lower_bound(lower_bound_array, value) and std::upper_bound(upper_bound_array, value). If the index of both results is the same, return index + 1. Otherwise, return 0.
If the indices returned match, it means that the value is >= the lower bound and < the upper bound. If they don't, then you are in between ranges.
The ideal is an interval tree (specialized binary tree). Wikipedia describes the method completely. Better than I. You won't get much more optimal than this, without sacrificing space for performance.
Your example ranges overlap, but the question says they wont. I'll assume the range is a typo. You could, could, store the destinations in an array and use the indices as the ranges. It's pretty easy, but ugly and not very maintainable. You'd need to initialize the array to 0, then for each range, iterate over those indices and set each index to the destination value. Very ugly, but constant lookup time so maybe useful if the numbers don't get too high and the ranges don't change very often.
Record the limits into a set (or map). When you call insert you will have a return value which is a pair. An iterator and a boolean. If the boolean is true then a new element is created which you have to remove later. After that step one with the iterator and look at what you have found.
http://www.cplusplus.com/reference/stl/set/insert/ See Return value
It's 1-dimensional spatial index. Quadtree-style binary tree will do, for example - and there are several other widely used methods.
A simple Linked List containing the range entries should be quick enough, even for say 50-100 ranges. Moreover, you could implement a Skip List, on say the upper bounds, to speed up these range queries. Yet another possibility is an Interval Tree.
Ultimately I'd choose the simplest: binary search.
You may find Minimal Perfect Hashing Function useful, http://cmph.sourceforge.net/.