Binary Search avoid unreadable entry (hole in list) - c++

I have implemented a binary search function but I have an issue with a list entry that may become unreadable. It's implemented in C++ but ill just use some pseudo code to make it easier. Please to not focus on the unreadable or string implementation, it's just pseudo code. What matter is that there are unreadable entries in the list that have to be navigated around.
int i = 0;
int imin = 0;
int imax = 99;
string search = "test";
while(imin <= imax)
{
i = imin + (imax - imin) / 2;
string text = vector.at(i);
if(text.isUnreadable())
{
continue;
}
if(compare(text, search) = 0)
{
break;
}
else if(compare(text, search) < 0)
{
imin = i + 1;
}
else if(compare(text, search) > 0)
{
imax = i - 1;
}
}
The searching itself is working pretty well, but the problem I have is how to avoid getting an endless loop if the text is unreadable. Anyone has a time tested approach for this? The loop should not just exit when unreadable but rather navigate around the hole.

I had similar task in one of projects - lookup on sequence where some of items are non-comparable.
I am not sure is this the best possible implementation, in my case it looks like this:
int low = first_comparable(0,env);
int high = last_comparable(env.total() - 1,env);
while (low < high)
{
int mid = low + ((high - low) / 2);
int tmid = last_comparable(mid,env);
if( tmid < low )
{
tmid = first_comparable(mid,env);
if( tmid == high )
return high;
if( tmid > high )
return -1;
}
mid = tmid;
...
}
If vector.at(mid) item is non-comparable it does lookup in its neighborhood to find closest comparable.
first/last_comparable() functions return index of first comparable element from given index. Difference is in direction.
inline int first_comparable( int n, E& env)
{
int n_elements = env.total();
for( ; n < n_elements; ++n )
if( env.is_comparable(n) )
return n;
return n;
}

Create a list of pointers to your data items. Do not add "unreadable" ones. Search the resulting list of pointers.

the problem I have is how to avoid getting an endless loop if the text is unreadable.
Seems like that continue should be break instead, so that you break out of the loop. You'd probably want to set a flag or something to indicate the error to whatever code follows the loop.
Another option is to throw an exception.
Really, you should do almost anything other than what you're doing. Currently, when you read one of these 'unreadable' states, you simply continue the loop. But imin and imax still have the same values, so you end up reading the same string from the same place in the vector, and find that it's unreadable again, and so on. You need to decide how you want to respond to one of these 'unreadable' states. I guessed above that you'd want to stop the search, in which case either setting a flag and breaking out of the loop or throwing an exception to accomplish the same thing would be reasonable choices.

Related

Optimizing code with a lookup table

This portion of my code takes too long to run and I was looking for a way to optimize it. I think a lookup table would be the fastest way but I could be wrong. My program has a main for loop and for each iteration in the main for loop, a nested loop goes through 1,233,487 iterations and then goes through the if statements if the conditions are met. The main for loop goes through 898,281 iterations so it must go through 898,281 * 1,233,487 calculations. How would I go about creating a lookup table to optimize these calculations/is there a better way to optimize my code.
for (int i = 0; i < all_neutrinos.size(); i++)
{ //all_neutrinos.size() = 898281
int MC_count = 0; //counts per window in the Monte Carlo simulation
int count = 0; //count per window for real data
if (cosmic_ray_events.size() == MC_cosmic_ray_events.size())
{
for (int j = 0; j < cosmic_ray_events.size(); j++)
{ //cosmic_ray_events.size() = 1233487
if ((MC_cosmic_ray_events[j][1] >= (all_neutrinos[i][3] - band_width))
&& (MC_cosmic_ray_events[j][1] <= (all_neutrinos[i][3] + band_width)))
{
if ((earth_radius * fabs(all_neutrinos[i][2] - MC_cosmic_ray_events[j][0]))
<= test_arc_length)
{
MC_count++;
}
}
if ((cosmic_ray_events[j][7] >= (all_neutrinos[i][3] - band_width))
&& (cosmic_ray_events[j][7] <= (all_neutrinos[i][3] + band_width)))
{
if(earth_radius * fabs(all_neutrinos[i][2] - cosmic_ray_events[j][6])
<= test_arc_length)
{
count++;
}
}
}
MCcount_out << i << " " << MC_count << endl;
count_out << i << " " << count << endl;
}
}
First cosmic_raw_events and MC_cosmic_ray_events are utterly unrelated. Make it two loops.
Sort MC_cosmic_ray_events by [1]. Sort cosmic_ray_events by [7]. Sort all_neutrinos by [3].
This doesn't have to be in-place sorting -- you can sort an array of pointers or indexes into them if you want.
Start with a highwater and lowwater index into your cosmic ray events set to 0.
Now, walk over all_neutrinos. For each one, advance highwater until
MC_cosmic_ray_events[highwater][1] > all_neutrinos[i][3] + band_width). Then advance lowwater until MC_cosmic_ray_events[lowwater][1] >= all_neutrinos[i][3] - band_width).
On the half-open range j = lowwater upto but not including highwater, run:
if (
(earth_radius * fabs(all_neutrinos[i][2] - MC_cosmic_ray_events[j][0]))
<= test_arc_length
) {
MC_count++;
}
Now repeat until i reaches the end of all_neutrinos.
Then repeat this process, using cosmic_ray_events and [7].
Your code takes O(NM) time. This code takes O(N lg N + M lg M + N * (average bandwidth intersect rate) time. If relatively few pass the bandwidth test, you are going to be insanely faster.
Assuming you get an average of 0.5 intersects per all_neutrinos, this will be on the order of 100000x faster.
There is not much to optimize. The counts are really high, and there is not much hard computation going on. There are some obvious optimizations you could do, such as storing (all_neutrinos[i][3] +/- bandwitdth) in local variables before entering the j-loop. You compiler probably already does this, though, but this would certainly improve performance in debug mode.
Have you tried separating the two halves of the j-loop and have two j-loops? as in:
auto all_neutrinos_2 = all_neutrinos[i][2];
//... precompute bandwidth limits
for (int j = 0; j < cosmic_ray_events.size(); j++)
{ //cosmic_ray_events.size() = 1233487
auto MC_events = MC_cosmic_ray_events[j][1];
if ((all_neutrinos_lower <= MC_events) &&(MC_cosmic_ray_events[j][1] <= all_neutrinos_higher))
{
if ((earth_radius * fabs(all_neutrinos_2 - MC_cosmic_ray_events[j][0]))
<= test_arc_length)
{
MC_count++;
}
}
}
for (int j = 0; j < cosmic_ray_events.size(); j++)
{ //cosmic_ray_events.size() = 1233487
auto events = cosmic_ray_events[j][7];
if ((all_neutrinos_lower <= events) && (events <= all_neutrinos_higher))
{
if(earth_radius * fabs(all_neutrinos_2 - cosmic_ray_events[j][6])
<= test_arc_length)
{
count++;
}
}
}
I have the feeling you could get some improvement from improved memory cache hits this way.
Any improvement beyond that would involve packing the input data to reduce memory cache misses, and would involve modifying the structure and code generating the MC_cosmic_ray_events and cosmic_ray_events arrays
Slicing the counts in severals smaller tasks running on different threads is also a route I would look at seriously at this point. Data access is read only, and each thread can have its own counter, which can all be summed in the end.

Buggy simple function for binary search (C++)

I wrote a simple function for binary search, but it's not working as expected. I have a vector with 4000000 32-bit ints. Usually, when I search for a number, if it's there, it's found and the index is returned, if it's not, -1 is returned (the index always corresponds to the value, but that's not the point).
While messing around with the program I found out that it can't find 93 (even though it's there), obviously, there must be more values it can't find.
I use CLion, which implements GDB as the debugger and G++ as the compiler.
template<typename T>
int BinarySearch(vector<T>& vec, T& request)
{
int low = 0;
int high = vec.size() - 1;
while (low < high)
{
int mid = (low / 2 ) + (high / 2); // Styled it this way to avoid overflows.
// This looks like where the bug happens, basically low and high both
// become 93 while mid becomes 92,
// it then exits the loop and returns -1 because low is not lower than
// high anymore.
if (vec[mid] == request)
{
return mid;
}
else if (vec[mid] < request)
{
low = mid + 1;
}
else if (vec[mid] > request)
{
high = mid - 1;
}
}
return - 1;
}
I'm pretty confused by this, what's wrong?
Condition should be while (low <= high).
If you keep it as while (low < high), then when low==high (means we reach the final element), while loop will break and will return -1. So,your program wont check that element.
Also you should use mid=low+(high-low)/2; to prevent overflow and access all values.Problem in your code is that suppose when low=high=1, it will give mid =0(due to data conversion), which is wrong.

Increment the value of a map

need your help and better if you can help me fast. It is very trivial problem but still can't understand what exactly i need to put in one line.
The following code i have
for (busRequest = apointCollection.begin(); busRequest != apointCollection.end(); busRequest++)
{
double Min = DBL_MAX;
int station = 0;
for (int i = 0; i < newStations; i++)
{
distance = sqrt(pow((apointCollection2[i].x - busRequest->x1), 2) + pow((apointCollection2[i].y - busRequest->y1), 2));
if (distance < Min)
{
Min = distance;
station = i;
}
}
if (people.find(station) == people.end())
{
people.insert(pair<int, int>(station, i));
}
else
{
how can i increment "i" if the key of my statation is already in the map.
}
}
Just briefly what i do , i take the first busrequest go to the second loop take the first station and find the minimum distance. After i go over the second loop , i add that station with minimum distance to my map . After i proceed with all my loops and if there is the same station , i need to increment it , so it means that that station is using two times and etc.
I need the help just give me hint or provide the line that i need to add.
I thank you in advance and waiting for your help.
And I think you meant Min Distance instead of i? Check and let me know.
for (busRequest = apointCollection.begin(); busRequest != apointCollection.end(); busRequest++)
{
double Min = DBL_MAX;
int station = 0;
for (int i = 0; i < newStations; i++)
{
distance = sqrt(pow((apointCollection2[i].x - busRequest->x1), 2) + pow((apointCollection2[i].y - busRequest->y1), 2));
if (distance < Min)
{
Min = distance;
station = i;
}
}
if (people.find(station) == people.end())
{
people.insert(pair<int, int>(station, i)); // here???
}
else
{
// This routine will increment the value if the key already exists. If it doesn't exist it will create it for you
YourMap[YourKey]++;
}
}
In C++ you can directly access a map key without inserting it. C++ will automatically create it with default value.
In your case, if a station is not present in people map and you will access people[station] then people[station] will automatically be set to 0 ( default value of int is 0 ).
So you can just do this:
if (people[station] == 0)
{
// Do something
people[station] = station; // NOTE: i is not accessible here! check ur logic
}
else
{
people[station]++;
}
Also: In your code i cannot be accessed inside IF condition to insert into people map.

Random choices of two values

In my algorithm I have two values that I need to choose at random but each one has to be chosen a predetermined number of times.
So far my solution is to put the choices into a vector the correct number of times and then shuffle it. In C++:
// Example choices (can be any positive int)
int choice1 = 3;
int choice2 = 4;
int number_of_choice1s = 5;
int number_of_choice2s = 1;
std::vector<int> choices;
for(int i = 0; i < number_of_choice1s; ++i) choices.push_back(choice1);
for(int i = 0; i < number_of_choice2s; ++i) choices.push_back(choice2);
std::random_shuffle(choices.begin(), choices.end());
Then I keep an iterator to choices and whenever I need a new one I increase the iterator and grab that value.
This works but it seems like there might be a more efficient way. Since I always know how many of each value I'll use I'm wondering if there is a more algorithmic way to go about doing this, rather than just storing the values.
You are unnecessarily using so much memory. You have two variables:
int number_of_choice1s = 5;
int number_of_choice2s = 1;
Now simply randomize:
int result = rand() % (number_of_choice1s + number_of_choice2s);
if(result < number_of_choice1s) {
--number_of_choice1s;
return choice1;
} else {
--number_of_choice2s;
return choice2;
}
This scales very well two millions of random invocations.
You could write this a bit more simply:
std::vector<int> choices(number_of_choice1s, choice1);
choices.resize(number_of_choice1s + number_of_choice2s, choice2);
std::random_shuffle(choices.begin(), choices.end());
A biased random distribution will keep some kind of order over the resulting set ( the choice that was picked the most have lesser and lesser chance to be picked next ), which give a biased result (specially if the number of time you have to pick the first value is large compared to the second value, you'll endup with something like this {1,1,1,2,1,1,1,1,2}.
Here's the code, which looks a lot like the one written by #Tomasz Nurkiewicz but using a simple even/odd which should give about 50/50 chance to pick either values.
int result = rand();
if ( result & 1 && number_of_choice1s > 0)
{
number_of_choice1s--;
return choice1;
}else if (number_of_choice2s>0)
{
number_of_choice2s--;
return choice2;
}
else
{
return -1;
}

Custom sorting, always force 0 to back of ascending order?

Premise
This problem has a known solution (shown below actually), I'm just wondering if anyone has a more elegant algorithm or any other ideas/suggestions on how to make this more readable, efficient, or robust.
Background
I have a list of sports competitions that I need to sort in an array. Due to the nature of this array's population, 95% of the time the list will be pre sorted, so I use an improved bubble sort algorithm to sort it (since it approaches O(n) with nearly sorted lists).
The bubble sort has a helper function called CompareCompetitions that compares two competitions and returns >0 if comp1 is greater, <0 if comp2 is greater, 0 if the two are equal. The competitions are compared first by a priority field, then by game start time, and then by Home Team Name.
The priority field is the trick to this problem. It is an int that holds a positve value or 0. They are sorted with 1 being first, 2 being second, and so on with the exception that 0 or invalid values are always last.
e.g. the list of priorities
0, 0, 0, 2, 3, 1, 3, 0
would be sorted as
1, 2, 3, 3, 0, 0, 0, 0
The other little quirk, and this is important to the question, is that 95% of the time, priority will be it's default 0, because it is only changed if the user wants to manually change the sort order, which is rarely. So the most frequent case in the compare function is that priorities are equal and 0.
The Code
This is my existing compare algorithm.
int CompareCompetitions(const SWI_COMPETITION &comp1,const SWI_COMPETITION &comp2)
{
if(comp1.nPriority == comp2.nPriority)
{
//Priorities equal
//Compare start time
int ret = comp1.sStartTime24Hrs.CompareNoCase(comp2.sStartTime24Hrs);
if(ret != 0)
{
return ret; //return compare result
}else
{
//Equal so far
//Compare Home team Name
ret = comp1.sHLongName.CompareNoCase(comp2.sHLongName);
return ret;//Home team name is last field to sort by, return that value
}
}
else if(comp1.nPriority > comp2.nPriority)
{
if(comp2.nPriority <= 0)
return -1;
else
return 1;//comp1 has lower priority
}else /*(comp1.nPriority < comp2.nPriority)*/
{
if(comp1.nPriority <= 0)
return 1;
else
return -1;//comp1 one has higher priority
}
}
Question
How can this algorithm be improved?
And more importantly...
Is there a better way to force 0 to the back of the sort order?
I want to emphasize that this code seems to work just fine, but I am wondering if there is a more elegant or efficient algorithm that anyone can suggest. Remember that nPriority will almost always be 0, and the competitions will usually sort by start time or home team name, but priority must always override the other two.
Isn't it just this?
if (a==b) return other_data_compare(a, b);
if (a==0) return 1;
if (b==0) return -1;
return a - b;
You can also reduce some of the code verbosity using the trinary operator like this:
int CompareCompetitions(const SWI_COMPETITION &comp1,const SWI_COMPETITION &comp2)
{
if(comp1.nPriority == comp2.nPriority)
{
//Priorities equal
//Compare start time
int ret = comp1.sStartTime24Hrs.CompareNoCase(comp2.sStartTime24Hrs);
return ret != 0 ? ret : comp1.sHLongName.CompareNoCase(comp2.sHLongName);
}
else if(comp1.nPriority > comp2.nPriority)
return comp2.nPriority <= 0 ? -1 : 1;
else /*(comp1.nPriority < comp2.nPriority)*/
return comp1.nPriority <= 0 ? 1 : -1;
}
See?
This is much shorter and in my opinion easily read.
I know it's not what you asked for but it's also important.
Is it intended that if the case nPriority1 < 0 and nPriority2 < 0 but nPriority1 != nPriority2 the other data aren't compared?
If it isn't, I'd use something like
int nPriority1 = comp1.nPriority <= 0 ? INT_MAX : comp1.nPriority;
int nPriority2 = comp2.nPriority <= 0 ? INT_MAX : comp2.nPriority;
if (nPriority1 == nPriority2) {
// current code
} else {
return nPriority1 - nPriority2;
}
which will consider values less or equal to 0 the same as the maximum possible value.
(Note that optimizing for performance is probably not worthwhile if you consider that there are insensitive comparisons in the most common path.)
If you can, it seems like modifying the priority scheme would be the most elegant, so that you could just sort normally. For example, instead of storing a default priority as 0, store it as 999, and cap user defined priorities at 998. Then you won't have to deal with the special case anymore, and your compare function can have a more straightforward structure, with no nesting of if's:
(pseudocode)
if (priority1 < priority2) return -1;
if (priority1 > priority2) return 1;
if (startTime1 < startTime2) return -1;
if (startTime1 > startTime2) return 1;
if (teamName1 < teamName2) return -1;
if (teamName1 > teamName2) return -1;
return 0; // exact match!
I think the inelegance you feel about your solution comes from duplicate code for the zero priority exception. The Pragmatic Programmer explains that each piece of information in your source should be defined in "one true" place. To the naive programmer reading your function, you want the exception to stand-out, separate from the other logic, in one place, so that it is readily understandable. How about this?
if(comp1.nPriority == comp2.nPriority)
{
// unchanged
}
else
{
int result, lowerPriority;
if(comp1.nPriority > comp2.nPriority)
{
result = 1;
lowerPriority = comp2.nPriority;
}
else
{
result = -1;
lowerPriority = comp1.nPriority;
}
// zero is an exception: always goes last
if(lowerPriority == 0)
result = -result;
return result;
}
I Java-ized it, but the approach will work fine in C++:
int CompareCompetitions(Competition comp1, Competition comp2) {
int n = comparePriorities(comp1.nPriority, comp2.nPriority);
if (n != 0)
return n;
n = comp1.sStartTime24Hrs.compareToIgnoreCase(comp2.sStartTime24Hrs);
if (n != 0)
return n;
n = comp1.sHLongName.compareToIgnoreCase(comp2.sHLongName);
return n;
}
private int comparePriorities(Integer a, Integer b) {
if (a == b)
return 0;
if (a <= 0)
return -1;
if (b <= 0)
return 1;
return a - b;
}
Basically, just extract the special-handling-for-zero behavior into its own function, and iterate along the fields in sort-priority order, returning as soon as you have a nonzero.
As long as the highest priority is not larger than INT_MAX/2, you could do
#include <climits>
const int bound = INT_MAX/2;
int pri1 = (comp1.nPriority + bound) % (bound + 1);
int pri2 = (comp2.nPriority + bound) % (bound + 1);
This will turn priority 0 into bound and shift all other priorities down by 1. The advantage is that you avoid comparisons and make the remainder of the code look more natural.
In response to your comment, here is a complete solution that avoids the translation in the 95% case where priorities are equal. Note, however, that your concern over this is misplaced since this tiny overhead is negligible with respect to the overall complexity of this case, since the equal-priorities case involves at the very least a function call to the time comparison method and at worst an additional call to the name comparator, which is surely at least an order of magnitude slower than whatever you do to compare the priorities. If you are really concerned about efficiency, go ahead and experiment. I predict that the difference between the worst-performing and best-performing suggestions made in this thread won't be more than 2%.
#include <climits>
int CompareCompetitions(const SWI_COMPETITION &comp1,const SWI_COMPETITION &comp2)
{
if(comp1.nPriority == comp2.nPriority)
if(int ret = comp1.sStartTime24Hrs.CompareNoCase(comp2.sStartTime24Hrs))
return ret;
else
return comp1.sHLongName.CompareNoCase(comp2.sHLongName);
const int bound = INT_MAX/2;
int pri1 = (comp1.nPriority + bound) % (bound + 1);
int pri2 = (comp2.nPriority + bound) % (bound + 1);
return pri1 > pri2 ? 1 : -1;
}
Depending on your compiler/hardware, you might be able to squeeze out a few more cycles by replacing the last line with
return (pri1 > pri2) * 2 - 1;
or
return (pri1-pri2 > 0) * 2 - 1;
or (assuming 2's complement)
return ((pri1-pri2) >> (CHAR_BIT*sizeof(int) - 1)) | 1;
Final comment: Do you really want CompareCompetitions to return 1,-1,0 ? If all you need it for is bubble sort, you would be better off with a function returning a bool (true if comp1 is ">=" comp2 and false otherwise). This would simplify (albeit slightly) the code of CompareCompetitions as well as the code of the bubble sorter. On the other hand, it would make CompareCompetitions less general-purpose.