I am trying to find one element in one array, which has the minimum absolute value. For example, in array [5.1, -2.2, 8.2, -1, 4, 3, -5, 6], I want get the value -1. I use following code (myarray is 1D array and not sorted)
for (int i = 1; i < 8; ++i)
{
if(fabsf(myarray[i])<fabsf(myarray[0])) myarray[0] = myarray[i];
}
Then, the target value is in myarray[0].
Because I have to repeat this procedure many times, this piece of code becomes the bottleneck in my program. Does anyone know how to improve this code? Thanks in advance!
BTW, the size of the array is always eight. Could this be used to optimize this code?
Update: so far, following code works slightly better on my machine:
float absMin = fabsf(myarray[0]); int index = 0;
for (int i = 1; i < 8; ++i)
{
if(fabsf(myarray[i])<absMin) {absMin = fabsf(myarray[i]); index=i;}
}
float result = myarray[index];
I am wandering how to avoid fabsf, because I just want to compare the absolute values instead of computing them. Does anyone have any idea?
There are some urban myths like inlining, loop unrolling by hand and similar which are supposed to make your code faster. Good news is you don't have to do it, at least if you use -O3 compiler optimization.
Bad news is, if you already use -O3 there is nothing you can do to speed up this function: the compiler will optimize the hell out of your code! For example it will surely do the caching of fabsf(myarray[0]) as some suggested. The only thing you can achieve with this "refactoring" is to build bugs into your program and make it less readable.
My advice is to look somewhere else for improvements:
try to reduce the number of invocations of this code
if this code is the bottle neck, than my guess would be that you recalculate the minimal value over and over again (otherwise filling the values into the array would take approximately the same time) - so cache the results of the search
shift costs to changing the elements of the array, for example by using some fancy data structures (heaps, priority_queue) or by tracking the minimum of elements. Lets say your array has only two elements values [1,2] so minimum is 1. Now if you change
2 to 3, you don't have to do anything
2 to 0, you can easily update your minimum to 0
1 to 3, you have to loop through all elements. But maybe this case is not that often.
Can you store the values pre fabbed?
Also as #Gerstrong mentions, storing the number outside the loop and only calculating it when array changes will give you a boost.
Calling partial_sort or nth_element will sort the array only so that the correct value is in the right location.
std::nth_element(v.begin(), v.begin(), v.end(), [](float& lhs, float& rhs){
return fabsf(lhs)<fabsf(rhs);
});
Let me give some ideas that could help:
float minVal = fabsf(myarray[0]);
for (int i = 1; i < 8; ++i)
{
if(fabsf(myarray[i])<minVal) minVal = fabsf(myarray[i]);
}
myarray[0] = minVal;
But compilers nowadays are very smart and you might not get any more speed, as you already get optimized code. It depends on how your mentioned piece of code is called.
Another way to optimize this maybe is using C++ and STL, so you can do the following using the typical binary search tree std::set:
// Absolute comparator for std::set
bool absless_compare(const int64_t &a, const int64_t &b)
{
return (fabsf(a) < fabsf(b));
}
std::set<float, absless_compare> mySet = {5.1, -2.2, 8.2, -1, 4, 3, -5, 6};
const float minVal = *(mySet.begin());
With this approach by inserting your numbers they are already sorted in ascending order. The less-Comparator is usually a set for the std::set, but you can change it to use something different like in this example. This might help on larger datasets, but you mentioned you only have eight values to compare, so it really will not help.
Eight elements is a very small number, which might be kept in stack with for example the declaration of std::array<float,8> myarray close to your sorting function before filling it with data. You should that variants on your full codeset and observe what helps. Of course if you declare std::array<float,8> myarray or float[8] myarray runtime you should get the same results.
What you also could check is if fabsf really uses float as parameter and does not convert your variable to double which would degrade the performance. There is also std::abs() which for my understanding deduces the data type, because in C++ you can use templates etc.
If don't want to use fabs obviously a call like this
float myAbs(const float val)
{
return (val<0) ? -val : val;
}
or you hack the bit to zero which make your number negative. Either way, I'm pretty sure, that fabsf is fully aware of that, and I don't think a code like that will make it faster.
So I would check if the argument is converted to double. If you have C99 Standard in your system though, you should not have that issue.
One thought would be to do your comparisons "tournament" style, instead of linearly. In other words, you first compare 1 with 2, 3 with 4, etc. Then you take those 4 elements and do the same thing, and then again, until you only have one element left.
This does not change the number of comparisons. Since each comparison eliminates one element from the running, you will have exactly 7 comparisons no matter what. So why do I suggest this? Because it removes data dependencies from your code. Modern processors have multiple pipelines and can retire multiple instructions simultaneously. However, when you do the comparisons in a loop, each loop iteration depends on the previous one. When you do it tournament style, the first four comparisons are completely independent, so the processor may be able to do them all at once.
In addition to doing that, you can compute all the fabs at once in a trivial loop and put it in a new array. Since the fabs computations are independent, this can get sped up pretty easily. You would do this first, and then the tournament style comparisons to get the index. It should be exactly the same number of operations, it's just changing the order around so that the compiler can more easily see larger blocks that lack data dependencies.
The element of an array with minimal absolute value
Let the array, A
A = [5.1, -2.2, 8.2, -1, 4, 3, -5, 6]
The minimal absolute value of A is,
double miniAbsValue = A.array().abs().minCoeff();
int i_minimum = 0; // to find the position of minimum absolute value
for(int i = 0; i < 8; i++)
{
double ftn = evalsH(i);
if( fabs(ftn) == miniAbsValue )
{
i_minimum = i;
}
}
Now the element of A with minimal absolute value is
A(i_minimum)
Related
I am trying to solve the programming problem firstDuplicate on codesignal. The problem is "Given an array a that contains only numbers in the range 1 to a.length, find the first duplicate number for which the second occurrence has minimal index".
Example: For a = [2, 1, 3, 5, 3, 2] the output should be firstDuplicate(a) = 3
There are 2 duplicates: numbers 2 and 3. The second occurrence of 3 has a smaller index than the second occurrence of 2 does, so the answer is 3.
With this code I pass 21/23 tests, but then it tells me that the program exceeded the execution time limit on test 22. How would I go about making it faster so that it passes the remaining two tests?
#include <algorithm>
int firstDuplicate(vector<int> a) {
vector<int> seen;
for (size_t i = 0; i < a.size(); ++i){
if (std::find(seen.begin(), seen.end(), a[i]) != seen.end()){
return a[i];
}else{
seen.push_back(a[i]);
}
}
if (seen == a){
return -1;
}
}
Anytime you get asked a question about "find the duplicate", "find the missing element", or "find the thing that should be there", your first instinct should be use a hash table. In C++, there are the unordered_map and unordered_set classes that are for such types of coding exercises. The unordered_set is effectively a map of keys to bools.
Also, pass you vector by reference, not value. Passing by value incurs the overhead of copying the entire vector.
Also, that comparison seems costly and unnecessary at the end.
This is probably closer to what you want:
#include <unordered_set>
int firstDuplicate(const vector<int>& a) {
std::unordered_set<int> seen;
for (int i : a) {
auto result_pair = seen.insert(i);
bool duplicate = (result_pair.second == false);
if (duplicate) {
return (i);
}
}
return -1;
}
std::find is linear time complexity in terms of distance between first and last element (or until the number is found) in the container, thus having a worst-case complexity of O(N), so your algorithm would be O(N^2).
Instead of storing your numbers in a vector and searching for it every time, Yyu should do something like hashing with std::map to store the numbers encountered and return a number if while iterating, it is already present in the map.
std::map<int, int> hash;
for(const auto &i: a) {
if(hash[i])
return i;
else
hash[i] = 1;
}
Edit: std::unordered_map is even more efficient if the order of keys doesn't matter, since insertion time complexity is constant in average case as compared to logarithmic insertion complexity for std::map.
It's probably an unnecessary optimization, but I think I'd try to take slightly better advantage of the specification. A hash table is intended primarily for cases where you have a fairly sparse conversion from possible keys to actual keys--that is, only a small percentage of possible keys are ever used. For example, if your keys are strings of length up to 20 characters, the theoretical maximum number of keys is 25620. With that many possible keys, it's clear no practical program is going to store any more than a minuscule percentage, so a hash table makes sense.
In this case, however, we're told that the input is: "an array a that contains only numbers in the range 1 to a.length". So, even if half the numbers are duplicates, we're using 50% of the possible keys.
Under the circumstances, instead of a hash table, even though it's often maligned, I'd use an std::vector<bool>, and expect to get considerably better performance in the vast majority of cases.
int firstDuplicate(std::vector<int> const &input) {
std::vector<bool> seen(input.size()+1);
for (auto i : input) {
if (seen[i])
return i;
seen[i] = true;
}
return -1;
}
The advantage here is fairly simple: at least in a typical case, std::vector<bool> uses a specialization to store bools in only one bit apiece. This way we're storing only one bit for each number of input, which increases storage density, so we can expect excellent use of the cache. In particular, as long as the number of bytes in the cache is at least a little more than 1/8th the number of elements in the input array, we can expect all of seen to be in the cache most of the time.
Now make no mistake: if you look around, you'll find quite a few articles pointing out that vector<bool> has problems--and for some cases, that's entirely true. There are places and times that vector<bool> should be avoided. But none of its limitations applies to the way we're using it here--and it really does give an advantage in storage density that can be quite useful, especially for cases like this one.
We could also write some custom code to implement a bitmap that would give still faster code than vector<bool>. But using vector<bool> is easy, and writing our own replacement that's more efficient is quite a bit of extra work...
I've seen some similar questions on the topic, but I'm relatively new to programming and couldn't make sense of some of the language used in the solutions.
Suppose I have 2 finite sets A,B represented as arrays where:
int A[2] = {1, 3};
int B[2] = {1, 2};
I want the bitsets (column vectors V) that represent A and B.
v1 v2
(1) 1, 1
(2) 0, 1
(3) 1, 0
This way I can easily sum row (k) and get the count of appearances for the value k across all of my sets A_1 to A_n.
I am looking for the fastest possible way to do this. I can roughly imagine how I might first initialize a matrix of bit vectors (setting each value to 0) and then loop through each set A_i, setting the corresponding entry of my matrix to 1, but this solution seems useless, because I still have to loop through every element in every set A_i.
I am trying to avoid having to loop through every element of every set by instead getting the count of appearances by summing rows of bits, but I can't figure out how to elegantly make this transformation in a time efficient manner.
Motivation: I'm trying to implement the ID3 decision tree algorithm and am trying to use bit vectors to calculate proportions of labels for entropy calculation.
The key in the presentation is that you don't explicitly form the sets just to build bitsets from them, but construct the bitsets instead of the sets.
In short, you have
std::vector<double> unsortedDataInRow(numDataInRow) = ...;
std::vector<int> labels(numLabels) = ...;
You then obtain
std::vector<unsigned> sortedIndices = getSortedIndices(unsortedDataInRow);
so that unsortedDataInRow[sortedIndices[i]] is sorted. But instead of building std::vector<int> sortedLabels from them, you instead fill a
std::vector<std::vector<bool>> bitsets(numLabels, std::vector<bool>(numDataInRow));
// this gets zero-initialized
in such a way that bitsets[label][i] == (unsortedLabels[sortedIndices[i]] == label):
for (auto sortedIndex : sortedIndices)
bitsets[unsortedLabels[sortedIndices]][sortedIndex] = true;
This helps performance because you (presumably) do the label counting in InfoGain (i.e. determining P(c), which can then be done much faster through popcnt than through counts[labels[i]]++;) much more often than you do the above.
Note that this is just a sketch - std::vector<bool> doesn't have an inbuilt way to get a popcnt. You'd have to hope that your compiler recognizes a handwritten one. Alternatively, use boost::dynamic_bitset, or some other library, or a handwritten one.
I currently have a solution but I feel it's not as efficient as it could be to this problem, so I want to see if there is a faster method to this.
I have two arrays (std::vectors for example). Both arrays contain only unique integer values that are sorted but are sparse in value, ie: 1,4,12,13... What I want to ask is there fast way I can find the INDEX to one of the arrays where the values are the same. For example, array1 has values 1,4,12,13 and array2 has values 2,12,14,16. The first matching value index is 1 in array2. The index into the array is what is important as I have other arrays that contain data that will use this index that "matches".
I am not confined to using arrays, maps are possible to. I am only comparing the two arrays once. They will not be reused again after the first matching pass. There can be small to large number of values (300,000+) in either array, but DO NOT always have the same number of values (that would make things much easier)
Worse case is a linear search O(N^2). Using map would get me better O(log N) but I would still have convert an array to into a map of value, index pairs.
What I currently have to not do any container type conversions is this. Loop over the smaller of the two arrays. Compare current element of small array (array1) with the current element of large array (array2). If array1 element value is larger than array2 element value, increment the index for array2 until is it no longer larger than array1 element value (while loop). Then, if array1 element value is smaller than array2 element, go to next loop iteration and begin again. Otherwise they must be equal and I have my index to either arrays of the matching value.
So in this loop, I am at best O(N) if all values have matches and at worse O(2N) if none match. So I am wondering if there is something faster out there? It's hard to know for sure how often the two arrays will match, but I would way I would lean more toward most of the arrays will mostly have matches than not.
I hope I explained the problem well enough and I appreciate any feedback or tips on improving this.
Code example:
std::vector<int> array1 = {4,6,12,34};
std::vector<int> array2 = {1,3,6,34,40};
for(unsigned int i=0, z=0; i < array1.size(); i++)
{
int value1 = array1[i];
while(value1 > array2[z] && z < array2.size())
z++;
if (z >= array2.size())
break; // reached end of array2
if (value1 < array2[z])
continue;
// we have a match, i and z indices have same value
}
Result will be matching indexes for array1 = [1,3] and for array2= [2,3]
I wrote an implementation of this function using an algorithm that performs better with sparse distributions, than the trivial linear merge.
For distributions, that are similar†, it has O(n) complexity but ranges where the distributions are greatly different, it should perform below linear, approaching O(log n) in optimal cases. However, I wasn't able to prove that the worst case isn't better than O(n log n). On the other hand, I haven't been able to find that worst case either.
I templated it so that any type of ranges can be used, such as sub-ranges or raw arrays. Technically it works with non-random access iterators as well, but the complexity is much greater, so it's not recommended. I think it should be possible to modify the algorithm to fall back to linear search in that case, but I haven't bothered.
† By similar distribution, I mean that the pair of arrays have many crossings. By crossing, I mean a point where you would switch from one array to another if you were to merge the two arrays together in sorted order.
#include <algorithm>
#include <iterator>
#include <utility>
// helper structure for the search
template<class Range, class Out>
struct search_data {
// is any there clearer way to get iterator that might be either
// a Range::const_iterator or const T*?
using iterator = decltype(std::cbegin(std::declval<Range&>()));
iterator curr;
const iterator begin, end;
Out out;
};
template<class Range, class Out>
auto init_search_data(const Range& range, Out out) {
return search_data<Range, Out>{
std::begin(range),
std::begin(range),
std::end(range),
out,
};
}
template<class Range, class Out1, class Out2>
void match_indices(const Range& in1, const Range& in2, Out1 out1, Out2 out2) {
auto search_data1 = init_search_data(in1, out1);
auto search_data2 = init_search_data(in2, out2);
// initial order is arbitrary
auto lesser = &search_data1;
auto greater = &search_data2;
// if either range is exhausted, we are finished
while(lesser->curr != lesser->end
&& greater->curr != greater->end) {
// difference of first values in each range
auto delta = *greater->curr - *lesser->curr;
if(!delta) { // matching value was found
// store both results and increment the iterators
*lesser->out++ = std::distance(lesser->begin, lesser->curr++);
*greater->out++ = std::distance(greater->begin, greater->curr++);
continue; // then start a new iteraton
}
if(delta < 0) { // set the order of ranges by their first value
std::swap(lesser, greater);
delta = -delta; // delta is always positive after this
}
// next crossing cannot be farther than the delta
// this assumption has following pre-requisites:
// range is sorted, values are integers, values in the range are unique
auto range_left = std::distance(lesser->curr, lesser->end);
auto upper_limit =
std::min(range_left, static_cast<decltype(range_left)>(delta));
// exponential search for a sub range where the value at upper bound
// is greater than target, and value at lower bound is lesser
auto target = *greater->curr;
auto lower = lesser->curr;
auto upper = std::next(lower, upper_limit);
for(int i = 1; i < upper_limit; i *= 2) {
auto guess = std::next(lower, i);
if(*guess >= target) {
upper = guess;
break;
}
lower = guess;
}
// skip all values in lesser,
// that are less than the least value in greater
lesser->curr = std::lower_bound(lower, upper, target);
}
}
#include <iostream>
#include <vector>
int main() {
std::vector<int> array1 = {4,6,12,34};
std::vector<int> array2 = {1,3,6,34};
std::vector<std::size_t> indices1;
std::vector<std::size_t> indices2;
match_indices(array1, array2,
std::back_inserter(indices1),
std::back_inserter(indices2));
std::cout << "indices in array1: ";
for(std::vector<int>::size_type i : indices1)
std::cout << i << ' ';
std::cout << "\nindices in array2: ";
for(std::vector<int>::size_type i : indices2)
std::cout << i << ' ';
std::cout << std::endl;
}
Since the arrays are already sorted you can just use something very much like the merge step of mergesort. This just looks at the head element of each array, and discards the lower element (the next element becomes the head). Stop when you find a match (or when either array becomes exhausted, indicating no match).
This is O(n) and the fastest you can do for arbitrary distubtions. With certain clustered distributions a "skip ahead" approach could be used rather than always looking at the next element. This could result in better than O(n) running times for certain distributions. For example, given the arrays 1,2,3,4,5 and 10,11,12,13,14 an algorithm could determine there were no matches to be found in as few as one comparison (5 < 10).
What is the range of the stored numbers?
I mean, you say that the numbers are integers, sorted, and sparse (i.e. non-sequential), and that there may be more than 300,000 of them, but what is their actual range?
The reason that I ask is that, if there is a reasonably small upper limit, u, (say, u=500,000), the fastest and most expedient solution might be to just use the values as indices. Yes, you might be wasting memory, but is 4*u really a lot of memory? This depends on your application and your target platform (i.e. if this is for a memory-constrained embedded system, its less likely to be a good idea than if you have a laptop with 32GiB RAM).
Of course, if the values are more-or-less evenly spread over 0-2^31-1, this crude idea isn't attractive, but maybe there are properties of the input values that you can exploit other simply than the range. You might be able to hand-write a fairly simple hash function.
Another thing worth considering is whether you actually need to be able to retrieve the index quickly or if it helps just be able to tell if the index exists in the other array quickly. Whether or not a value exists at a particular index requires only one bit, so you could have a bitmap of the range of the input values using 32x less memory (i.e. mask off 5 LSBs and use that as a bit position, then shift the remaining 27 bits 5 places right and use that as an array index).
Finally, a hybrid approach might be worth considering, where you decide how much memory you're prepared to use (say you decide 256KiB, which corresponds to 64Ki 4-byte integers) then use that as a lookup-table to into much smaller sub-problems. Say you have 300,000 values whose LSBs are pretty evenly distributed. Then you could use 16 LSBs as indices into a lookup-table of lists that are (on average) only 4 or 5 elements long, which you can then search by other means. A couple of year ago, I worked on some simulation software that had ~200,000,000 cells, each with a cell id; some utility functionality used a binary search to identify cells by id. We were able to speed it up significantly and non-intrusively with this strategy. Not a perfect solution, but a great improvement. (If the LSBs are not evenly distributed, maybe that's a property that you can exploit or maybe you can choose a range of bits that are, or do a bit of hashing.)
I guess the upshot is “consider some kind of hashing”, even the “identity hash” or simple masking/modulo with a little “your solution doesn't have to be perfectly general” on the side and some “your solution doesn't have to be perfectly space efficient” sauce on top.
I had an algorithm that started out like
int sumLargest2 ( int * arr, size_t n )
{
int largest(max(arr[0], arr[1])), secondLargest(min(arr[0],arr[1]));
// ...
and I realized that the first is probably not optimal because calling max and then min is repetitious when you consider that the information required to know the minimum is already there once you've found the maximum. So I figured out that I could do
int largest = max(arr[0], arr[1]);
int secondLargest = arr[0] == largest ? arr[1] : arr[0];
to shave off the useless invocation of min, but I'm not sure that actually saves any number of operations. Are there any fancy bit-shifting algorithms that can do the equivalent of
int largest(max(arr[0], arr[1])), secondLargest(min(arr[0],arr[1]));
?????
In C++, you can use std::minmax to produce a std::pair of the minimum and the maximum. This is particularly easy in combination with std::tie:
#include <algorithm>
#include <utility>
int largest, secondLargest;
std::tie(secondLargest, largest) = std::minmax(arr[0], arr[1]);
GCC, at least, is capable of optimizing the call to minmax into a single comparison, identical to the result of the C code below.
In C, you could write the test out yourself:
int largest, secondLargest;
if (arr[0] < arr[1]) {
largest = arr[1];
secondLargest = arr[0];
} else {
largest = arr[0];
secondLargest = arr[1];
}
How about:
int largestIndex = arr[1] > arr[0];
int largest = arr[largestIndex];
int secondLargest = arr[1 - largestIndex];
The first line relies on an implicit cast of a boolean result to 1 in the case of true and 0 in the case of false.
I'm going to assume that you'd rather solve the larger problem... That is, getting the sum of the largest two numbers in an array.
What you are trying to do is a std::partial_sort().
Let's implement it.
int sumLargest2(int * arr, size_t n) {
int * first = arr;
int * middle = arr + 2;
int * last = arr + n;
std::partial_sort(first, middle, last, std::greater<int>());
return arr[0] + arr[1];
}
And if you're unable to modify arr, then I'd recommend looking into std::partial_sort_copy().
x = max(a, b);
y = a + b - x;
It won't necessarily be faster, but it will be different.
Also beware of overflows.
If your intention is to reduce the function call to find min mad max you can try std::minmax_element. This is available since C++11.
auto result = std::minmax_element(arr, arr+n);
std::cout<< "min:"<< *result.first<<"\n";
std::cout<< "max :" <<*result.second << "\n";
If you just want to find the bigger of two values go:
if(a > b)
{
largest = a;
second = b;
}
else
{
largest = b;
second = a;
}
No function calls, one comparison, two assignments.
I'm assuming C++...
Short answer, use std::minmax and compile with the right optimizations and the right instruction set parameters.
Long ugly answer, The compiler cannot make all the assumptions necessary to make it really, really fast. You can. In this case, you can change the algorithm to process all data first and you can force alignment on the data. Doing all this, you can use intrinsics to make it faster.
Although I haven't tested it in this particular case, I've seen enormous performance improvements using these guidelines.
Since you're not passing 2 integers to the function, I'm assuming your using an array and want to iterate it somehow. You now have a choice to make: make 2 arrays and use min/max or use 1 array with both a and b. This decision alone can already influence the performance.
If you have 2 arrays, these can be allocated on 32-byte boundaries with aligned malloc's and then processed using intrinsics. If you are going for real, raw performance - this is the way to go.
F.ex, let's assume you have AVX2. (NOTE: I'm not sure if you do and you SHOULD check this using CPU id's!). Go to the cheat sheet here: https://software.intel.com/sites/landingpage/IntrinsicsGuide/ and pick your poison.
The intrinsics you're looking for are in this case probably:
_mm256_min_epi32
_mm256_max_epi32
_mm256_stream_load_si256
If you have to do this for the entire array, you probably want to keep all the stuff in a single __mm256 register before merging the individual items. E.g.: do a min/max per 256-bit vector, and when the loop is done, extract the 32-bit items and do a min/max on that.
Long nicer answer: So ... as for the compiler. Compilers do attempt to optimize these kinds of things, but run into problems.
If you have 2 different arrays that you process, the compiler has to know that they are different in order to be able to optimize it. This is the reason why stuff like restrict exists, which tells the compiler exactly this little thing you probably already knew while writing the code.
Also, the compiler doesn't know your memory is aligned, so it has to check this and branch... for each call. We don't want this; which means we want it to inline its stuff. So, add inline, put it in a header file and that's that. You can also use aligned to give him a hint.
Your compiler also didn't get the hint that the int* won't change over time. If it cannot change, it's a good idea to tell him that using the const keyword.
A compiler uses an instruction set to do the compilation. Normally, they already use SSE, but AVX2 can help a lot (as I've shown with the intrinsics above). If you can compile it with those flags, make sure to use them - they help a lot.
Run in release mode, compile with optimizations on 'fast' and see what happens under the hood. If you do all this, you should see vpmax... instructions appearing in the inner loops, which means that the compiler uses the intrinsics just fine.
I don't know what else you want to do in the loop... if you use all these instructions you should hit the memory speed on big arrays.
How about a time-space trade-off?
#include <utility>
template<typename T>
std::pair<T, T>
minmax(T const& a, T const& b)
{ return b < a ? std::make_pair(b, a) : std::make_pair(a, b); }
//main
std::pair<int, int> values = minmax(a[0], a[1]);
int largest = values.second;
int secondLargest = values.first;
I have a class containing a number of double values. This is stored in a vector where the indices for the classes are important (they are referenced from elsewhere). The class looks something like this:
Vector of classes
class A
{
double count;
double val;
double sumA;
double sumB;
vector<double> sumVectorC;
vector<double> sumVectorD;
}
vector<A> classes(10000);
The code that needs to run as fast as possible is something like this:
vector<double> result(classes.size());
for(int i = 0; i < classes.size(); i++)
{
result[i] += classes[i].sumA;
vector<double>::iterator it = find(classes[i].sumVectorC.begin(), classes[i].sumVectorC.end(), testval);
if(it != classes[i].sumVectorC.end())
result[i] += *it;
}
The alternative is instead of one giant loop, split the computation into two separate loops such as:
for(int i = 0; i < classes.size(); i++)
{
result[i] += classes[i].sumA;
}
for(int i = 0; i < classes.size(); i++)
{
vector<double>::iterator it = find(classes[i].sumVectorC.begin(), classes[i].sumVectorC.end(), testval);
if(it != classes[i].sumVectorC.end())
result[i] += *it;
}
or to store each member of the class in a vector like so:
Class of vectors
vector<double> classCounts;
vector<double> classVal;
...
vector<vector<double> > classSumVectorC;
...
and then operate as:
for(int i = 0; i < classes.size(); i++)
{
result[i] += classCounts[i];
...
}
Which way would usually be faster (across x86/x64 platforms and compilers)? Are look-ahead and cache lines are the most important things to think about here?
Update
The reason I'm doing a linear search (i.e. find) here and not a hash map or binary search is because the sumVectors are very short, around 4 or 5 elements. Profiling showed a hash map was slower and a binary search was slightly slower.
As the implementation of both variants seems easy enough I would build both versions and profile them to find the fastest one.
Empirical data usually beats speculation.
As a side issue: Currently, the find() in your innermost loop does a linear scan through all elements of classes[i].sumVectorC until it finds a matching value. If that vector contains many values, and you have no reason to believe that testVal appears near the start of the vector, then this will be slow -- consider using a container type with faster lookup instead (e.g. std::map or one of the nonstandard but commonly implemented hash_map types).
As a general guideline: consider algorithmic improvements before low-level implementation optimisation.
As lothar says, you really should test it out. But to answer your last question, yes, cache misses will be a major concern here.
Also, it seems that your first implementation would run into load-hit-store stalls as coded, but I'm not sure how much of a problem that is on x86 (it's a big problem on XBox 360 and PS3).
It looks like optimizing the find() would be a big win (profile to know for sure). Depending on the various sizes, in addition to replacing the vector with another container, you could try sorting sumVectorC and using a binary search in the form of lower_bound. This will turn your linear search O(n) into O(log n).
if you can guarrantee that std::numeric_limits<double>::infinity is not a possible value, ensuring that the arrays are sorted with a dummy infinite entry at the end and then manually coding the find so that the loop condition is a single test:
array[i]<test_val
and then an equality test.
then you know that the average number of looked at values is (size()+1)/2 in the not found case. Of course if the search array changes very frequently then the issue of keeping it sorted is an issue.
of course you don't tell us much about sumVectorC or the rest of A for that matter, so it is hard to ascertain and give really good advice. For example if sumVectorC is never updates then it is probably possible to find an EXTREMELY cheap hash (eg cast ULL and bit extraction) that is perfect on the sumVectorC values that fits into double[8]. Then the overhead is bit extract and 1 comparison versus 3 or 6
Also if you have a bound on sumVectorC.size() that is reasonable(you mentioned 4 or 5 so this assumption seems not bad) you could consider using an aggregated array or even just a boost::array<double> and add your own dynamic size eg :
class AggregatedArray : public boost::array<double>{
size_t _size;
size_t size() const {
return size;
}
....
push_back(..){...
pop(){...
resize(...){...
};
this gets rid of the extra cache line access to the allocated array data for sumVectorC.
In the case of sumVectorC very infrequently updating if finding a perfect hash (out of your class of hash algoithhms)is relatively cheap then you can incur that with profit when sumVectorC changes. These small lookups can be problematic and algorithmic complexity is frequently irrelevant - it is the constants that dominate. It is an engineering problem and not a theoretical one.
Unless you can guarantee that the small maps are in cache you can be almost be guaranteed that using a std::map will yield approximately 130% worse performance as pretty much each node in the tree will be in a separate cache line
Thus instead of accessing (4 times 1+1 times 2)/5 = 1.2 cache lines per search (the first 4 are in first cacheline, the 5th in the second cacheline, you will access (1 + 2 times 2 + 2 times 3) = 9/5) + 1 for the tree itself = 2.8 cachelines per search (the 1 being 1 node at the root, 2 nodes being children of the root, and the last 2 being grandchildren of the root, plus the tree itself)
So I would predict using a std::map to take 2.8/1.2 = 233% as long for a sumVectorC having 5 entries
This what I meant when I said: "It is an engineering problem and not a theoretical one."