Optimizing a comparison over array elements with two conditions; C++ abstraction mechanisms? - c++

My question is a follow-up to How to make this code faster (learning best practices)?, which has been put on hold (bummer). The problem is to optimize a loop over an array with floats which are tested for whether they lie within a given interval. Indices of matching elements in the array are to be stored in a provided result array.
The test includes two conditions (smaller than the upper threshold and bigger than the lower one). The obvious code for the test is if( elem <= upper && elem >= lower ) .... I observed that branching (including the implicit branch involved in the short-circuiting operator&&) is much more expensive than the second comparison. What I came up with is below. It is about 20%-40% faster than a naive implementation, more than I expected. It uses the fact that bool is an integer type. The condition test result is used as an index into two result arrays. Only one of them will contain the desired data, the other one can be discarded. This replaces program structure with data structure and computation.
I am interested in more ideas for optimization. "Technical hacks" (of the kind provided here) are welcome. I'm also interested in whether modern C++ could provide means to be faster, e.g. by enabling the compiler to create parallel running code. Think visitor pattern/functor. Computations on the single srcArr elements are almost independent, except that the order of indices in the result array depends on the order of testing the source array elements. I would loosen the requirements a little so that the order of the matching indices reported in the result array is irrelevant. Can anybody come up with a fast way?
Here is the source code of the function. A supporting main is below. gcc needs -std=c++11 because of chrono. VS 2013 express was able to compile this too (and created 40% faster code than gcc -O3).
#include <cstdlib>
#include <iostream>
#include <chrono>
using namespace std;
using namespace std::chrono;
/// Check all elements in srcArr whether they lie in
/// the interval [lower, upper]. Store the indices of
/// such elements in the array pointed to by destArr[1]
/// and return the number of matching elements found.
/// This has been highly optimized, mainly to avoid branches.
int findElemsInInterval( const float srcArr[], // contains candidates
int **const destArr, // two arrays to be filled with indices
const int arrLen, // length of each array
const float lower, const float upper // interval
)
{
// Instead of branching, use the condition
// as an index into two distinct arrays. We need to keep
// separate indices for both those arrays.
int destIndices[2];
destIndices[0] = destIndices[1] = 0;
for( int srcInd=0; srcInd<arrLen; ++srcInd )
{
// If the element is inside the interval, both conditions
// are true and therefore equal. In all other cases
// exactly one condition is true so that they are not equal.
// Matching elements' indices are therefore stored in destArr[1].
// destArr[0] is a kind of a dummy (it will incidentally contain
// indices of non-matching elements).
// This used to be (with a simple int *destArr)
// if( srcArr[srcInd] <= upper && srcArr[srcInd] >= lower) destArr[destIndex++] = srcInd;
int isInInterval = (srcArr[srcInd] <= upper) == (srcArr[srcInd] >= lower);
destArr[isInInterval][destIndices[isInInterval]++] = srcInd;
}
return destIndices[1]; // the number of elements in the results array
}
int main(int argc, char *argv[])
{
int arrLen = 1000*1000*100;
if( argc > 1 ) arrLen = atol(argv[1]);
// destArr[1] will hold the indices of elements which
// are within the interval.
int *destArr[2];
// we don't check destination boundaries, so make them
// the same length as the source.
destArr[0] = new int[arrLen];
destArr[1] = new int[arrLen];
float *srcArr = new float[arrLen];
// Create always the same numbers for comparison (don't srand).
for( int srcInd=0; srcInd<arrLen; ++srcInd ) srcArr[srcInd] = rand();
// Create an interval in the middle of the rand() spectrum
float lowerLimit = RAND_MAX/3;
float upperLimit = lowerLimit*2;
cout << "lower = " << lowerLimit << ", upper = " << upperLimit << endl;
int numInterval;
auto t1 = high_resolution_clock::now(); // measure clock time as an approximation
// Call the function a few times to get a longer run time
for( int srcInd=0; srcInd<10; ++srcInd )
numInterval = findElemsInInterval( srcArr, destArr, arrLen, lowerLimit, upperLimit );
auto t2 = high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::milliseconds>( t2 - t1 ).count();
cout << numInterval << " elements found in " << duration << " milliseconds. " << endl;
return 0;
}

Thinking of the integer range check optimization of turning a <= x && x < b into ((unsigned)(x-a)) < b-a, a floating point variant comes to mind:
You could try something like
const float radius = (b-a)/2;
if( fabs( x-(a+radius) ) < radius )
...
to reduce the check to one conditional.

I see about a 10% speedup from this:
int destIndex = 0; // replace destIndices
int isInInterval = (srcArr[srcInd] <= upper) == (srcArr[srcInd] >= lower);
destArr[1][destIndex] = srcInd;
destIndex += isInInterval;

Eliminate the pair of output arrays. Instead only advance the 'number written' by 1 if you want to keep the result, otherwise just keep overwriting the 'one past the end' index.
Ie, retval[destIndex]=curIndex; destIndex+= isInArray; -- better coherancy and less wasted memory.
Write two versions: one that supports a fixed array length (of say 1024 or whatever) and another that supports a runtime parameter. Use a template argumemt to remove code duplication. Assume the length is less than that constant.
Have the function return size and a RVO'd std::array<unsigned, 1024>.
Write a wrapper function that merges results (create all results, then merge them). Then throw the parrallel patterns library at the problem (so the results get computed in parrallel).

If you allow yourself vectorization using the SSE (or better, AVX) instruction set, you can perform 4/8 comparisons in a go, do this twice, 'and' the results, and retrieve the 4 results (-1 or 0). At the same time, this unrolls the loop.
// Preload the bounds
__m128 lo= _mm_set_ps(lower);
__m128 up= _mm_set_ps(upper);
int srcIndex, dstIndex= 0;
for (srcInd= 0; srcInd + 3 < arrLen; )
{
__m128 src= _mm_load_ps(&srcArr[srcInd]); // Load 4 values
__m128 tst= _mm_and_ps(_mm_cmple_ps(src, lo), _mm_cmpge_ps(src, up)); // Test
// Copy the 4 indexes with conditional incrementation
dstArr[dstIndex]= srcInd++; destIndex-= tst.m128i_i32[0];
dstArr[dstIndex]= srcInd++; destIndex-= tst.m128i_i32[1];
dstArr[dstIndex]= srcInd++; destIndex-= tst.m128i_i32[2];
dstArr[dstIndex]= srcInd++; destIndex-= tst.m128i_i32[3];
}
CAUTION: unchecked code.

Related

Why does sorting call the comparison function less often than a linear minimum search algorithm?

I'll start by giving some context. I'm learning to write a raytracer, a very simple one. I don't have any acceleration structures yet, so the code in question is intended to find the closest object that the ray hits. Since I'm learning yet, I'd greatly appreciate if the answers concentrated on the seemingly strange problem that I'm observing - I know the RT logic is very wrong as it is right now. It produces correct results, anyway.
1. The first approach: for every hit, add a hit-result structure object to the list, then apply std::sort with a predicate that compares the distance form the hit point to the ray origin. Should be O(N log N) according to the textbook, and I think it is suboptimal, since I only need the first result, not the whole sorted list.
2. The second approach: whenever there is a hit, take the distance and compare it to the minimum, which is first initialized to std::numeric_limits<float>::max(). Well, your standard "find min in the array" algorithm. Should be O(N) and thus faster.
These pieces of code reside in a recursive function. Tested on the very same scene of 10 spheres, 1 is faster by an order of magnitude. The amount of calls to the distance function is a few times less than in 2. What am I missing?
I'm not sure if the context is required, in case there are "branches to be cut" off this question, tell me.
Code piece 1:
result rt_function(...) {
static int count{};
std::vector<result> hitList;
for(const auto& obj : objList) {
const result res = obj->testOuter(ray);
if ( res.hit ) {
hitList.push_back(res);
}
}
if (!hitList.empty()) {
sort(hitList.begin(), hitList.end(), [=](result& hit1, result& hit2) -> bool {
std::cerr << ++count << '\n';
return cv::norm(hit1.point - ray.origin) <
cv::norm(hit2.point - ray.origin);
});
const result res = hitList.front();
const SceneObject* near = res.obj;
// the raytracing continues...
count == 180771
Code piece 2:
result rt_function(...) {
static int count{};
float min_distance = std::numeric_limits<float>::max(), distance{};
result closest_res{}; bool have_hit{};
for(const auto& obj : objList) {
const result res = obj->testOuter(ray);
if ( res.hit ) {
have_hit = true;
std::cerr << ++count << '\n';
distance = cv::norm(res.point - ray.origin);
if (distance < min_distance) {
min_distance = distance; closest_res = res;
}
}
}
if (have_hit) {
const result res = closest_res;
const SceneObject* near = res.obj;
// the raytracing continues...
count == 349633
I want to (a) understand why there are less comparisons and (b) where the bottleneck is, since the run time is significantly higher, as I've noted above.
Statements like O(N²) are like a dimension; double the number of points and time taken quadruples. An O(log N) algorithm can be slow for small N , the point being if N doubles or is increased by a factor of 10 running time doesn't.
Compare with finding a specific word in a 1000 page dictionary and one in a 20 word sentence. Sorting a 20 word sentence before finding a specific word takes longer than reading it straight through once.

first value of a loop in c++ different for the others

I need to put the first value of a loop = 0, and then use a range to start the loop.
In MatLab this is possible : x = [0 -range:range] (range is a integer)
This will give a value of [0, -range, -range+1, -range+2, .... , range-1, range]
The problem is I need to do this in C++, I tried to do by an array and then put in like the value on the loop without success.
//After loading 2 images, put it into matrix values and then trying to compare each one.
for r=1:bRows
for c=1:bCols
rb=r*blockSize;
cb=c*blockSize;
%%for each block search in the near position(1.5 block size)
search=blockSize*1.5;
for dr= [0 -search:search] //Here's the problem.
for dc= [0 -search:search]
%%check if it is inside the image
if(rb+dr-blockSize+1>0 && rb+dr<=rows && cb+dc-blockSize+1>0 && cb+dc<=cols)
%compute the error and check if it is lower then the previous or not
block=I1(rb+dr-blockSize+1:rb+dr,cb+dc-blockSize+1:cb+dc,1);
TE=sum( sum( abs( block - cell2mat(B2(r,c)) ) ) );
if(TE<E)
M(r,c,:)=[dr dc]; %store the motion vector
Err(r,c,:)=TE; %store th error
E=TE;
end
end
end
end
%reset the error for the next search
E=255*blockSize^2;
end
end
C++ doesn't natively support ranges of the kind you know from MatLab, although external solutions are available, if somewhat of an overkill for your use case. However, C++ allows you to implement them easily (and efficiently) using the primitives provided by the language, such as for loops and resizable arrays. For example:
// Return a vector consisting of
// {0, -limit, -limit+1, ..., limit-1, limit}.
std::vector<int> build_range0(int limit)
{
std::vector<int> ret{0};
for (auto i = -limit; i <= limit; i++)
ret.push_back(i);
return ret;
}
The resulting vector can be easily used for iteration:
for (int dr: build_range0(search)) {
for (int dc: build_range0(search)) {
if (rb + dr - blockSize + 1 > 0 && ...)
...
}
}
The above of course wastes some space to create a temporary vector, only to throw it away (which I suspect happens in your MatLab example as well). If you want to just iterate over the values, you will need to incorporate the loop such as the one in build_range0 directly in your function. This has the potential to reduce readability and introduce repetition. To keep the code maintainable, you can abstract the loop into a generic function that accepts a callback with the loop body:
// Call fn(0), fn(-limit), fn(-limit+1), ..., fn(limit-1), and fn(limit)
template<typename F>
void for_range0(int limit, F fn) {
fn(0);
for (auto i = -limit; i <= limit; i++)
fn(i);
}
The above function can be used to implement iteration by providing the loop body as an anonymous function:
for_range0(search, [&](int dr) {
for_range0(search, [&](int dc) {
if (rb + dr - blockSize + 1 > 0 && ...)
...
});
});
(Note that both anonymous functions capture enclosing variables by reference in order to be able to mutate them.)
Reading your comment, you could do something like this
for (int i = 0, bool zero = false; i < 5; i++)
{
cout << "hi" << endl;
if (zero)
{
i = 3;
zero = false;
}
}
This would start at it 0, then after doing what I want it to do, assign i the value 3, and then continue adding to it each iteration.

Finding the minimum in an array (but skipping some elements) using reduction in CUDA

I have a large array of floating point numbers and I want to find out the minimum value of the array (ignoring -1s wherever present) as well as its index, using reduction in CUDA. I have written the following code to do this, which in my opinion should work:
__global__ void get_min_cost(float *d_Cost,int n,int *last_block_number,int *number_in_last_block,int *d_index){
int tid = threadIdx.x;
int myid = blockDim.x * blockIdx.x + threadIdx.x;
int s;
if(result == (*last_block_number)-1){
s = (*number_in_last_block)/2;
}else{
s = 1024/2;
}
for(;s>0;s/=2){
if(myid+s>=n)
continue;
if(tid<s){
if(d_Cost[myid+s] == -1){
continue;
}else if(d_Cost[myid] == -1 && d_Cost[myid+s] != -1){
d_Cost[myid] = d_Cost[myid+s];
d_index[myid] = d_index[myid+s];
}else{
// both not -1
if(d_Cost[myid]<=d_Cost[myid+s])
continue;
else{
d_Cost[myid] = d_Cost[myid+s];
d_index[myid] = d_index[myid+s];
}
}
}
else
continue;
__syncthreads();
}
if(tid==0){
d_Cost[blockIdx.x] = d_Cost[myid];
d_index[blockIdx.x] = d_index[myid];
}
return;
}
The last_block_number argument is the id of the last block, and number_in_last_block is the number of elements in last block (which is a power of 2). Thus, all blocks will launch 1024 threads every time and the last block will only use number_in_last_block threads, while others will use 1024 threads.
After this function runs, I expect the minimum values for each block to be in d_Cost[blockIdx.x] and their indices in d_index[blockIdx.x].
I call this function multiple times, each time updating the number of threads and blocks. The second time I call this function, the number of threads now become equal to the number of blocks remaining etc.
However, the above function isn't giving me the desired output. In fact, it gives a different output every time I run the program, i.e, it returns an incorrect value as the minimum during some intermediate iteration (though that incorrect value is quite close to the minimum every time).
What am I doing wrong here?
As I mentioned in my comment above, I would recommend to avoid writing reductions of your own and use CUDA Thrust whenever possible. This holds true even in the case when you need to customize those operations, the customization being possible by properly overloading, e.g., relational operations.
Below I'm providing a simple code to evaluate the minimum in an array along with its index. It is based on a classical example contained in the An Introduction to Thrust presentation. The only addition is skipping, as you requested, the -1's from the counting. This can be reasonably done by replacing all the -1's in the array by INT_MAX, i.e., the maximum representable integer according to IEEE floating point standards.
#include <thrust\device_vector.h>
#include <thrust\replace.h>
#include <thrust\sequence.h>
#include <thrust\reduce.h>
#include <thrust\iterator\zip_iterator.h>
#include <thrust\tuple.h>
// --- Struct returning the smallest of two tuples
struct smaller_tuple
{
__host__ __device__ thrust::tuple<int,int> operator()(thrust::tuple<int,int> a, thrust::tuple<int,int> b)
{
if (a < b)
return a;
else
return b;
}
};
void main() {
const int N = 20;
const int large_value = INT_MAX;
// --- Setting the data vector
thrust::device_vector<int> d_vec(N,10);
d_vec[3] = -1; d_vec[5] = -2;
// --- Copying the data vector to a new vector where the -1's are changed to FLT_MAX
thrust::device_vector<int> d_vec_temp(d_vec);
thrust::replace(d_vec_temp.begin(), d_vec_temp.end(), -1, large_value);
// --- Creating the index sequence [0, 1, 2, ... )
thrust::device_vector<int> indices(d_vec_temp.size());
thrust::sequence(indices.begin(), indices.end());
// --- Setting the initial value of the search
thrust::tuple<int,int> init(d_vec_temp[0],0);
thrust::tuple<int,int> smallest;
smallest = thrust::reduce(thrust::make_zip_iterator(thrust::make_tuple(d_vec_temp.begin(), indices.begin())),
thrust::make_zip_iterator(thrust::make_tuple(d_vec_temp.end(), indices.end())),
init, smaller_tuple());
printf("Smallest %i %i\n",thrust::get<0>(smallest),thrust::get<1>(smallest));
getchar();
}

Calculating averages of elements in a sequence without iterating over it

Write a class that takes in a step size (n) in its constructor. The only method in the class takes in an integer, adds it into a sequence of numbers, and returns the average of the last n values inserted into the sequence. Do not iterate over the sequence to calculate the average.
And NO, this isn't homework
Following is my way of doing it in C++:
Initialize two STL queue<int>, one of which has length n and is called buffer
User - input values are stored dynamically in the buffer. Once this buffer is full, add the user - input value to "sum" and subtract the buffer.front() value.
Push the first value from buffer into the second queue<int> named values
Pop the first value (buffer.pop())
return the average by dividing sum by n
Following is the code I came up with:
#ifndef calcAverage_Window_h
#define calcAverage_Window_h
#include <iostream>
#include <queue>
using namespace std;
class Window{
private:
int n, sum;
queue<int> values, buffer, sums;
public:
Window(int);
float calcAverage(int);
};
#endif
#include "Window.h"
Window::Window(int m){
n = m;
buffer.push(1);
buffer.push(2);
buffer.push(3);
sum = 6;
}
float Window::calcAverage(int val){
buffer.push(val);
values.push(buffer.front());
sum = sum + val - buffer.front();
buffer.pop();
return float(sum)/n; //float(sum) required so that calcAverage doesn't return an int
}
#include "Window.h"
int main()
{
Window w(3);
cout<<w.calcAverage(4)<<endl;
cout<<w.calcAverage(5)<<endl;
cout<<w.calcAverage(6)<<endl;
return 0;
}
I have the following questions:
Is there a better way to do this?
If we are not allowed to use STL either, I would implement a queue and use that for buffer and values. Does anyone have a better idea?
I cheated a bit by initializing the buffer in the Window(n) constructor. That is because: 1) I did not know how else I would go about it
2) It maybe clear for the case when n = 2, but it is ambiguous for n = 3.
Where will this method / code fail?
I came to think of this way empirically. Is there an algorithmic way to look at this problem?
To answer a few of your questions:
Where will this method / code fail?
Well, assuming the above code is bug-free, it will not necessarily work correctly if you decide to move to floating-point data.
Note that its overflow behaviour is also subtly different compared to a direct implementation of a moving average.
Is there an algorithmic way to look at this problem?
Yes. With window size L the moving sums for time n and time n-1 are as follows:
y[n] = x[n] + x[n-1] + ... + x[n-L+1]
y[n-1] = x[n-1] + ... + x[n-L+1] + x[n-L]
Subtract one equation from the other, you get:
y[n] - y[n-1] = x[n] - x[n-L]
Move y[n-1] to the other side of the equals sign, and you're done.

Fast Popcount instruction or Hamming distance for binary array?

I'm implementing on Visual Studio 2010 C++
I have two binary arrays. For example,
array1[100] = {1,0,1,0,0,1,1, .... }
array2[100] = {0,0,1,1,1,0,1, .... }
To calculate the Hamming distance between array1 and array2,
array3[100] stores the xor result of array1 and array2.
Then I have to count the number of 1 bits in array3. To do this, I know I can use the __popcnt instruction.
For now, I'm doing something like below:
popcnt_result = 0;
for (i=0; i<100; i++) {
popcnt_result = popcnt_result + __popcnt(array3[i]);
}
It shows a good result but is slow. How can I make it faster?
array3 seems a bit wasteful, you're accessing a whole extra 400 bytes of memory that you don't need to. I would try comparing what you have with the following:
for (int i = 0; i < 100; ++i) {
result += (array1[i] ^ array2[i]); // could also try != in place of ^
}
If that helps at all, then I leave it as an exercise for the reader how to apply both this change and duskwuff's.
As implemented, the __popcnt call is not helping. It's actually slowing you down.
__popcnt counts the number of set bits in its argument. You're only passing in one element, which looks like it's guaranteed to be 0 or 1, so the result (also 0 or 1) is not useful. Doing this would be slightly faster:
popcnt_result += array3[i];
Depending on how your array is laid out, you may or may not be able to use __popcnt in a cleverer way. Specifically, if your array consists of one-byte elements (e.g, char, bool, int8_t, or similar), you could perform a population count on four elements at a time:
for(i = 0; i < 100; i += 4) {
uint32_t *p = (uint32_t *) &array3[i];
popcnt_result += __popcnt(*p);
}
(Note that this depends on the fact that 100 is divisible evenly by 4. You'd have to add some special-case handling for the last few elements otherwise.)
If the array consists of larger values, such as int, though, you're out of luck, and there's still no guarantee that this will be any faster than the naïve implementation above.
If your arrays only contain two values (0 or 1) the Hamming distance is just the number of positions where corresponding values are different. This can be done in one pass using std::inner_product from the standard library.
#include <iostream>
#include <functional>
#include <numeric>
int main()
{
int array1[100] = { 1,0,1,0,0,1,1, ... };
int array2[100] = { 0,0,1,1,1,0,1, ... };
int distance = std::inner_product(array1, array1 + 100, array2, 0, std::plus<int>(), std::not_equal_to<int>());
std::cout << "distance=" << distance << '\n';
return 0;
}