thrust vector distance calculation - c++

Consider the following dataset and centroids. There are 7 individuals and two means each with 8 dimensions. They are stored row major order.
short dim = 8;
float centroids[] = {
0.223, 0.002, 0.223, 0.412, 0.334, 0.532, 0.244, 0.612,
0.742, 0.812, 0.817, 0.353, 0.325, 0.452, 0.837, 0.441
};
float data[] = {
0.314, 0.504, 0.030, 0.215, 0.647, 0.045, 0.443, 0.325,
0.731, 0.354, 0.696, 0.604, 0.954, 0.673, 0.625, 0.744,
0.615, 0.936, 0.045, 0.779, 0.169, 0.589, 0.303, 0.869,
0.275, 0.406, 0.003, 0.763, 0.471, 0.748, 0.230, 0.769,
0.903, 0.489, 0.135, 0.599, 0.094, 0.088, 0.272, 0.719,
0.112, 0.448, 0.809, 0.157, 0.227, 0.978, 0.747, 0.530,
0.908, 0.121, 0.321, 0.911, 0.884, 0.792, 0.658, 0.114
};
I want to calculate each euclidean distances. c1 - d1, c1 - d2 ....
On CPU I would do:
float dist = 0.0, dist_sqrt;
for(int i = 0; i < 2; i++)
for(int j = 0; j < 7; j++)
{
float dist_sum = 0.0;
for(int k = 0; k < dim; k++)
{
dist = centroids[i * dim + k] - data[j * dim + k];
dist_sum += dist * dist;
}
dist_sqrt = sqrt(dist_sum);
// do something with the distance
std::cout << dist_sqrt << std::endl;
}
Is there any built in solution of vector distance calculation in THRUST?

It can be done in thrust. Explaining how will be rather involved, and the code is rather dense.
The key observation to start with is that the core operation can be done via a transformed reduction. The thrust transform operation is used to perform the elementwise subtraction of the vectors (individual-centroid) and squaring of each result, and the reduction sums the results together to produce the square of the euclidean distance. The starting point for this operation is thrust::reduce_by_key, but it gets rather involved to present the data correctly to reduce_by_key.
The final results are produced by taking the square root of each result from above, and we can use an ordinary thrust::transform for this.
The above is a summary description of the only 2 lines of thrust code that do all the work. However, the first line has considerable complexity to it. In order to exploit parallelism, the approach I took was to virtually "lay out" the necessary vectors in sequence, to be presented to reduce_by_key. To take a simple example, suppose we have 2 centroids and 4 individuals, and suppose our dimension is 2.
centroid 0: C00 C01
centroid 1: C10 C11
individ 0: I00 I01
individ 1: I10 I11
individ 2: I20 I21
individ 3: I30 I31
We can "lay out" the vectors like this:
C00 C01 C00 C01 C00 C01 C00 C01 C10 C11 C10 C11 C10 C11 C10 C11
I00 I01 I10 I11 I20 I21 I30 I31 I00 I01 I10 I11 I20 I21 I30 I31
To facilitate the reduce_by_key, we will also need to generate key values to delineate the vectors:
0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1
The above data "laid-out" data sets can be quite large, and we don't want to incur storage and retrieval cost, so we will generate these "on-the-fly" using thrust's collection of fancy iterators. This is where things get quite dense. With the above strategy in mind, we will use thrust::reduce_by_key to do the work. We'll create a custom functor provided to a transform_iterator to do the subtraction (and squaring) of the I and C vectors, which will be zipped together for this purpose. The "lay out" of the vectors will be created on the fly using permutation iterators with additional custom index-creation functors, to help with the replicated patterns in each of I and C.
Therefore, working from the "inside out", the sequence of steps is as follows:
for both I (data) and C (centr) use a counting_iterator combined with a custom indexing functor inside of a transform_iterator to produce the indexing sequences we will need.
using the indexing sequences created in step 1 and the base I and C vectors, virtually "lay out" the vectors via a permutation_iterator (one for each laid-out vector).
zip the 2 "laid out" virtual I and C vectors together, to create a <float, float> tuple vector (virtual).
take the zip_iterator from step 3, and combine with a custom distance-calculation functor ((I-C)^2) in a transform_iterator
use another transform_iterator, combining a counting_iterator with a custom key-generating functor, to produce the key sequence (virtual)
pass the iterators in steps 4 and 5 to reduce_by_keyas the inputs (keys, values) to be reduced. The output vectors for reduce_by_key are also keys and values. We don't need the keys, so we'll use a discard_iterator to dump those. The values we will save.
The above steps are all accomplished in a single line of thrust code.
Here's a code illustrating the above:
#include <iostream>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/reduce.h>
#include <thrust/iterator/transform_iterator.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/iterator/permutation_iterator.h>
#include <thrust/iterator/zip_iterator.h>
#include <thrust/iterator/discard_iterator.h>
#include <thrust/copy.h>
#include <math.h>
#include <time.h>
#include <sys/time.h>
#include <stdlib.h>
#define MAX_DATA 100000000
#define MAX_CENT 5000
#define TOL 0.001
unsigned long long dtime_usec(unsigned long long prev){
#define USECPSEC 1000000ULL
timeval tv1;
gettimeofday(&tv1,0);
return ((tv1.tv_sec * USECPSEC)+tv1.tv_usec) - prev;
}
unsigned verify(float *d1, float *d2, int len){
unsigned pass = 1;
for (int i = 0; i < len; i++)
if (fabsf(d1[i] - d2[i]) > TOL){
std::cout << "mismatch at: " << i << " val1: " << d1[i] << " val2: " << d2[i] << std::endl;
pass = 0;
break;}
return pass;
}
void eucl_dist_cpu(const float *centroids, const float *data, float *rdist, int num_centroids, int dim, int num_data, int print){
int out_idx = 0;
float dist, dist_sqrt;
for(int i = 0; i < num_centroids; i++)
for(int j = 0; j < num_data; j++)
{
float dist_sum = 0.0;
for(int k = 0; k < dim; k++)
{
dist = centroids[i * dim + k] - data[j * dim + k];
dist_sum += dist * dist;
}
dist_sqrt = sqrt(dist_sum);
// do something with the distance
rdist[out_idx++] = dist_sqrt;
if (print) std::cout << dist_sqrt << ", ";
}
if (print) std::cout << std::endl;
}
struct dkeygen : public thrust::unary_function<int, int>
{
int dim;
int numd;
dkeygen(const int _dim, const int _numd) : dim(_dim), numd(_numd) {};
__host__ __device__ int operator()(const int val) const {
return (val/dim);
}
};
typedef thrust::tuple<float, float> mytuple;
struct my_dist : public thrust::unary_function<mytuple, float>
{
__host__ __device__ float operator()(const mytuple &my_tuple) const {
float temp = thrust::get<0>(my_tuple) - thrust::get<1>(my_tuple);
return temp*temp;
}
};
struct d_idx : public thrust::unary_function<int, int>
{
int dim;
int numd;
d_idx(int _dim, int _numd) : dim(_dim), numd(_numd) {};
__host__ __device__ int operator()(const int val) const {
return (val % (dim*numd));
}
};
struct c_idx : public thrust::unary_function<int, int>
{
int dim;
int numd;
c_idx(int _dim, int _numd) : dim(_dim), numd(_numd) {};
__host__ __device__ int operator()(const int val) const {
return (val % dim) + (dim * (val/(dim*numd)));
}
};
struct my_sqrt : public thrust::unary_function<float, float>
{
__host__ __device__ float operator()(const float val) const {
return sqrtf(val);
}
};
unsigned long long eucl_dist_thrust(thrust::host_vector<float> &centroids, thrust::host_vector<float> &data, thrust::host_vector<float> &dist, int num_centroids, int dim, int num_data, int print){
thrust::device_vector<float> d_data = data;
thrust::device_vector<float> d_centr = centroids;
thrust::device_vector<float> values_out(num_centroids*num_data);
unsigned long long compute_time = dtime_usec(0);
thrust::reduce_by_key(thrust::make_transform_iterator(thrust::make_counting_iterator<int>(0), dkeygen(dim, num_data)), thrust::make_transform_iterator(thrust::make_counting_iterator<int>(dim*num_data*num_centroids), dkeygen(dim, num_data)),thrust::make_transform_iterator(thrust::make_zip_iterator(thrust::make_tuple(thrust::make_permutation_iterator(d_centr.begin(), thrust::make_transform_iterator(thrust::make_counting_iterator<int>(0), c_idx(dim, num_data))), thrust::make_permutation_iterator(d_data.begin(), thrust::make_transform_iterator(thrust::make_counting_iterator<int>(0), d_idx(dim, num_data))))), my_dist()), thrust::make_discard_iterator(), values_out.begin());
thrust::transform(values_out.begin(), values_out.end(), values_out.begin(), my_sqrt());
cudaDeviceSynchronize();
compute_time = dtime_usec(compute_time);
if (print){
thrust::copy(values_out.begin(), values_out.end(), std::ostream_iterator<float>(std::cout, ", "));
std::cout << std::endl;
}
thrust::copy(values_out.begin(), values_out.end(), dist.begin());
return compute_time;
}
int main(int argc, char *argv[]){
int dim = 8;
int num_centroids = 2;
float centroids[] = {
0.223, 0.002, 0.223, 0.412, 0.334, 0.532, 0.244, 0.612,
0.742, 0.812, 0.817, 0.353, 0.325, 0.452, 0.837, 0.441
};
int num_data = 8;
float data[] = {
0.314, 0.504, 0.030, 0.215, 0.647, 0.045, 0.443, 0.325,
0.731, 0.354, 0.696, 0.604, 0.954, 0.673, 0.625, 0.744,
0.615, 0.936, 0.045, 0.779, 0.169, 0.589, 0.303, 0.869,
0.275, 0.406, 0.003, 0.763, 0.471, 0.748, 0.230, 0.769,
0.903, 0.489, 0.135, 0.599, 0.094, 0.088, 0.272, 0.719,
0.112, 0.448, 0.809, 0.157, 0.227, 0.978, 0.747, 0.530,
0.908, 0.121, 0.321, 0.911, 0.884, 0.792, 0.658, 0.114,
0.721, 0.555, 0.979, 0.412, 0.007, 0.501, 0.844, 0.234
};
std::cout << "cpu results: " << std::endl;
float dist[num_data*num_centroids];
eucl_dist_cpu(centroids, data, dist, num_centroids, dim, num_data, 1);
thrust::host_vector<float> h_data(data, data + (sizeof(data)/sizeof(float)));
thrust::host_vector<float> h_centr(centroids, centroids + (sizeof(centroids)/sizeof(float)));
thrust::host_vector<float> h_dist(num_centroids*num_data);
std::cout << "gpu results: " << std::endl;
eucl_dist_thrust(h_centr, h_data, h_dist, num_centroids, dim, num_data, 1);
float *data2, *centroids2, *dist2;
num_centroids = 10;
num_data = 1000000;
if (argc > 2) {
num_centroids = atoi(argv[1]);
num_data = atoi(argv[2]);
if ((num_centroids < 1) || (num_centroids > MAX_CENT)) {std::cout << "Num centroids out of range" << std::endl; return 1;}
if ((num_data < 1) || (num_data > MAX_DATA)) {std::cout << "Num data out of range" << std::endl; return 1;}
if (num_data * dim * num_centroids > 2000000000) {std::cout << "data set out of range" << std::endl; return 1;}}
std::cout << "Num Data: " << num_data << std::endl;
std::cout << "Num Cent: " << num_centroids << std::endl;
std::cout << "result size: " << ((num_data*num_centroids*4)/1048576) << " Mbytes" << std::endl;
data2 = new float[dim*num_data];
centroids2 = new float[dim*num_centroids];
dist2 = new float[num_data*num_centroids];
for (int i = 0; i < dim*num_data; i++) data2[i] = rand()/(float)RAND_MAX;
for (int i = 0; i < dim*num_centroids; i++) centroids2[i] = rand()/(float)RAND_MAX;
unsigned long long dtime = dtime_usec(0);
eucl_dist_cpu(centroids2, data2, dist2, num_centroids, dim, num_data, 0);
dtime = dtime_usec(dtime);
std::cout << "cpu time: " << dtime/(float)USECPSEC << "s" << std::endl;
thrust::host_vector<float> h_data2(data2, data2 + (dim*num_data));
thrust::host_vector<float> h_centr2(centroids2, centroids2 + (dim*num_centroids));
thrust::host_vector<float> h_dist2(num_data*num_centroids);
dtime = dtime_usec(0);
unsigned long long ctime = eucl_dist_thrust(h_centr2, h_data2, h_dist2, num_centroids, dim, num_data, 0);
dtime = dtime_usec(dtime);
std::cout << "gpu total time: " << dtime/(float)USECPSEC << "s, gpu compute time: " << ctime/(float)USECPSEC << "s" << std::endl;
if (!verify(dist2, &(h_dist2[0]), num_data*num_centroids)) {std::cout << "Verification failure." << std::endl; return 1;}
std::cout << "Success!" << std::endl;
return 0;
}
Notes:
The code is set up to do 2 passes, a short one using a data set similar to yours, with printout for visual check. Then a larger data set can be entered, via command-line sizing parameters (number of centroids, then number of individuals), for benchmark comparison and validation of results.
Contrary to what I stated in the comments, the thrust code is only running about 25% faster than the naive single-threaded CPU code. Your mileage may vary.
This is just one way to think about handling it. I have had other ideas, but not enough time to flesh them out.
The data sets can become rather large. The code right now is intended to be limited to data sets where the product of dimension*number_of_centroids*number_of_individuals is less than 2 billion. However, as you approach even this number, you will need a GPU and CPU that both have a few GB of memory. I briefly explored larger data set sizes. A few code changes would be needed in various places to extend from e.g. int to unsigned long long, etc. However I haven't provided that as I am still investigating an issue with that code.
For another, non-thrust-related look at computing euclidean distances on the GPU, you may be interested in this question. If you follow the sequence of optimizations that were made there, it may shed some light on either how this thrust code might be improved, or else how another non-thrust realization could be used.
Sorry I wasn't able to squeeze more performance out.

Related

Eigen: Slow access to columns of Matrix 4

I am using Eigen for operations similar to Cholesky update, implying a lot of AXPY (sum plus multiplication by a scalar) on the columns of a fixed size matrix, typically a Matrix4d. In brief, it is 3 times more expensive to access to the columns of a Matrix 4 than to a Vector 4.
Typically, the code below:
for(int i=0;i<4;++i ) L.col(0) += x*y[i];
is 3 times less efficient than the code below:
for(int i=0;i<4;++i ) l4 += x*y[i];
where L is typically a matrix of size 4, x, y and l4 are vectors of size 4.
Moreover, the time spent in the first line of code is not depending on the matrix storage organization (either RowMajor of ColMajor).
On a Intel i7 (2.5GHz), it takes about 0.007us for vector operations, and 0.02us for matrix operations (timings are done by repeating 100000 times the same operation). My application would need thousands of such operation in timings hopefully far below the millisecond.
Question: I am doing something improperly when accessing columns of my 4x4 matrix? Is there something to do to make the first line of code more efficient?
Full code used for timings is below:
#include <iostream>
#include <Eigen/Core>
#include <vector>
#include <sys/time.h>
typedef Eigen::Matrix<double,4,1,Eigen::ColMajor> Vector4;
//typedef Eigen::Matrix<double,4,4,Eigen::RowMajor,4,4> Matrix4;
typedef Eigen::Matrix<double,4,4,Eigen::ColMajor,4,4> Matrix4;
inline double operator- ( const struct timeval & t1,const struct timeval & t0)
{
/* TODO: double check the double conversion from long (on 64x). */
return double(t1.tv_sec - t0.tv_sec)+1e-6*double(t1.tv_usec - t0.tv_usec);
}
void sumCols( Matrix4 & L,
Vector4 & x4,
Vector4 & y)
{
for(int i=0;i<4;++i )
{
L.col(0) += x4*y[i];
}
}
void sumVec( Vector4 & L,
Vector4 & x4,
Vector4 & y)
{
for(int i=0;i<4;++i )
{
//L.tail(4-i) += x4.tail(4-i)*y[i];
L += x4 *y[i];
}
}
int main()
{
using namespace Eigen;
const int NBT = 1000000;
struct timeval t0,t1;
std::vector< Vector4> x4s(NBT);
std::vector< Vector4> y4s(NBT);
std::vector< Vector4> z4s(NBT);
std::vector< Matrix4> L4s(NBT);
for(int i=0;i<NBT;++i)
{
x4s[i] = Vector4::Random();
y4s[i] = Vector4::Random();
L4s[i] = Matrix4::Random();
}
int sample = int(z4s[55][2]/10*NBT);
std::cout << "*** SAMPLE = " << sample << std::endl;
gettimeofday(&t0,NULL);
for(int i=0;i<NBT;++i)
{
sumCols(L4s[i], x4s[i], y4s[i]);
}
gettimeofday(&t1,NULL);
std::cout << (t1-t0) << std::endl;
std::cout << "\t\t\t\t\t\t\tForce check" << L4s[sample](1,0) << std::endl;
gettimeofday(&t0,NULL);
for(int i=0;i<NBT;++i)
{
sumVec(z4s[i], x4s[i], y4s[i]);
}
gettimeofday(&t1,NULL);
std::cout << (t1-t0) << std::endl;
std::cout << "\t\t\t\t\t\t\tForce check" << z4s[sample][2] << std::endl;
return -1;
}
As I said in a comment, the generated assembly is exactly the same for both functions.
The problem is that your benchmark is biased in the sense that L4s is 4 times bigger than z4s, and you thus get more cache misses in the matrix case than in the vector case.

Thrust CUDA find maximum per each group(segment)

My data like
value = [1, 2, 3, 4, 5, 6]
key = [0, 1, 0, 2, 1, 2]
I need to now maximum(value and index) per each group(key).
So the result should be
max = [3, 5, 6]
index = [2, 4, 5]
key = [0, 1, 2]
How can I get it with cuda thrust?
I can do sort -> reduce_by_key but it's not really efficient. In my case vector size > 10M and key space ~ 1K(starts from 0 without gaps).
Since the original question focused on thrust, I didn't have any suggestions other than what I mentioned in the comments,
However, based on further dialog in the comments, I thought I would post an answer that covers both CUDA and thrust.
The thrust method uses a sort_by_key operation to group like keys together, followed by a reduce_by_key operation to find the max + index for each key-group.
The CUDA method uses a custom atomic approach I describe here to find a 32-bit max plus 32-bit index (for each key-group).
The CUDA method is substantially (~10x) faster, for this specific test case. I used a vector size of 10M and a key size of 10K for this test.
My test platform was CUDA 8RC, RHEL 7, and Tesla K20X GPU. K20X is a member of the Kepler generation which has much faster global atomics than previous GPU generations.
Here's a fully worked example, covering both cases, and providing a timing comparison:
$ cat t1234.cu
#include <iostream>
#include <thrust/copy.h>
#include <thrust/reduce.h>
#include <thrust/sort.h>
#include <thrust/device_vector.h>
#include <thrust/iterator/zip_iterator.h>
#include <thrust/sequence.h>
#include <thrust/functional.h>
#include <cstdlib>
#include <time.h>
#include <sys/time.h>
#define USECPSEC 1000000ULL
unsigned long long dtime_usec(unsigned long long start){
timeval tv;
gettimeofday(&tv, 0);
return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}
const size_t ksize = 10000;
const size_t vsize = 10000000;
const int nTPB = 256;
struct my_max_func
{
template <typename T1, typename T2>
__host__ __device__
T1 operator()(const T1 t1, const T2 t2){
T1 res;
if (thrust::get<0>(t1) > thrust::get<0>(t2)){
thrust::get<0>(res) = thrust::get<0>(t1);
thrust::get<1>(res) = thrust::get<1>(t1);}
else {
thrust::get<0>(res) = thrust::get<0>(t2);
thrust::get<1>(res) = thrust::get<1>(t2);}
return res;
}
};
typedef union {
float floats[2]; // floats[0] = maxvalue
int ints[2]; // ints[1] = maxindex
unsigned long long int ulong; // for atomic update
} my_atomics;
__device__ unsigned long long int my_atomicMax(unsigned long long int* address, float val1, int val2)
{
my_atomics loc, loctest;
loc.floats[0] = val1;
loc.ints[1] = val2;
loctest.ulong = *address;
while (loctest.floats[0] < val1)
loctest.ulong = atomicCAS(address, loctest.ulong, loc.ulong);
return loctest.ulong;
}
__global__ void my_max_idx(const float *data, const int *keys,const int ds, my_atomics *res)
{
int idx = (blockDim.x * blockIdx.x) + threadIdx.x;
if (idx < ds)
my_atomicMax(&(res[keys[idx]].ulong), data[idx],idx);
}
int main(){
float *h_vals = new float[vsize];
int *h_keys = new int[vsize];
for (int i = 0; i < vsize; i++) {h_vals[i] = rand(); h_keys[i] = rand()%ksize;}
// thrust method
thrust::device_vector<float> d_vals(h_vals, h_vals+vsize);
thrust::device_vector<int> d_keys(h_keys, h_keys+vsize);
thrust::device_vector<int> d_keys_out(ksize);
thrust::device_vector<float> d_vals_out(ksize);
thrust::device_vector<int> d_idxs(vsize);
thrust::device_vector<int> d_idxs_out(ksize);
thrust::sequence(d_idxs.begin(), d_idxs.end());
cudaDeviceSynchronize();
unsigned long long et = dtime_usec(0);
thrust::sort_by_key(d_keys.begin(), d_keys.end(), thrust::make_zip_iterator(thrust::make_tuple(d_vals.begin(), d_idxs.begin())));
thrust::reduce_by_key(d_keys.begin(), d_keys.end(), thrust::make_zip_iterator(thrust::make_tuple(d_vals.begin(),d_idxs.begin())), d_keys_out.begin(), thrust::make_zip_iterator(thrust::make_tuple(d_vals_out.begin(), d_idxs_out.begin())), thrust::equal_to<int>(), my_max_func());
cudaDeviceSynchronize();
et = dtime_usec(et);
std::cout << "Thrust time: " << et/(float)USECPSEC << "s" << std::endl;
// cuda method
float *vals;
int *keys;
my_atomics *results;
cudaMalloc(&keys, vsize*sizeof(int));
cudaMalloc(&vals, vsize*sizeof(float));
cudaMalloc(&results, ksize*sizeof(my_atomics));
cudaMemset(results, 0, ksize*sizeof(my_atomics)); // works because vals are all positive
cudaMemcpy(keys, h_keys, vsize*sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(vals, h_vals, vsize*sizeof(float), cudaMemcpyHostToDevice);
et = dtime_usec(0);
my_max_idx<<<(vsize+nTPB-1)/nTPB, nTPB>>>(vals, keys, vsize, results);
cudaDeviceSynchronize();
et = dtime_usec(et);
std::cout << "CUDA time: " << et/(float)USECPSEC << "s" << std::endl;
// verification
my_atomics *h_results = new my_atomics[ksize];
cudaMemcpy(h_results, results, ksize*sizeof(my_atomics), cudaMemcpyDeviceToHost);
for (int i = 0; i < ksize; i++){
if (h_results[i].floats[0] != d_vals_out[i]) {std::cout << "value mismatch at index: " << i << " thrust: " << d_vals_out[i] << " CUDA: " << h_results[i].floats[0] << std::endl; return -1;}
if (h_results[i].ints[1] != d_idxs_out[i]) {std::cout << "index mismatch at index: " << i << " thrust: " << d_idxs_out[i] << " CUDA: " << h_results[i].ints[1] << std::endl; return -1;}
}
std::cout << "Success!" << std::endl;
return 0;
}
$ nvcc -arch=sm_35 -o t1234 t1234.cu
$ ./t1234
Thrust time: 0.026593s
CUDA time: 0.002451s
Success!
$

omp parallel for no optimization achieved for quadratic sieve

I am trying to implement parallel quadratic sieve using open mp. In sieving phase, I am using log approximations to check the divisibility. This is my code.
#pragma omp parallel for schedule (dynamic) num_threads(4)
for (int i = 0; i < factorBase.size(); ++i) {
const uint32_t p = factorBase[i];
const float logp = std::log(factorBase[i]) / std::log(2);
// Sieve first sequence.
while (startIndex.first[i] < intervalEnd) {
logApprox[startIndex.first[i] - intervalStart] -= logp;
startIndex.first[i] += p;
}
if (p == 2)
continue; // a^2 = N (mod 2) only has one root.
// Sieve second sequence.
while (startIndex.second[i] < intervalEnd) {
logApprox[startIndex.second[i] - intervalStart] -= logp;
startIndex.second[i] += p;
}
}
Here factorbase and logApprox are std::vectors initialized as follows
std::vector<float> logApprox(INTERVAL_LENGTH, 0);
std::vector<uint32_t> factorBase;
Whenever, I run this code and compare the running time, there is no much difference between sequential and parallel run. What are some optimizations that can be done? I am a beginner in openmp and any help is appreciated.Thanks
Very interesting task you have! Thanks!
Decided to make my own implementation with very many optimizations.
I achieved 20.4x times boost compared to your original code (your code gives 17.86 seconds, my gives 0.87 seconds). Also I used 2x times less memory for sieving compared to your algorithm, while achieving same goal.
To make comparison I simplified your code in such a way that it still does almost same thing and runs exactly same time, but looks much more simple:
#pragma omp parallel for
for (size_t i = 0; i < factorBase.size(); ++i) {
auto const p = factorBase[i];
float const logp = std::log(p) / std::log(2);
while (startIndex[i] < logApprox.size()) {
logApprox[startIndex[i]] += logp;
startIndex[i] += p;
}
}
You can see that I leaved only single sieve loop, second one does same thing and not necessary for demonstration, so I removed it. Also I removed startInterval as it is irrelevant to speed demonstration. And for simplicity I did += of logarithm instead of yours -=.
One important notice regarding your algorithm is that it doesn't do any synchronization, it means that different cores of CPU may write to same entry of logApprox array hence give wrong result.
And as I have measured this wrong result happens once or twice per hundred million entries of logApprox array. My optimized code overcame this limitation and did correct synchronization besides doing all speed optimizations.
I did following improvements to gain 20x times speedup:
I split whole array into blocks, approximately 2^13 elements in size. Each group of blocks is processed by separate thread/CPU-core hence no synchronization of threads is needed. Besides avoiding synchronization what is very important is that 2^13 block fits fully into L1 or L2 cache of CPU, hence speeds up things a lot.
Each block of 2^13 is processed for all possible primes. To keep track of which offsets of what primes are needed I created a special ring buffer of 2^7 size, this ring buffer is indexed with block number modulo 2^7 and keeps track which primes with which offsets are needed for each block (modulo 2^7).
I have as many threads as there are CPU cores. For each thread I precompute starting offsets of all primes for this thread, these starting offsets are computed through modular arithmetics based on startIndex array that you provided in your original code.
To speedup even more instead of float logarithm I use integer logarithm, which is based on uint16_t. This integer logarithm is computed as uint16_t integer_log = uint16_t(std::log2(p) * (1 << 8) + 0.5);. Besides increasing speed of computing += for integer logarithms, they also decrease occupied memory 2x times. If for some reason uint16_t logarithm is not enough for you then please replace using ILog2T = u16; with using ILog2T = u32; in my code, but this will double amount of used memory.
My code output following to console:
time_simple 17.859 sec, time_optimized 0.874 sec, boost 20.434, correct_ratio 0.999999993
Time simple is time of your original code for sieving array of size 2^28, time optimized is my code for same array, boost is how much my code is faster (you can see it is 20x times faster). Correct ratio says if there are any errors in your code, due to absence of multi-core synchronization (as you can see sometimes it is less than 1.0 hence there are some errors).
Full optimized code below:
Try it online!
#include <cstdint>
#include <random>
#include <iostream>
#include <iomanip>
#include <chrono>
#include <thread>
#include <type_traits>
#include <vector>
#include <stdexcept>
#include <sstream>
#include <mutex>
#include <omp.h>
#define ASSERT_MSG(cond, msg) { if (!(cond)) throw std::runtime_error("Assertion (" #cond ") failed at line " + std::to_string(__LINE__) + "! Msg: '" + std::string(msg) + "'."); }
#define ASSERT(cond) ASSERT_MSG(cond, "")
#define OSTR(code) ([&]{ std::ostringstream ss; ss code; return ss.str(); }())
#define COUT(code) { std::unique_lock<std::mutex> lock(cout_mux); std::cout code; std::cout << std::flush; }
#define LN { COUT(<< "LN " << __LINE__ << std::endl); }
#define DUMP(var) { COUT(<< #var << " = (" << (var) << ")" << std::endl); }
using u16 = uint16_t;
using u32 = uint32_t;
using u64 = uint64_t;
using ILog2T = u16;
using PrimeT = u32;
std::mutex cout_mux;
template <typename T>
std::vector<T> GenPrimes(size_t end) {
thread_local std::vector<T> primes = {2, 3};
while (primes.back() < end) {
for (T p = primes.back() + 2;; p += 2) {
bool is_prime = true;
for (auto d: primes) {
if (u64(d) * d > p)
break;
if (p % d == 0) {
is_prime = false;
break;
}
}
if (is_prime) {
primes.push_back(p);
break;
}
}
}
primes.pop_back();
return primes;
}
void SieveA(std::vector<float> & logApprox, std::vector<PrimeT> const & factorBase, std::vector<PrimeT> startIndex) {
#pragma omp parallel for
for (size_t i = 0; i < factorBase.size(); ++i) {
auto const p = factorBase[i];
float const logp = std::log(p) / std::log(2);
while (startIndex[i] < logApprox.size()) {
logApprox[startIndex[i]] += logp;
startIndex[i] += p;
}
}
}
size_t NThreads() {
//return 1;
return std::thread::hardware_concurrency();
}
ILog2T LogToI(double x) { return ILog2T(x * (1ULL << (sizeof(ILog2T) * 8 - 8)) + 0.5); }
double IToLog(ILog2T x) { return x / double(1ULL << (sizeof(ILog2T) * 8 - 8)); }
double Time() {
static auto const gtb = std::chrono::high_resolution_clock::now();
return std::chrono::duration_cast<std::chrono::duration<double>>(
std::chrono::high_resolution_clock::now() - gtb).count();
}
std::string FloatToStr(double x, size_t round = 6) {
return OSTR(<< std::fixed << std::setprecision(round) << x);
}
double SieveB(std::vector<ILog2T> & logs, std::vector<PrimeT> const & primes, std::vector<PrimeT> const & starts0) {
auto const nthr = NThreads();
std::vector<std::vector<PrimeT>> starts(nthr, std::vector<PrimeT>(primes.size()));
std::vector<std::vector<ILog2T>> plogs(nthr, std::vector<ILog2T>(primes.size()));
std::vector<std::pair<u64, u64>> ranges(nthr);
size_t constexpr block_log2 = 13, block = 1 << block_log2, ring_log2 = 6, ring_size = 1ULL << ring_log2, ring_mask = ring_size - 1;
std::vector<std::vector<std::vector<std::pair<u32, u32>>>> ring(nthr, std::vector<std::vector<std::pair<u32, u32>>>(ring_size));
#pragma omp parallel for
for (size_t ithr = 0; ithr < nthr; ++ithr) {
size_t const nblock = ((logs.size() + nthr - 1) / nthr + block - 1) / block * block,
begin = ithr * nblock, end = std::min<size_t>(logs.size(), (ithr + 1) * nblock);
ranges[ithr] = {begin, end};
for (size_t i = 0; i < primes.size(); ++i) {
PrimeT const p = primes[i];
size_t const mod0 = begin % p, mod = starts0[i] < mod0 ? p + starts0[i] - mod0 : starts0[i] - mod0;
starts[ithr][i] = mod;
plogs[ithr][i] = LogToI(std::log2(p));
ring[ithr][((begin + starts[ithr][i]) >> block_log2) & ring_mask].push_back({i, begin + starts[ithr][i]});
}
}
auto tim = Time();
#pragma omp parallel for
for (size_t ithr = 0; ithr < nthr; ++ithr) {
auto const [begin, end] = ranges[ithr];
auto const [bbegin, bend] = std::make_tuple(begin / block, (end - 1) / block + 1);
auto const & cstarts = starts.at(ithr);
auto const & cplogs = plogs.at(ithr);
auto & cring = ring[ithr];
std::decay_t<decltype(cring[0])> tmp;
size_t hit_cnt = 0, miss_cnt = 0;
for (size_t iblock = bbegin; iblock < bend; ++iblock) {
size_t const cbegin = iblock << block_log2, cend = std::min<size_t>(end, (iblock + 1) << block_log2);
auto & ring_cur = cring[iblock & ring_mask];
tmp = ring_cur;
ring_cur.clear();
for (auto [ip, off]: tmp)
if (off >= cend) {
//++miss_cnt;
ring_cur.push_back({ip, off});
} else {
//++hit_cnt;
auto const p = primes[ip];
auto const plog = cplogs[ip];
for (; off < cend; off += p) {
//if (8192 - 10 <= off && off <= 8192 + 10) COUT(<< "logs.size() " << logs.size() << " begin " << begin << " end " << end << " bbegin " << bbegin << " bend " << bend << " cbegin " << cbegin << " cend " << cend << " iblock " << iblock << " off " << off << " p " << p << " plog " << plog << std::endl);
logs[off] += plog;
}
if (off < end)
cring[(off >> block_log2) & ring_mask].push_back({ip, off});
}
}
//COUT(<< "hit_ratio " << std::fixed << std::setprecision(6) << double(hit_cnt) / (hit_cnt + miss_cnt) << std::endl);
}
return Time() - tim;
}
void Test() {
size_t constexpr len = 1ULL << 28;
std::mt19937_64 rng{123};
auto const primes = GenPrimes<PrimeT>(1 << 12);
std::vector<PrimeT> starts;
for (auto p: primes)
starts.push_back(rng() % p);
ASSERT(primes.size() == starts.size());
double tA = 0, tB = 0;
std::vector<float> logsA(len);
std::vector<ILog2T> logsB(len);
{
tA = Time();
SieveA(logsA, primes, starts);
tA = Time() - tA;
}
{
tB = SieveB(logsB, primes, starts);
}
size_t correct = 0;
for (size_t i = 0; i < len; ++i) {
//ASSERT_MSG(std::abs(logsA[i] - IToLog(logsB[i])) < 0.1, "i " + std::to_string(i) + " logA " + FloatToStr(logsA[i], 3) + " logB " + FloatToStr(IToLog(logsB[i]), 3));
if (std::abs(logsA[i] - IToLog(logsB[i])) < 0.1)
++correct;
}
std::cout << std::fixed << std::setprecision(3) << "time_simple " << tA << " sec, time_optimized " << tB << " sec, boost " << (tA / tB) << ", correct_ratio " << std::setprecision(9) << double(correct) / len << std::endl;
}
int main() {
try {
omp_set_num_threads(NThreads());
Test();
return 0;
} catch (std::exception const & ex) {
std::cout << "Exception: " << ex.what() << std::endl;
return -1;
}
}
Output:
time_simple 17.859 sec, time_optimized 0.874 sec, boost 20.434, correct_ratio 0.999999993
In my opinion, you should turn the schedule to static and give it chunk-size (https://software.intel.com/en-us/articles/openmp-loop-scheduling).
A small optimization should be :
outside of the big FOR loop, declare a const and initialize it to 1/std::log(2), and then inside the FOR loop, instead of dividing by std::log(2), do a multiplication of the previous const, division is very expensive in CPU cycles.

Rounding integers routine

There is something that baffles me with integer arithmetic in tutorials. To be precise, integer division.
The seemingly preferred method is by casting the divisor into a float, then rounding the float to the nearest whole number, then cast that back into integer:
#include <cmath>
int round_divide_by_float_casting(int a, int b){
return (int) std::roundf( a / (float) b);
}
Yet this seems like scratching your left ear with your right hand. I use:
int round_divide (int a, int b){
return a / b + a % b * 2 / b;
}
It's no breakthrough, but the fact that it is not standard makes me wonder if I am missing anything?
Despite my (albeit limited) testing, I couldn't find any scenario where the two methods give me different results. Did someone run into some sort of scenario where the int → float → int casting produced more accurate results?
Arithmetic solution
If one defined what your functions should return, she would describe it as something close as "f(a, b) returns the closest integer of the result of the division of a by b in the real divisor ring."
Thus, the question can be summarized as: can we define this closest integer using only integer division. I think we can.
There is exactly two candidates as the closest integer: a / b and (a / b) + 1(1). The selection is easy, if a % b is closer to 0 as it is to b, then a / b is our result. If not, (a / b) + 1 is.
One could then write something similar to, ignoring optimization and good practices:
int divide(int a, int b)
{
const int quot = a / b;
const int rem = a % b;
int result;
if (rem < b - rem) {
result = quot;
} else {
result = quot + 1;
}
return result;
}
While this definition satisfies out needs, one could optimize it by not computing two times the division of a by b with the use of std::div():
int divide(int a, int b)
{
const std::div_t dv = std::div(a, b);
int result = dv.quot;
if (dv.rem >= b - dv.rem) {
++result;
}
return result;
}
The analysis of the problem we did earlier assures us of the well defined behaviour of our implementation.
(1)There is just one last thing to check: how does it behaves when a or b is negative? This is left to the reader ;).
Benchmark
#include <iostream>
#include <iomanip>
#include <string>
// solutions
#include <cmath>
#include <cstdlib>
// benchmak
#include <limits>
#include <random>
#include <chrono>
#include <algorithm>
#include <functional>
//
// Solutions
//
namespace
{
int round_divide_by_float_casting(int a, int b) {
return (int)roundf(a / (float)b);
}
int round_divide_by_modulo(int a, int b) {
return a / b + a % b * 2 / b;
}
int divide_by_quotient_comparison(int a, int b)
{
const std::div_t dv = std::div(a, b);
int result = dv.quot;
if (dv.rem >= b - dv.rem)
{
++result;
}
return result;
}
}
//
// benchmark
//
class Randomizer
{
std::mt19937 _rng_engine;
std::uniform_int_distribution<int> _distri;
public:
Randomizer() : _rng_engine(std::time(0)), _distri(std::numeric_limits<int>::min(), std::numeric_limits<int>::max())
{
}
template<class ForwardIt>
void operator()(ForwardIt begin, ForwardIt end)
{
std::generate(begin, end, std::bind(_distri, _rng_engine));
}
};
class Clock
{
std::chrono::time_point<std::chrono::steady_clock> _start;
public:
static inline std::chrono::time_point<std::chrono::steady_clock> now() { return std::chrono::steady_clock::now(); }
Clock() : _start(now())
{
}
template<class DurationUnit>
std::size_t end()
{
return std::chrono::duration_cast<DurationUnit>(now() - _start).count();
}
};
//
// Entry point
//
int main()
{
Randomizer randomizer;
std::array<int, 1000> dividends; // SCALE THIS UP (1'000'000 would be great)
std::array<int, dividends.size()> divisors;
std::array<int, dividends.size()> results;
randomizer(std::begin(dividends), std::end(dividends));
randomizer(std::begin(divisors), std::end(divisors));
{
Clock clock;
auto dividend = std::begin(dividends);
auto divisor = std::begin(divisors);
auto result = std::begin(results);
for ( ; dividend != std::end(dividends) ; ++dividend, ++divisor, ++result)
{
*result = round_divide_by_float_casting(*dividend, *divisor);
}
const float unit_time = clock.end<std::chrono::nanoseconds>() / static_cast<float>(results.size());
std::cout << std::setw(40) << "round_divide_by_float_casting(): " << std::setprecision(3) << unit_time << " ns\n";
}
{
Clock clock;
auto dividend = std::begin(dividends);
auto divisor = std::begin(divisors);
auto result = std::begin(results);
for ( ; dividend != std::end(dividends) ; ++dividend, ++divisor, ++result)
{
*result = round_divide_by_modulo(*dividend, *divisor);
}
const float unit_time = clock.end<std::chrono::nanoseconds>() / static_cast<float>(results.size());
std::cout << std::setw(40) << "round_divide_by_modulo(): " << std::setprecision(3) << unit_time << " ns\n";
}
{
Clock clock;
auto dividend = std::begin(dividends);
auto divisor = std::begin(divisors);
auto result = std::begin(results);
for ( ; dividend != std::end(dividends) ; ++dividend, ++divisor, ++result)
{
*result = divide_by_quotient_comparison(*dividend, *divisor);
}
const float unit_time = clock.end<std::chrono::nanoseconds>() / static_cast<float>(results.size());
std::cout << std::setw(40) << "divide_by_quotient_comparison(): " << std::setprecision(3) << unit_time << " ns\n";
}
}
Outputs:
g++ -std=c++11 -O2 -Wall -Wextra -Werror main.cpp && ./a.out
round_divide_by_float_casting(): 54.7 ns
round_divide_by_modulo(): 24 ns
divide_by_quotient_comparison(): 25.5 ns
Demo
The two arithmetic solutions' performances are not distinguishable (their benchmark converges when you scale the bench size up).
It would really depend on the processor, and the range of the integer which is better (and using double would resolve most of the range issues)
For modern "big" CPUs like x86-64 and ARM, integer division and floating point division are roughly the same effort, and converting an integer to a float or vice versa is not a "hard" task (and does the correct rounding directly in that conversion, at least), so most likely the resulting operations are.
atmp = (float) a;
btmp = (float) b;
resfloat = divide atmp/btmp;
return = to_int_with_rounding(resfloat)
About four machine instructions.
On the other hand, your code uses two divides, one modulo and a multiply, which is quite likely longer on such a processor.
tmp = a/b;
tmp1 = a % b;
tmp2 = tmp1 * 2;
tmp3 = tmp2 / b;
tmp4 = tmp + tmp3;
So five instructions, and three of those are "divide" (unless the compiler is clever enough to reuse a / b for a % b - but it's still two distinct divides).
Of course, if you are outside the range of number of digits that a float or double can hold without losing digits (23 bits for float, 53 bits for double), then your method MAY be better (assuming there is no overflow in the integer math).
On top of all that, since the first form is used by "everyone", it's the one that the compiler recognises and can optimise.
Obviously, the results depend on both the compiler being used and the processor it runs on, but these are my results from running the code posted above, compiled through clang++ (v3.9-pre-release, pretty close to released 3.8).
round_divide_by_float_casting(): 32.5 ns
round_divide_by_modulo(): 113 ns
divide_by_quotient_comparison(): 80.4 ns
However, the interesting thing I find when I look at the generated code:
xorps %xmm0, %xmm0
cvtsi2ssl 8016(%rsp,%rbp), %xmm0
xorps %xmm1, %xmm1
cvtsi2ssl 4016(%rsp,%rbp), %xmm1
divss %xmm1, %xmm0
callq roundf
cvttss2si %xmm0, %eax
movl %eax, 16(%rsp,%rbp)
addq $4, %rbp
cmpq $4000, %rbp # imm = 0xFA0
jne .LBB0_7
is that the round is actually a call. Which really surprises me, but explains why on some machines (particularly more recent x86 processors), it is faster.
g++ gives better results with -ffast-math, which gives around:
round_divide_by_float_casting(): 17.6 ns
round_divide_by_modulo(): 43.1 ns
divide_by_quotient_comparison(): 18.5 ns
(This is with increased count to 100k values)
Prefer the standard solution. Use std::div family of functions declared in cstdlib.
See: http://en.cppreference.com/w/cpp/numeric/math/div
Casting to float and then to int may be very inefficient on some architectures, for example, microcontrollers.
Thanks for the suggestions so far. To shed some light I made a test setup to compare performance.
#include <iostream>
#include <string>
#include <cmath>
#include <cstdlib>
#include <chrono>
using namespace std;
int round_divide_by_float_casting(int a, int b) {
return (int)roundf(a / (float)b);
}
int round_divide_by_modulo(int a, int b) {
return a / b + a % b * 2 / b;
}
int divide_by_quotient_comparison(int a, int b)
{
const std::div_t dv = std::div(a, b);
int result = dv.quot;
if (dv.rem <= b - dv.rem) {
++result;
}
return result;
}
int main()
{
int itr = 1000;
//while (true) {
auto begin = chrono::steady_clock::now();
for (int i = 0; i < itr; i++) {
for (int j = 10; j < itr + 1; j++) {
divide_by_quotient_comparison(i, j);
}
}
auto end = std::chrono::steady_clock::now();
cout << "divide_by_quotient_comparison(,) function took: "
<< chrono::duration_cast<std::chrono::nanoseconds>(end - begin).count()
<< endl;
begin = chrono::steady_clock::now();
for (int i = 0; i < itr; i++) {
for (int j = 10; j < itr + 1; j++) {
round_divide_by_float_casting(i, j);
}
}
end = std::chrono::steady_clock::now();
cout << "round_divide_by_float_casting(,) function took: "
<< chrono::duration_cast<std::chrono::nanoseconds>(end - begin).count()
<< endl;
begin = chrono::steady_clock::now();
for (int i = 0; i < itr; i++) {
for (int j = 10; j < itr + 1; j++) {
round_divide_by_modulo(i, j);
}
}
end = std::chrono::steady_clock::now();
cout << "round_divide_by_modulo(,) function took: "
<< chrono::duration_cast<std::chrono::nanoseconds>(end - begin).count()
<< endl;
//}
return 0;
}
The results I got on my machine (i7 with Visual Studio 2015) was as follows: the modulo arithmetic was about twice as fast as the int → float → int casting method. The method relying on std::div_t (suggested by #YSC and #teroi) was faster than the int → float → int, but slower than the modulo arithmetic method.
A second test was performed to avoid certain compiler optimizations pointed out by #YSC:
#include <iostream>
#include <string>
#include <cmath>
#include <cstdlib>
#include <chrono>
#include <vector>
using namespace std;
int round_divide_by_float_casting(int a, int b) {
return (int)roundf(a / (float)b);
}
int round_divide_by_modulo(int a, int b) {
return a / b + a % b * 2 / b;
}
int divide_by_quotient_comparison(int a, int b)
{
const std::div_t dv = std::div(a, b);
int result = dv.quot;
if (dv.rem <= b - dv.rem) {
++result;
}
return result;
}
int main()
{
int itr = 100;
vector <int> randi, randj;
for (int i = 0; i < itr; i++) {
randi.push_back(rand());
int rj = rand();
if (rj == 0)
rj++;
randj.push_back(rj);
}
vector<int> f, m, q;
while (true) {
auto begin = chrono::steady_clock::now();
for (int i = 0; i < itr; i++) {
for (int j = 0; j < itr; j++) {
q.push_back( divide_by_quotient_comparison(randi[i] , randj[j]) );
}
}
auto end = std::chrono::steady_clock::now();
cout << "divide_by_quotient_comparison(,) function took: "
<< chrono::duration_cast<std::chrono::nanoseconds>(end - begin).count()
<< endl;
begin = chrono::steady_clock::now();
for (int i = 0; i < itr; i++) {
for (int j = 0; j < itr; j++) {
f.push_back( round_divide_by_float_casting(randi[i], randj[j]) );
}
}
end = std::chrono::steady_clock::now();
cout << "round_divide_by_float_casting(,) function took: "
<< chrono::duration_cast<std::chrono::nanoseconds>(end - begin).count()
<< endl;
begin = chrono::steady_clock::now();
for (int i = 0; i < itr; i++) {
for (int j = 0; j < itr; j++) {
m.push_back( round_divide_by_modulo(randi[i], randj[j]) );
}
}
end = std::chrono::steady_clock::now();
cout << "round_divide_by_modulo(,) function took: "
<< chrono::duration_cast<std::chrono::nanoseconds>(end - begin).count()
<< endl;
cout << endl;
f.clear();
m.clear();
q.clear();
}
return 0;
}
In this second test the slowest was the divide_by_quotient() reliant on std::div_t, followed by divide_by_float(), and the fastest again was the divide_by_modulo(). However this time the performance difference was much, much lower, less than 20%.

CUDA histogram reduce_by_key failing

I have the following CUDA Thrust code that uses reduce_by_key to histogram the values [0, 1024) into 256 buckets. I expect each bucket to have a count = 4, yet I see bucket 0 has 256, bucket 255 has 3, and the remainder have 4.
#include <stdio.h>
#include <stdlib.h>
#include <cuda.h>
#include <cuda_runtime.h>
#include <device_launch_parameters.h>
#include <thrust/device_vector.h>
#include <thrust/extrema.h>
#include <thrust/pair.h>
#define SIZE 1024
struct binFunc {
const float minVal;
const float valRange;
const int numBins;
binFunc(float _minVal, float _valRange, int _numBins) :
minVal(_minVal), valRange(_valRange), numBins(_numBins) {}
__host__ __device__
int operator()(float v) const {
int b = int((v - minVal) / valRange * float(numBins));
return b;
}
};
int main() {
thrust::device_vector<float> d_vec(SIZE);
for (int i = 0; i < SIZE; ++i)
d_vec[i] = float(i);
thrust::device_vector<float>::iterator min;
thrust::device_vector<float>::iterator max;
thrust::pair<thrust::device_vector<float>::iterator,
thrust::device_vector<float>::iterator> minmax =
thrust::minmax_element(d_vec.begin(), d_vec.end());
min = minmax.first;
max = minmax.second;
float minVal = *min;
float maxVal = *max;
std::cout << "The minimum value is " << minVal
<< " and the maximum value is " << maxVal << "." << std::endl;
float valRange = maxVal - minVal;
std::cout << "The range is " << valRange << "." << std::endl;
int numBins = 256;
thrust::device_vector<int> d_binResults(SIZE);
thrust::transform(d_vec.begin(), d_vec.end(), d_binResults.begin(),
binFunc(minVal, valRange, numBins));
thrust::device_vector<int>::iterator d_binResults_iter =
d_binResults.begin();
for (int i = 0; i < 10; ++i) {
int b = *d_binResults_iter;
printf("d_binResults[%d]=%d\n", i, b);
d_binResults_iter++;
}
std::cout << "The numBins is " << numBins << "." << std::endl;
thrust::device_vector<int> d_binsKeys(numBins);
thrust::device_vector<int> d_binsValues(numBins);
thrust::pair<thrust::device_vector<int>::iterator,
thrust::device_vector<int>::iterator> keys_and_values =
thrust::reduce_by_key(d_binResults.begin(), d_binResults.end(),
thrust::constant_iterator<int>(1), d_binsKeys.begin(),
d_binsValues.begin());
thrust::device_vector<int>::iterator d_binsKeys_begin_iter =
d_binsKeys.begin();
thrust::device_vector<int>::iterator d_binsValues_begin_iter =
d_binsValues.begin();
for (int i = 0; i < numBins; ++i) {
int key = *d_binsKeys_begin_iter;
int val = *d_binsValues_begin_iter;
printf("d_binsValues[%d]=(%d,%d)\n", i, key, val);
d_binsKeys_begin_iter++;
d_binsValues_begin_iter++;
}
return 0;
}
The salient part of the output is:
d_binsValues[0]=(0,256)
d_binsValues[1]=(1,4)
d_binsValues[2]=(2,4)
...
d_binsValues[254]=(254,4)
d_binsValues[255]=(255,3)
So, bucket 0 has 256 elements, and bucket 255 has 3 elements? What's going on here?
If you print out all the d_binResults[] values instead of the first 10, you will discover that the last element (d_binResults[1023]) has a value of 256! But that is an invalid bin index. For numBins = 256, the valid indices are 0..255.
It is occurring due to the calculation arithmetic in your functor:
int b = int((v - minVal) / valRange * float(numBins));
Plugging in the relevant values for the last element, we have:
(1023 - 0)/1023*256 = 256
But 256 is an invalid bin index. It turns out that this breaks the reduce_by_key operation, causing both the last bin to have 3 elements and the first bin to be "corrupted".
If you fix this you will fix both issues you describe (first bin has 256 elements, last bin has 3.)
As a simple proof, add this line of code:
d_binResults[1023] = 255;
immediately after your thrust::transform operation. The results are then correct. How you choose to correct your bin calculation arithmetic is up to you. (possibly "fixable" by adding 1 to valRange but that may imply something about your expected histogram values).