Strided vs shuffling reduction - c++

I've recently the watched CppCon talk about using Clang to compile CUDA cuda code, where the speaker after talking a bit about the architecture implements a sum reduction. I was interested in the approach he took which was doing a reduction by a shfl of the elements in the block, so with no working example I used his code modified it a little bit and got a max-reduction.
The thing is that this max reduction is very slow, compared to a CPU implementation of finding the max in 2^22 elements I get times of about ~90ms against ~20ms. Here is the code for the shfl reduction
#include <vector>
#include <cuda.h>
#include <cuda_runtime.h>
#include <device_functions.h>
#include <cuda_runtime_api.h>
using namespace std;
// Global reduce test
__global__ void d_max_reduce(const int *in, int *out, size_t N) {
int sum = 0;
size_t start = (threadIdx.x + blockIdx.x * blockDim.x) * 4;
for (size_t i = start; i < start + 4 && i < N; i++) {
sum = max(__ldg(in + i), sum);
}
for (int i = 16; i; i >>= 1) {
sum = max(__shfl_down(sum, i), sum);
}
__shared__ int shared_max;
shared_max = 0;
__syncthreads();
if (!(threadIdx.x % 32)) {
atomicMax(&shared_max, sum);
}
__syncthreads();
if (!threadIdx.x) {
atomicMax(out, shared_max);
}
}
int test_max_reduce(std::vector<int> &v) {
int *in, *out;
cudaMalloc(&in, v.size() * sizeof(int));
cudaMalloc(&out, sizeof(int));
cudaMemcpy(in, v.data(), v.size() * sizeof(int), cudaMemcpyHostToDevice);
cudaMemset(out, 0, sizeof(int));
int threads = 256;
d_max_reduce<<<ceil((float)v.size() / (threads * 4)), threads>>>(in, out, v.size());
int res;
cudaMemcpy(&res, out, sizeof(int), cudaMemcpyDeviceToHost);
cudaFree(in);
cudaFree(out);
return res;
}
So I used one of Nvidia's examples of a strided reduction (which is also is a sum) changed it to a max and I got times of about 7ms. Here is the code for the strided reduction
#include <vector>
#include <cuda.h>
#include <cuda_runtime.h>
#include <device_functions.h>
#include <cuda_runtime_api.h>
__global__ void d_max_reduction(const int *in, int *out, size_t N) {
extern __shared__ int s_data[];
size_t tid = threadIdx.x;
size_t i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < N)
s_data[tid] = in[i];
else
s_data[tid] = 0;
__syncthreads();
for (size_t s = blockDim.x / 2; s > 0; s >>= 1) {
if (tid < s)
s_data[tid] = max(s_data[tid], s_data[tid + s]);
__syncthreads();
}
if (!tid)
atomicMax(out, s_data[0]);
}
int test_max_reduction(std::vector<int> &v) {
int *in;
int *out;
cudaMalloc(&in, v.size() * sizeof(int));
cudaMalloc(&out, sizeof(int));
cudaMemcpy(in, v.data(), v.size() * sizeof(int), cudaMemcpyHostToDevice);
cudaMemset(out, 0, sizeof(int));
int threads = 128;
d_max_reduction<<<ceil((float)v.size() / threads),
threads,
threads * sizeof(int)>>>(in, out, v.size());
int res;
cudaMemcpy(&res, out, sizeof(int), cudaMemcpyDeviceToHost);
cudaFree(in);
cudaFree(out);
return res;
}
And just in case the rest so there is a MWE.
#include <random>
#include <timer.hpp>
int test_max_reduce(std::vector<int> &v);
int test_max_reduction(std::vector<int> &v);
int main() {
int N = 2000 * 2000; // * 2000;
std::vector<int> vec(N);
std::random_device dev;
std::mt19937 mt(dev());
std::uniform_int_distribution<int> dist(0, N << 2);
for (size_t i = 0; i < vec.size(); i++) {
vec[i] = dist(mt);
}
measure("GPU (shfl)", test_max_reduce, vec);
measure("GPU strided", test_max_reduction, vec);
measure("CPU",
[](std::vector<int> &vec) -> int {
int maximum = 0;
for (size_t i = 0; i < vec.size(); i++) {
maximum = std::max(maximum, vec[i]);
}
return maximum;
},
vec);
return 0;
}
And timer.hpp is
#ifndef TIMER_HPP
#define TIMER_HPP
#include <chrono>
#include <string>
#include <iostream>
template <typename F, typename ...Args>
void measure(std::string msg, F func, Args&&... args) {
auto start = std::chrono::steady_clock::now();
int val = func(std::forward<Args>(args)...);
auto end = std::chrono::steady_clock::now();
std::cout << msg << " Test " << std::endl;
std::cout << " Max Value : " << val << std::endl;
std::cout << " Time : ";
std::cout << std::chrono::duration_cast<std::chrono::milliseconds>
(end - start).count() << std::endl;
}
#endif // TIMER_HPP
I generally get the following times
GPU (shfl) Test
Max Value : 15999999
Time : 86
GPU strided Test
Max Value : 15999999
Time : 7
CPU Test
Max Value : 15999999
Time : 23
EDIT new timings after warmup
GPU (shfl) Test
Max Value : 16000000
Time : 4
GPU strided Test
Max Value : 16000000
Time : 6
CPU Test
Max Value : 16000000
Time : 23
So my more general question is why is the shfl version slower than the strided? Which can be divided in
Am I missing something in the launch parameters/doing/assumed something wrong?
And when is recommended to use shfl intrinsic over a strided loop and viceversa?

Related

CUDA thrust sort much slower when called from inside kernel [duplicate]

This question already has an answer here:
Accelerating __device__ function in Thrust comparison operator
(1 answer)
Closed last month.
#include <iostream>
#include <math.h>
#include <vector>
#include <assert.h>
#include <fstream>
#include <map>
#include <algorithm>
#include <sstream>
#include <cuda_runtime_api.h>
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/sort.h>
#include <thrust/functional.h>
#include <thrust/execution_policy.h>
#include <cub/cub.cuh>
using namespace std;
typedef float real;
int MAX_N = 10000000;
int N;
real* a, *b;
real* d_a;
real* h_res1, *h_res2;
volatile real v_res = 0;
class MyTimer {
std::chrono::time_point<std::chrono::system_clock> start;
public:
void startCounter() {
start = std::chrono::system_clock::now();
}
int64_t getCounterNs() {
return std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::system_clock::now() - start).count();
}
int64_t getCounterMs() {
return std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::system_clock::now() - start).count();
}
double getCounterMsPrecise() {
return std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::system_clock::now() - start).count()
/ 1000000.0;
}
};
void genData()
{
N = 100000;
for (int i = 0; i < N; i++) a[i] = float(rand() % 1000) / (rand() % 1000 + 1);
}
void __attribute__((noinline)) testCpu(real* arr, real* res, int N)
{
std::sort(arr, arr + N);
v_res = arr[rand() % N];
memcpy(res, arr, N * sizeof(real));
}
__global__
void sort_kernel(float* a, int N)
{
if (blockIdx.x==0 && threadIdx.x==0)
thrust::sort(thrust::device, a, a + N);
__syncthreads();
}
void __attribute__((noinline)) testGpu(real* arr, real* res, int N)
{
MyTimer timer;
timer.startCounter();
cudaMemcpy(d_a, arr, N * sizeof(float), cudaMemcpyHostToDevice);
cudaDeviceSynchronize();
cout << "Copy H2D cost = " << timer.getCounterMsPrecise() << "\n";
timer.startCounter();
//thrust::sort(thrust::device, d_a, d_a + N);
sort_kernel<<<1,1>>>(d_a, N);
cudaDeviceSynchronize();
cout << "Thrust sort cost = " << timer.getCounterMsPrecise() << "\n";
timer.startCounter();
cudaMemcpy(res, d_a, N * sizeof(float), cudaMemcpyDeviceToHost);
cudaDeviceSynchronize();
cout << "Copy D2H cost = " << timer.getCounterMsPrecise() << "\n";
v_res = res[rand() % N];
}
void __attribute__((noinline)) deepCopy(real* a, real* b, int N)
{
for (int i = 0; i < N; i++) b[i] = a[i];
}
void testOne(int t, bool record = true)
{
MyTimer timer;
genData();
deepCopy(a, b, N);
timer.startCounter();
testCpu(a, h_res1, N);
cout << "CPU cost = " << timer.getCounterMsPrecise() << "\n";
timer.startCounter();
testGpu(b, h_res2, N);
cout << "GPU cost = " << timer.getCounterMsPrecise() << "\n";
for (int i = 0; i < N; i++) {
if (h_res1[i] != h_res2[i]) {
cout << "ERROR " << i << " " << h_res1[i] << " " << h_res2[i] << "\n";
exit(1);
}
}
cout << "-----------------\n";
}
int main()
{
a = new real[MAX_N];
b = new real[MAX_N];
cudaMalloc(&d_a, MAX_N * sizeof(float));
cudaMallocHost(&h_res1, MAX_N * sizeof(float));
cudaMallocHost(&h_res2, MAX_N * sizeof(float));
testOne(0, 0);
for (int i = 1; i <= 50; i++) testOne(i);
}
For legacy code reason, I have to perform sort inside a kernel completely. Basically, I need:
__global__ void mainKernel(float** a, int N, float* global_pad)
{
int x;
...
cooperative_groups::grid_group g = cooperative_groups::this_grid();
sortFunc(a[x], N); // this can be a kernel. Then only 1 thread in the grid will call it
g.sync();
...
}
I tried to use thrust::sort but it's extremely slow. For example, with N = 100000, the benchmark result is:
CPU cost = 5.82228
Copy H2D cost = 0.088908
Thrust sort from CPU cost = 0.391211 (running line thrust::sort(thrust::device, d_a, d_a + N);)
Thrust sort inside kernel cost = 116 (running line sort_kernel<<<1,1>>>(d_a, N);)
Copy D2H cost = 0.067639
Why is thrust::sort so slow in this case? I want to find an implementation of sortFunc that is fastest possible (global_pad can be used as temporary memory)
Edit: I'm using 2080ti and CUDA 11.4. The compile command I use is
nvcc -o main main.cu -O3 -std=c++17
You need to turn on dynamic parallelism in the compile command.
Use -rdc=true, nvcc -o main main.cu -O3 -std=c++17 -rdc=true.
Then the 2 block code below are equivalent
__global__
void sort_kernel(float* a, int N)
{
if (blockIdx.x==0 && threadIdx.x==0)
thrust::sort(thrust::device, a, a + N);
__syncthreads();
}
...
sort_kernel<<<1,1>>>(d_a, N);
and
thrust::sort(thrust::device, d_a, d_a + N);

I don't understand where I have a problem in code using sse

I am new with sse programming. I want to write code in which I sum up 4 consecutive numbers from vector v and write the result of this sum in ans vector. I want to write optimized code using sse. But when I set up size is equal to 4 my program is working. But when I set up size is 8 my program doesn't work and I have this error message:
"Exception thrown: read access violation.
ans was 0x1110112.
If there is a handler for this exception, the program may be safely continued."
I don't understand where I have a problem. I allocate the memory right, in which place I have a problem. Could somebody help me, I will be really grateful.
#include <iostream>
#include <immintrin.h>
#include <pmmintrin.h>
#include <vector>
#include <math.h>
using namespace std;
arith_t = double
void init(arith_t *&v, size_t size) {
for (int i = 0; i < size; ++i) {
v[i] = i / 10.0;
}
}
//accumulate with sse
void sub_func_sse(arith_t *v, size_t size, int start_idx, arith_t *ans, size_t start_idx_ans) {
__m128d first_part = _mm_loadu_pd(v + start_idx);
__m128d second_part = _mm_loadu_pd(v + start_idx + 2);
__m128d sum = _mm_add_pd(first_part, second_part);
sum = _mm_hadd_pd(sum, sum);
_mm_store_pd(ans + start_idx_ans, sum);
}
int main() {
const size_t size = 8;
arith_t *v = new arith_t[size];
arith_t *ans_sse = new arith_t[size / 4];
init(v, size);
init(ans_sse, size / 4);
int num_repeat = 1;
arith_t total_time_sse = 0;
for (int p = 0; p < num_repeat; ++p) {
for (int idx = 0, ans_idx = 0; idx < size; idx += 4, ans_idx++) {
sub_func_sse(v, size, idx, ans_sse, ans_idx);
}
}
for (size_t i = 0; i < size / 4; ++i) {
cout << *(ans_sse + i) << endl;
}
delete[] ans_sse;
delete[] v;
}
You're using unaligned memory which requires special versions of load and store functions. You correctly used _mm_loadu_pd but the_mm_store_pd isn't appropriate for working with unaligned memory so you should change it to _mm_storeu_pd. Also consider using aligned memory which will result in better performance.

Avoid blas when involving temporary memory allocation?

I have a program that computes the matrix product x'Ay repeatedly. Is it better practice to compute this by making calls to MKL's blas, i.e. cblas_dgemv and cblas_ddot, which requires allocating memory to a temporary vector, or is better to simply take the sum of x_i * a_ij * y_j? In other words, does MKL's blas theoretically add any value?
I benchmarked this for my laptop. There was virtually no difference in each of the tests, other than g++_no_blas performed twice as poorly as the other tests (why?). There was also no difference between O2, O3 and Ofast.
g++_blas_static 57ms
g++_blas_dynamic 58ms
g++_no_blas 100ms
icpc_blas_static 57ms
icpc_blas_dynamic 58ms
icpc_no_blas 58ms
util.h
#ifndef UTIL_H
#define UTIL_H
#include <random>
#include <memory>
#include <iostream>
struct rng
{
rng() : unif(0.0, 1.0)
{
}
std::default_random_engine re;
std::uniform_real_distribution<double> unif;
double rand_double()
{
return unif(re);
}
std::unique_ptr<double[]> generate_square_matrix(const unsigned N)
{
std::unique_ptr<double[]> p (new double[N * N]);
for (unsigned i = 0; i < N; ++i)
{
for (unsigned j = 0; j < N; ++j)
{
p.get()[i*N + j] = rand_double();
}
}
return p;
}
std::unique_ptr<double[]> generate_vector(const unsigned N)
{
std::unique_ptr<double[]> p (new double[N]);
for (unsigned i = 0; i < N; ++i)
{
p.get()[i] = rand_double();
}
return p;
}
};
#endif // UTIL_H
main.cpp
#include <iostream>
#include <iomanip>
#include <memory>
#include <chrono>
#include "util.h"
#include "mkl.h"
double vtmv_blas(double* x, double* A, double* y, const unsigned n)
{
double temp[n];
cblas_dgemv(CblasRowMajor, CblasNoTrans, n, n, 1.0, A, n, y, 1, 0.0, temp, 1);
return cblas_ddot(n, temp, 1, x, 1);
}
double vtmv_non_blas(double* x, double* A, double* y, const unsigned n)
{
double r = 0;
for (unsigned i = 0; i < n; ++i)
{
for (unsigned j = 0; j < n; ++j)
{
r += x[i] * A[i*n + j] * y[j];
}
}
return r;
}
int main()
{
std::cout << std::fixed;
std::cout << std::setprecision(2);
constexpr unsigned N = 10000;
rng r;
std::unique_ptr<double[]> A = r.generate_square_matrix(N);
std::unique_ptr<double[]> x = r.generate_vector(N);
std::unique_ptr<double[]> y = r.generate_vector(N);
auto start = std::chrono::system_clock::now();
const double prod = vtmv_blas(x.get(), A.get(), y.get(), N);
auto end = std::chrono::system_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(
end - start);
std::cout << "Result: " << prod << std::endl;
std::cout << "Time (ms): " << duration.count() << std::endl;
GCC no blas is poor because it does not use vectorized SMID instructions, while others all do. icpc will auto-vectorize you loop.
You don't show your matrix size, but generally gemv is memory bound. As the matrix is much larger than a temp vector, eliminating it may not be able to increase the performance a lot.

Using popcnt on the GPU

I need to compute
(a & b).count()
over a large set (> 10000) bit vectors (std::bitset<N>) where N is anywhere from 2 ^ 10 to 2 ^16.
const size_t N = 2048;
std::vector<std::vector<char>> distances;
std::vector<std::bitset<N>> bits(100000);
load_from_file(bits);
for(int i = 0; i < bits.size(); i++){
for(int j = 0; j < bits.size(); j++){
distance[i][j] = (bits[i] & bits[j]).count();
}
}
Currently I'm relying on chunked multithreading and SSE/AVX to compute distances. Luckily I can use vpand from AVX to compute the & but my code is still using popcnt (%rax) and a loop to compute the bit counts.
Is there a way I can compute the (a & b).count() function on my GPU (nVidia 760m)? Ideally I would just pass 2 chunks of memory of N bits. I was looking at using thrust but I couldn't find a popcnt function.
EDIT:
Current CPU implementation.
double validate_pooled(const size_t K) const{
int right = 0;
const size_t num_examples = labels.size();
threadpool tp;
std::vector<std::future<bool>> futs;
for(size_t i = 0; i < num_examples; i++){
futs.push_back(tp.enqueue(&kNN<N>::validate_N, this, i, K));
}
for(auto& fut : futs)
if(fut.get()) right++;
return right / (double) num_examples;
}
bool validate_N(const size_t cmp, const size_t n) const{
const size_t num_examples = labels.size();
std::vector<char> dists(num_examples, -1);
for(size_t i = 0; i < num_examples; i++){
if(i == cmp) continue;
dists[i] = (bits[cmp] & bits[i]).count();
}
typedef std::unordered_map<std::string,size_t> counter;
counter counts;
for(size_t i = 0; i < n; i++){
auto iter = std::max_element(dists.cbegin(), dists.cend());
size_t idx = std::distance(dists.cbegin(), iter);
dists[idx] = -1; // Remove the top result.
counts[labels[idx]] += 1;
}
auto iter = std::max_element(counts.cbegin(), counts.cend(),
[](const counter::value_type& a, const counter::value_type& b){ return a.second < b.second; });
return labels[cmp] == iter->first;;
}
EDIT:
This is what I've come up with. However its brutally slow. I'm not sure if I'm doing something wrong
template<size_t N>
struct popl
{
typedef unsigned long word_type;
std::bitset<N> _cmp;
popl(const std::bitset<N>& cmp) : _cmp(cmp) {}
__device__
int operator()(const std::bitset<N>& x) const
{
int pop_total = 0;
#pragma unroll
for(size_t i = 0; i < N/64; i++)
pop_total += __popcll(x._M_w[i] & _cmp._M_w[i]);
return pop_total;
}
};
int main(void) {
const size_t N = 2048;
thrust::host_vector<std::bitset<N> > h_vec;
load_bits(h_vec);
thrust::device_vector<std::bitset<N> > d_vec = h_vec;
thrust::device_vector<int> r_vec(h_vec.size(), 0);
for(int i = 0; i < h_vec.size(); i++){
r_vec[i] = thrust::transform_reduce(d_vec.cbegin(), d_vec.cend(), popl<N>(d_vec[i]), 0, thrust::maximum<int>());
}
return 0;
}
CUDA has population count intrinsics for both 32-bit and 64-bit types. (__popc() and __popcll())
These could be used directly in a CUDA kernel or via thrust (in a functor) perhaps passed to thrust::transform_reduce.
If that is the only function you want to do on the GPU, it may be difficult to get a net "win" because of the "cost" of transferring data to/from the GPU. Your overall input data set appears to be about 1GB in size (100000 vectors of bit length 65536), but the output data set appears to be 10-40GB in size based on my calculations (100000 * 100000 * 1-4 bytes per result).
Either the CUDA kernel or the thrust function and data layout should be crafted carefully with the objective of having the code run limited only by memory bandwidth. The cost of data transfer could also be mitigated, perhaps to a large extent, by overlap of copy and compute operations, mainly on the output data set.
At first glance, this problem appears to be somewhat similar to the problem of computing euclidean distances among sets of vectors, so this question/answer may be of interest, from a CUDA perspective.
EDIT: adding some code that I used to investigate this. I am able to get a significant speedup (~25x including data copy time) over a naive single-threaded CPU implementation, but I don't know how fast the CPU version would be using "chunked multithreading and SSE/AVX ", so it would be interesting to see more of your implementation or get some performance numbers. I also don't think the CUDA code I have here is highly optimized, it's just a "first cut".
In this case, for proof-of-concept, I focused on a small problem size, N=2048, 10000 bitsets. For this small problem size, I can fit enough of the vector of bitsets in shared memory, for a "small" threadblock size, to take advantage of shared memory. So this particular approach would have to be modified for larger N.
$ cat t581.cu
#include <iostream>
#include <vector>
#include <bitset>
#include <stdlib.h>
#include <time.h>
#include <sys/time.h>
#define nTPB 128
#define OUT_CHUNK 250
#define N_bits 2048
#define N_vecs 10000
const size_t N = N_bits;
__global__ void comp_dist(unsigned *in, unsigned *out, unsigned numvecs, unsigned start_idx, unsigned end_idx){
__shared__ unsigned sdata[(N/32)*nTPB];
int idx = threadIdx.x+blockDim.x*blockIdx.x;
if (idx < numvecs)
for (int i = 0; i < (N/32); i++)
sdata[(i*nTPB)+threadIdx.x] = in[(i*numvecs)+idx];
__syncthreads();
int vidx = start_idx;
if (idx < numvecs)
while (vidx < end_idx) {
unsigned sum = 0;
for (int i = 0; i < N/32; i++)
sum += __popc(sdata[(i*nTPB)+ threadIdx.x] & in[(i*numvecs)+vidx]);
out[((vidx-start_idx)*numvecs)+idx] = sum;
vidx++;}
}
void cpu_test(std::vector<std::bitset<N> > &in, std::vector<std::vector<unsigned> > &out){
for (int i=0; i < in.size(); i++)
for (int j=0; j< in.size(); j++)
out[i][j] = (in[i] & in[j]).count();
}
int check_data(unsigned *d1, unsigned start_idx, std::vector<std::vector<unsigned> > &d2){
for (int i = start_idx; i < start_idx+OUT_CHUNK; i++)
for (int j = 0; j<N_vecs; j++)
if (d1[((i-start_idx)*N_vecs)+j] != d2[i][j]) {std::cout << "mismatch at " << i << "," << j << " was: " << d1[((i-start_idx)*N_vecs)+j] << " should be: " << d2[i][j] << std::endl; return 1;}
return 0;
}
unsigned long long get_time_usec(){
timeval tv;
gettimeofday(&tv, 0);
return (unsigned long long)(((unsigned long long)tv.tv_sec*1000000ULL)+(unsigned long long)tv.tv_usec);
}
int main(){
unsigned long long t1, t2;
std::vector<std::vector<unsigned> > distances;
std::vector<std::bitset<N> > bits;
for (int i = 0; i < N_vecs; i++){
std::vector<unsigned> dist_row(N_vecs, 0);
distances.push_back(dist_row);
std::bitset<N> data;
for (int j =0; j < N; j++) data[j] = rand() & 1;
bits.push_back(data);}
t1 = get_time_usec();
cpu_test(bits, distances);
t1 = get_time_usec() - t1;
unsigned *h_data = new unsigned[(N/32)*N_vecs];
memset(h_data, 0, (N/32)*N_vecs*sizeof(unsigned));
for (int i = 0; i < N_vecs; i++)
for (int j = 0; j < N; j++)
if (bits[i][j]) h_data[(i)+((j/32)*N_vecs)] |= 1U<<(31-(j&31));
unsigned *d_in, *d_out1, *d_out2, *h_out1, *h_out2;
cudaMalloc(&d_in, (N/32)*N_vecs*sizeof(unsigned));
cudaMalloc(&d_out1, N_vecs*OUT_CHUNK*sizeof(unsigned));
cudaMalloc(&d_out2, N_vecs*OUT_CHUNK*sizeof(unsigned));
cudaStream_t stream1, stream2;
cudaStreamCreate(&stream1);
cudaStreamCreate(&stream2);
h_out1 = new unsigned[N_vecs*OUT_CHUNK];
h_out2 = new unsigned[N_vecs*OUT_CHUNK];
t2 = get_time_usec();
cudaMemcpy(d_in, h_data, (N/32)*N_vecs*sizeof(unsigned), cudaMemcpyHostToDevice);
for (int i = 0; i < N_vecs; i += 2*OUT_CHUNK){
comp_dist<<<(N_vecs + nTPB - 1)/nTPB, nTPB, 0, stream1>>>(d_in, d_out1, N_vecs, i, i+OUT_CHUNK);
cudaStreamSynchronize(stream2);
if (i > 0) if (check_data(h_out2, i-OUT_CHUNK, distances)) return 1;
comp_dist<<<(N_vecs + nTPB - 1)/nTPB, nTPB, 0, stream2>>>(d_in, d_out2, N_vecs, i+OUT_CHUNK, i+2*OUT_CHUNK);
cudaMemcpyAsync(h_out1, d_out1, N_vecs*OUT_CHUNK*sizeof(unsigned), cudaMemcpyDeviceToHost, stream1);
cudaMemcpyAsync(h_out2, d_out2, N_vecs*OUT_CHUNK*sizeof(unsigned), cudaMemcpyDeviceToHost, stream2);
cudaStreamSynchronize(stream1);
if (check_data(h_out1, i, distances)) return 1;
}
cudaDeviceSynchronize();
t2 = get_time_usec() - t2;
std::cout << "cpu time: " << ((float)t1)/(float)1000 << "ms gpu time: " << ((float) t2)/(float)1000 << "ms" << std::endl;
return 0;
}
$ nvcc -O3 -arch=sm_20 -o t581 t581.cu
$ ./t581
cpu time: 20324.1ms gpu time: 753.76ms
$
CUDA 6.5, Fedora20, Xeon X5560, Quadro5000 (cc2.0) GPU. The above test case includes results verification between the distances data produced on the CPU vs. the GPU. I've also broken this into a chunked algorithm with results data transfer (and verification) overlapped with compute operations, to make it more easily extendable to the case where there is a very large amount of output data (e.g. 100000 bitsets). I haven't actually run this through the profiler yet, however.
EDIT 2: Here's a "windows version" of the code:
#include <iostream>
#include <vector>
#include <bitset>
#include <stdlib.h>
#include <time.h>
#define nTPB 128
#define OUT_CHUNK 250
#define N_bits 2048
#define N_vecs 10000
const size_t N = N_bits;
#define cudaCheckErrors(msg) \
do { \
cudaError_t __err = cudaGetLastError(); \
if (__err != cudaSuccess) { \
fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
msg, cudaGetErrorString(__err), \
__FILE__, __LINE__); \
fprintf(stderr, "*** FAILED - ABORTING\n"); \
exit(1); \
} \
} while (0)
__global__ void comp_dist(unsigned *in, unsigned *out, unsigned numvecs, unsigned start_idx, unsigned end_idx){
__shared__ unsigned sdata[(N/32)*nTPB];
int idx = threadIdx.x+blockDim.x*blockIdx.x;
if (idx < numvecs)
for (int i = 0; i < (N/32); i++)
sdata[(i*nTPB)+threadIdx.x] = in[(i*numvecs)+idx];
__syncthreads();
int vidx = start_idx;
if (idx < numvecs)
while (vidx < end_idx) {
unsigned sum = 0;
for (int i = 0; i < N/32; i++)
sum += __popc(sdata[(i*nTPB)+ threadIdx.x] & in[(i*numvecs)+vidx]);
out[((vidx-start_idx)*numvecs)+idx] = sum;
vidx++;}
}
void cpu_test(std::vector<std::bitset<N> > &in, std::vector<std::vector<unsigned> > &out){
for (unsigned i=0; i < in.size(); i++)
for (unsigned j=0; j< in.size(); j++)
out[i][j] = (in[i] & in[j]).count();
}
int check_data(unsigned *d1, unsigned start_idx, std::vector<std::vector<unsigned> > &d2){
for (unsigned i = start_idx; i < start_idx+OUT_CHUNK; i++)
for (unsigned j = 0; j<N_vecs; j++)
if (d1[((i-start_idx)*N_vecs)+j] != d2[i][j]) {std::cout << "mismatch at " << i << "," << j << " was: " << d1[((i-start_idx)*N_vecs)+j] << " should be: " << d2[i][j] << std::endl; return 1;}
return 0;
}
unsigned long long get_time_usec(){
return (unsigned long long)((clock()/(float)CLOCKS_PER_SEC)*(1000000ULL));
}
int main(){
unsigned long long t1, t2;
std::vector<std::vector<unsigned> > distances;
std::vector<std::bitset<N> > bits;
for (int i = 0; i < N_vecs; i++){
std::vector<unsigned> dist_row(N_vecs, 0);
distances.push_back(dist_row);
std::bitset<N> data;
for (int j =0; j < N; j++) data[j] = rand() & 1;
bits.push_back(data);}
t1 = get_time_usec();
cpu_test(bits, distances);
t1 = get_time_usec() - t1;
unsigned *h_data = new unsigned[(N/32)*N_vecs];
memset(h_data, 0, (N/32)*N_vecs*sizeof(unsigned));
for (int i = 0; i < N_vecs; i++)
for (int j = 0; j < N; j++)
if (bits[i][j]) h_data[(i)+((j/32)*N_vecs)] |= 1U<<(31-(j&31));
unsigned *d_in, *d_out1, *d_out2, *h_out1, *h_out2;
cudaMalloc(&d_in, (N/32)*N_vecs*sizeof(unsigned));
cudaMalloc(&d_out1, N_vecs*OUT_CHUNK*sizeof(unsigned));
cudaMalloc(&d_out2, N_vecs*OUT_CHUNK*sizeof(unsigned));
cudaCheckErrors("cudaMalloc fail");
cudaStream_t stream1, stream2;
cudaStreamCreate(&stream1);
cudaStreamCreate(&stream2);
cudaCheckErrors("cudaStrem fail");
h_out1 = new unsigned[N_vecs*OUT_CHUNK];
h_out2 = new unsigned[N_vecs*OUT_CHUNK];
t2 = get_time_usec();
cudaMemcpy(d_in, h_data, (N/32)*N_vecs*sizeof(unsigned), cudaMemcpyHostToDevice);
cudaCheckErrors("cudaMemcpy fail");
for (int i = 0; i < N_vecs; i += 2*OUT_CHUNK){
comp_dist<<<(N_vecs + nTPB - 1)/nTPB, nTPB, 0, stream1>>>(d_in, d_out1, N_vecs, i, i+OUT_CHUNK);
cudaCheckErrors("cuda kernel loop 1 fail");
cudaStreamSynchronize(stream2);
if (i > 0) if (check_data(h_out2, i-OUT_CHUNK, distances)) return 1;
comp_dist<<<(N_vecs + nTPB - 1)/nTPB, nTPB, 0, stream2>>>(d_in, d_out2, N_vecs, i+OUT_CHUNK, i+2*OUT_CHUNK);
cudaCheckErrors("cuda kernel loop 2 fail");
cudaMemcpyAsync(h_out1, d_out1, N_vecs*OUT_CHUNK*sizeof(unsigned), cudaMemcpyDeviceToHost, stream1);
cudaMemcpyAsync(h_out2, d_out2, N_vecs*OUT_CHUNK*sizeof(unsigned), cudaMemcpyDeviceToHost, stream2);
cudaCheckErrors("cuda kernel loop 3 fail");
cudaStreamSynchronize(stream1);
if (check_data(h_out1, i, distances)) return 1;
}
cudaDeviceSynchronize();
cudaCheckErrors("cuda kernel loop 4 fail");
t2 = get_time_usec() - t2;
std::cout << "cpu time: " << ((float)t1)/(float)1000 << "ms gpu time: " << ((float) t2)/(float)1000 << "ms" << std::endl;
return 0;
}
I've added CUDA error checking to this code. Be sure to build a release project in Visual Studio, not debug. When I run this on a windows 7 laptop with a Quadro1000M GPU I get about 35 seconds for the CPU execution and about 1.5 seconds for the GPU.
OpenCL 1.2 has popcount which would seem to do what you want. It can work on a vector, so up to ulong16 which is 1024 bits at a time. Note that NVIDIA drivers only support OpenCL 1.1 which does not include this function.
Of course you could just use a function or table to compute it pretty quickly, so an OpenCL 1.1 implementation is possible as well, and would likely run at the memory bandwidth of the device.

radix select using cuda

I have been working to develop a radix select using CUDA which utilizes k smallest element to sort given number of elements. The main idea behind this radix select is that is scans through 32 bit integer starting from its MSB to LSB. It partitions all 0 bit on left side and all 1 bit on the right side. The side with contains k smallest elements is solved recursively. My partition process works just fine but I am having problem dealing with recursive function calls. I am unable to stop the recursion. Please help me on that!
My kernel function looks like this: This is kernel.h
#include "header.h"
#define WARP_SIZE 32
#define BLOCK_SIZE 32
__device__ int Partition(int *d_DataIn, int firstidx, int lastidx, int k, int N, int bit)
{
int threadID = threadIdx.x + BLOCK_SIZE * blockIdx.x;
int WarpID = threadID >> 5;
int LocWarpID = threadID - 32 * WarpID;
int NumWarps = N / WARP_SIZE;
int pivot;
__shared__ int DataPartition[BLOCK_SIZE];
__shared__ int DataBinary[WARP_SIZE];
for(int i = 0; i < NumWarps; i++)
{
if(LocWarpID >= firstidx && LocWarpID <=lastidx)
{
int r = d_DataIn[i * WARP_SIZE + LocWarpID];
int p = (r>>(31-bit))&1;
unsigned int B = __ballot(p);
unsigned int B_flip = ~B;
if(p==1)
{
int b = B << (32-LocWarpID);
int RightLoc = __popc(b);
DataPartition[lastidx - RightLoc] = r;
}
else
{
int b_flip = B_flip << (32 - LocWarpID);
int LeftLoc = __popc(b_flip);
DataPartition[LeftLoc] = r;
}
if(LocWarpID <= lastidx - __popc(B))
{
d_DataIn[LocWarpID] = DataPartition[LocWarpID];
}
else
{
d_DataIn[LocWarpID] = DataPartition[LocWarpID];
}
pivot = lastidx - __popc(B);
return pivot+1;
}
}
}
__device__ int RadixSelect(int *d_DataIn, int firstidx, int lastidx, int k, int N, int bit)
{
if(firstidx == lastidx)
return *d_DataIn;
int q = Partition(d_DataIn, firstidx, lastidx, k, N, bit);
int length = q - firstidx;
if(k == length)
return *d_DataIn;
else if(k < length)
return RadixSelect(d_DataIn, firstidx, q-1, k, N, bit+1);
else
return RadixSelect(d_DataIn, q, lastidx, k-length, N, bit+1);
}
__global__ void radix(int *d_DataIn, int firstidx, int lastidx, int k, int N, int bit)
{
RadixSelect(d_DataIn, firstidx, lastidx, k, N, bit);
}
Host code is main.cu and it looks like:
#include "header.h"
#include <iostream>
#include <fstream>
#include "kernel.h"
#define BLOCK_SIZE 32
using namespace std;
int main()
{
int N = 32;
thrust::host_vector<float>h_HostFloat(N);
thrust::counting_iterator <unsigned int> Numbers(0);
thrust::transform(Numbers, Numbers + N, h_HostFloat.begin(), RandomFloatNumbers(1.f, 100.f));
thrust::host_vector<int>h_HostInt(N);
thrust::transform(h_HostFloat.begin(), h_HostFloat.end(), h_HostInt.begin(), FloatToInt());
thrust::device_vector<float>d_DeviceFloat = h_HostFloat;
thrust::device_vector<int>d_DeviceInt(N);
thrust::transform(d_DeviceFloat.begin(), d_DeviceFloat.end(), d_DeviceInt.begin(), FloatToInt());
int *d_DataIn = thrust::raw_pointer_cast(d_DeviceInt.data());
int *h_DataOut;
float *h_DataOut1;
int fsize = N * sizeof(float);
int size = N * sizeof(int);
h_DataOut = new int[size];
h_DataOut1 = new float[fsize];
int firstidx = 0;
int lastidx = BLOCK_SIZE-1;
int k = 20;
int bit = 1;
int NUM_BLOCKS = N / BLOCK_SIZE;
radix <<< NUM_BLOCKS, BLOCK_SIZE >>> (d_DataIn, firstidx, lastidx, k, N, bit);
cudaMemcpy(h_DataOut, d_DataIn, size, cudaMemcpyDeviceToHost);
WriteData(h_DataOut1, h_DataOut, 10, N);
return 0;
}
List of headers that I used:
#include "cuda.h"
#include "cuda_runtime_api.h"
#include "device_launch_parameters.h"
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/transform.h>
#include <thrust/generate.h>
#include "functor.h"
#include <thrust/iterator/counting_iterator.h>
#include <thrust/copy.h>
#include <thrust/device_ptr.h>
Another header file "functor.h" to convert floating point numbers to int type and to generate random floating numbers.
#include <thrust/random.h>
#include <sstream>
#include <fstream>
#include <iomanip>
struct RandomFloatNumbers
{
float a, b;
__host__ __device__
RandomFloatNumbers(float _a, float _b) : a(_a), b(_b) {};
__host__ __device__
float operator() (const unsigned int n) const{
thrust::default_random_engine rng;
thrust::uniform_real_distribution<float> dist(a,b);
rng.discard(n);
return dist(rng);
}
};
struct FloatToInt
{
__host__ __device__
int operator() (const float &x)
const {
union {
float f_value;
int i_value;
} value;
value.f_value = x;
return value.i_value;
}
};
float IntToFloat(int &x)
{
union{
float f_value;
int i_value;
}value;
value.i_value = x;
return value.f_value;
}
bool WriteData(float *h_DataOut1, int *h_DataOut, int bit, int N)
{
std::ofstream data;
std::stringstream file;
file << "out\\Partition_";
file << std::setfill('0') <<std::setw(2) << bit;
file << ".txt";
data.open((file.str()).c_str());
if(data.is_open() == false)
{
std::cout << "File is not open" << std::endl;
return false;
}
for(int i = 0; i < N; i++)
{
h_DataOut1[i] = IntToFloat(h_DataOut[i]);
//cout << h_HostFloat[i] << " \t" << h_DataOut1[i] << endl;
//std::bitset<32>bitshift(h_DataOut[i]&1<<31-bit);
//data << bitshift[31-bit] << "\t" <<h_DataOut1[i] <<std::endl;
data << h_DataOut1[i] << std::endl;
}
data << std::endl;
data.close();
std::cout << "Partition=" <<bit <<"\n";
return true;
}
Per your request, I'm posting the code I used to investigate this and help me in studying your code.
#include <stdio.h>
#include <stdlib.h>
__device__ int gpu_partition(unsigned int *data, unsigned int *partition, unsigned int *ones, unsigned int* zeroes, int bit, int idx, unsigned int* warp_ones){
int one = 0;
int valid = 0;
int my_one, my_zero;
if (partition[idx]){
valid = 1;
if(data[idx] & (1ULL<<(31-bit))) one=1;}
__syncthreads();
if (valid){
if (one){
my_one=1;
my_zero=0;}
else{
my_one=0;
my_zero=1;}
}
else{
my_one=0;
my_zero=0;}
ones[idx]=my_one;
zeroes[idx]=my_zero;
unsigned int warp_one = __popc(__ballot(my_one));
if (!(threadIdx.x & 31))
warp_ones[threadIdx.x>>5] = warp_one;
__syncthreads();
// reduce
for (int i = 16; i > 0; i>>=1){
if (threadIdx.x < i)
warp_ones[threadIdx.x] += warp_ones[threadIdx.x + i];
__syncthreads();}
return warp_ones[0];
}
__global__ void gpu_radixkernel(unsigned int *data, unsigned int m, unsigned int n, unsigned int *result){
__shared__ unsigned int loc_data[1024];
__shared__ unsigned int loc_ones[1024];
__shared__ unsigned int loc_zeroes[1024];
__shared__ unsigned int loc_warp_ones[32];
int l=0;
int bit = 0;
unsigned int u = n;
if (n<2){
if ((n == 1) && !(threadIdx.x)) *result = data[0];
return;}
loc_data[threadIdx.x] = data[threadIdx.x];
loc_ones[threadIdx.x] = (threadIdx.x<n)?1:0;
__syncthreads();
unsigned int *next = loc_ones;
do {
int s = gpu_partition(loc_data, next, loc_ones, loc_zeroes, bit++, threadIdx.x, loc_warp_ones);
if ((u-s) > m){
u = (u-s);
next = loc_zeroes;}
else{
l = (u-s);
next = loc_ones;}}
while ((u != l) && (bit<32));
if (next[threadIdx.x]) *result = loc_data[threadIdx.x];
}
int partition(unsigned int *data, int l, int u, int bit){
unsigned int *temp = (unsigned int *)malloc(((u-l)+1)*sizeof(unsigned int));
int pos = 0;
for (int i = l; i<=u; i++)
if(data[i] & (1ULL<<(31-bit))) temp[pos++] = data[i];
int result = u-pos;
for (int i = l; i<=u; i++)
if(!(data[i] & (1ULL<<(31-bit)))) temp[pos++] = data[i];
pos = 0;
for (int i = u; i>=l; i--)
data[i] = temp[pos++];
free(temp);
return result;
}
unsigned int radixselect(unsigned int *data, int l, int u, int m, int bit){
if (l == u) return(data[l]);
if (bit > 32) {printf("radixselect fail!\n"); return 0;}
int s = partition(data, l, u, bit);
if (s>=m) return radixselect(data, l, s, m, bit+1);
return radixselect(data, s+1, u, m, bit+1);
}
int main(){
unsigned int data[8] = {32767, 22, 88, 44, 99, 101, 0, 7};
unsigned int data1[8];
for (int i = 0; i<8; i++){
for (int j=0; j<8; j++) data1[j] = data[j];
printf("value[%d] = %d\n", i, radixselect(data1, 0, 7, i, 0));}
unsigned int *d_data;
cudaMalloc((void **)&d_data, 1024*sizeof(unsigned int));
unsigned int h_result, *d_result;
cudaMalloc((void **)&d_result, sizeof(unsigned int));
cudaMemcpy(d_data, data, 8*sizeof(unsigned int), cudaMemcpyHostToDevice);
for (int i = 0; i < 8; i++){
gpu_radixkernel<<<1,1024>>>(d_data, i, 8, d_result);
cudaMemcpy(&h_result, d_result, sizeof(unsigned int), cudaMemcpyDeviceToHost);
printf("gpu result index %d = %d\n", i, h_result);
}
unsigned int data2[1024];
unsigned int data3[1024];
for (int i = 0; i < 1024; i++) data2[i] = rand();
cudaMemcpy(d_data, data2, 1024*sizeof(unsigned int), cudaMemcpyHostToDevice);
for (int i = 0; i < 1024; i++){
for (int j = 0; j<1024; j++) data3[j] = data2[j];
unsigned int cpuresult = radixselect(data3, 0, 1023, i, 0);
gpu_radixkernel<<<1,1024>>>(d_data, i, 1024, d_result);
cudaMemcpy(&h_result, d_result, sizeof(unsigned int), cudaMemcpyDeviceToHost);
if (h_result != cpuresult) {printf("mismatch at index %d, cpu: %d, gpu: %d\n", i, cpuresult, h_result); return 1;}
}
printf("Finished\n");
return 0;
}
Here are some notes, in no particular order:
I got rid of all your thrust code, it's not doing anything useful as far as the radix select algorithm is concerned. I also find your casting of float to int curious. I haven't thought through the ramifications of trying to do a bitwise radix select in order on a sequence of exponent bits followed by a sequence of mantissa bits. It might work, (although I think if you include the sign bit, it definitely won't work) but again I don't think it's central to understanding the algorithm.
I included a host version that I wrote just to check my device results.
I'm pretty sure this algorithm will fail in some cases where there are duplicated elements. For example, if you hand it a vector of all zeroes, I think it will fail. I don't think it would be difficult to handle that case however.
my host version is recursive, but my device version is not. I don't see that recursion is that useful here, since the non-recursive form of the algorithm is easy to write as well, especially since there are at most 32 bits to travel through. Still, if you wanted to create a recursive device version, it should not be difficult, by incorporating the u,s, and l manipulation code inside the partition function.
I have dispensed with typical cuda error checking. However I recommend it.
I don't consider this to be a paragon of cuda programming. If you delve into for example a radix sort algorithm (such as here), you will see that it is pretty complex. A fast GPU radix select would look nothing like my code. I wrote my code to be analogous to the serial recursive partitioned radix sort, which is not the best way to do it on a massively parallel architecture.
Since radix select is not a sort, I attempted to write a device code that would do no data movement of the input data, since I considered this to be expensive and unnecessary. I do a single read from global memory for the data at the beginning of the kernel, and thereafter I do all work out of shared memory, and even in shared memory I am not re-arranging the data (as I do in my host version) so as to avoid the cost of data movement. Instead I keep flag arrays of ones and zeroes partitions, to feed to the next partitioning step. The data movement would involve a fair amount of uncoalesced and/or bank-conflicted traffic, whereas the flag arrays allow all accesses to be non-bank-conflicted.