I have two vectors (oldvector and newvector). I need to calculate the value of the residual which is defined by the following pseudocode:
residual = 0;
forall i : residual += (oldvector[i] - newvector[i])^2
Currently, I am calculating this with two CUDA Thrust operations which are essentially doing:
forall i : oldvector[i] = oldvector[i] - newvector[i]
followed by a thrust::transform_reduce with a square as unary operator, which is doing:
residual = 0;
forall i : residual += oldvector[i]^2;
The problem with this is obviously the intermediate store to global memory before transform_reduce. Is there a more efficient approach to this problem which would fuse these two steps? Apart from writing my own CUDA kernel, is there any other option?
One approach I thought of was to write a thrust::reduce with zip iterators. The problem with this is that the return type of the operator has to be the same type as its input. What this means, according to me, is that the reduction operator would be returning a tuple which would mean an extra addition.
If I do write a reduction CUDA kernel, has there been any improvements made over the CUDA 1.1 example for the reduction kernel?
thrust::inner_product will do it in a single function call. Your original idea can be made to work also (zipping together the two vectors and using thrust::transform_reduce) This code demonstrates both methods:
#include <iostream>
#include <thrust/tuple.h>
#include <thrust/iterator/zip_iterator.h>
#include <thrust/transform.h>
#include <thrust/device_vector.h>
#include <thrust/inner_product.h>
#include <thrust/functional.h>
#define N 2
struct zdiffsq{
template <typename Tuple>
__host__ __device__ float operator()(Tuple a)
{
float result = thrust::get<0>(a) - thrust::get<1>(a);
return result*result;
}
};
struct diffsq{
__host__ __device__ float operator()(float a, float b)
{
return (b-a)*(b-a);
}
};
int main(){
thrust::device_vector<float> oldvector(N);
thrust::device_vector<float> newvector(N);
oldvector[0] = 1.0f; oldvector[1] = 2.0f;
newvector[0] = 2.0f; newvector[1] = 5.0f;
float result = thrust::inner_product(oldvector.begin(), oldvector.end(), newvector.begin(), 0.0f, thrust::plus<float>(), diffsq());
std::cout << "Result: " << result << std::endl;
float result2 = thrust::transform_reduce(thrust::make_zip_iterator(thrust::make_tuple(oldvector.begin(), newvector.begin())), thrust::make_zip_iterator(thrust::make_tuple(oldvector.end(), newvector.end())), zdiffsq(), 0.0f, thrust::plus<float>());
std::cout << "Result2: " << result2 << std::endl;
}
You can also investigate eliminating the functor definition used with the inner product example, by using thrust placeholders.
Even if you want to write your own CUDA code, the standard recommendation now for oft-used algorithms like parallel reductions and sorting, is to use cub.
And yes, the CUDA parallel reduction sample and accompanying presentation is still a good basic intro to a fast parallel reduction.
Robert Crovella has already answered this question and has also suggested using CUB.
At variance with Thrust, CUB leaves performance-critical parameters, such as the choice of specific reduction algorithm to be used and the degree of concurrency unbound, selectable by the user. These parameters can be tuned in order maximimize performance for a particular architecture and application. The parameters can be specified at compile time, so avoiding runtime performance penalties.
Below, there is a full worked example on how using CUB for residual calculation.
#include <cub/cub.cuh>
#include <cuda.h>
#include "Utilities.cuh"
#include <iostream>
#define BLOCKSIZE 256
#define ITEMS_PER_THREAD 8
const int N = 4096;
/******************************/
/* TRANSFORM REDUCTION KERNEL */
/******************************/
__global__ void TransformSumKernel(const float * __restrict__ indata1, const float * __restrict__ indata2, float * __restrict__ outdata) {
unsigned int tid = threadIdx.x + blockIdx.x * gridDim.x;
// --- Specialize BlockReduce for type float.
typedef cub::BlockReduce<float, BLOCKSIZE * ITEMS_PER_THREAD> BlockReduceT;
__shared__ typename BlockReduceT::TempStorage temp_storage;
float result;
if(tid < N) result = BlockReduceT(temp_storage).Sum((indata1[tid] - indata2[tid]) * (indata1[tid] - indata2[tid]));
if(threadIdx.x == 0) atomicAdd(outdata, result);
return;
}
/********/
/* MAIN */
/********/
int main() {
// --- Allocate host side space for
float *h_data1 = (float *)malloc(N * sizeof(float));
float *h_data2 = (float *)malloc(N * sizeof(float));
float *h_result = (float *)malloc(sizeof(float));
float *d_data1; gpuErrchk(cudaMalloc(&d_data1, N * sizeof(float)));
float *d_data2; gpuErrchk(cudaMalloc(&d_data2, N * sizeof(float)));
float *d_result; gpuErrchk(cudaMalloc(&d_result, sizeof(float)));
for (int i = 0; i < N; i++) {
h_data1[i] = 1.f;
h_data2[i] = 3.f;
}
gpuErrchk(cudaMemcpy(d_data1, h_data1, N * sizeof(float), cudaMemcpyHostToDevice));
gpuErrchk(cudaMemcpy(d_data2, h_data2, N * sizeof(float), cudaMemcpyHostToDevice));
TransformSumKernel<<<iDivUp(N, BLOCKSIZE), BLOCKSIZE>>>(d_data1, d_data2, d_result);
gpuErrchk(cudaPeekAtLastError());
gpuErrchk(cudaDeviceSynchronize());
gpuErrchk(cudaMemcpy(h_result, d_result, sizeof(float), cudaMemcpyDeviceToHost));
std::cout << "output: ";
std::cout << h_result[0];
std::cout << std::endl;
gpuErrchk(cudaFree(d_data1));
gpuErrchk(cudaFree(d_data2));
gpuErrchk(cudaFree(d_result));
return 0;
}
Related
I am developing my first Cuda application, and I have a kernel with "below-expected throughput", which seems to be the biggest bottleneck at the moment.
The task of the kernel is to compute an N by N sized matrix (DD) containing squared distances between all elements on a data matrix. The data matrix (Y) is size N by D (to support multi dimensional data) and stored as row-major.
Source:
__global__ void computeSquaredEuclideanDistance(const float * __restrict__ Y, float * __restrict__ DD, const int N, const int D) {
int index = blockIdx.x * blockDim.x + threadIdx.x;
int stride = blockDim.x * gridDim.x;
for (int i = index; i < N * N; i += stride) {
const int m = i / N;
const int n = i % N;
float tmp = 0;
for (int d = 0; d < D; ++d) {
const float Ynd = Y[d + D * n];
const float Ymd = Y[d + D * m];
const float Ydiff = Ynd - Ymd;
tmp += Ydiff * Ydiff;
}
DD[n + N * m] = tmp;
}
}
This is being called with size_t blockSize = 256 and size_t numBlocks = (N*N + blockSize - 1)/blockSize.
How can I optimize this kernel? My initial thought is that the time-consuming part is reading data without exploiting some sort of shared memory, but can anyone give me pointers on how to approach this?
Remarks from the nvvc profiling tool:
Latency analysis:
Compute utilization at around 40%
Memory (L2 cache) utilization at around 35%
Occupancy is not an issue
Active Warps at 57.59 of a theoretical 64
Occupancy at 90% of a theoretical 100
For my application, typical values are:
5k < N < 30k
D is either 2 or 3
I typically disregard these types of optimization questions because they are on the verge of off-topic, in my opinion. Worst still, you provide no MCVE so anyone trying to answer would have to write all their own support code to compile and benchmark your kernel. And this sort of work does require benchmarking and code analysis. But because your problem is basically a linear algebra problem (and I like linear algebra), I answered it rather than close voting it as too broad......
With that off my chest. there are a couple of things which immediately jump out in the code which could be improved and which would probably have a material affect on the run time.
The first is that the trip count of the inner loop is known a priori. Anytime you have a situation like that, let the compiler know. Loop unrolling and code reordering is a very powerful compiler optimization and the NVIDIA compiler is extremely good at it. If you move D into a template parameter, you can do something like this:
template<int D>
__device__ float esum(const float *x, const float *y)
{
float val = 0.f;
#pragma unroll
for(int i=0; i<D; i++) {
float diff = x[i] - y[i];
val += diff * diff;
}
return val;
}
template<int D>
__global__
void vdistance0(const float * __restrict__ Y, float * __restrict__ DD, const int N)
{
int index = blockIdx.x * blockDim.x + threadIdx.x;
int stride = blockDim.x * gridDim.x;
for (int i = index; i < N * N; i += stride) {
const int m = i / N;
const int n = i % N;
DD[n + N * m] = esum<D>(Y + D * n, Y + D * m);
}
}
template __global__ void vdistance0<2>(const float *, float *, const int);
template __global__ void vdistance0<3>(const float *, float *, const int);
The compiler will inline esum and unroll the inner loop and it can then use its reordering heuristics to better interleave loads and flops to improve throughput. The resulting code has a lower register footprint too. When I run this for N=10000 and D=2, I get about 35% speed up (7.1ms versus 4.5ms on a GTX 970 with CUDA 9.1).
But there is an even more glaringly obvious optimization than this. The calculation you are performing will produce a symmetric output matrix. You only need to do (N*N)/2 operations to compute the full matrix, rather than the N*N you are doing in your code [technically N(N/2 -1) because the diagonal entries are zero, but lets forget the diagonal for the purposes of this discussion].
So taking a different approach and using one block to calculate each row of the upper triangular output matrix, then you can do something like this:
struct udiag
{
float *p;
int m;
__device__ __host__ udiag(float *_p, int _m) : p(_p), m(_m) {};
__device__ __host__ float* get_row(int i) { return p + (i * (i + 1)) / 2; };
};
template<int D>
__global__
void vdistance2(const float * __restrict__ Y, float * __restrict__ DD, const int N)
{
int rowid = blockIdx.x;
int colid = threadIdx.x;
udiag m(DD, N);
for(; rowid < N; rowid += gridDim.x) {
float* p = m.get_row(rowid);
const float* y = Y + D * rowid;
for(int i=colid; i < (N-rowid); i += blockDim.x) {
p[i] = esum<D>(y, y + D * i);
}
}
}
template __global__ void vdistance2<2>(const float *, float *, const int);
template __global__ void vdistance2<3>(const float *, float *, const int);
This uses a little helper class to encapsulate the triangle numbers needed for the addressing scheme for the upper triangular output matrix. Doing this saves an enormous amount of memory and memory bandwidth as well as reducing the total FLOP count for the calculation. If you need to do other things afterwards BLAS (and CUBLAS) supports computations on upper or lower triangular matrices. Use them. When I run this I get about 75% speedup (7.1ms versus 1.6ms on the same GTX 970).
Huge disclaimer: All the code you see here was written during a 45 minute lunch break and as been very lightly tested. I make absolutely no claims that anything in this answer is actually correct. I have confirmed that it compiles and doesn't produce a runtime error when I run it to get profiling data. That is it. Cavaet Emptor and all that.
I am working with cuda and cublas and I was trying to implement simple operations like matrix element-wise multiplication/division. I am using only float for my experiments. I know the most obvious way to do it is to write a kernel like this one:
__global__ void mul_elementwise(const unsigned int n, float* source, float* dest, const float value)
{
const unsigned int offset = blockIdx.x * blockDim.x + threadIdx.x;
const unsigned int stride = blockDim.x * gridDim.x;
for (unsigned int i = offset; i < n; i += stride)
{
dest[i] = source[i] * value;
}
}
This kernel can work both for multiplication and division (just using 1/x as value). But this can be achieved using cublas library too: suppose we have a matrix A m x n stored in column-major style and a scalar x, then setting alpha = x or alpha = 1/x and d_ones as a vector of m*n 1s, we can invoke and obtain the same result
cublasSaxpy(cublas_handle, m * n, &alpha, d_ones, 1, A_dev, 1);
Both methods work just fine, but I am facing few problems with some particular matrix, for which both methods do no work. I isolated this big matrix and build a MCVE available here (you can compile it with nvcc mcve.cu -lcublas. As you can see the results in both cases are totally wrong: host result is totally different, I am trying to figure out what's going on. I do not see any error in code but maybe i should try to use double instead of float and see what happens.
Any opinions about this situation? Thanks in advance!
EDIT #1 I tried using doubles but nothing changes if I use cublasDaxpy meanwhile it works perfectly with the custom kernel. I think the values are too small so single floating point precision is not enough.
Interesting MCVE. Wouldn't it have been possible to shrink your vector down to just a few elements? Isn't it possible to show the calculation discrepancy based on just 1 vector element?
Anyway I see several problems.
Your kernel implements the following function: y=alpha*x. But SAXPY implements y=alpha*x+y. Now, if y started out as (all) zero, then these two would be the same. But that's not what you have:
CUBLAS Your Kernel
---------------------------
alpha: alpha alpha
x: 1 ahost (ahost is your huge data array)
y: ahost -
So your kernel is computing y=alpha * ahost, but your CUBLAS call is computing y = alpha*1 + ahost. I wouldn't expect the same result from these, in general.
Your analysis of error seems flawed in a few ways. First, you are computing the absolute error in a float variable (a number which will always be positive, since it's the absolute value), but then you're comparing it against a negative number:
float diff = abs(host[i]-dev[i]);
...
if (diff > (-1e12))
won't that if test always be true? Perhaps you meant 1e-12 although that would still be flawed. Looking for a fixed error threshold on a floating point comparison should be scaled to the size of the numbers being compared. float quantities only contain about 6-7 accurate decimal digits. (And summing these errors is also troublesome.)
Here is a complete code that has the above issues fixed, and produces zero sum error for all the comparisons (host<->kernel and host<->cublas):
static float array[] = {0x00000000,
0x00000000,0x00000000,0x00000000,0x00000000,0x00000000,0x00000000,0x00000000,0x00000000,0x00000000,0x00000000,0x00000000,0x00000000,0x00000000,0x00000000,0x00000000,0x00000000,0x00000000,0x00000000,0x00000000,0xB58DA1CF,0xB50D2FEC,0x34A48536,0xB4A1D5BC,0x358E1345,0x35943AAC,0xB5983F40,0xB43628BB,0xB4A95348,0xB4DB751C,0xB50C8D1A,0xB3EFCBB5,0x3552B8CD,0x3538A167,0x358FDE0D,0xB4D54CE9,0xB5D29BB7,0xB4A234EE,0x346EF2F4,0x35B5D9F2,0xB40F1487,0x3554BC20,0x33FD9466,0xB536D37D,0xB3C2E594,0xB59DA581,0x3584FC87,0x34438F09,0x35D293CB,0xB4FBB002,0xB59F41E9};
#include <iostream>
#include <stdio.h>
#include <cublas_v2.h>
#include <assert.h>
#define TOL 0.0001
typedef unsigned int u32;
#define GET_STRIDE() u32(blockDim.x * gridDim.x)
#define GET_OFFSET() u32(blockIdx.x * blockDim.x + threadIdx.x)
inline
cudaError_t checkCuda(cudaError_t result)
{
#if defined(DEBUG) || defined(_DEBUG)
if (result != cudaSuccess) {
fprintf(stderr, "CUDA Runtime Error: %s\n", cudaGetErrorString(result));
assert(result == cudaSuccess);
}
#endif
return result;
}
__global__ void div_elementwise(const u32 n, float* source, float* dest, const float value)
{
for (u32 i = GET_OFFSET(); i < n; i += GET_STRIDE())
{
dest[i] = source[i] * value;
}
}
float check_eq(float* dev, float* host, u32 len)
{
float sum = 0.0f;
for (u32 i = 0; i < len; ++i)
{
if (dev[i]!=host[i])
{
//printf("diff %d %f %f\n", i, dev[i], host[i]);
//break;
float diff = abs((host[i]-dev[i])/host[i]);
sum += diff;
if (diff > (TOL))
printf("diff %d %f\n", i, diff);
}
}
printf("%f\n", sum);
return sum;
}
void div_host(float* a, float v, u32 len)
{
for (u32 i = 0; i < len; ++i)
{
a[i]=a[i]*v;
}
}
int main()
{
u32 len = sizeof(array)/sizeof(float);
printf("array len = %d\n", len);
for (int i =0; i < len; i++) if (isnan(array[i])) {printf("nan value at %d\n",i); return -1;}
float* adev, *adevcublas, *d_zero;
float* ahost = (float*) malloc(len * sizeof(float));
checkCuda(cudaMalloc(&adev, len * sizeof(float)));
checkCuda(cudaMalloc(&adevcublas, len * sizeof(float)));
checkCuda(cudaMalloc(&d_zero, len * sizeof(float)));
memcpy(ahost, &array[0], len * sizeof(float));
checkCuda(cudaMemcpy(adev, ahost, len * sizeof(float), cudaMemcpyHostToDevice));
checkCuda(cudaMemcpy(adevcublas, ahost, len * sizeof(float), cudaMemcpyHostToDevice));
checkCuda(cudaMemset(d_zero, 0, len*sizeof(float)));
float alpha = 1/2494.f;
printf("%f\n", alpha);
div_host(ahost, alpha, len);
u32 tb = 256;
div_elementwise<<<((len + tb - 1) / tb),tb>>>(len, adev, adev, alpha);
float* r = (float*) malloc(len * sizeof(float));
checkCuda(cudaMemcpy(r, adev, len * sizeof(float), cudaMemcpyDeviceToHost));
check_eq(r,ahost,len);
cublasHandle_t ch;
cublasCreate(&ch);
float* r0 = (float*) malloc(len * sizeof(float));
cublasStatus_t stat = cublasSaxpy(ch, len, &alpha, adevcublas, 1, d_zero, 1);
if (stat != CUBLAS_STATUS_SUCCESS) {std::cout << "CUBLAS error: " << (int)stat << std::endl; return 1;}
checkCuda(cudaMemcpy(r0, d_zero, len * sizeof(float), cudaMemcpyDeviceToHost));
check_eq(r0,ahost,len);
free(r);
free(r0);
free(ahost);
cudaFree(adev);
return 0;
}
From some comments that I have read in here, for some reason it is preferable to have Structure of Arrays (SoA) over Array of Structures (AoS) for parallel implementations like CUDA? If that is true, can anyone explain why?
Thanks in advance!
Choice of AoS versus SoA for optimum performance usually depends on access pattern. This is not just limited to CUDA however - similar considerations apply for any architecture where performance can be significantly affected by memory access pattern, e.g. where you have caches or where performance is better with contiguous memory access (e.g. coalesced memory accesses in CUDA).
E.g. for RGB pixels versus separate RGB planes:
struct {
uint8_t r, g, b;
} AoS[N];
struct {
uint8_t r[N];
uint8_t g[N];
uint8_t b[N];
} SoA;
If you are going to be accessing the R/G/B components of each pixel concurrently then AoS usually makes sense, since the successive reads of R, G, B components will be contiguous and usually contained within the same cache line. For CUDA this also means memory read/write coalescing.
However if you are going to process color planes separately then SoA might be preferred, e.g. if you want to scale all R values by some scale factor, then SoA means that all R components will be contiguous.
One further consideration is padding/alignment. For the RGB example above each element in an AoS layout is aligned to a multiple of 3 bytes, which may not be convenient for CUDA, SIMD, et al - in some cases perhaps even requiring padding within the struct to make alignment more convenient (e.g. add a dummy uint8_t element to ensure 4 byte alignment). In the SoA case however the planes are byte aligned which can be more convenient for certain algorithms/architectures.
For most image processing type applications the AoS scenario is much more common, but for other applications, or for specific image processing tasks this may not always be the case. When there is no obvious choice I would recommend AoS as the default choice.
See also this answer for more general discussion of AoS v SoA.
I just want to provide a simple example showing how a Struct of Arrays (SoA) performs better than an Array of Structs (AoS).
In the example, I'm considering three different versions of the same code:
SoA (v1)
Straight arrays (v2)
AoS (v3)
In particular, version 2 considers the use of straight arrays. The timings of versions 2 and 3 are the same for this example and result to be better than version 1. I suspect that, in general, straight arrays could be preferable, although at the expense of readability, since, for example, loading from uniform cache could be enabled through const __restrict__ for this case.
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
#include <thrust\device_vector.h>
#include "Utilities.cuh"
#include "TimingGPU.cuh"
#define BLOCKSIZE 1024
/******************************************/
/* CELL STRUCT LEADING TO ARRAY OF STRUCT */
/******************************************/
struct cellAoS {
unsigned int x1;
unsigned int x2;
unsigned int code;
bool done;
};
/*******************************************/
/* CELL STRUCT LEADING TO STRUCT OF ARRAYS */
/*******************************************/
struct cellSoA {
unsigned int *x1;
unsigned int *x2;
unsigned int *code;
bool *done;
};
/*******************************************/
/* KERNEL MANIPULATING THE ARRAY OF STRUCT */
/*******************************************/
__global__ void AoSvsSoA_v1(cellAoS *d_cells, const int N) {
const int tid = threadIdx.x + blockIdx.x * blockDim.x;
if (tid < N) {
cellAoS tempCell = d_cells[tid];
tempCell.x1 = tempCell.x1 + 10;
tempCell.x2 = tempCell.x2 + 10;
d_cells[tid] = tempCell;
}
}
/******************************/
/* KERNEL MANIPULATING ARRAYS */
/******************************/
__global__ void AoSvsSoA_v2(unsigned int * __restrict__ d_x1, unsigned int * __restrict__ d_x2, const int N) {
const int tid = threadIdx.x + blockIdx.x * blockDim.x;
if (tid < N) {
d_x1[tid] = d_x1[tid] + 10;
d_x2[tid] = d_x2[tid] + 10;
}
}
/********************************************/
/* KERNEL MANIPULATING THE STRUCT OF ARRAYS */
/********************************************/
__global__ void AoSvsSoA_v3(cellSoA cell, const int N) {
const int tid = threadIdx.x + blockIdx.x * blockDim.x;
if (tid < N) {
cell.x1[tid] = cell.x1[tid] + 10;
cell.x2[tid] = cell.x2[tid] + 10;
}
}
/********/
/* MAIN */
/********/
int main() {
const int N = 2048 * 2048 * 4;
TimingGPU timerGPU;
thrust::host_vector<cellAoS> h_cells(N);
thrust::device_vector<cellAoS> d_cells(N);
thrust::host_vector<unsigned int> h_x1(N);
thrust::host_vector<unsigned int> h_x2(N);
thrust::device_vector<unsigned int> d_x1(N);
thrust::device_vector<unsigned int> d_x2(N);
for (int k = 0; k < N; k++) {
h_cells[k].x1 = k + 1;
h_cells[k].x2 = k + 2;
h_cells[k].code = k + 3;
h_cells[k].done = true;
h_x1[k] = k + 1;
h_x2[k] = k + 2;
}
d_cells = h_cells;
d_x1 = h_x1;
d_x2 = h_x2;
cellSoA cell;
cell.x1 = thrust::raw_pointer_cast(d_x1.data());
cell.x2 = thrust::raw_pointer_cast(d_x2.data());
cell.code = NULL;
cell.done = NULL;
timerGPU.StartCounter();
AoSvsSoA_v1 << <iDivUp(N, BLOCKSIZE), BLOCKSIZE >> >(thrust::raw_pointer_cast(d_cells.data()), N);
gpuErrchk(cudaPeekAtLastError());
gpuErrchk(cudaDeviceSynchronize());
printf("Timing AoSvsSoA_v1 = %f\n", timerGPU.GetCounter());
//timerGPU.StartCounter();
//AoSvsSoA_v2 << <iDivUp(N, BLOCKSIZE), BLOCKSIZE >> >(thrust::raw_pointer_cast(d_x1.data()), thrust::raw_pointer_cast(d_x2.data()), N);
//gpuErrchk(cudaPeekAtLastError());
//gpuErrchk(cudaDeviceSynchronize());
//printf("Timing AoSvsSoA_v2 = %f\n", timerGPU.GetCounter());
timerGPU.StartCounter();
AoSvsSoA_v3 << <iDivUp(N, BLOCKSIZE), BLOCKSIZE >> >(cell, N);
gpuErrchk(cudaPeekAtLastError());
gpuErrchk(cudaDeviceSynchronize());
printf("Timing AoSvsSoA_v3 = %f\n", timerGPU.GetCounter());
h_cells = d_cells;
h_x1 = d_x1;
h_x2 = d_x2;
// --- Check results
for (int k = 0; k < N; k++) {
if (h_x1[k] != k + 11) {
printf("h_x1[%i] not equal to %i\n", h_x1[k], k + 11);
break;
}
if (h_x2[k] != k + 12) {
printf("h_x2[%i] not equal to %i\n", h_x2[k], k + 12);
break;
}
if (h_cells[k].x1 != k + 11) {
printf("h_cells[%i].x1 not equal to %i\n", h_cells[k].x1, k + 11);
break;
}
if (h_cells[k].x2 != k + 12) {
printf("h_cells[%i].x2 not equal to %i\n", h_cells[k].x2, k + 12);
break;
}
}
}
The following are the timings (runs performed on a GTX960):
Array of struct 9.1ms (v1 kernel)
Struct of arrays 3.3ms (v3 kernel)
Straight arrays 3.2ms (v2 kernel)
SoA is effectly good for SIMD processing.
For several reason, but basically it's more efficient to load 4 consecutive floats in a register. With something like:
float v [4] = {0};
__m128 reg = _mm_load_ps( v );
than using:
struct vec { float x; float, y; ....} ;
vec v = {0, 0, 0, 0};
and create an __m128 data by accessing all member:
__m128 reg = _mm_set_ps(v.x, ....);
if your arrays are 16-byte aligned data load/store are faster and some op can be perform directly in memory.
Update!
My current code doesn't check for out of bounds memory access. When I run the cuda memcheck, it says memory access is bad even for matrices of just 2 by 2! I'm accessing memory where I shouldn't somehow and that's the problem!
To check for out of bounds memory access, run cuda-memcheck ./(insert executable here)
Shown below is my code for the matrix multiplication itself:
dim3 block(32,32);
dim3 grid( (n+31)/32, (n+31)/32 );
matrixMul<<<grid,block>>>(d_C, d_A, d_B, n, k);
kA and kB are matrices with values in them (they're all 2's to make it easier).
m, n, k are all the same number for my square matrices
kC is the matrix to store the answer.
#ifndef _MATRIXMUL_KERNEL_H_
#define _MATRIXMUL_KERNEL_H_
#include <stdio.h>
__global__ void matrixMul(float *kC, float *kA, float *kB, int n, int k)
{
int tx = blockIdx.x * 32 + threadIdx.x;
int ty = blockIdx.y * 32 + threadIdx.y;
float value = 0;
for (int i=0;i<n;i++)
{
float elementA=kA[ty*n+i];
float elementB=kB[i*k+tx];
value += elementA*elementB;
}
kC[ty*n+tx] = value;
}
#endif // #ifndef _MATRIXMUL_KERNEL_H_
Based on how you are defining the grid of threads, you should add a thread check to the kernel code like this:
#ifndef _MATRIXMUL_KERNEL_H_
#define _MATRIXMUL_KERNEL_H_
#include <stdio.h>
__global__ void matrixMul(float *kC, float *kA, float *kB, int n, int k)
{
int tx = blockIdx.x * 32 + threadIdx.x;
int ty = blockIdx.y * 32 + threadIdx.y;
if ((ty < n) && (tx < n)) { // add this line
float value = 0;
for (int i=0;i<n;i++)
{
float elementA=kA[ty*n+i];
float elementB=kB[i*k+tx];
value += elementA*elementB;
}
kC[ty*n+tx] = value;
} // add this line
}
#endif // #ifndef _MATRIXMUL_KERNEL_H_
Otherwise threads outside the valid array array will corrupt your results. Things work for multiples of 32x32 because there are no invalid threads. In that case you're launching exactly the required number of threads. But in other cases you are launching extra threads. These extra threads, if allowed to compute an invalid matrix position, will corrupt the results.
We have the following serial C code operating on
two vectors a[] and b[]:
double a[20000],b[20000],r=0.9;
for(int i=1;i<=10000;++i)
{
a[i]=r*a[i]+(1-r)*b[i]];
errors=max(errors,fabs(a[i]-b[i]);
b[i]=a[i];
}
Please tell us on how to port this code to CUDA and cublas?
It's also possible to implement this reduction in Thrust using thrust::transform_reduce. This solution fuses the entire operation, as talonmies suggests:
#include <thrust/device_vector.h>
#include <thrust/iterator/zip_iterator.h>
#include <thrust/transform_reduce.h>
#include <thrust/functional.h>
// this functor unpacks a tuple and then computes
// a weighted absolute difference of its members
struct weighted_absolute_difference
{
double r;
weighted_absolute_difference(const double r)
: r(r)
{}
__host__ __device__
double operator()(thrust::tuple<double,double> t)
{
double a = thrust::get<0>(t);
double b = thrust::get<1>(t);
a = r * a + (1.0 - r) * b;
return fabs(a - b);
}
};
int main()
{
using namespace thrust;
const std::size_t n = 20000;
const double r = 0.9;
device_vector<double> a(n), b(n);
// initialize a & b
...
// do the reduction
double result =
transform_reduce(make_zip_iterator(make_tuple(a.begin(), b.begin())),
make_zip_iterator(make_tuple(a.end(), b.end())),
weighted_absolute_difference(r),
-1.f,
maximum<double>());
// note that this solution does not set
// a[i] = r * a[i] + (1 - r) * b[i]
return 0;
}
Note that we do not perform the assignment a[i] = r * a[i] + (1 - r) * b[i] in this solution, though it would be simple to do so after the reduction using thrust::transform. It is not safe to modify transform_reduce's arguments in either functor.
This second line in your loop:
errors=max(errors,fabs(a[i]-b[i]);
is known as a reduction. Fortunately there is reduction example code in the CUDA SDK - take a look at this and use it as a template for your algorithm.
You probably want to split this into two separate operations (possibly as two separate kernels) - one for the parallel part (calculation of bp[] values) and a second for the reduction (calculate errors).