Skipping every M elements when iterating through an array in CUDA

Skipping every M elements when iterating through an array in CUDA - c++

I am new to Cuda programming and I have been trying to figure out how to convert the following code into Cuda code.
for (int i = 0; i <= N; i += M)
{
output[i].x = signal[i].x;
output[i].y = signal[i].y;
}
following a vector_add example, I was able to get this:
__global__ void dec(const complex * signal, int N, int M, complex * output)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i <= N)
{
output[i].x = signal[i].x;
output[i].y = signal[i].y;
}
And this is where I am stuck. In my understanding, all thread/units would calculate in parallel, so I wasn't sure where to inform the iterator to skip every M elements in Cuda. An alternative I thought of was to check i % M == 0. But I'd like to see if there is anything else I should know first to tackle this problem, such as thread syncing and etc.
Any help is appreciated.

Something like this should work:
__global__ void dec(const complex * signal, int N, int M, complex * output)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
i *= M; // add this line
if (i <= N)
{
output[i].x = signal[i].x;
output[i].y = signal[i].y;
}
You should also make sure that you don't overflow the int variable. This should be possible to manage by not launching unnecessary threads, i.e. don't launch a grid of significantly more than N/M threads.

Related

Sorting algorithm with Cuda. Inside or outside kernels?

I have a matrix of size 50000x100 and I need to sort each row using Cuda in C++. My architecture is a K80 NVidia card.
Since the number of columns is small, I am currently running the sorting algorithm inside a kernel. I am using a modified bubble algorithm that runs on all lines of the matrix.
I am wondering if there is an more efficient way to proceed. I tried to use thrust::sort inside my kernel but it is much slower. I also tried a merge sort algorithm but the recursive part of the algorithm didn't work inside my kernel.
==edit==
here is my kernel:
__global__ void computeQuantilesKernel(float *matIn, int nRows, int nCols, int nQuantiles, float *outsideValues, float *quantilesAve, int param2)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
float values[100];//big enough for 100 columns
int keys[100];
int nQuant[100];//big enough for 100 quantiles (percentiles)
float thisQuantile[100];
int quant;
if (idx >= nRows) return;
//read matIn from global memory
for (int i = 0; i < nCols; i++)
{
values[i] = matIn[idx * nCols + i + param2 * nCols * nRows];
keys[i] = i;
}
//bubble Sort:
int i, j;
int temp;
float tempVal;
for (i = 0; i < nCols - 1; i++)
{
for (j = 0; j < nCols - i - 1; j++)
{
if (values[j + 1] < values[j]) // ascending order simply changes to <
{
tempVal = values[j]; // swap elements
temp = keys[j]; // swap elements
values[j] = values[j + 1];
keys[j] = keys[j + 1];
values[j + 1] = tempVal;
keys[j + 1] = temp;
}
}
}
//end of bubble sort
//reset nQuant and thisQuantile
for (int iQuant = 0; iQuant < nQuantiles; iQuant++)
{
nQuant[iQuant] = 0;
thisQuantile[iQuant] = 0;
}
//Compute sum of outsideValues for each quantile
for (int i = 0; i < nCols; i++)
{
quant = (int)(((float)i + 0.5) / ((float)nCols / (float)nQuantiles));//quantile like Matlab
nQuant[quant]++;
thisQuantile[quant] += outsideValues[idx * nCols + keys[i]];
}
//Divide by the size of each quantile to get averages
for (int iQuant = 0; iQuant < nQuantiles; iQuant++)
{
quantilesAve[idx + nRows * iQuant + param2 * nQuantiles * nRows] = thisQuantile[iQuant] / (float)nQuant[iQuant];
}
}

Your code as it stands uses a single thread to handle each of your rows separately. As a result you are starving for quick scratch memory (registers, L1 cache, shared memory). You are allocating at least 1600 bytes per each thread - that is a lot! You want to stay at around 128 bytes per thread (32 registers of 32 bits each). Secondly, you are using local arrays addressable at run-time -- those arrays will be spilled into local memory, trash your L1 cache and end up in global memory again (1600B x 32 threads gives 51KB, which is already at or above the limits of shmem/L1).
For that reason I would suggest handling a single row per block of 64 or 128 threads instead, and keep the row you sort in shared memory. Bubble sort is actually very easy to implement in parallel:
__shared__ float values[nCols];
... load the data ...
__syncthreads();
for (int i = 0; i < nCols/2; i++)
{
int j = threadIdx.x;
if (j % 2 == 0 && j<nCols-1)
if (values[j+1] < values[j])
swap(values[j+1], values[j]);
__syncthreads();
if (j % 2 == 1 && j<nCols-1)
if (values[j+1] < values[j])
swap(values[j+1], values[j]);
__syncthreads();
}
Notice how your inner for j = ... loop is replaced by threadIdx, but the core idea of the algorithm stays the same. In each iteration I perform bubble swap first only on even pairs and then only on odd pairs to avoid parallel conflicts.
I assume that nCols is lower than the dimension of your block, which for 100 elements is easily achievable.
There are many ways that the above code can be improved further, for example
Cut the thread count in half and assume j=threadIdx.x*2 for the first half of the loop, and j=threadIdx.x*2+1 for the second half. This way no thread stays idle.
Use only 32 threads, each handling two values of j sequentially. This way your problem will fit a single warp, allowing you to drop __syncthreads() altogether. With 32 threads, you might be able to use warp shuffle intrinsics.
Experiment with #pragma unroll, although the amount of produce code may be unfeasible. Profiling will help.
Also consider experimenting with hardcoded merge sort instead of bubble sort. If my memory serves me right, when I implemented a warp-sized bubble sort and merge-sort with all loops unrolled, merge sort performed almost twice as fast as bubble sort. Note, it was several years ago, on the first generation of CUDA-capable cards.

Why isn't the Visual C++ auto-vectorizer vectorizing this simple loop?

I can't figure out why Visual C++ can't auto-vectorize this loop... any ideas?
I get:
testvec.cpp:12: info C5002: loop not vectorized due to reason '1200'
where reason code 1200 is:
Loop contains loop-carried data dependences that prevent vectorization. Different iterations of the loop interfere with each other such that vectorizing the loop would produce wrong answers, and the auto-vectorizer cannot prove to itself that there are no such data dependences.
But why?
#include <stdlib.h>
int main(int argc, char *argv[])
{
int const n = argc;
double
*const p1 = (double *)malloc(n * n * sizeof(*p1)),
*const p2 = (double *)malloc(n * n * sizeof(*p2));
for (int j = 0; j < n; ++j)
{
double const sj = p1[n * j];
for (int i = 0; i < n; ++i)
{
double const sum = p1[i] + sj, old = p1[i + n * j];
p2[i + n * j] = sum < old ? sum : old;
}
}
}

I finally found how to fix it... seems like the multiplication in n * j is the culprit.
Hoisting it outside as int nj = n * j; and using nj in the inner loop instead fixes the problem.
I still don't know why this happens though.
If anyone knows, please post it!

Kepler -Concurrent kernel launches not overlapping

I am trying to overlap kernel execution on Kepler device, but from NVVP layout it seems that they are not overlapping. here is the code,
#include<stdio.h>
#include<sys/time.h>
#include<time.h>
#define NY 1024
#define NX 1024
__global__ void kernel1(int j,int *A,int *b)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
b[j*NY+i] = A[i*NY+j];
}
__global__ void kernel2(int j,int *A,int *b)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
for(int time=0;time<100;time++)
b[j*NY+i] += 10;
}
int main()
{
int nstreams=4;
int *a, *b;
struct timeval t1,t2;
cudaMalloc((void**)&a,NX*NY*sizeof(int));
cudaMalloc((void**)&b,NX*NY*sizeof(int));
cudaStream_t *streams = (cudaStream_t *) malloc(nstreams * sizeof(cudaStream_t));
for (int i = 0; i < nstreams; i++)
{
cudaStreamCreate(&(streams[i]));
}
gettimeofday(&t1, NULL);
for(int newvar=0;newvar<NX;newvar++)
{
kernel1<<<1,NY,0,streams[newvar%nstreams]>>>(newvar,a,b);
}
for(int newvar=0;newvar<NX;newvar++)
{
kernel2<<<1,NY,0,streams[newvar%nstreams]>>>(newvar,a,b);
}
cudaDeviceSynchronize();
gettimeofday(&t2, NULL);
return 0;
}
Please suggest some tips.
CUDA version 5.5
NVVP version 5.5 Linux machine Ubuntu 12.10

Fundamentally I think the problem is that your kernels are not executing long enough. The execution time of your kernels is a few microseconds, and the kernel launch overhead is also a few microseconds, so you're not seeing any overlap. By the time the API has completed the setup of the new kernel launch, the previous kernel has finished.
I modified your kernel1 as follows:
__global__ void kernel1(int j,int *A,int *b)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
for (int q = 0; q < 1000; q++)
b[j*NY+i] = A[i*NY+j] + q/j;
}
There's nothing magical or special about these modifications, I'm just looking for a way to increase the kernel duration execution (from a few microseconds to a few milliseconds).
With the above changes, I saw good overlap of your kernel1 in the profiler.
I imagine something similar could be done with your kernel2.
You should also make sure you have not deselected the "enable concurrent kernel profiling" checkbox when you start a profiling session in nvvp.

Simple and fast matrix-vector multiplication in C / C++

I need frequent usage of matrix_vector_mult() which multiplies matrix with vector, and below is its implementation.
Question: Is there a simple way to make it significantly, at least twice, faster?
Remarks: 1) The size of the matrix is about 300x50. It doesn't change during the
run. 2) It must work on both Windows and Linux.
double vectors_dot_prod(const double *x, const double *y, int n)
{
double res = 0.0;
int i;
for (i = 0; i < n; i++)
{
res += x[i] * y[i];
}
return res;
}
void matrix_vector_mult(const double **mat, const double *vec, double *result, int rows, int cols)
{ // in matrix form: result = mat * vec;
int i;
for (i = 0; i < rows; i++)
{
result[i] = vectors_dot_prod(mat[i], vec, cols);
}
}

This is something that in theory a good compiler should do by itself, however I made a try with my system (g++ 4.6.3) and got about twice the speed on a 300x50 matrix by hand unrolling 4 multiplications (about 18us per matrix instead of 34us per matrix):
double vectors_dot_prod2(const double *x, const double *y, int n)
{
double res = 0.0;
int i = 0;
for (; i <= n-4; i+=4)
{
res += (x[i] * y[i] +
x[i+1] * y[i+1] +
x[i+2] * y[i+2] +
x[i+3] * y[i+3]);
}
for (; i < n; i++)
{
res += x[i] * y[i];
}
return res;
}
I expect however the results of this level of micro-optimization to vary wildly between systems.

As Zhenya says, just use a good BLAS or matrix math library.
If for some reason you can't do that, see if your compiler can unroll and/or vectorize your loops; making sure rows and cols are both constants at the call site may help, assuming the functions you posted are available for inlining
If you still can't get the speedup you need, you're looking at manual unrolling, and vectorizing using extensions or inline assembler.

If the size is constant and known in advance, pass it in as a precompiler variable, which will permit the compiler to optimize more fully.

How to speed up matrix multiplication in C++?

I'm performing matrix multiplication with this simple algorithm. To be more flexible I used objects for the matricies which contain dynamicly created arrays.
Comparing this solution to my first one with static arrays it is 4 times slower. What can I do to speed up the data access? I don't want to change the algorithm.
matrix mult_std(matrix a, matrix b) {
matrix c(a.dim(), false, false);
for (int i = 0; i < a.dim(); i++)
for (int j = 0; j < a.dim(); j++) {
int sum = 0;
for (int k = 0; k < a.dim(); k++)
sum += a(i,k) * b(k,j);
c(i,j) = sum;
}
return c;
}
EDIT
I corrected my Question avove! I added the full source code below and tried some of your advices:
swapped k and j loop iterations -> performance improvement
declared dim() and operator()() as inline -> performance improvement
passing arguments by const reference -> performance loss! why? so I don't use it.
The performance is now nearly the same as it was in the old porgram. Maybe there should be a bit more improvement.
But I have another problem: I get a memory error in the function mult_strassen(...). Why?
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
OLD PROGRAM
main.c http://pastebin.com/qPgDWGpW
c99 main.c -o matrix -O3
NEW PROGRAM
matrix.h http://pastebin.com/TYFYCTY7
matrix.cpp http://pastebin.com/wYADLJ8Y
main.cpp http://pastebin.com/48BSqGJr
g++ main.cpp matrix.cpp -o matrix -O3.
EDIT
Here are some results. Comparison between standard algorithm (std), swapped order of j and k loop (swap) and blocked algortihm with block size 13 (block).

Speaking of speed-up, your function will be more cache-friendly if you swap the order of the k and j loop iterations:
matrix mult_std(matrix a, matrix b) {
matrix c(a.dim(), false, false);
for (int i = 0; i < a.dim(); i++)
for (int k = 0; k < a.dim(); k++)
for (int j = 0; j < a.dim(); j++) // swapped order
c(i,j) += a(i,k) * b(k,j);
return c;
}
That's because a k index on the inner-most loop will cause a cache miss in b on every iteration. With j as the inner-most index, both c and b are accessed contiguously, while a stays put.

Make sure that the members dim() and operator()() are declared inline, and that compiler optimization is turned on. Then play with options like -funroll-loops (on gcc).
How big is a.dim() anyway? If a row of the matrix doesn't fit in just a couple cache lines, you'd be better off with a block access pattern instead of a full row at-a-time.

You say you don't want to modify the algorithm, but what does that mean exactly?
Does unrolling the loop count as "modifying the algorithm"? What about using SSE/VMX whichever SIMD instructions are available on your CPU? What about employing some form of blocking to improve cache locality?
If you don't want to restructure your code at all, I doubt there's more you can do than the changes you've already made. Everything else becomes a trade-off of minor changes to the algorithm to achieve a performance boost.
Of course, you should still take a look at the asm generated by the compiler. That'll tell you much more about what can be done to speed up the code.

Use SIMD if you can. You absolutely have to use something like VMX registers if you do extensive vector math assuming you are using a platform that is capable of doing so, otherwise you will incur a huge performance hit.
Don't pass complex types like matrix by value - use a const reference.
Don't call a function in each iteration - cache dim() outside your loops.
Although compilers typically optimize this efficiently, it's often a good idea to have the caller provide a matrix reference for your function to fill out rather than returning a matrix by type. In some cases, this may result in an expensive copy operation.

Here is my implementation of the fast simple multiplication algorithm for square float matrices (2D arrays). It should be a little faster than chrisaycock code since it spares some increments.
static void fastMatrixMultiply(const int dim, float* dest, const float* srcA, const float* srcB)
{
memset( dest, 0x0, dim * dim * sizeof(float) );
for( int i = 0; i < dim; i++ ) {
for( int k = 0; k < dim; k++ )
{
const float* a = srcA + i * dim + k;
const float* b = srcB + k * dim;
float* c = dest + i * dim;
float* cMax = c + dim;
while( c < cMax )
{
*c++ += (*a) * (*b++);
}
}
}
}

Pass the parameters by const reference to start with:
matrix mult_std(matrix const& a, matrix const& b) {
To give you more details we need to know the details of the other methods used.
And to answer why the original method is 4 times faster we would need to see the original method.
The problem is undoubtedly yours as this problem has been solved a million times before.
Also when asking this type of question ALWAYS provide compilable source with appropriate inputs so we can actually build and run the code and see what is happening.
Without the code we are just guessing.
Edit
After fixing the main bug in the original C code (a buffer over-run)
I have update the code to run the test side by side in a fair comparison:
// INCLUDES -------------------------------------------------------------------
#include <stdlib.h>
#include <stdio.h>
#include <sys/time.h>
#include <time.h>
// DEFINES -------------------------------------------------------------------
// The original problem was here. The MAXDIM was 500. But we were using arrays
// that had a size of 512 in each dimension. This caused a buffer overrun that
// the dim variable and caused it to be reset to 0. The result of this was causing
// the multiplication loop to fall out before it had finished (as the loop was
// controlled by this global variable.
//
// Everything now uses the MAXDIM variable directly.
// This of course gives the C code an advantage as the compiler can optimize the
// loop explicitly for the fixed size arrays and thus unroll loops more efficiently.
#define MAXDIM 512
#define RUNS 10
// MATRIX FUNCTIONS ----------------------------------------------------------
class matrix
{
public:
matrix(int dim)
: dim_(dim)
{
data_ = new int[dim_ * dim_];
}
inline int dim() const {
return dim_;
}
inline int& operator()(unsigned row, unsigned col) {
return data_[dim_*row + col];
}
inline int operator()(unsigned row, unsigned col) const {
return data_[dim_*row + col];
}
private:
int dim_;
int* data_;
};
// ---------------------------------------------------
void random_matrix(int (&matrix)[MAXDIM][MAXDIM]) {
for (int r = 0; r < MAXDIM; r++)
for (int c = 0; c < MAXDIM; c++)
matrix[r][c] = rand() % 100;
}
void random_matrix_class(matrix& matrix) {
for (int r = 0; r < matrix.dim(); r++)
for (int c = 0; c < matrix.dim(); c++)
matrix(r, c) = rand() % 100;
}
template<typename T, typename M>
float run(T f, M const& a, M const& b, M& c)
{
float time = 0;
for (int i = 0; i < RUNS; i++) {
struct timeval start, end;
gettimeofday(&start, NULL);
f(a,b,c);
gettimeofday(&end, NULL);
long s = start.tv_sec * 1000 + start.tv_usec / 1000;
long e = end.tv_sec * 1000 + end.tv_usec / 1000;
time += e - s;
}
return time / RUNS;
}
// SEQ MULTIPLICATION ----------------------------------------------------------
int* mult_seq(int const(&a)[MAXDIM][MAXDIM], int const(&b)[MAXDIM][MAXDIM], int (&z)[MAXDIM][MAXDIM]) {
for (int r = 0; r < MAXDIM; r++) {
for (int c = 0; c < MAXDIM; c++) {
z[r][c] = 0;
for (int i = 0; i < MAXDIM; i++)
z[r][c] += a[r][i] * b[i][c];
}
}
}
void mult_std(matrix const& a, matrix const& b, matrix& z) {
for (int r = 0; r < a.dim(); r++) {
for (int c = 0; c < a.dim(); c++) {
z(r,c) = 0;
for (int i = 0; i < a.dim(); i++)
z(r,c) += a(r,i) * b(i,c);
}
}
}
// MAIN ------------------------------------------------------------------------
using namespace std;
int main(int argc, char* argv[]) {
srand(time(NULL));
int matrix_a[MAXDIM][MAXDIM];
int matrix_b[MAXDIM][MAXDIM];
int matrix_c[MAXDIM][MAXDIM];
random_matrix(matrix_a);
random_matrix(matrix_b);
printf("%d ", MAXDIM);
printf("%f \n", run(mult_seq, matrix_a, matrix_b, matrix_c));
matrix a(MAXDIM);
matrix b(MAXDIM);
matrix c(MAXDIM);
random_matrix_class(a);
random_matrix_class(b);
printf("%d ", MAXDIM);
printf("%f \n", run(mult_std, a, b, c));
return 0;
}
The results now:
$ g++ t1.cpp
$ ./a.exe
512 1270.900000
512 3308.800000
$ g++ -O3 t1.cpp
$ ./a.exe
512 284.900000
512 622.000000
From this we see the C code is about twice as fast as the C++ code when fully optimized. I can not see the reason in the code.

I'm taking a wild guess here, but if you dynamically allocating the matrices makes such a huge difference, maybe the problem is fragmentation. Again, I've no idea how the underlying matrix is implemented.
Why don't you allocate the memory for the matrices by hand, ensuring it's contiguous, and build the pointer structure yourself?
Also, does the dim() method have any extra complexity? I would declare it inline, too.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Skipping every M elements when iterating through an array in CUDA - c++

Related

Sorting algorithm with Cuda. Inside or outside kernels?

Why isn't the Visual C++ auto-vectorizer vectorizing this simple loop?

Kepler -Concurrent kernel launches not overlapping

Simple and fast matrix-vector multiplication in C / C++

How to speed up matrix multiplication in C++?

Categories

Resources