Kepler -Concurrent kernel launches not overlapping - concurrency

I am trying to overlap kernel execution on Kepler device, but from NVVP layout it seems that they are not overlapping. here is the code,
#include<stdio.h>
#include<sys/time.h>
#include<time.h>
#define NY 1024
#define NX 1024
__global__ void kernel1(int j,int *A,int *b)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
b[j*NY+i] = A[i*NY+j];
}
__global__ void kernel2(int j,int *A,int *b)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
for(int time=0;time<100;time++)
b[j*NY+i] += 10;
}
int main()
{
int nstreams=4;
int *a, *b;
struct timeval t1,t2;
cudaMalloc((void**)&a,NX*NY*sizeof(int));
cudaMalloc((void**)&b,NX*NY*sizeof(int));
cudaStream_t *streams = (cudaStream_t *) malloc(nstreams * sizeof(cudaStream_t));
for (int i = 0; i < nstreams; i++)
{
cudaStreamCreate(&(streams[i]));
}
gettimeofday(&t1, NULL);
for(int newvar=0;newvar<NX;newvar++)
{
kernel1<<<1,NY,0,streams[newvar%nstreams]>>>(newvar,a,b);
}
for(int newvar=0;newvar<NX;newvar++)
{
kernel2<<<1,NY,0,streams[newvar%nstreams]>>>(newvar,a,b);
}
cudaDeviceSynchronize();
gettimeofday(&t2, NULL);
return 0;
}
Please suggest some tips.
CUDA version 5.5
NVVP version 5.5 Linux machine Ubuntu 12.10

Fundamentally I think the problem is that your kernels are not executing long enough. The execution time of your kernels is a few microseconds, and the kernel launch overhead is also a few microseconds, so you're not seeing any overlap. By the time the API has completed the setup of the new kernel launch, the previous kernel has finished.
I modified your kernel1 as follows:
__global__ void kernel1(int j,int *A,int *b)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
for (int q = 0; q < 1000; q++)
b[j*NY+i] = A[i*NY+j] + q/j;
}
There's nothing magical or special about these modifications, I'm just looking for a way to increase the kernel duration execution (from a few microseconds to a few milliseconds).
With the above changes, I saw good overlap of your kernel1 in the profiler.
I imagine something similar could be done with your kernel2.
You should also make sure you have not deselected the "enable concurrent kernel profiling" checkbox when you start a profiling session in nvvp.

Related

Skipping every M elements when iterating through an array in CUDA

I am new to Cuda programming and I have been trying to figure out how to convert the following code into Cuda code.
for (int i = 0; i <= N; i += M)
{
output[i].x = signal[i].x;
output[i].y = signal[i].y;
}
following a vector_add example, I was able to get this:
__global__ void dec(const complex * signal, int N, int M, complex * output)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i <= N)
{
output[i].x = signal[i].x;
output[i].y = signal[i].y;
}
And this is where I am stuck. In my understanding, all thread/units would calculate in parallel, so I wasn't sure where to inform the iterator to skip every M elements in Cuda. An alternative I thought of was to check i % M == 0. But I'd like to see if there is anything else I should know first to tackle this problem, such as thread syncing and etc.
Any help is appreciated.
Something like this should work:
__global__ void dec(const complex * signal, int N, int M, complex * output)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
i *= M; // add this line
if (i <= N)
{
output[i].x = signal[i].x;
output[i].y = signal[i].y;
}
You should also make sure that you don't overflow the int variable. This should be possible to manage by not launching unnecessary threads, i.e. don't launch a grid of significantly more than N/M threads.

CUDA Array-Vector multiply

Hi i am making my first steps in CUDA technology but i think i do not get it right.
I am trying to multiply two dimensional array by vector but something is not working
Here is the code I am trying to figure out:
#include <stdio.h>
#include <stdlib.h>
#define N 2
__global__ void Multiply(int A[N][N], int B[N], int C[N]){
int i = threadIdx.x;
int j = threadIdx.y;
int sum = A[i][j] * B[j];
C[i]= sum;
printf("%d,%d ", sum, C[i]);
}
int main(){
int A[N][N] ={ {1,1},
{1,1}
};
int B[N] = {4,6};
int C[N] = {0,0};
int (*aA)[N], (*aB), (*aC);
cudaMalloc((void**)&aA, (N*N)*sizeof(int));
cudaMalloc((void**)&aB, (N)*sizeof(int));
cudaMalloc((void**)&aC, (N)*sizeof(int));
cudaMemcpy(aA, A, (N*N)*sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(aB, B, (N)*sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(aC, C, (N)*sizeof(int), cudaMemcpyHostToDevice);
int numBlocks = 1;
dim3 threadsPerBlock(N,N);
Multiply<<<numBlocks,threadsPerBlock>>>(aA,aB,aC);
cudaMemcpy(C, aC, (N)*sizeof(int), cudaMemcpyDeviceToHost);
cudaFree(aA);
cudaFree(aB);
cudaFree(aC);
printf("\n");
system("pause");
}
in this case the Output is : 4,6 4,6 6,6 6,6 so basically the sum i giving the right values but C[i] is returning always 6 although there is sum value assigned to it.
What am I doing wrong?
Any time you are having trouble with a CUDA code, it's a good idea to use proper cuda error checking and run your code with cuda-memcheck. That's just a boiler-plate statement I make. It wouldn't actually turn up issues with the code you have shown in this case.
As was pointed out already in an answer now deleted, you are not actually summing anything together. Even though you have a variable named sum, it is not actually the sum of anything, and you have no + or summation operations in your kernel code. You are not writing a kernel that will sum anything together.
To produce a correct result, your kernel depends on cooperatively having multiple threads update a single location (C[i]). However, this requires some coordination between threads. Without any coordination, you will have threads in a race condition with each other, and the results will be unpredictable. We could sort this out using a parallel reduction, to sum together partial-products from each of the individual threads, or for simplicity we could use an atomicAdd operation, which will force threads to update (add to) C[i] one-by-one, so they don't step on each other. Using atomicAdd therefore also supplies the necessary addition (+) operation, which is lacking in your kernel.
Here's a worked code with items 2 and 3 addressed. You can run it with cuda-memcheck to verify behavioral correctness even though it has no explicit error checking:
$ cat t1037.cu
#include <stdio.h>
#include <stdlib.h>
#define N 2
__global__ void Multiply(int A[N][N], int B[N], int C[N]){
int i = threadIdx.x;
int j = threadIdx.y;
int product = A[i][j] * B[j];
atomicAdd(C+i, product);
// printf("%d,%d ", product, C[i]);
}
int main(){
int A[N][N] ={ {1,1},
{1,1}
};
int B[N] = {4,6};
int C[N] = {0,0};
int (*aA)[N], (*aB), (*aC), i;
cudaMalloc((void**)&aA, (N*N)*sizeof(int));
cudaMalloc((void**)&aB, (N)*sizeof(int));
cudaMalloc((void**)&aC, (N)*sizeof(int));
cudaMemcpy(aA, A, (N*N)*sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(aB, B, (N)*sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(aC, C, (N)*sizeof(int), cudaMemcpyHostToDevice);
int numBlocks = 1;
dim3 threadsPerBlock(N,N);
Multiply<<<numBlocks,threadsPerBlock>>>(aA,aB,aC);
cudaMemcpy(C, aC, (N)*sizeof(int), cudaMemcpyDeviceToHost);
for (i=0; i<N; i++){
printf("C[%d] = %d\n",i,C[i]);
}
cudaFree(aA);
cudaFree(aB);
cudaFree(aC);
printf("\n");
}
$ nvcc -o t1037 t1037.cu
$ cuda-memcheck ./t1037
========= CUDA-MEMCHECK
C[0] = 10
C[1] = 10
========= ERROR SUMMARY: 0 errors
$

Find max element in array OpenMP and PPL versions run much slower than serial code

I'm trying to implement two versions of a function that would find the max element in the array of floats. However, my parallel functions appeared to run much slower than the serial code.
With array of 4194304 (2048 * 2048) floats, I get the following numbers (in microseconds):
serial code: 9433
PPL code: 24184 (more than two times slower)
OpenMP code: 862093 (almost 100 times slower)
Here's the code:
PPL:
float find_largest_element_in_matrix_PPL(float* m, size_t dims)
{
float max_element;
int row, col;
concurrency::combinable<float> locals([] { return (float)INT_MIN; });
concurrency::parallel_for(size_t(0), dims * dims, [&locals](int curr)
{
float &localMax = locals.local();
localMax = max<float>(localMax, curr);
});
max_element = locals.combine([](float left, float right) { return max<float>(left, right); });
return max_element;
}
OpenMP:
float find_largest_element_in_matrix_OMP(float* m, unsigned const int dims)
{
float max_value = 0.0;
int i, row, col, index;
#pragma omp parallel for private(i) shared(max_value, index)
for (i = 0; i < dims * dims; ++i)
{
#pragma omp critical
if (m[i] > max_value)
{
max_value = m[i];
index = i;
}
}
//row = index / dims;
//col = index % dims;
return max_value;
}
What's making the code run so slowly? Am I missing something?
Could you help me find out what I'm doing wrong?
So, as Baum mit Augen noticed, the problem with OpenMP was that I had a critical section and the code didn't actually run in parallel, but synchronously.
Removing critical section did the trick.
As for PPL, I've found out that it does a lot more preparations (creating threads and stuff) than OpenMP does, hence the slowdown.
Update
So, here's the correct variant to find max element with OpenMP (the critical section is still needed but inside the if block):
float find_largest_element_in_matrix_OMP(float* m, unsigned const int dims)
{
float max_value = 0.0;
int i, row, col, index;
#pragma omp parallel for
for (i = 0; i < dims * dims; ++i)
{
if (m[i] > max_value)
{
#pragma omp critical
max_value = m[i];
}
}
return max_value;
}
PS: not tested.

Simple and fast matrix-vector multiplication in C / C++

I need frequent usage of matrix_vector_mult() which multiplies matrix with vector, and below is its implementation.
Question: Is there a simple way to make it significantly, at least twice, faster?
Remarks: 1) The size of the matrix is about 300x50. It doesn't change during the
run. 2) It must work on both Windows and Linux.
double vectors_dot_prod(const double *x, const double *y, int n)
{
double res = 0.0;
int i;
for (i = 0; i < n; i++)
{
res += x[i] * y[i];
}
return res;
}
void matrix_vector_mult(const double **mat, const double *vec, double *result, int rows, int cols)
{ // in matrix form: result = mat * vec;
int i;
for (i = 0; i < rows; i++)
{
result[i] = vectors_dot_prod(mat[i], vec, cols);
}
}
This is something that in theory a good compiler should do by itself, however I made a try with my system (g++ 4.6.3) and got about twice the speed on a 300x50 matrix by hand unrolling 4 multiplications (about 18us per matrix instead of 34us per matrix):
double vectors_dot_prod2(const double *x, const double *y, int n)
{
double res = 0.0;
int i = 0;
for (; i <= n-4; i+=4)
{
res += (x[i] * y[i] +
x[i+1] * y[i+1] +
x[i+2] * y[i+2] +
x[i+3] * y[i+3]);
}
for (; i < n; i++)
{
res += x[i] * y[i];
}
return res;
}
I expect however the results of this level of micro-optimization to vary wildly between systems.
As Zhenya says, just use a good BLAS or matrix math library.
If for some reason you can't do that, see if your compiler can unroll and/or vectorize your loops; making sure rows and cols are both constants at the call site may help, assuming the functions you posted are available for inlining
If you still can't get the speedup you need, you're looking at manual unrolling, and vectorizing using extensions or inline assembler.
If the size is constant and known in advance, pass it in as a precompiler variable, which will permit the compiler to optimize more fully.

Multithreading taking equal time as single thread quick sorting

I'm working on linux but multithreading and single threading both are taking 340ms. Can someone tell me what's wrong with what I'm doing?
Here is my code
#include<time.h>
#include<fstream>
#define SIZE_OF_ARRAY 1000000
using namespace std;
struct parameter
{
int *data;
int left;
int right;
};
void readData(int *data)
{
fstream iFile("Data.txt");
for(int i = 0; i < SIZE_OF_ARRAY; i++)
iFile>>data[i];
}
int threadCount = 4;
int Partition(int *data, int left, int right)
{
int i = left, j = right, temp;
int pivot = data[(left + right) / 2];
while(i <= j)
{
while(data[i] < pivot)
i++;
while(data[j] > pivot)
j--;
if(i <= j)
{
temp = data[i];
data[i] = data[j];
data[j] = temp;
i++;
j--;
}
}
return i;
}
void QuickSort(int *data, int left, int right)
{
int index = Partition(data, left, right);
if(left < index - 1)
QuickSort(data, left, index - 1);
if(index < right)
QuickSort(data, index + 1, right);
}
//Multi threading code starts from here
void *Sort(void *param)
{
parameter *param1 = (parameter *)param;
QuickSort(param1->data, param1->left, param1->right);
pthread_exit(NULL);
}
int main(int argc, char *argv[])
{
clock_t start, diff;
int *data = new int[SIZE_OF_ARRAY];
pthread_t threadID, threadID1;
pthread_attr_t attr;
pthread_attr_init(&attr);
pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_JOINABLE);
pthread_attr_setscope(&attr, PTHREAD_SCOPE_SYSTEM);
parameter param, param1;
readData(data);
start = clock();
int index = Partition(data, 0, SIZE_OF_ARRAY - 1);
if(0 < index - 1)
{
param.data = data;
param.left = 0;
param.right = index - 1;
pthread_create(&threadID, NULL, Sort, (void *)&param);
}
if(index < SIZE_OF_ARRAY - 1)
{
param1.data = data;
param1.left = index + 1;
param1.right = SIZE_OF_ARRAY;
pthread_create(&threadID1, NULL, Sort, (void *)&param1);
}
pthread_attr_destroy(&attr);
pthread_join(threadID, NULL);
pthread_join(threadID1, NULL);
diff = clock() - start;
cout<<"Sorting Time = "<<diff * 1000 / CLOCKS_PER_SEC<<"\n";
delete []data;
return 0;
}
//Multithreading Ends here
Single thread main function
int main(int argc, char *argv[])
{
clock_t start, diff;
int *data = new int[SIZE_OF_ARRAY];
readData(data);
start = clock();
QuickSort(data, 0, SIZE_OF_ARRAY - 1);
diff = clock() - start;
cout<<"Sorting Time = "<<diff * 1000 / CLOCKS_PER_SEC<<"\n";
delete []data;
return 0;
}
//Single thread code ends here
some of functions single thread and multi thread use same
clock returns total CPU time, not wall time.
If you have 2 CPUs and 2 threads, then after a second of running both thread simultaneously clock will return CPU time of 2 seconds (the sum of CPU times of each thread).
So the result is totally expected. It does not matter how many CPUs you have, the total running time summed over all CPUs will be the same.
Note that you call Partition once from the main thread...
The code works on the same memory block which prevents a CPU from working when the other accesses that same memory block. Unless your data is really large you're likely to have many such hits.
Finally, if your algorithm works at memory speed when you run it with one thread, adding more threads doesn't help. I did such tests a while back with image data, and having multiple thread decreased the total speed because the process was so memory intensive that both threads were fighting to access memory... and the result was worse than not having threads at all.
What makes really fast computers today go really is fast is running one very intensive process per computer, not a large number of threads (or processes) on a single computer.
Build a thread pool with a producer-consumer queue with 24 threads hanging off it. Partition your data into two and issue a mergesort task object to the pool, the mergesort object should issue further pairs of mergesorts to the queue and wait on a signal for them to finish and so on until a mergersort object finds that it has [L1 cache-size data]. The object then qicksorts its data and signals completion to its parent task.
If that doesn't turn out to be blindingly quick on 24 cores, I'll stop posting about threads..
..and it will handle multiple sorts in parallel.
..and the pool can be used for other tasks.
.. and there is no No performance-destroying, deadlock-generating join(), synchronize(), (if you except the P-C queue, which only locks for long enough to push an object ref on), no thread-creation overhead and no dodgy thread-stopping/terminating/destroying code. Like the barbers, there is no waiting - as soon as a thread is finished with a task it can get another.
No thread micro-management, no tuning, (you could create 64 threads now, ready for the next generation of boxes). You could make the thread count tuneable - just add more threads at runtime, or delete some by queueing up poison-pills.
You don't need a reference to the threads at all - just set 'em off, (pass queue as parameter).