I have a quite simple 1D CUDA kernel doing a inclusive sum, that is, if we have a input 1D array
[ x_0, x_1, x_2,..., x_n-1 ]
the output would be
[ x_0, x_0+x_1, x_0+x_1+x_2, ..., x_0+x_1+...x_n-1 ].
The kernel shown below actually does not completely finish this job, on the other hand it finishes its job within each block. Anyway my question is not about how I can completely implement the inclusive sum, but I think there is a possible negative-indexing error during thread calculation.
__global__ void parallel_scan_inefficient(float* input, float* output){
// num_threads and max_i are globalled defined
__shared__ float temp[num_threads];
int i = blockIdx.x*blockDim.x+threadIdx.x;//global index
if (i<max_i)
{
temp[threadIdx.x]=input[i];
}
for (unsigned int stride=1;stride<=threadIdx.x; stride*=2)
{
__syncthreads();
temp[threadIdx.x]+=temp[threadIdx.x-stride];
}
output[i]=temp[threadIdx.x];
}
This piece of program is in fact from Hwu&Kirk's textbook "Programming Massively Parallel Processors" Chapter 9 pp.203.
So as you can see in the for-loop
for (unsigned int stride=1;stride<=threadIdx.x; stride*=2)
{
__syncthreads();
temp[threadIdx.x]+=temp[threadIdx.x-stride];
}
since "threadIdx.x" starts from 0 for each block, but "stride" starts from 1. Wouldn't we see for example temp[-1] for the first element in a block ? Also after one iteration, "stride" then becomes 2 and we will see temp[-2] for threadIdx.x=0 ?
This doesn't quite make sense to me, though CUDA compiler doesn't report any errors - I ran cuda-memcheck for this kernel and it is still fine. Also the results are right (of course it is right for each block, as I said this kernel only partially finishes the inclusive sum)
I reckon I might make a very stupid mistake but I just couldn't spot it. Any light would be much appreciated. Many thanks.
If you have a code like this:
for (unsigned int stride=1;stride<=threadIdx.x; stride*=2)
{
__syncthreads();
temp[threadIdx.x]+=temp[threadIdx.x-stride];
}
Then for thread where threadIdx.x == 0 the for loop will be skipped entirely. Try running the following code in main:
for (unsigned int stride=1;stride<=0; stride*=2)
{
cout << "I am running" << endl;
}
And you'll see there is nothing in the console.
Related
Before I start, let me say that I've only used threads once when we were taught about them in university. Therefore, I have almost zero experience using them and I don't know if what I'm trying to do is a good idea.
I'm doing a project of my own and I'm trying to make a for loop run fast because I need the calculations in the loop for a real-time application. After "optimizing" the calculations in the loop, I've gotten closer to the desired speed. However, it still needs improvement.
Then, I remembered threading. I thought I could make the loop run even faster if I split it in 4 parts, one for each core of my machine. So this is what I tried to do:
void doYourThing(int size,int threadNumber,int numOfThreads) {
int start = (threadNumber - 1) * size / numOfThreads;
int end = threadNumber * size / numOfThreads;
for (int i = start; i < end; i++) {
//Calculations...
}
}
int main(void) {
int size = 100000;
int numOfThreads = 4;
int start = 0;
int end = size / numOfThreads;
std::thread coreB(doYourThing, size, 2, numOfThreads);
std::thread coreC(doYourThing, size, 3, numOfThreads);
std::thread coreD(doYourThing, size, 4, numOfThreads);
for (int i = start; i < end; i++) {
//Calculations...
}
coreB.join();
coreC.join();
coreD.join();
}
With this, computation time changed from 60ms to 40ms.
Questions:
1)Do my threads really run on a different core? If that's true, I would expect a greater increase in speed. More specifically, I assumed it would take close to 1/4 of the initial time.
2)If they don't, should I use even more threads to split the work? Will it make my loop faster or slower?
(1).
The question #François Andrieux asked is good. Because in the original code there is a well-structured for-loop, and if you used -O3 optimization, the compiler might be able to vectorize the computation. This vectorization will give you speedup.
Also, it depends on what is the critical path in your computation. According to Amdahl's law, the possible speedups are limited by the un-parallelisable path. You might check if the computation are reaching some variable where you have locks, then the time could also spend to spin on the lock.
(2). to find out the total number of cores and threads on your computer you may have lscpu command, which will show you the cores and threads information on your computer/server
(3). It is not necessarily true that more threads will have a better performance
There is a header-only library in Github which may be just what you need. Presumably your doYourThing processes an input vector (of size 100000 in your code) and stores the results into another vector. In this case, all you need to do is to say is
auto vectorOut = Lazy::runForAll(vectorIn, myFancyFunction);
The library will decide how many threads to use based on how many cores you have.
On the other hand, if the compiler is able to vectorize your algorithm and it still looks like it is a good idea to split the work into 4 chunks like in your example code, you could do it for example like this:
#include "Lazy.h"
void doYourThing(const MyVector& vecIn, int from, int to, MyVector& vecOut)
{
for (int i = from; i < to; ++i) {
// Calculate vecOut[i]
}
}
int main(void) {
int size = 100000;
MyVector vecIn(size), vecOut(size)
// Load vecIn vector with input data...
Lazy::runForAll({{std::pair{0, size/4}, {size/4, size/2}, {size/2, 3*size/4}, {3*size/4, size}},
[&](auto indexPair) {
doYourThing(vecIn, indexPair.first, indexPair.second, vecOut);
});
// Now the results are in vecOut
}
README.md gives further examples on parallel execution which you might find useful.
I'm trying to convert c++ code into Cuda code and I've got the following triple nested for loop that will fill an array for further OpenGL rendering (i'm simply creating a coordinate vertices array):
for(int z=0;z<263;z++) {
for(int y=0;y<170;y++) {
for(int x=0;x<170;x++) {
g_vertex_buffer_data_3[i]=(float)x+0.5f;
g_vertex_buffer_data_3[i+1]=(float)y+0.5f;
g_vertex_buffer_data_3[i+2]=-(float)z+0.5f;
i+=3;
}
}
}
I would like to get faster operations and so I'll use Cuda for some operations like the one listed above. I want to create one block for each iteration of the outermost loop and since the inner loops have iterations of 170 * 170 = 28900 total iterations, assign one thread to each innermost loop iteration. I converted the c++ code into this (it's just a small program that i made to understand how to use Cuda):
__global__ void mykernel(int k, float *buffer) {
int idz=blockIdx.x;
int idx=threadIdx.x;
int idy=threadIdx.y;
buffer[k]=idx+0.5;
buffer[k+1]=idy+0.5;
buffer[k+2]=idz+0.5;
k+=3;
}
int main(void) {
int dim=3*170*170*263;
float* g_vertex_buffer_data_2 = new float[dim];
float* g_vertex_buffer_data_3;
int i=0;
HANDLE_ERROR(cudaMalloc((void**)&g_vertex_buffer_data_3, sizeof(float)*dim));
dim3 dimBlock(170, 170);
dim3 dimGrid(263);
mykernel<<<dimGrid, dimBlock>>>(i, g_vertex_buffer_data_3);
HANDLE_ERROR(cudaMemcpy(&g_vertex_buffer_data_2,g_vertex_buffer_data_3,sizeof(float)*dim,cudaMemcpyDeviceToHost));
for(int j=0;j<100;j++){
printf("g_vertex_buffer_data_2[%d]=%f\n",j,g_vertex_buffer_data_2[j]);
}
cudaFree(g_vertex_buffer_data_3);
return 0;
}
Trying to launch it I get a segmenation fault. Do you know what am i doing wrong?
I think the problem is that threadIdx.x and threadIdx.y grow at the same time, while I would like to have threadIdx.x to be the inner one and threadIdx.y to be the outer one.
There is a lot wrong here, but the source of the segfault is this:
cudaMemcpy(&g_vertex_buffer_data_2,g_vertex_buffer_data_3,
sizeof(float)*dim,cudaMemcpyDeviceToHost);
You either want
cudaMemcpy(&g_vertex_buffer_data_2[0],g_vertex_buffer_data_3,
sizeof(float)*dim,cudaMemcpyDeviceToHost);
or
cudaMemcpy(g_vertex_buffer_data_2,g_vertex_buffer_data_3,
sizeof(float)*dim,cudaMemcpyDeviceToHost);
Once you fix that you will notice that the kernel is actually never launching with an invalid launch error. This is because a block size of (170,170) is illegal. CUDA has a 1024 threads per block limit on all current hardware.
There might well be other problems in your code. I stopped looking after I found these two.
I have an array of float values, namely life, of which i want to count the number of entries with a value greater than 0 in CUDA.
On the CPU, the code would look like this:
int numParticles = 0;
for(int i = 0; i < MAX_PARTICLES; i++){
if(life[i]>0){
numParticles++;
}
}
Now in CUDA, I've tried something like this:
__global__ void update(float* life, int* numParticles){
int idx = threadIdx.x + blockIdx.x * blockDim.x;
if (life[idx]>0){
(*numParticles)++;
}
}
//life is a filled device pointer
int launchCount(float* life)
{
int numParticles = 0;
int* numParticles_d = 0;
cudaMalloc((void**)&numParticles_d, sizeof(int));
update<<<MAX_PARTICLES/THREADS_PER_BLOCK,THREADS_PER_BLOCK>>>(life, numParticles_d);
cudaMemcpy(&numParticles, numParticles_d, sizeof(int), cudaMemcpyDeviceToHost);
std::cout << "numParticles: " << numParticles << std::endl;
}
But for some reason the CUDA attempt always returns 0 for numParticles. How come?
This:
if (life[idx]>0){
(*numParticles)++;
}
is a read-after write hazard. Multiple threads will be simultaneously attempting to read and write from numParticles. The CUDA execution model does not guarantee anything about the order of simultaneous transactions.
You could make this work by using atomic memory transactions, for example:
if (life[idx]>0){
atomicAdd(numParticles, 1);
}
This will serialize the memory transactions and make the calculation correct. It will also have a big negative effect on performance.
You might want to investigate having each block calculate a local sum using a reduction type calculation and then sum the block local sums atomically or on the host, or in a second kernel.
Your code is actually launching MAX_PARTICLES threads, and multiple thread blocks are executing (*numParticles)++; concurrently. It is a race condition. So you have the result 0, or if you are luck, sometimes a little bigger than 0.
As your attempt to sum up life[i]>0 ? 1 : 0 for all i, you could follow CUDA parallel reduction to implement your kernel, or use Thrust reduction to simplify your life.
(I have tried to simplify this as much as i could to find out where I'm doing something wrong.)
The ideea of the code is that I have a global array *v (I hope using this array isn't slowing things down, the threads should never acces the same value because they all work on different ranges) and I try to create 2 threads each one sorting the first half, respectively the second half by calling the function merge_sort() with the respective parameters.
On the threaded run, i see the process going to 80-100% cpu usage (on dual core cpu) while on the no threads run it only stays at 50% yet the run times are very close.
This is the (relevant) code:
//These are the 2 sorting functions, each thread will call merge_sort(..). Is this a problem? both threads calling same (normal) function?
void merge (int *v, int start, int middle, int end) {
//dynamically creates 2 new arrays for the v[start..middle] and v[middle+1..end]
//copies the original values into the 2 halves
//then sorts them back into the v array
}
void merge_sort (int *v, int start, int end) {
//recursively calls merge_sort(start, (start+end)/2) and merge_sort((start+end)/2+1, end) to sort them
//calls merge(start, middle, end)
}
//here i'm expecting each thread to be created and to call merge_sort on its specific range (this is a simplified version of the original code to find the bug easier)
void* mergesort_t2(void * arg) {
t_data* th_info = (t_data*)arg;
merge_sort(v, th_info->a, th_info->b);
return (void*)0;
}
//in main I simply create 2 threads calling the above function
int main (int argc, char* argv[])
{
//some stuff
//getting the clock to calculate run time
clock_t t_inceput, t_sfarsit;
t_inceput = clock();
//ignore crt_depth for this example (in the full code i'm recursively creating new threads and i need this to know when to stop)
//the a and b are the range of values the created thread will have to sort
pthread_t thread[2];
t_data next_info[2];
next_info[0].crt_depth = 1;
next_info[0].a = 0;
next_info[0].b = n/2;
next_info[1].crt_depth = 1;
next_info[1].a = n/2+1;
next_info[1].b = n-1;
for (int i=0; i<2; i++) {
if (pthread_create (&thread[i], NULL, &mergesort_t2, &next_info[i]) != 0) {
cerr<<"error\n;";
return err;
}
}
for (int i=0; i<2; i++) {
if (pthread_join(thread[i], &status) != 0) {
cerr<<"error\n;";
return err;
}
}
//now i merge the 2 sorted halves
merge(v, 0, n/2, n-1);
//calculate end time
t_sfarsit = clock();
cout<<"Sort time (s): "<<double(t_sfarsit - t_inceput)/CLOCKS_PER_SEC<<endl;
delete [] v;
}
Output (on 1 million values):
Sort time (s): 1.294
Output with direct calling of merge_sort, no threads:
Sort time (s): 1.388
Output (on 10 million values):
Sort time (s): 12.75
Output with direct calling of merge_sort, no threads:
Sort time (s): 13.838
Solution:
I'd like to thank WhozCraig and Adam too as they've hinted to this from the beginning.
I've used the inplace_merge(..) function instead of my own and the program run times are as they should now.
Here's my initial merge function (not really sure if the initial, i've probably modified it a few times since, also array indices might be wrong right now, i went back and forth between [a,b] and [a,b), this was just the last commented-out version):
void merge (int *v, int a, int m, int c) { //sorts v[a,m] - v[m+1,c] in v[a,c]
//create the 2 new arrays
int *st = new int[m-a+1];
int *dr = new int[c-m+1];
//copy the values
for (int i1 = 0; i1 <= m-a; i1++)
st[i1] = v[a+i1];
for (int i2 = 0; i2 <= c-(m+1); i2++)
dr[i2] = v[m+1+i2];
//merge them back together in sorted order
int is=0, id=0;
for (int i=0; i<=c-a; i++) {
if (id+m+1 > c || (a+is <= m && st[is] <= dr[id])) {
v[a+i] = st[is];
is++;
}
else {
v[a+i] = dr[id];
id++;
}
}
delete st, dr;
}
all this was replaced with:
inplace_merge(v+a, v+m, v+c);
Edit, some times on my 3ghz dual core cpu:
1 million values:
1 thread : 7.236 s
2 threads: 4.622 s
4 threads: 4.692 s
10 million values:
1 thread : 82.034 s
2 threads: 46.189 s
4 threads: 47.36 s
There's one thing that struck me: "dynamically creates 2 new arrays[...]". Since both threads will need memory from the system, they need to acquire a lock for that, which could well be your bottleneck. In particular the idea of doing microscopic array allocations sounds horribly inefficient. Someone suggested an in-place sort that doesn't need any additional storage, which is much better for performance.
Another thing is the often-forgotten starting half-sentence for any big-O complexity measurements: "There is an n0 so that for all n>n0...". In other words, maybe you haven't reached n0 yet? I recently saw a video (hopefully someone else will remember it) where some people tried to determine this limit for some algorithms, and their results were that these limits are surprisingly high.
Note: since OP uses Windows, my answer below (which incorrectly assumed Linux) might not apply. I left it for sake of those who might find the information useful.
clock() is a wrong interface for measuring time on Linux: it measures CPU time used by the program (see http://linux.die.net/man/3/clock), which in case of multiple threads is the sum of CPU time for all threads. You need to measure elapsed, or wallclock, time. See more details in this SO question: C: using clock() to measure time in multi-threaded programs, which also tells what API can be used instead of clock().
In the MPI-based implementation that you try to compare with, two different processes are used (that's how MPI typically enables concurrency), and the CPU time of the second process is not included - so the CPU time is close to wallclock time. Nevertheless, it's still wrong to use CPU time (and so clock()) for performance measurement, even in serial programs; for one reason, if a program waits for e.g. a network event or a message from another MPI process, it still spends time - but not CPU time.
Update: In Microsoft's implementation of C run-time library, clock() returns wall-clock time, so is OK to use for your purpose. It's unclear though if you use Microsoft's toolchain or something else, like Cygwin or MinGW.
I am writing a program that does calculations in multiple threads and return the result using c++ future, here's a simplified version of my code
int main()
{
int length = 64;
vector<std::future<float>> threads(length);
vector<float> results(length);
int blockLength = 8;
int blockCount = length/blockLength;
for(int j=0;j<blockCount;j++)
{
for(int i=0;i<blockLength;i++)
{
threads[i + j * blockLength] = std::async(func1,i*j);
}
for(int i=0;i<blockLength;i++)
{
results[i + j * blockLength] = threads[i].get();
}
}
the definition of func1 is simplified as follows:
float func1(int input)
{
//calculations...
return result;
}
I would like that the program above does 64 times of calculations, in 8 threads at a time, so that the processor and memory usage would be better at the same time.
The program is conceived that it will post blockLength number of threads at a time, and wait till the calculation results are obtained, and proceed to the next loop.
the program will post blockLength number of threads for blockCount times, for example, 8 threads for 8 times.
but the program is not working, there is always a EXC_BAD_ACCESS exception when the first loop of blockLength threads finishes, besides, the calculation time of each thread is not guaranteed, any thread can run for a long time or finish quickly.
Here is a screenshot:
as is shown above, the CPU usage drops as some of the threads finish, but an exception is thrown as soon as the second loop starts.
Would you please point out what is wrong with my usage of future?
How can we correct it?
Thank you very much!