We all know that GPGPU has several stream multiprocesssors(SM) and each has a lot of stream processors(SP) when talking about its hardware architecture. But it introduces another conceptions block and thread in NVDIA's CUDA programming model.
And we also know that block corresponds to SM and thread corresponds to SP, When we launch a CUDA kernel, we configure the kernel as kernel<<<blockNum, threadsNum>>>. I have been writing CUDA program like this for nearly two months. But I still have a lot of questions. A good programmer never just be satisfied with the no-bug program, they want to delve inside and know how the program behaves.
following questions:
Suppose a GPU has 14 SMs and each SM has 48 SPs, we have a kernel like this:
__global__ void double(int *data, int dataNum){
unsigned int tid = threadIdx.x + blockIdx.x * blockDim.x;
while(tid < dataNum){
data[tid] *= 2;
tid += blockDim.x + blockIdx.x;
}
}
and data is an array of 1024 * 1024 int numbers, kernel configuration as <<<128, 512>>>, it means the grid has 512 * 128 threads and every kernel will iterate (1024 * 1024)/(512 * 128) = 16 times in its while loop. But there are only 14 * 48 SPs, which says that only 14 * 48 threads can simultaneously run no matter how many block numbers or thread numbers in your configuration, what's the point of blockNum and threadNum in the configuration, why not just <<<number of SMs, number of SPs>>>.
And is there any difference between <<<128, 512>>> and <<<64, 512>>>, perhaps the former will iterate 16 times in it while loops and the letter 32 times, but the former has double blocks to schedule. Is there any way to know what's the best configuration, no just compare the result and choose the best, for we could not try every combination, so the result is not complete best, but the best in your attemps.
we know only one block can run a SM one time, but where does the CUDA store the other blocks' context, suppose 512 blocks and 14 SMs, only 14 blocks have their contexts in SMs, how about the other 498 blocks' context?
And we also know that block corresponds to SM and thread corresponds to SP
This is incorrect. An SM can process multiple blocks simultaneously and an SP can process multiple threads simultaneously.
1) I think your question may be due to not separating between the work that an application needs to have done and the resources available to do that work. When you launch a kernel, you specify the work you want to have done. The GPU then uses its resources to perform the work. The more resources a GPU has, the more work it can do in parallel. Any work that can not be done in parallel is done in serial.
By letting the programmer specify the work that needs to be done without tying it to the amount of resources available on a given GPU, CUDA provides an abstraction that allows the app to seamlessly scale to any GPU.
But there are only 14 * 48 SPs, which says that only 14 * 48 threads can simultaneously run no matter how many block numbers or thread numbers in your configuration
SPs are pipelined, so they process many threads simultaneously. Each thread is in a different stage of completion. Each SP can start one operation and yield the result of one operation per clock. Though, as you can see now, even if your statement was correct, it wouldn't lead to your conclusion.
2) Threads in a block can cooperate with each other using shared memory. If your app is not using shared memory, the only implication of block size is performance. Initially, you can get a good value for the block size by using the occupancy calculator. After that, you can further fine tune your block size for performance by testing with different sizes. Since threads are run in groups of 32, called warps, you want to have the block size be divisible by 32. So there are not that many block sizes to test.
3) An SM can run a number of blocks at the same time. The number of blocks depends on how many resources each block requires and how many resources the SM has. A block uses a number of different resources and one of the resources becomes the limiting factor in how many blocks will run simultaneously. The occupancy calculator will tell you what the limiting factor is.
Only blocks that run simultaneously consume resources on an SM. I think those resources are what you mean by context. Blocks that are completed and blocks that are not yet started do not consume resources on an SM.
Related
Note: I am using a GT 740, with 2 SMs and 192 CUDA cores per SM.
I have a working CUDA kernel that is executed 4 times:
__global__ void foo(float *d_a, int i) {
if (i < 1500) {
...
...
...
}
}
int main() {
float *d_mem;
cudaMalloc(&d_mem, lots_of_bytes);
for (int i = 0; i < 1500; i += 384)
foo<<<1, 384>>>(d_mem, i);
return 0;
}
Each kernel call reuses the memory allocated to d_mem because of memory constraints.
I would like to modify it to be executed from a single statement, like this:
foo<<<8,192>>>(d_mem);
I want both active thread blocks to access different halves of d_mem, though the specific halves are not important, because data is not shared between blocks.
For example, the following is 1 of several desirable access patterns:
Block 1: d_mem[0] and Block 2: d_mem[1]
Block 3: d_mem[0] and Block 4: d_mem[1]
...
While this is undesirable:
Block 1: d_mem[0] and Block 2: d_mem[0]
Block 3: d_mem[1] and Block 4: d_mem[1]
...
Essentially, I want a way to address d_mem so that any combination of active blocks access different parts of it.
I thought that addressing d_mem with a block's SM ID might work, but it appears that this ID is not guaranteed to remain the same throughout a block's life.
I also considered addressing d_mem with a thread's global ID modulo 2 (threadIdx.x + blockIdx.x * blockDim.x) % 2, but this relies on the blocks being processed in a particular order.
This is mainly relevant to the use of 1 block per SM, but I am also interested in how this could be solved for an arbitrary number of blocks per SM, if possible at all.
Simplest way would be
foo<<<2, 384>>> or foo<<<2, 192>>>
and putting a for loop around the calculations in your kernel. Then you can select the half of memory with blockIdx.x. Even if the two blocks are scheduled on the same SM, it would work. This method also works with more than one block per SM, e.g. (for quarter memory per block)
foo<<<4, 96>>>
Having only 192 threads per SM is inefficient (better at least 384, even better 512, 576, 768, 960 or 1024). The SM needs to hide latencies and switch active threads. If you get into memory problems by having more than 384 (=2 SMs *192) calculations active at the same time, try to think, whether you could utilize more than one thread for the same work package (value of threadIdx.x + i), threads can easily cooperate within warp boundaries or use shared memory. Sometimes it is beneficial to use more threads just for the part of your kernel, where you are reading and writing global memory. Here the largest latencies occur.
So you could call your kernel as
foo<<<2, dim3(4, 192)>>>
and have 4 threads instead of one. For graphics, those 4 could be the rgba channels or xyz coordinates or triangle corners. They can also change their use throughout the kernel.
As a performance optimization this makes some calculations more complicated.
Your if-statement for the current implementation with one block probably should be
if(threadIdx.x + i < 1500)
I wrote C++ application which is simulating simple heat flow. It is using OpenCL for computing.
OpenCL kernel is taking two-dimensional (n x n) array of temperatures values and its size (n). It returns new array with temperatures after each cycle:
pseudocode:
int t_id = get_global_id(0);
if(t_id < n * n)
{
m_new[t_id / n][t_id % n] = average of its and its neighbors (top, bottom, left, right) temperatures
}
As You can see, every thread is computing single cell in matrix. When host application needs to perform X computing cycles it looks like this
For 1 ... X
Copy memory to OpenCL device
Call kernel
Copy memory back
I would like to rewrite kernel code to perform all X cycles without constant memory copying to/from OpenCL device.
Copy memory to OpenCL device
Call kernel X times OR call kernel one time and make it compute X cycles.
Copy memory back
I know that each thread in kernel should lock when all other threads are doing their job and after that - m[][] and m_new[][] should be swapped. I have no idea how to implement any of those two functionalities.
Or maybe there is another way to do this optimally?
Copy memory to OpenCL device
Call kernel X times
Copy memory back
this works. Make sure call kernel is not blocking(so 1-2 ms per cycle is saved) and there aren't any host-accesible buffer properties such as USE_HOST_PTR or ALLOC_HOST_PTR.
If calling kernel X times doesn't get satisfactory performance, you can try using single workgroup(such as only 256 threads) with looping X times that each cycles has a barrier() at the end so all 256 threads synchronize before starting next cycle. This way you can compute M different heat-flow problems at the same time where M is number of compute units(or workgroups) if that is a server, it can serve that many computations.
Global synchronization is not possible because when latest threads are launched, first threads are already gone. It works with (number of compute units)(number of threads per workgroup)(number of wavefronts per workgroup) threads concurrently. For example, a R7-240 gpu with 5 compute units and local-range=256, it can run maybe 5*256*20=25k threads at a time.
Then, for further performance, you can apply local-memory optimizations.
I'm using CUDA (in reality I'm using pyCUDA if the difference matters) and performing some computation over arrays. I'm launching a kernel with a grid of 320*600 threads. Inside the kernel I'm declaring two linear arrays of 20000 components using:
float test[20000]
float test2[20000]
With these arrays I perform simple calculations, like for example filling them with constant values. The point is that the kernel executes normally and perform correctly the computations (you can see this filling an array with a random component of test and sending that array to host from device).
The problem is that my NVIDIA card has only 2GB of memory and the total amount of memory to allocate the arrays test and test2 is 320*600*20000*4 bytes that is much more than 2GB.
Where is this memory coming from? and how can CUDA perform the computation in every thread?
Thank you for your time
The actual sizing of the local/stack memory requirements is not as you suppose (for the entire grid, all at once) but is actually based on a formula described by #njuffa here.
Basically, the local/stack memory require is sized based on the maximum instantaneous capacity of the device you are running on, rather than the size of the grid.
Based on the information provided by njuffa, the available stack size limit (per thread) is the lesser of:
The maximum local memory size (512KB for cc2.x and higher)
available GPU memory/(#of SMs)/(max threads per SM)
For your first case:
float test[20000];
float test2[20000];
That total is 160KB (per thread) so we are under the maximum limit of 512KB per thread. What about the 2nd limit?
GTX 650m has 2 cc 3.0 (kepler) SMs (each Kepler SM has 192 cores). Therefore, the second limit above gives, if all the GPU memory were available:
2GB/2/2048 = 512KB
(kepler has 2048 max threads per multiprocessor)
so it is the same limit in this case. But this assumes all the GPU memory is available.
Since you're suggesting in the comments that this configuration fails:
float test[40000];
float test2[40000];
i.e. 320KB, I would conclude that your actual available GPU memory is at the point of this bulk allocation attempt is somewhere above (160/512)*100% i.e. above 31% but below (320/512)*100% i.e. below 62.5% of 2GB, so I would conclude that your available GPU memory at the time of this bulk allocation request for the stack frame would be something less than 1.25GB.
You could try to see if this is the case by calling cudaGetMemInfo right before the kernel launch in question (although I don't know how to do this in pycuda). Even though your GPU starts out with 2GB, if you are running the display from it, you are likely starting with a number closer to 1.5GB. And dynamic (e.g. cudaMalloc) and or static (e.g. __device__) allocations that occur prior to this bulk allocation request at kernel launch, will all impact available memory.
This is all to explain some of the specifics. The general answer to your question is that the "magic" arises due to the fact that the GPU does not necessarily allocate the stack frame and local memory for all threads in the grid, all at once. It need only allocate what is required for the maximum instantaneous capacity of the device (i.e. SMs * max threads per SM), which may be a number that is significantly less than what would be required for the whole grid.
I just started to code in CUDA and I'm trying to get my head around the concepts of how threads are executed and memory accessed in order to get the most out of the GPU. I read through the CUDA best practice guide, the book CUDA by Example and several posts here. I also found the reduction example by Mark Harris quite interesting and useful, but despite all the information I got rather confused on the details.
Let's assume we have a large 2D array (N*M) on which we do column-wise operations. I split the array into blocks so that each block has a number of threads that is a multiple of 32 (all threads fit into several warps). The first thread in each block allocates additional memory (a copy of the initial array, but only for the size of its own dimension) and shares the pointer using a _shared _ variable so that all threads of the same block can access the same memory. Since the number of threads is a multiple of 32, so should be the memory in order to be accessed in a single read. However, I need to have an extra padding around the memory block, a border, so that the width of my array becomes (32*x)+2 columns. The border comes from decomposing the large array, so that I have an overlapping areas in which a copy of its neighbours is temporarily available.
Coeleased memory access:
Imagine the threads of a block are accessing the local memory block
1 int x = threadIdx.x;
2
3 for (int y = 0; y < height; y++)
4 {
5 double value_centre = array[y*width + x+1]; // remeber we have the border so we need an offset of + 1
6 double value_left = array[y*width + x ]; // hence the left element is at x
7 double value_right = array[y*width + x+2]; // and the right element at x+2
8
9 // .. do something
10 }
Now, my understanding is that since I do have an offset (+1,+2), which is unavoidable, I will have at least two reads per warp and per assignment (except for the left elements), or does it not matter from where I start reading as long as the memory after the 1st thread is perfectly aligned? Note also, if that is not the case then I would have unaligned access to the array for each row after the first one, since the width of my array is (32*x)+2, and hence not 32-byte aligned. A further padding would however solve the problem for each new row.
Question: Is my understanding correct that in the example above only the first row would allow coeleased access and only for the left element in the array, since that is the only one which is accessed without any offset?
Thread executed in a warp:
Threads in a warp are only executed in parallel if and only if all the instructions are the same (according to link). If I do have a conditional statement / diverging execution, then that particular thread will be executed by itself and not within a warp with the others.
For example if I initialise the array I could do something like
1 int x = threadIdx.x;
2
3 array[x+1] = globalArray[blockIdx.x * blockDim.x + x]; // remember the border and therefore use +1
4
5 if (x == 0 || x == blockDim.x-1) // border
6 {
7 array[x] = DBL_MAX;
8 }
Will the warp be of size 32 and executed in parallel until line 3 and then stop for all other threads and only the first and last thread further executed to initialise the border, or will those be separated from all other threads already at the beginning, since there is an if statement that all other threads do not fulfill?
Question: How are threads collected into a single warp? Each thread in a warp needs to share the same instructions. Need this to be valid for the whole function? This is not the case for thread 1 (x=0), since it initialises also the border and therefore is different from others. To my understanding, thread 1 is executed in a single warp, thread (2-33, etc.) in another warp, which then doesn't access the memory in a singe read, due to miss-alignment, and then again the final thread in a single warp due to the other border. Is that correct?
I wonder what the best practice is, to have either memory perfectly aligned for each row (in which case I would run each block with (32*x-2) threads so that the array with border is (32*x-2)+2 a multiple of 32 for each new line) or do it the way I had demonstrated above, with threads a multiple of 32 for each block and just live with the unaligned memory. I am aware that these sort of questions are not always straightforward and often depend on particular cases, but sometimes certain things are a bad practice and should not become habit.
When I experimented a little bit, I didn't really notice a difference in execution time, but maybe my examples were just too simple. I tried to get information from the visual profiler, but I haven't really understood all the information it gives me. I got however a warning that my occupancy level is at 17%, which I think must be really low and therefore there is something I do wrong. I didn't manage to find information on how threads are executed in parallel and how efficient my memory access is.
-Edit-
Added and highlighted 2 questions, one about memory access, the other one about how threads are collected to a single warp.
Now, my understanding is that since I do have an offset (+1,+2), which is unavoidable, I will have at least two reads per warp and per assignment (except for the left elements), or does it not matter from where I start reading as long as the memory after the 1st thread is perfectly aligned?
Yes, it does matter "from where you start reading" if you are trying to achieve perfect coalescing. Perfect coalescing means the read activity for a given warp and a given instruction all comes from the same 128-byte aligned cacheline.
Question: Is my understanding correct that in the example above only the first row would allow coeleased access and only for the left element in the array, since that is the only one which is accessed without any offset?
Yes. For cc2.0 and higher devices, the cache(s) may mitigate some of the drawbacks of unaligned access.
Question: How are threads collected into a single warp? Each thread in a warp needs to share the same instructions. Need this to be valid for the whole function? This is not the case for thread 1 (x=0), since it initialises also the border and therefore is different from others. To my understanding, thread 1 is executed in a single warp, thread (2-33, etc.) in another warp, which then doesn't access the memory in a singe read, due to miss-alignment, and then again the final thread in a single warp due to the other border. Is that correct?
The grouping of threads into warps always follows the same rules, and will not vary based on the specifics of the code you write, but is only affected by your launch configuration. When you write code that not all the threads will participate in (such as in your if statement), then the warp still proceeds in lockstep, but the threads that do not participate are idle. When you are filling in borders like this, it's rarely possible to get perfectly aligned or coalesced reads, so don't worry about it. The machine gives you that flexibility.
Suppose I have N tasks, where each tasks can be performed by a single thread on the GPU. Suppose also that N = number of threads on the GPU.
Question 1:
Is the following an appropriate way to launch a 1D kernel of maximum size? Will all N threads that exist on the GPU perform the work?
cudaDeviceProp theProps;
dim3 mygrid(theProps.maxGridSize[0], 1, 1);
dim3 myblocks(theProps.maxThreadsDim[0], 1, 1);
mykernel<<<mygrid, myblocks>>>(...);
Question 2:
What is the property cudaDeviceProp::maxThreadsPerBlock in relation to cudaDeviceProp::maxThreadsDim[0] ? How do they differ? Can cudaDeviceProp::maxThreadsPerBlock be substituted for cudaDeviceProp::maxThreadsDim[0] above?
Question 3:
Suppose that I want to divide the shared memory of a block equally amongst the threads in the block, and that I want the most amount of shared memory available for each thread. Then I should maximize the number of blocks, and minimize the number of threads per block, right?
Question 4:
Just to confirm (after reviewing related questions on SO), in the linear (1D) grid/block scheme above, a global unique thread index is unsigned int tid = blockIdx.x * blockDim.x + threadIdx.x. Right?
It's recommended to ask one question per question. Having all sorts of questions makes it very difficult for anyone to give a complete answer. SO isn't really a tutorial service. You should avail yourself of the existing documentation, webinars, and of course there are many other resources available.
Is the following an appropriate way to launch a 1D kernel of maximum size? Will all N threads that exist on the GPU perform the work?
It's certainly possible, all of the threads launched (say it is called N) will be available to perform work, and it will launch a grid of maximum (1D) size. But why do you want to do that anyway? Most cuda programming methodologies don't start out with that as a goal. The grid should be sized to the algorithm. If the 1D grid size appears to be a limiter, you can work around by performing loops in the kernel to handle multiple data elements per thread, or else launch a 2D grid to get around the 1D grid limit. The limit for cc3.x devices has been expanded.
What is the property cudaDeviceProp::maxThreadsPerBlock in relation to cudaDeviceProp::maxThreadsDim[0] ? How do they differ? Can cudaDeviceProp::maxThreadsPerBlock be substituted for cudaDeviceProp::maxThreadsDim[0] above?
The first is a limit on the total threads in a multidimensional block (i.e. threads_x*threads_y*threads_z). The second is a limit on the first dimension (x) size. For a 1D threadblock, they are interchangeable, since the y and z dimensions are 1. For a multidimensional block, the multidimensional limit exists to inform users that threadsblocks of, for example, maxThreadsDim[0]*maxThreadsDim[1]*maxThreadsDim[2] are not legal.
Suppose that I want to divide the shared memory of a block equally amongst the threads in the block, and that I want the most amount of shared memory available for each thread. Then I should maximize the number of blocks, and minimize the number of threads per block, right?
Again, I'm a bit skeptical of the methodology. But yes, the theoretical maximum of possible shared memory bytes per thread would be achieved by a threadblock of smallest number of threads. However, allowing a threadblock to use all the available shared memory may result in only having one threadblock that can be resident on an SM at a time. This may have negative consequences for occupancy, which may have negative consequences for performance. There are many useful recommendations for choosing threadblock size, to maximize performance. I can't summarize them all here. But we want to choose threadblock size as a multiple of warp size, we generally want multiple warps per threadblock, and all other things being equal, we want to enable maximum occupancy (which is related to the number of threadblocks that can be resident on an SM).
Just to confirm (after reviewing related questions on SO), in the linear (1D) grid/block scheme above, a global unique thread index is unsigned int tid = blockIdx.x * blockDim.x + threadIdx.x. Right?
Yes, for a 1-D threadblock and grid structure, this line will give a globally unique thread ID:
unsigned int tid = blockIdx.x * blockDim.x + threadIdx.x;