about organizing threads in cuda - c++

general question: the number of threads must be equal to the size of the elements i want to deal with? exmaple: if i have matrix M[a][b]. i must allocate (aXb) threads or i can allocate more threads than i need(more than ab)? because the thread that will focus on element aXb+1 will throw us out, doesnt he? or the solution is to put a condition(only if in range(ab))?
specific question: let be M[x][y] matrix with x rows and y columns. consider that 1000 <= x <= 300000 and y <= 100. how can i organize the threads in that way that it will be general for each input for x and y. i want that each thread will focus on one element in the matrix. CC = 2.1 thanks!

General answer: It depends on a problem.
In most cases natural one-to-one mapping of the problem to the grid of threads is fine to start with, but what you want to keep in mind is:
Achieving high occupancy.
Maximizing GPU resources usage and memory throughput.
Working with valid data.
Sometimes it may require using single thread to process many elements or many threads to process single element.
For instance, you can imagine an series of independent operations A,B and C that need to be applied on array of elements. You could run three different kernels, but it might be better choice to allocate the grid to contain three times more threads than there is elements and distinguish operations by one of the dimensions of the grid (or anything else). On the other side - you might have a problem that could use maximizing the usage of shared memory (e.g transforming the image) - you could use block of 16 threads to process 5x5 image window where each thread would calculate some statistics of each 2x2 slice.
The choice is yours - the best advice is not always go with the obvious. Try different approaches and choose what works best.

Related

Thread indexing for image rows / columns in CUDA

So I'm working on a CUDA program, and I'm experiencing some issues when it comes to indexing blocks and threads. Basically, I'm trying to implement the Pixel Sort algorithm in CUDA. (With one modification, we are only processing either rows or columns, not both at the same time)
The way I was visualizing it was to simply run N blocks with 1 thread each (for the number of rows, or columns) and have each block process that row / column independently of each other.
So if we want to sort on columns, we launch the kernel like this (there are a couple extra parameters that are only relevant to our specific processing, so I've left them out for simplicity)
pixel_sort<<<cols, 1>>>(d_image, d_imageSort, rows, cols, ...);
Then in the kernel, I access the block index with
int tid = blockIdx.x;
This allows me to work with one row / columns data per block, but it has some issues. It's running slower than our serial implementation of the algorithm for smaller images, and straight up crashes when the image size becomes too large.
An alternative thread scheme I was considering would be to map each of the image's pixels to one thread, however I have a couple of questions on this.
If we were to launch N blocks with M threads, representing N cols with M rows, how do we avoid 512 (or 1024 ?) limit of threads per block. Can we just have each thread process multiple pixels in the column in this instance? How would the indexing look like in the kernel?
The algorithm basically requires that we work on the entire column, hence each thread cannot just do some work on that pixel, they have to communicate, presumably using shared memory. Would it be a valid strategy to have one "master" thread per block, that does the actual sorting calculations, and then have all of the other threads participate in the shared memory?
Other Notes:
Our image data is read in through OpenCV, and has the RGBA values stored in a uchar4 array
If you have a single thread per block, you very quickly run into thread occupancy issues. If your goal is to do a full row sort (for columns you could transpose the image before sending to the GPU to take advantage of global coalescing), the fastest way that gets a decent result is probably to do a radix or merge-sort on a per-row basis, basically copying the steps from http://mgarland.org/files/papers/nvr-2008-001.pdf. You could assign k blocks of m threads each for each row such that km >= image width. Then you would be launching k*(image height) blocks. Your grid would then be of size (k, height, 1).
As for your specific questions:
You cannot get around the 512/1024 thread-per-block limit, you would have to restructure your algorithm.
A "master" thread would generally be poor design, causing stalls, overhead, and not taking full advantage of the many cores. You may sometimes need to utilize a single thread, say to output/broadcast a result, but mostly you want to avoid it. See the linked article for sample algorithms that mostly avoid this.

Multithreaded array of arrays?

I have a data structure which consists of 1,000 array elements, each array element is a smaller array of 8 ints:
std::array<std::array<int, 8>, 1000>
The data structure contains two "pointers", which track the largest and smallest populated array elements (within the "outer", 1000-element array). So for example they might be:
min = 247
max = 842
How can I read and write to this data structure from multiple threads? I am worried about race conditions between pushing/popping elements and maintaining the two "pointers". My basic mode of operation is:
// Pop element from current index
// Calculate new index
// Write element to new index
// Update min and max "pointers"
You are correct that your current algorithm is not thread safe, there are a number of places where contention could occur.
This is impossible to optimize without more information though. You need to know where the slow-down is happening before you can improve it - and for that you need metrics. Profile your code and find out what bits are actually taking the time, because you can only gain by parallelizing those bits and even then you may find that it's actually memory or something else that is the limiting factor, not CPU.
The simplest approach will then be to just lock the entire structure for the full process. This will only work if the threads are doing a lot of other processing as well, if not you will actually lose performance compared to single threading.
After that you can consider having a separate lock for different sections of the data structure. You will need to properly analyse what you are using when and where and work out what would be useful to split. For example you might have chunks of the sub arrays with each chunk having its own lock.
Be careful of deadlocks in this situation though, you might have a thread claim 32 then want 79 while another thread already has 79 and then wants 32. Make sure you always claim locks in the same order.
The fastest option (if possible) may even be to give each thread it's own copy of the data structure, each processes 1/N of the work and then merge the results at the end. This way no synchronization is needed at all during processing.
But again it all comes back to the metrics and profiling. This is not a simple problem.

CUDA - Understanding parallel execution of threads (warps) and coalesced memory access

I just started to code in CUDA and I'm trying to get my head around the concepts of how threads are executed and memory accessed in order to get the most out of the GPU. I read through the CUDA best practice guide, the book CUDA by Example and several posts here. I also found the reduction example by Mark Harris quite interesting and useful, but despite all the information I got rather confused on the details.
Let's assume we have a large 2D array (N*M) on which we do column-wise operations. I split the array into blocks so that each block has a number of threads that is a multiple of 32 (all threads fit into several warps). The first thread in each block allocates additional memory (a copy of the initial array, but only for the size of its own dimension) and shares the pointer using a _shared _ variable so that all threads of the same block can access the same memory. Since the number of threads is a multiple of 32, so should be the memory in order to be accessed in a single read. However, I need to have an extra padding around the memory block, a border, so that the width of my array becomes (32*x)+2 columns. The border comes from decomposing the large array, so that I have an overlapping areas in which a copy of its neighbours is temporarily available.
Coeleased memory access:
Imagine the threads of a block are accessing the local memory block
1 int x = threadIdx.x;
2
3 for (int y = 0; y < height; y++)
4 {
5 double value_centre = array[y*width + x+1]; // remeber we have the border so we need an offset of + 1
6 double value_left = array[y*width + x ]; // hence the left element is at x
7 double value_right = array[y*width + x+2]; // and the right element at x+2
8
9 // .. do something
10 }
Now, my understanding is that since I do have an offset (+1,+2), which is unavoidable, I will have at least two reads per warp and per assignment (except for the left elements), or does it not matter from where I start reading as long as the memory after the 1st thread is perfectly aligned? Note also, if that is not the case then I would have unaligned access to the array for each row after the first one, since the width of my array is (32*x)+2, and hence not 32-byte aligned. A further padding would however solve the problem for each new row.
Question: Is my understanding correct that in the example above only the first row would allow coeleased access and only for the left element in the array, since that is the only one which is accessed without any offset?
Thread executed in a warp:
Threads in a warp are only executed in parallel if and only if all the instructions are the same (according to link). If I do have a conditional statement / diverging execution, then that particular thread will be executed by itself and not within a warp with the others.
For example if I initialise the array I could do something like
1 int x = threadIdx.x;
2
3 array[x+1] = globalArray[blockIdx.x * blockDim.x + x]; // remember the border and therefore use +1
4
5 if (x == 0 || x == blockDim.x-1) // border
6 {
7 array[x] = DBL_MAX;
8 }
Will the warp be of size 32 and executed in parallel until line 3 and then stop for all other threads and only the first and last thread further executed to initialise the border, or will those be separated from all other threads already at the beginning, since there is an if statement that all other threads do not fulfill?
Question: How are threads collected into a single warp? Each thread in a warp needs to share the same instructions. Need this to be valid for the whole function? This is not the case for thread 1 (x=0), since it initialises also the border and therefore is different from others. To my understanding, thread 1 is executed in a single warp, thread (2-33, etc.) in another warp, which then doesn't access the memory in a singe read, due to miss-alignment, and then again the final thread in a single warp due to the other border. Is that correct?
I wonder what the best practice is, to have either memory perfectly aligned for each row (in which case I would run each block with (32*x-2) threads so that the array with border is (32*x-2)+2 a multiple of 32 for each new line) or do it the way I had demonstrated above, with threads a multiple of 32 for each block and just live with the unaligned memory. I am aware that these sort of questions are not always straightforward and often depend on particular cases, but sometimes certain things are a bad practice and should not become habit.
When I experimented a little bit, I didn't really notice a difference in execution time, but maybe my examples were just too simple. I tried to get information from the visual profiler, but I haven't really understood all the information it gives me. I got however a warning that my occupancy level is at 17%, which I think must be really low and therefore there is something I do wrong. I didn't manage to find information on how threads are executed in parallel and how efficient my memory access is.
-Edit-
Added and highlighted 2 questions, one about memory access, the other one about how threads are collected to a single warp.
Now, my understanding is that since I do have an offset (+1,+2), which is unavoidable, I will have at least two reads per warp and per assignment (except for the left elements), or does it not matter from where I start reading as long as the memory after the 1st thread is perfectly aligned?
Yes, it does matter "from where you start reading" if you are trying to achieve perfect coalescing. Perfect coalescing means the read activity for a given warp and a given instruction all comes from the same 128-byte aligned cacheline.
Question: Is my understanding correct that in the example above only the first row would allow coeleased access and only for the left element in the array, since that is the only one which is accessed without any offset?
Yes. For cc2.0 and higher devices, the cache(s) may mitigate some of the drawbacks of unaligned access.
Question: How are threads collected into a single warp? Each thread in a warp needs to share the same instructions. Need this to be valid for the whole function? This is not the case for thread 1 (x=0), since it initialises also the border and therefore is different from others. To my understanding, thread 1 is executed in a single warp, thread (2-33, etc.) in another warp, which then doesn't access the memory in a singe read, due to miss-alignment, and then again the final thread in a single warp due to the other border. Is that correct?
The grouping of threads into warps always follows the same rules, and will not vary based on the specifics of the code you write, but is only affected by your launch configuration. When you write code that not all the threads will participate in (such as in your if statement), then the warp still proceeds in lockstep, but the threads that do not participate are idle. When you are filling in borders like this, it's rarely possible to get perfectly aligned or coalesced reads, so don't worry about it. The machine gives you that flexibility.

MPI synchronize matrix of vectors

Excuse me if this question is common or trivial, I am not very familiar with MPI so bear with me.
I have a matrix of vectors. Each vector is empty or has a few items in it.
std::vector<someStruct*> partitions[matrix_size][matrix_size];
When I start the program each process will have the same data in this matrix, but as the code progresses each process might remove several items from some vectors and put them in other vectors.
So when I reach a barrier I somehow have to make sure each process has the latest version of this matrix. The big problem is that each process might manipulate any or all vectors.
How would I go about to make sure that every process has the correct updated matrix after the barrier?
EDIT:
I am sorry I was not clear. Each process may move one or more objects to another vector but only one process may move each object. In other words each process has a list of objects it may move, but the matrix may be altered by everyone. And two processes can't move the same object ever.
In that case you'll need to send messages using MPI_Bcast that inform the other processors about this and instruct them to do the same. Alternatively, if the ordering doesn't matter until you hit the barrier, you can only send the messages to the root process which performs the permutations and then after the barrier sends it to all the others using MPI_Bcast.
One more thing: vectors of pointers are usually quite a bad idea, as you'll need to manage the memory manually in there. If you can use C++11, use std::unique_ptr or std::shared_ptr instead (depending on what your semantics are), or use Boost which provides very similar facilities.
And lastly, representing a matrix as a fixed-size array of fixed-size arrays is readlly bad. First: the matrix size is fixed. Second: adjacent rows are not necessarily stored in contiguous memory, slowing your program down like crazy (it literally can be orders of magnitudes). Instead represent the matrix as a linear array of size Nrows*Ncols, and then index the elements as Nrows*i + j where Nrows is the number of rows and i and j are the row and column indices, respectively. If you don't want column-major storage instead, address the elements by i + Ncols*j. You can wrap this index-juggling in inline functions that have virtually zero overhead.
I would suggest to lay out the data differently:
Each process has a map of his objects and their position in the matrix. How that is implemented depends on how you identify objects. If all local objects are numbered, you could just use a vector<pair<int,int>>.
Treat that as the primary structure you manipulate and communicate that structure with MPI_Allgather (each process sends it data to all other processes, at the end everyone has all data). If you need fast lookup by coordinates, then you can build up a cache.
That may or may not be performing well. Other optimizations (like sharing 'transactions') totally depend on your objects and the operations you perform on them.

Better way to copy several std::vectors into 1? (multithreading)

Here is what I'm doing:
I'm taking in bezier points and running bezier interpolation then storing the result in a std::vector<std::vector<POINT>.
The bezier calculation was slowing me down so this is what I did.
I start with a std::vector<USERPOINT> which is a struct with a point and 2 other points for bezier handles.
I divide these up into ~4 groups and assign each thread to do 1/4 of the work. To do this I created 4 std::vector<std::vector<POINT> > to store the results from each thread.In the end all the points have to be in 1 continuous vector, before I used multithreading I accessed this directly but now I reserve the size of the 4 vectors produced by the threads and insert them into the original vector, in the correct order. This works, but unfortunatly the copy part is very slow and makes it slower than without multithreading. So now my new bottleneck is copying the results to the vector. How could I do this way more efficiently?
Thanks
Have all the threads put their results into a single contiguous vector just like before. You have to ensure each thread only accesses parts of the vector that are separate from the others. As long as that's the case (which it should be regardless -- you don't want to generate the same output twice) each is still working with memory that's separate from the others, and you don't need any locking (etc.) for things to work. You do, however, need/want to ensure that the vector for the result has the correct size for all the results first -- multiple threads trying (for example) to call resize() or push_back() on the vector will wreak havoc in a hurry (not to mention causing copying, which you clearly want to avoid here).
Edit: As Billy O'Neal pointed out, the usual way to do this would be to pass a pointer to each part of the vector where each thread will deposit its output. For the sake of argument, let's assume we're using the std::vector<std::vector<POINT> > mentioned as the original version of things. For the moment, I'm going to skip over the details of creating the threads (especially since it varies across systems). For simplicity, I'm also assuming that the number of curves to be generated is an exact multiple of the number of threads -- in reality, the curves won't divide up exactly evenly, so you'll have to "fudge" the count for one thread, but that's really unrelated to the question at hand.
std::vector<USERPOINT> inputs; // input data
std::vector<std::vector<POINT> > outputs; // space for output data
const int thread_count = 4;
struct work_packet { // describe the work for one thread
USERPOINT *inputs; // where to get its input
std::vector<POINT> *outputs; // where to put its output
int num_points; // how many points to process
HANDLE finished; // signal when it's done.
};
std::vector<work_packet> packets(thread_count); // storage for the packets.
std::vector<HANDLE> events(thread_count); // storage for parent's handle to events
outputs.resize(inputs.size); // can't resize output after processing starts.
for (int i=0; i<thread_count; i++) {
int offset = i * inputs.size() / thread_count;
packets[i].inputs = &inputs[0]+offset;
packets[i].outputs = &outputs[0]+offset;
packets[i].count = inputs.size()/thread_count;
events[i] = packets[i].done = CreateEvent();
threads[i].process(&packets[i]);
}
// wait for curves to be generated (Win32 style, for the moment).
WaitForMultipleObjects(&events[0], thread_count, WAIT_ALL, INFINITE);
Note that although we have to be sure that the outputs vector doesn't get resized while be operated on by multiple threads, the individual vectors of points in outputs can be, because each will only ever be touched by one thread at a time.
If the simple copy in between things is slower than before you started using Mulitthreading, it's entirely likely that what you're doing simple isn't going to scale to multiple cores. If it's something simple like bezier stuff I suspect that's going to be the case.
Remember that the overhead of making the threads and such has an impact on total run time.
Finally.. for the copy, what are you using? Is it std::copy?
Multithreading is not going to speed up your process. Processing the data in different cores, could.