Concurrent writes in the same global memory location - c++

I have several blocks, each having some integers in a shared memory array of size 512. How can I check if the array in every block contains a zero as an element?
What I am doing is creating an array that resides in the global memory. The size of this array depends on the number of blocks, and it is initialized to 0. Hence every block writes to a[blockid] = 1 if the shared memory array contains a zero.
My problem is when I have several threads in a single block writing at the same time. That is, if the array in the shared memory contains more than one zero, then several threads will write a[blockid] = 1. Would this generate any problem?
In other words, would it be a problem if 2 threads write the exact same value to the exact same array element in global memory?

For a CUDA program, if multiple threads in a warp write to the same location then the location will be updated but it is undefined how many times the location is updated (i.e. how many actual writes occur in series) and it is undefined which thread will write last (i.e. which thread will win the race).
For devices of compute capability 2.x, if multiple threads in a warp write to the same address then only one thread will actually perform the write, which thread is undefined.
From the CUDA C Programming Guide section F.4.2:
If a non-atomic instruction executed by a warp writes to the same location in global memory for more than one of the threads of the warp, only one thread performs a write and which thread does it is undefined.
See also section 4.1 of the guide for more info.
In other words, if all threads writing to a given location write the same value, then it is safe.

In the CUDA execution model, there are no guarantees that every simultaneous write from threads in the same block to the same global memory location will succeed. At least one write will work, but it isn't guaranteed by the programming model how many write transactions will occur, or in what order they will occur if more than one transaction is executed.
If this is a problem, then a better approach (from a correctness point of view), would be to have only one thread from each block do the global write. You can either use a shared memory flag set atomically or a reduction operation to determine whether the value should be set. Which you choose might depend on how many zeros there are likely to be. The more zeroes there are, the more attractive the reduction will be. CUDA includes warp level __any() and __all() operators which can be built into a very efficient boolean reduction in a few lines of code.

Yes, it will be a problem called as Race Condition.
You should consider synchronizing access to the global data through process Semaphores

While not a mutex or semaphore, CUDA does contain a synchronization primative you can utilize for serializing access to a given code segment or memory location. Through the __syncthreads() function, you can create a barrier so that any given thread blocks at the point of the command call until all the threads in a given block have executed the __syncthreads() command. That way you can hopefully serialize access to your memory location and avoid a situation where two threads need to write to the same memory location at the same time. The only warning is that all the threads have to at some point execute __syncthreads(), or else you will end up with a dead-lock situation. So don't place the call inside some conditional if-statement where some threads may never execute the command. If you do approach your problem like this, there will need to be some provision made for the threads that don't initially call __syncthreads() to call the function later in order to avoid deadlock.

Related

Do I need to use a mutex to protect access to an array of mutexes from different threads?

Let's say I have a bunch of files, and an array with a mutex for each file. Now I have different threads reading from random files, but first they need to acquire the lock from the array. Should I have a lock on the entire array that must be acquired before taking the mutex for the particular file?
No, but what you do is to bring the memory in which these mutexes live into every thread since you placed the mutexes close on purpose.
Keep the other threads accesses to memory away from what the other individual threads deal with.
Assosiate each thread with data as tightly packed (but aligned), and as in as few cache lines, as possible. One mutex and one data set - nowhere close to where the other working threads needs access.
You can easily measure the effect by using a homemade std::hardware_constructive_interference_size like ... 64 (popular, non-scientific, but common).
Separate the data in such a way that no other thread needs to touch data within those 64 (or whatever number you come up with) bytes.
It's a "no kidding?" experience.
The number 64 is almost arbitrary. I can compile a program using that constant - but it will not be translated into something meaningful for a different target platform - it'll stay 64. It's a best guess.
Understanding std::hardware_destructive_interference_size and std::hardware_constructive_interference_size
No, accessing different elements of an array in different threads does not cause data races and a mutex can be used by multiple threads unsynchronized, because it must be able to to fulfill its purpose.
You do not need to add a lock for the array itself. The same is true for member functions of standard library containers that only access elements and do not modify the container itself.

memory range sharing in threads : ensure data is not stuck in cache

When sending address of memory location from one thread to another, how to ensure that data is not stuck in CPU cache, and that the second thread actually reads the correct value ? ( I'm using a socketpair() to send the
pointer from one thread to another )
And related question , how does c++ compiler, along with thread primitives, figure out what memory address need to be handled specially for synchronozations.
struct Test { int fld; }
thread_1 ( ) {
Test *ptr1 = new Test;
ptr1->fld = 100;
::write(write_fd, &ptr1, sizeof(ptr1));
}
thread_2 () {
Test *ptr2;
::read(read_fd, &ptr2, sizeof(ptr2));
// WHAT MAGIC IS REQUIRED TO ENSURE THIS ?
assert(ptr2->fld == 100 );
}
If you want to pass the value between threads in the same process, I would ensure that std::atomic<int> as the type of field, and the related setter and getter functions. Obviously, passing a pointer from one process to another doesn't work at all, unless it's from an area of memory that is guaranteed to have the same address in both processes - shared memory for example, but then you shouldn't need sockets...
Compilers do not, in general, know how to deal with caches, except for atomic types (technically, atomics are usually dealt with using separate instructions, rather than cache-flushing and cache-invalidation, and the processor hardware handles the relevant "talk to other processors about the cache content").
The OS (subject to bugs of course) does that sort of thing when passing between processes - or within processes. But for passing pointers, you can't rely on that, the newly received pointer value is correct, but the content the pointer is pointing at isn't cache-managed.
In some processors, you can use a memory barrier to the correct order of memory content between threads. This forces the processor to "perform all memory operations before this point". However, in the case of system calls like read and write, the OS should take care of that for you, and ensure that the memory has been properly written to before the read starts to read the memory it wants to store in the socket buffer, and write will have a memory barrier after it's stored your data (in this case the value of the pointer, but memory barriers affect all reads and/or writes that preceed that point).
If you were to implement your own primitives for passing data, and the processors do not have cache coherency (most of the modern processors do), you will also need to add a cache-flush for the writing side, and a cache invalidate for the reading side. This is architecture dependent, there is no support for this in standard C or C++ (and in some processors, only OS functionality [kernel mode] can flush or invalidate cache content, in other processors it can be done in user-mode code - the granularity of such operations also varies, it may be necessary to flush or invalidate the entire cache-system, or individual lines of 32, 64 or 128 bytes can be flushed at a time)
In C++, you don't need to care about implementation details like caches. The only thing you need to do is to make sure there is a C++ happens-after relation.
As Mats Petersson's answer shows, std::atomic is one way to achieve that. All accesses to an atomic variable are ordered, although the order might not be statically determined (i.e. if you have two threads trying to write to the same atomic variable, you can't predict which write happens last).
Another mechanism to enforce synchronization is std::mutex. Threads can attempt to lock a mutex, but only one thread can have a mutex locked at a time. The other threads will block. The compiler will make certain that when one thread unlocks a mutex and the next thread locks the mutex, writes by the first thread can be read by the second thread. If this requires flushing the cache, the compiler will arrange that.
Yet another mechanism is std::atomic_thread_fence. This is useful if you have multiple objects shared between threads (all in the same direction). Instead of making them all atomic, you can make one of them atomic and "attach" a fence to that atomic variable. You then write to the atomic variable last, and read from it first. Obviously this is best encapsulated in a class.

Multithreading - hopefully a simple task

I have an iterative process coded in C++ which takes a long time and am considering converting my code to use multiple threads. But I am concerned that it could be very complicated and risk lock-ups and bugs. However I suspect that for this particular problem it may be trivial, but I would like confirmation.
I am hoping I can use threading code which is a s simple as this here.
My program employs large amounts of global arrays and structures. I assume that the individual threads need not concern themselves if other threads are attempting to read the same data at the same time.
I would also assume that if one thread wanted to increment a global float variable by say 1.5 and another thread wanted to decrement it by 0.1 then so long as I didn't care about the order of events then both threads would succeed in their task without any special code (like mutexs and locks etc) and the float would eventually end up larger by 1.4. If all my assumptions are correct then my task will be easy - Please advise.
EDIT: just to make it absolutely clear - it doesn't matter at all the order in which the float is incremented / decremented. So long as its value ends up larger by 1.4 then I am happy. The value of the float is not read until after all the threads have completed their task.
EDIT: As a more concrete example, imaging we had the task of finding the total donations made to a charity from different states in the US. We could have a global like this:
float total_donations= 0;
Then we could have 50 separate threads, each of which calculated a local float called donations_from_this_state. And each thread would separately perform:
total_donations += donations_from_this_state;
Obviously which order the threads performed their task in would make no difference to the end result.
I assume that the individual threads need not concern themselves if other threads are attempting to read the same data at the same time.
Correct. As long as all threads are readers no synchronization is needed as no values are changed in the shared data.
I would also assume that if one thread wanted to increment a global float variable by say 1.5 and another thread wanted to decrement it by 0.1 then so long as I didn't care about the order of events then both threads would succeed in their task without any special code (like mutexs and locks etc) and the float would eventually end up larger by 1.4
This assumption is not correct. If you have two or more threads writing to the same shared variable and that variable is not internally synchronized then you need external synchronization otherwise your code has undefined behavior per [intro.multithread]/21
The execution of a program contains a data race if it contains two conflicting actions in different threads, at least one of which is not atomic, and neither happens before the other. Any such data race results in undefined behavior.
Where conflicting action is specified by [intro.multithread]/4
Two expression evaluations conflict if one of them modifies a memory location (1.7) and the other one accesses or modifies the same memory location.

Critical Sections openMP

I would like to know where do we need to set critical sections?
If there are multiple threads with a shared array, and each one want
to write in different place does it need to be in a critical section, even though each
thread write to a different place in the array?
lets say that I have 2 dimensional array M[3][3], initial_array[3] and some double variable
and I want to calculate something and store it at the first column of M.
I can use with a for loop, but I want to use with openMP , so I did:
omp_set_num_threads(3);
#pragma omp parallel shared(M,variable)
{
int id = omp_get_thread_num();
double init = initial_array[id]*variable;
M[id][0] = init;
}
It works fine, but I know that it can cause to deadlock or for bad running time.
I mean what if I had more threads and even a larger M..
what is the correct way to set critical section?
another thing i want to ask is about the initial_array, is it also need to be shared?
This is safe code.
Random access in arrays does not cause any race conditions to other elements in the array. As long as you continue to read and write to unshared elements within the array concurrently, you'll never hit a race condition.
Keep in mind that a read can race with a write depending on the type and size of the element. Your example shows double, and I'd be concerned if you had reads concurrent with write operations on the same element. It is possible for there to be a context switch during a write, but that depends on your arch/platform. Anyways, you aren't doing this but it is worth mentioning.
I don't see any problem with regards to concurrency since you are accessing different parts of the memory (different indices of the array), but the only problem I see is performance hit if your cores have dedicated L1 caches.
In this case there will be a performance hit due to cache coherency, where one updates the index, invalidates others, does a write back etc. For small no of threads/cores not an issue but on threads running on large number of cores it sure it. Because the data your threads running on aren't truly independent, they are read as a block of data in cache (if you are accessing M[0][0], then not only M[0][0] is read into the cache but M[0][0] to M[n][col] where n depends upon the cache block size ). And if the block is large, it might contain more of shared data.

What is faster in CUDA: global memory write + __threadfence() or atomicExch() to global memory?

Assuming that we have lots of threads that will access global memory sequentially, which option performs faster in the overall? I'm in doubt because __threadfence() takes into account all shared and global memory writes but the writes are coalesced. In the other hand atomicExch() takes into account just the important memory addresses but I don't know if the writes are coalesced or not.
In code:
array[threadIdx.x] = value;
Or
atomicExch(&array[threadIdx.x] , value);
Thanks.
On Kepler GPUs, I would bet on atomicExch since atomics are very fast on Kepler. On Fermi, it may be a wash, but given that you have no collisions, atomicExch could still perform well.
Please make an experiment and report the results.
Those two do very different things.
atomicExch ensures that no two threads try to modify a given cell at a time. If such conflict would occur, one or more threads may be stalled. If you know beforehand that no two threads access the same cell, there is no point to use any atomic... function.
__threadfence() delays the current thread (and only the current thread!) to ensure that any subsequent writes by given thread do actually happen later.
As such, __threadfence() on its own, without any follow-up code is not very interesting.
For that reason, I don't think there is a point to compare the efficiency of those two. Maybe if you could show a bit more concrete use case I could relate...
Note, that neither of those actually give you any guarantees on the actual order of execution of the threads.