Critical Sections openMP - c++

I would like to know where do we need to set critical sections?
If there are multiple threads with a shared array, and each one want
to write in different place does it need to be in a critical section, even though each
thread write to a different place in the array?
lets say that I have 2 dimensional array M[3][3], initial_array[3] and some double variable
and I want to calculate something and store it at the first column of M.
I can use with a for loop, but I want to use with openMP , so I did:
omp_set_num_threads(3);
#pragma omp parallel shared(M,variable)
{
int id = omp_get_thread_num();
double init = initial_array[id]*variable;
M[id][0] = init;
}
It works fine, but I know that it can cause to deadlock or for bad running time.
I mean what if I had more threads and even a larger M..
what is the correct way to set critical section?
another thing i want to ask is about the initial_array, is it also need to be shared?

This is safe code.
Random access in arrays does not cause any race conditions to other elements in the array. As long as you continue to read and write to unshared elements within the array concurrently, you'll never hit a race condition.
Keep in mind that a read can race with a write depending on the type and size of the element. Your example shows double, and I'd be concerned if you had reads concurrent with write operations on the same element. It is possible for there to be a context switch during a write, but that depends on your arch/platform. Anyways, you aren't doing this but it is worth mentioning.

I don't see any problem with regards to concurrency since you are accessing different parts of the memory (different indices of the array), but the only problem I see is performance hit if your cores have dedicated L1 caches.
In this case there will be a performance hit due to cache coherency, where one updates the index, invalidates others, does a write back etc. For small no of threads/cores not an issue but on threads running on large number of cores it sure it. Because the data your threads running on aren't truly independent, they are read as a block of data in cache (if you are accessing M[0][0], then not only M[0][0] is read into the cache but M[0][0] to M[n][col] where n depends upon the cache block size ). And if the block is large, it might contain more of shared data.

Related

If I make a piece of code in which each thread modifies completely different parts of an array, will that maintain cache coherency?

So I am making some parallel code using OpenMP (but this question should be reasonably applicable to other frameworks), in which I have an array of objects:
std::vector<Body> bodies;
And then I do a little parallel loop to do some things to the bodies. At the start of this parallel section, a team of threads is set up to execute the loop individually. The loop basically uses the values of foo on every Body (apart from the one in question) to update the value of bar on the body in question. So essentially no writes are being done to the values of foo on each body, and the only reads being done on bar are localised to the thread controlling that particular body; in pseudocode, it looks like this:
//create team of threads, and then this section is executed by each thread separately
for each Body i
for each Body j =/= i
i.bar += (j.foo * 2);
end for
end for
My question is whether this will, as I think it should, maintain the coherency of the cache? Because as far as I see it, none of the threads are reaching for things that are being edited by the other threads, so it should be safe, I feel. But This is quite an important point in the report that I need to write on this, so I want to be sure.
Thank you.
The rule is you need synchronization if you have more than one thread and at least one of them is a writer and the threads are accessing the same object. If all of your threads are reading then you do not need any synchronization at all.
With an array/vector if you are writing to it but each thread is writing to its own unique section then you do not need any synchronization either as you are not accessing the same underlying objects (as long as you are not modifying the vector itself like adding or removing elements). The only hazard with this is false sharing. If two threads are working on the different parts of the array but they happen to be on the same cache line then any modification is going to dirty the cache line and both threads will be impacted. This is just a performance impact though and does not lead to undefined behavior.

Thread safety and block write size

Is there a minimal block size that multiple threads can write to in a contiguous block of memory that avoids race conditions or the losing of values?
For example, can 32 threads write to individual elements of this array without affecting the other values?
int array[32];
How about this?
bool array[32];
How about an object that stores simple true/false into a bit array?
I'm guessing there is some block write size or cache related functionality that would come into play that determines this. Is that correct? And is there anything standard/safe with regards to this size(platform defines, etc)?
Is there a minimal block size that multiple threads can write to in a contiguous block of memory that avoids race conditions or the losing of values?
No. A conflict (and potential data race) only occurs if two threads access the same memory location (that is, the same byte). Two threads accessing different objects won't conflict.
For example, can 32 threads write to individual elements of this array without affecting the other values?
Yes, each element has its own memory location, so two threads accessing different elements won't conflict.
How about this?
Yes; again, each bool has its own location.
How about an object that stores simple true/false into a bit array?
No; that would pack multiple values into a larger element, and two threads accessing the same larger element would conflict.
I'm guessing there is some block write size or cache related functionality that would come into play that determines this.
There could be a performance impact (known as false sharing) when multiple threads access the same cache line; but the language guarantees that it won't affect the program's correctness as long as they don't access the same memory location.
There is no garuntee in standard. If you need exclusive access to element you may use std::atomic.i.e. You may use like:
std::vector<std::atomic<int> > array;
Otherwise you are always free to use std::mutex.
can 32 threads write to individual elements of this array without affecting the other values?
You are free to do this provided that one thread does interfare with other. i.e thread i modifies the value of array[i] ONLY.

What is faster in CUDA: global memory write + __threadfence() or atomicExch() to global memory?

Assuming that we have lots of threads that will access global memory sequentially, which option performs faster in the overall? I'm in doubt because __threadfence() takes into account all shared and global memory writes but the writes are coalesced. In the other hand atomicExch() takes into account just the important memory addresses but I don't know if the writes are coalesced or not.
In code:
array[threadIdx.x] = value;
Or
atomicExch(&array[threadIdx.x] , value);
Thanks.
On Kepler GPUs, I would bet on atomicExch since atomics are very fast on Kepler. On Fermi, it may be a wash, but given that you have no collisions, atomicExch could still perform well.
Please make an experiment and report the results.
Those two do very different things.
atomicExch ensures that no two threads try to modify a given cell at a time. If such conflict would occur, one or more threads may be stalled. If you know beforehand that no two threads access the same cell, there is no point to use any atomic... function.
__threadfence() delays the current thread (and only the current thread!) to ensure that any subsequent writes by given thread do actually happen later.
As such, __threadfence() on its own, without any follow-up code is not very interesting.
For that reason, I don't think there is a point to compare the efficiency of those two. Maybe if you could show a bit more concrete use case I could relate...
Note, that neither of those actually give you any guarantees on the actual order of execution of the threads.

Concurrent writes in the same global memory location

I have several blocks, each having some integers in a shared memory array of size 512. How can I check if the array in every block contains a zero as an element?
What I am doing is creating an array that resides in the global memory. The size of this array depends on the number of blocks, and it is initialized to 0. Hence every block writes to a[blockid] = 1 if the shared memory array contains a zero.
My problem is when I have several threads in a single block writing at the same time. That is, if the array in the shared memory contains more than one zero, then several threads will write a[blockid] = 1. Would this generate any problem?
In other words, would it be a problem if 2 threads write the exact same value to the exact same array element in global memory?
For a CUDA program, if multiple threads in a warp write to the same location then the location will be updated but it is undefined how many times the location is updated (i.e. how many actual writes occur in series) and it is undefined which thread will write last (i.e. which thread will win the race).
For devices of compute capability 2.x, if multiple threads in a warp write to the same address then only one thread will actually perform the write, which thread is undefined.
From the CUDA C Programming Guide section F.4.2:
If a non-atomic instruction executed by a warp writes to the same location in global memory for more than one of the threads of the warp, only one thread performs a write and which thread does it is undefined.
See also section 4.1 of the guide for more info.
In other words, if all threads writing to a given location write the same value, then it is safe.
In the CUDA execution model, there are no guarantees that every simultaneous write from threads in the same block to the same global memory location will succeed. At least one write will work, but it isn't guaranteed by the programming model how many write transactions will occur, or in what order they will occur if more than one transaction is executed.
If this is a problem, then a better approach (from a correctness point of view), would be to have only one thread from each block do the global write. You can either use a shared memory flag set atomically or a reduction operation to determine whether the value should be set. Which you choose might depend on how many zeros there are likely to be. The more zeroes there are, the more attractive the reduction will be. CUDA includes warp level __any() and __all() operators which can be built into a very efficient boolean reduction in a few lines of code.
Yes, it will be a problem called as Race Condition.
You should consider synchronizing access to the global data through process Semaphores
While not a mutex or semaphore, CUDA does contain a synchronization primative you can utilize for serializing access to a given code segment or memory location. Through the __syncthreads() function, you can create a barrier so that any given thread blocks at the point of the command call until all the threads in a given block have executed the __syncthreads() command. That way you can hopefully serialize access to your memory location and avoid a situation where two threads need to write to the same memory location at the same time. The only warning is that all the threads have to at some point execute __syncthreads(), or else you will end up with a dead-lock situation. So don't place the call inside some conditional if-statement where some threads may never execute the command. If you do approach your problem like this, there will need to be some provision made for the threads that don't initially call __syncthreads() to call the function later in order to avoid deadlock.

Any issues with large numbers of critical sections?

I have a large array of structures, like this:
typedef struct
{
int a;
int b;
int c;
etc...
}
data_type;
data_type data[100000];
I have a bunch of separate threads, each of which will want to make alterations to elements within data[]. I need to make sure that no to threads attempt to access the same data element at the same time. To be precise: one thread performing data[475].a = 3; and another thread performing data[475].b = 7; at the same time is not allowed, but one thread performing data[475].a = 3; while another thread performs data[476].a = 7; is allowed. The program is highly speed critical. My plan is to make a separate critical section for each data element like so:
typedef struct
{
CRITICAL_SECTION critsec;
int a;
int b;
int c;
etc...
}
data_type;
In one way I guess it should all work and I should have no real questions, but not having had much experience in multithreaded programming I am just feeling a little uneasy about having so many critical sections. I'm wondering if the sheer number of them could be creating some sort of inefficiency. I'm also wondering if perhaps some other multithreading technique could be faster? Should I just relax and go ahead with plan A?
With this many objects, most of their critical sections will be unlocked, and there will be almost no contention. As you already know (other comment), critical sections don't require a kernel-mode transition if they're unowned. That makes critical sections efficient for this situation.
The only other consideration would be whether you would want the critical sections inside your objects or in another array. Locality of reference is a good reason to put the critical sections inside the object. When you've entered the critical section, an entire cacheline (e.g. 16 or 32 bytes) will be in memory. With a bit of padding, you can make sure each object starts on a cacheline. As a result, the object will be (partially) in cache once its critical section is entered.
Your plan is worth trying, but I think you will find that Windows is unhappy creating that many Critical Sections. Each CS contains some kernel handle(s) and you are using up precious kernel space. I think, depending on your version of Windows, you will run out of handle memory and InitializeCriticalSection() or some other function will start to fail.
What you might want to do is have a pool of CSs available for use, and store a pointer to the 'in use' CS inside your struct. But then this gets tricky quite quickly and you will need to use Atomic operations to set/clear the CS pointer (to atomically flag the array entry as 'in use'). Might also need some reference counting, etc...
Gets complicated.
So try your way first, and see what happens. We had a similar situation once, and we had to go with a pool, but maybe things have changed since then.
Depending on the data member types in your data_type structure (and also depending on the operations you want to perform on those members), you might be able to forgo using a separate synchronization object, using the Interlocked functions instead.
In your sample code, all the data members are integers, and all the operations are assignments (and presumably reads), so you could use InterlockedExchange() to set the values atomically and InterlockedCompareExchange() to read the values atomically.
If you need to use non-integer data member types, or if you need to perform more complex operations, or if you need to coordinate atomic access to more than one operation at a time (e.g., read data[1].a and then write data[1].b), then you will have to use a synchronization object, such as a CRITICAL_SECTION.
If you must use a synchronization object, I recommend that you consider partitioning your data set into subsets and use a single synchronization object per subset. For example, you might consider using one CRITICAL_SECTION for each span of 1000 elements in the data array.
You could also consider MUTEX.
This is nice method.
Each client could reserve the resource by itself with mutex (mutual-exclusion).
This is more common, some libraries also support this with threads.
Read about boost::thread and it's mutexes
With Your approach:
data_type data[100000];
I'd be afraid of stack overflow, unless You're allocating it at the heap.
EDIT:
Boost::MUTEX
uses win32 Critical Sections
As others have pointed out, yes there is an issue and it is called too fine-grained locking.. it's resource wasteful and even though the chances are small you will start creating a lot of backing primitives and data when the things do get an occasional, call it longer than usual or whatever, contention. Plus you are wasting resources as it is not really a trivial data structure as for example in VM impls..
If I recall correctly you will have a higher chance of a SEH exception from that point onwards on Win32 or just higher memory usage. Partitioning and pooling them is probably the way to go but it is a more complex implementation. Paritioning on something else (re:action) and expecting some short-lived contention is another way to deal with it.
In any case, it is a problem of resource management with what you have right now.