OpenMP critical section only if data race exists vs lock? - c++

My question is somewhat similar to this one: How to use lock in OpenMP?
in the sense that their answers sort of answer my question but not well enough.
I'm trying to implement a simple work-stealing scheduler in OpenMP (from scratch).
Let's say I have an array of some object, say int. I have multiple threads which will manipulate the entries of this array, in no particular order. I would like to make sure that no two threads try to access the same element of the array at the same time. However, I am allowing the threads to access the same element, as long as the accesses are not simultaneous. Also, I am allowing the threads to access the array simultaneously, as long as each thread wishes to access a different entry of the array during this time. I could use a critical section, as in the following:
int array[1000];
#pragma omp parallel
{
bool flag = true;
while(flag){
int x = rand()%1000;
#pragma omp critical
{
array[x] = some_function(array[x]);
if (some_condition(array[x])){
flag = false;
}
}
}
}
This code creates some threads and the threads randomly access and manipulate entries of the array until some stopping condition that kills the thread. This code works fine, since the critical section ensures that no two threads will ever write to the array at the same time (in case they generated the same value of x). However, if at some time two threads do not happen to generate the same value of x, the critical section is redundant, as they threads are not accessing the same entry. Is there a way to make it so that a thread will stall if and only if the value of x it generated is the same as a thread that is currently also using x? Right now, this code is inefficient, and basically serial, even if every thread happens to generate a different value of x. I want to make it so they stall only if there is a collision.
Perhaps what I am looking for are locks ,but I am not sure. Are critical sections not the right way to go here?

I meant something like this:
#include <stdlib.h>
#include <omp.h>
int main()
{
int array[1000];
omp_lock_t locks[1000];
for (int i = 0; i < 1000; i++)
omp_init_lock(&locks[i]);
#pragma omp parallel
{
bool flag = true;
while(flag){
int x = rand()%1000;
omp_set_lock(&locks[x]);
array[x] = some_function(array[x]);
if (some_condition(array[x])){
flag = false;
}
omp_unset_lock(&locks[x]);
}
}
for (int i = 0; i < 1000; i++)
omp_destroy_lock(&locks[i]);
}

Related

Why the following program does not mix the output when mutex is not used?

I have made multiple runs of the program. I do not see that the output is incorrect, even though I do not use the mutex. My goal is to demonstrate the need of a mutex. My thinking is that different threads with different "num" values will be mixed.
Is it because the objects are different?
using VecI = std::vector<int>;
class UseMutexInClassMethod {
mutex m;
public:
VecI compute(int num, VecI veci)
{
VecI v;
num = 2 * num -1;
for (auto &x:veci) {
v.emplace_back(pow(x,num));
std::this_thread::sleep_for(std::chrono::seconds(1));
}
return v;
}
};
void TestUseMutexInClassMethodUsingAsync()
{
const int nthreads = 5;
UseMutexInClassMethod useMutexInClassMethod;
VecI vec{ 1,2,3,4,5 };
std::vector<std::future<VecI>> futures(nthreads);
std::vector<VecI> outputs(nthreads);
for (decltype(futures)::size_type i = 0; i < nthreads; ++i) {
futures[i] = std::async(&UseMutexInClassMethod::compute,
&useMutexInClassMethod,
i,vec
);
}
for (decltype(futures)::size_type i = 0; i < nthreads; ++i) {
outputs[i] = futures[i].get();
for (auto& x : outputs[i])
cout << x << " ";
cout << endl;
}
}
If you want an example that does fail with a high degree of certainty you can look at the below. It sets up a variable called accumulator to be shared by reference to all the futures. This is what is missing in your example. You are not actually sharing any memory. Make sure you understand the difference between passing by reference and passing by value.
#include <vector>
#include <memory>
#include <thread>
#include <future>
#include <iostream>
#include <cmath>
#include <mutex>
struct UseMutex{
int compute(std::mutex & m, int & num)
{
for(size_t j = 0;j<1000;j++)
{
///////////////////////
// CRITICAL SECTIION //
///////////////////////
// this code currently doesn't trigger the exception
// because of the lock on the mutex. If you comment
// out the single line below then the exception *may*
// get called.
std::scoped_lock lock{m};
num++;
std::this_thread::sleep_for(std::chrono::nanoseconds(1));
num++;
if(num%2!=0)
throw std::runtime_error("bad things happened");
}
return 0;
}
};
template <typename T> struct F;
void TestUseMutexInClassMethodUsingAsync()
{
const int nthreads = 16;
int accumulator=0;
std::mutex m;
std::vector<UseMutex> vs{nthreads};
std::vector<std::future<int>> futures(nthreads);
for (auto i = 0; i < nthreads; ++i) {
futures[i]= std::async([&,i](){return vs[i].compute(m,accumulator);});
}
for(auto i = 0; i < nthreads; ++i){
futures[i].get();
}
}
int main(){
TestUseMutexInClassMethodUsingAsync();
}
You can comment / uncomment the line
std::scoped_lock lock{m};
which protects the increment of the shared variable num. The rule for this mini program is that at the line
if(num%2!=0)
throw std::runtime_error("bad things happened");
num should be a multiple of two. But as multiple threads are accessing this variable without a lock you can't guarantee this. However if you add a lock around the double increment and test then you can be sure no other thread is accessing this memory during the duration of the increment and test.
Failing
https://godbolt.org/z/sojcs1WK9
Passing
https://godbolt.org/z/sGdx3x3q3
Of course the failing one is not guaranteed to fail but I've set it up so that it has a high probability of failing.
Notes
[&,i](){return vs[i].compute(m,accumulator);};
is a lambda or inline function. The notation [&,i] means it captures everything by reference except i which it captures by value. This is important because i changes on each loop iteration and we want each future to get a unique value of i
Is it because the objects are different?
Yes.
Your code is actually perfectly thread safe, no need for mutex here. You never share any state between threads except for copying vec from TestUseMutexInClassMethodUsingAsync to compute by std::async (and copying is thread-safe) and moving computation result from compute's return value to futures[i].get()'s return value. .get() is also thread-safe: it blocks until the compute() method terminates and then returns its computation result.
It's actually nice to see that even a deliberate attempt to get a race condition failed :)
You probably have to fully redo your example to demonstrate is how simultaneous* access to a shared object breaks things. Get rid of std::async and std::future, use simple std::thread with capture-by-reference, remove sleep_for (so both threads do a lot of operations instead of one per second), significantly increase number of operations and you will get a visible race. It may look like a crash, though.
* - yes, I'm aware that "wall-clock simulateneous access" does not exist in multithreaded systems, strictly speaking. However, it helps getting a rough idea of where to look for visible race conditions for demonstration purposes.
Comments have called out the fact that just not protecting a critical section does not guarantee that the risked behavior actually occurs.
That also applies for multiple runs, because while you are not allowed to test a few times and then rely on the repeatedly observed behavior, it is likely that optimization mechanisms cause a likely enough reoccurring observation as to be perceived has reproducible.
If you intend to demonstrate the need for synchronization you need to employ synchronization to poise things to a near guaranteed misbehavior of observable lack of protection.
Allow me to only outline a sequence for that, with a few assumptions on scheduling mechanisms (this is based on a rather simple, single core, priority based scheduling environment I have encountered in an embedded environment I was using professionally), just to give an insight with a simplified example:
start a lower priority context.
optionally set up proper protection before entering the critical section
start critical section, e.g. by outputting the first half of to-be-continuous output
asynchronously trigger a higher priority context, which is doing that which can violate your critical section, e.g. outputs something which should not be in the middle of the two-part output of the critical section
(in protected case the other context is not executed, in spite of being higher priority)
(in unprotected case the other context is now executed, because of being higher priority)
end critical section, e.g. by outputting the second half of the to-be-continuous output
optionally remove the protection after leaving the critical section
(in protected case the other context is now executed, now that it is allowed)
(in unprotected case the other context was already executed)
Note:
I am using the term "critical section" with the meaning of a piece of code which is vulnerable to being interrupted/preempted/descheduled by another piece of code or another execution of the same code. Specifically for me a critical section can exist without applied protection, though that is not a good thing. I state this explicitly because I am aware of the term being used with the meaning "piece of code inside applied protection/synchronization". I disagree but I accept that the term is used differently and requires clarification in case of potential conflicts.

Sampling conditional distribution OpenMP

I have a function which draws a random sample:
Sample sample();
I have a function which checks weather a sample is valid:
bool is_valid(Sample s);
This simulates a conditional distribution. Now I want a lot of valid samples (most samples will not be valid).
So I want to parallelize this code with openMP
vector<Sample> valid_samples;
while(valid_samples.size() < n) {
Sample s = sample();
if(is_valid(s)) {
valid_samples.push_back(s);
}
}
How would I do this? Most of the OpenMP code I found were simple for loops where the number of iterations is determined in the beginning.
The sample() function has a
thread_local std::mt19937_64 gen([](){
std::random_device d;
std::uniform_int_distribution<int> dist(0,10000);
return dist(d);
}());
as a random number generator. Is is valid and thread save if I assume that my device has a source of randomness? Are there better solutions?
You may employ OpenMP task parallelism. The simplest solution would be to define a task as a single sample insertion:
vector<Sample> valid_samples(n); // need to be resized to allow access in parallel
void insert_ith(size_t i)
{
do {
valid_samples[i] = sample();
} while (!is_valid(valid_samples[i]));
}
#pragma omp parallel
{
#pragma omp single
{
for (size_t i = 0; i < n; i++)
{
#pragma omp task
insert_ith(i);
}
}
}
Note that there might be performance issues with such single-task-single-insertion mapping. First, there would be false sharing involved, but likely worse, tasking management has some overhead which might be significant for very small tasks. In such a case, a remedy is simple — instead of a single insertion per tasks, insert multiple items at once, such as 100. Usually, a suitable number is a trade-off: the lower creates more tasks = more overhead, the higher may result in worse load balancing.
You need to take care of the critical section in your code, which is the insertion into the answer vector
something like this should work (haven't compiled because functions and types are not given)
// create vector before parallel section because it shall be shared
vector<Sample> valid_samples(n); // set initial size to avoid reallocation
int reached_count = 0;
#pragma omp parallel shared(valid_samples, n, reached_count)
{
while(reached_count < n) { // changed this, see comments for discussion
Sample s = sample(); // I assume this to be thread indepent
if(is_valid(s)) {
#pragma omp critical
{
// check condition again, another thread might have already
// reached maximum number
if(reached_count < n) {
valid_samples.push_back(s);
reached_count = valid_samples.size();
}
}
}
}
}
note that neither sample() nor isvalid(s) are inside of the critical section because I assume these functions to be far more expensive than the vector insertion or the acceptance is very rare
If that is not the case, you could work with indepent local vectors and merge in the end, but that would only gain a significant benefit if you reduce the number of synchronization in some way, like giving a fixed number of iterations (at least for a large part)

Spinning thread barrier using Atomic Builtins

I'm trying to implement a spinning thread barrier using atomics, specifically __sync_fetch_and_add. https://gcc.gnu.org/onlinedocs/gcc-4.4.5/gcc/Atomic-Builtins.html
I basically want an alternative to the pthread barrier. I'm using Ubuntu on a system that can run about a hundred threads in parallel.
int bar = 0; //global variable
int P = MAX_THREADS; //number of threads
__sync_fetch_and_add(&bar,1); //each thread comes and adds atomically
while(bar<P){} //threads spin until bar increments to P
bar=0; //a thread sets bar=0 to be used in the next spinning barrier
This does not work for obvious reasons (a thread may set bar=0, and another thread gets stuck in an infinite while loop etc). I saw an implementation here: Writing a (spinning) thread barrier using c++11 atomics, however it seems too complex and I think its performance might be worse than a pthread barrier.
This implementation is also expected to produce more traffic within the memory hierarchy due to bar's cache line being ping-ponged among threads.
Any ideas on how to use these atomic instructions to make a simple barrier? A communication-optimal scheme would also be helpful additionally.
Instead of spinning on the counter of the threads, it is better to spin on the number of the barries passed, which will be incremented only by the last thread, faced the barrier. Such way you also reduce memory cache pressure, as spinning variable is now updated only by single thread.
int P = MAX_THREADS;
int bar = 0; // Counter of threads, faced barrier.
volatile int passed = 0; // Number of barriers, passed by all threads.
void barrier_wait()
{
int passed_old = passed; // Should be evaluated before incrementing *bar*!
if(__sync_fetch_and_add(&bar,1) == (P - 1))
{
// The last thread, faced barrier.
bar = 0;
// *bar* should be reseted strictly before updating of barriers counter.
__sync_synchronize();
passed++; // Mark barrier as passed.
}
else
{
// Not the last thread. Wait others.
while(passed == passed_old) {};
// Need to synchronize cache with other threads, passed barrier.
__sync_synchronize();
}
}
Note, that you need to use volatile modificator for spinning variable.
C++ code could be somewhat faster than C one, as it can use acquire/release memory barriers instead of the full one, which is the only barrier available from __sync functions:
int P = MAX_THREADS;
std::atomic<int> bar = 0; // Counter of threads, faced barrier.
std::atomic<int> passed = 0; // Number of barriers, passed by all threads.
void barrier_wait()
{
int passed_old = passed.load(std::memory_order_relaxed);
if(bar.fetch_add(1) == (P - 1))
{
// The last thread, faced barrier.
bar = 0;
// Synchronize and store in one operation.
passed.store(passed_old + 1, std::memory_order_release);
}
else
{
// Not the last thread. Wait others.
while(passed.load(std::memory_order_relaxed) == passed_old) {};
// Need to synchronize cache with other threads, passed barrier.
std::atomic_thread_fence(std::memory_order_acquire);
}
}

Forking once and then use the threads in a different procedure?

If I fork inside my main program and then call a subroutine inside a single directive, what is the behavior if I enter an OMP parallel directive in this subroutine?
My guess/hope is that existing threads are used, as they all should have nothing to do at the moment.
Pseudo-Example:
double A[];
int main() {
#pragma omp parallel num_threads(2)
{
#pragma omp single
{
for (int t=0; t<1000; t++) {
evolve();
}
}
}
}
void evolve() {
#pragma omp parallel for num_threads(2)
for (int i=0; i<100; i++) {
do_stuff(i);
}
}
void do_stuff(int i) {
// expensive calculation on array element A[i]
}
As evolve() is called very often, forking here would cause way to much overhead, so I would like to do it only once, then call evolve() from a single thread and split the work of the calls to do_stuff() over the existing threads.
For Fortran this seems to work. I get a roughly 80-90% speed increase on a simple example using 2 threads. But for C++ I get a different behavior, only the thread which executes the single directive is used for the loop in evolve()
I fixed the problem using the task directive in the main program and passing the limits to evolve(), but this looks like a clumsy solution...
Why is the behavior in Fortran and C++ different and what would be the solution in C++?
I believe orphaned directives are the cleanest solution in your case:
double A[];
int main() {
#pragma omp parallel num_threads(2)
{
// Each thread calls evolve() a thousand times
for (int t=0; t<1000; t++) {
evolve();
}
}
}
void evolve() {
// The orphaned construct inside evolve()
// will bind to the innermost parallel region
#pragma omp for
for (int i=0; i<100; i++) {
do_stuff(i);
} // Implicit thread synchronization
}
void do_stuff(int i) {
// expensive calculation on array element A[i]
}
This will work because (section 2.6.1 of the standard):
A loop region binds to the innermost enclosing parallel region
That said, in your code you are using nested parallel constructs. To be sure to enable them you must set the environment variable OMP_NESTED to true, otherwise (quoting Appendix E of the latest standard):
OMP_NESTED environment variable: if the value is neither true nor false the behavior is implementation defined
Unfortunately, your code will likely not work as expected in all cases. If you have a code structure like this:
void foo() {
#pragma omp parallel
#pragma omp single
bar();
}
void bar() {
#pragma omp parallel
printf("...)";
}
OpenMP is requested to create a new team of threads when entering the parallel region in bar. OpenMP calls that "nested parallelism". However, what exactly happens depends on the your actual implementation of OpenMP used and the setting of OMP_NESTED.
OpenMP implementations are not required to support nested parallelism. It would be perfectly legal, if an implementation ignored the parallel region in bar and just execute it with one thread. OMP_NESTED can be used to turn on and off nesting, if the implementation supports it.
In your case, things by chance went well, since you sent all threads to sleep except one. This thread then created a new team of threads of full size (potentially NEW threads, not reusing the old ones). If you omitted the single construct, you would easily get thousands of threads.
Unfortunately, OpenMP does not support your pattern to create a parallel team, have one thread executing the call stacks, and then distribute work across the other team members through a worksharing construct like for. If you need this code pattern, the only solution will be OpenMP tasks.
Cheers,
-michael
Your example doesn't actually call fork(), so I suspect you don't mean fork in the system-call sense (i.e. duplicating your process). However, if that really is what you meant, I suspect that most OpenMP implementations will not work correctly in a forked process. Typically, threads are not preserved across fork() calls. If the OpenMP implementation you use registers pthread_atfork() handlers, it may work correctly following a fork() call, but it will not use the same threads as the parent process.

c++ OpenMP critical: "one-way" locking?

Consider the following serial function. When I parallelize my code, every thread will call this function from within the parallel region (not shown). I am trying to make this threadsafe and efficient (fast).
float get_stored_value__or__calculate_if_does_not_yet_exist( int A )
{
static std::map<int, float> my_map;
std::map::iterator it_find = my_map.find(A); //many threads do this often.
bool found_A = it_find != my_map.end();
if (found_A)
{
return it_find->second;
}
else
{
float result_for_A = calculate_value(A); //should only be done once, really.
my_map[A] = result_for_A;
return result_for_A;
}
}
Almost every single time this function is called, the threads will successfully "find" the stored value for their "A" (whatever it is). Every once in a while, when a "new A" is called, a value will have to be calculated and stored.
So where should I put the #pragma omp critical ?
Though easy, it is very inefficient to put a #pragma omp critical around all of this, since each thread will be doing this constantly and it will often be the read-only case.
Is there any way to implement a "one-way" critical, or a "one-way" lock routine? That is, the above operations involving the iterator should only be "locked" when writing to my_map in the else statement. But multiple threads should be able to execute the .find call simultaneously.
I hope I make sense.
Thank you.
According to this link on Stack Overflow inserting into an std::map doesn't invalidate iterators. The same goes for the end() iterator. Here's a supporting link.
Unfortunately, insertion can happen multiple times if you don't use a critical section. Also, since your calculate_value routine might be computationally expensive, you will have to lock to avoid this else clause being operated on twice with the same value of A and then inserted twice.
Here's a sample function where you can replicate this incorrect multiple insertion:
void testFunc(std::map<int,float> &theMap, int i)
{
std::map<int,float>::iterator ite = theMap.find(i);
if(ite == theMap.end())
{
theMap[i] = 3.14 * i * i;
}
}
Then called like this:
std::map<int,float> myMap;
int i;
#pragma omp parallel for
for(i=1;i<=100000;++i)
{
testFunc(myMap,i % 100);
}
if(myMap.size() != 100)
{
std::cout << "Problem!" << std::endl;
}
Edit: edited to correct error in earler version.
OpenMP is a compiler "tool" for automatic loop parallelization, not a thread communication or synchronization library; so it doesn't have sophisticated mutexes, like a read/write mutex: acquire lock on writing, but not on reading.
Here's an implementation example.
Anyway Chris A.'s answer is better than mine though :)
While #ChrisA's answer may solve your problem, I'll leave my answer here in case any future searchers find it useful.
If you'd like, you can give #pragma omp critical sections a name. Then, any section with that name is considered the same critical section. If this is what you would like to do, you can easily cause onyl small portions of your method to be critical.
#pragma omp critical map_protect
{
std::map::iterator it_find = my_map.find(A); //many threads do this often.
bool found_A = it_find != my_map.end();
}
...
#pragma omp critical map_protect
{
float result_for_A = calculate_value(A); //should only be done once, really.
my_map[A] = result_for_A;
}
The #pragma omp atomic and #pragma omp flush directives may also be useful.
atomic causes a write to a memory location (lvalue in the expression preceded by the directive) to always be atomic.
flush ensures that any memory expected to be available to all threads is actually written to all threads, not stored in a processor cache and unavailable where it should be.