c++ OpenMP critical: "one-way" locking? - c++

Consider the following serial function. When I parallelize my code, every thread will call this function from within the parallel region (not shown). I am trying to make this threadsafe and efficient (fast).
float get_stored_value__or__calculate_if_does_not_yet_exist( int A )
{
static std::map<int, float> my_map;
std::map::iterator it_find = my_map.find(A); //many threads do this often.
bool found_A = it_find != my_map.end();
if (found_A)
{
return it_find->second;
}
else
{
float result_for_A = calculate_value(A); //should only be done once, really.
my_map[A] = result_for_A;
return result_for_A;
}
}
Almost every single time this function is called, the threads will successfully "find" the stored value for their "A" (whatever it is). Every once in a while, when a "new A" is called, a value will have to be calculated and stored.
So where should I put the #pragma omp critical ?
Though easy, it is very inefficient to put a #pragma omp critical around all of this, since each thread will be doing this constantly and it will often be the read-only case.
Is there any way to implement a "one-way" critical, or a "one-way" lock routine? That is, the above operations involving the iterator should only be "locked" when writing to my_map in the else statement. But multiple threads should be able to execute the .find call simultaneously.
I hope I make sense.
Thank you.

According to this link on Stack Overflow inserting into an std::map doesn't invalidate iterators. The same goes for the end() iterator. Here's a supporting link.
Unfortunately, insertion can happen multiple times if you don't use a critical section. Also, since your calculate_value routine might be computationally expensive, you will have to lock to avoid this else clause being operated on twice with the same value of A and then inserted twice.
Here's a sample function where you can replicate this incorrect multiple insertion:
void testFunc(std::map<int,float> &theMap, int i)
{
std::map<int,float>::iterator ite = theMap.find(i);
if(ite == theMap.end())
{
theMap[i] = 3.14 * i * i;
}
}
Then called like this:
std::map<int,float> myMap;
int i;
#pragma omp parallel for
for(i=1;i<=100000;++i)
{
testFunc(myMap,i % 100);
}
if(myMap.size() != 100)
{
std::cout << "Problem!" << std::endl;
}
Edit: edited to correct error in earler version.

OpenMP is a compiler "tool" for automatic loop parallelization, not a thread communication or synchronization library; so it doesn't have sophisticated mutexes, like a read/write mutex: acquire lock on writing, but not on reading.
Here's an implementation example.
Anyway Chris A.'s answer is better than mine though :)

While #ChrisA's answer may solve your problem, I'll leave my answer here in case any future searchers find it useful.
If you'd like, you can give #pragma omp critical sections a name. Then, any section with that name is considered the same critical section. If this is what you would like to do, you can easily cause onyl small portions of your method to be critical.
#pragma omp critical map_protect
{
std::map::iterator it_find = my_map.find(A); //many threads do this often.
bool found_A = it_find != my_map.end();
}
...
#pragma omp critical map_protect
{
float result_for_A = calculate_value(A); //should only be done once, really.
my_map[A] = result_for_A;
}
The #pragma omp atomic and #pragma omp flush directives may also be useful.
atomic causes a write to a memory location (lvalue in the expression preceded by the directive) to always be atomic.
flush ensures that any memory expected to be available to all threads is actually written to all threads, not stored in a processor cache and unavailable where it should be.

Related

Why the following program does not mix the output when mutex is not used?

I have made multiple runs of the program. I do not see that the output is incorrect, even though I do not use the mutex. My goal is to demonstrate the need of a mutex. My thinking is that different threads with different "num" values will be mixed.
Is it because the objects are different?
using VecI = std::vector<int>;
class UseMutexInClassMethod {
mutex m;
public:
VecI compute(int num, VecI veci)
{
VecI v;
num = 2 * num -1;
for (auto &x:veci) {
v.emplace_back(pow(x,num));
std::this_thread::sleep_for(std::chrono::seconds(1));
}
return v;
}
};
void TestUseMutexInClassMethodUsingAsync()
{
const int nthreads = 5;
UseMutexInClassMethod useMutexInClassMethod;
VecI vec{ 1,2,3,4,5 };
std::vector<std::future<VecI>> futures(nthreads);
std::vector<VecI> outputs(nthreads);
for (decltype(futures)::size_type i = 0; i < nthreads; ++i) {
futures[i] = std::async(&UseMutexInClassMethod::compute,
&useMutexInClassMethod,
i,vec
);
}
for (decltype(futures)::size_type i = 0; i < nthreads; ++i) {
outputs[i] = futures[i].get();
for (auto& x : outputs[i])
cout << x << " ";
cout << endl;
}
}
If you want an example that does fail with a high degree of certainty you can look at the below. It sets up a variable called accumulator to be shared by reference to all the futures. This is what is missing in your example. You are not actually sharing any memory. Make sure you understand the difference between passing by reference and passing by value.
#include <vector>
#include <memory>
#include <thread>
#include <future>
#include <iostream>
#include <cmath>
#include <mutex>
struct UseMutex{
int compute(std::mutex & m, int & num)
{
for(size_t j = 0;j<1000;j++)
{
///////////////////////
// CRITICAL SECTIION //
///////////////////////
// this code currently doesn't trigger the exception
// because of the lock on the mutex. If you comment
// out the single line below then the exception *may*
// get called.
std::scoped_lock lock{m};
num++;
std::this_thread::sleep_for(std::chrono::nanoseconds(1));
num++;
if(num%2!=0)
throw std::runtime_error("bad things happened");
}
return 0;
}
};
template <typename T> struct F;
void TestUseMutexInClassMethodUsingAsync()
{
const int nthreads = 16;
int accumulator=0;
std::mutex m;
std::vector<UseMutex> vs{nthreads};
std::vector<std::future<int>> futures(nthreads);
for (auto i = 0; i < nthreads; ++i) {
futures[i]= std::async([&,i](){return vs[i].compute(m,accumulator);});
}
for(auto i = 0; i < nthreads; ++i){
futures[i].get();
}
}
int main(){
TestUseMutexInClassMethodUsingAsync();
}
You can comment / uncomment the line
std::scoped_lock lock{m};
which protects the increment of the shared variable num. The rule for this mini program is that at the line
if(num%2!=0)
throw std::runtime_error("bad things happened");
num should be a multiple of two. But as multiple threads are accessing this variable without a lock you can't guarantee this. However if you add a lock around the double increment and test then you can be sure no other thread is accessing this memory during the duration of the increment and test.
Failing
https://godbolt.org/z/sojcs1WK9
Passing
https://godbolt.org/z/sGdx3x3q3
Of course the failing one is not guaranteed to fail but I've set it up so that it has a high probability of failing.
Notes
[&,i](){return vs[i].compute(m,accumulator);};
is a lambda or inline function. The notation [&,i] means it captures everything by reference except i which it captures by value. This is important because i changes on each loop iteration and we want each future to get a unique value of i
Is it because the objects are different?
Yes.
Your code is actually perfectly thread safe, no need for mutex here. You never share any state between threads except for copying vec from TestUseMutexInClassMethodUsingAsync to compute by std::async (and copying is thread-safe) and moving computation result from compute's return value to futures[i].get()'s return value. .get() is also thread-safe: it blocks until the compute() method terminates and then returns its computation result.
It's actually nice to see that even a deliberate attempt to get a race condition failed :)
You probably have to fully redo your example to demonstrate is how simultaneous* access to a shared object breaks things. Get rid of std::async and std::future, use simple std::thread with capture-by-reference, remove sleep_for (so both threads do a lot of operations instead of one per second), significantly increase number of operations and you will get a visible race. It may look like a crash, though.
* - yes, I'm aware that "wall-clock simulateneous access" does not exist in multithreaded systems, strictly speaking. However, it helps getting a rough idea of where to look for visible race conditions for demonstration purposes.
Comments have called out the fact that just not protecting a critical section does not guarantee that the risked behavior actually occurs.
That also applies for multiple runs, because while you are not allowed to test a few times and then rely on the repeatedly observed behavior, it is likely that optimization mechanisms cause a likely enough reoccurring observation as to be perceived has reproducible.
If you intend to demonstrate the need for synchronization you need to employ synchronization to poise things to a near guaranteed misbehavior of observable lack of protection.
Allow me to only outline a sequence for that, with a few assumptions on scheduling mechanisms (this is based on a rather simple, single core, priority based scheduling environment I have encountered in an embedded environment I was using professionally), just to give an insight with a simplified example:
start a lower priority context.
optionally set up proper protection before entering the critical section
start critical section, e.g. by outputting the first half of to-be-continuous output
asynchronously trigger a higher priority context, which is doing that which can violate your critical section, e.g. outputs something which should not be in the middle of the two-part output of the critical section
(in protected case the other context is not executed, in spite of being higher priority)
(in unprotected case the other context is now executed, because of being higher priority)
end critical section, e.g. by outputting the second half of the to-be-continuous output
optionally remove the protection after leaving the critical section
(in protected case the other context is now executed, now that it is allowed)
(in unprotected case the other context was already executed)
Note:
I am using the term "critical section" with the meaning of a piece of code which is vulnerable to being interrupted/preempted/descheduled by another piece of code or another execution of the same code. Specifically for me a critical section can exist without applied protection, though that is not a good thing. I state this explicitly because I am aware of the term being used with the meaning "piece of code inside applied protection/synchronization". I disagree but I accept that the term is used differently and requires clarification in case of potential conflicts.

How can I write codes that reuse thread on C++ openmp?

I have a function f that I can use parallel processing. For this purpose, I used openmp.
However, this function is called many times, and it seems that thread creation is done every call.
How can we reuse the thread?
void f(X &src, Y &dest) {
... // do processing based on "src"
#pragma omp parallel for
for (...) {
}
...// put output into "dest"
}
int main() {
...
for(...) { // It is impossible to make this loop processing parallel one.
f(...);
}
...
return 0;
}
OpenMP implements thread pool internally, it tries to reuse threads unless you change some of its settings in between or use different application threads to call parallel regions while others are still active.
One can verify that the threads are indeed the same by using thread locals. I'd recommend you to verify your claim about recreating the threads. OpenMP runtime does lots of smart optimizations beyond obvious thread pool idea, you just need to know how to tune and control it properly.
While it is unlikely that threads are recreated, it is easy to see how threads can go to sleep by the time when you call parallel region again and it takes noticeable amount of time to wake them up. You can prevent threads from going to sleep by using OMP_WAIT_POLICY=active and/or implementation-specific environment variables like KMP_BLOCKTIME=infinite (for Intel/LLVM run-times).
This is just in addition to Anton's correct answer. If you are really concerned about the issue, for most programs you can easily move the parallel region on the outside and keep serial work serial like follows:
void f(X &src, Y &dest) {
// You can also do simple computations
// without side effects outside of the single section
#pragma omp single
{
... // do processing based on "src"
}
#pragma omp for // note parallel is missing
for (...) {
}
#pragma omp critical
...// each thread puts its own part of the output into "dest"
}
int main() {
...
// make sure to declare loop variable locally or explicitly private
#pragma omp parallel
for(type variable;...;...) {
f(...);
}
...
return 0;
}
Use this only if you have measured evidence that you are suffering from the overhead of reopening parallel regions. You may have to juggle with shared variables, or manually inline f, because all variables declared within f will be private - so how it looks in detail depends on your specific application.

OpenMP critical section only if data race exists vs lock?

My question is somewhat similar to this one: How to use lock in OpenMP?
in the sense that their answers sort of answer my question but not well enough.
I'm trying to implement a simple work-stealing scheduler in OpenMP (from scratch).
Let's say I have an array of some object, say int. I have multiple threads which will manipulate the entries of this array, in no particular order. I would like to make sure that no two threads try to access the same element of the array at the same time. However, I am allowing the threads to access the same element, as long as the accesses are not simultaneous. Also, I am allowing the threads to access the array simultaneously, as long as each thread wishes to access a different entry of the array during this time. I could use a critical section, as in the following:
int array[1000];
#pragma omp parallel
{
bool flag = true;
while(flag){
int x = rand()%1000;
#pragma omp critical
{
array[x] = some_function(array[x]);
if (some_condition(array[x])){
flag = false;
}
}
}
}
This code creates some threads and the threads randomly access and manipulate entries of the array until some stopping condition that kills the thread. This code works fine, since the critical section ensures that no two threads will ever write to the array at the same time (in case they generated the same value of x). However, if at some time two threads do not happen to generate the same value of x, the critical section is redundant, as they threads are not accessing the same entry. Is there a way to make it so that a thread will stall if and only if the value of x it generated is the same as a thread that is currently also using x? Right now, this code is inefficient, and basically serial, even if every thread happens to generate a different value of x. I want to make it so they stall only if there is a collision.
Perhaps what I am looking for are locks ,but I am not sure. Are critical sections not the right way to go here?
I meant something like this:
#include <stdlib.h>
#include <omp.h>
int main()
{
int array[1000];
omp_lock_t locks[1000];
for (int i = 0; i < 1000; i++)
omp_init_lock(&locks[i]);
#pragma omp parallel
{
bool flag = true;
while(flag){
int x = rand()%1000;
omp_set_lock(&locks[x]);
array[x] = some_function(array[x]);
if (some_condition(array[x])){
flag = false;
}
omp_unset_lock(&locks[x]);
}
}
for (int i = 0; i < 1000; i++)
omp_destroy_lock(&locks[i]);
}

Why is std::mutex faster than std::atomic?

I want to put objects in std::vector in multi-threaded mode. So I decided to compare two approaches: one uses std::atomic and the other std::mutex. I see that the second approach is faster than the first one. Why?
I use GCC 4.8.1 and, on my machine (8 threads), I see that the first solution requires 391502 microseconds and the second solution requires 175689 microseconds.
#include <vector>
#include <omp.h>
#include <atomic>
#include <mutex>
#include <iostream>
#include <chrono>
int main(int argc, char* argv[]) {
const size_t size = 1000000;
std::vector<int> first_result(size);
std::vector<int> second_result(size);
std::atomic<bool> sync(false);
{
auto start_time = std::chrono::high_resolution_clock::now();
#pragma omp parallel for schedule(static, 1)
for (int counter = 0; counter < size; counter++) {
while(sync.exchange(true)) {
std::this_thread::yield();
};
first_result[counter] = counter;
sync.store(false) ;
}
auto end_time = std::chrono::high_resolution_clock::now();
std::cout << std::chrono::duration_cast<std::chrono::microseconds>(end_time - start_time).count() << std::endl;
}
{
auto start_time = std::chrono::high_resolution_clock::now();
std::mutex mutex;
#pragma omp parallel for schedule(static, 1)
for (int counter = 0; counter < size; counter++) {
std::unique_lock<std::mutex> lock(mutex);
second_result[counter] = counter;
}
auto end_time = std::chrono::high_resolution_clock::now();
std::cout << std::chrono::duration_cast<std::chrono::microseconds>(end_time - start_time).count() << std::endl;
}
return 0;
}
I don't think your question can be answered referring only to the standard- mutexes are as platform-dependent as they can be. However, there is one thing, that should be mentioned.
Mutexes are not slow. You may have seen some articles, that compare their performance against custom spin-locks and other "lightweight" stuff, but that's not the right approach - these are not interchangeable.
Spin locks are considerably fast, when they are locked (acquired) for a relatively short amount of time - acquiring them is very cheap, but other threads, that are also trying to lock, are active for whole this time (running constantly in loop).
Custom spin-lock could be implemented this way:
class SpinLock
{
private:
std::atomic_flag _lockFlag;
public:
SpinLock()
: _lockFlag {ATOMIC_FLAG_INIT}
{ }
void lock()
{
while(_lockFlag.test_and_set(std::memory_order_acquire))
{ }
}
bool try_lock()
{
return !_lockFlag.test_and_set(std::memory_order_acquire);
}
void unlock()
{
_lockFlag.clear();
}
};
Mutex is a primitive, that is much more complicated. In particular, on Windows, we have two such primitives - Critical Section, that works in per-process basis and Mutex, which doesn't have such limitation.
Locking mutex (or critical section) is much more expensive, but OS has the ability to really put other waiting threads to "sleep", which improves performance and helps tasks scheduler in efficient resources management.
Why I write this? Because modern mutexes are often so-called "hybrid mutexes". When such mutex is locked, it behaves like a normal spin-lock - other waiting threads perform some number of "spins" and then heavy mutex is locked to prevent from wasting resources.
In your case, mutex is locked in each loop iteration to perform this instruction:
second_result[counter] = omp_get_thread_num();
It looks like a fast one, so "real" mutex may never be locked. That means, that in this case your "mutex" can be as fast as atomic-based solution (because it becomes an atomic-based solution itself).
Also, in the first solution you used some kind of spin-lock-like behaviour, but I am not sure if this behaviour is predictable in multi-threaded environment. I am pretty sure, that "locking" should have acquire semantics, while unlocking is a release op. Relaxed memory ordering may be too weak for this use case.
I edited the code to be more compact and correct. It uses the std::atomic_flag, which is the only type (unlike std::atomic<> specializations), that is guaranteed to be lock-free (even std::atomic<bool> does not give you that).
Also, referring to the comment below about "not yielding": it is a matter of specific case and requirements. Spin locks are very important part of multi-threaded programming and their performance can often be improved by slightly modifying its behavior. For example, Boost library implements spinlock::lock() as follows:
void lock()
{
for( unsigned k = 0; !try_lock(); ++k )
{
boost::detail::yield( k );
}
}
source: boost/smart_ptr/detail/spinlock_std_atomic.hpp
Where detail::yield() is (Win32 version):
inline void yield( unsigned k )
{
if( k < 4 )
{
}
#if defined( BOOST_SMT_PAUSE )
else if( k < 16 )
{
BOOST_SMT_PAUSE
}
#endif
#if !BOOST_PLAT_WINDOWS_RUNTIME
else if( k < 32 )
{
Sleep( 0 );
}
else
{
Sleep( 1 );
}
#else
else
{
// Sleep isn't supported on the Windows Runtime.
std::this_thread::yield();
}
#endif
}
[source: http://www.boost.org/doc/libs/1_66_0/boost/smart_ptr/detail/yield_k.hpp]
First, thread spins for some fixed number of times (4 in this case). If mutex is still locked, pause instruction is used (if available) or Sleep(0) is called, which basically causes context-switch and allows scheduler to give another blocked thread a chance to do something useful. Then, Sleep(1) is called to perform actual (short) sleep. Very nice!
Also, this statement:
The purpose of a spinlock is busy waiting
is not entirely true. The purpose of spinlock is to serve as a fast, easy-to-implement lock primitive - but it still needs to be written properly, with certain possible scenarios in mind. For example, Intel says (regarding Boost's usage of _mm_pause() as a method of yielding inside lock()):
In the spin-wait loop, the pause intrinsic improves the speed at which
the code detects the release of the lock and provides especially
significant performance gain.
So, implementations like
void lock() { while(m_flag.test_and_set(std::memory_order_acquire)); }
may not be as good as it seems.
There is an additional important issue related to your problem. An efficient spinlock never "spins" on an operation that involves (even potential) modification of a memory location (such as exchange or test_and_set). On typical modern architectures, these operations generate instructions that require the cache line with a lock memory location to be in the exclusive state, which is extremely time-consuming (especially, when multiple threads are spinning at the same time). Always spin on load/read only and try to acquire the lock only when there is a chance that this operation will succeed.
A nice relevant article is, for instance, here: Correctly implementing a spinlock in C++

Forking once and then use the threads in a different procedure?

If I fork inside my main program and then call a subroutine inside a single directive, what is the behavior if I enter an OMP parallel directive in this subroutine?
My guess/hope is that existing threads are used, as they all should have nothing to do at the moment.
Pseudo-Example:
double A[];
int main() {
#pragma omp parallel num_threads(2)
{
#pragma omp single
{
for (int t=0; t<1000; t++) {
evolve();
}
}
}
}
void evolve() {
#pragma omp parallel for num_threads(2)
for (int i=0; i<100; i++) {
do_stuff(i);
}
}
void do_stuff(int i) {
// expensive calculation on array element A[i]
}
As evolve() is called very often, forking here would cause way to much overhead, so I would like to do it only once, then call evolve() from a single thread and split the work of the calls to do_stuff() over the existing threads.
For Fortran this seems to work. I get a roughly 80-90% speed increase on a simple example using 2 threads. But for C++ I get a different behavior, only the thread which executes the single directive is used for the loop in evolve()
I fixed the problem using the task directive in the main program and passing the limits to evolve(), but this looks like a clumsy solution...
Why is the behavior in Fortran and C++ different and what would be the solution in C++?
I believe orphaned directives are the cleanest solution in your case:
double A[];
int main() {
#pragma omp parallel num_threads(2)
{
// Each thread calls evolve() a thousand times
for (int t=0; t<1000; t++) {
evolve();
}
}
}
void evolve() {
// The orphaned construct inside evolve()
// will bind to the innermost parallel region
#pragma omp for
for (int i=0; i<100; i++) {
do_stuff(i);
} // Implicit thread synchronization
}
void do_stuff(int i) {
// expensive calculation on array element A[i]
}
This will work because (section 2.6.1 of the standard):
A loop region binds to the innermost enclosing parallel region
That said, in your code you are using nested parallel constructs. To be sure to enable them you must set the environment variable OMP_NESTED to true, otherwise (quoting Appendix E of the latest standard):
OMP_NESTED environment variable: if the value is neither true nor false the behavior is implementation defined
Unfortunately, your code will likely not work as expected in all cases. If you have a code structure like this:
void foo() {
#pragma omp parallel
#pragma omp single
bar();
}
void bar() {
#pragma omp parallel
printf("...)";
}
OpenMP is requested to create a new team of threads when entering the parallel region in bar. OpenMP calls that "nested parallelism". However, what exactly happens depends on the your actual implementation of OpenMP used and the setting of OMP_NESTED.
OpenMP implementations are not required to support nested parallelism. It would be perfectly legal, if an implementation ignored the parallel region in bar and just execute it with one thread. OMP_NESTED can be used to turn on and off nesting, if the implementation supports it.
In your case, things by chance went well, since you sent all threads to sleep except one. This thread then created a new team of threads of full size (potentially NEW threads, not reusing the old ones). If you omitted the single construct, you would easily get thousands of threads.
Unfortunately, OpenMP does not support your pattern to create a parallel team, have one thread executing the call stacks, and then distribute work across the other team members through a worksharing construct like for. If you need this code pattern, the only solution will be OpenMP tasks.
Cheers,
-michael
Your example doesn't actually call fork(), so I suspect you don't mean fork in the system-call sense (i.e. duplicating your process). However, if that really is what you meant, I suspect that most OpenMP implementations will not work correctly in a forked process. Typically, threads are not preserved across fork() calls. If the OpenMP implementation you use registers pthread_atfork() handlers, it may work correctly following a fork() call, but it will not use the same threads as the parent process.