I have a function memoize that reads and writes a static std::map, eg:
int memoize(int i)
{
static std::map<int,int> map;
const auto iterator = memoizer.find(i);
if (iterator == memoizer.end())
memoizer[i]=i+1;
else
return iterator->second;
}
memoize is called by other functions which are called by other functions which are called.... by functions in main. Now if in main I have something like
#pragma omp parallel for
for (int i=0; i<1000; i+++)
f(i); \\function calling memoize
for (int i=0; i<1000; i+++)
g(i); \\function calling memoize
then I have a problem in the first loop since std::map is not thread safe. I am trying to figure out a way to lock the static map only if openmp is used (thus only for the first loop). I would rather avoid having to rewrite all functions to take an extra omp_lock_t argument.
What's the best way to achieve that? Hopefully with the least amount of macros possible.
A very simple solution to your problem would be to protect both the read and the update parts of your code with a critical OpenMP directive. Of course, to reduce the risk of unwanted interaction/collision with some other critical you might already have somewhere else in your code, you should give it a name to identify it clearly. And should your OpenMP implementation permit it (ie the version of the standard being high enough), and also if you have the corresponding knowledge, you can add a hint on how much of contention you expect from this update among the threads for example.
Then, I would suggest you to check whether this simple implementation gives you sufficient performance, meaning whether the impact of the critical directive isn't too much performance-wise. If you're happy with the performance, then just leave it as is. If not, then come back with a more precise question and we'll see what can be done.
For the simple solution, the code could look like this (with here a hint suggesting that high contention is expected, which is only a example for your sake, not a true expectation of mine):
int memoize( int i ) {
static std::map<int,int> memoizer;
bool found = false;
int value;
#pragma omp critical( memoize ) hint( omp_sync_hint_contended )
{
const auto it = memoizer.find( i );
if ( it != memoizer.end() ) {
value = it->second;
found = true;
}
}
if ( !found ) {
// here, we didn't find i in memoizer, so we compute it
value = compute_actual_value_for_memoizer( i );
#pragma omp critical( memoize ) hint( omp_sync_hint_contended )
memoizer[i] = value;
}
return value;
}
Related
In general, it's not thread-safe to access the same instance of std::map from different threads.
But is it could be thread-safe under such a condition:
no more element would be added to\remove from the instance of std::map after it has been initialized
the type of values of the std::map is std::atomic<T>
Here is the demo code:
#include<atomic>
#include<thread>
#include<map>
#include<vector>
#include<iostream>
class Demo{
public:
Demo()
{
mp_.insert(std::make_pair(1, true));
mp_.insert(std::make_pair(2, true));
mp_.insert(std::make_pair(3, true));
}
int Get(const int& integer, bool& flag)
{
const auto itr = mp_.find(integer);
if( itr == mp_.end())
{
return -1;
}
else
{
flag = itr->second;
return 0;
}
}
int Set(const int& integer, const bool& flag)
{
const auto itr = mp_.find(integer);
if( itr == mp_.end())
{
return -1;
}
else
{
itr->second = flag;
return 0;
}
}
private:
std::map<int, std::atomic<bool>> mp_;
};
int main()
{
Demo demo;
std::vector<std::thread> vec;
vec.push_back(std::thread([&demo](){
while(true)
{
for(int i=0; i<9; i++)
{
bool cur_flag = false;
if(demo.Get(i, cur_flag) == 0)
{
demo.Set(i, !cur_flag);
}
std::this_thread::sleep_for(std::chrono::milliseconds(1000));
}
}
}));
vec.push_back(std::thread([&demo](){
while(true)
{
for(int i=0; i<9; i++)
{
bool cur_flag = false;
if(demo.Get(i, cur_flag)==0)
{
std::cout << "(" << i << "," << cur_flag <<")" << std::endl;
}
std::this_thread::sleep_for(std::chrono::milliseconds(10));
}
}
})
);
for(auto& thread:vec)
{
thread.join();
}
}
What more, the compiler does not complain about anything with -fsanitize=thread option.
Yes, this is safe.
Data races are best thought of as unsynchronized conflicting access (potential concurrent reads and writes).
std::thread construction imposes an order: the actions which preceded in code are guaranteed to come before the thread starts. So the map is completely populated before the concurrent accesses.
The library says standard types can only access the type itself, the function arguments, and the required properties of any container elements. std::map::find is non-const, but the standard requires that for the purposes of data races, it is treated as const. Operations on iterators are required to at most access (but not modify) the container. So the concurrent accesses to std::map are all non-modifying.
That leaves the load and store from the std::atomic<bool> which is race-free as well.
This should avoid data-race UB, since you aren't mutating the std::map data structure at all after starting either thread. And given the limited ways you're modifying the atomic values of the map, that's also safe.
You didn't provide accessor functions that would allow atomic RMW, so the only thing you can do is .store(flag) and .load(flag) with the default memory_order_seq_cst. Not .exchange or .compare_exchange_weak, or ^= 1 to flip.
Your if(Get) Set isn't equivalent to a flag ^= 1; atomic flip of that flag (although that doesn't compile as efficiently as one might like: Efficient way to toggle an atomic_bool).
If another thread was also flipping the same flag, they could step on each other. e.g. 10 total flips should get it back to the original value, but with separate atomic load and store, both threads could load the same value and then store the same value, resulting in only one flip for 2 (or many more) if(Get) Set operations across multiple threads.
Of course, if you don't have multiple threads writing the same flag, it is more efficient to separately load and store like you're doing.
Especially if you avoid the default memory_order_seq_cst, e.g. provide a std::memory_order ord = std::memory_order_seq_cst optional arg to your accessor functions, like std::atomic functions do. SC stores on most ISAs are more costly. (Especially on x86, where even mo_release is free in asm, but mo_seq_cst needs a full barrier.)
(But as discussed on Efficient way to toggle an atomic_bool std::atomic<bool> has no portable way to atomically flip it other than ^= 1, there isn't a member function that can take a memory_order arg. atomic<uint8_t> may be a better choice, taking the low bit as the actual boolean value.)
Since you need to be able to return failure, perhaps return a pointer to the std::atomic<bool> objects, with NULL indicating not found. That would let the caller use any std::atomic member function. But it does make misuse possible, holding onto the reference for too long.
C++ compilers emit warnings when a local variable may be uninitialized on first usage. However, sometimes, I know that the variable will always be written before being used, so I do not need to initialize it. When I do this, the compiler emits a warning, of course. Since my team is building with -Werror, the code will not compile. How can I turn off this warning for specific local variables. I have the following restrictions:
I am not allowed to change compiler flags
The solution must work on all compilers (i.e., no gnu-extensions or other compiler specific attributes)
I want to use this only on specific local variables. Other uninitialized locals should still trigger a warning
The solution should not generate any instructions.
I cannot alter the class of the local variable. I.e., I cannot simply add a "do nothing" constructor.
Of course, the easiest solution would be to initialize the variable. However, the variable is of a type that is costly to initialize (even default initialization is costly) and the code is used in a very hot loop, so I do not want to waste the CPU cycles for an initialization that is guaranteed to be overwritten before it is read anyway.
So is there a platform-independent, compiler-independent way of telling the compiler that a local variable does not need to be initialized?
Here is some example code that might trigger such a warning:
void foo(){
T t;
for(int i = 0; i < 100; i++){
if (i == 0) t = ...;
if (i == 1) doSomethingWith(t);
}
}
As you see, the first loop cycle initializes t and the second one uses it, so t will never be read uninitialized. However, the compiler is not able to deduce this, so it emits a warning. Note that this code is quite simplified for the sake of brevity.
My answer will recommend another approach: instead of disabling the warning code, just do some reformulation on the implementation. I see two approaches:
First Option
You can use pointers instead of a real object and guarantee that it will be initialized just when you need it, something like:
std::unique_ptr<T> t;
for(int i=0; i<100; i++)
{
if(i == 0) if(t.empty()) t = std::unique_ptr<T>(new T); *t = ...;
if(i == 1) if(t.empty()) t = std::unique_ptr<T>(new T); doSomethingWith(*t);
}
It's interesting to note that probably when i==0, you don't need to construct t using the default constructor. I can't guess how your operator= is implemented, but I supose that probably you are assigning an object that's already allocated in the code that you are omitting in the ... segment.
Second Option
As your code experiences such a huge performance loss, I can infer that T will never be an basic tipe (ints, floats, etc). So, instead of using pointers, you can reimplement your class T in a way that you use an init method and avoid initializing it on the constructor. You can use some boolean to indicate if the class needs initalization or not:
class FooClass()
{
public:
FooClass() : initialized(false){ ... }
//Class implementation
void init()
{
//Do your heavy initialization code here.
initialized = true;
}
bool initialized() const { return initialized; }
private:
bool initialized;
}
Than you will be able to write it like this:
T t;
for(int i=0; i<100; i++)
{
if(i == 0) if(!t.initialized()) t.init(); t = ...;
if(i == 1) if(!t.initialized()) t.init(); doSomethingWith(t);
}
If the code is not very complex, I usually unroll one of the iterations:
void foo(){
T t;
t = ...;
for(int i = 1; i < 100; i++){
doSomethingWith(t);
}
}
I'm trying to come up with an equivalent replacement of an Intel TBB parallel_for loop that uses a tbb::blocked_range using OpenMP. Digging around online, I've only managed to find mention of one other person doing something similar; a patch submitted to the Open Cascade project, wherein the TBB loop appeared as so (but did not use a tbb::blocked_range):
tbb::parallel_for_each (aFaces.begin(), aFaces.end(), *this);
and the OpenMP equivalent was:
int i, n = aFaces.size();
#pragma omp parallel for private(i)
for (i = 0; i < n; ++i)
Process (aFaces[i]);
Here is TBB loop I'm trying to replace:
tbb::parallel_for( tbb::blocked_range<size_t>( 0, targetList.size() ), DoStuff( targetList, data, vec, ptr ) );
It uses the DoStuff class to carry out the work:
class DoStuff
{
private:
List& targetList;
Data* data;
vector<things>& vec;
Worker* ptr;
public:
DoIdentifyTargets( List& pass_targetList,
Data* pass_data,
vector<things>& pass_vec,
Worker* pass_worker)
: targetList(pass_targetList), data(pass_data), vecs(pass_vec), ptr(pass_worker)
{
}
void operator() ( const tbb::blocked_range<size_t> range ) const
{
for ( size_t idx = range.begin(); idx != range.end(); ++idx )
{
ptr->PerformWork(&targetList[idx], data->getData(), &Vec);
}
}
};
My understanding based on this reference is that TBB will divide the blocked range into smaller subsets and give each thread one of the ranges to loop through. Since each thread will get its own DoStuff class, which has a bunch of references and pointers, meaning the threads are essentially sharing those resources.
Here's what I've come up with as an equivalent replacement in OpenMP:
int index = 0;
#pragma omp parallel for private(index)
for (index = 0; index < targetList.size(); ++index)
{
ptr->PerformWork(&targetList[index], data->getData(), &Vec);
}
Because of circumstances outside of my control (this is merely one component in a much larger system that spans +5 computers) stepping through the code with a debugger to see exactly what's happening is... Unlikely. I'm working on getting remote debugging going, but it's not looking very promising. All I know for sure is that the above OpenMP code is somehow doing something differently than TBB was, and expected results after calling PerformWork for each index are not obtained.
Given the information above, does anyone have any ideas on why the OpenMP and TBB code are not functionally equivalent?
Following Ben and Rick's advice, I tested the following loop without the omp pragma (serially) and obtained my expected results (very slowly). After adding the pragma back in, the parallel code also performs as expected. Looks like the problem was either in declaring the index as private outside of the loop, or declaring numTargets as private inside the loop. Or both.
int numTargets = targetList.size();
#pragma omp parallel for
for (int index = 0; index < numTargets; ++index)
{
ptr->PerformWork(&targetList[index], data->getData(), &vec);
}
I have a code that is as follow (simplified code):
for( int i = 0; i < input.rows; i++ )
{
if(IsGoodMatch(input[I])
{
Newvalues newValues;
newValues.x1=input.x1;
newValues.x2=input.x1*2;
output.push_back( newValues);
}
}
This code works well, but if I want to make it parallel using omp parallel for, I am getting error on output.push_back and it seems that during vector resize, the memory corrupted.
What is the problem and how can I fix it?
How can I make sure only one thread inserting a new item into vector at any time?
The simple answer is that std::vector::push_back is not thread-safe.
In order to safely do this in parallel you need to synchronize in order to ensure that push_back isn't called from multiple threads at the same time.
Synchronization in C++11 can easily be achieved by using an std::mutex.
std::vector's push_back can not guarantee a correct behavior when being called in a concurrent manner like you are doing now (there is no thread-safety).
However since the elements don't depend on each other, it would be very reasonable to resize the vector and modify elements inside the loop separately:
output.resize(input.rows);
int k = 0;
#pragma omp parallel for shared(k, input)
for( int i = 0; i < input.rows; i++ )
{
if(IsGoodMatch(input[I])
{
Newvalues newValues;
...
// ! prevent other threads to modify k !
output[k] = newValues;
k++;
// ! allow other threads to modify k again !
}
}
output.resize(k);
since the direct access using operator[] doesn't depend on other members of std::vector which might cause inconsistencies between the threads. However this solution might still need an explicit synchronization (i.e. using a synchronization mechanism such as mutex) that will ensure that a correct value of k will be used.
"How can I make sure only one thread inserting a new item into vector at any time?"
You don't need to. Threads will be modifying different elements (that reside in different parts of memory). You just need to make sure that the element each thread tries to modify is the correct one.
Use concurrent vector
#include <concurrent_vector.h>
Concurrency::concurrent_vector<int> in c++11.
It is thread safe version of vector.
Put a #pragma omp critical before the push_back.
I solved a similar problem by deriving the standard std::vector class just to implement an atomic_push_back method, suitable to work in the OpenMP paradigm.
Here is my "OpenMP-safe" vector implementation:
template <typename T>
class omp_vector : public std::vector<T>
{
private:
omp_lock_t lock;
public:
omp_vector()
{
omp_init_lock(&lock);
}
void atomic_push_back(T const &p)
{
omp_set_lock(&lock);
std::vector<T>::push_back(p);
omp_unset_lock(&lock);
}
};
of course you have to include omp.h. Then your code could be just as follows:
opm_vector<...> output;
#pragma omp parallel for shared(input,output)
for( int i = 0; i < input.rows; i++ )
{
if(IsGoodMatch(input[I])
{
Newvalues newValues;
newValues.x1=input.x1;
newValues.x2=input.x1*2;
output.atomic_push_back( newValues);
}
}
If you still need the output vector somewhere else in a non-parallel section of the code, you could just use the normal push_back method.
You can try to use a mutex to fix the problem.
Usually I prefer to achieve such thing myself;
static int mutex=1;
int signal(int &x)
{
x+=1;
return 0;
}
int wait(int &x)
{
x-=1;
while(x<0);
return 0;
}
for( int i = 0; i < input.rows; i++ )
{
if(IsGoodMatch(input[I])
{
Newvalues newValues;
newValues.x1=input.x1;
newValues.x2=input.x1*2;
wait(mutex);
output.push_back( newValues);
signal(mutex);
}
}
Hope this could help.
I have some old code that uses qsort to sort an MFC CArray of structures but am seeing the occasional crash that may be down to multiple threads calling qsort at the same time. The code I am using looks something like this:
struct Foo
{
CString str;
time_t t;
Foo(LPCTSTR lpsz, time_t ti) : str(lpsz), t(ti)
{
}
};
class Sorter()
{
public:
static void DoSort();
static int __cdecl SortProc(const void* elem1, const void* elem2);
};
...
void Sorter::DoSort()
{
CArray<Foo*, Foo*> data;
for (int i = 0; i < 100; i++)
{
Foo* foo = new Foo("some string", 12345678);
data.Add(foo);
}
qsort(data.GetData(), data.GetCount(), sizeof(Foo*), SortProc);
...
}
int __cdecl SortProc(const void* elem1, const void* elem2)
{
Foo* foo1 = (Foo*)elem1;
Foo* foo2 = (Foo*)elem2;
// 0xC0000005: Access violation reading location blah here
return (int)(foo1->t - foo2->t);
}
...
Sorter::DoSort();
I am about to refactor this horrible code to use std::sort instead but wondered if the above is actually unsafe?
EDIT: Sorter::DoSort is actually a static function but uses no static variables itself.
EDIT2: The SortProc function has been changed to match the real code.
Your problem doesn't necessarily have anything to do with thread saftey.
The sort callback function takes in pointers to each item, not the item itself. Since you are sorting Foo* what you actually want to do is access the parameters as Foo**, like this:
int __cdecl SortProc(const void* elem1, const void* elem2)
{
Foo* foo1 = *(Foo**)elem1;
Foo* foo2 = *(Foo**)elem2;
if(foo1->t < foo2->t) return -1;
else if (foo1->t > foo2->t) return 1;
else return 0;
}
Your SortProc isn't returning correct results, and this likely leads to memory corruption by something assuming that the data is, well, sorted after you get done sorting it. You could even be leading qsort into corruption as it tries to sort, but that of course varies by implementation.
The comparison function for qsort must return negative if the first object is less than the second, zero if they are equal, and positive otherwise. Your current code only ever returns 0 or 1, and returns 1 when you should be returning negative.
int __cdecl Sorter::SortProc(const void* ap, const void* bp) {
Foo const& a = *(Foo const*)ap;
Foo const& b = *(Foo const*)bp;
if (a.t == b.t) return 0;
return (a.t < b.t) ? -1 : 1;
}
C++ doesn't really make any guarantees about thread safety. About the most you can say is that either multiple readers OR a single writer to a data structure will be OK. Any combination of readers and writers, and you need to serialise the access somehow.
Since you tagged your question with MFC tag I suppose you should select Multi-threaded Runtime Library in Project Settings.
Right now, your code is thread-safe, but useless, as the DoSort-method only uses local variables, and doesn't even return anything. If the data you are sorting is a member of Sorter, then it is not safe to call the function from multiple threads. In gerenal, read up on reentrancy, this may give you an idea of what you need to look out for.
what make it thread safe is, whether your object are thread safe, for example to make qsort thread-safe you must ensure that anything that write or read to or from and to the object are thread safe.
The pthreads man page lists the standard functions which are not required to be thread-safe. qsort is not among them, so it is required to be thread-safe in POSIX.
http://www.kernel.org/doc/man-pages/online/pages/man7/pthreads.7.html
I can't find the equivalent list for Windows, though, so this isn't really an answer to your question. I'd be a bit surprised if it was different.
Be aware what "thread-safe" means in this context, though. It means you can call the same function concurrently on different arrays -- it doesn't mean that concurrent access to the same data via qsort is safe (it isn't).
As a word of warning, you may find std::sort is not as fast as qsort. If you do find that try std::stable_sort.
I once wrote a BWT compressor based on the code presented my Mark Nelson in Dr Dobbs and when I turned it into classes I found that regular sort was a lot slower. stable_sort fixed the speed problems.