Sampling conditional distribution OpenMP - c++

I have a function which draws a random sample:
Sample sample();
I have a function which checks weather a sample is valid:
bool is_valid(Sample s);
This simulates a conditional distribution. Now I want a lot of valid samples (most samples will not be valid).
So I want to parallelize this code with openMP
vector<Sample> valid_samples;
while(valid_samples.size() < n) {
Sample s = sample();
if(is_valid(s)) {
valid_samples.push_back(s);
}
}
How would I do this? Most of the OpenMP code I found were simple for loops where the number of iterations is determined in the beginning.
The sample() function has a
thread_local std::mt19937_64 gen([](){
std::random_device d;
std::uniform_int_distribution<int> dist(0,10000);
return dist(d);
}());
as a random number generator. Is is valid and thread save if I assume that my device has a source of randomness? Are there better solutions?

You may employ OpenMP task parallelism. The simplest solution would be to define a task as a single sample insertion:
vector<Sample> valid_samples(n); // need to be resized to allow access in parallel
void insert_ith(size_t i)
{
do {
valid_samples[i] = sample();
} while (!is_valid(valid_samples[i]));
}
#pragma omp parallel
{
#pragma omp single
{
for (size_t i = 0; i < n; i++)
{
#pragma omp task
insert_ith(i);
}
}
}
Note that there might be performance issues with such single-task-single-insertion mapping. First, there would be false sharing involved, but likely worse, tasking management has some overhead which might be significant for very small tasks. In such a case, a remedy is simple — instead of a single insertion per tasks, insert multiple items at once, such as 100. Usually, a suitable number is a trade-off: the lower creates more tasks = more overhead, the higher may result in worse load balancing.

You need to take care of the critical section in your code, which is the insertion into the answer vector
something like this should work (haven't compiled because functions and types are not given)
// create vector before parallel section because it shall be shared
vector<Sample> valid_samples(n); // set initial size to avoid reallocation
int reached_count = 0;
#pragma omp parallel shared(valid_samples, n, reached_count)
{
while(reached_count < n) { // changed this, see comments for discussion
Sample s = sample(); // I assume this to be thread indepent
if(is_valid(s)) {
#pragma omp critical
{
// check condition again, another thread might have already
// reached maximum number
if(reached_count < n) {
valid_samples.push_back(s);
reached_count = valid_samples.size();
}
}
}
}
}
note that neither sample() nor isvalid(s) are inside of the critical section because I assume these functions to be far more expensive than the vector insertion or the acceptance is very rare
If that is not the case, you could work with indepent local vectors and merge in the end, but that would only gain a significant benefit if you reduce the number of synchronization in some way, like giving a fixed number of iterations (at least for a large part)

Related

Why the following program does not mix the output when mutex is not used?

I have made multiple runs of the program. I do not see that the output is incorrect, even though I do not use the mutex. My goal is to demonstrate the need of a mutex. My thinking is that different threads with different "num" values will be mixed.
Is it because the objects are different?
using VecI = std::vector<int>;
class UseMutexInClassMethod {
mutex m;
public:
VecI compute(int num, VecI veci)
{
VecI v;
num = 2 * num -1;
for (auto &x:veci) {
v.emplace_back(pow(x,num));
std::this_thread::sleep_for(std::chrono::seconds(1));
}
return v;
}
};
void TestUseMutexInClassMethodUsingAsync()
{
const int nthreads = 5;
UseMutexInClassMethod useMutexInClassMethod;
VecI vec{ 1,2,3,4,5 };
std::vector<std::future<VecI>> futures(nthreads);
std::vector<VecI> outputs(nthreads);
for (decltype(futures)::size_type i = 0; i < nthreads; ++i) {
futures[i] = std::async(&UseMutexInClassMethod::compute,
&useMutexInClassMethod,
i,vec
);
}
for (decltype(futures)::size_type i = 0; i < nthreads; ++i) {
outputs[i] = futures[i].get();
for (auto& x : outputs[i])
cout << x << " ";
cout << endl;
}
}
If you want an example that does fail with a high degree of certainty you can look at the below. It sets up a variable called accumulator to be shared by reference to all the futures. This is what is missing in your example. You are not actually sharing any memory. Make sure you understand the difference between passing by reference and passing by value.
#include <vector>
#include <memory>
#include <thread>
#include <future>
#include <iostream>
#include <cmath>
#include <mutex>
struct UseMutex{
int compute(std::mutex & m, int & num)
{
for(size_t j = 0;j<1000;j++)
{
///////////////////////
// CRITICAL SECTIION //
///////////////////////
// this code currently doesn't trigger the exception
// because of the lock on the mutex. If you comment
// out the single line below then the exception *may*
// get called.
std::scoped_lock lock{m};
num++;
std::this_thread::sleep_for(std::chrono::nanoseconds(1));
num++;
if(num%2!=0)
throw std::runtime_error("bad things happened");
}
return 0;
}
};
template <typename T> struct F;
void TestUseMutexInClassMethodUsingAsync()
{
const int nthreads = 16;
int accumulator=0;
std::mutex m;
std::vector<UseMutex> vs{nthreads};
std::vector<std::future<int>> futures(nthreads);
for (auto i = 0; i < nthreads; ++i) {
futures[i]= std::async([&,i](){return vs[i].compute(m,accumulator);});
}
for(auto i = 0; i < nthreads; ++i){
futures[i].get();
}
}
int main(){
TestUseMutexInClassMethodUsingAsync();
}
You can comment / uncomment the line
std::scoped_lock lock{m};
which protects the increment of the shared variable num. The rule for this mini program is that at the line
if(num%2!=0)
throw std::runtime_error("bad things happened");
num should be a multiple of two. But as multiple threads are accessing this variable without a lock you can't guarantee this. However if you add a lock around the double increment and test then you can be sure no other thread is accessing this memory during the duration of the increment and test.
Failing
https://godbolt.org/z/sojcs1WK9
Passing
https://godbolt.org/z/sGdx3x3q3
Of course the failing one is not guaranteed to fail but I've set it up so that it has a high probability of failing.
Notes
[&,i](){return vs[i].compute(m,accumulator);};
is a lambda or inline function. The notation [&,i] means it captures everything by reference except i which it captures by value. This is important because i changes on each loop iteration and we want each future to get a unique value of i
Is it because the objects are different?
Yes.
Your code is actually perfectly thread safe, no need for mutex here. You never share any state between threads except for copying vec from TestUseMutexInClassMethodUsingAsync to compute by std::async (and copying is thread-safe) and moving computation result from compute's return value to futures[i].get()'s return value. .get() is also thread-safe: it blocks until the compute() method terminates and then returns its computation result.
It's actually nice to see that even a deliberate attempt to get a race condition failed :)
You probably have to fully redo your example to demonstrate is how simultaneous* access to a shared object breaks things. Get rid of std::async and std::future, use simple std::thread with capture-by-reference, remove sleep_for (so both threads do a lot of operations instead of one per second), significantly increase number of operations and you will get a visible race. It may look like a crash, though.
* - yes, I'm aware that "wall-clock simulateneous access" does not exist in multithreaded systems, strictly speaking. However, it helps getting a rough idea of where to look for visible race conditions for demonstration purposes.
Comments have called out the fact that just not protecting a critical section does not guarantee that the risked behavior actually occurs.
That also applies for multiple runs, because while you are not allowed to test a few times and then rely on the repeatedly observed behavior, it is likely that optimization mechanisms cause a likely enough reoccurring observation as to be perceived has reproducible.
If you intend to demonstrate the need for synchronization you need to employ synchronization to poise things to a near guaranteed misbehavior of observable lack of protection.
Allow me to only outline a sequence for that, with a few assumptions on scheduling mechanisms (this is based on a rather simple, single core, priority based scheduling environment I have encountered in an embedded environment I was using professionally), just to give an insight with a simplified example:
start a lower priority context.
optionally set up proper protection before entering the critical section
start critical section, e.g. by outputting the first half of to-be-continuous output
asynchronously trigger a higher priority context, which is doing that which can violate your critical section, e.g. outputs something which should not be in the middle of the two-part output of the critical section
(in protected case the other context is not executed, in spite of being higher priority)
(in unprotected case the other context is now executed, because of being higher priority)
end critical section, e.g. by outputting the second half of the to-be-continuous output
optionally remove the protection after leaving the critical section
(in protected case the other context is now executed, now that it is allowed)
(in unprotected case the other context was already executed)
Note:
I am using the term "critical section" with the meaning of a piece of code which is vulnerable to being interrupted/preempted/descheduled by another piece of code or another execution of the same code. Specifically for me a critical section can exist without applied protection, though that is not a good thing. I state this explicitly because I am aware of the term being used with the meaning "piece of code inside applied protection/synchronization". I disagree but I accept that the term is used differently and requires clarification in case of potential conflicts.

OpenMP critical section only if data race exists vs lock?

My question is somewhat similar to this one: How to use lock in OpenMP?
in the sense that their answers sort of answer my question but not well enough.
I'm trying to implement a simple work-stealing scheduler in OpenMP (from scratch).
Let's say I have an array of some object, say int. I have multiple threads which will manipulate the entries of this array, in no particular order. I would like to make sure that no two threads try to access the same element of the array at the same time. However, I am allowing the threads to access the same element, as long as the accesses are not simultaneous. Also, I am allowing the threads to access the array simultaneously, as long as each thread wishes to access a different entry of the array during this time. I could use a critical section, as in the following:
int array[1000];
#pragma omp parallel
{
bool flag = true;
while(flag){
int x = rand()%1000;
#pragma omp critical
{
array[x] = some_function(array[x]);
if (some_condition(array[x])){
flag = false;
}
}
}
}
This code creates some threads and the threads randomly access and manipulate entries of the array until some stopping condition that kills the thread. This code works fine, since the critical section ensures that no two threads will ever write to the array at the same time (in case they generated the same value of x). However, if at some time two threads do not happen to generate the same value of x, the critical section is redundant, as they threads are not accessing the same entry. Is there a way to make it so that a thread will stall if and only if the value of x it generated is the same as a thread that is currently also using x? Right now, this code is inefficient, and basically serial, even if every thread happens to generate a different value of x. I want to make it so they stall only if there is a collision.
Perhaps what I am looking for are locks ,but I am not sure. Are critical sections not the right way to go here?
I meant something like this:
#include <stdlib.h>
#include <omp.h>
int main()
{
int array[1000];
omp_lock_t locks[1000];
for (int i = 0; i < 1000; i++)
omp_init_lock(&locks[i]);
#pragma omp parallel
{
bool flag = true;
while(flag){
int x = rand()%1000;
omp_set_lock(&locks[x]);
array[x] = some_function(array[x]);
if (some_condition(array[x])){
flag = false;
}
omp_unset_lock(&locks[x]);
}
}
for (int i = 0; i < 1000; i++)
omp_destroy_lock(&locks[i]);
}

using #pragma omp parallel for make the program slower

My C++ program takes about 300 s to run.
Inside my program I need to cwis divide my vectors. VS analyzer tells this takes about 15% of running time. here is the code:
template <class T> myVector<T> cWisDivide(myVector<T> &vec1,
myVector<T> &vec2)
{
try
{
if (vec1._rows == vec2._rows)
{
myVector<T> result(vec1._rows);
//#pragma omp parallel for
for (int r = 1; r <= vec1._rows; r++)
{
if (vec2(r) != 0)
{
result(r) = vec1(r) / vec2(r);
}
else
{
throw std::runtime_error("");
}
}
return result;
}
}
catch (const exception &e)
{
....
}
}
this function is called many time.
If I use #pragma ... before the loop, the cpu usage sticks 100% for about 350 s. which is more than the time taken to run program sequentially.
I would appreciate if any one could help me on the issue.
This can go wrong in a number of ways:
without knowing the type of result, it's possible that barriers have to be built in to avoid a race condition when modifying it -- you could avoid that by having parallel result vectors that you merge afterwards.
copy overhead for the vec1 and vec2 vectors might be bigger than performance reward.
all in all, this is a question about parallelizable vector types -- refer to your openMP documentation of choice to learn more about parallely accessible types.
Anyway, I just looked it up and from the OMP specification ...
• A throw executed inside a loop region must cause execution to resume within the same iteration of the loop region, and the same thread that threw the exception must catch it.
I knew I didn't like the look of the exception.
OpenMP API V4.0 page 59.

Multihreading recursive program c++

I am working on a recursive algorithm which we want to parallelize to improve the performance.
I implemented multithreading using Visual c++ 12.0 and < thread > library . However I dont see any performance improvements. The time taken either less by a few milliseconds or is more than the time with single thread.
Kindly let me know if am doing something wrong and what corrections should I make to the code.
Here is my code
void nonRecursiveFoo(<className> &data, int first, int last)
{
//process the data between first and last index and set its value to true based on some condition
//no threads are created here
}
void recursiveFoo(<className> &data, int first, int last)
{
int partitionIndex = -1;
data[first]=true;
data[last]=true;
for (int i = first + 1; i < last; i++)
{
//some logic setting the index
If ( some condition is true)
partitionIndex = i;
}
//no dependency of partitions on one another and so can be parallelized
if( partitionIndex != -1)
{
data[partitionIndex]=true;
//assume some threadlimit
if (Commons::GetCurrentThreadCount() < Commons::GetThreadLimit())
{
std::thread t1(recursiveFoo, std::ref(data), first, index);
Commons::IncrementCurrentThreadCount();
recursiveFoo(data, partitionIndex , last);
t1.join();
}
else
{
nonRecursiveFoo(data, first, partitionIndex );
nonRecursiveFoo(data, partitionIndex , last);
}
}
}
//main
int main()
{
recursiveFoo(data,0,data.size-1);
}
//commons
std::mutex threadCountMutex;
static void Commons::IncrementCurrentThreadCount()
{
threadCountMutex.lock();
CurrentThreadCount++;
threadCountMutex.unlock();
}
static int GetCurrentThreadCount()
{
return CurrentThreadCount;
}
static void SetThreadLimit(int count)
{
ThreadLimit = count;
}
static int GetThreadLimit()
{
return ThreadLimit;
}
static int GetMinPointsPerThread()
{
return MinimumPointsPerThread;
}
Without further information (see comments) this is mostly guesswork, but there are a few things you should watch out for:
First of all, make sure that your partitioning logic is very short and fast compared to the processing. Otherwise, you are just creating more work than you gain processing power.
Make sure, there is enough work to begin with or the speedup might be not enough to pay for the additional overhead of thread creation.
Check that your work gets evenly distributed among the different threads and don't spawn more threads than you have cores on your computer (print the number of total threads at the end - don't rely on your ThreadLimit).
Don't let your partitions get too small, (especially no less than 64 Bytes) or you end up with false sharing.
It would be MUCH more efficient, to implement CurrentThreadCount as a std::atomic<int> in which case you don't need a mutex.
Put the increment of the counter before the creation of the thread. Otherwise, the newly created thread might read the counter before it is incremented and spawn a new thread again, even if the max number of threads is already reached (This is still not a perfect solution, but I would only invest more time on this if you have verified, that overcommitting is your actual problem)
If you really must use a mutex (for reasons outside of the example code) you have to use it for every access to CurrentThreadCount (read and write access). Otherwise this is - strictly speaking - a race condition and thus UB.
By using t1.join you're basically waiting for the other thread to finish - i.e. not doing anything in parallel.
By looking at your algorithm I don't see how it can be parallelized(thus improved) by using threads - you have to wait for a single recursive call to end.
First of all, you are not doing anything in parallel, as every thread creation blocks, until the created thread has finished. Hence, your multithreaded code will always be slower than the non multithreaded version.
In order to parallelize you could spawn threads for that part, where the non-recursive function is called, put the thread ID into a vector and join on the highest level of the recursion, by walking through the vector. (Although there are more elegant ways to do that, but for a first should this would be OK, I think).
Thus, all non recursive calls will run in parallel. But you should use another condition than the max number of threads, but the size of the problem, e.g. last-first<threshold.

c++ OpenMP critical: "one-way" locking?

Consider the following serial function. When I parallelize my code, every thread will call this function from within the parallel region (not shown). I am trying to make this threadsafe and efficient (fast).
float get_stored_value__or__calculate_if_does_not_yet_exist( int A )
{
static std::map<int, float> my_map;
std::map::iterator it_find = my_map.find(A); //many threads do this often.
bool found_A = it_find != my_map.end();
if (found_A)
{
return it_find->second;
}
else
{
float result_for_A = calculate_value(A); //should only be done once, really.
my_map[A] = result_for_A;
return result_for_A;
}
}
Almost every single time this function is called, the threads will successfully "find" the stored value for their "A" (whatever it is). Every once in a while, when a "new A" is called, a value will have to be calculated and stored.
So where should I put the #pragma omp critical ?
Though easy, it is very inefficient to put a #pragma omp critical around all of this, since each thread will be doing this constantly and it will often be the read-only case.
Is there any way to implement a "one-way" critical, or a "one-way" lock routine? That is, the above operations involving the iterator should only be "locked" when writing to my_map in the else statement. But multiple threads should be able to execute the .find call simultaneously.
I hope I make sense.
Thank you.
According to this link on Stack Overflow inserting into an std::map doesn't invalidate iterators. The same goes for the end() iterator. Here's a supporting link.
Unfortunately, insertion can happen multiple times if you don't use a critical section. Also, since your calculate_value routine might be computationally expensive, you will have to lock to avoid this else clause being operated on twice with the same value of A and then inserted twice.
Here's a sample function where you can replicate this incorrect multiple insertion:
void testFunc(std::map<int,float> &theMap, int i)
{
std::map<int,float>::iterator ite = theMap.find(i);
if(ite == theMap.end())
{
theMap[i] = 3.14 * i * i;
}
}
Then called like this:
std::map<int,float> myMap;
int i;
#pragma omp parallel for
for(i=1;i<=100000;++i)
{
testFunc(myMap,i % 100);
}
if(myMap.size() != 100)
{
std::cout << "Problem!" << std::endl;
}
Edit: edited to correct error in earler version.
OpenMP is a compiler "tool" for automatic loop parallelization, not a thread communication or synchronization library; so it doesn't have sophisticated mutexes, like a read/write mutex: acquire lock on writing, but not on reading.
Here's an implementation example.
Anyway Chris A.'s answer is better than mine though :)
While #ChrisA's answer may solve your problem, I'll leave my answer here in case any future searchers find it useful.
If you'd like, you can give #pragma omp critical sections a name. Then, any section with that name is considered the same critical section. If this is what you would like to do, you can easily cause onyl small portions of your method to be critical.
#pragma omp critical map_protect
{
std::map::iterator it_find = my_map.find(A); //many threads do this often.
bool found_A = it_find != my_map.end();
}
...
#pragma omp critical map_protect
{
float result_for_A = calculate_value(A); //should only be done once, really.
my_map[A] = result_for_A;
}
The #pragma omp atomic and #pragma omp flush directives may also be useful.
atomic causes a write to a memory location (lvalue in the expression preceded by the directive) to always be atomic.
flush ensures that any memory expected to be available to all threads is actually written to all threads, not stored in a processor cache and unavailable where it should be.