Is the vector reference operator[] threadsafe for writing? - c++

Let's say I have the following code snippet.
// Some function decleration
void generateOutput(const MyObj1& in, MyObj2& out);
void doTask(const std::vector<MyObj1>& input, std::vector<MyObj2>& output) {
output.resize(input.size());
// Use OpenMP to run in parallel
#pragma omp parallel for
for (size_t i = 0; i < input.size(); ++i) {
generateOutput(input[i], output[i]);
}
}
Is the above threasafe?
I am mainly concerned about writing to output[i].
Do I need some sort of locking? Or is it unnecessary?
ex:
// Some function prototype
void generateOutput(const MyObj1& in, MyObj2& out);
void doTask(const std::vector<MyObj1>& input, std::vector<MyObj2>& output) {
output.resize(input.size());
// Use OpenMP to run in parallel
#pragma omp parallel for
for (size_t i = 0; i < input.size(); ++i) {
MyObj2 tmpOutput;
generateOutput(input[i], tmpOutput);
#pragma omp critical
output[i] = std::move(tmpOutput);
}
}
I am not worried about the reading portion. As mention in this answer, it looks like reading input[i] is threadsafe.

output[i] does not write to output. This is just a call to std::vector<MyObj2>::operator[]. It returns an unnamed MyObj2&, which is then used to call generateOutput. The latter is where the write happens.
I'll assume that generateOutput is threadsafe itself, and MyObj2 too, since we don't have code for that. So the write to MyObj2& inside generateOutput is also threadsafe.
As a result, all parts are threadsafe.

As long as it is guaranteed that the threads operate on completely separate items (i.e., no item is accessed by different different threads without some kind of synchronization) this is safe.
Since you are using a simple parallel for loop, in which each item is accessed exactly once, this is safe.

To not do any assumption on the implementation of std::vector you can modify your code as below to make it threadsafe (pointer addresses will by definition point on different zones in memory and hence be thread safe)
// Some function decleration
void generateOutput(const MyObj1& in, MyObj2 *out); // use raw data pointer for output
void doTask(const std::vector<MyObj1>& input, std::vector<MyObj2>& output) {
output.resize(input.size());
// Use OpenMP to run in parallel
auto data = output.data() ;// pointer on vector underlying data outside of OMP threading
#pragma omp parallel for
for (size_t i = 0; i < input.size(); ++i) {
generateOutput(input[i], &data[i]); // access to distinct data elements ie addresses (indexed by i only in from each omp thred)
}
}

Related

OpenMP thread initialization and de-initialization before doing work

I am using an API that needs to be started and stopped for every thread in which it is used. So if I want to do something with the API in a specific thread I have to call api_start() (and api_stop() afterwards).
Now I have a very trivial problem I can solve in parallel which I want to try with OpenMP. Consider the problem is looking like this:
#pragma omp parallel for num_threads(NUM_THREADS), default(none)
for (auto i = 0; i < count; i++)
{
api_process(i);
}
This will not work because the worker threads of OpenMP did not call api_start() or api_stop() so a working solution would be:
#pragma omp parallel for num_threads(NUM_THREADS), default(none)
for (auto i = 0; i < count; i++)
{
api_start();
api_process(i);
api_stop();
}
But this solution will bring up overhead because now a thread calls api_start() and api_stop() multiple times (if NUM_THREADS < count).
So my question is: Is there a way in OpenMP to define a function to call for every created thread once on startup and once on deletion?
Thanks in advance!
You can call the functions manually at the beginning/end of the first/last iteration, respectively, or use something as std::call_once. However, this would add some overhead into each iteration (branching).
EDIT: Actually, this wouldn't work since only a single thread would call those functions. You would need to define some thread-local flags and check them in iterations. Same downside.
A much better alternative would be simply to split parallel and for OpenMP code blocks:
#pragma omp parallel
{
api_start();
#pragma omp for
for (auto i = 0; i < count; i++)
{
api_process(i);
}
api_stop();
}

OpenMP critical section only if data race exists vs lock?

My question is somewhat similar to this one: How to use lock in OpenMP?
in the sense that their answers sort of answer my question but not well enough.
I'm trying to implement a simple work-stealing scheduler in OpenMP (from scratch).
Let's say I have an array of some object, say int. I have multiple threads which will manipulate the entries of this array, in no particular order. I would like to make sure that no two threads try to access the same element of the array at the same time. However, I am allowing the threads to access the same element, as long as the accesses are not simultaneous. Also, I am allowing the threads to access the array simultaneously, as long as each thread wishes to access a different entry of the array during this time. I could use a critical section, as in the following:
int array[1000];
#pragma omp parallel
{
bool flag = true;
while(flag){
int x = rand()%1000;
#pragma omp critical
{
array[x] = some_function(array[x]);
if (some_condition(array[x])){
flag = false;
}
}
}
}
This code creates some threads and the threads randomly access and manipulate entries of the array until some stopping condition that kills the thread. This code works fine, since the critical section ensures that no two threads will ever write to the array at the same time (in case they generated the same value of x). However, if at some time two threads do not happen to generate the same value of x, the critical section is redundant, as they threads are not accessing the same entry. Is there a way to make it so that a thread will stall if and only if the value of x it generated is the same as a thread that is currently also using x? Right now, this code is inefficient, and basically serial, even if every thread happens to generate a different value of x. I want to make it so they stall only if there is a collision.
Perhaps what I am looking for are locks ,but I am not sure. Are critical sections not the right way to go here?
I meant something like this:
#include <stdlib.h>
#include <omp.h>
int main()
{
int array[1000];
omp_lock_t locks[1000];
for (int i = 0; i < 1000; i++)
omp_init_lock(&locks[i]);
#pragma omp parallel
{
bool flag = true;
while(flag){
int x = rand()%1000;
omp_set_lock(&locks[x]);
array[x] = some_function(array[x]);
if (some_condition(array[x])){
flag = false;
}
omp_unset_lock(&locks[x]);
}
}
for (int i = 0; i < 1000; i++)
omp_destroy_lock(&locks[i]);
}

How allocate dynamic memory in OpenMP C++

I tried to parallel function, that allocates memory, but I had an exception of bad heap. Memory must have used some threads in one time.
void GetDoubleParameters( CInd *ci )
{
for(int i=0;i<ci->size();i++)
{
void *tmp;
#pragma omp parallel private (tmp)
{
for(int j=0;j<ci[i].getValues().size();j++)
{
tmp = (void*)new double(ci[i].getValues()[j]);
ci->getParameters().push_back(tmp);
}
}
}
}
The problem is the line:
ci->getParameters().push_back(tmp);
ci is accessed by all parallel threads at once, and its parameters element with the push_back routine (probably a std::vector) is probably not thread-safe.
You will have to organize some guards around this code. Something like:
omp_lock_t my_lock;
...
// initialize lock
omp_init_lock (&my_lock);
...
// do something sensible in parallel
...
{
omp_guard my_guard (my_lock);
// protected region starts here
// only one thread at a time works here
}
// more parallel work
...
}
omp_destroy_lock (&my_lock);

Forking once and then use the threads in a different procedure?

If I fork inside my main program and then call a subroutine inside a single directive, what is the behavior if I enter an OMP parallel directive in this subroutine?
My guess/hope is that existing threads are used, as they all should have nothing to do at the moment.
Pseudo-Example:
double A[];
int main() {
#pragma omp parallel num_threads(2)
{
#pragma omp single
{
for (int t=0; t<1000; t++) {
evolve();
}
}
}
}
void evolve() {
#pragma omp parallel for num_threads(2)
for (int i=0; i<100; i++) {
do_stuff(i);
}
}
void do_stuff(int i) {
// expensive calculation on array element A[i]
}
As evolve() is called very often, forking here would cause way to much overhead, so I would like to do it only once, then call evolve() from a single thread and split the work of the calls to do_stuff() over the existing threads.
For Fortran this seems to work. I get a roughly 80-90% speed increase on a simple example using 2 threads. But for C++ I get a different behavior, only the thread which executes the single directive is used for the loop in evolve()
I fixed the problem using the task directive in the main program and passing the limits to evolve(), but this looks like a clumsy solution...
Why is the behavior in Fortran and C++ different and what would be the solution in C++?
I believe orphaned directives are the cleanest solution in your case:
double A[];
int main() {
#pragma omp parallel num_threads(2)
{
// Each thread calls evolve() a thousand times
for (int t=0; t<1000; t++) {
evolve();
}
}
}
void evolve() {
// The orphaned construct inside evolve()
// will bind to the innermost parallel region
#pragma omp for
for (int i=0; i<100; i++) {
do_stuff(i);
} // Implicit thread synchronization
}
void do_stuff(int i) {
// expensive calculation on array element A[i]
}
This will work because (section 2.6.1 of the standard):
A loop region binds to the innermost enclosing parallel region
That said, in your code you are using nested parallel constructs. To be sure to enable them you must set the environment variable OMP_NESTED to true, otherwise (quoting Appendix E of the latest standard):
OMP_NESTED environment variable: if the value is neither true nor false the behavior is implementation defined
Unfortunately, your code will likely not work as expected in all cases. If you have a code structure like this:
void foo() {
#pragma omp parallel
#pragma omp single
bar();
}
void bar() {
#pragma omp parallel
printf("...)";
}
OpenMP is requested to create a new team of threads when entering the parallel region in bar. OpenMP calls that "nested parallelism". However, what exactly happens depends on the your actual implementation of OpenMP used and the setting of OMP_NESTED.
OpenMP implementations are not required to support nested parallelism. It would be perfectly legal, if an implementation ignored the parallel region in bar and just execute it with one thread. OMP_NESTED can be used to turn on and off nesting, if the implementation supports it.
In your case, things by chance went well, since you sent all threads to sleep except one. This thread then created a new team of threads of full size (potentially NEW threads, not reusing the old ones). If you omitted the single construct, you would easily get thousands of threads.
Unfortunately, OpenMP does not support your pattern to create a parallel team, have one thread executing the call stacks, and then distribute work across the other team members through a worksharing construct like for. If you need this code pattern, the only solution will be OpenMP tasks.
Cheers,
-michael
Your example doesn't actually call fork(), so I suspect you don't mean fork in the system-call sense (i.e. duplicating your process). However, if that really is what you meant, I suspect that most OpenMP implementations will not work correctly in a forked process. Typically, threads are not preserved across fork() calls. If the OpenMP implementation you use registers pthread_atfork() handlers, it may work correctly following a fork() call, but it will not use the same threads as the parent process.

Elegantly initializing openmp threads in parallel for loop

I have a for loop that uses a (somewhat complicated) counter object sp_ct to initialize an array. The serial code looks like
sp_ct.depos(0);
for(int p=0;p<size; p++, sp_ct.increment() ) {
in[p]=sp_ct.parable_at_basis();
}
My counter supports parallelization because it can be initialized to the state after p increments, leading to the following working code-fragment:
int firstloop=-1;
#pragma omp parallel for \
default(none) shared(size,in) firstprivate(sp_ct,firstloop)
for(int p=0;p<size;p++) {
if( firstloop == -1 ) {
sp_ct.depos(p); firstloop=0;
} else {
sp_ct.increment();
}
in[p]=sp_ct.parable_at_basis();
} // end omp paralell for
I dislike this because of the clutter that obscures what is really going on, and because it has an unnecessary branch inside the loop (Yes, I know that this is likely to not have a measurable influence on running time because it is so predictable...).
I would prefer to write something like
#pragma omp parallel for default(none) shared(size,in) firstprivate(sp_ct,firstloop)
for(int p=0;p<size;p++) {
#prgma omp initialize // or something
{ sp_ct.depos(p); }
in[p]=sp_ct.parable_at_basis();
sp_ct.increment();
}
} // end omp paralell for
Is this possible?
If I generalize you problem, the question is "How to execute some intialization code for each thread of a parallel section ?", is that right ? You may use a property of the firstprivate clause : "the initialization or construction of the given variable happens as if it were done once per thread, prior to the thread's execution of the construct".
struct thread_initializer
{
explicit thread_initializer(
int size /*initialization params*/) : size_(size) {}
//Copy constructor that does the init
thread_initializer(thread_initializer& _it) : size_(_it.size)
{
//Here goes once per thread initialization
for(int p=0;p<size;p++)
sp_ct.depos(p);
}
int size_;
scp_type sp_ct;
};
Then the loop may be written :
thread_initializer init(size);
#pragma omp parallel for \
default(none) shared(size,in) firstprivate(init)
for(int p=0;p<size;p++) {
init.sp_ct.increment();
}
in[p]=init.sp_ct.parable_at_basis();
The bad things are that you have to write this extra initializer and some code is moved away from its actual execution point. The good thing is that you can reuse it as well as the cleaner loop syntaxe.
From what I can tell you can do this by manually defining the chunks. This looks somewhat like something I was trying to do with induction in OpenMP Induction with OpenMP: getting range values for a parallized for loop in OpenMP
So you probably want something like this:
#pragma omp parallel
{
const int nthreads = omp_get_num_threads();
const int ithread = omp_get_thread_num();
const int start = ithread*size/nthreads;
const int finish = (ithread+1)*size/nthreads;
Counter_class_name sp_ct;
sp_ct.depos(start);
for(int p=start; p<finish; p++, sp_ct.increment()) {
in[p]=sp_ct.parable_at_basis();
}
}
Notice that except for some declarations and changing the range values this code is almost identical to the serial code.
Also you don't have to declare anything shared or private. Everything declared inside the parallel block is private and everything declared outside is shared. You don't need firstprivate either. This makes the code cleaner and more clear (IMHO).
I see what you're trying to do, and I don't think it is possible. I'm just going to write some code that I believe would achieve the same thing, and is somewhat clean, and if you like it, sweet!
sp_ct.depos(0);
in[0]=sp_ct.parable_at_basis();
#pragma omp parallel for \
default(none) shared(size,in) firstprivate(sp_ct,firstloop)
for(int p = 1; p < size; p++) {
sp_ct.increment();
in[p]=sp_ct.parable_at_basis();
} // end omp paralell for
Riko, implement sp_ct.depos(), so it will invoke .increment() only as often as necessary to bring the counter to the passed parameter. Then you can use this code:
sp_ct.depos(0);
#pragma omp parallel for \
default(none) shared(size,in) firstprivate(sp_ct)
for(int p=0;p<size;p++) {
sp_ct.depos(p);
in[p]=sp_ct.parable_at_basis();
} // end omp paralell for
This solution has one additional benefit: Your implementation only works if each thread receives only one chunk out of 0 - size. Which is the case when specifying schedule(static) omitting the chunk size (OpenMP 4.0 Specification, chapter 2.7.1, page 57). But since you did not specify a schedule the used schedule will be implementation dependent (OpenMP 4.0 Specification, chapter 2.3.2). If the implementation chooses to use dynamic or guided, threads will receive multiple chunks with gaps between them. So one thread could receive chunk 0-20 and then chunk 70-90 which will make p and sp_ct out of sync on the second chunk. The solution above is compatible to all schedules.