How allocate dynamic memory in OpenMP C++ - c++

I tried to parallel function, that allocates memory, but I had an exception of bad heap. Memory must have used some threads in one time.
void GetDoubleParameters( CInd *ci )
{
for(int i=0;i<ci->size();i++)
{
void *tmp;
#pragma omp parallel private (tmp)
{
for(int j=0;j<ci[i].getValues().size();j++)
{
tmp = (void*)new double(ci[i].getValues()[j]);
ci->getParameters().push_back(tmp);
}
}
}
}

The problem is the line:
ci->getParameters().push_back(tmp);
ci is accessed by all parallel threads at once, and its parameters element with the push_back routine (probably a std::vector) is probably not thread-safe.
You will have to organize some guards around this code. Something like:
omp_lock_t my_lock;
...
// initialize lock
omp_init_lock (&my_lock);
...
// do something sensible in parallel
...
{
omp_guard my_guard (my_lock);
// protected region starts here
// only one thread at a time works here
}
// more parallel work
...
}
omp_destroy_lock (&my_lock);

Related

parallelizing (openmp) a function call causes memory error

I'm calling a class function in parallel using OpenMP. It's working okay in serial or putting the my_func part in the critical section (slow apparently). Running in parallel, however, keeps giving the following error message,
malloc: *** error for object 0x7f961dcef750: pointer being freed was
not allocated
I think the problem is with the new operators in the my_func, i.e., The pointer myDyna seems to be shared among threads. My questions are,
1)Isn't anything inside the parallel region private, including all the pointers in the my_func? Meaning that, each thread should have its own copy of myDyna, but why the error occurred?
2)Without changing the my_func too much, what can be done to make the parallelization work at the main level? For example, would it work by adding a copy constructor in the myDyna?
void my_func(){
Variables *theVars = new Variables();
//trDynaPS Constructor for the class.
trDynaPS *myDyna = new trDynaPS(theVars, starttime, stepsize, stoptime);
ResultRate = myDyna->trDynaPS_Drive(1, 1);
if (theVars != nullptr) {
myDyna->theVars = nullptr;
delete theVars;
}
delete myDyna;
};
int main()
{
#pragma omp parallel for
{
for (int i = 0;i<10;i++){
//I have multiple copies of myclass as a vector
myclass[i]->run(); //my_func is in the run operation
}
}
return 0;
}

OpenMP: how to realize thread local object in task?

What I am trying to do is to iterate over all elements of a container in parallel comparable to #pragma omp for; however the container in question does not offer a random access iterator. I therefore use a workaround via tasks described in this stackoverflow answer:
for (std::map<A,B>::iterator it = my_map.begin();
it != my_map.end();
++it)
{
#pragma omp task
{ /* do work with it */ }
}
My problem is that a 'scratch space' object is needed in each iteration; this object is expensive to construct or copy into the data environment of the task. It would only be necessary to have a single thread local object for each thread; in the sense that each task uses the object of the thread it is executed on. private requires a copy and shared results in a race condition. Is there a way to realize this with OpenMP?
I researched #pragma omp threadprivate, however the objects are not static, as the structure of the program looks something like this:
method(int argument_for_scratch_object){
#pragma omp parallel
{
Object scratch(argument_for_scratch_object);
//some computations are done here...
#pragma omp single nowait
{
//here goes the for loop creating the tasks above
//each task uses the scratch space object
}
}
}
If scratch was declared static (and then made threadprivate) before the parallel region, it would be initialized with argument_for_scratch_object of the first method call; which might not be correct for the subsequent method calls.
According to your update, I would suggest to use a global/static thread-private pointer and then initialize it by each thread within your parallel section.
static Object* scratch_ptr;
#pragma omp threadprivate(scratch_ptr);
void method(int argument_for_scratch_object)
{
#pragma omp parallel
{
scratch_ptr = new Object(argument_for_scratch_object);
...
delete scratch_ptr;
}
}

Is the vector reference operator[] threadsafe for writing?

Let's say I have the following code snippet.
// Some function decleration
void generateOutput(const MyObj1& in, MyObj2& out);
void doTask(const std::vector<MyObj1>& input, std::vector<MyObj2>& output) {
output.resize(input.size());
// Use OpenMP to run in parallel
#pragma omp parallel for
for (size_t i = 0; i < input.size(); ++i) {
generateOutput(input[i], output[i]);
}
}
Is the above threasafe?
I am mainly concerned about writing to output[i].
Do I need some sort of locking? Or is it unnecessary?
ex:
// Some function prototype
void generateOutput(const MyObj1& in, MyObj2& out);
void doTask(const std::vector<MyObj1>& input, std::vector<MyObj2>& output) {
output.resize(input.size());
// Use OpenMP to run in parallel
#pragma omp parallel for
for (size_t i = 0; i < input.size(); ++i) {
MyObj2 tmpOutput;
generateOutput(input[i], tmpOutput);
#pragma omp critical
output[i] = std::move(tmpOutput);
}
}
I am not worried about the reading portion. As mention in this answer, it looks like reading input[i] is threadsafe.
output[i] does not write to output. This is just a call to std::vector<MyObj2>::operator[]. It returns an unnamed MyObj2&, which is then used to call generateOutput. The latter is where the write happens.
I'll assume that generateOutput is threadsafe itself, and MyObj2 too, since we don't have code for that. So the write to MyObj2& inside generateOutput is also threadsafe.
As a result, all parts are threadsafe.
As long as it is guaranteed that the threads operate on completely separate items (i.e., no item is accessed by different different threads without some kind of synchronization) this is safe.
Since you are using a simple parallel for loop, in which each item is accessed exactly once, this is safe.
To not do any assumption on the implementation of std::vector you can modify your code as below to make it threadsafe (pointer addresses will by definition point on different zones in memory and hence be thread safe)
// Some function decleration
void generateOutput(const MyObj1& in, MyObj2 *out); // use raw data pointer for output
void doTask(const std::vector<MyObj1>& input, std::vector<MyObj2>& output) {
output.resize(input.size());
// Use OpenMP to run in parallel
auto data = output.data() ;// pointer on vector underlying data outside of OMP threading
#pragma omp parallel for
for (size_t i = 0; i < input.size(); ++i) {
generateOutput(input[i], &data[i]); // access to distinct data elements ie addresses (indexed by i only in from each omp thred)
}
}

Am I using this deque in a thread safe manner?

I'm trying to understand multi threading in C++. In the following bit of code, will the deque 'tempData' declared in retrieve() always have every element processed once and only once, or could there be multiple copies of tempData across multiple threads with stale data, causing some elements to be processed multiple times? I'm not sure if passing by reference actually causes there to be only one copy in this case?
static mutex m;
void AudioAnalyzer::analysisThread(deque<shared_ptr<AudioAnalysis>>& aq)
{
while (true)
{
m.lock();
if (aq.empty())
{
m.unlock();
break;
}
auto aa = aq.front();
aq.pop_front();
m.unlock();
if (false) //testing
{
retrieveFromDb(aa);
}
else
{
analyzeAudio(aa);
}
}
}
void AudioAnalyzer::retrieve()
{
deque<shared_ptr<AudioAnalysis>>tempData(data);
vector<future<void>> futures;
for (int i = 0; i < NUM_THREADS; ++i)
{
futures.push_back(async(bind(&AudioAnalyzer::analysisThread, this, _1), ref(tempData)));
}
for (auto& f : futures)
{
f.get();
}
}
Looks OK to me.
Threads have shared memory and if the reference to tempData turns up as a pointer in the thread then every thread sees exactly the same pointer value and the same single copy of tempData. [You can check that if you like with a bit of global code or some logging.]
Then the mutex ensures single-threaded access, at least in the threads.
One problem: somewhere there must be a push onto the deque, and that may need to be locked by the mutex as well. [Obviously the push_back onto the futures queue is just local.]

Elegantly initializing openmp threads in parallel for loop

I have a for loop that uses a (somewhat complicated) counter object sp_ct to initialize an array. The serial code looks like
sp_ct.depos(0);
for(int p=0;p<size; p++, sp_ct.increment() ) {
in[p]=sp_ct.parable_at_basis();
}
My counter supports parallelization because it can be initialized to the state after p increments, leading to the following working code-fragment:
int firstloop=-1;
#pragma omp parallel for \
default(none) shared(size,in) firstprivate(sp_ct,firstloop)
for(int p=0;p<size;p++) {
if( firstloop == -1 ) {
sp_ct.depos(p); firstloop=0;
} else {
sp_ct.increment();
}
in[p]=sp_ct.parable_at_basis();
} // end omp paralell for
I dislike this because of the clutter that obscures what is really going on, and because it has an unnecessary branch inside the loop (Yes, I know that this is likely to not have a measurable influence on running time because it is so predictable...).
I would prefer to write something like
#pragma omp parallel for default(none) shared(size,in) firstprivate(sp_ct,firstloop)
for(int p=0;p<size;p++) {
#prgma omp initialize // or something
{ sp_ct.depos(p); }
in[p]=sp_ct.parable_at_basis();
sp_ct.increment();
}
} // end omp paralell for
Is this possible?
If I generalize you problem, the question is "How to execute some intialization code for each thread of a parallel section ?", is that right ? You may use a property of the firstprivate clause : "the initialization or construction of the given variable happens as if it were done once per thread, prior to the thread's execution of the construct".
struct thread_initializer
{
explicit thread_initializer(
int size /*initialization params*/) : size_(size) {}
//Copy constructor that does the init
thread_initializer(thread_initializer& _it) : size_(_it.size)
{
//Here goes once per thread initialization
for(int p=0;p<size;p++)
sp_ct.depos(p);
}
int size_;
scp_type sp_ct;
};
Then the loop may be written :
thread_initializer init(size);
#pragma omp parallel for \
default(none) shared(size,in) firstprivate(init)
for(int p=0;p<size;p++) {
init.sp_ct.increment();
}
in[p]=init.sp_ct.parable_at_basis();
The bad things are that you have to write this extra initializer and some code is moved away from its actual execution point. The good thing is that you can reuse it as well as the cleaner loop syntaxe.
From what I can tell you can do this by manually defining the chunks. This looks somewhat like something I was trying to do with induction in OpenMP Induction with OpenMP: getting range values for a parallized for loop in OpenMP
So you probably want something like this:
#pragma omp parallel
{
const int nthreads = omp_get_num_threads();
const int ithread = omp_get_thread_num();
const int start = ithread*size/nthreads;
const int finish = (ithread+1)*size/nthreads;
Counter_class_name sp_ct;
sp_ct.depos(start);
for(int p=start; p<finish; p++, sp_ct.increment()) {
in[p]=sp_ct.parable_at_basis();
}
}
Notice that except for some declarations and changing the range values this code is almost identical to the serial code.
Also you don't have to declare anything shared or private. Everything declared inside the parallel block is private and everything declared outside is shared. You don't need firstprivate either. This makes the code cleaner and more clear (IMHO).
I see what you're trying to do, and I don't think it is possible. I'm just going to write some code that I believe would achieve the same thing, and is somewhat clean, and if you like it, sweet!
sp_ct.depos(0);
in[0]=sp_ct.parable_at_basis();
#pragma omp parallel for \
default(none) shared(size,in) firstprivate(sp_ct,firstloop)
for(int p = 1; p < size; p++) {
sp_ct.increment();
in[p]=sp_ct.parable_at_basis();
} // end omp paralell for
Riko, implement sp_ct.depos(), so it will invoke .increment() only as often as necessary to bring the counter to the passed parameter. Then you can use this code:
sp_ct.depos(0);
#pragma omp parallel for \
default(none) shared(size,in) firstprivate(sp_ct)
for(int p=0;p<size;p++) {
sp_ct.depos(p);
in[p]=sp_ct.parable_at_basis();
} // end omp paralell for
This solution has one additional benefit: Your implementation only works if each thread receives only one chunk out of 0 - size. Which is the case when specifying schedule(static) omitting the chunk size (OpenMP 4.0 Specification, chapter 2.7.1, page 57). But since you did not specify a schedule the used schedule will be implementation dependent (OpenMP 4.0 Specification, chapter 2.3.2). If the implementation chooses to use dynamic or guided, threads will receive multiple chunks with gaps between them. So one thread could receive chunk 0-20 and then chunk 70-90 which will make p and sp_ct out of sync on the second chunk. The solution above is compatible to all schedules.