I am attempting to parallelize a for-loop that runs within a genetic algorithm using OpenMP and am encountering a segfault, and I'm assuming its a thread-safety issue.
What is unclear to me, and perhaps may be a lack of knowledge on my part for C++ threading, is that there should not be any cross-talk going on between variables as I can see it.
For reference, here is the loop that I am parallelizing:
void GA::evaluate(double cfgNRG, double cfgNA, double cfgAC)
{
// Evaluate individuals in the population:
#pragma omp parallel num_threads(3)
{
#pragma omp for
for(unsigned int indv = 0; indv < population_.size(); ++indv)
{
std::cout << "Individual [" << indv << "]" << std::endl;
// Retrieve the individual:
Genome& genome = population_[indv];
// Have we already evaluated this individual?
if(genome.is_evaluated()) {
continue;
}
// Evaluate individual:
{
GA::SimulationResults results = evaluate(genome, cfgNRG, cfgNA, cfgAC);
genome.set_trace(results.first);
genome.set_fitness(results.second);
}
}
}
// Sort the population:
sort_population();
}
The issue comes within the internal evaluate function. However, the only variable acted upon is the genome variable that is pulled out of the population_ vector. I had thought that acting upon a single variable (that does not interact with anything else until the end of the for loop) would be thread-safe, and yet, I receive the segfault. If I define the evaluate function to be critical, the program works as normal (and also, the program works just fine without parallelizing).
My one thought was that the threads were not being joined at the end of the loop, however according to the documentation a join should automatically occur on the closing brace after my parallel declaration.
Related
I was trying to write some code that allow me to observe reordering of memory operations.
In the fallowing example I expected that on some executions of set_values() order of assigning values could change. Especialy notification = 1 may occur before the rest of operations, but in dosn't happend even after thousens of iterations.
I've compiled code with -O3 optimization.
Here is youtube material that i'm refering to : https://youtu.be/qlkMbxUbKfw?t=200
int a{0};
int b{0};
int c{0};
int notification{0};
void set_values()
{
a = 1;
b = 2;
c = 3;
notification = 1;
}
void calculate()
{
while(notification != 1);
a += b + c;
}
void reset()
{
a = 0;
b = 0;
c = 0;
notification = 0;
}
int main()
{
a=6; //just to allow first iteration
for(int i = 0 ; a == 6 ; i++)
{
reset();
std::thread t1(calculate);
std::thread t2(set_values);
t1.join();
t2.join();
std::cout << "Iteration: " << i << ", " "a = " << a << std::endl;
}
return 0;
}
Now the program is stuck in infinited loop. I expect that in some iterations order of instructions in set_values() function can change (due to optimalization on cash memory). For example notification = 1 will be executed before c = 3 what will trigger execution of calculate() function and gives a==3 what satisfies the condition of terminating the loop and prove reordering
Or maybe someone can provide other trivial example of code that help observe reordering of memory operations?
The compiler can indeed reorder your assignments in the function set_values. However, it is not required to do so. In this case it has no reason to reorder anything, since you are assigning constants to all four variables.
Now the program is stuck in infinited loop.
This is probably because while(notification != 1); will be optimized to an infinite loop.
With a bit of work, we can find a way to make the compiler reorder the assignment notify = 1 before the other statements, see https://godbolt.org/z/GY-pAw.
Notice that the program reads x from the standard input, this is done to force the compiler to read from a memory location.
I've also made the variable notification volatile, so that while(notification != 1); doesn't get optimised away.
You can try this example on your machine, I've been able to consistently fail the assertion using g++9.2 and -O3 running on an Intel Sandy Bridge cpu.
Be aware that the cpu itself can reorder instructions if they are independent of each other, see https://en.wikipedia.org/wiki/Out-of-order_execution. This is, however, a bit tricky to test and reproduce consistently.
Your compiler optimizes in unexpected ways but is allowed to do so because you are violating a fundamental rule of the C++ memory model.
You cannot access a memory location from multiple threads if at least one of them is a writer.
To synchronize, either use a std:mutex or use std:atomic<int> instead of int for your variables
I don't have experience with openmp in C++ and I would like to learn how to solve my problem properly. I have 30 files that need to be processed independently by the same function. Each time the function is activated, a new output file will be generated (out01.txt to out30.txt) saving the results. I have 12 processors in my machine and would like to use 10 for this problem.
I need to change my code to wait for all the 30 files to be processed to execute other routines in C++. At this moment, I'm not able to force my code to wait for all the omp scope to be executed and then move the second function.
Please find below a draft of my code.
int W = 10;
int i = 1;
ostringstream fileName;
int th_id, nthreads;
omp_set_num_threads(W);
#pragma omp parallel shared (nFiles) private(i,fileName,th_id)
{
#pragma omp for schedule(static)
for ( i = 1; i <= nFiles; i++)
{
th_id = omp_get_thread_num();
cout << "Th_id: " << th_id << endl;
// CALCULATION IS PERFORMED HERE FOR EACH FILE
}
}
// THIS is the point where the program should wait for the whole block to be finished
// Calling the second function ...
Both the "omp for" and "omp parallel" pragmas have an implicit barrier at the end of its scope. Therefore, the code after the parallel section can't be executed until the parallel section has concluded. So your code should run perfectly.
If there is still a problem then it isn't because your code isn't waiting at the end of the parallel region.
Please supply us with more details about what happens during execution of this code. This way we might be able to find the real cause of your problem.
I'm working on a small Collatz conjecture calculator using C++ and GMP, and I'm trying to implement parallelism on it using OpenMP, but I'm coming across issues regarding thread safety. As it stands, attempting to run the code will yield this:
*** Error in `./collatz': double free or corruption (fasttop): 0x0000000001140c40 ***
*** Error in `./collatz': double free or corruption (fasttop): 0x00007f4d200008c0 ***
[1] 28163 abort (core dumped) ./collatz
This is the code to reproduce the behaviour.
#include <iostream>
#include <gmpxx.h>
mpz_class collatz(mpz_class n) {
if (mpz_odd_p(n.get_mpz_t())) {
n *= 3;
n += 1;
} else {
n /= 2;
}
return n;
}
int main() {
mpz_class x = 1;
#pragma omp parallel
while (true) {
//std::cout << x.get_str(10);
while (true) {
if (mpz_cmp_ui(x.get_mpz_t(), 1)) break;
x = collatz(x);
}
x++;
//std::cout << " OK" << std::endl;
}
}
Given that I did not get this error when I uncomment the outputs to screen, which are slow, I assume the issue at hand has to do with thread safety, and in particular with concurrent threads trying to increment x at the same time.
Am I correct in my assumptions? How can I fix this and make it safe to run?
I assume what you want to do is to check if the collatz conjecture holds for all numbers. The program you posted is wrong on many levels both serially and in parallel.
if (mpz_cmp_ui(x.get_mpz_t(), 1)) break;
Means that it will break when x != 1. If you replace it with the correct 0 == mpz_cmp_ui, the code will just continue to test 2 over and over again. You have to have two variables anyway, one for the outer loop that represents what you want to check, and one for the inner loop performing the check. It's easier to get this right if you make a function for that:
void check_collatz(mpz_class n) {
while (n != 1) {
n = collatz(n);
}
}
int main() {
mpz_class x = 1;
while (true) {
std::cout << x.get_str(10);
check_collatz(x);
x++;
}
}
The while (true) loop is bad to reason about and parallelize, so let's just make an equivalent for loop:
for (mpz_class x = 1;; x++) {
check_collatz(x);
}
Now, we can talk about parallelizing the code. The basis for OpenMP parallelizing is a worksharing construct. You cannot just slap #pragma omp parallel on a while loop. Fortunately you can easily mark certain canonical for loops with #pragma omp parallel for. For that, however, you cannot use mpz_class as a loop variable, and you must specify an end for the loop:
#pragma omp parallel for
for (long check = 1; check <= std::numeric_limits<long>::max(); check++)
{
check_collatz(check);
}
Note that check is implicitly private, there is a copy for each thread working on it. Also OpenMP will take care of distributing the work [1 ... 2^63] among threads. When a thread calls check_collatz a new, private, mpz_class object will be created for it.
Now, you might notice, that repeatedly creating a new mpz_class object in each loop iteration is costly (memory allocation). You can reuse that (by breaking check_collatz again) and creating a thread-private mpz_class working object. For this, you split the compound parallel for into separate parallel and for pragmas:
#include <gmpxx.h>
#include <iostream>
#include <limits>
// Avoid copying objects by taking and modifying a reference
void collatz(mpz_class& n)
{
if (mpz_odd_p(n.get_mpz_t()))
{
n *= 3;
n += 1;
}
else
{
n /= 2;
}
}
int main()
{
#pragma omp parallel
{
mpz_class x;
#pragma omp for
for (long check = 1; check <= std::numeric_limits<long>::max(); check++)
{
// Note: The structure of this fits perfectly in a for loop.
for (x = check; x != 1; collatz(x));
}
}
}
Note that declaring x in the parallel region will make sure it is implicitly private and properly initialized. You should prefer that to declaring it outside and marking it private. This will often lead to confusion because explicitly private variables from outside scope are unitialized.
You might complain that this only checks the first 2^63 numbers. Just let it run. This gives you enough time to master OpenMP to expert level and write your own custom worksharing for GMP objects.
You were concerned about having extra objects for each thread. This is essential for good performance. You cannot solve this efficiently with locks/critical sections/atomics. You would have to protect each and every read and write to your only relevant variable. There would be no parallelism left.
Note: The huge for loop will likely have a load imbalance. So some threads will probably finish a few centuries earlier than the others. You could fix that with dynamic scheduling, or smaller static chunks.
Edit: For academic sake, here is one idea how to implement the worksharing directly on GMP objects:
#pragma omp parallel
{
// Note this is not a "parallel" loop
// these are just separate loops on distinct strided
int nthreads = omp_num_threads();
mpz_class check = 1;
// we already checked those in the other program
check += std::numeric_limits<long>::max();
check += omp_get_thread_num();
mpz_class x;
for (; ; check += nthreads)
{
// Note: The structure of this fits perfectly in a for loop.
for (x = check; x != 1; collatz(x));
}
}
You could well be right about collisions with x. You can mark x as private by:
#pragma omp parallel private(x)
This way each thread gets their own "version" of the variable x, which should make this thread-safe. By default, variables declared before a #pragma omp parallel are public, so there is one shared instance between all of the threads.
You might want to touch x only with atomic instructions.
#pragma omp atomic
x++;
This ensures that all threads see the same value of x without requires mutexes or other synchronization techniques.
I'm new to OpenMP and from what I have read about OpenMP 2.0, which comes standard with Microsoft Visual Studio 2010, global variables are considered troublesome and error prone when used in parallel programming. I have also been adopting this feeling since I have found very little on how to deal with global variables and static global variables efficiently, or at all for that matter.
I have this snippet of code which runs but because of the local variable created in the parallel block I don't get the answer I'm looking for. I get 8 different print outs (because that how many threads I have on my PC) instead of 1 answer. I know that it's because of the local variables "list" created in the parallel block but this code will not run if I move the "list" variable and make it a global variable. Actually the code does run but it never gives me an answer back. This is the sample code that I would like to modify to use a global "list" variable :
#pragma omp parallel
{
vector<int> list;
#pragma omp for
for(int i = 0; i < 50000; i++)
{
list.push_back(i);
}
cout << list.size() << endl;
}
Output:
6250
6250
6250
6250
6250
6250
6250
6250
They add up to 50000 but I did not get the one answer with 50000, instead it's divided up.
Solution:
vector<int> list;
#pragma omp parallel
{
#pragma omp for
for(int i = 0; i < 50000; i++)
{
cout << i << endl;
#pragma omp critical
{
list.push_back(i);
}
}
}
cout << list.size() << endl;
According to the MSDN Documentation the parallel clause
Defines a parallel region, which is code that will be executed by
multiple threads in parallel.
And since the list variable is declared inside this section every thread will have its own list.
On the other hand, the for pragma
Causes the work done in a for loop inside a parallel region to be
divided among threads.
So the 50000 iterations will be split among threads but each thread will have its own list.
I think what you are trying to do can be achieved by:
Taking the list definition outside the "parallel" section.
Protect the list.push_back statement with a critical section.
Try this:
vector<int> list;
#pragma omp parallel
{
#pragma omp for
for(int i = 0; i < 50000; i++)
{
#pragma omp critical
{
list.push_back(i);
}
}
}
cout << list.size() << endl;
I don't think you should get any speedup from OpenMP in this case because there will be contention for the critical section. A faster solution for this (if you don't care about the order of elements) would be for every thread to have its own list, and get those lists merged after the loop finishes. The implementation using std::list instead of std::vector would look cleaner in this case (because you wouldn't have to copy arrays).
Some apps are memory bound and not compute bound. Bottom line: check if you actually get a speedup from OpenMP.
Why you need the first pragma here? (#pragma omp parallel). I think that's the issue.
I'm doing some time trials on my code, and logically it seems really easy to parallelize with OpenMP as each trial is independent of the others. As it stands, my code looks something like this:
for(int size = 30; size < 50; ++size) {
#pragma omp parallel for
for(int trial = 0; trial < 8; ++trial) {
time_t start, end;
//initializations
time(&start);
//perform computation
time(&end);
output << size << "\t" << difftime(end,start) << endl;
}
output << endl;
}
I have a sneaking suspicion that this is kind of a faux pas, however, as two threads may simultaneously write values to the output, thus screwing up the formatting. Is this a problem, and if so, will surrounding the output << size << ... code with a #pragma omp critical statement fix it?
Never mind whether your output will be screwed up (it likely will). Unless you're really careful to assign your OpenMP threads to different processors that don't share resources like memory bandwidth, your time trials aren't very meaningful either. Different runs will be interfering with each other.
The solution to the problem you're asking about is to write the result times into designated elements of an array, with one slot for each trial, and ouput the results after the fact.
As long as you don't mind the individual lines being out of order you'll be fine. OpenMP should make sure a whole line is printed at a time.
However, you will need to declare start and end as private in the pragma otherwise the threads will overwrite them and mess up your timings.