Data race with OpenMP and Eigen but data is read-only? - c++

I am trying to find the data race in my code but I just can't seem to grasp why it happens. The data in the threads is used read-only and the only variable that is written to is protected by a critical region.
I tried using the Intel Inspector but I am compiling with g++ 9.3.0 and apparently even the 2021 version can't deal with the OpenMP implementation for it. The release notes do not explicitly state it as exception as it was for older versions but there is a warning about false positives because it is not supported. It also always shows a data race for the pragma statements which isn't helpful at all.
My current suspects are either Eigen or the fact that I use a reference to a std::vector. Eigen itself I compile with EIGEN_DONT_PARALLELIZE to not mess with nested parallelism although I think I don't use anything that would use it anyway.
Edit:
Not sure if it is really a "data race" (or wrong memory access?) but the example produces non-deterministic output in the form of that the result differs for the same input. If this happens the loop in the main breaks. With more than one thread this happens early (after 5-12 iterations usually). If I run it with one thread only or compile without OpenMP, I have to manually end the example program.
Minimal (not) working example below.
#include <Eigen/Dense>
#include <vector>
#include <iostream>
#ifdef _OPENMP
#include <omp.h>
#else
#define omp_set_num_threads(number)
#endif
typedef Eigen::Matrix<double, 9, 1> Vector9d;
typedef std::vector<Vector9d, Eigen::aligned_allocator<Vector9d>> Vector9dList;
Vector9d derivPath(const Vector9dList& pathPositions, int index){
int n = pathPositions.size()-1;
if(index >= 0 && index < n+1){
// path is one point, no derivative possible
if(n == 0){
return Vector9d::Zero();
}
else if(index == n){
return Vector9d::Zero();
}
// path is a line, derivative is in the direction of start to end
else {
return n * (pathPositions[index+1] - pathPositions[index]);
}
}
else{
return Vector9d::Zero();
}
}
// ********************************
// data race occurs here somewhere
double errorFunc(const Vector9dList& pathPositions){
int n = pathPositions.size()-1;
double err = 0.0;
#pragma omp parallel default(none) shared(pathPositions, err, n)
{
double err_private = 0;
#pragma omp for schedule(static)
for(int i = 0; i < n+1; ++i){
Vector9d derivX_i = derivPath(pathPositions, i);
// when I replace this with pathPositions[i][0] the loop in the main doesn't break
// (or at least I always had to manually end the program)
// but it does break if I use derivX_i[0];
double err_i = derivX_i.norm();
err_private = err_private + err_i;
}
#pragma omp critical
{
err += err_private;
}
}
err = err / static_cast<double>(n);
return err;
}
// ***************************************
int main(int argc, char **argv){
// setup data
int n = 100;
Vector9dList pathPositions;
pathPositions.reserve(n+1);
double a = 5.0;
double b = 1.0;
double c = 1.0;
Eigen::Vector3d f, u;
f << 0, 0, -1;//-p;
u << 0, 1, 0;
for(int i = 0; i<n+1; ++i){
double t = static_cast<double>(i)/static_cast<double>(n);
Eigen::Vector3d p;
double x = 2*t*a - a;
double z = -b/(a*a) * x*x + b + c;
p << x, 0, z;
Vector9d cam;
cam << p, f, u;
pathPositions.push_back(cam);
}
omp_set_num_threads(8);
//reference value
double pe = errorFunc(pathPositions);
int i = 0;
do{
double pe_i = errorFunc(pathPositions);
// there is a data race
if(std::abs(pe-pe_i) > std::numeric_limits<double>::epsilon()){
std::cout << "Difference detected at iteration " << i << " diff:" << std::abs(pe-pe_i);
break;
}
i++;
}
while(true);
}
Output for running the example multiple times
Difference detected at iteration 13 diff:1.77636e-15
Difference detected at iteration 1 diff:1.77636e-15
Difference detected at iteration 0 diff:1.77636e-15
Difference detected at iteration 0 diff:1.77636e-15
Difference detected at iteration 0 diff:1.77636e-15
Difference detected at iteration 7 diff:1.77636e-15
Difference detected at iteration 8 diff:1.77636e-15
Difference detected at iteration 6 diff:1.77636e-15
As you can see, the difference is minor but there and it doesn't always happen in the same iteration which makes it non-deterministic. There is no output if I run it single threaded as I usually end the program after letting it run for a couple of minutes. Therefore, it has to have to do with the parallelization somehow.
I know I could use a reduction in this case but in the original code in my project I have to compute other things in the parallel region as well and I wanted to keep the minimal example as close to the original structure as possible.
I use OpenMP in other parts of my program too where I am not sure if I have a data race there too but the structure is similar (except that I use #pragma omp parallel for and the collapse statement). I have some variable or vector I write to but it's always either in a critical region or each thread only writes to it's own subset of the vector. Data that is used by multiple threads is always read-only. The read-only data is always a std::vector, a reference to a std::vector or a numerical data type like int or double. The vectors always contain an Eigen type or double.

There are no race conditions. You are observing a natural consequence of the non-commutative algebra of truncated floating-point representations. (A + B) + C is not always the same as A + (B + C) when A, B, and C are finite-precision floating-point numbers due to rounding errors. 1.77636E-15 x 100 (the absolute error when commenting out err = err / static_cast<double>(n);) in binary is:
0 | 01010101 | 00000000000000000001100
S exponent mantissa
As you can see, the error is in the least significant bits of the mantissa, hinting at it being the result of accumulation of rounding errors.
The problem occurs here:
#pragma omp parallel default(none) shared(pathPositions, err, n)
{
...
#pragma omp critical
{
err += err_private;
}
}
The final value of err depends on the order in which the different threads arrive at the critical section and their contributions get added, which is why sometimes you see discrepancy right away and sometimes it takes a couple of iterations.
To demonstrate that it is not an OpenMP problem per se, simply modify the function to read:
double errorFunc(const Vector9dList& pathPositions){
int n = pathPositions.size()-1;
double err = 0.0;
std::vector<double> errs(n+1);
#pragma omp parallel default(none) shared(pathPositions, errs, n)
{
#pragma omp for schedule(static)
for(int i = 0; i < n+1; ++i){
Vector9d derivX_i = derivPath(pathPositions, i);
errs[i] = derivX_i.norm();
}
}
for (int i = 0; i < n+1; ++i)
err += errs[i];
err = err / static_cast<double>(n);
return err;
}
This removes the dependency on how the sub-sums are computed and added together and the return value will always be the same no matter the number of OpenMP threads.
Another version only fixes the order in which err_private are reduced into err:
double errorFunc(const Vector9dList& pathPositions){
int n = pathPositions.size()-1;
double err = 0.0;
std::vector<double> errs(omp_get_max_threads());
int nthreads;
#pragma omp parallel default(none) shared(pathPositions, errs, n, nthreads)
{
#pragma omp master
nthreads = omp_get_num_threads();
double err_private = 0;
#pragma omp for schedule(static)
for(int i = 0; i < n+1; ++i){
Vector9d derivX_i = derivPath(pathPositions, i);
double err_i = derivX_i.norm();
err_private = err_private + err_i;
}
errs[omp_get_thread_num()] = err_private;
}
for (int i = 0; i < nthreads; i++)
err += errs[i];
err = err / static_cast<double>(n);
return err;
}
Again, this code produces the same result each and every time as long as the number of threads is kept constant. The value may differ slightly (in the LSBs) with different number of threads.
You can't get easily around such discrepancy and only learn to live with it and take precautions to minimise its influence on the rest of the computation. In fact, you are really lucky to stumble upon it in 2021, a year in the post-x87 era, when virtually all commodity FPUs use 64-bit IEEE 754 operands and not in the 1990's when x87 FPUs used 80-bit operands and the result of a repeated accumulation would depend on whether you keep the value in an FPU register all the time or periodically store it in and then load it back from memory, which rounds the 80-bit representation to a 64-bit one.
In the mean time, mandatory reading for anyone dealing with math on digital computers.
P.S. Although it is 2021 and we've been living for 21 years in the post-x87 era (started when Pentium 4 introduced the SSE2 instruction set back in 2000), if your CPU is an x86 one, you can still partake in the x87 madness. Just compile your code with -mfpmath=387 :)

Related

Windows threading synchronization performance issue

I have a threading issue under windows.
I am developing a program that runs complex physical simulations for different conditions. Say a condition per hour of the year, would be 8760 simulations. I am grouping those simulations per thread such that each thread runs a for loop of 273 simulations (on average)
I bought an AMD ryzen 9 5950x with 16 cores (32 threads) for this task. On Linux, all the threads seem to be between 98% to 100% usage, while under windows I get this:
(The first bar is the I/O thread reading data, the smaller bars are the process threads. Red: synchronization, green: process, purple: I/O)
This is from Visual Studio's concurrency visualizer, which tells me that 63% of the time was spent on thread synchronization. As far as I can tell, my code is the same for both the Linux and windows executions.
I made my best to make the objects immutable to avoid issues and that provided a big gain with my old 8-thread intel i7. However with many more threads, this issue arises.
For threading, I have tried a custom parallel for, and the taskflow library. Both perform identically for what I want to do.
Is there something fundamental about windows threads that produces this behaviour?
The custom parallel for code:
/**
* parallel for
* #tparam Index integer type
* #tparam Callable function type
* #param start start index of the loop
* #param end final +1 index of the loop
* #param func function to evaluate
* #param nb_threads number of threads, if zero, it is determined automatically
*/
template<typename Index, typename Callable>
static void ParallelFor(Index start, Index end, Callable func, unsigned nb_threads=0) {
// Estimate number of threads in the pool
if (nb_threads == 0) nb_threads = getThreadNumber();
// Size of a slice for the range functions
Index n = end - start + 1;
Index slice = (Index) std::round(n / static_cast<double> (nb_threads));
slice = std::max(slice, Index(1));
// [Helper] Inner loop
auto launchRange = [&func] (int k1, int k2) {
for (Index k = k1; k < k2; k++) {
func(k);
}
};
// Create pool and launch jobs
std::vector<std::thread> pool;
pool.reserve(nb_threads);
Index i1 = start;
Index i2 = std::min(start + slice, end);
for (unsigned i = 0; i + 1 < nb_threads && i1 < end; ++i) {
pool.emplace_back(launchRange, i1, i2);
i1 = i2;
i2 = std::min(i2 + slice, end);
}
if (i1 < end) {
pool.emplace_back(launchRange, i1, end);
}
// Wait for jobs to finish
for (std::thread &t : pool) {
if (t.joinable()) {
t.join();
}
}
}
A complete C++ project illustrating the issue is uploaded here
Main.cpp:
//
// Created by santi on 26/08/2022.
//
#include "input_data.h"
#include "output_data.h"
#include "random.h"
#include "par_for.h"
void fillA(Matrix& A){
Random rnd;
rnd.setTimeBasedSeed();
for(int i=0; i < A.getRows(); ++i)
for(int j=0; j < A.getRows(); ++j)
A(i, j) = (int) rnd.randInt(0, 1000);
}
void worker(const InputData& input_data,
OutputData& output_data,
const std::vector<int>& time_indices,
int thread_index){
std::cout << "Thread " << thread_index << " [" << time_indices[0]<< ", " << time_indices[time_indices.size() - 1] << "]\n";
for(const int& t: time_indices){
Matrix b = input_data.getAt(t);
Matrix A(input_data.getDim(), input_data.getDim());
fillA(A);
Matrix x = A * b;
output_data.setAt(t, x);
}
}
void process(int time_steps, int dim, int n_threads){
InputData input_data(time_steps, dim);
OutputData output_data(time_steps, dim);
// correct the number of threads
if ( n_threads < 1 ) { n_threads = ( int )getThreadNumber( ); }
// generate indices
std::vector<int> time_indices = arrange<int>(time_steps);
// compute the split of indices per core
std::vector<ParallelChunkData<int>> chunks = prepareParallelChunks(time_indices, n_threads );
// run in parallel
ParallelFor( 0, ( int )chunks.size( ), [ & ]( int k ) {
// run chunk
worker(input_data, output_data, chunks[k].indices, k );
} );
}
int main(){
process(8760, 5000, 0);
return 0;
}
The performance problem you see is definitely caused by the many memory allocations, as already suspected by Matt in his answer. To expand on this: Here is a screenshot from Intel VTune running on an AMD Ryzen Threadripper 3990X with 64 cores (128 threads):
As you can see, almost all of the time is spent in malloc or free, which get called from the various Matrix operations. The bottom part of the image shows the timeline of the activity of a small selection of the threads: Green means that the thread is inactive, i.e. waiting. Usually only one or two threads are actually active. Allocations and freeing memory accesses a shared resource, causing the threads to wait for each other.
I think you have only two real options:
Option 1: No dynamic allocations anymore
The most efficient thing to do would be to rewrite the code to preallocate everything and get rid of all the temporaries. To adapt it to your example code, you could replace the b = input_data.getAt(t); and x = A * b; like this:
void MatrixVectorProduct(Matrix const & A, Matrix const & b, Matrix & x)
{
for (int i = 0; i < x.getRows(); ++i) {
for (int j = 0; j < x.getCols(); ++j) {
x(i, j) = 0.0;
for (int k = 0; k < A.getCols(); ++k) {
x(i,j) += (A(i,k) * b(k,j));
}
}
}
}
void getAt(int t, Matrix const & input_data, Matrix & b) {
for (int i = 0; i < input_data.getRows(); ++i)
b(i, 0) = input_data(i, t);
}
void worker(const InputData& input_data,
OutputData& output_data,
const std::vector<int>& time_indices,
int thread_index){
std::cout << "Thread " << thread_index << " [" << time_indices[0]<< ", " << time_indices[time_indices.size() - 1] << "]\n";
Matrix A(input_data.getDim(), input_data.getDim());
Matrix b(input_data.getDim(), 1);
Matrix x(input_data.getDim(), 1);
for (const int & t: time_indices) {
getAt(t, input_data.getMat(), b);
fillA(A);
MatrixVectorProduct(A, b, x);
output_data.setAt(t, x);
}
std::cout << "Thread " << thread_index << ": Finished" << std::endl;
}
This fixes the performance problems.
Here is a screenshot from VTune, where you can see a much better utilization:
Option 2: Using a special allocator
The alternative is to use a different allocator that handles allocating and freeing memory more efficiently in multithreaded scenarios. One that I had very good experience with is mimalloc (there are others such as hoard or the one from TBB). You do not need to modify your source code, you just need to link with a specific library as described in the documentation.
I tried mimalloc with your source code, and it gave near 100% CPU utilization without any code changes.
I also found a post on the Intel forums with a similar problem, and the solution there was the same (using a special allocator).
Additional notes
Matrix::allocSpace() allocates the memory by using pointers to arrays. It is better to use one contiguous array for the whole matrix instead of multiple independent arrays. That way, all elements are located behind each other in memory, allowing more efficient access.
But in general I suggest to use a dedicated linear algebra library such as Eigen instead of the hand rolled matrix implementation to exploit vectorization (SSE2, AVX,...) and to get the benefits of a highly optimized library.
Ensure that you compile your code with optimizations enabled.
Disable various cross-checks if you do not need them: assert() (i.e. define NDEBUG in the preprocessor), and for MSVC possibly /GS-.
Ensure that you actually have enough memory installed.
You said that all your memory was pre-allocated, but in the worker function I see this...
Matrix b = input_data.getAt(t);
which allocates and fills a new matrix b, and this...
Matrix A(input_data.getDim(), input_data.getDim());
which allocates and fills a new matrix A, and this...
Matrix x = A * b;
which allocates and fills a new matrix x.
The heap is a global data structure, so the thread synchronization time you're seeing is probably contention in the memory allocate/free functions.
These are in a tight loop. You should fix this loop to access b by reference, and reuse the other 2 matrices for every iteration.

How to optimize omp parallelization when batching

I am generating class Objects and putting them into std::vector. Before adding, I need to check if they intersect with the already generated objects. As I plan to have millions of them, I need to parallelize this function as it takes a lot of time (The function must check each new object against all previously generated).
Unfortunately, the speed increase is not significant. The profiler also shows very low efficiency (all overhead). Any advise would be appreciated.
bool
Generator::_check_cube (std::vector<Cube> &cubes, const cube &cube)
{
auto ptr_cube = &cube;
auto npol = cubes.size();
auto ptr_cubes = cubes.data();
const auto nthreads = omp_get_max_threads();
bool check = false;
#pragma omp parallel shared (ptr_cube, ptr_cubes, npol, check)
{
#pragma omp single nowait
{
const auto batch_size = npol / nthreads;
for (int32_t i = 0; i < nthreads; i++)
{
const auto bstart = batch_size * i;
const auto bend = ((bstart + batch_size) > npol) ? npol : bstart + batch_size;
#pragma omp task firstprivate(i, bstart, bend) shared (check)
{
struct bd bd1{}, bd2{};
bd1 = allocate_bd();
bd2 = allocate_bd();
for (auto j = bstart; j < bend; j++)
{
bool loc_check;
#pragma omp atomic read
loc_check = check;
if (loc_check) break;
if (ptr_cube->cube_intersecting(ptr_cubes[j], &bd1, &bd2))
{
#pragma omp atomic write
check = true;
break;
}
}
free_bd(&bd1);
free_bd(&bd2);
}
}
}
}
return check;
}
UPDATE: The Cube is actually made of smaller objects Cuboids, each of them have size (L, W, H), position coordinates and rotation. The intersect function:
bool
Cube::cube_intersecting(Cube &other, struct bd *bd1, struct bd *bd2) const
{
const auto nom = number_of_cuboids();
const auto onom = other.number_of_cuboids();
for (int32_t i = 0; i < nom; i++)
{
get_mcoord(i, bd1);
for (int32_t j = 0; j < onom; j++)
{
other.get_mcoord(j, bd2);
if (check_gjk_intersection(bd1, bd2))
{
return true;
}
}
}
return false;
}
//get_mcoord calculates vertices of the cuboids
void
Cube::get_mcoord(int32_t index, struct bd *bd) const
{
for (int32_t i = 0; i < 8; i++)
{
for (int32_t j = 0; j < 3; j++)
{
bd->coord[i][j] = _cuboids[index].get_coord(i)[j];
}
}
}
inline struct bd
allocate_bd()
{
struct bd bd{};
bd.numpoints = 8;
bd.coord = (double **) malloc(8 * sizeof(double *));
for (int32_t i = 0; i < 8; i++)
{
bd.coord[i] = (double *) malloc(3 * sizeof(double));
}
return bd;
}
Typical values: npol > 1 million, threads 32, and each npol Cube consists of 1 - 3 smaller cuboids which are directly checked against other if intersect.
The problem with your search is that OpenMP really likes static loops, where the number of iterations is predetermined. Thus, maybe one task will break early, but all the other will go through their full search.
With recent versions of OpenMP (5, I think) there is a solution for that.
(Not sure about this one: Make your tasks much more fine-grained, for instance one for each intersection test);
Spawn your tasks in a taskloop;
Once you find your intersection (or any condition that causes you to break), do cancel taskloop.
Small problem: cancelling is disabled by default. Set the environment variable OMP_CANCELLATION to true.
Do you have more intersections being true or more being false ? If most are true, you're flooding your hardware with requests to write to a shared resource, and what you are doing is essentially sequential. One way to address this is to avoid using a shared resource so there is no mutex and you let all threads run and at the end you take a decision given the results; this will likely run faster but the benefit depends also on arbitrary choices such as few metrics (eg., nthreads, ncuboids).
It is possible that on another architecture (eg., gpu), your algorithm works well as it is. I may be worth it to benchmark it on a gpu, and see if you will benefit from that migration, given the production sizes (millions of cuboids, 24 dimensions).
You also have a complexity problem, which is, for every new cuboid you compare up to the whole set of existing cuboids. One way to address this is to gather all the cuboids size (range) by dimension and order them, and add the new cuboids ranges ordered. If there is intersection in one dimension, you test the next one etc. You also can runs them in parallel. Before running through the ranges, you test if you are hitting inside the global range, if not it's useless to test locally the intersection.
Here and in general you want to parallelize with minimum of dependency (shared resources, mutex). So you want to try to find a point of view where this will happen. Parallelising over dimensions over ordered ranges (segments) might be better that parallelizing over cuboids.
Algorithms and benefits of parallelism also depend on the values of your objects. This does not mean that complexity predictions are not relevant, but that one may find a smarter approach given those values.
I think your code is memory bound, so its bottleneck is memory read/write not calculations. This can be the main reason of poor speed increase. As already mentioned by #Soleil a different hardware (GPU) can be beneficial here.
You mentioned in the comments that Generator::_check_cub called many times. To reduce OpenMP overheads my suggestion is moving the parallel region out of this function, you can even use it in your main function:
main(){
#pragma omp parallel
#pragma omp single nowait
{
//your code
}
}
In this case you have to use #pragma omp taskwait to wait for the tasks to complete.
for (int32_t i = 0; i < nthreads; i++)
{
#pragma omp task default(none) firstprivate(...) shared (..)
{
//your code comes here
}
}
#pragma omp taskwait
I also suggest using default(none) clause in #pragma omp task directive so you have to explicitly tell the sharing attribute of all your variables.
Do you really need function get_mcoord? It seems a redunant memory copy to me. I think it may be better to write a check_gjk_intersection function which takes _cuboids or its indices as parameters. In this case you get rid of many memory allocations/deallocations of bd1 and bd2, which also can be time consuming as #Victor pointed out.

Doesn't see any significant improvement while using parallel block in OpenMP C++

I am receiving an array of Eigen::MatrixXf and Eigen::Matrix4f in realtime. Both of these arrays are having an equal number of elements. All I am trying to do is just multiply elements of both the arrays together and storing the result in another array at the same index.
Please see the code snippet below-
#define COUNT 4
while (all_ok())
{
Eigen::Matrix4f trans[COUNT];
Eigen::MatrixXf in_data[COUNT];
Eigen::MatrixXf out_data[COUNT];
// at each iteration, new data is filled
// in 'trans' and 'in_data' variables
#pragma omp parallel num_threads(COUNT)
{
#pragma omp for
for (int i = 0; i < COUNT; i++)
out_data[i] = trans[i] * in_clouds[i];
}
}
Please note that COUNT is a constant. The size of trans and in_data is (4 x 4) and (4 x n) respectively, where n is approximately 500,000. In order to parallelize the for loop, I gave OpenMP a try as shown above. However, I don't see any significant improvement in the elapsed time of for loop.
Any suggestions? Any alternatives to perform the same operation, please?
Edit: My idea is to define 4 (=COUNT) threads wherein each of them is taking care of multiplication. In this way, we don't need to create threads every time, I guess!
Works for me using the following self-contained example, that is, I get a x4 speed up when enabling openmp:
#include <iostream>
#include <bench/BenchTimer.h>
using namespace Eigen;
const int COUNT = 4;
EIGEN_DONT_INLINE
void foo(const Matrix4f *trans, const MatrixXf *in_data, MatrixXf *out_data)
{
#pragma omp parallel for num_threads(COUNT)
for (int i = 0; i < COUNT; i++)
out_data[i] = trans[i] * in_data[i];
}
int main()
{
Eigen::Matrix4f trans[COUNT];
Eigen::MatrixXf in_data[COUNT];
Eigen::MatrixXf out_data[COUNT];
int n = 500000;
for (int i = 0; i < COUNT; i++)
{
trans[i].setRandom();
in_data[i].setRandom(4,n);
out_data[i].setRandom(4,n);
}
int tries = 3;
int rep = 1;
BenchTimer t;
BENCH(t, tries, rep, foo(trans, in_data, out_data));
std::cout << " " << t.best(Eigen::REAL_TIMER) << " (" << double(n)*4.*4.*4.*2.e-9/t.best() << " GFlops)\n";
return 0;
}
So 1) make sure you measure the wallclock time and not the CPU time, and 2) make sure that the products is the bottleneck and not filling in_data.
Finally, for maximal performance don't forget to enable AVX/FMA (e.g., with -march=native), and of course make sure to benchmark with compiler's optimization ON.
For the record, on my computer the above example takes 0.25s without openmp, and 0.065s with.
You need to specify -fopenmp during compilation and linking. But you will quickly hit the limit, where RAM access is stopping further speeding up. You really should have a look at vector intrinsics. Dependent on you CPU you could accelerate your operations to the size of your register divided by the size of your variable (float = 4). So if your processor supports say AVX, you'd be dealing with 8 floats at a time. If you need some inspiration, you're welcome to steal code from my medical image reconstruction library here:
https://github.com/kvahed/codeare/blob/master/src/matrix/SIMDTraits.hpp
The code does the whole shebang for float/double real and complex.

Thread safety while looping with OpenMP

I'm working on a small Collatz conjecture calculator using C++ and GMP, and I'm trying to implement parallelism on it using OpenMP, but I'm coming across issues regarding thread safety. As it stands, attempting to run the code will yield this:
*** Error in `./collatz': double free or corruption (fasttop): 0x0000000001140c40 ***
*** Error in `./collatz': double free or corruption (fasttop): 0x00007f4d200008c0 ***
[1] 28163 abort (core dumped) ./collatz
This is the code to reproduce the behaviour.
#include <iostream>
#include <gmpxx.h>
mpz_class collatz(mpz_class n) {
if (mpz_odd_p(n.get_mpz_t())) {
n *= 3;
n += 1;
} else {
n /= 2;
}
return n;
}
int main() {
mpz_class x = 1;
#pragma omp parallel
while (true) {
//std::cout << x.get_str(10);
while (true) {
if (mpz_cmp_ui(x.get_mpz_t(), 1)) break;
x = collatz(x);
}
x++;
//std::cout << " OK" << std::endl;
}
}
Given that I did not get this error when I uncomment the outputs to screen, which are slow, I assume the issue at hand has to do with thread safety, and in particular with concurrent threads trying to increment x at the same time.
Am I correct in my assumptions? How can I fix this and make it safe to run?
I assume what you want to do is to check if the collatz conjecture holds for all numbers. The program you posted is wrong on many levels both serially and in parallel.
if (mpz_cmp_ui(x.get_mpz_t(), 1)) break;
Means that it will break when x != 1. If you replace it with the correct 0 == mpz_cmp_ui, the code will just continue to test 2 over and over again. You have to have two variables anyway, one for the outer loop that represents what you want to check, and one for the inner loop performing the check. It's easier to get this right if you make a function for that:
void check_collatz(mpz_class n) {
while (n != 1) {
n = collatz(n);
}
}
int main() {
mpz_class x = 1;
while (true) {
std::cout << x.get_str(10);
check_collatz(x);
x++;
}
}
The while (true) loop is bad to reason about and parallelize, so let's just make an equivalent for loop:
for (mpz_class x = 1;; x++) {
check_collatz(x);
}
Now, we can talk about parallelizing the code. The basis for OpenMP parallelizing is a worksharing construct. You cannot just slap #pragma omp parallel on a while loop. Fortunately you can easily mark certain canonical for loops with #pragma omp parallel for. For that, however, you cannot use mpz_class as a loop variable, and you must specify an end for the loop:
#pragma omp parallel for
for (long check = 1; check <= std::numeric_limits<long>::max(); check++)
{
check_collatz(check);
}
Note that check is implicitly private, there is a copy for each thread working on it. Also OpenMP will take care of distributing the work [1 ... 2^63] among threads. When a thread calls check_collatz a new, private, mpz_class object will be created for it.
Now, you might notice, that repeatedly creating a new mpz_class object in each loop iteration is costly (memory allocation). You can reuse that (by breaking check_collatz again) and creating a thread-private mpz_class working object. For this, you split the compound parallel for into separate parallel and for pragmas:
#include <gmpxx.h>
#include <iostream>
#include <limits>
// Avoid copying objects by taking and modifying a reference
void collatz(mpz_class& n)
{
if (mpz_odd_p(n.get_mpz_t()))
{
n *= 3;
n += 1;
}
else
{
n /= 2;
}
}
int main()
{
#pragma omp parallel
{
mpz_class x;
#pragma omp for
for (long check = 1; check <= std::numeric_limits<long>::max(); check++)
{
// Note: The structure of this fits perfectly in a for loop.
for (x = check; x != 1; collatz(x));
}
}
}
Note that declaring x in the parallel region will make sure it is implicitly private and properly initialized. You should prefer that to declaring it outside and marking it private. This will often lead to confusion because explicitly private variables from outside scope are unitialized.
You might complain that this only checks the first 2^63 numbers. Just let it run. This gives you enough time to master OpenMP to expert level and write your own custom worksharing for GMP objects.
You were concerned about having extra objects for each thread. This is essential for good performance. You cannot solve this efficiently with locks/critical sections/atomics. You would have to protect each and every read and write to your only relevant variable. There would be no parallelism left.
Note: The huge for loop will likely have a load imbalance. So some threads will probably finish a few centuries earlier than the others. You could fix that with dynamic scheduling, or smaller static chunks.
Edit: For academic sake, here is one idea how to implement the worksharing directly on GMP objects:
#pragma omp parallel
{
// Note this is not a "parallel" loop
// these are just separate loops on distinct strided
int nthreads = omp_num_threads();
mpz_class check = 1;
// we already checked those in the other program
check += std::numeric_limits<long>::max();
check += omp_get_thread_num();
mpz_class x;
for (; ; check += nthreads)
{
// Note: The structure of this fits perfectly in a for loop.
for (x = check; x != 1; collatz(x));
}
}
You could well be right about collisions with x. You can mark x as private by:
#pragma omp parallel private(x)
This way each thread gets their own "version" of the variable x, which should make this thread-safe. By default, variables declared before a #pragma omp parallel are public, so there is one shared instance between all of the threads.
You might want to touch x only with atomic instructions.
#pragma omp atomic
x++;
This ensures that all threads see the same value of x without requires mutexes or other synchronization techniques.

OpenMP: pragma cancel for ON NUMA

---------------------EDIT-------------------------
I have edited the code as follows:
#pragma omp parallel for private(i, piold, err) shared(threshold_err) reduction(+:pi) schedule (static)
{
for (i = 0; i < 10000000000; i++){ //1000000000//705035067
piold = pi;
pi += (((i&1) == false) ? 1.0 : -1.0)/(2*i+1);
err = fabs(pi-piold);
if ( err < threshold_err){
#pragma omp cancel for
}
}
}
pi = 4*pi;
I compile it with LLVM3.9/Clang4.0. When I run it with one thread I get expected results with pragma cancel action (checked against non pragma cancel version, resulted in faster run).
But when I run it with threads >=2, the program goes into loop. I am run the code on NUMA machines. What is happening? Perhaps the cancel condition is not being satisfied! But then code takes longer than single thread non-pragma-cancel version!! FYI, it runs file when OMP_CANCELLATION=false.
I have following OpenMP code. I am using LLVM-3.9/Clang-4.0 to compile this code.
#pragma omp parallel private(i, piold, err) shared(pi, threshold_err)
{
#pragma omp for reduction(+:pi) schedule (static)
for (i = 0; i < 10000000 ; i++){
piold = pi;
pi += (((i&1) == false) ? 1.0 : -1.0)/(2*i+1);
#pragma omp critical
{
err = fabs(pi-piold);// printf("Err: %0.11f\n", err);
}
if ( err < threshold_err){
printf("Cancelling!\n");
#pragma omp cancel for
}
}
}
Unfortunately I do not think the #pragma omp cancel for is terminating the whole for loop. I am printing out the err value in the end, but again with parallelism it is confusing which value is being printed. The final value of err is smaller than threshold_err. The print cancelling is printing but in the very beginning of the program, which is surprising. The program keeps running after that!
How to make sure that this is correct implementation? BTW OMP_CANCELLATION is set to true and a small test program returns '1' for the corresponding function, omp_get_cancellation().
I understand that the omp cancel is just a break signal, it notify so that no thread is created later. Threads which are still running will continue until the end. See http://bisqwit.iki.fi/story/howto/openmp/ and http://jakascorner.com/blog/2016/08/omp-cancel.html
In fact, in my opinion, I see your program product acceptable approximation. However, some variable can be keep in smaller scope. This is my suggestion
#include <iostream>
#include <cmath>
#include <iomanip>
int main() {
long double pi = 0.0;
long double threshold_err = 1e-7;
int cancelFre = 0;
#pragma omp parallel shared(pi, threshold_err, cancelFre)
{
#pragma omp for reduction(+:pi) schedule (static)
for (int i = 0; i < 100000000; i++){
long double piold = pi;
pi += (((i&1) == false) ? 1.0 : -1.0)/(2*i+1);
long double err = std::fabs(pi-piold);
if ( err < threshold_err){
#pragma omp cancel for
cancelFre++;
}
}
}
std::cout << std::setprecision(10) << pi * 4 << " " << cancelFre;
return 0;
}
Okay so I solved it. In my code above the problem was here:
err = fabs(pi-piold);
In the above line pi is changed before the following if condition is changed. Also multiple threads do the same. As I understand this makes program go in a deadlock.
I solved it by forcing only one thread, master, to do this check:
if(omp_get_thread_num()==0){
err = fabs(pi-piold);
if ( err < threshold_err){
#pragma omp cancel for
}
}
I could have used #pragma omp single but it gave error about nested pragmas.
Here the performance suffers on low number of threads (1-4 are worse than normal sequential code). After that the performance improves. This is not the best solution and someone can surely improve upon this one.