Nested openmp causes segmentation fault (MacOS X only) - c++

I am building an c++ application that uses nested omp. It however causes crash. The problem is solved when either one of the two omp is removed, or the wait function is inside the main file itself. OS is MacOS X Lion, compiler should be either llvm-gcc or gcc-4.2 (I am not sure, simply used cmake...) I then built the following app to demonstrate:
EDIT: I now tried the same on a linux machine, it works fine. So it's a pure MACOS X (lion) issue.
OMP_NESTED is set to true.
The main:
#include "waiter.h"
#include "iostream"
#include "time.h"
#include <omp.h>
void wait(){
int seconds = 1;
#pragma omp parallel for
for (int i=0;i<2;i++){
clock_t endwait;
endwait = clock () + seconds * CLOCKS_PER_SEC ;
while (clock() < endwait) {}
std::cout << i << "\n";
}
}
int main(){
std::cout << "blub\n";
#pragma omp parallel for
for(int i=0;i<5;i++){
Waiter w; // causes crash
// wait(); // works
}
std::cout << "blub\n";
return 0;
}
header:
#ifndef WAITER_H_
#define WAITER_H_
class Waiter {
public:
Waiter ();
};
#endif // WAITER_H_
implementation:
#include "waiter.h"
#include "omp.h"
#include "time.h"
#include <iostream>
Waiter::Waiter(){
int seconds = 1;
#pragma omp parallel for
for (int i=0;i<5;i++){
clock_t endwait;
endwait = clock () + seconds * CLOCKS_PER_SEC ;
while (clock() < endwait) {}
std::cout << i << "\n";
}
}
CMakeLists.txt:
cmake_minimum_required (VERSION 2.6)
project (waiter)
set(CMAKE_CXX_FLAGS "-fPIC -fopenmp")
set(CMAKE_C_FLAGS "-fPIC -fopenmp")
set(CMAKE_SHARED_LINKDER_FLAGS "-fPIC -fopenmp")
set(CMAKE_LIBRARY_OUTPUT_DIRECTORY ${PROJECT_BINARY_DIR}/lib)
set(EXECUTABLE_OUTPUT_PATH ${PROJECT_BINARY_DIR}/bin)
add_library(waiter SHARED waiter.cpp waiter.h)
add_executable(use_waiter use_waiter.cpp)
target_link_libraries(use_waiter waiter)
thanks for help!

EDIT: rewritten with more details.
openmp causes intermittent failure on gcc 4.2, but it is fixed by gcc 4.6.1 (or perhaps 4.6). You can get the 4.6.1 binary from http://hpc.sourceforge.net/ (look for gcc-lion.tar.gz).
The failure of openmp in lion with less than gcc 4.6.1 is intermittent. It seems to happen after many openmp calls, so is likely made more likely by nesting but nesting is not required. This link doesn't have nested openmp (there is a parallel for within a standard single threaded for) but fails. My own code had intermittent hanging or crashing due to openmp after many minutes of working fine with gcc 4.2 (with no nested pragmas) in lion and was completely fixed by gcc 4.6.1.
I downloaded your code and compiled it with gcc 4.2 and it ran fine on my machine (with both the Waiter w; and wait(); options :-). I just used:
g++ -v -fPIC -fopenmp use_waiter.cpp waiter.cpp -o waiter
I tried increasing the loop maxes but still couldn't get it to fail. I see both the starting and ending blub.
What error message do you see?
Are you sure that the gcc 4.6 you downloaded is being used (use -v to make sure)?
See also here.

Related

Why is my OpenMP program slow when running it with CLion and WSL, but fast when I compile it manually?

I have a very simple matrix multiplication program to try out OpenMP:
#include <iostream>
#include <stdlib.h>
#include <omp.h>
int main() {
const int N = 1000;
double matrix[N][N];
double vector[N];
double outvector[N];
for (int i = 0; i < N; i++) {
for (int j = 0; j < N; j++) {
matrix[i][j] = rand() % 50;
}
}
for (int i = 0; i < N; i++) {
vector[i] = rand() % 50;
}
double t=omp_get_wtime();
omp_set_num_threads(12);
#pragma omp parallel for
for (int i = 0; i < N; i++) {
for (int j = 0; j < N; j++) {
outvector[i] += matrix[i][j] * vector[j];
}
}
t = omp_get_wtime() - t;
printf("%g\n", t);
return 0;
}
I compile it two ways:
Using my Windows Subsystem Linux (WSL), using g++ main.cpp -o main -fopenmp. This way, it runs significantly faster than if I comment out #pragma omp parallel for (as one would expect).
Using my CLion toolchain:
It's a default WSL toolchain
And the following CMakeLists file:
cmake_minimum_required(VERSION 3.9)
project(Fernuni)
set(CMAKE_CXX_STANDARD 14)
# There was no difference between using this or find_package
#set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -std=c++11 -fopenmp")
find_package(OpenMP)
if (OPENMP_FOUND)
set (CMAKE_C_FLAGS "${CMAKE_C_FLAGS} ${OpenMP_C_FLAGS}")
set (CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} ${OpenMP_CXX_FLAGS}")
endif()
add_executable(Fernuni main.cpp)
This way, it runs about ten times slower if #pragma omp parallel for is there.
Why?
edit:
Here are my specific timings with various combinations of optimization settings and enabled/disabled pragma directive:
OMP,O0 OMP,O3 O0 O3 OMP only nothing
CLion 0.0332578 0.0234029 0.0023873 6.4e-06 0.0058386 0.0094753
WSL/g++ 0.007106 0.0012252 0.0038349 5.1e-06 0.0008419 0.0021912
You are mainly measuring the overheads of the operating system, the ones of the hardware, the ones of the runtime and the effect of compiler optimizations.
Indeed, the computation time should be less than 1 ms, while the time to create the OS threads, to schedule them and to initialize OpenMP can actually be close the same time regarding the OS and the runtime (it takes between 0.3 and 0.8 ms on my machine to do only that). Not to mention that caches are cold (many cache misses occurs), the processor frequency may not be high (because of frequency scaling).
Moreover, the compiler can optimize the loop to nothing as it does not have any side effect and the result is not read. Actually, GCC do that with -O3 without OpenMP but not with (see here). This is likely because the GCC optimizer do not see that the output is not read (due to internal escape references). This explains why you get better timings without OpenMP. You should read the result, at least to check they are correct! Put it shortly the benchmark is flawed.
Note that using -march=native or -mavx+-mfma can help a lot to speed up the loop this case since these instruction sets can theoretically divide by 4 the number of instructions executed in that case. In practice, the program will likely be memory-bound (due to the size of the matrix).

OpenMP 4.5 won't offload to GPU with target directive

I am trying to make a simple GPU offloading program using openMP. However, when I try to offload it still runs on the default device, i.e. my CPU.
I have installed a compiler, g++ 7.2.0 that has CUDA support (is in on a cluster that I use). When I run the below code it shows me that it can see the 8 GPUs but when I try to offload it says that it is still on the CPU.
#include <omp.h>
#include <iostream>
#include <stdio.h>
#include <math.h>
#include <algorithm>
#define n 10000
#define m 10000
using namespace std;
int main()
{
double tol = 1E-10;
double err = 1;
size_t iter_max = 10;
size_t iter = 0;
bool notGPU[1] = {true};
double Anew[n][m];
double A[n][m];
int target[1];
target[0] = omp_get_initial_device();
cout << "Total Devices: " << omp_get_num_devices() << endl;
cout << "Target: " << target[0] << endl;
for (int iter = 0; iter < iter_max; iter++){
#pragma omp target
{
err = 0.0;
#pragma omp parallel for reduction(max:err)
for (int j = 1; j < n-1; ++j){
target[0] = omp_is_initial_device();
for (int i = 1; i < m-1; i++){
Anew[j][i] = 0.25 * (A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]);
err = fmax(err, fabs(Anew[j][i] - A[j][i]));
}
}
}
}
if (target[0]){
cout << "not on GPU" << endl;
} else{
cout << "On GPU" << endl;}
return 0;
}
When I run this I always get that it is not on the GPU, but that there are 8 devices available.
This is not a well documented process!
You have to install some packages which look a little like:
sudo apt install gcc-offload-nvptx
You also need to add additional flags to your compilation string. I've globbed together a number of them below. Mix and match until something works, or use them as the basis for further Googling.
gcc -fopenmp -foffload=x86_64-intelmicemul-linux-gnu="-mavx2" -foffload=nvptx-none -foffload="-O3" -O2 test.c -fopenmp-targets=nvptx64-nvidia-cuda
When I last tried this with GCC in 2018 it just didn't work. At that time target offloading for OpenMP only worked with the IBM XL compiler and OpenACC (a similar set of directives to OpenMP) only worked on the Nvidia's PGI compiler. I find PGI to do a worse job of compiling C/C++ than the others (seems inefficient, non-standard flags), but a Community Edition is available for free and a little translating will get you running in OpenACC quickly.
IBM XL seems to do a fine job compiling, but I don't know if it's available for free.
The situation may have changed with GCC. If you find a way to get it working, I'd appreciate you leaving a comment here. My strong recommendation is that you stop trying with GCC7 and get ahold of GCC8 or GCC9. GPU offloading is a fast-moving area and you'll want the latest compilers to take best advantage of it.
Looks like you're missing a device(id) in your #pragma omp target line:
#pragma omp target device(/*your device id here*/)
Without that, you haven't explicitly asked OpenMP to run anywhere but your CPU.

Program works slower when running second time after recompilation

Performance of the simple program (generate 1 200 000 unique random shuffled integers then sort them) is slower, when I run it from Qt Creator second time after recompilation (and all the next till the next recompilation).
#include <iostream>
#include <random>
#include <algorithm>
#include <chrono>
#include <iterator>
#include <cstdint>
using size_type = std::uint32_t;
alignas(64) size_type v[1200000];
// behaviour really not depends on CPU affinity
#ifdef __linux__
#include <sched.h>
#endif
int main()
{
#ifdef __linux__
{
cpu_set_t m;
int status;
CPU_ZERO(&m);
CPU_SET(0, &m);
status = sched_setaffinity(0, sizeof(m), &m);
if (status != 0) {
perror("sched_setaffinity");
}
}
#endif
std::mt19937 g(0);
for (size_type i = 1; i < std::size(v); ++i) {
v[i] = std::exchange(v[g() % i], i);
}
for (size_type i = 0; i < 10; ++i) { // first output not depends on number of iterations
auto start = std::chrono::high_resolution_clock::now();
std::sort(std::begin(v), std::end(v));
std::cout << std::chrono::duration_cast< std::chrono::microseconds >(std::chrono::high_resolution_clock::now() - start).count() << std::endl;
}
}
Say, first time it prints;
97896
26069
25628
25771
25863
25722
25976
25855
25687
25735
and then:
137238
35056
34880
34468
34746
27309
25781
25932
25502
25383
yet another (and all further like the second and third):
137648
35086
34966
26005
26305
26435
25683
25440
25981
25632
If I recompile program, then all repeating again.
If I recompile program and run it from the console, then all outputs starting from value near the 137000, even the first one, and looks like the next:
137207
35059
35035
34844
34563
34586
34466
34132
34327
34487
If mutters much, I build and run above program on Ubuntu Desktop 16.04.3 64 bit on AMD A10-7800 Radeon R7, 12 Compute Cores 4C+8G w/ 8GB RAM, SSD. Without root previlegies and without debugger on. I use g++-7 -m32 -march=native -mtune=native -O3, gold and ccache.
I expected inverse results, because of (maybe) branch prediction caching or some other caching (if possible at all between consecutive runs of the same code), but the results are discouraging.

How to make GNU GCC optimize OpenMP threads similarly

This is my first post here. Yay! Back to the problem:
I'm learning how to use OpenMP. My IDE is Code::Blocks. I want to improve some of my older programs. I need to be sure that the results will be exactly the same. It appears that "for" loops are optimized differently in the master thread than in the other threads.
Example:
#include <iostream>
#include <omp.h>
int main()
{
std::cout.precision(17);
#pragma omp parallel for schedule(static, 1) ordered
for(int i=0; i<4; i++)
{
double sum = 0.;
for(int j=0; j<10; j++)
{
sum += 10.1;
}
#pragma omp ordered
std::cout << "thread " << omp_get_thread_num() << " says " << sum << "\n";
}
return 0;
}
produces
thread 0 says 101
thread 1 says 100.99999999999998579
thread 2 says 100.99999999999998579
thread 3 says 100.99999999999998579
Can I somehow make sure all threads receive the same optimization than my single-threaded programs (that didn't use OpenMP) have received?
EDIT:
The compiler is "compiler and GDB debugger from TDM-GCC (version 4.9.2, 32 bit, SJLJ)", whatever that means. It's the IDE's "default". I'm not familiar with compiler differences.
The output provided comes from the "Release" build, which is adding the "-O2" argument.
None of "-O", "-O1" and "-O3" arguments produces a "101".
You can try my .exe from dropbox (zip file, also contains possibly required dlls).
This i happens because float or double data type can not represent some numbers like 20.2
#include <iostream>
int main()
{
std::cout.precision(17);
double a=20.2;
std::cout << a << std::endl;
return 0;
}
its output will be
20.199999999999999
for more information on this see
Unexpected Output when adding two float numbers
Don't know why this does not happens for the first thread but if you remove openMP then too you will get the same result.
From what I get this is simply numerical accuracy. For a double value type you should expect 16 digits precision.
I.e. the result is 101 +/- 1.e-16*101
This exactly the range you get. And unless you use something like quadruple precision, this is as good as it gets.

C++ + openmp for parallel computing: how to set up in visual studio?

I have a c++ program that creates an object and then calls 2 functions of this object that are independent from one another. So it looks like this:
Object myobject(arg1, arg2);
double answer1 = myobject.function1();
double answer2 = myobject.function2();
I would like to have those 2 computations run in parallel to save computation time. I've seen that this could be done using openmp, but couldn't figure out how to set it up. The only examples I found were sending the same calculation (i.e. "hello world!" for example) to the different cores and the output was 2 times "hello world!". How can I do it in this situation?
I use Windows XP with Visual Studio 2005.
You should look into the sections construct of OpenMP. It works like this:
#pragma omp parallel sections
{
#pragma omp section
{
... section 1 block ...
}
#pragma omp section
{
... section 2 block ...
}
}
Both blocks might execute in parallel given that there are at least two threads in the team but it is up to the implementation to decide how and where to execute each section.
There is a cleaner solution using OpenMP tasks, but it requires that your compiler supports OpenMP 3.0. MSVC only supports OpenMP 2.0 (even in VS 11!).
You should explicitly enable OpenMP support in your project's settings. If you are doing compilation from the command line, the option is /openmp.
If the memory that is is required for your code is not a lot you can use MPI library too. For this purpose first of all install MPI on your visual studio from this tutorial Compiling MPI Programs in Visual Studio
or from here:MS-MPI with Visual Studio 2008
use this mpi hello world code :
#include<iostream>
#include<mpi.h>
using namespace std;
int main(int argc, char** argv){
int mynode, totalnodes;
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &totalnodes);
MPI_Comm_rank(MPI_COMM_WORLD, &mynode);
cout << "Hello world from process " << mynode;
cout << " of " << totalnodes << endl;
MPI_Finalize();
return 0;
}
for your base code, add your functions to it and declare job of each process with this example if statement:
if(mynode== 0 ){function1}
if(mynode== 1 ){function2}
function1 and function2 can be any thing that you like executes at the same time; but be careful that these two functions independent of each others.
thats it!
The first part of this is getting OpenMP up and running with Visual Studio 2005, which is quite old; it takes some doing, but it's described in the answer to this question.
Once that's done, it's fairly easy to do this simple form of task parallelism if you have two methods which are genuinely completely independant. Note that qualifier; if the methods are reading the same data, that's ok, but if they're updating any state that the other method uses, or calling any other routines that do so, then things will break.
As long as the methods are completly independant, you can use sections for these (tasks are actually the more modern, OpenMP 3.0 way of doing this, but you probably won't be able to get OpenMP 3.0 support for such an old compiler); you will also see people misusing parallel for loops to achieve this, which at least has the advantage of letting you control the thread assignments, so I include that here for completeness even though I can't really recommend it:
#include <omp.h>
#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
int f1() {
int tid = omp_get_thread_num();
printf("Thread %d in function f1.\n", tid);
sleep(rand()%10);
return 1;
}
int f2() {
int tid = omp_get_thread_num();
printf("Thread %d in function f2.\n", tid);
sleep(rand()%10);
return 2;
}
int main (int argc, char **argv) {
int answer;
int ans1, ans2;
/* using sections */
#pragma omp parallel num_threads(2) shared(ans1, ans2, answer) default(none)
{
#pragma omp sections
{
#pragma omp section
ans1 = f1();
#pragma omp section
ans2 = f2();
}
#pragma omp single
answer = ans1+ans2;
}
printf("Answer = %d\n", answer);
/* hacky appraoch, mis-using for loop */
answer = 0;
#pragma omp parallel for schedule(static,1) num_threads(2) reduction(+:answer) default(none)
for (int i=0; i<2; i++) {
if (i==0)
answer += f1();
if (i==1)
answer += f2();
}
printf("Answer = %d\n", answer);
return 0;
}
Running this gives
$ ./sections
Thread 0 in function f1.
Thread 1 in function f2.
Answer = 3
Thread 0 in function f1.
Thread 1 in function f2.
Answer = 3