How to make GNU GCC optimize OpenMP threads similarly

How to make GNU GCC optimize OpenMP threads similarly - c++

This is my first post here. Yay! Back to the problem:
I'm learning how to use OpenMP. My IDE is Code::Blocks. I want to improve some of my older programs. I need to be sure that the results will be exactly the same. It appears that "for" loops are optimized differently in the master thread than in the other threads.
Example:
#include <iostream>
#include <omp.h>
int main()
{
std::cout.precision(17);
#pragma omp parallel for schedule(static, 1) ordered
for(int i=0; i<4; i++)
{
double sum = 0.;
for(int j=0; j<10; j++)
{
sum += 10.1;
}
#pragma omp ordered
std::cout << "thread " << omp_get_thread_num() << " says " << sum << "\n";
}
return 0;
}
produces
thread 0 says 101
thread 1 says 100.99999999999998579
thread 2 says 100.99999999999998579
thread 3 says 100.99999999999998579
Can I somehow make sure all threads receive the same optimization than my single-threaded programs (that didn't use OpenMP) have received?
EDIT:
The compiler is "compiler and GDB debugger from TDM-GCC (version 4.9.2, 32 bit, SJLJ)", whatever that means. It's the IDE's "default". I'm not familiar with compiler differences.
The output provided comes from the "Release" build, which is adding the "-O2" argument.
None of "-O", "-O1" and "-O3" arguments produces a "101".
You can try my .exe from dropbox (zip file, also contains possibly required dlls).

This i happens because float or double data type can not represent some numbers like 20.2
#include <iostream>
int main()
{
std::cout.precision(17);
double a=20.2;
std::cout << a << std::endl;
return 0;
}
its output will be
20.199999999999999
for more information on this see
Unexpected Output when adding two float numbers
Don't know why this does not happens for the first thread but if you remove openMP then too you will get the same result.

From what I get this is simply numerical accuracy. For a double value type you should expect 16 digits precision.
I.e. the result is 101 +/- 1.e-16*101
This exactly the range you get. And unless you use something like quadruple precision, this is as good as it gets.

Related

OpenMP 4.5 won't offload to GPU with target directive

I am trying to make a simple GPU offloading program using openMP. However, when I try to offload it still runs on the default device, i.e. my CPU.
I have installed a compiler, g++ 7.2.0 that has CUDA support (is in on a cluster that I use). When I run the below code it shows me that it can see the 8 GPUs but when I try to offload it says that it is still on the CPU.
#include <omp.h>
#include <iostream>
#include <stdio.h>
#include <math.h>
#include <algorithm>
#define n 10000
#define m 10000
using namespace std;
int main()
{
double tol = 1E-10;
double err = 1;
size_t iter_max = 10;
size_t iter = 0;
bool notGPU[1] = {true};
double Anew[n][m];
double A[n][m];
int target[1];
target[0] = omp_get_initial_device();
cout << "Total Devices: " << omp_get_num_devices() << endl;
cout << "Target: " << target[0] << endl;
for (int iter = 0; iter < iter_max; iter++){
#pragma omp target
{
err = 0.0;
#pragma omp parallel for reduction(max:err)
for (int j = 1; j < n-1; ++j){
target[0] = omp_is_initial_device();
for (int i = 1; i < m-1; i++){
Anew[j][i] = 0.25 * (A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]);
err = fmax(err, fabs(Anew[j][i] - A[j][i]));
}
}
}
}
if (target[0]){
cout << "not on GPU" << endl;
} else{
cout << "On GPU" << endl;}
return 0;
}
When I run this I always get that it is not on the GPU, but that there are 8 devices available.

This is not a well documented process!
You have to install some packages which look a little like:
sudo apt install gcc-offload-nvptx
You also need to add additional flags to your compilation string. I've globbed together a number of them below. Mix and match until something works, or use them as the basis for further Googling.
gcc -fopenmp -foffload=x86_64-intelmicemul-linux-gnu="-mavx2" -foffload=nvptx-none -foffload="-O3" -O2 test.c -fopenmp-targets=nvptx64-nvidia-cuda
When I last tried this with GCC in 2018 it just didn't work. At that time target offloading for OpenMP only worked with the IBM XL compiler and OpenACC (a similar set of directives to OpenMP) only worked on the Nvidia's PGI compiler. I find PGI to do a worse job of compiling C/C++ than the others (seems inefficient, non-standard flags), but a Community Edition is available for free and a little translating will get you running in OpenACC quickly.
IBM XL seems to do a fine job compiling, but I don't know if it's available for free.
The situation may have changed with GCC. If you find a way to get it working, I'd appreciate you leaving a comment here. My strong recommendation is that you stop trying with GCC7 and get ahold of GCC8 or GCC9. GPU offloading is a fast-moving area and you'll want the latest compilers to take best advantage of it.

Looks like you're missing a device(id) in your #pragma omp target line:
#pragma omp target device(/*your device id here*/)
Without that, you haven't explicitly asked OpenMP to run anywhere but your CPU.

GCC 8.1.0/MinGW64-compiled OpenMP program crashes looking for cygwin.s?

I'm learning OpenMP in C++ using gcc 8.1.0 and MinGW64 (latest version as of this month), and I'm running into a weird debug error when my program encounters a segmentation fault.
I know the cause of the crash, attempting to create too many OpenMP threads (50,000), but it's the error itself that has me puzzled. I didn't compile gcc or MinGW64 from source, I just used the installers, and I'm on Windows.
Why is it looking for cygwin.s, and why use that file structure on Windows? My code and the error message from gdb are below the closing.
I'm learning OpenMP in the process of programming a path tracer, and I think I have a workaround for the thread limit (using while (threads < runs) and letting OpenMP set the thread count automatically), but I am stumped as to the error. Is there a workaround or solution for this?
It works fine with ~10,000 threads. I know it's not actually creating 10,000 threads simultaneously, but it's what I was doing before I thought of the workaround.
Thank you for the heads up about rand() and thread safety. I ended up replacing my RNG code with some that appears to be working fine in OpenMP, and it's literally a night and day difference visually. I will try the other changes and report back. Thanks!
WOW! It runs so much faster and the image is artifact-free! Thank you!
Jadan Bliss
Final code:
#pragma omp parellel
for (j = options.height - 1; j >= 0; j--){
for (i=0; i < options.width; i++) {
#pragma omp parallel for reduction(Vector3Add:col)
for (int s=0; s < options.samples; s++)
{
float u = (float(i) + scene_drand()) / float(options.width);
float v = (float(j) + scene_drand()) / float(options.height);
Ray r = cam.get_ray(u, v); // was: origin, lower_left_corner + u*horizontal + v*vertical);
col += color(r, world, 0);
}
col /= real(options.samples);
render.set(i,j, col);
col = Vector3(0.0);
}
}
Error:
Starting program:
C:\Users\Jadan\Documents\CBProjects\learnOMP\bin\Debug\learnOMP.exe
[New Thread 22136.0x6620] [New Thread 22136.0x80a8] [New Thread
22136.0x8008] [New Thread 22136.0x5428]
Thread 1 received signal SIGSEGV, Segmentation fault.
___chkstk_ms () at ../../../../../src/gcc-8.1.0/libgcc/config/i386/cygwin.S:126 126
../../../../../src/gcc-8.1.0/libgcc/config/i386/cygwin.S: No such file
or directory.

Here are some remarks on your code.
Using a huge number of thread will not bring you any gain and is the probable reason of your problems. Thread creation has a time and resource cost. Time cost makes that it will probably be the main time in your program and your parallel program will be by far longer than its sequential version. Concerning resource cost, each thread has its own stack segment. Its size is system dependent, but typical values are measured in MB. I do not know the characteristics of your system, but with 100000 threads, this is probably the reason why your code is crashing. I have no explaination for the message about about cygwin.s, but after a stack overflow, the behavior can be weird.
Threads are a mean to parralelize code, and, for data parallelism, it is most of the time useless to have more threads than the number of logical processors on your system. Let openmp set it, but you can experiment later to tune this number.
Besides that, there are other problems.
rand() is not thread safe as it uses a global state that will be modified concurrently by threads. rand_r() is, as the state of the random generator is not global and can be stored in every thread.
You should not modify a shared var like result without an atomic access as concurrent thread accesses can lead to unexpected results. While safe, using an atomic modification for every value is not a very efficient solution, though. Atomic accesses are very expensive and it is better to use a reduction that does local accumulation in every thread and a unique atomic access at the end.
#include <omp.h>
#include <iostream>
#include <random>
#include <time.h>
int main()
{
int runs = 100000;
double result = 0.0;
#pragma omp parallel
{
// per thread initialisation of rand_r seed.
unsigned int rand_state=omp_get_thread_num()*time(NULL);
// or whatever thread dependent seed
#pragma omp for reduction(+:result)
for(int i=0; i<runs; i++)
{
double d = double(rand_r(&rand_state))/double(RAND_MAX);
result += d;
}
}
result /= double(runs);
std::cout << "The computed average over " << runs << " runs was "
<< result << std::endl;
return 0;
}

omp doens't launch thread

openMp used to work on my project on 6 threads and now, (I have no ideas why), the program is single threaded.
My code is pretty simple, I only use openMp in one cpp file, i declared
#include <omp.h>
then the function to parallelize is :
#pragma omp parallel for collapse(2) num_threads(IntervalMapEstimator::m_num_thread)
for (int cell_index_x = m_min_cell_index_sensor_rot_x; cell_index_x <= m_max_cell_index_sensor_rot_x; cell_index_x++)
{
for (int cell_index_y = m_min_cell_index_sensor_rot_y; cell_index_y <= m_max_cell_index_sensor_rot_y; cell_index_y++)
{
//use for debug
omp_set_num_threads (5);
std::cout << "omp_get_num_threads = " << omp_get_num_threads ()<< std::endl;
std::cout << "omp_get_max_threads = " << omp_get_max_threads ()<< std::endl;
if(split_points) {
extract_relevant_points_from_angle_lists(relevant_points, pointcloud_ff_polar_angle_lists, cell_min_angle_sensor_rot, cell_max_angle_sensor_rot);
} else {
extract_relevant_points_multithread_with_localvector(relevant_points, pointcloud, cell_min_angle_sensor_rot, cell_max_angle_sensor_rot);
}
}
}
omp_get_num_threads return 1 thread
omp_get_max_threads return 5
IntervalMapEstimator::m_num_thread is set at 6
Any lead would be greatly appreciated.
EDIT 1:
I modified my code but the problem remains, the program is still running in mono thread.
omp_get_num_threads return 1
omp_get_max_threads return 8
Is there a way to know how many threads are available at running time ?
#pragma omp parallel for collapse(2)
for (int cell_index_x = m_min_cell_index_sensor_rot_x; cell_index_x <= m_max_cell_index_sensor_rot_x; cell_index_x++)
{
for (int cell_index_y = m_min_cell_index_sensor_rot_y; cell_index_y <= m_max_cell_index_sensor_rot_y; cell_index_y++)
{
std::cout << "omp_get_num_threads = " << omp_get_num_threads ()<< std::endl;
std::cout << "omp_get_max_threads = " << omp_get_max_threads ()<< std::endl;
extract_relevant_points(relevant_points, pointcloud, cell_min_angle_sensor_rot, cell_max_angle_sensor_rot);
}
}
I just saw my computer is beginning to run low in memory, could that be a part of the problem ?

According to https://msdn.microsoft.com/en-us/library/bx15e8hb.aspx:
If a parallel region is encountered while dynamic adjustment of the number of threads is disabled, and the number of threads requested for the parallel region exceeds the number that the run-time system can supply, the behavior of the program is implementation-defined. An implementation may, for example, interrupt the execution of the program, or it may serialize the parallel region.
You request 6 threads, the implementation can only provide 5, so it is free to do what it wants.
I'm also pretty sure you are not supposed to change the number of threads while inside a parallel region, so your omp_set_num_threads will do nothing at best and blow up in your face at worst.

I founded the answer thanks to another post: Why does the compiler ignore OpenMP pragmas?
In the end it was a simple error of library that i didn't add to the compiler, and I didn't noticed it because i was compiling with cmake so i don't have to type the line directly. Also i use catkin_make to compile so i don't have warning but only error in the console.
So basically, to use openMp you have to add -fopenmp as arguments to your compiler, and if you don't' ... well the lines are just ignored by the compiler.

slow serial "for" with openmp enabled

I try to use openmp and find strange results.
Parallel "for" run faster with openmp as expected. But serial "for" run much faster when openmp disabled (without /openmp option. vs 2013).
Test code
const int n = 5000;
const int m = 2000000;
vector <double> a(n, 0);
double start = omp_get_wtime();
#pragma omp parallel for shared(a)
for (int i = 0; i < n; i++)
{
double StartVal = i;
for (int j = 0; j < m; ++j)
{
a[i] = (StartVal + log(exp(exp((double)i))));
}
}
cout << "omp Time: " << (omp_get_wtime() - start) << endl;
start = omp_get_wtime();
for (int i = 0; i < n; i++)
{
double StartVal = i;
for (int j = 0; j < m; ++j)
{
a[i] = (StartVal + log(exp(exp((double)i))));
}
}
cout << "serial Time: " << (omp_get_wtime() - start) << endl;
Output without /openmp option
0
omp Time: 6.4389
serial Time: 6.37592
Output with /openmp option
0
1
2
3
omp Time: 1.84636
serial Time: 16.353
Is it correct results? Or I'm doing something wrong?

I believe part of the answer lies hidden in the architecture of the computer you run on. I tried running the same code another machine (GCC 4.8 on GNU+Linux, quad Core2 CPU), and over many runs, found a slightly odd thing: while the time for both loops varied, and OpenMP with many threads always ran faster, the second loop never ran significantly faster than the first, even without OpenMP.
The next step was to try to eliminate a dependency between the loops, allocating a second vector for the second loop. It still ran no faster than the first. So I tried reversing them, running the OpenMP loop after the serial one; and while it still ran fast when multithreaded, it would now see delays when the first loop didn't. It's looking more like an operating system behaviour at this point; long-lived threads simply seem more likely to get interrupted. I had taken some measures to reduce interruptions (niceness -15, specific cpu set) but this is not a system dedicated to benchmarking.
None of my results were anywhere near as extreme as yours, however. My first guess as to what caused your large difference was that you reused the same array and ran the parallel loop first. This would distribute the array into caches on all cores, causing a slight dilemma of whether to migrade the thread to the data or the other way around; and OpenMP may have chosen any distribution, including iteration i to thread i%threads (as with schedule(static,1)), which probably would hurt multithreaded runtime, or one cacheline each which would hurt later single threaded reading if it fit in per-core caches. However, all of the array accesses are writes, so the processor shouldn't need to wait for them in the first place.
In summary, your results are certainly platform dependent and unexpected. I would suggest rerunning the test with swapped order, the two loops operating on different arrays, and placed in different compilation units, and of course to verify the written results. It is possible you've found a flaw in your compiler.

OpenMP and File I/O

I'm doing some time trials on my code, and logically it seems really easy to parallelize with OpenMP as each trial is independent of the others. As it stands, my code looks something like this:
for(int size = 30; size < 50; ++size) {
#pragma omp parallel for
for(int trial = 0; trial < 8; ++trial) {
time_t start, end;
//initializations
time(&start);
//perform computation
time(&end);
output << size << "\t" << difftime(end,start) << endl;
}
output << endl;
}
I have a sneaking suspicion that this is kind of a faux pas, however, as two threads may simultaneously write values to the output, thus screwing up the formatting. Is this a problem, and if so, will surrounding the output << size << ... code with a #pragma omp critical statement fix it?

Never mind whether your output will be screwed up (it likely will). Unless you're really careful to assign your OpenMP threads to different processors that don't share resources like memory bandwidth, your time trials aren't very meaningful either. Different runs will be interfering with each other.
The solution to the problem you're asking about is to write the result times into designated elements of an array, with one slot for each trial, and ouput the results after the fact.

As long as you don't mind the individual lines being out of order you'll be fine. OpenMP should make sure a whole line is printed at a time.
However, you will need to declare start and end as private in the pragma otherwise the threads will overwrite them and mess up your timings.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js