openmp in mex : stackoverflow error - c++

i have got the following fraction of code that getting me the stack overflow error
#pragma omp parallel shared(Mo1, Mo2, sum_normalized_p_gn, Data, Mean_Out,Covar_Out,Prior_Out, det) private(i) num_threads( number_threads )
{
//every thread has a new copy
double* normalized_p_gn = (double*)malloc(NMIX*sizeof(double));
#pragma omp critical
{
int id = omp_get_thread_num();
int threads = omp_get_num_threads();
mexEvalString("drawnow");
}
#pragma omp for
//some parallel process.....
}
the variables declared in the shared are created by malloc. and they consumes with large amount of memory
there are 2 questions regarding to the above code.
1) why this would generate the stack overflow error( i.e. segmentation fault) before it goes into the parallel for loop? it works fine when it runs in the sequential mode....
2) am i right to dynamic allocate memory for each thread like "normalized_p_gn" above?
Regards
Edwin

In place of malloc, use mxMalloc in mex files (see here). Don't forget to mxFree when you're finished with the memory.

One possibility which we can't exclude, your code snippet doesn't reveal any numbers, is that you are simply trying to allocate too much memory when you run this in parallel. If you can confirm that this is not your problem, comment or edit your question and I'll take another look.

Related

Is seq_cst needed for synchronization of OpenMP atomic updates?

I read here that sequential memory consistency (seq_cst) "might be needed" to make sure an atomic update is viewed by all threads consistently in an OpenMP parallel region.
Consider the following MWE, which is admittedly trivial and could be realized with a reduction rather than atomics, but which illustrates my question that arose in a more complex piece of code:
#include <iostream>
int main()
{
double a = 0;
#pragma omp parallel for
for (int i = 0; i < 10000000; ++i)
{
#pragma omp atomic
a += 5.5;
}
std::cout.precision(17);
std::cout << a << std::endl;
return 0;
}
I compiled this with g++ -fopenmp -O3 using GCC versions 6 to 12 on an Intel Core i9-9880H CPU, and then ran it using 4 or 8 threads, which always correctly prints:
55000000
When adding seq_cst to the atomic directive, the result is exactly the same. I would have expected the code without seq_cst to (occasionally) produce smaller results due to race conditions / outdated memory view. Is this hardware dependent? Is the code guaranteed to be free of race conditions even without seq_cst, and if so, why? Would the answer be different when using a compiler that was still based on OpenMP 3.1, as that apparently worked somewhat differently?

Simple C++ Loop Not Benefitting from Multithreading

I have some extremely simple C++ code that I was certain would run 3x faster with multithreading but somehow only runs 3% faster (or less) on both GCC and MSVC on Windows 10.
There are no mutex locks and no shared resources. And I can't see how false sharing or cache thrashing could be at play since each thread only modifies a distinct segment of the array, which has over a billion int values. I realize there are many questions on SO like this but I haven't found any that seem to solve this particular mystery.
One hint might be that moving the array initialization into the loop of the add() function does make the function 3x faster when multithreaded vs single-threaded (~885ms vs ~2650ms).
Note that only the add() function is being timed and takes ~600ms on my machine. My machine has 4 hyperthreaded cores, so I'm running the code with threadCount set to 8 and then to 1.
Any idea what might be going on? Is there any way to turn off (when appropriate) the features in processors that cause things like false sharing (and possibly like what we're seeing here) to happen?
#include <chrono>
#include <iostream>
#include <thread>
void startTimer();
void stopTimer();
void add(int* x, int* y, int threadIdx);
namespace ch = std::chrono;
auto start = ch::steady_clock::now();
const int threadCount = 8;
int itemCount = 1u << 30u; // ~1B items
int itemsPerThread = itemCount / threadCount;
int main() {
int* x = new int[itemCount];
int* y = new int[itemCount];
// Initialize arrays
for (int i = 0; i < itemCount; i++) {
x[i] = 1;
y[i] = 2;
}
// Call add() on multiple threads
std::thread threads[threadCount];
startTimer();
for (int i = 0; i < threadCount; ++i) {
threads[i] = std::thread(add, x, y, i);
}
for (auto& thread : threads) {
thread.join();
}
stopTimer();
// Verify results
for (int i = 0; i < itemCount; ++i) {
if (y[i] != 3) {
std::cout << "Error!";
}
}
delete[] x;
delete[] y;
}
void add(int* x, int* y, int threadIdx) {
int firstIdx = threadIdx * itemsPerThread;
int lastIdx = firstIdx + itemsPerThread - 1;
for (int i = firstIdx; i <= lastIdx; ++i) {
y[i] = x[i] + y[i];
}
}
void startTimer() {
start = ch::steady_clock::now();
}
void stopTimer() {
auto end = ch::steady_clock::now();
auto duration = ch::duration_cast<ch::milliseconds>(end - start).count();
std::cout << duration << " ms\n";
}
You may be simply hitting the memory transfer rate of your machine, you are doing 8GB of reads and 4GB of writes.
On my machine your test completes in about 500ms which is 24GB/s (which is similar to the results given by a memory bandwidth tester).
As you hit each memory address with a single read and a single write the caches aren't much use as you aren't reusing memory.
Your problem is not the processor. You ran against the RAM read and write latency. As your cache is able to hold some megabytes of data and you exceed this storage by far. Multi-threading is so long useful, as long as you can shovel data into your processor. The cache in your processor is incredibly fast, compared to your RAM. As you exceed your cache storage, this results in a RAM latency test.
If you want to see the advantages of multi-threading, you have to choose data sizes in range of your cache size.
EDIT
Another thing to do, would be to create a higher workload for the cores, so the storage latency goes unrecognized.
sidenote: keep in mind, your core has several execution units. one or more for each type of operation - integer, float, shift and so on. That means, one core can execute more then one command per step. In particular one operation per execution unit. You can keep the data size of the test data and do more stuff with it - be creative =) Filling the queue with integer operations only, will give you an advantage in multi-threading. If you can variate in your code, when and where you do different operations, do it, this also will show impact on the speedup. Or avoid it, if you want to see a nice speedup on multi-threading.
to avoid any kind of optimization, you should use randomized test data. so neither the compiler nor the processor itself can predict what the outcome of your operation is.
Also avoid doing branches like if and while. Each decision the processor has to predict and execute, will slow you down and alter the result. With branch-prediction, you will never get a deterministic result. Later in a "real" program, be my guest and do what you want. But when you want to explore the multi-threading world, this could lead you to wrong conclusions.
BTW
Please use a delete for every new you use, to avoid memory leaks. AND even better, avoid plain pointers, new and delete. You should use RAII. I advice to use std::array or std::vector, simple a STL-container. This will save you tons of debugging time and headaches.
Speedup from parallelization is limited by the portion of the task that remains serial. This is called Amdahl's law. In your case, a decent amount of that serial time is spent initializing the array.
Are you compiling the code with -O3? If so, the compiler might be able to unroll and/or vectorize some of the loops. The loop strides are predictable, so hardware prefetching might help as well.
You might want to also explore if using all 8 hyperthreads are useful or if it's better to run 1 thread per core (I am going to guess that since the problem is memory-bound, you'll likely benefit from all 8 hyperthreads).
Nevertheless, you'll still be limited by memory bandwidth. Take a look at the roofline model. It'll help you reason about the performance and what speedup you can theoretically expect. In your case, you're hitting the memory bandwidth wall that effectively limits the ops/sec achievable by your hardware.

This code prints the value of x around 5000 but not 10000, why is that?

This code that I have written creates 2 threads and a for loop that iterates 10000 times but the value of x at the end comes out near 5000 instead of 10000, why is that happening?
#include<unistd.h>
#include<stdio.h>
#include<sys/time.h>
#include "omp.h"
using namespace std;
int x=0;
int main(){
omp_set_num_threads(2);
#pragma omp parallel for
for(int i= 0;i<10000;i++){
x+=1;
}
printf("x is: %d\n",x);
}
x is not an atomic type and is read and written in different threads. (Thinking that int is an atomic type is a common misconception.)
The behaviour of your program is therefore undefined.
Using std::atomic<int> x; is the fix.
The reason is, that when multiple threads access the same variable, race conditions can occur.
The operation x+=1 can be understand as: x = x + 1. So you first read the value of x and then write x + 1 to x. When you have two threads running and operating on the same value of x, following happens: Thread A reads the value of x which is 0. Thread B reads the value of x which is still 0. Then thread A writes 0+1 to x. And then Thread B writes 0+1 to x. And now you have missed one increment and x is just 1 instead of 2. A fix for this problem might be to use an atomic_int.
Modifying one (shared) value by multiple threads is a race condition and leads to wrong results. If multiple threads work with one value, all of them must only read the value.
The idiomatic solution is to use a OpenMP reduction as follows
#pragma omp parallel for reduction(+:x)
for(int i= 0;i<10000;i++){
x+=1;
}
Internally, each thread has it's own x and they are added together after the loop.
Using atomics is an alternative, but will perform significantly worse. Atomic operations are more costly in itself and also very bad for caches.
If you use atomics, you should use OpenMP atomics which are applied to the operation, not the variable. I.e.
#pragma omp parallel for
for (int i= 0;i<10000;i++){
#pragma omp atomic
x+=1;
}
You should not, as other answers suggest, use C++11 atomics. Using them is explicitly unspecified behavior in OpenMP. See this question for details.

Fortran array values change if printed [duplicate]

I've looked at the official definitions, but I'm still quite confused.
firstprivate: Specifies that each thread should have its own instance of a variable, and that the variable should be initialized with the value of the variable, because it exists before the parallel construct.
To me, that sounds a lot like private. I've looked for examples, but I don't seem to understand how it's special or how it can be used.
lastprivate: Specifies that the enclosing context's version of the variable is set equal to the private version of whichever thread executes the final iteration (for-loop construct) or last section (#pragma sections).
I feel like I understand this one a bit better because of the following example:
#pragma omp parallel
{
#pragma omp for lastprivate(i)
for (i=0; i<n-1; i++)
a[i] = b[i] + b[i+1];
}
a[i]=b[i];
So, in this example, I understand that lastprivate allows for i to be returned outside of the loop as the last value it was.
I just started learning OpenMP today.
private variables are not initialised, i.e. they start with random values like any other local automatic variable (and they are often implemented using automatic variables on the stack of each thread). Take this simple program as an example:
#include <stdio.h>
#include <omp.h>
int main (void)
{
int i = 10;
#pragma omp parallel private(i)
{
printf("thread %d: i = %d\n", omp_get_thread_num(), i);
i = 1000 + omp_get_thread_num();
}
printf("i = %d\n", i);
return 0;
}
With four threads it outputs something like:
thread 0: i = 0
thread 3: i = 32717
thread 1: i = 32717
thread 2: i = 1
i = 10
(another run of the same program)
thread 2: i = 1
thread 1: i = 1
thread 0: i = 0
thread 3: i = 32657
i = 10
This clearly demonstrates that the value of i is random (not initialised) inside the parallel region and that any modifications to it are not visible after the parallel region (i.e. the variable keeps its value from before entering the region).
If i is made firstprivate, then it is initialised with the value that it has before the parallel region:
thread 2: i = 10
thread 0: i = 10
thread 3: i = 10
thread 1: i = 10
i = 10
Still modifications to the value of i inside the parallel region are not visible after it.
You already know about lastprivate (and it is not applicable to the simple demonstration program as it lacks worksharing constructs).
So yes, firstprivate and lastprivate are just special cases of private. The first one results in bringing in values from the outside context into the parallel region while the second one transfers values from the parallel region to the outside context. The rationale behind these data-sharing classes is that inside the parallel region all private variables shadow the ones from the outside context, i.e. it is not possible to use an assignment operation to modify the outside value of i from inside the parallel region.
You cannot use local variable i before initialization, the program will give an error since C++ 14 Standard.

OpenMP odd behaviour with SIMD linear and parallel for linear directives

I am learning how to use OpenMP with C++ using GNU C compiler 6.2.1 and I tested the following code:
#include <stdio.h>
#include <omp.h>
#include <iostream>
int b=10;
int main()
{
int array[8];
std::cout << "Test with #pragma omp simd linear:\n";
#pragma omp simd linear(b)
for (int n=0;n<8;++n) array[n]=b;
for (int n=0;n<8;++n) printf("Iteration %d: %d\n", n, array[n]);
std::cout << "Test with #pragma omp parallel for linear:\n";
#pragma omp parallel for linear(b)
for (int n=0;n<8;++n) array[n]=b;
for (int n=0;n<8;++n) printf("Iteration %d: %d\n", n, array[n]);
}
In both cases I expected a list of numbers going from 10 to 17, however, this was not the case. The #pragma omp simd linear(b) is outright ignored, printing only 10 for each value in array. For #pragma omp parallel for linear(b) the program outputs 10,10,12,12,14,14,16,16.
I compile the file using g++ -fopenmp -Wall main.cpp -o main.o. How can I fix this?
EDIT: Reading the specification more carefully I found that the linear clausule overwrites the initial value with the last value obtained (i.e. if we start with b=10 after the first cycle we have b=17).
However, the program runs correctly if I add schedule(dynamic) to the parallel for cycles. Why would I have to specify that parameter in order to have a correct execution?
The OpenMP specification says:
The linear clause declares one or more list items to be private and to
have a linear relationship with respect to the iteration space of a
loop associated with the construct on which the clause appears.
This is an information only to the compiler to indicate the linear behavior of a variable in a loop, but in your code b is not increased at all. That is the reason you always get 10 in the first loop. So, the strange results obtained is not the compiler's fault. To correct it you have to use
array[n]=b++;
On the other hand, for #pragma omp parallel for linear(b) loop, OpenMP calculates the starting b value for each thread (based on the linear relationship), but this value is still not increased in a given thread. So, depending on the number of threads used you will see different number of "steps".
In the case of schedule(dynamic) clause, the chunk_size is 1, so each loop cycle runs in a different thread. In this case the initial b value is always calculated by OpenMP, so you get correct values only.