Every time I execute cal() function with the same parameter I get different output. Function g() always calculate the same result for same input. Are threads overwriting any variable?
void cal(uint_fast64_t n) {
Bint num = N(n);
Bint total = 0, i, max_size(__UINT64_MAX__);
for(i = 1; i <= num; i+= max_size){
#pragma omp parallel shared(i,num,total)
{
int id = omp_get_thread_num();
int numthreads = omp_get_num_threads();
Bint sum(0), k;
for(uint64_t j = id; (j < __UINT64_MAX__); j+=numthreads){
k = i+j;
if(k > num){
i = k;
break;
}
sum = sum + g(k);
}
#pragma omp critical
total += sum;
}
}
std::cout << total << std::endl;
}
if(k > num){
i = k;
break;
}
Here you modify the shared variable i (possibly multiple times in parallel) while other threads may be reading from it (for k = i+j), all without synchronization. This is a race condition and your code thus has Undefined Behavior.
The value of j depends on the value of id. If different threads are used to do the math, you'll get different results.
int id = omp_get_thread_num(); // <---
int numthreads = omp_get_num_threads();
Bint sum(0), k;
for(uint64_t j = id; (j < __UINT64_MAX__); j+=numthreads){ // <---
k = i+j;
if(k > num){
i = k; // <---
break;
}
sum = sum + g(k);
Further, you change i to k when k > num. This can happen much sooner or much later depending on which thread is picked up first to run the inner loop.
You may want to look at this question and answer.
Does an OpenMP ordered for always assign parts of the loop to threads in order, too?
Related
How do I get a better optimization for this piece of code using openmp.
Number of threads is 6, but can't get better performance.
I have tried different scheduling options, but i can't get it optimized better.
Is there a way of getting a better result ?
int lenght = 40000;
int idx;
long *result = new long[ size ];
#pragma omp parallel for private(idx) schedule(dynamic)
for ( int i = 0; i < lenght; i++ ) {
for ( int j = 0; j < i; j++ ) {
idx = (int)( someCalculations( i, j ) );
#pragma omp atomic
result[ idx ] += 1;
}
}
This piece of code does optimize the calculation time, but I still need a better result.
Thanks in advance.
Since OpenMP 4.0 you can write your own reduction.
The idea is :
in for loop, you tell the compiler to reduce the place you modify in each loop.
since omp doesn't know how to reduce such array, you must write your own adder my_add which will simply sum two array.
you tell omp how to use it in your reducer (myred)
#include <stdio.h>
#include <stdlib.h>
#define LEN 40000
int someCalculations(int i, int j)
{
return i * j % 40000 ;
}
/* simple adder, just sum x+y in y */
long *my_add(long * x, long *y)
{
int i;
#pragma omp parallel for private(i)
for (i = 0; i < LEN; ++i)
{
x[i] += y[i];
}
free(y);
return x;
}
/* reduction declaration:
name
type
operation to be performed
initializer */
#pragma omp declare reduction(myred: long*:omp_out=my_add(omp_out,omp_in))\
initializer(omp_priv=calloc(LEN, sizeof(long)))
int main(void)
{
int i, j;
long *result = calloc(LEN, sizeof *result);
// tell omp how to use it
#pragma omp parallel for reduction(myred:result) private (i, j)
for (i = 0; i < LEN; i++) {
for (j = 0; j < i; j++) {
int idx = someCalculations(i, j);
result[idx] += 1;
}
}
// simple display, I store it in a file and compare
// result files with/without openmp to be sure it's correct...
for (i = 0; i < LEN; ++i) {
printf("%ld\n", result[i]);
}
return 0;
}
Without -fopenmp: real 0m3.727s
With -fopenmp: real 0m0.835s
Hello I'm having a hard time with this program, I'm supposed to go trough whole data vector sequentially and sum up each one of the vectors in there in parallel using openmp(and store the sum in solution[i]). But the program gets stuck for some reason. The input vectors that I'm given aren't many but are very large (like 2.5m ints each). Any idea what am I doing wrong?
Here is the code, ps: igone the unused minVectorSize parameter:
void sumsOfVectors_omp_per_vector(const vector<vector<int8_t>> &data, vector<long> &solution, unsigned long minVectorSize) {
unsigned long vectorNum = data.size();
for (int i = 0; i < vectorNum; i++) {
#pragma omp parallel
{
unsigned long sum = 0;
int thread = omp_get_thread_num();
int threadnum = omp_get_num_threads();
int begin = thread * data[i].size() / threadnum;
int end = ((thread + 1) * data[i].size() / threadnum) - 1;
for (int j = begin; j <= end; j++) {
sum += data[i][j];
}
#pragma omp critical
{
solution[i] += sum;
}
}
}
}
void sumsOfVectors_omp_per_vector(const vector<vector<int8_t>> &data, vector<long> &solution, unsigned long minVectorSize) {
unsigned long vectorNum = data.size();
for (int i = 0; i < vectorNum; i++) {
unsigned long sum = 0;
int begin = 0;
int end = data[i].size();
#omp parallel for reduction(+:sum)
for (int j = begin; j < end; j++) {
sum += data[i][j];
}
solution[i] += sum;
}
}
Something like this should be more elegant and work better, Could you compile and comment if it works for you or doesnt
I wrote code to test the performance of openmp on win (Win7 x64, Corei7 3.4HGz) and on Mac (10.12.3 Core i7 2.7 HGz).
In xcode I made a console application setting the compiled default. I use LLVM 3.7 and OpenMP 5 (in opm.h i searched define KMP_VERSION_MAJOR=5, define KMP_VERSION_MINOR=0 and KMP_VERSION_BUILD = 20150701, libiopm5) on macos 10.12.3 (CPU - Corei7 2700GHz)
For win I use VS2010 Sp1. Additional I set c/C++ -> Optimization -> Optimization = Maximize Speed (O2), c/C++ -> Optimization ->Favor Soze Or Speed = Favor Fast code (Ot).
If I run the application in a single thread, the time difference corresponds to the frequency ratio of processors (approximately). But if you run 4 threads, the difference becomes tangible: win program be faster then mac program in ~70 times.
#include <cmath>
#include <mutex>
#include <cstdint>
#include <cstdio>
#include <iostream>
#include <omp.h>
#include <boost/chrono/chrono.hpp>
static double ActionWithNumber(double number)
{
double sum = 0.0f;
for (std::uint32_t i = 0; i < 50; i++)
{
double coeff = sqrt(pow(std::abs(number), 0.1));
double res = number*(1.0-coeff)*number*(1.0-coeff) * 3.0;
sum += sqrt(res);
}
return sum;
}
static double TestOpenMP(void)
{
const std::uint32_t len = 4000000;
double *a;
double *b;
double *c;
double sum = 0.0;
std::mutex _mutex;
a = new double[len];
b = new double[len];
c = new double[len];
for (std::uint32_t i = 0; i < len; i++)
{
c[i] = 0.0;
a[i] = sin((double)i);
b[i] = cos((double)i);
}
boost::chrono::time_point<boost::chrono::system_clock> start, end;
start = boost::chrono::system_clock::now();
double k = 2.0;
omp_set_num_threads(4);
#pragma omp parallel for
for (int i = 0; i < len; i++)
{
c[i] = k*a[i] + b[i] + k;
if (c[i] > 0.0)
{
c[i] += ActionWithNumber(c[i]);
}
else
{
c[i] -= ActionWithNumber(c[i]);
}
std::lock_guard<std::mutex> scoped(_mutex);
sum += c[i];
}
end = boost::chrono::system_clock::now();
boost::chrono::duration<double> elapsed_time = end - start;
double sum2 = 0.0;
for (std::uint32_t i = 0; i < len; i++)
{
sum2 += c[i];
c[i] /= sum2;
}
if (std::abs(sum - sum2) > 0.01) printf("Incorrect result.\n");
delete[] a;
delete[] b;
delete[] c;
return elapsed_time.count();
}
int main()
{
double sum = 0.0;
const std::uint32_t steps = 5;
for (std::uint32_t i = 0; i < steps; i++)
{
sum += TestOpenMP();
}
sum /= (double)steps;
std::cout << "Elapsed time = " << sum;
return 0;
}
I specifically use a mutex here to compare the performance of openmp on the "mac" and "win". On the "Win" function returns the time of 0.39 seconds. On the "Mac" function returns the time of 25 seconds, i.e. 70 times slower.
What is the cause of this difference?
First of all, thank for edit my post (i use translater to write text).
In the real app, I update the values in a huge matrix (20000х20000) in random order. Each thread determines the new value and writes it in a particular cell. I create a mutex for each row, since in most cases different threads write to different rows. But apparently in cases when 2 threads write in one row and there is a long lock. At the moment I can't divide the rows in different threads, since the order of records is determined by the FEM elements.
So just to put a critical section in there comes out, as it will block writes to the entire matrix.
I wrote code like in real application.
static double ActionWithNumber(double number)
{
const unsigned int steps = 5000;
double sum = 0.0f;
for (u32 i = 0; i < steps; i++)
{
double coeff = sqrt(pow(abs(number), 0.1));
double res = number*(1.0-coeff)*number*(1.0-coeff) * 3.0;
sum += sqrt(res);
}
sum /= (double)steps;
return sum;
}
static double RealAppTest(void)
{
const unsigned int elementsNum = 10000;
double* matrix;
unsigned int* elements;
boost::mutex* mutexes;
elements = new unsigned int[elementsNum*3];
matrix = new double[elementsNum*elementsNum];
mutexes = new boost::mutex[elementsNum];
for (unsigned int i = 0; i < elementsNum; i++)
for (unsigned int j = 0; j < elementsNum; j++)
matrix[i*elementsNum + j] = (double)(rand() % 100);
for (unsigned int i = 0; i < elementsNum; i++) //build FEM element like Triangle
{
elements[3*i] = rand()%(elementsNum-1);
elements[3*i+1] = rand()%(elementsNum-1);
elements[3*i+2] = rand()%(elementsNum-1);
}
boost::chrono::time_point<boost::chrono::system_clock> start, end;
start = boost::chrono::system_clock::now();
omp_set_num_threads(4);
#pragma omp parallel for
for (int i = 0; i < elementsNum; i++)
{
unsigned int* elems = &elements[3*i];
for (unsigned int j = 0; j < 3; j++)
{
//in here set mutex for row with index = elems[j];
boost::lock_guard<boost::mutex> lockup(mutexes[i]);
double res = 0.0;
for (unsigned int k = 0; k < 3; k++)
{
res += ActionWithNumber(matrix[elems[j]*elementsNum + elems[k]]);
}
for (unsigned int k = 0; k < 3; k++)
{
matrix[elems[j]*elementsNum + elems[k]] = res;
}
}
}
end = boost::chrono::system_clock::now();
boost::chrono::duration<double> elapsed_time = end - start;
delete[] elements;
delete[] matrix;
delete[] mutexes;
return elapsed_time.count();
}
int main()
{
double sum = 0.0;
const u32 steps = 5;
for (u32 i = 0; i < steps; i++)
{
sum += RealAppTest();
}
sum /= (double)steps;
std::cout<<"Elapsed time = " << sum;
return 0;
}
You're combining two different sets of threading/synchronization primitives - OpenMP, which is built into the compiler and has a runtime system, and manually creating a posix mutex with std::mutex. It's probably not surprising that there's some interoperability hiccups with some compiler/OS combinations.
My guess here is that in the slow case, the OpenMP runtime is going overboard to make sure that there's no interactions between higher-level ongoing OpenMP threading tasks and the manual mutex, and that doing so inside a tight loop causes the dramatic slowdown.
For mutex-like behaviour in the OpenMP framework, we can use critical sections:
#pragma omp parallel for
for (int i = 0; i < len; i++)
{
//...
// replacing this: std::lock_guard<std::mutex> scoped(_mutex);
#pragma omp critical
sum += c[i];
}
or explicit locks:
omp_lock_t sumlock;
omp_init_lock(&sumlock);
#pragma omp parallel for
for (int i = 0; i < len; i++)
{
//...
// replacing this: std::lock_guard<std::mutex> scoped(_mutex);
omp_set_lock(&sumlock);
sum += c[i];
omp_unset_lock(&sumlock);
}
omp_destroy_lock(&sumlock);
We get much more reasonable timings:
$ time ./openmp-original
real 1m41.119s
user 1m15.961s
sys 1m53.919s
$ time ./openmp-critical
real 0m16.470s
user 1m2.313s
sys 0m0.599s
$ time ./openmp-locks
real 0m15.819s
user 1m0.820s
sys 0m0.276s
Updated: There's no problem with using an array of openmp locks in exactly the same way as the mutexes:
omp_lock_t sumlocks[elementsNum];
for (unsigned idx=0; idx<elementsNum; idx++)
omp_init_lock(&(sumlocks[idx]));
//...
#pragma omp parallel for
for (int i = 0; i < elementsNum; i++)
{
unsigned int* elems = &elements[3*i];
for (unsigned int j = 0; j < 3; j++)
{
//in here set mutex for row with index = elems[j];
double res = 0.0;
for (unsigned int k = 0; k < 3; k++)
{
res += ActionWithNumber(matrix[elems[j]*elementsNum + elems[k]]);
}
omp_set_lock(&(sumlocks[i]));
for (unsigned int k = 0; k < 3; k++)
{
matrix[elems[j]*elementsNum + elems[k]] = res;
}
omp_unset_lock(&(sumlocks[i]));
}
}
for (unsigned idx=0; idx<elementsNum; idx++)
omp_destroy_lock(&(sumlocks[idx]));
I have the simple problem of comparing all elements to each other. The comparison itself is symmetric, therefore, it doesn't have to be done twice.
The following code example shows what I am looking for by showing the indices of the accessed elements:
int n = 5;
for (int i = 0; i < n; i++)
{
for (int j = i + 1; j < n; j++)
{
printf("%d %d\n", i,j);
}
}
The output is:
0 1
0 2
0 3
0 4
1 2
1 3
1 4
2 3
2 4
3 4
So each element is compared to each other once. When I want to parallelize this code I have the problem that first I have to stick to dynamic scheduling because the calculation time of each iteration does vary to a huge extend AND I can not use collapse due to the fact that the nested iterations are index-dependant from the outer loop.
Using #pragma omp parallel for schedule(dynamic, 3) for the outer loop may lead to single core executions at the end whereas using this for the inner loop may lead to such executions within each iteration of the outer loop.
Is there a more sophisticated way of doing/parallelizing that?
I haven't thought it thoroughly, but you can try some approach like this too:
int total = n * (n-1) / 2; // total number of combinations
#pragma omp parallel for
for (int k = 0; k < total; ++k) {
int i = first(k, n);
int j = second(k, n, i);
printf("%d %d\n", i,j);
}
int first(int k, int n) {
int i = 0;
for (; k >= n - 1; ++i) {
k -= n - 1;
n -= 1;
}
return i;
}
int second(int k, int n, int i) {
int t = i * (2*n - i - 1) / 2;
return (t == 0 ? k + i + 1 : (k % t) + i + 1);
}
Indeed, the OpenMP standard says for the collapse that:
The iteration count for each associated loop is computed before entry
to the outermost loop. If execution of any associated loop changes
any of the values used to compute any of the iteration counts, then
the behavior is unspecified.
So you cannot collapse your loops, which would have been the easiest way.
However, since you're not particularly interested in the order the pairs of indexes are computed, you can change a bit your loops as follow:
for ( int i = 0; i < n; i++ ) {
for ( int j = 0; j < n / 2; j++ ) {
int ii, jj;
if ( j < i ) {
ii = n - 1 - i;
jj = n - 1 - j;
}
else {
ii = i;
jj = j + 1;
}
printf( "%d %d\n", ii, jj );
}
}
This should give you all the pairs you want, in a somewhat mangled order, but with fixed iteration limits which allow for balanced parallelisation, and even loop collapsing if you want. Simply, if n is even, the column corresponding to n/2 will be displayed twice so either you live with it or you slightly modify the algorithm to avoid that...
I have previously had good results with the following:
#pragma omp parallel for collapse(2)
for (int i = 0; i < n; ++i) {
for (int j = 0; j < n; ++j) {
if (j <= i)
continue;
printf("%d %d\n", i, j);
}
}
Do remember that printf does not do any parallel workload just, so it would be best if you profiled it on your specific work. You could try adding schedule(dynamic, 10) or something greater than 10 depending on how many iterations you're performing.
I'm wondering if it is feasible to make this loop parallel using openMP.
Of coarse there is the issue with the race conditions. I'm unsure how to deal with the n in the inner loop being generated by the outerloop, and the race condition with where D=A[n]. Do you think it is practical to try and make this parallel?
for(n=0; n < 10000000; ++n) {
for (n2=0; n2< 100; ++n2) {
A[n]=A[n]+B[n2][n+C[n2]+200];
}
D=D+A[n];
}
Yes, this is indeed parallelizable assuming none of the pointers are aliased.
int D = 0; // Or whatever the type is.
#pragma omp parallel for reduction(+:D) private(n2)
for (n=0; n < 10000000; ++n) {
for (n2 = 0; n2 < 100; ++n2) {
A[n] = A[n] + B[n2][n + C[n2] + 200];
}
D += A[n];
}
It could actually be optimized somewhat as follows:
int D = 0; // Or whatever the type is.
#pragma omp parallel for reduction(+:D) private(n2)
for (n=0; n < 10000000; ++n) {
int tmp = A[n]
for (n2 = 0; n2 < 100; ++n2) {
tmp += B[n2][n + C[n2] + 200];
}
A[n] = tmp;
D += tmp;
}