The goal is to add OpenMP parallelization to for (i = 0; i < n; i++) for the lower triangle solver for the form Ax=b. Expected result is exactly same as the result when there is NO parallelization added to for (i = 0; i < n; i++).
vector<vector<double>> represents a 2-D matrix. makeMatrix(int m, int n) initializes a vector<vector<double>> of all zeroes of size mxn.
Two of the most prominent tries have been left in comments.
vector<vector<double>> lowerTriangleSolver(vector<vector<double>> A, vector<vector<double>> b)
{
vector<vector<double>> x = makeMatrix(A.size(), 1);
int i, j;
int n = A.size();
double s;
//#pragma omp parallel for reduction(+: s)
//#pragma omp parallel for shared(s)
for (i = 0; i < n; i++)
{
s = 0.0;
#pragma omp parallel for
for (j = 0; j < i; j++)
{
s = s + A[i][j] * x[j][0];
}
x[i][0] = (b[i][0] - s) / A[i][i];
}
return x;
}
You could try to assign the outer loop iterations among threads, instead of the inner loop. In this way, you increase the granularity of the parallel tasks and avoid the reduction of the 's' variable.
#pragma omp parallel for
for (int i = 0; i < n; i++){
double s = 0.0;
for (int j = 0; j < i; j++){
s = s + A[i][j] * x[j][0];
}
x[i][0] = (b[i][0] - s) / A[i][i];
}
Unfortunately, that is not possible because there is a dependency between s = s + A[i][j] * x[j][0]; and x[i][0] = (b[i][0] - s) / A[i][i];, more precisely x[j][0] depends upon the x[i][0].
So you can try two approaches:
for (int i = 0; i < n; i++){
double s = 0.0;
#pragma omp parallel for reduction(+:s)
for (int j = 0; j < i; j++){
s = s + A[i][j] * x[j][0];
}
x[i][0] = (b[i][0] - s) / A[i][i];
}
or using SIMD :
for (int i = 0; i < n; i++){
double s = 0.0;
#pragma omp simd reduction(+:s)
for (int j = 0; j < i; j++){
s = s + A[i][j] * x[j][0];
}
x[i][0] = (b[i][0] - s) / A[i][i];
}
I was getting the error: "free(): corrupted unsorted chunks" when trying to run:
#pragma omp parallel for reduction(+:save) shared(save2)
for (size_t i = 0; i <= N; ++i) {
vector<float> dist = cdist(i, arestas);
vector<float> distinv(dist.size());
for (size_t j = 0; j < N(); ++j) {
if (arr[j] > 0)
arrv[j] = (1/N) + (1 / arr[j]);
else
arrv[j] = 0;
}
save = accumulate(arrv.begin(), arrv.end(), 0.0);
vector<double>::iterator iter = save2.begin() + i;
save2.insert(iter, sum);
}
I might miss the point here, but what about just doing it this way (not tested)?
vector<double> sum2(N);
#pragma omp parallel for num_threads(8)
for ( size_t i = 0; i < N; i++ ) {
double sum = 0;
for ( size_t j = 0; j < dist.size(); ++j ) {
if ( dist[j] > 0 ) {
sum += 1. / dist[j];
}
}
sum2[i] = sum;
}
There is still some room for improving this version (by removing the if statement for example, in order to help the vectorization), but unless you had some unexplained constrains in your code, I think this version is a good starting point.
I wrote code to test the performance of openmp on win (Win7 x64, Corei7 3.4HGz) and on Mac (10.12.3 Core i7 2.7 HGz).
In xcode I made a console application setting the compiled default. I use LLVM 3.7 and OpenMP 5 (in opm.h i searched define KMP_VERSION_MAJOR=5, define KMP_VERSION_MINOR=0 and KMP_VERSION_BUILD = 20150701, libiopm5) on macos 10.12.3 (CPU - Corei7 2700GHz)
For win I use VS2010 Sp1. Additional I set c/C++ -> Optimization -> Optimization = Maximize Speed (O2), c/C++ -> Optimization ->Favor Soze Or Speed = Favor Fast code (Ot).
If I run the application in a single thread, the time difference corresponds to the frequency ratio of processors (approximately). But if you run 4 threads, the difference becomes tangible: win program be faster then mac program in ~70 times.
#include <cmath>
#include <mutex>
#include <cstdint>
#include <cstdio>
#include <iostream>
#include <omp.h>
#include <boost/chrono/chrono.hpp>
static double ActionWithNumber(double number)
{
double sum = 0.0f;
for (std::uint32_t i = 0; i < 50; i++)
{
double coeff = sqrt(pow(std::abs(number), 0.1));
double res = number*(1.0-coeff)*number*(1.0-coeff) * 3.0;
sum += sqrt(res);
}
return sum;
}
static double TestOpenMP(void)
{
const std::uint32_t len = 4000000;
double *a;
double *b;
double *c;
double sum = 0.0;
std::mutex _mutex;
a = new double[len];
b = new double[len];
c = new double[len];
for (std::uint32_t i = 0; i < len; i++)
{
c[i] = 0.0;
a[i] = sin((double)i);
b[i] = cos((double)i);
}
boost::chrono::time_point<boost::chrono::system_clock> start, end;
start = boost::chrono::system_clock::now();
double k = 2.0;
omp_set_num_threads(4);
#pragma omp parallel for
for (int i = 0; i < len; i++)
{
c[i] = k*a[i] + b[i] + k;
if (c[i] > 0.0)
{
c[i] += ActionWithNumber(c[i]);
}
else
{
c[i] -= ActionWithNumber(c[i]);
}
std::lock_guard<std::mutex> scoped(_mutex);
sum += c[i];
}
end = boost::chrono::system_clock::now();
boost::chrono::duration<double> elapsed_time = end - start;
double sum2 = 0.0;
for (std::uint32_t i = 0; i < len; i++)
{
sum2 += c[i];
c[i] /= sum2;
}
if (std::abs(sum - sum2) > 0.01) printf("Incorrect result.\n");
delete[] a;
delete[] b;
delete[] c;
return elapsed_time.count();
}
int main()
{
double sum = 0.0;
const std::uint32_t steps = 5;
for (std::uint32_t i = 0; i < steps; i++)
{
sum += TestOpenMP();
}
sum /= (double)steps;
std::cout << "Elapsed time = " << sum;
return 0;
}
I specifically use a mutex here to compare the performance of openmp on the "mac" and "win". On the "Win" function returns the time of 0.39 seconds. On the "Mac" function returns the time of 25 seconds, i.e. 70 times slower.
What is the cause of this difference?
First of all, thank for edit my post (i use translater to write text).
In the real app, I update the values in a huge matrix (20000х20000) in random order. Each thread determines the new value and writes it in a particular cell. I create a mutex for each row, since in most cases different threads write to different rows. But apparently in cases when 2 threads write in one row and there is a long lock. At the moment I can't divide the rows in different threads, since the order of records is determined by the FEM elements.
So just to put a critical section in there comes out, as it will block writes to the entire matrix.
I wrote code like in real application.
static double ActionWithNumber(double number)
{
const unsigned int steps = 5000;
double sum = 0.0f;
for (u32 i = 0; i < steps; i++)
{
double coeff = sqrt(pow(abs(number), 0.1));
double res = number*(1.0-coeff)*number*(1.0-coeff) * 3.0;
sum += sqrt(res);
}
sum /= (double)steps;
return sum;
}
static double RealAppTest(void)
{
const unsigned int elementsNum = 10000;
double* matrix;
unsigned int* elements;
boost::mutex* mutexes;
elements = new unsigned int[elementsNum*3];
matrix = new double[elementsNum*elementsNum];
mutexes = new boost::mutex[elementsNum];
for (unsigned int i = 0; i < elementsNum; i++)
for (unsigned int j = 0; j < elementsNum; j++)
matrix[i*elementsNum + j] = (double)(rand() % 100);
for (unsigned int i = 0; i < elementsNum; i++) //build FEM element like Triangle
{
elements[3*i] = rand()%(elementsNum-1);
elements[3*i+1] = rand()%(elementsNum-1);
elements[3*i+2] = rand()%(elementsNum-1);
}
boost::chrono::time_point<boost::chrono::system_clock> start, end;
start = boost::chrono::system_clock::now();
omp_set_num_threads(4);
#pragma omp parallel for
for (int i = 0; i < elementsNum; i++)
{
unsigned int* elems = &elements[3*i];
for (unsigned int j = 0; j < 3; j++)
{
//in here set mutex for row with index = elems[j];
boost::lock_guard<boost::mutex> lockup(mutexes[i]);
double res = 0.0;
for (unsigned int k = 0; k < 3; k++)
{
res += ActionWithNumber(matrix[elems[j]*elementsNum + elems[k]]);
}
for (unsigned int k = 0; k < 3; k++)
{
matrix[elems[j]*elementsNum + elems[k]] = res;
}
}
}
end = boost::chrono::system_clock::now();
boost::chrono::duration<double> elapsed_time = end - start;
delete[] elements;
delete[] matrix;
delete[] mutexes;
return elapsed_time.count();
}
int main()
{
double sum = 0.0;
const u32 steps = 5;
for (u32 i = 0; i < steps; i++)
{
sum += RealAppTest();
}
sum /= (double)steps;
std::cout<<"Elapsed time = " << sum;
return 0;
}
You're combining two different sets of threading/synchronization primitives - OpenMP, which is built into the compiler and has a runtime system, and manually creating a posix mutex with std::mutex. It's probably not surprising that there's some interoperability hiccups with some compiler/OS combinations.
My guess here is that in the slow case, the OpenMP runtime is going overboard to make sure that there's no interactions between higher-level ongoing OpenMP threading tasks and the manual mutex, and that doing so inside a tight loop causes the dramatic slowdown.
For mutex-like behaviour in the OpenMP framework, we can use critical sections:
#pragma omp parallel for
for (int i = 0; i < len; i++)
{
//...
// replacing this: std::lock_guard<std::mutex> scoped(_mutex);
#pragma omp critical
sum += c[i];
}
or explicit locks:
omp_lock_t sumlock;
omp_init_lock(&sumlock);
#pragma omp parallel for
for (int i = 0; i < len; i++)
{
//...
// replacing this: std::lock_guard<std::mutex> scoped(_mutex);
omp_set_lock(&sumlock);
sum += c[i];
omp_unset_lock(&sumlock);
}
omp_destroy_lock(&sumlock);
We get much more reasonable timings:
$ time ./openmp-original
real 1m41.119s
user 1m15.961s
sys 1m53.919s
$ time ./openmp-critical
real 0m16.470s
user 1m2.313s
sys 0m0.599s
$ time ./openmp-locks
real 0m15.819s
user 1m0.820s
sys 0m0.276s
Updated: There's no problem with using an array of openmp locks in exactly the same way as the mutexes:
omp_lock_t sumlocks[elementsNum];
for (unsigned idx=0; idx<elementsNum; idx++)
omp_init_lock(&(sumlocks[idx]));
//...
#pragma omp parallel for
for (int i = 0; i < elementsNum; i++)
{
unsigned int* elems = &elements[3*i];
for (unsigned int j = 0; j < 3; j++)
{
//in here set mutex for row with index = elems[j];
double res = 0.0;
for (unsigned int k = 0; k < 3; k++)
{
res += ActionWithNumber(matrix[elems[j]*elementsNum + elems[k]]);
}
omp_set_lock(&(sumlocks[i]));
for (unsigned int k = 0; k < 3; k++)
{
matrix[elems[j]*elementsNum + elems[k]] = res;
}
omp_unset_lock(&(sumlocks[i]));
}
}
for (unsigned idx=0; idx<elementsNum; idx++)
omp_destroy_lock(&(sumlocks[idx]));
Is another for loop allowed in the counter section (third part) of a for loop? In my attempt to get elegant in writing code to produce a right triangle, I wrote this but it wouldn't compile:
#include <stdio.h>
int main()
{
int i, j, N = 5;
for (i = 1;
i <= N;
(for (j = 1; j <= i; j++, printf("%c", '0'));), i++)
printf("\n");
}
return 0;
}
No there are allowed only expressions or declarations.
EDIT: I am sorry. I thought you are speaking about the condition part of the loop. In the expression part of the loop there are allowed only expressions.
You could use a lambda expression that would contain this for loop. For example
for ( i = 1;
i <= N;
[]( int i ) { for ( j = 1; j <= i; j++, printf("%c", '0' ) ); }( i ), i++)
Here is a demonstrative example
#include <iostream>
int main()
{
int N = 10;
for ( int i = 1;
i < N;
[]( int i )
{
for ( int j = 1; j < i; j++, ( std::cout << '*' ) );
}( i ), i++ )
{
std::cout << std::endl;
}
return 0;
}
The output is
*
**
***
****
*****
******
*******
********
Or your could define the lambda expression outside the outer loop that to make the program more readable. For example
#include <iostream>
int main()
{
int N = 10;
auto inner_loop = []( int i )
{
for ( int j = 1; j < i; j++, ( std::cout << '*' ) );
};
for ( int i = 1; i < N; inner_loop( i ), i++ )
{
std::cout << std::endl;
}
return 0;
}
Take into account that in general case the nested loops showed in other posts are unable to substitute the loop with the lambda-expression. For example the outer loop can contain continue statements that will skip the inner loop. So if you need that the inner loop will be executed in any case independing on the continue statements then this construction with the lambda expression will be helpful.:)
There is no need to do so. Because for loop can be easily replaced with while loop, every part of for loop can be placed in another place, where it's possible to use complex constructions. In your case, you can just change loop to the following:
for (i = 1; i <= N; i++) {
printf("\n");
for (j = 1; j <= i; j++) {
printf("%c", '0');
}
}
However, if you really have to place complex action, you may use gcc extension (compound statement):
for (i = 1;
i <= N;
({for (j = 1; j <= i; j++) putchar('0'); }), i++) {
printf("\n");
}
In the counter section of a for() loop, expressions are allowed, but statements are not.
And every for() line in C/C++ forms a new statement (it's not an expression).
However, you can nest several for() loops if you want.
For example, since you want a new loop in the counter sections, that means that you need to perform a loop at the end of the main for() loop.
This is the scheme:
for (int i = 0; i < i_max; i++) {
// stuff...
for (int j = 0; j < j_max; j++) {
// stuff..
}
}
You can't do that because condition and increment parts of a for can only contain expressions. A for loop is an iteration statement, though.
Simply nest the loops like sane programers do:
#include <stdio.h>
int main()
{
int N = 5;
for (int i = 1; i <= N; i++) {
for (int j = 1; j <= i; j++)
printf("0");
printf("\n");
}
}
If you're not feeling well, though, you could use a lambda:
#include <stdio.h>
int main()
{
int N = 5;
for (
int i = 1;
i <= N;
[=](){ for (int j = 1; j <= i; j++) printf("0"); }(), printf("\n"), i++
) ;
}
Shortest-possible solution:
main(i){for(i=1;i<11;printf("%0*d\n",i++,0));}
Output:
0
00
000
0000
00000
000000
0000000
00000000
000000000
0000000000
LIVE DEMO
Elegance comes with clarity.
When I want to create a string of characters, I construct a C++ object called std::string.
#include <iostream>
#include <string>
int main()
{
char c = '0';
const int n = 5;
for (int i = 1; i <= n; ++i)
{
std::cout << std::string(i, c) << '\n';
}
}
So there is no need for a nested for-loop in this particular case.
Otherwise put a for-statement in the body of the outer loop as other answers suggested.
Hat tip to Michael Burr for the suggestion to use lambda. And thanks to the commentators requesting me to use putchar().
#include <stdio.h>
int main() {
int N;
scanf("%d", &N);
for (int i = 0; i < N; i++, [i] {
for (int j = 1; j <= i; j++, putchar('0'))
;
}(), printf("\n"))
;
return 0;
}
LIVE DEMO
I think we should avoid obfuscated code unless we get paid for making the code complex.
I'm wondering if it is feasible to make this loop parallel using openMP.
Of coarse there is the issue with the race conditions. I'm unsure how to deal with the n in the inner loop being generated by the outerloop, and the race condition with where D=A[n]. Do you think it is practical to try and make this parallel?
for(n=0; n < 10000000; ++n) {
for (n2=0; n2< 100; ++n2) {
A[n]=A[n]+B[n2][n+C[n2]+200];
}
D=D+A[n];
}
Yes, this is indeed parallelizable assuming none of the pointers are aliased.
int D = 0; // Or whatever the type is.
#pragma omp parallel for reduction(+:D) private(n2)
for (n=0; n < 10000000; ++n) {
for (n2 = 0; n2 < 100; ++n2) {
A[n] = A[n] + B[n2][n + C[n2] + 200];
}
D += A[n];
}
It could actually be optimized somewhat as follows:
int D = 0; // Or whatever the type is.
#pragma omp parallel for reduction(+:D) private(n2)
for (n=0; n < 10000000; ++n) {
int tmp = A[n]
for (n2 = 0; n2 < 100; ++n2) {
tmp += B[n2][n + C[n2] + 200];
}
A[n] = tmp;
D += tmp;
}