I'm trying to learn paralellization of C++ using openmp, and I'm trying to use the following example. But for some reason when I increase the number of threads the code runs slower. Im compiling it using the -fopenmp flag. It would be nice if I could get your expert opinion.
#include <omp.h>
#include <iostream>
static long num_steps =100000000;
#define NUM_THREADS 4
double step;
int main(){
int i,nthreads;
double pi, sum[NUM_THREADS]; // should be shared : hence promoted scalar sum into an array
step = 1.0/(double) num_steps;
omp_set_num_threads(NUM_THREADS);
double t1 = omp_get_wtime();
#pragma omp parallel
{
int i, id, nthrds;
double x;
id = omp_get_thread_num();
nthrds = omp_get_num_threads();
//if(id==0) nthreads = nthrds; // This is done because the number of threads can be different
// ie the environment can give you a different number of threads
// than requested
for(i=id, sum[id] = 0.0; i<num_steps;i=i+nthrds){
x = (i+0.5)*step;
sum[id] += 4.0/(1.0+x*x);
}
}
double t2 = omp_get_wtime();
std::cout << "Time : " ;
double ms_double = t2 - t1;
std::cout << ms_double << "ms\n";
for(i=0,pi=0.0; i < nthreads; i++){
pi += sum[i]*step;
}
}
Minor complaints aside, your big problem is the loop update i=i+nthrds. This means that each cache line will be accessed by all 4 of your threads. (Btw, use the OMP_NUM_THREADS environment variable to set the number of threads. Do not hardcode.) This is called false sharing and it's really bad for performance: you want each cacheline to be exclusively in one core.
The main advantage of OpenMP is that you do not have to do reduction manually. You just have to add an extra line to the serial code. So, your code should be something like this (which is free from false-sharing):
double sum=0;
#pragma omp parallel for reduction(+:sum)
for(unsigned long i=0; i<num_steps; ++i){
const double x = (i+0.5)*step;
sum += 4.0/(1.0+x*x);
}
double pi = sum*step;
Note that your code had an uninitialized variable (pi) and your code did not handle the properly if you got less threads than requested.
What #Victor Ejkhout called "minor complaints" might not be so minor. It is only normal that using a new API (omp) for the first time can be confusing. And that reflects on the coding style of the application code as well, more often than not. But especially in such cases, special attention should be paid to readability.
The code below is the "prettied-up" version of your attempt. And next to the omp parallel integration it also has the single threaded and a multi threaded (using std::thread) version so you can compare them to each other.
#include <omp.h>
#include <iostream>
#include <thread>
constexpr int MAX_PARALLEL_THREADS = 4; // long is wrong - is it an isize_t or a int32_t or an int64_t???
// the function we want to integrate
double f(double x) {
return 4.0 / (1.0 + x * x);
}
// performs the summation of function values on the interval [left,right[
double sum_interval(double left, double right, double step) {
double sum = 0.0;
for (double x = left; x < right; x += step) {
sum += f(x);
}
return sum;
}
double integrate_single_threaded(double left, double right, double step) {
return sum_interval(left, right, step) / (right - left);
}
double integrate_multi_threaded(double left, double right, double step) {
double sums[MAX_PARALLEL_THREADS];
std::thread threads[MAX_PARALLEL_THREADS];
for (int i= 0; i < MAX_PARALLEL_THREADS;i++) {
threads[i] = std::thread( [&sums,left,right,step,i] () {
double ileft = left + (right - left) / MAX_PARALLEL_THREADS * i;
double iright = left + (right - left) / MAX_PARALLEL_THREADS * (i + 1);
sums[i] = sum_interval(ileft,iright,step);
});
}
double total_sum = 0.0;
for (int i = 0; i < MAX_PARALLEL_THREADS; i++) {
threads[i].join();
total_sum += sums[i];
}
return total_sum / (right - left);
}
double integrate_parallel(double left, double right, double step) {
double sums[MAX_PARALLEL_THREADS];
int thread_count = 0;
omp_set_num_threads(MAX_PARALLEL_THREADS);
#pragma omp parallel
{
thread_count = omp_get_num_threads(); // 0 is impossible, there is always 1 thread minimum...
int interval_index = omp_get_thread_num();
double ileft = left + (right - left) / thread_count * interval_index;
double iright = left + (right - left) / thread_count * (interval_index + 1);
sums[interval_index] = sum_interval(ileft,iright,step);
}
double total_sum = 0.0;
for (int i = 0; i < thread_count; i++) {
total_sum += sums[i];
}
return total_sum / (right - left);
}
int main (int argc, const char* argv[]) {
double left = -1.0;
double right = 1.0;
double step = 1.0E-9;
// run single threaded calculation
std::cout << "single" << std::endl;
double tstart = omp_get_wtime();
double i_single = integrate_single_threaded(left, right, step);
double tend = omp_get_wtime();
double st_time = tend - tstart;
// run multi threaded calculation
std::cout << "multi" << std::endl;
tstart = omp_get_wtime();
double i_multi = integrate_multi_threaded(left, right, step);
tend = omp_get_wtime();
double mt_time = tend - tstart;
// run omp calculation
std::cout << "omp" << std::endl;
tstart = omp_get_wtime();
double i_omp = integrate_parallel(left, right, step);
tend = omp_get_wtime();
double omp_time = tend - tstart;
std::cout
<< "i_single: " << i_single
<< " st_time: " << st_time << std::endl
<< "i_multi: " << i_multi
<< " mt_time: " << mt_time << std::endl
<< "i_omp: " << i_omp
<< " omp_time: " << omp_time << std::endl;
return 0;
}
When I compile this on my Debian with g++ --std=c++17 -Wall -O3 -lpthread -fopenmp -o para para.cpp -pthread, I get the following results:
single
multi
omp
i_single: 3.14159e+09 st_time: 2.37662
i_multi: 3.14159e+09 mt_time: 0.635195
i_omp: 3.14159e+09 omp_time: 0.660593
So, at least my conclusion is, that it is not worth the effort to learn openMP, given that the (more general use) std::thread version looks just as nice and performs at least equally well.
I am not really trusting the computed integral result in either case, though. But I did not really focus on that. They all produce the same value. That is the important part.
Related
I am trying to compare two methods of computing the value of PI, one sequential and another one parallel, and measuring the difference between them with a steady clock. When I run them in debug mode, everything works fine, the output is comparable for the two methods:
sequential: 2238223
parallel: 506050
The problem I am facing is that when I am compiling in release mode, the steady_clock does not measure any time difference for the sequential version:
sequential: 0
parallel: 271027
Another strange thing is that I am printing the values, and at the end of the test I am printing the value of pi returned by the two methods, and the console is immediately printing 0 for the sequential method, after the launch of the program, then it's waiting a while until printing the result of the parallel method, and then it's waiting a while more until it prints the result of the pi values, which makes me think that the program is executing the first method right when the value of pi needs to be printed.
Is my guess right, then why does the program has this different behavior?
This is the code for the two methods:
double ComputePIParallel()
{
long long i;
double area = 0;
double h = 1.0 / n;
#pragma omp parallel for shared(n, h) reduction(+:area)
for (i = 1; i <= n; i++)
{
double x = h * (i - 0.5);
area += (4.0 / (1.0 + x*x));
}
double pi = h * area;
return pi;
}
double ComputePISequntial()
{
long long i;
double area = 0;
double h = 1.0 / n;
for (i = 1; i <= n; i++)
{
double x = h * (i - 0.5);
area += (4.0 / (1.0 + x*x));
}
return h * area;
}
and the main method:
int main()
{
steady_clock::time_point begin = steady_clock::now();
double pi1 = ComputePISequntial();
steady_clock::time_point end = steady_clock::now();
long long duration = duration_cast<microseconds>(end - begin).count();
cout<<"sequential: "<<duration<<endl;
begin = steady_clock::now();
double pi2 = ComputePIParallel();
end = steady_clock::now();
duration = duration_cast<microseconds>(end - begin).count();
std::cout<<"parallel: "<<duration;
std::cout<<endl<<pi1<<endl<<pi2;
_getch();
return 0;
}
The value of n is 100000000 and it is a global. The code is compiled with MS compiler from Visual Studio 2012, and it has enabled Open MP support.
I use this version of calculation of Pi with thread-safe function
rand_r
But it appears that it is slower (and answer is wrong) when running this program in parallel comparing to sequential program with use of
rand()
which is not thread-safe. It seems that this way of using is also not thread-safe. But I do not understand why, because I have read many questions about thread-safe PRNGs and learned that rand_r should be safe enough.
#include <iostream>
#include <random>
#include <ctime>
#include "omp.h"
#include <stdlib.h>
using namespace std;
unsigned seed;
int main()
{
double start = time(0);
int i, n, N;
double x, y;
N = 1<<30;
n = 0;
double pi;
#pragma omp threadprivate(seed)
#pragma omp parallel private(x, y) reduction(+:n)
{
for (i = 0; i < N; i++) {
seed = 25234 + 17 * omp_get_thread_num();
x = rand_r(&seed) / (double) RAND_MAX;
y = rand_r(&seed) / (double) RAND_MAX;
if (x*x + y*y <= 1)
n++;
}
}
pi = 4. * n / (double) (N);
cout << pi << endl;
double stop = time(0);
cout << (stop - start) << endl;
return 0;
}
P.S. By the way, what are the magic numbers in
seed = 25234 + 17 * omp_get_thread_num();
? I stole them from some answer.
EDIT: The comment by Gilles helped me. The resolution was:
1. To switch lines of for loop and seed initialization.
2. To add #pragma omp for
Modified code reads
#pragma omp parallel private(x, y, seed)
{
seed = 25234 + 17 * omp_get_thread_num();
#pragma omp for reduction(+:n)
for (int i = 0; i < N; i++) {
x = (double) rand_r(&seed) / (double) RAND_MAX;
y = (double) rand_r(&seed) / (double) RAND_MAX;
if (x*x + y*y <= 1)
n++;
}
}
The problem is resolved.
Apparently there are more instructions in rand_r() compared to rand(). Below is copied from one implementation. So it's reasonable that rand_r() takes more time to complete one round than rand().
int
rand_r(unsigned int *ctx)
{
u_long val = (u_long) *ctx;
int r = do_rand(&val);
*ctx = (unsigned int) val;
return (r);
}
static u_long next = 1;
int
rand()
{
return (do_rand(&next));
}
And since rand() is not thread safe, the output could be incorrect if you use rand() in parallel. The worse part is that you would still get a result and don't know if it's correct in small scale test.
I wrote a (probably-inefficient, but anyway..) Rcpp code using inline to simulate a stochastic SEIR model.
The serial version compiles and works perfectly, but since I need to simulate from it a large number of times and since it seems to me like an embarrassingly parallel problem (just need to simulate again for other parameter values and return a matrix with the results) I tried to add #pragma omp parallel for and to compile with -fopenmp -lgomp but ... boom!
I get a segfault even for very small examples!
I tried to add setenv("OMP_STACKSIZE","24M",1); and values well over 24M but still the segfault happens.
I'll explain briefly the code since it's a bit long (I tried to shorten it but the result change and I can't reproduce it..):
I have two nested loops, the inner one execute the model for a given parameter set and the outer one changes the parameters.
The only reason a race condition might happen is if the code were trying to execute set of instructions inside inner the loop in parallel (which cannot be done because of the model structure, on iteration t it depends on iteration t-1) and not to parallelize the outer, but if I'm not mistaken that is what the parallel for constructor does for default if put just outside the outer...
This is basically the form of the code I'm trying to run:
mat result(n_param,T_MAX);
#pragma omp parallel for
for(int i=0,i<n_param_set;i++){
t=0;
rowvec jnk(T_MAX);
while(t < T_MAX){
...
jnk(t) = something(jnk(t-1));
...
t++;
}
result.row(i)=jnk;
}
return wrap(result);
And my question is: How I tell the compiler that I just want to compute in parallel the outer loop (even distributing them statically like n_loops/n_threads for each thread) and not the inner one (which is actually non-parallelizable)?
The real code is a bit more involved and I'll present it here for the sake of reproducibility if you're really willing, but I'm only asking about the behavior of OpenMP. Please notice that the only OpenMP instruction appears at line 122.
library(Rcpp);library(RcppArmadillo);library(inline)
misc='
#include <math.h>
#define _USE_MATH_DEFINES
#include <omp.h>
using namespace arma;
template <typename T> int sgn(T val) {
return (T(0) < val) - (val < T(0));
}
uvec rmultinomial(int n,vec prob)
{
int K = prob.n_elem;
uvec rN = zeros<uvec>(K);
double p_tot = sum(prob);
double pp;
for(int k = 0; k < K-1; k++) {
if(prob(k)>0) {
pp = prob[k] / p_tot;
rN(k) = ((pp < 1.) ? (rbinom(1,(double) n, pp))(0) : n);
n -= rN[k];
} else
rN[k] = 0;
if(n <= 0) /* we have all*/
return rN;
p_tot -= prob[k]; /* i.e. = sum(prob[(k+1):K]) */
}
rN[K-1] = n;
return rN;
}
'
model_and_summary='
mat SEIR_sim_plus_summaries()
{
vec alpha;
alpha << 0.002 << 0.0045;
vec beta;
beta << 0.01 << 0.01;
vec gamma;
gamma << 1.0/14.0 << 1.0/14.0;
vec sigma;
sigma << 1.0/(3.5) << 1.0/(3.5);
vec phi;
phi << 0.8 << 0.8;
int S_0 = 800;
int E_0 = 100;
int I_0 = 100;
int R_0 = 0;
int pop = 1000;
double tau = 0.01;
double t_0 = 0;
vec obs_time;
obs_time << 1 << 2 << 3 << 4 << 5 << 6 << 7 << 8 << 9 << 10 << 11 << 12 << 13 << 14 << 15 << 16 << 17 << 18 << 19 << 20 << 21 << 22 << 23 << 24;
const int n_obs = obs_time.n_elem;
const int n_part = alpha.n_elem;
mat stat(n_part,6);
//#pragma omp parallel for
for(int k=0;k<n_part;k++) {
ivec INC_i(n_obs);
ivec INC_o(n_obs);
// Event variables
double alpha_t;
int nX; //current number of people moving
vec rates(8);
uvec trans(4); // current transitions, e.g. from S to E,I,R,Universe
vec r(4); // rates e.g. from S to E, I, R, Univ.
/*********************** Initialize **********************/
int S_curr = S_0;
int S_prev = S_0;
int E_curr = E_0;
int E_prev = E_0;
int I_curr = I_0;
int I_prev = I_0;
int R_curr = R_0;
int R_prev = R_0;
int IncI_curr = 0;
int IncI_prev = 0;
int IncO_curr = 0;
int IncO_prev = 0;
double t_curr = t_0;
int t_idx =0;
while( t_idx < n_obs ) {
// next time preparation
t_curr += tau;
S_prev = S_curr;
E_prev = E_curr;
I_prev = I_curr;
R_prev = R_curr;
IncI_prev = IncI_curr;
IncO_prev = IncO_curr;
/*********************** description (rates) of the events **********************/
alpha_t = alpha(k)*(1+phi(k)*sin(2*M_PI*(t_curr+0)/52)); //real contact rate, time expressed in weeks
rates(0) = (alpha_t * ((double)I_curr / (double)pop ) * ((double)S_curr)); //e+1, s-1, r,i one s get infected (goes in E, not yey infectous)
rates(1) = (sigma(k) * E_curr); //e-1, i+1, r,s one exposed become infectous (goes in I) INCIDENCE!!
rates(2) = (gamma(k) * I_curr); //i-1, s,e, r+1 one i recover
rates(3) = (beta(k) * I_curr); //i-1, s, r,e one i dies
rates(4) = (beta(k) * R_curr); //i,e, s, r-1 one r dies
rates(5) = (beta(k) * E_curr); //e-1, s, r,i one e dies
rates(6) = (beta(k) * S_curr); //s-1 e, i ,r one s dies
rates(7) = (beta(k) * pop); //s+1 one susc is born
// Let the events occour
/*********************** S compartement **********************/
if((rates(0)+rates(6))>0){
nX = rbinom(1,S_prev,1-exp(-(rates(0)+rates(6))*tau))(0);
r(0) = rates(0)/(rates(0)+rates(6)); r(1) = 0.0; r(2) = 0; r(3) = rates(6)/(rates(0)+rates(6));
trans = rmultinomial(nX, r);
S_curr -= nX;
E_curr += trans(0);
I_curr += trans(1);
R_curr += trans(2);
//trans(3) contains dead individual, who disappear...we could avoid this using sequential conditional binomial
}
/*********************** E compartement **********************/
if((rates(1)+rates(5))>0){
nX = rbinom(1,E_prev,1-exp(-(rates(1)+rates(5))*tau))(0);
r(0) = 0.0; r(1) = rates(1)/(rates(1)+rates(5)); r(2) = 0.0; r(3) = rates(5)/(rates(1)+rates(5));
trans = rmultinomial(nX, r);
S_curr += trans(0);
E_curr -= nX;
I_curr += trans(1);
R_curr += trans(2);
IncI_curr += trans(1);
}
/*********************** I compartement **********************/
if((rates(2)+rates(3))>0){
nX = rbinom(1,I_prev,1-exp(-(rates(2)+rates(3))*tau))(0);
r(0) = 0.0; r(1) = 0.0; r(2) = rates(2)/(rates(2)+rates(3)); r(3) = rates(3)/(rates(2)+rates(3));
trans = rmultinomial(nX, r);
S_curr += trans(0);
E_curr += trans(1);
I_curr -= nX;
R_curr += trans(2);
IncO_curr += trans(2);
}
/*********************** R compartement **********************/
if(rates(4)>0){
nX = rbinom(1,R_prev,1-exp(-rates(4)*tau))(0);
r(0) = 0.0; r(1) = 0.0; r(2) = 0.0; r(3) = rates(4)/rates(4);
trans = rmultinomial(nX, r);
S_curr += trans(0);
E_curr += trans(1);
I_curr += trans(2);
R_curr -= nX;
}
/*********************** Universe **********************/
S_curr += pop - (S_curr+E_curr+I_curr+R_curr); //it should be poisson, but since the pop is fixed...
/*********************** Save & Continue **********************/
// Check if the time is interesting for us
if(t_curr > obs_time[t_idx]){
INC_i(t_idx) = IncI_curr;
INC_o(t_idx) = IncO_curr;
IncI_curr = IncI_prev = 0;
IncO_curr = IncO_prev = 0;
t_idx++;
}
//else just go on...
}
/*********************** Finished - Starting w/ stats **********************/
// INC_i is the useful variable, how can I change its reference withour copying it?
ivec incidence = INC_i; //just so if I want to use INC_o i have to change just this...
//Scan the epidemics to recover the summary stats (naively divide the data each 52 weeks)
double n_years = ceil((double)obs_time(n_obs-1)/52.0);
vec mu_attack(n_years);
vec ratio_attack(n_years-1);
vec peak(n_years);
vec atk(52);
peak(0)=0.0;
vec tmpExplo(52); //explosiveness
vec explo(n_years);
int year=0;
int week;
for(week=0 ; week<n_obs ; week++){
if(week - 52*year > 51){
mu_attack(year) = sum( atk )/(double)pop;
if(year>0)
ratio_attack(year-1) = mu_attack(year)/mu_attack(year-1);
for(int i=0;i<52;i++){
if(atk(i)>(peak(year)/2.0)){
tmpExplo(i) = 1.0;
} else {
tmpExplo(i) = 0.0;
}
}
explo(year) = sum(tmpExplo);
year++;
peak(year)=0.0;
}
atk(week-52*year) = incidence(week);
if( peak(year) < incidence(week) )
peak(year)=incidence(week);
}
if(week - 52*year > 51){
mu_attack(year) = sum( atk )/(double)pop;
} else {
ivec idx(52);
for(int i=0;i<52;i++)
{ idx(i) = i; } //take just the updated ones...
vec tmp = atk.elem(find(idx<(week - 52*year)));
mu_attack(year) = sum( tmp )/((double)pop * (tmp.n_elem/52.0));
ratio_attack(year-1) = mu_attack(year)/mu_attack(year-1);
for(int i=0;i<tmp.n_elem;i++){
if(tmp(i)>(peak(year)/2.0)){
tmpExplo(i) = 1.0;
} else {
tmpExplo(i) = 0.0;
}
}
for(int i=tmp.n_elem;i<52;i++)
tmpExplo(i) = 0.0; //to reset the others
explo(year) = sum(tmpExplo);
}
double correlation2;
double correlation4;
vec autocorr = acf(peak);
/***** ACF *****/
if(n_years<3){
correlation2=0.0;
correlation4=0.0;
} else {
if(n_years<5){
correlation2 = autocorr(1);
correlation4 = 0.0;
} else {
correlation2 = autocorr(1);
correlation4 = autocorr(3);
}
}
rowvec jnk(6);
jnk << sum(mu_attack)/(year+1.0)
<< (sum( log(ratio_attack)%log(ratio_attack) )/(n_years-1)) - (pow(sum( log(ratio_attack) )/(n_years-1),2))
<< correlation2 << correlation4 << max(peak) << sum(explo)/n_years;
stat.row(k) = jnk;
}
return stat;
}
'
main='
std::cout << "max_num_threads " << omp_get_max_threads() << std::endl;
RNGScope scope;
mat summaries = SEIR_sim_plus_summaries();
return wrap(summaries);
'
plug = getPlugin("RcppArmadillo")
## modify the plugin for Rcpp to support OpenMP
plug$env$PKG_CXXFLAGS <- paste('-fopenmp', plug$env$PKG_CXXFLAGS)
plug$env$PKG_LIBS <- paste('-fopenmp -lgomp', plug$env$PKG_LIBS)
SEIR_sim_summary = cxxfunction(sig=signature(),main,settings=plug,inc = paste(misc,model_and_summary),verbose=TRUE)
SEIR_sim_summary()
Thanks for the help!
NB: before you ask, I slightly modified the Rcpp multinomial sampling function just because I liked that way more than the one using pointer...not any other particular reason! :)
The core pseudo-random number generators (PRNGs) in R are not designed to be used in multithreaded environments. That is, their state is stored in a static array (dummy from src/main/PRNG.c) and therefore is shared among all threads. Moreover several other static structures are used to store states for the higher-level interfaces to the core PRNGs.
A possible solution could be that you put each call to rnorm() or other sampling functions inside named critical sections with all having the same name, e.g.:
...
#pragma omp critical(random)
rN(k) = ((pp < 1.) ? (rbinom(1,(double) n, pp))(0) : n);
...
if((rates(0)+rates(6))>0){
#pragma omp critical(random)
nX = rbinom(1,S_prev,1-exp(-(rates(0)+rates(6))*tau))(0);
...
Note that the critical construct operates on the structured block following it and therefore locks the entire statement. If a random number is being drawn inline inside a call to a time-consuming function, e.g.
#pragma omp critical(random)
x = slow_computation(rbinom(...));
this is better transformed to:
#pragma omp critical(random)
rb = rbinom(...);
x = slow_computation(rb);
That way only the rb = rbinom(...); statement will be protected.
I was trying to prove a point with OpenMP compared to MPICH, and I cooked up the following example to demonstrate how easy it was to do some high performance in OpenMP.
The Gauss-Seidel iteration is split into two separate runs, such that in each sweep every operation can be performed in any order, and there should be no dependency between each task. So in theory each processor should never have to wait for another process to perform any kind of synchronization.
The problem I am encountering, is that I, independent of problem size, find there is only a weak speed-up of 2 processors and with more than 2 processors it might even be slower.
Many other linear paralleled routine I can obtain very good scaling, but this one is tricky.
My fear is that I am unable to "explain" to the compiler that operation that I perform on the array, is thread-safe, such that it is unable to be really effective.
See the example below.
Anyone has any clue on how to make this more effective with OpenMP?
void redBlackSmooth(std::vector<double> const & b,
std::vector<double> & x,
double h)
{
// Setup relevant constants.
double const invh2 = 1.0/(h*h);
double const h2 = (h*h);
int const N = static_cast<int>(x.size());
double sigma = 0;
// Setup some boundary conditions.
x[0] = 0.0;
x[N-1] = 0.0;
// Red sweep.
#pragma omp parallel for shared(b, x) private(sigma)
for (int i = 1; i < N-1; i+=2)
{
sigma = -invh2*(x[i-1] + x[i+1]);
x[i] = (h2/2.0)*(b[i] - sigma);
}
// Black sweep.
#pragma omp parallel for shared(b, x) private(sigma)
for (int i = 2; i < N-1; i+=2)
{
sigma = -invh2*(x[i-1] + x[i+1]);
x[i] = (h2/2.0)*(b[i] - sigma);
}
}
Addition:
I have now also tried with a raw pointer implementation and it has the same behavior as using STL container, so it can be ruled out that it is some pseudo-critical behavior comming from STL.
First of all, make sure that the x vector is aligned to cache boundaries. I did some test, and I get something like a 100% improvement with your code on my machine (core duo) if I force the alignment of memory:
double * x;
const size_t CACHE_LINE_SIZE = 256;
posix_memalign( reinterpret_cast<void**>(&x), CACHE_LINE_SIZE, sizeof(double) * N);
Second, you can try to assign more computation to each thread (in this way you can keep cache-lines separated), but I suspect that openmp already does something like this under the hood, so it may be worthless with large N.
In my case this implementation is much faster when x is not cache-aligned.
const int workGroupSize = CACHE_LINE_SIZE / sizeof(double);
assert(N % workGroupSize == 0); //Need to tweak the code a bit to let it work with any N
const int workgroups = N / workGroupSize;
int j, base , k, i;
#pragma omp parallel for shared(b, x) private(sigma, j, base, k, i)
for ( j = 0; j < workgroups; j++ ) {
base = j * workGroupSize;
for (int k = 0; k < workGroupSize; k+=2)
{
i = base + k + (redSweep ? 1 : 0);
if ( i == 0 || i+1 == N) continue;
sigma = -invh2* ( x[i-1] + x[i+1] );
x[i] = ( h2/2.0 ) * ( b[i] - sigma );
}
}
In conclusion, you definitely have a problem of cache-fighting, but given the way openmp works (sadly I am not familiar with it) it should be enough to work with properly allocated buffers.
I think the main problem is about type of array structure you are using. Lets try comparing results with vectors and arrays. (Arrays = c-arrays using new operator).
Vector and array sizes are N = 10000000. I force the smoothing function to repeat in order to maintain runtime > 0.1secs.
Vector Time: 0.121007 Repeat: 1 MLUPS: 82.6399
Array Time: 0.164009 Repeat: 2 MLUPS: 121.945
MLUPS = ((N-2)*repeat/runtime)/1000000 (Million Lattice Points Update per second)
MFLOPS are misleading when it comes to grid calculation. A few changes in the basic equation can lead to consider high performance for the same runtime.
The modified code:
double my_redBlackSmooth(double *b, double* x, double h, int N)
{
// Setup relevant constants.
double const invh2 = 1.0/(h*h);
double const h2 = (h*h);
double sigma = 0;
// Setup some boundary conditions.
x[0] = 0.0;
x[N-1] = 0.0;
double runtime(0.0), wcs, wce;
int repeat = 1;
timing(&wcs);
for(; runtime < 0.1; repeat*=2)
{
for(int r = 0; r < repeat; ++r)
{
// Red sweep.
#pragma omp parallel for shared(b, x) private(sigma)
for (int i = 1; i < N-1; i+=2)
{
sigma = -invh2*(x[i-1] + x[i+1]);
x[i] = (h2*0.5)*(b[i] - sigma);
}
// Black sweep.
#pragma omp parallel for shared(b, x) private(sigma)
for (int i = 2; i < N-1; i+=2)
{
sigma = -invh2*(x[i-1] + x[i+1]);
x[i] = (h2*0.5)*(b[i] - sigma);
}
// cout << "In Array: " << r << endl;
}
if(x[0] != 0) dummy(x[0]);
timing(&wce);
runtime = (wce-wcs);
}
// cout << "Before division: " << repeat << endl;
repeat /= 2;
cout << "Array Time:\t" << runtime << "\t" << "Repeat:\t" << repeat
<< "\tMLUPS:\t" << ((N-2)*repeat/runtime)/1000000.0 << endl;
return runtime;
}
I didn't change anything in the code except than array type. For better cache access and blocking you should look into data alignment (_mm_malloc).
I wrote program which realises this formula:
Pi = 1/n * summ( 4 / ( 1 + ((i-0.5) /n)^2)
Program code:
#include <iostream>
#include <math.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
using namespace std;
const long double PI = double(M_PI);
int main(int argc, char* argv[])
{
typedef struct timeval tm;
tm start, end;
int timer = 0;
int n;
if (argc == 2) n = atoi(argv[1]);
else n = 8000;
long double pi1 = 0;
gettimeofday ( &start, NULL );
for(int i = 1; i <= n; i++) {
pi1 += 4 / ( 1 + (i-0.5) * (i-0.5) / (n*n) );
}
pi1/=n;
gettimeofday ( &end, NULL );
timer = ( end.tv_usec - start.tv_usec );
long double delta = pi1 - PI;
printf("pi = %.12Lf\n",pi1);
printf("delta = %.12Lf\n", delta);
cout << "time = " << timer << endl;
return 0;
}
How to present it in an optimal way? when there will be less floating-point operations in this part:
for(int i = 1; i <= n; i++) {
pi1 += 4 / ( 1 + (i-0.5) * (i-0.5) / (n*n) );
}
Thanks
one idea will be:
double nn = n*n;
for(double i = 0.5; i < n; i += 1) {
pi1 += 4 / ( 1 + i * i / nn );
}
but you need to test if it is any difference with current code.
I suggest you read this excellent document:
Software Optimization Guide for AMD64 Processors
Which is also great when you do not have an AMD processor.
But if I were you, I would replace the whole calculation loop with just
pi1 = M_PI;
Which will probably be the fastest... If you are actually interested in a faster algorithm for Pi calculations, look at the Wikipedia article: Category:Pi algorithm
If you just want to microoptimize your code, read the above mentioned software optimization guide.
Examples of simple optimization:
compute double one_per_n = 1/n; outside the for loop reducing the cost of dividing by non each iteration
compute double j = (i-0.5) * one_per_n inside the loop
pi1 += 4 / (1 + j*j);
This should be faster and also avoid the integer overflow you have for greater values of n. For even more optimized code you will have to look at the generated code and use a profiler to make appropriate changes. The optimized code this way might behave differently on machines with a different CPU or cache.... Avoiding divisions is something that is always good to do to save computation time.
#include <iostream>
#include <cmath>
#include <chrono>
#ifndef M_PI //M_PI is non standard make you sure catch this case
#define M_PI 3.14159265358979323846
#endif
typdef long double float_t;
const float_t PI = double(M_PI);
int main(int argc, char* argv[])
{
int n = argc == 2 ? atoi(argv[1]) : 8000;
float_t pi1=0.0;
//if you can using auto here is a no brainer
std::chrono::time_point start
=std::chrono::system_clock::now();
unsigned n2=n*n;
for(unsigned i = 1; i <= n; i++)
{
pi1 += 4.0 / ( 1.0 + (i-0.5) * (i-0.5) / n2 );
}
pi1/=n;
std::chrono::duration<double> time
=std::chrono::system_clock::now()-start;
float_t delta = pi1 - PI;
std::cout << "pi = " << std::setprecision(12) << pi1
<< "delta = " << std::setprecision(12) << delta
<< "\ntime = " << time.count() << std::endl;
return 0;
}