I am trying to compare two methods of computing the value of PI, one sequential and another one parallel, and measuring the difference between them with a steady clock. When I run them in debug mode, everything works fine, the output is comparable for the two methods:
sequential: 2238223
parallel: 506050
The problem I am facing is that when I am compiling in release mode, the steady_clock does not measure any time difference for the sequential version:
sequential: 0
parallel: 271027
Another strange thing is that I am printing the values, and at the end of the test I am printing the value of pi returned by the two methods, and the console is immediately printing 0 for the sequential method, after the launch of the program, then it's waiting a while until printing the result of the parallel method, and then it's waiting a while more until it prints the result of the pi values, which makes me think that the program is executing the first method right when the value of pi needs to be printed.
Is my guess right, then why does the program has this different behavior?
This is the code for the two methods:
double ComputePIParallel()
{
long long i;
double area = 0;
double h = 1.0 / n;
#pragma omp parallel for shared(n, h) reduction(+:area)
for (i = 1; i <= n; i++)
{
double x = h * (i - 0.5);
area += (4.0 / (1.0 + x*x));
}
double pi = h * area;
return pi;
}
double ComputePISequntial()
{
long long i;
double area = 0;
double h = 1.0 / n;
for (i = 1; i <= n; i++)
{
double x = h * (i - 0.5);
area += (4.0 / (1.0 + x*x));
}
return h * area;
}
and the main method:
int main()
{
steady_clock::time_point begin = steady_clock::now();
double pi1 = ComputePISequntial();
steady_clock::time_point end = steady_clock::now();
long long duration = duration_cast<microseconds>(end - begin).count();
cout<<"sequential: "<<duration<<endl;
begin = steady_clock::now();
double pi2 = ComputePIParallel();
end = steady_clock::now();
duration = duration_cast<microseconds>(end - begin).count();
std::cout<<"parallel: "<<duration;
std::cout<<endl<<pi1<<endl<<pi2;
_getch();
return 0;
}
The value of n is 100000000 and it is a global. The code is compiled with MS compiler from Visual Studio 2012, and it has enabled Open MP support.
Related
I'm trying to learn paralellization of C++ using openmp, and I'm trying to use the following example. But for some reason when I increase the number of threads the code runs slower. Im compiling it using the -fopenmp flag. It would be nice if I could get your expert opinion.
#include <omp.h>
#include <iostream>
static long num_steps =100000000;
#define NUM_THREADS 4
double step;
int main(){
int i,nthreads;
double pi, sum[NUM_THREADS]; // should be shared : hence promoted scalar sum into an array
step = 1.0/(double) num_steps;
omp_set_num_threads(NUM_THREADS);
double t1 = omp_get_wtime();
#pragma omp parallel
{
int i, id, nthrds;
double x;
id = omp_get_thread_num();
nthrds = omp_get_num_threads();
//if(id==0) nthreads = nthrds; // This is done because the number of threads can be different
// ie the environment can give you a different number of threads
// than requested
for(i=id, sum[id] = 0.0; i<num_steps;i=i+nthrds){
x = (i+0.5)*step;
sum[id] += 4.0/(1.0+x*x);
}
}
double t2 = omp_get_wtime();
std::cout << "Time : " ;
double ms_double = t2 - t1;
std::cout << ms_double << "ms\n";
for(i=0,pi=0.0; i < nthreads; i++){
pi += sum[i]*step;
}
}
Minor complaints aside, your big problem is the loop update i=i+nthrds. This means that each cache line will be accessed by all 4 of your threads. (Btw, use the OMP_NUM_THREADS environment variable to set the number of threads. Do not hardcode.) This is called false sharing and it's really bad for performance: you want each cacheline to be exclusively in one core.
The main advantage of OpenMP is that you do not have to do reduction manually. You just have to add an extra line to the serial code. So, your code should be something like this (which is free from false-sharing):
double sum=0;
#pragma omp parallel for reduction(+:sum)
for(unsigned long i=0; i<num_steps; ++i){
const double x = (i+0.5)*step;
sum += 4.0/(1.0+x*x);
}
double pi = sum*step;
Note that your code had an uninitialized variable (pi) and your code did not handle the properly if you got less threads than requested.
What #Victor Ejkhout called "minor complaints" might not be so minor. It is only normal that using a new API (omp) for the first time can be confusing. And that reflects on the coding style of the application code as well, more often than not. But especially in such cases, special attention should be paid to readability.
The code below is the "prettied-up" version of your attempt. And next to the omp parallel integration it also has the single threaded and a multi threaded (using std::thread) version so you can compare them to each other.
#include <omp.h>
#include <iostream>
#include <thread>
constexpr int MAX_PARALLEL_THREADS = 4; // long is wrong - is it an isize_t or a int32_t or an int64_t???
// the function we want to integrate
double f(double x) {
return 4.0 / (1.0 + x * x);
}
// performs the summation of function values on the interval [left,right[
double sum_interval(double left, double right, double step) {
double sum = 0.0;
for (double x = left; x < right; x += step) {
sum += f(x);
}
return sum;
}
double integrate_single_threaded(double left, double right, double step) {
return sum_interval(left, right, step) / (right - left);
}
double integrate_multi_threaded(double left, double right, double step) {
double sums[MAX_PARALLEL_THREADS];
std::thread threads[MAX_PARALLEL_THREADS];
for (int i= 0; i < MAX_PARALLEL_THREADS;i++) {
threads[i] = std::thread( [&sums,left,right,step,i] () {
double ileft = left + (right - left) / MAX_PARALLEL_THREADS * i;
double iright = left + (right - left) / MAX_PARALLEL_THREADS * (i + 1);
sums[i] = sum_interval(ileft,iright,step);
});
}
double total_sum = 0.0;
for (int i = 0; i < MAX_PARALLEL_THREADS; i++) {
threads[i].join();
total_sum += sums[i];
}
return total_sum / (right - left);
}
double integrate_parallel(double left, double right, double step) {
double sums[MAX_PARALLEL_THREADS];
int thread_count = 0;
omp_set_num_threads(MAX_PARALLEL_THREADS);
#pragma omp parallel
{
thread_count = omp_get_num_threads(); // 0 is impossible, there is always 1 thread minimum...
int interval_index = omp_get_thread_num();
double ileft = left + (right - left) / thread_count * interval_index;
double iright = left + (right - left) / thread_count * (interval_index + 1);
sums[interval_index] = sum_interval(ileft,iright,step);
}
double total_sum = 0.0;
for (int i = 0; i < thread_count; i++) {
total_sum += sums[i];
}
return total_sum / (right - left);
}
int main (int argc, const char* argv[]) {
double left = -1.0;
double right = 1.0;
double step = 1.0E-9;
// run single threaded calculation
std::cout << "single" << std::endl;
double tstart = omp_get_wtime();
double i_single = integrate_single_threaded(left, right, step);
double tend = omp_get_wtime();
double st_time = tend - tstart;
// run multi threaded calculation
std::cout << "multi" << std::endl;
tstart = omp_get_wtime();
double i_multi = integrate_multi_threaded(left, right, step);
tend = omp_get_wtime();
double mt_time = tend - tstart;
// run omp calculation
std::cout << "omp" << std::endl;
tstart = omp_get_wtime();
double i_omp = integrate_parallel(left, right, step);
tend = omp_get_wtime();
double omp_time = tend - tstart;
std::cout
<< "i_single: " << i_single
<< " st_time: " << st_time << std::endl
<< "i_multi: " << i_multi
<< " mt_time: " << mt_time << std::endl
<< "i_omp: " << i_omp
<< " omp_time: " << omp_time << std::endl;
return 0;
}
When I compile this on my Debian with g++ --std=c++17 -Wall -O3 -lpthread -fopenmp -o para para.cpp -pthread, I get the following results:
single
multi
omp
i_single: 3.14159e+09 st_time: 2.37662
i_multi: 3.14159e+09 mt_time: 0.635195
i_omp: 3.14159e+09 omp_time: 0.660593
So, at least my conclusion is, that it is not worth the effort to learn openMP, given that the (more general use) std::thread version looks just as nice and performs at least equally well.
I am not really trusting the computed integral result in either case, though. But I did not really focus on that. They all produce the same value. That is the important part.
I'm wondering if there is a simple way to run a function multiple times in parrallel. I've tried multithreading but either there is something I don't understand or it doesn't actually speed up the calculations (actually quite the opposite). I have here the function that I want to run in parrallel:
void heun_update_pos(vector<planet>& planets, vector<double> x_i, vector<double> y_i, vector<double> mass, size_t n_planets, double h, int i)
{
if (planets[i].mass != 0) {
double sum_gravity_x = 0;
double sum_gravity_y = 0;
//loop for collision check and gravitational contribution
for (int j = 0; j < n_planets; j++) {
if (planets[j].mass != 0) {
double delta_x = planets[i].x_position - x_i[j];
double delta_y = planets[i].y_position - y_i[j];
//computing the distances between two planets in x and y
if (delta_x != 0 && delta_y != 0) {
//collision test
if (collision_test(planets[i], planets[j], delta_x, delta_y) == true) {
planets[i].mass += planets[j].mass;
planets[j].mass = 0;
}
//sum of the gravity contributions from other planets
sum_gravity_x += gravity_x(delta_x, delta_y, mass[j]);
sum_gravity_y += gravity_y(delta_x, delta_y, mass[j]);
}
}
};
double sx_ip1 = planets[i].x_speed + (h / 2) * sum_gravity_x;
double sy_ip1 = planets[i].y_speed + (h / 2) * sum_gravity_y;
double x_ip1 = planets[i].x_position + (h / 2) * (planets[i].x_speed + sx_ip1);
double y_ip1 = planets[i].y_position + (h / 2) * (planets[i].y_speed + sy_ip1);
planets[i].update_position(x_ip1, y_ip1, sx_ip1, sy_ip1);
};
}
An here is my how I tried to use multithreading with it:
const int cores = 6;
vector<thread> threads(cores);
int active_threads = 0;
int closing_threads = 1;
for (int i = 0; i < n_planets; i++) {
threads[active_threads] = thread(&Heun_update_pos, ref(planets), x_i, y_i, mass, n_planets, h, i);
if (i > cores - 2) threads[closing_threads].join();
//There should only be as many threads as there are cores
closing_threads++;
if (closing_threads > cores - 1) closing_threads = 0;
active_threads++; // counting the number of active threads
if (active_threads >= cores) active_threads = 0;
};
//CLOSING REMAINING THREADS
for (int k = 0; k < cores; k++) {
if (threads[k].joinable()) threads[k].join();
};
I just started learning C++ today (used Python before), this is my first code, so I am not very familiar with all the C++ functionalities.
Creating new threads take a lot of time, typically 50-100 microseconds. Depending on how long your serial version takes, it would really not be very helpful. If you run this code several times, it would be worth trying to use a thread pool since waking up a thread takes max 5 microseconds.
Check out a similar answer here:
Is there a performance benefit in using a pool of threads over simply creating threads?
There is a framework for multithreading calculation in C++ called OpenMP. You might think about using it.
https://bisqwit.iki.fi/story/howto/openmp/
Suppose we need to generate a very long harmonic signal, ideally infinitely long. At first glance, the solution seems trivial:
Sample1:
float t = 0;
while (runned)
{
float v = sinf(w * t);
t += dt;
}
Unfortunately, this is a non-working solution. For t >> dt due to limited float precision incorrect values will be obtained. Fortunately we can call to mind that sin(2*PI* n + x) = sin(x) where n - arbitrary integer value, therefore modifying the example is not difficult to get an "infinite" analog
Sample2:
float t = 0;
float tau = 2 * M_PI / w;
while (runned)
{
float v = sinf(w * t);
t += dt;
if (t > tau) t -= tau;
}
For one physical simulation, I needed to get an infinite signal, which is the sum of harmonic signals, like that:
Sample3:
float getSignal(float x)
{
float ret = 0;
for (int i = 0; i < modNum; i++)
ret += sin(w[i] * x);
return ret;
}
float t = 0;
while (runned)
{
float v = getSignal(t);
t += dt;
}
In this form, the code does not work correctly for large t, for similar reasons for the Sample1. The question is - how to get an "infinite" implementation of the Sample3 algorithm? I assume that the solution should looks like an Sample2. A very important note - generally speaking, w[i] is arbitrary and not harmonics, that is, all frequencies are not multiples of some base frequency, so i can't find common tau. Using types with greater precission (double, long double) is not allowed.
Thanks for your advice!
You can choose an arbitrary tau and store the phase reminders for each mod when subtracting it from t (as #Damien suggested in the comments).
Also, representing the time as t = dt * it where it is an integer can improve numerical stability (i think).
Maybe something like this:
int ndt = 1000; // accumulate phase every 1000 steps for example
float tau = dt * ndt;
std::vector<float> phases(modNum, 0.0f);
int it = 0;
float t = 0.0f;
while (runned)
{
t = dt * it;
float v = 0.0f;
for (int i = 0; i < modNum; i++)
{
v += sinf(w[i] * t + phases[i]);
}
if (++it >= ndt)
{
it = 0;
for (int i = 0; i < modNum; ++i)
{
phases[i] = fmod(w[i] * tau + phases[i], 2 * M_PI);
}
}
}
I use this version of calculation of Pi with thread-safe function
rand_r
But it appears that it is slower (and answer is wrong) when running this program in parallel comparing to sequential program with use of
rand()
which is not thread-safe. It seems that this way of using is also not thread-safe. But I do not understand why, because I have read many questions about thread-safe PRNGs and learned that rand_r should be safe enough.
#include <iostream>
#include <random>
#include <ctime>
#include "omp.h"
#include <stdlib.h>
using namespace std;
unsigned seed;
int main()
{
double start = time(0);
int i, n, N;
double x, y;
N = 1<<30;
n = 0;
double pi;
#pragma omp threadprivate(seed)
#pragma omp parallel private(x, y) reduction(+:n)
{
for (i = 0; i < N; i++) {
seed = 25234 + 17 * omp_get_thread_num();
x = rand_r(&seed) / (double) RAND_MAX;
y = rand_r(&seed) / (double) RAND_MAX;
if (x*x + y*y <= 1)
n++;
}
}
pi = 4. * n / (double) (N);
cout << pi << endl;
double stop = time(0);
cout << (stop - start) << endl;
return 0;
}
P.S. By the way, what are the magic numbers in
seed = 25234 + 17 * omp_get_thread_num();
? I stole them from some answer.
EDIT: The comment by Gilles helped me. The resolution was:
1. To switch lines of for loop and seed initialization.
2. To add #pragma omp for
Modified code reads
#pragma omp parallel private(x, y, seed)
{
seed = 25234 + 17 * omp_get_thread_num();
#pragma omp for reduction(+:n)
for (int i = 0; i < N; i++) {
x = (double) rand_r(&seed) / (double) RAND_MAX;
y = (double) rand_r(&seed) / (double) RAND_MAX;
if (x*x + y*y <= 1)
n++;
}
}
The problem is resolved.
Apparently there are more instructions in rand_r() compared to rand(). Below is copied from one implementation. So it's reasonable that rand_r() takes more time to complete one round than rand().
int
rand_r(unsigned int *ctx)
{
u_long val = (u_long) *ctx;
int r = do_rand(&val);
*ctx = (unsigned int) val;
return (r);
}
static u_long next = 1;
int
rand()
{
return (do_rand(&next));
}
And since rand() is not thread safe, the output could be incorrect if you use rand() in parallel. The worse part is that you would still get a result and don't know if it's correct in small scale test.
I have problem with the following code:
int *chosen_pts = new int[k];
std::pair<float, int> *dist2 = new std::pair<float, int>[x.n];
// initialize dist2
for (int i = 0; i < x.n; ++i) {
dist2[i].first = std::numeric_limits<float>::max();
dist2[i].second = i;
}
// choose the first point randomly
int ndx = 1;
chosen_pts[ndx - 1] = rand() % x.n;
double begin, end;
double elapsed_secs;
while (ndx < k) {
float sum_distribution = 0.0;
// look for the point that is furthest from any center
begin = omp_get_wtime();
#pragma omp parallel for reduction(+:sum_distribution)
for (int i = 0; i < x.n; ++i) {
int example = dist2[i].second;
float d2 = 0.0, diff;
for (int j = 0; j < x.d; ++j) {
diff = x(example,j) - x(chosen_pts[ndx - 1],j);
d2 += diff * diff;
}
if (d2 < dist2[i].first) {
dist2[i].first = d2;
}
sum_distribution += dist2[i].first;
}
end = omp_get_wtime() - begin;
std::cout << "center assigning -- "
<< ndx << " of " << k << " = "
<< (float)ndx / k * 100
<< "% is done. Elasped time: "<< (float)end <<"\n";
/**/
bool unique = true;
do {
// choose a random interval according to the new distribution
float r = sum_distribution * (float)rand() / (float)RAND_MAX;
float sum_cdf = dist2[0].first;
int cdf_ndx = 0;
while (sum_cdf < r) {
sum_cdf += dist2[++cdf_ndx].first;
}
chosen_pts[ndx] = cdf_ndx;
for (int i = 0; i < ndx; ++i) {
unique = unique && (chosen_pts[ndx] != chosen_pts[i]);
}
} while (! unique);
++ndx;
}
As you can see i use omp to make parallel the for loop. It works fine and i can achive a significant speed up. However if i increase the value of x.n over 20000000 the function stops to work after 8-10 loops:
It doestn produces any output (std::cout)
Only one core works
No error, whatsoever
If i comment out the do while loop, it works again as expected. All cores are busy and there is an output after each iteration, and i can increase k.n over 100 millions just as i need it.
It's not OpenMP parallel for getting stuck, it's obviously in your serial do-while loop.
One particular issue that I see is that there is no array boundary checks in the inner while loop accessing dist2. In theory, out-of-boundary access should never happen; but in practice it may - see below why. So first of all I would rewrite the calculation of cdf_ndx to guarantee that the loop ends when all elements are inspected:
float sum_cdf = 0;
int cdf_ndx = 0;
while (sum_cdf < r && cdf_ndx < x.n ) {
sum_cdf += dist2[cdf_ndx].first;
++cdf_ndx;
}
Now, how it may happen that sum_cdf does not reach r? It is due to specifics of floating-point arithmetic and the fact that sum_distribution was computed in parallel, while sum_cdf is computed serially. The problem is that contribution of one element to the sum can be below the accuracy for floats; in other words, when you sum two float values that differ more than ~8 orders of magnitude, the smaller one does not affect the sum.
So, with 20M of floats after some point it might happen that the next value to add is so small comparing to the accumulated sum_cdf that adding this value does not change it! On the other hand, sum_distribution was essentially computed as several independent partial sums (one per thread) then combined together. Thus it is more accurate, and possibly bigger than sum_cdf can ever reach.
A solution can be to compute sum_cdf in portions, having two nested loops. For example:
float sum_cdf = 0;
int cdf_ndx = 0;
while (sum_cdf < r && cdf_ndx < x.n ) {
float block_sum = 0;
int block_end = min(cdf_ndx+10000, x.n); // 10000 is arbitrary selected block size
for (int i=cdf_ndx; i<block_end; ++i ) {
block_sum += dist2[i].first;
if( sum_cdf+block_sum >=r ) {
block_end = i; // adjust to correctly compute cdf_ndx
break;
}
}
sum_cdf += block_sum;
cdf_ndx = block_end;
}
And after the loop you need to check that cdf_ndx < x.n, otherwise repeat with a new random interval.