OpenMP implementation slower than serial implementation [duplicate] - c++

This question already has an answer here:
OpenMP program is slower than sequential one
(1 answer)
Closed 5 years ago.
i am currently trying to get familiar with OpenMP. For practice i implemented a greedy "learning" algorithm with OpenMP. Then i measured the time with
time ./a.out
I compared with my serial implementation and no matter how many iterations my program is doing the OpenMP one is alway significant slower.
Here is my Code, comments should hopefully explain everything:
#include <omp.h>
#include <iostream>
#include <vector>
#include <cstdlib>
#include <cmath>
#include <stdio.h>
#include <ctime>
#define THREADS 4
using namespace std;
struct TrainData {
double input;
double output;
};
//Long Term Memory struct
struct LTM {
double a; //paramter a of the polynom
double b;
double c;
double score; //score to be minimized!
LTM()
{
a=0;
b=0;
c=0;
score=0;
}
//random LTM with paramters from low to high (including low and high)
LTM(int low, int high)
{
score=0;
a= rand() % high + low;
b= rand() % high + low;
c= rand() % high + low;
}
LTM(double _a, double _b, double _c)
{
a=_a;
b=_b;
c=_c;
}
void print()
{
cout<<"Score: "<<score<<endl;
cout<<"a: "<<a<<" b: "<<b<<" c: "<<c<<endl;
}
};
//the acutal polynom function evaluating with passed LTM
inline double evaluate(LTM &ltm, const double &x)
{
double ret;
ret = ltm.a*x*x + ltm.b*x + ltm.c;
return ret;
}
//scoring function calculates the Root Mean Square error (RMS)
inline double score_function(LTM &ltmnew, vector<TrainData> &td)
{
double score;
double val;
int tdsize=td.size();
score=0;
for(int i=0; i< tdsize; i++)
{
val = (td.at(i)).output - evaluate(ltmnew, (td.at(i)).input);
val *= val;
score += val;
}
score /= (double)tdsize;
score = sqrt(score);
return score;
}
LTM iterate(int iterations, vector<TrainData> td, int low, int high)
{
LTM fav = LTM(low,high);
fav.score = score_function(fav, td);
fav.print();
LTM favs[THREADS]; // array for collecting the favorites of each thread
#pragma omp parallel num_threads(THREADS) firstprivate(fav, low, high, td)
{
#pragma omp master
printf("Threads: %d\n", omp_get_num_threads());
LTM cand;
#pragma omp for private(cand)
for(int i=0; i<iterations; i++)
{
cand = LTM(low, high);
cand.score = score_function(cand, td);
if(cand.score < fav.score)
fav = cand;
}
//save the favorite before ending the parallel section
#pragma omp critical
favs[omp_get_thread_num()] = fav;
}
//search for the best one in the array
for(int i=0; i<THREADS; i++)
{
if(favs[i].score < fav.score)
fav=favs[i];
}
return fav;
}
//generate training data from -50 up to 50 with the train LTM
void generateTrainData(vector<TrainData> *td, LTM train)
{
#pragma omp parallel for schedule(dynamic, 25)
for(int i=-50; i< 50; i++)
{
struct TrainData d;
d.input = i;
d.output = evaluate(train, (double)i);
#pragma omp critical
td->push_back(d);
//cout<<"input: "<<d.input<<" -> "<<d.output<<endl;
}
}
int main(int argc, char *argv[])
{
int its= 10000000; //number of iterations
int a=2;
int b=4;
int c=6;
srand(time(NULL));
LTM pol = LTM(a,b,c); //original polynom parameters
vector<TrainData> td;
//first genarte some training data and save it to td
generateTrainData(&td, pol);
//try to find the best solution
LTM fav = iterate( its, td, 1, 6);
printf("Final: a=%f b=%f c=%f score: %f\n", fav.a, fav.b, fav.c, fav.score);
return 0;
}
At my home PC it took 12s with this implementation. The serial one only 6s.
If i increase the number of iterations by factor 10 it will be around 2min/1min (omp / serial).
Can anyone help me?

Okay, thanks to the comments of my initial question i could solve the performance issues.
Like in the comments said the problem was the rand() function i was using.
I replaced them with an appropriate thread safe drand48_r().
Like:
...
LTM(double low, double high, struct drand48_data *buff)
{
score=0;
double x;
drand48_r(buff,&x);
a= low + x * (high - low);
drand48_r(buff,&x);
b= low + x * (high - low);
drand48_r(buff,&x);
c= low + x * (high - low);
}
...
now i got times under one second!
Thanks! :)

Related

How can I make my c++ code run faster with eigen library?

I have written a parallelized c++ code, which functions as follows :
There are 75 'w' points, and each of them is sent to one processor.
For each 'w' point, I am defining a matrix. Then I am diagonalizing it. I am using the eigenvectors to compute a particular quantity by summing over the fourth power of each of the matrix elements. And then I average this quantity over 300 iterations of the matrix.
So I am using Eigen package for this calculation, and I compile the code with mpiCC -I eigen -Ofast filename.cpp. For a 512 x 512 matrix, the whole procedure takes 2.5 hours. Currently I need to do the same for a 2748 x 2748 matrix, and it's still going on after approx. 12:30 hrs. Is there anyway I can make the code run faster?
The code is given here for reference :
#include <iostream>
#include <complex>
#include <cmath>
#include<math.h>
#include<stdlib.h>
#include<time.h>
#include<Eigen/Dense>
#include<fstream>
#include<random>
#include "mpi.h"
#define pi 3.14159
using namespace std;
using namespace Eigen;
#define no_of_processor 75 // no of processors used for computing
#define no_of_disorder_av 300 //300 iterations for each w
#define A_ratio 1 //aspect ratio Ly/Lx
#define Lx 8
#define w_init 0.1 // initial value of potential strength
#define del_w 0.036 // increment of w in each loop
#define w_loop 75 // no of different w
#define alpha (sqrt(5.0)-1.0)/(double)2.0
double onsite_pot(int x,int y, int z, double phi, double alpha_0){
double B11=alpha;
double B12=alpha;
double B13=alpha;
double b1= (double)B11*x+(double)B12*y+(double)B13*z;
double c11= 1.0-cos(2*M_PI*b1+phi); //printf("%f\n",c1);
double c12= 1.0+(alpha_0*cos(2*M_PI*b1+phi));
double c1=c11/c12;
return c1;
}
int main(int argc, char *argv[])
{
clock_t begin = clock();
/*golden ratio----------------------------*/
char filename[200];
double t=1.0;
int i,j,k,l,m;//for loops
double alpha_0=0;
int Ly=A_ratio*Lx;
int Lz= A_ratio*Lx;
int A=Lx*Ly;
int V=A*Lz; //size of the matrix
int numtasks,rank,RC;
RC=MPI_Init(&argc,&argv);
if (RC != 0) {
printf ("Error starting MPI program. Terminating.\n");
MPI_Abort(MPI_COMM_WORLD, RC);
}
MPI_Comm_size(MPI_COMM_WORLD,&numtasks);
MPI_Comm_rank(MPI_COMM_WORLD,&rank);
sprintf(filename,"IPR3D%dalpha%g.dat",rank+1,alpha_0);
ofstream myfile;
myfile.open(filename, ios::app); //preparing file to write in
int n = w_loop/no_of_processor;
double w=w_init+(double)(n*rank*del_w);
int var_w_loop = 0;
MatrixXcd H(V,V); // matrix getting defined here
MatrixXcd evec(V,V); // matrix for eigenvector
VectorXcd temp(V); // vector for a temporary space used later in calculation
double IPR[V], E_levels[V]; // for average value of the quantity and eigen values.
do{
for(i=0;i<V;i++)
{
IPR[i]=E_levels[i]=0.0;
}
/*!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!*/
/*!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!*/
/*----loop for disorder average---------------------------*/
for (l=0;l<no_of_disorder_av;l++){
for (i=V; i<V; i++)
{
for (j=V; j<V; j++)
H(i,j)=0;
}
double phi=(2*M_PI*(double)l)/(double)no_of_disorder_av;
//matrix assignment starts
int z=0;
for (int plane=0; plane<Lz; plane++)
{
z += 1 ;
int y=0;
int indx1= plane*A ; //initial index of each plane
int indx2= indx1+A-1; // last index of each plane
for (int linchain=0; linchain<Ly; linchain++){
y += 1;
int x=0;
int indx3= indx1 + linchain*Lx ; //initial index of each chain
int indx4= indx3 + Lx-1 ; //last index of each chain
for (int latpoint=0; latpoint<Lx; latpoint++){
x += 1;
int indx5= indx3 +latpoint; //index of each lattice point
H(indx5,indx5)= 2*w*onsite_pot(x,y,z,phi,alpha_0); //onsite potential
if (indx5<indx4){ //hopping inside a chain
H(indx5,(indx5+1))= t; //printf("%d %d\n",indx5,indx5+1);
H((indx5+1),indx5)= t;
}
if (indx5<=(indx2-Lx)){ //hopping between different chain
H(indx5,(indx5+Lx))= t; //printf("%d %d\n",indx5,indx5+Lx);
H((indx5+Lx),indx5)= t;
printf("%d\n",indx5);
}
if (indx5<(V-A)){
H(indx5,(indx5+A))= t; //printf("%d %d\n",indx5,indx5+A);// hopping between different plane
H((indx5+A),indx5)= t;
}
} //latpoint loop
}//linchain loop
}//plane loop
//PB..............................................
for (int plane=0; plane<Lz; plane++){
int indx1= plane*A; //initial index of each plane
int indx2 = indx1+A-1 ;//last indx of each plane
//periodic boundary condition x
for (int linchain=0; linchain<Ly; linchain++){
int indx3 = indx1 + linchain*Lx; // initital index of each chain
int indx4=indx3+ Lx-1; //last index of each chain
H(indx3,indx4)= t; //printf("%d %d\n",indx3,indx4);
H(indx4,indx3)= t;
}//linchain loop
//periodic boundary condition y
for (int i=0; i<Lx; i++){
int indx5 = indx1+i;
int indx6 = indx5+(Ly-1)*Lx; //printf("%d %d\n",indx5,indx6);
H(indx5,indx6)=t;
H(indx6,indx5)=t;
}
}//plane loop
//periodic boundary condition in z
for (int i=0; i<A; i++){
int indx1=i ;
int indx2=(Lz-1)*A+i ;
H(indx1,indx2)= t; //printf("%d %d\n",indx1,indx2);
H(indx2,indx1)= t ;
}
//matrix assignment ends
/**-------------------------------------------------------*/
double Tr = abs(H.trace());
for(i=0;i<V;i++)
{
for(j=0;j<V;j++)
{
if(i==j)
{
H(i,j) = H(i,j)-(Tr/(double)V);
}
}
}
SelfAdjointEigenSolver<MatrixXcd> es(H); //defining the diagonalizing function
double *E = NULL;
E = new double[V]; // for the eigenvalues
for(i=0;i<V;i++)
{
E[i]=es.eigenvalues()[i];
//cout<<"original eigenvalues "<<E[i]<<"\n";
}
evec=es.eigenvectors();
double bandwidth = E[V-1] - E[0];
for(i=0;i<V;i++)
E[i]=E[i]/bandwidth;
for(i=0;i<V;i++)
{
E_levels[i] = E_levels[i]+E[i]; //summing over energies for each iteration
}
delete[] E;
E=NULL;
//main calculation process
for(i=0;i<V;i++)
{
temp = evec.col(i);
double num=0.0,denom=0.0;
for(j=0;j<V;j++)
{
num=num+pow(abs(temp(j)),4);
denom=denom+pow(abs(temp(j)),2);
}
IPR[i] = IPR[i]+(num/(denom*denom));
} //calculation ends
}//no_of_disorder_av loop (l)
for(i=0; i<V; i++)
{
myfile<<w<<"\t"<<(E_levels[i]/(double)no_of_disorder_av)<<"\t"
<<(IPR[i]/(double)no_of_disorder_av)<<"\n"; //taking output in file
}
var_w_loop++; // counts number of w loop
w+= del_w; // proceeds to next w
}while(var_w_loop<n) ; // w varying do while loop
MPI_Finalize();
clock_t end = clock();
double time_spent = (double)(end - begin) / CLOCKS_PER_SEC;
printf("time spent %f s\n\n",time_spent);
return 0;
}

Why does threading floating point computations on the CPU make them take significantly longer?

I am currently working on a scientific simulation (Gravitational nbody). I first wrote it with a naive single-threaded algorithm, and this performed acceptably for a small number of particles. I then multi-threaded this algorithm (it is embarrassingly parallel), and the program took about 3x as long. What follows is a minimum, complete, verifiable example of a trivial algorithm with similar properties and output to a file in /tmp (it is designed to run on Linux, but the C++ is also standard). Be warned that if you decide to run this code, it will produce a 152.62MB file. The data is outputted to prevent the compiler from optimizing the computation out of the program.
#include <iostream>
#include <functional>
#include <thread>
#include <vector>
#include <atomic>
#include <random>
#include <fstream>
#include <chrono>
constexpr unsigned ITERATION_COUNT = 2000;
constexpr unsigned NUMBER_COUNT = 10000;
void runThreaded(unsigned count, unsigned batchSize, std::function<void(unsigned)> callback){
unsigned threadCount = std::thread::hardware_concurrency();
std::vector<std::thread> threads;
threads.reserve(threadCount);
std::atomic<unsigned> currentIndex(0);
for(unsigned i=0;i<threadCount;++i){
threads.emplace_back([&currentIndex, batchSize, count, callback]{
unsigned startAt = currentIndex.fetch_add(batchSize);
if(startAt >= count){
return;
}else{
for(unsigned i=0;i<count;++i){
unsigned index = startAt+i;
if(index >= count){
return;
}
callback(index);
}
}
});
}
for(std::thread &thread : threads){
thread.join();
}
}
void threadedTest(){
std::mt19937_64 rnd(0);
std::vector<double> numbers;
numbers.reserve(NUMBER_COUNT);
for(unsigned i=0;i<NUMBER_COUNT;++i){
numbers.push_back(rnd());
}
std::vector<double> newNumbers = numbers;
std::ofstream fout("/tmp/test-data.bin");
for(unsigned i=0;i<ITERATION_COUNT;++i) {
std::cout << "Iteration: " << i << "/" << ITERATION_COUNT << std::endl;
runThreaded(NUMBER_COUNT, 100, [&numbers, &newNumbers](unsigned x){
double total = 0;
for(unsigned y=0;y<NUMBER_COUNT;++y){
total += numbers[y]*(y-x)*(y-x);
}
newNumbers[x] = total;
});
fout.write(reinterpret_cast<char*>(newNumbers.data()), newNumbers.size()*sizeof(double));
std::swap(numbers, newNumbers);
}
}
void unThreadedTest(){
std::mt19937_64 rnd(0);
std::vector<double> numbers;
numbers.reserve(NUMBER_COUNT);
for(unsigned i=0;i<NUMBER_COUNT;++i){
numbers.push_back(rnd());
}
std::vector<double> newNumbers = numbers;
std::ofstream fout("/tmp/test-data.bin");
for(unsigned i=0;i<ITERATION_COUNT;++i){
std::cout << "Iteration: " << i << "/" << ITERATION_COUNT << std::endl;
for(unsigned x=0;x<NUMBER_COUNT;++x){
double total = 0;
for(unsigned y=0;y<NUMBER_COUNT;++y){
total += numbers[y]*(y-x)*(y-x);
}
newNumbers[x] = total;
}
fout.write(reinterpret_cast<char*>(newNumbers.data()), newNumbers.size()*sizeof(double));
std::swap(numbers, newNumbers);
}
}
int main(int argc, char *argv[]) {
if(argv[1][0] == 't'){
threadedTest();
}else{
unThreadedTest();
}
return 0;
}
When I run this (compiled with clang 7.0.1 on Linux), I get the following times from the Linux time command. The difference between these is similar to what I see in my real program. The entry labelled "real" is what is relevant to this question, as this is the clock time that the program takes to run.
Single-threaded:
real 6m27.261s
user 6m27.081s
sys 0m0.051s
Multi-threaded:
real 14m32.856s
user 216m58.063s
sys 0m4.492s
As such, I ask what is causing this massive slowdown when I expect it to speed up significantly (roughly by a factor of 8, as I have an 8 core 16 thread CPU). I am not implementing this on the GPU as the next step is to make some changes to the algorithm to take it from O(n²) to O(nlogn), but that are also not amicable to a GPU. The changed algorithm will have less difference with my currently implemented O(n²) algorithm than the included example. Lastly, I want to observe that the subjective time to run each iteration (judged by the time between the iteration lines appearing) changes significantly in both the threaded and unthreaded runs.
It's kind of hard to follow this code, but I think you're duplicating work on a massive scale because each thread does nearly all the work, just skipping a small portion of it at the start.
I'm presuming the inner loop of runThreaded should be:
unsigned startAt = currentIndex.fetch_add(batchSize);
while (startAt < count) {
if (startAt >= count) {
return;
} else {
for(unsigned i=0;i<batchSize;++i){
unsigned index = startAt+i;
if(index >= count){
return;
}
callback(index);
}
}
startAt = currentIndex.fetch_add(batchSize);
}
Where i < batchSize is the key here. You should only do as much work as the batch dictates, not count times, which is the whole list minus the initial offset.
With this update the code runs significantly faster. I'm not sure if it does all the required work because it's hard to tell if that's actually happening, the output is very minimal.
For easy parallelization over multiple CPUs I recommend using tbb::parallel_for. It uses the correct number of CPUs and splits the range for you, completely eliminating the risk of implementing it wrong. Alternatively, there is a parallel for_each in C++17. In other words, this problem has a number of good solutions.
Vectorizing code is a difficult problem and neither clang++-6 not g++-8 auto-vectorize the baseline code. Hence, SIMD version below I used excellent Vc: portable, zero-overhead C++ types for explicitly data-parallel programming library.
Below is a working benchmark that compares:
The baseline version.
SIMD version.
SIMD + multi-threading version.
#include <Vc/Vc>
#include <tbb/parallel_for.h>
#include <algorithm>
#include <chrono>
#include <iomanip>
#include <iostream>
#include <random>
#include <vector>
constexpr int ITERATION_COUNT = 20;
constexpr int NUMBER_COUNT = 20000;
double baseline() {
double result = 0;
std::vector<double> newNumbers(NUMBER_COUNT);
std::vector<double> numbers(NUMBER_COUNT);
std::mt19937 rnd(0);
for(auto& n : numbers)
n = rnd();
for(int i = 0; i < ITERATION_COUNT; ++i) {
for(int x = 0; x < NUMBER_COUNT; ++x) {
double total = 0;
for(int y = 0; y < NUMBER_COUNT; ++y) {
auto d = (y - x);
total += numbers[y] * (d * d);
}
newNumbers[x] = total;
}
result += std::accumulate(newNumbers.begin(), newNumbers.end(), 0.);
swap(numbers, newNumbers);
}
return result;
}
double simd() {
double result = 0;
constexpr int SIMD_NUMBER_COUNT = NUMBER_COUNT / Vc::double_v::Size;
using vector_double_v = std::vector<Vc::double_v, Vc::Allocator<Vc::double_v>>;
vector_double_v newNumbers(SIMD_NUMBER_COUNT);
vector_double_v numbers(SIMD_NUMBER_COUNT);
std::mt19937 rnd(0);
for(auto& n : numbers) {
alignas(Vc::VectorAlignment) double t[Vc::double_v::Size];
for(double& v : t)
v = rnd();
n.load(t, Vc::Aligned);
}
Vc::double_v const incv(Vc::double_v::Size);
for(int i = 0; i < ITERATION_COUNT; ++i) {
Vc::double_v x(Vc::IndexesFromZero);
for(auto& new_n : newNumbers) {
Vc::double_v totals;
int y = 0;
for(auto const& n : numbers) {
for(unsigned j = 0; j < Vc::double_v::Size; ++j) {
auto d = y - x;
totals += n[j] * (d * d);
++y;
}
}
new_n = totals;
x += incv;
}
result += std::accumulate(newNumbers.begin(), newNumbers.end(), Vc::double_v{}).sum();
swap(numbers, newNumbers);
}
return result;
}
double simd_mt() {
double result = 0;
constexpr int SIMD_NUMBER_COUNT = NUMBER_COUNT / Vc::double_v::Size;
using vector_double_v = std::vector<Vc::double_v, Vc::Allocator<Vc::double_v>>;
vector_double_v newNumbers(SIMD_NUMBER_COUNT);
vector_double_v numbers(SIMD_NUMBER_COUNT);
std::mt19937 rnd(0);
for(auto& n : numbers) {
alignas(Vc::VectorAlignment) double t[Vc::double_v::Size];
for(double& v : t)
v = rnd();
n.load(t, Vc::Aligned);
}
Vc::double_v const v0123(Vc::IndexesFromZero);
for(int i = 0; i < ITERATION_COUNT; ++i) {
constexpr int SIMD_STEP = 4;
tbb::parallel_for(0, SIMD_NUMBER_COUNT, SIMD_STEP, [&](int ix) {
Vc::double_v xs[SIMD_STEP];
for(int is = 0; is < SIMD_STEP; ++is)
xs[is] = v0123 + (ix + is) * Vc::double_v::Size;
Vc::double_v totals[SIMD_STEP];
int y = 0;
for(auto const& n : numbers) {
for(unsigned j = 0; j < Vc::double_v::Size; ++j) {
for(int is = 0; is < SIMD_STEP; ++is) {
auto d = y - xs[is];
totals[is] += n[j] * (d * d);
}
++y;
}
}
std::copy_n(totals, SIMD_STEP, &newNumbers[ix]);
});
result += std::accumulate(newNumbers.begin(), newNumbers.end(), Vc::double_v{}).sum();
swap(numbers, newNumbers);
}
return result;
}
struct Stopwatch {
using Clock = std::chrono::high_resolution_clock;
using Seconds = std::chrono::duration<double>;
Clock::time_point start_ = Clock::now();
Seconds elapsed() const {
return std::chrono::duration_cast<Seconds>(Clock::now() - start_);
}
};
std::ostream& operator<<(std::ostream& s, Stopwatch::Seconds const& a) {
auto precision = s.precision(9);
s << std::fixed << a.count() << std::resetiosflags(std::ios_base::floatfield) << 's';
s.precision(precision);
return s;
}
void benchmark() {
Stopwatch::Seconds baseline_time;
{
Stopwatch s;
double result = baseline();
baseline_time = s.elapsed();
std::cout << "baseline: " << result << ", " << baseline_time << '\n';
}
{
Stopwatch s;
double result = simd();
auto time = s.elapsed();
std::cout << " simd: " << result << ", " << time << ", " << (baseline_time / time) << "x speedup\n";
}
{
Stopwatch s;
double result = simd_mt();
auto time = s.elapsed();
std::cout << " simd_mt: " << result << ", " << time << ", " << (baseline_time / time) << "x speedup\n";
}
}
int main() {
benchmark();
benchmark();
benchmark();
}
Timings:
baseline: 2.76582e+257, 6.399848397s
simd: 2.76582e+257, 1.600373449s, 3.99897x speedup
simd_mt: 2.76582e+257, 0.168638435s, 37.9501x speedup
Notes:
My machine supports AVX but not AVX-512, so it is roughly 4x speedup when using SIMD.
simd_mt version uses 8 threads on my machine and larger SIMD steps. The theoretical speedup is 128x, on practice - 38x.
clang++-6 cannot auto-vectorize the baseline code, neither can g++-8.
g++-8 generates considerably faster code for SIMD versions than clang++-6 .
Your heart is certainly in the right place minus a bug or two.
par_for is a complex issue depending on the payload of your loop. There is
no one-size-fits-all solution to this. The payload can be anything from
a couple of adds to almost infinite mutex blocks - for example by doing memory
allocation.
The atomic variable as a work item pattern has always worked well for me but
remember that atomic variables have a high cost on X86 (~400 cycles) and even
incur a high cost if they are in an unexecuted branch as I found to my peril.
Some permutation of the following is usually good. Choosing the right chunks_per_thread (as in your batchSize) is critical. If you don't trust your
users, you can test execute a few iterations of the loop to guess the
best chunking level.
#include <atomic>
#include <future>
#include <thread>
#include <vector>
#include <stdio.h>
template<typename Func>
void par_for(int start, int end, int step, int chunks_per_thread, Func func) {
using namespace std;
using namespace chrono;
atomic<int> work_item{start};
vector<future<void>> futures(std::thread::hardware_concurrency());
for (auto &fut : futures) {
fut = async(std::launch::async, [&work_item, end, step, chunks_per_thread, &func]() {
for(;;) {
int wi = work_item.fetch_add(step * chunks_per_thread);
if (wi > end) break;
int wi_max = std::min(end, wi+step * chunks_per_thread);
while (wi < wi_max) {
func(wi);
wi += step;
}
}
});
}
for (auto &fut : futures) {
fut.wait();
}
}
int main() {
using namespace std;
using namespace chrono;
for (int k = 0; k != 2; ++k) {
auto t0 = high_resolution_clock::now();
constexpr int loops = 100000000;
if (k == 0) {
for (int i = 0; i != loops; ++i ) {
if (i % 10000000 == 0) printf("%d\n", i);
}
} else {
par_for(0, loops, 1, 100000, [](int i) {
if (i % 10000000 == 0) printf("%d\n", i);
});
}
auto t1 = high_resolution_clock::now();
duration<double, milli> ns = t1 - t0;
printf("k=%d %fms total\n", k, ns.count());
}
}
results
...
k=0 174.925903ms total
...
k=1 27.924738ms total
About a 6x speedup.
I avoid the term "embarassingly parallel" as it is almost never the case. You pay exponentially higher costs the more resources you use on your journey from level 1 cache (ns latency) to globe spanning cluster (ms latency). But I hope this code snippet is useful as an answer.

Thread-safe parallel RNG slower than sequential rand()

I use this version of calculation of Pi with thread-safe function
rand_r
But it appears that it is slower (and answer is wrong) when running this program in parallel comparing to sequential program with use of
rand()
which is not thread-safe. It seems that this way of using is also not thread-safe. But I do not understand why, because I have read many questions about thread-safe PRNGs and learned that rand_r should be safe enough.
#include <iostream>
#include <random>
#include <ctime>
#include "omp.h"
#include <stdlib.h>
using namespace std;
unsigned seed;
int main()
{
double start = time(0);
int i, n, N;
double x, y;
N = 1<<30;
n = 0;
double pi;
#pragma omp threadprivate(seed)
#pragma omp parallel private(x, y) reduction(+:n)
{
for (i = 0; i < N; i++) {
seed = 25234 + 17 * omp_get_thread_num();
x = rand_r(&seed) / (double) RAND_MAX;
y = rand_r(&seed) / (double) RAND_MAX;
if (x*x + y*y <= 1)
n++;
}
}
pi = 4. * n / (double) (N);
cout << pi << endl;
double stop = time(0);
cout << (stop - start) << endl;
return 0;
}
P.S. By the way, what are the magic numbers in
seed = 25234 + 17 * omp_get_thread_num();
? I stole them from some answer.
EDIT: The comment by Gilles helped me. The resolution was:
1. To switch lines of for loop and seed initialization.
2. To add #pragma omp for
Modified code reads
#pragma omp parallel private(x, y, seed)
{
seed = 25234 + 17 * omp_get_thread_num();
#pragma omp for reduction(+:n)
for (int i = 0; i < N; i++) {
x = (double) rand_r(&seed) / (double) RAND_MAX;
y = (double) rand_r(&seed) / (double) RAND_MAX;
if (x*x + y*y <= 1)
n++;
}
}
The problem is resolved.
Apparently there are more instructions in rand_r() compared to rand(). Below is copied from one implementation. So it's reasonable that rand_r() takes more time to complete one round than rand().
int
rand_r(unsigned int *ctx)
{
u_long val = (u_long) *ctx;
int r = do_rand(&val);
*ctx = (unsigned int) val;
return (r);
}
static u_long next = 1;
int
rand()
{
return (do_rand(&next));
}
And since rand() is not thread safe, the output could be incorrect if you use rand() in parallel. The worse part is that you would still get a result and don't know if it's correct in small scale test.

Having problems with ctime, and working out function running time

I'm having trouble working out the time for my two maxsubarray functions to run. (right at the bottom of the code)
The output it gives me:
Inputsize: 101 Time using Brute Force:0 Time Using DivandCon: 12
is correct for the second time I use clock() but for the first difference diff1 it just gives me 0 and I'm not sure why?
Edit: Revised Code.
Edit2: Added Output.
#include <iostream>
#include <cmath>
#include <cstdlib>
#include <ctime>
#include <limits.h>
using namespace std;
int Kedane(int a[], int size)
{
int max_so_far = 0, max_ending_here = 0;
int i;
for(i = 0; i < size; i++)
{
max_ending_here = max_ending_here + a[i];
if(max_ending_here < 0)
max_ending_here = 0;
if(max_so_far < max_ending_here)
max_so_far = max_ending_here;
}
return max_so_far;
}
int BruteForce(int array[],int n)
{
int sum,ret=0;
for(int j=-1;j<=n-2;j++)
{
sum=0;
for(int k=j+1;k<+n-1;k++)
{
sum+=array[k];
if(sum>ret)
{
ret=sum;
}
}
}
return ret;
}
//------------------------------------------------------
// FUNCTION WHICH FINDS MAX OF 2 INTS
int max(int a, int b) { return (a > b)? a : b; }
// FUNCTION WHICH FINDS MAX OF 3 NUMBERS
// CALL MAX FUNCT FOR 2 VARIS TWICE!
int max(int a, int b, int c) { return max(max(a, b), c); }
// WORKS OUT FROM MIDDLE+1->RIGHT THE MAX SUM &
// THE MAX SUM FROM MIDDLE->LEFT + RETURNS SUM OF THESE
int maxCrossingSum(int arr[], int l, int m, int h)
{
int sum = 0; // LEFT OF MID
int LEFTsum = INT_MIN; // INITIALLISES SUM TO LOWEST POSSIBLE INT
for (int i = m; i >= l; i--)
{
sum = sum + arr[i];
if (sum > LEFTsum)
LEFTsum = sum;
}
sum = 0; // RIGHT OF MID
int RIGHTsum = INT_MIN;
for (int i = m+1; i <= h; i++)
{
sum = sum + arr[i];
if (sum > RIGHTsum)
RIGHTsum = sum;
}
// RETURN SUM OF BOTH LEFT AND RIGHT SIDE MAX'S
return LEFTsum + RIGHTsum;
}
// Returns sum of maxium sum subarray in aa[l..h]
int maxSubArraySum(int arr[], int l, int h)
{
// Base Case: Only one element
if (l == h)
return arr[l];
// Find middle point
int m = (l + h)/2;
/* Return maximum of following three possible cases
a) Maximum subarray sum in left half
b) Maximum subarray sum in right half
c) Maximum subarray sum such that the subarray crosses the midpoint */
return max(maxSubArraySum(arr, l, m),
maxSubArraySum(arr, m+1, h),
maxCrossingSum(arr, l, m, h));
}
// DRIVER
int main(void)
{
std::srand (time(NULL));
// CODE TO FILL ARRAY WITH RANDOMS [-50;50]
int size=30000;
int array[size];
for(int i=0;i<=size;i++)
{
array[i]=(std::rand() % 100) -50;
}
// TIMING VARI'S
clock_t t1,t2;
clock_t A,B;
clock_t K1,K2;
volatile int mb, md, qq;
//VARYING ELEMENTS IN THE ARRAY
for(int n=101;n<size;n=n+100)
{
t1=clock();
mb=BruteForce(array,n);
t2=clock();
A=clock();
md=maxSubArraySum(array, 0, n-1) ;
B=clock();
K1=clock();
qq=Kedane(array, n);
K2=clock();
cout<< n << "," << (double)t2-(double)t1 << ","<<(double)B-(double)A << ","<<(double)K2-(double)K1<<endl;
}
return 0;
}
101,0,0,0
201,0,0,0
301,1,0,0
401,0,0,0
501,0,0,0
601,0,0,0
701,0,0,0
801,1,0,0
901,1,0,0
1001,0,0,0
1101,1,0,0
1201,1,0,0
1301,0,0,0
1401,1,0,0
1501,1,0,0
1601,2,0,0
1701,1,0,0
1801,2,0,0
1901,1,1,0
2001,1,0,0
2101,2,0,0
2201,3,0,0
2301,2,0,0
2401,3,0,0
2501,3,0,0
2601,3,0,0
2701,4,0,0
2801,4,0,0
2901,4,0,0
3001,4,0,0
3101,4,0,0
3201,5,0,0
3301,5,0,0
3401,6,0,0
3501,5,0,0
3601,6,0,0
3701,6,0,0
3801,8,0,0
3901,7,0,0
4001,8,0,0
4101,7,0,0
4201,10,1,0
4301,9,0,0
4401,8,0,0
4501,9,0,0
4601,10,0,0
4701,11,0,0
4801,11,0,0
4901,11,0,0
5001,12,0,1
5101,11,1,0
5201,13,0,0
5301,13,0,0
5401,15,0,0
5501,14,0,0
5601,16,0,0
5701,15,0,0
5801,15,1,0
5901,16,0,0
6001,17,0,0
6101,18,0,0
6201,18,0,0
6301,19,0,0
6401,21,0,0
6501,19,0,0
6601,21,1,0
6701,20,0,0
6801,22,0,0
6901,23,0,0
7001,22,0,0
7101,24,0,0
7201,26,0,0
7301,26,0,0
7401,24,1,0
7501,26,0,0
7601,27,0,0
7701,28,0,0
7801,28,0,0
7901,30,0,0
8001,29,0,0
8101,31,0,0
8201,31,1,0
8301,35,0,0
8401,33,0,0
8501,35,0,0
8601,35,1,0
8701,35,0,0
8801,36,1,0
8901,37,0,0
9001,38,0,0
9101,39,0,0
9201,41,1,0
9301,40,0,0
9401,41,0,0
9501,42,0,0
9601,45,0,0
9701,45,0,0
9801,44,0,0
9901,47,0,0
10001,47,0,0
10101,48,0,0
10201,50,0,0
10301,51,0,0
10401,50,0,0
10501,51,0,0
10601,53,0,0
10701,55,0,0
10801,54,0,0
10901,56,0,0
11001,57,0,0
11101,56,0,0
11201,60,0,0
11301,60,0,0
11401,61,1,0
11501,61,1,0
11601,63,0,0
11701,62,1,0
11801,66,1,0
11901,65,0,0
12001,68,1,0
12101,68,0,0
12201,70,0,0
12301,71,0,0
12401,72,0,0
12501,73,1,0
12601,73,1,0
12701,76,0,0
12801,77,0,0
12901,78,1,0
13001,79,1,0
13101,80,0,0
13201,83,0,0
13301,82,0,0
13401,86,0,0
13501,85,1,0
13601,86,0,0
13701,89,0,0
13801,90,0,1
13901,90,0,0
14001,91,0,0
14101,97,0,0
14201,93,0,0
14301,96,0,0
14401,99,0,0
14501,100,0,0
14601,101,0,0
14701,101,0,0
14801,103,1,0
14901,104,0,0
15001,107,0,0
15101,108,0,0
15201,109,0,0
15301,109,0,0
15401,114,0,0
15501,114,0,0
15601,115,0,0
15701,116,0,0
15801,119,0,0
15901,118,0,0
16001,124,0,0
16101,123,1,0
16201,123,1,0
16301,125,0,0
16401,127,1,0
16501,128,1,0
16601,131,0,0
16701,132,0,0
16801,134,0,0
16901,134,1,0
17001,135,1,0
17101,139,0,0
17201,139,0,0
17301,140,1,0
17401,143,0,0
17501,145,0,0
17601,147,0,0
17701,147,0,0
17801,150,1,0
17901,152,1,0
18001,153,0,0
18101,155,0,0
18201,157,0,0
18301,157,1,0
18401,160,0,0
18501,160,1,0
18601,163,1,0
18701,165,0,0
18801,169,0,0
18901,171,0,1
19001,170,1,0
19101,173,1,0
19201,178,0,0
19301,175,1,0
19401,176,1,0
19501,180,0,0
19601,180,1,0
19701,182,1,0
19801,184,0,0
19901,187,1,0
20001,188,1,0
20101,191,0,0
20201,192,1,0
20301,193,1,0
20401,195,0,0
20501,199,0,0
20601,200,0,0
20701,201,0,0
20801,209,1,0
20901,210,0,0
21001,206,0,0
21101,210,0,0
21201,210,0,0
21301,213,0,0
21401,215,1,0
21501,217,1,0
21601,218,1,0
21701,221,1,0
21801,222,1,0
21901,226,1,0
22001,225,1,0
22101,229,0,0
22201,232,0,0
22301,233,1,0
22401,234,1,0
22501,237,1,0
22601,238,0,1
22701,243,0,0
22801,242,1,0
22901,246,1,0
23001,246,0,0
23101,250,1,0
23201,250,1,0
23301,254,1,0
23401,254,0,0
23501,259,0,1
23601,260,1,0
23701,263,1,0
23801,268,0,0
23901,266,1,0
24001,271,0,0
24101,272,1,0
24201,274,1,0
24301,280,0,1
24401,279,0,0
24501,281,0,0
24601,285,0,0
24701,288,0,0
24801,289,0,0
24901,293,0,0
25001,295,1,0
25101,299,1,0
25201,299,1,0
25301,302,0,0
25401,305,1,0
25501,307,0,0
25601,310,1,0
25701,315,0,0
25801,312,1,0
25901,315,0,0
26001,320,1,0
26101,320,0,0
26201,322,0,0
26301,327,1,0
26401,329,0,0
26501,332,1,0
26601,339,1,0
26701,334,1,0
26801,337,0,0
26901,340,0,0
27001,341,1,0
27101,342,1,0
27201,347,0,0
27301,348,1,0
27401,351,1,0
27501,353,0,0
27601,356,1,0
27701,360,0,1
27801,361,1,0
27901,362,1,0
28001,366,1,0
28101,370,0,1
28201,372,0,0
28301,375,1,0
28401,377,1,0
28501,380,0,0
28601,384,1,0
28701,384,0,0
28801,388,1,0
28901,391,1,0
29001,392,1,0
29101,399,1,0
29201,399,0,0
29301,404,1,0
29401,405,0,0
29501,409,1,0
29601,412,2,0
29701,412,1,0
29801,422,1,0
29901,419,1,0
The return values from BruteForce and maxSubArraySum are never used, and this gives the compiler a lot of lattitude when it comes to optimizing them.
On my machine for example, using clang -O3 reduces the call to BruteForce to a vector copy and nothing else.
One method for forcing the evaluation of these functions is to write their results to volatile variables:
volatile int mb, md;
// ...
mb = BruteForce(array, n);
// ...
md = maxSubArraySum(array, 0, n-1);
As the variables are volatile, the value given by the right-hand side of the assignments must be stored, despite the absence of any other side-effects, which prevents the compiler from optimising the computation away.

OpenMP C++ Not able to get linear speedup with number of processors [closed]

This question is unlikely to help any future visitors; it is only relevant to a small geographic area, a specific moment in time, or an extraordinarily narrow situation that is not generally applicable to the worldwide audience of the internet. For help making this question more broadly applicable, visit the help center.
Closed 9 years ago.
Please see the below results and let me know where I can optimise my code further to get a better speedup.
Result
Machine used: Mac Book Pro Processor: 2.5 GHz Intel Core i5(at least 4 logical cores)
Memory: 4GB 1600 MHz
Compiler: Mac OSX Compiler
Sequential Time:0.016466
Using two threads:0.0120111
Using four threads:0.0109911(Speed Up ~ 1.5)
Using 8 threads: 0.0111289
II Machine:
OS: Linux
Hardware: Intel(R) Core™ i5-3550 CPU # 3.30GHz × 4
Memory: 7.7 GiB
Compiler: G++ Version 4.6
Sequential Time:0.0128901
Using two threads:0.00838804
Using four threads:0.00612688(Speed up = 2)
Using 8 threads: 0.0101049
Please let me know what's the overhead in my code that is not giving a linear speedup. There is nothing much in the code. I am calling the function "findParallelUCHWOUP" in the main function like this:
#pragma omp parallel for private(th_id)
for (th_id = 0; th_id < nthreads; th_id++)
findParallelUCHWOUP(points, th_id + 1, nthreads, inp_size, first[th_id], last[th_id]);
Code:
class Point {
double i, j;
public:
Point() {
i = 0;
j = 0;
}
Point(double x, double y) {
i = x;
j = y;
}
double x() const {
return i;
}
double y() const {
return j;
}
void setValue(double x, double y) {
i = x;
j = y;
}
};
typedef std::vector<Point> Vector;
int second(std::stack<int> &s);
double crossProduct(Point v[], int a, int b, int c);
bool myfunction(Point a, Point b) {
return ((a.x() < b.x()) || (a.x() == b.x() && a.y() < b.y()));
}
class CTPoint {
int i, j;
public:
CTPoint() {
i = 0;
j = 0;
}
CTPoint(int x, int y) {
i = x;
j = y;
}
double getI() const {
return i;
}
double getJ() const {
return j;
}
};
const int nthreads = 4;
const int inp_size = 1000000;
Point output[inp_size];
int numElems = inp_size / nthreads;
int sizes[nthreads];
CTPoint ct[nthreads][nthreads];
//function that is called from different threads
int findParallelUCHWOUP(Point* iv, int id, int thread_num, int inp_size, int first, int last) {
output[first] = iv[first];
std::stack<int> s;
s.push(first);
int i = first + 1;
while (i < last) {
if (crossProduct(iv, i, first, last) > 0) {
s.push(i);
i++;
break;
} else {
i++;
}
}
if (i == last) {
s.push(last);
return 0;
}
for (; i <= last; i++) {
if (crossProduct(iv, i, first, last) >= 0) {
while (s.size() > 1 && crossProduct(iv, s.top(), second(s), i) <= 0) {
s.pop();
}
s.push(i);
}
}
int count = s.size();
sizes[id - 1] = count;
while (!s.empty()) {
output[first + count - 1] = iv[s.top()];
s.pop();
count--;
}
return 0;
}
double crossProduct(Point* v, int a, int b, int c) {
return (v[c].x() - v[b].x()) * (v[a].y() - v[b].y())
- (v[a].x() - v[b].x()) * (v[c].y() - v[b].y());
}
int second(std::stack<int> &s) {
int temp = s.top();
s.pop();
int sec = s.top();
s.push(temp);
return sec;
}
//reads points from a file and divides the array of points to different threads
int main(int argc, char *argv[]) {
// read points from a file and assign them to the input array.
Point *points = new Point[inp_size];
unsigned i = 0;
while (i < Points.size()) {
points[i] = Points[i];
i++;
}
numElems = inp_size / nthreads;
int first[nthreads];
int last[nthreads];
for(int i=1;i<=nthreads;i++){
first[i-1] = (i - 1) * numElems;
if (i == nthreads) {
last[i-1] = inp_size - 1;
} else {
last[i-1] = i * numElems - 1;
}
}
/* Parallel Code starts here*/
int th_id;
omp_set_num_threads(nthreads);
double start = omp_get_wtime();
#pragma omp parallel for private(th_id)
for (th_id = 0; th_id < nthreads; th_id++)
findParallelUCHWOUP(points, th_id + 1, nthreads, inp_size, first[th_id], last[th_id]);
/* Parallel Code ends here*/
double end = omp_get_wtime();
double diff = end - start;
std::cout << "Time Elapsed in seconds:" << diff << '\n';
return 0;
}
Threading in general and in your particular case, OpenMP do introduce a certain amount of overhead that does essentially prevent you from getting "real" linear speedup. You have to account for that.
Second, the runtime of your test is extremely short (I assume the times measure are seconds?). At that level you're also running into issues with the precision of timing the functions as a very small amount in overhead has a large impact on the measure result.
Last, you're also dealing with memory access here and if both the chunks you are processing and the stack you're creating don't fit into the processor cache, you also have to account for the overhead of fetching data from memory. The latter gets worse if you have multiple threads reading and possibly writing to the same area of memory. That will result in invalidated cache lines, which means that your cores will be waiting for data to be fetched into the cache and/or written to main memory.
I would massively increase the size of your data so you can runtimes in the seconds, for starters, then measure again. The longer running your test code is the better because the startup and general overhead of the threading will play less of a role if you do more processing.
Once you established a better baseline, you'll probably need a good profiler that gives you deeper insight into threading to see where the hotspots are in your code. It's not unusual that you might have to roll custom data structures for your parallelized part to improve the performance.