I use this version of calculation of Pi with thread-safe function
rand_r
But it appears that it is slower (and answer is wrong) when running this program in parallel comparing to sequential program with use of
rand()
which is not thread-safe. It seems that this way of using is also not thread-safe. But I do not understand why, because I have read many questions about thread-safe PRNGs and learned that rand_r should be safe enough.
#include <iostream>
#include <random>
#include <ctime>
#include "omp.h"
#include <stdlib.h>
using namespace std;
unsigned seed;
int main()
{
double start = time(0);
int i, n, N;
double x, y;
N = 1<<30;
n = 0;
double pi;
#pragma omp threadprivate(seed)
#pragma omp parallel private(x, y) reduction(+:n)
{
for (i = 0; i < N; i++) {
seed = 25234 + 17 * omp_get_thread_num();
x = rand_r(&seed) / (double) RAND_MAX;
y = rand_r(&seed) / (double) RAND_MAX;
if (x*x + y*y <= 1)
n++;
}
}
pi = 4. * n / (double) (N);
cout << pi << endl;
double stop = time(0);
cout << (stop - start) << endl;
return 0;
}
P.S. By the way, what are the magic numbers in
seed = 25234 + 17 * omp_get_thread_num();
? I stole them from some answer.
EDIT: The comment by Gilles helped me. The resolution was:
1. To switch lines of for loop and seed initialization.
2. To add #pragma omp for
Modified code reads
#pragma omp parallel private(x, y, seed)
{
seed = 25234 + 17 * omp_get_thread_num();
#pragma omp for reduction(+:n)
for (int i = 0; i < N; i++) {
x = (double) rand_r(&seed) / (double) RAND_MAX;
y = (double) rand_r(&seed) / (double) RAND_MAX;
if (x*x + y*y <= 1)
n++;
}
}
The problem is resolved.
Apparently there are more instructions in rand_r() compared to rand(). Below is copied from one implementation. So it's reasonable that rand_r() takes more time to complete one round than rand().
int
rand_r(unsigned int *ctx)
{
u_long val = (u_long) *ctx;
int r = do_rand(&val);
*ctx = (unsigned int) val;
return (r);
}
static u_long next = 1;
int
rand()
{
return (do_rand(&next));
}
And since rand() is not thread safe, the output could be incorrect if you use rand() in parallel. The worse part is that you would still get a result and don't know if it's correct in small scale test.
Related
I'm trying to learn paralellization of C++ using openmp, and I'm trying to use the following example. But for some reason when I increase the number of threads the code runs slower. Im compiling it using the -fopenmp flag. It would be nice if I could get your expert opinion.
#include <omp.h>
#include <iostream>
static long num_steps =100000000;
#define NUM_THREADS 4
double step;
int main(){
int i,nthreads;
double pi, sum[NUM_THREADS]; // should be shared : hence promoted scalar sum into an array
step = 1.0/(double) num_steps;
omp_set_num_threads(NUM_THREADS);
double t1 = omp_get_wtime();
#pragma omp parallel
{
int i, id, nthrds;
double x;
id = omp_get_thread_num();
nthrds = omp_get_num_threads();
//if(id==0) nthreads = nthrds; // This is done because the number of threads can be different
// ie the environment can give you a different number of threads
// than requested
for(i=id, sum[id] = 0.0; i<num_steps;i=i+nthrds){
x = (i+0.5)*step;
sum[id] += 4.0/(1.0+x*x);
}
}
double t2 = omp_get_wtime();
std::cout << "Time : " ;
double ms_double = t2 - t1;
std::cout << ms_double << "ms\n";
for(i=0,pi=0.0; i < nthreads; i++){
pi += sum[i]*step;
}
}
Minor complaints aside, your big problem is the loop update i=i+nthrds. This means that each cache line will be accessed by all 4 of your threads. (Btw, use the OMP_NUM_THREADS environment variable to set the number of threads. Do not hardcode.) This is called false sharing and it's really bad for performance: you want each cacheline to be exclusively in one core.
The main advantage of OpenMP is that you do not have to do reduction manually. You just have to add an extra line to the serial code. So, your code should be something like this (which is free from false-sharing):
double sum=0;
#pragma omp parallel for reduction(+:sum)
for(unsigned long i=0; i<num_steps; ++i){
const double x = (i+0.5)*step;
sum += 4.0/(1.0+x*x);
}
double pi = sum*step;
Note that your code had an uninitialized variable (pi) and your code did not handle the properly if you got less threads than requested.
What #Victor Ejkhout called "minor complaints" might not be so minor. It is only normal that using a new API (omp) for the first time can be confusing. And that reflects on the coding style of the application code as well, more often than not. But especially in such cases, special attention should be paid to readability.
The code below is the "prettied-up" version of your attempt. And next to the omp parallel integration it also has the single threaded and a multi threaded (using std::thread) version so you can compare them to each other.
#include <omp.h>
#include <iostream>
#include <thread>
constexpr int MAX_PARALLEL_THREADS = 4; // long is wrong - is it an isize_t or a int32_t or an int64_t???
// the function we want to integrate
double f(double x) {
return 4.0 / (1.0 + x * x);
}
// performs the summation of function values on the interval [left,right[
double sum_interval(double left, double right, double step) {
double sum = 0.0;
for (double x = left; x < right; x += step) {
sum += f(x);
}
return sum;
}
double integrate_single_threaded(double left, double right, double step) {
return sum_interval(left, right, step) / (right - left);
}
double integrate_multi_threaded(double left, double right, double step) {
double sums[MAX_PARALLEL_THREADS];
std::thread threads[MAX_PARALLEL_THREADS];
for (int i= 0; i < MAX_PARALLEL_THREADS;i++) {
threads[i] = std::thread( [&sums,left,right,step,i] () {
double ileft = left + (right - left) / MAX_PARALLEL_THREADS * i;
double iright = left + (right - left) / MAX_PARALLEL_THREADS * (i + 1);
sums[i] = sum_interval(ileft,iright,step);
});
}
double total_sum = 0.0;
for (int i = 0; i < MAX_PARALLEL_THREADS; i++) {
threads[i].join();
total_sum += sums[i];
}
return total_sum / (right - left);
}
double integrate_parallel(double left, double right, double step) {
double sums[MAX_PARALLEL_THREADS];
int thread_count = 0;
omp_set_num_threads(MAX_PARALLEL_THREADS);
#pragma omp parallel
{
thread_count = omp_get_num_threads(); // 0 is impossible, there is always 1 thread minimum...
int interval_index = omp_get_thread_num();
double ileft = left + (right - left) / thread_count * interval_index;
double iright = left + (right - left) / thread_count * (interval_index + 1);
sums[interval_index] = sum_interval(ileft,iright,step);
}
double total_sum = 0.0;
for (int i = 0; i < thread_count; i++) {
total_sum += sums[i];
}
return total_sum / (right - left);
}
int main (int argc, const char* argv[]) {
double left = -1.0;
double right = 1.0;
double step = 1.0E-9;
// run single threaded calculation
std::cout << "single" << std::endl;
double tstart = omp_get_wtime();
double i_single = integrate_single_threaded(left, right, step);
double tend = omp_get_wtime();
double st_time = tend - tstart;
// run multi threaded calculation
std::cout << "multi" << std::endl;
tstart = omp_get_wtime();
double i_multi = integrate_multi_threaded(left, right, step);
tend = omp_get_wtime();
double mt_time = tend - tstart;
// run omp calculation
std::cout << "omp" << std::endl;
tstart = omp_get_wtime();
double i_omp = integrate_parallel(left, right, step);
tend = omp_get_wtime();
double omp_time = tend - tstart;
std::cout
<< "i_single: " << i_single
<< " st_time: " << st_time << std::endl
<< "i_multi: " << i_multi
<< " mt_time: " << mt_time << std::endl
<< "i_omp: " << i_omp
<< " omp_time: " << omp_time << std::endl;
return 0;
}
When I compile this on my Debian with g++ --std=c++17 -Wall -O3 -lpthread -fopenmp -o para para.cpp -pthread, I get the following results:
single
multi
omp
i_single: 3.14159e+09 st_time: 2.37662
i_multi: 3.14159e+09 mt_time: 0.635195
i_omp: 3.14159e+09 omp_time: 0.660593
So, at least my conclusion is, that it is not worth the effort to learn openMP, given that the (more general use) std::thread version looks just as nice and performs at least equally well.
I am not really trusting the computed integral result in either case, though. But I did not really focus on that. They all produce the same value. That is the important part.
I'm trying to understand why the following runs much faster on 1 thread than on 4 threads on OpenMP. The following code is actually based on a similar question: OpenMP recursive tasks but when trying to implement one of the suggested answers, I don't get the intended speedup, which suggests I've done something wrong (and not sure what it is). Do people get better speed when running the below on 4 threads than on 1 thread? I'm getting a 10 times slowdown when running on 4 cores (I should be getting moderate speedup rather than significant slowdown).
int fib(int n)
{
if(n == 0 || n == 1)
return n;
if (n < 20) //EDITED CODE TO INCLUDE CUTOFF
return fib(n-1)+fib(n-2);
int res, a, b;
#pragma omp task shared(a)
a = fib(n-1);
#pragma omp task shared(b)
b = fib(n-2);
#pragma omp taskwait
res = a+b;
return res;
}
int main(){
omp_set_nested(1);
omp_set_num_threads(4);
double start_time = omp_get_wtime();
#pragma omp parallel
{
#pragma omp single
{
cout << fib(25) << endl;
}
}
double time = omp_get_wtime() - start_time;
std::cout << "Time(ms): " << time*1000 << std::endl;
return 0;
}
Have you tried it with a large number?
In multi-threading, it takes some time to initialize work on CPU cores. For smaller jobs, which is done very fast on a single core, threading slows the job down because of this.
Multi-threading shows increase in speed if the job normally takes time longer than second, not milliseconds.
There is also another bottleneck for threading. If your codes try to create too many threads, mostly by recursive methods, this may cause a delay to all running threads causing a massive set back.
In this OpenMP/Tasks wiki page, it is mentioned and a manual cut off is suggested. There need to be 2 versions of the function and when the thread goes too deep, it continues the recursion with single threading.
EDIT: cutoff variable needs to be increased before entering OMP zone.
the following code is for test purposes for the OP to test
#define CUTOFF 5
int fib_s(int n)
{
if (n == 0 || n == 1)
return n;
int res, a, b;
a = fib_s(n - 1);
b = fib_s(n - 2);
res = a + b;
return res;
}
int fib_m(int n,int co)
{
if (co >= CUTOFF) return fib_s(n);
if (n == 0 || n == 1)
return n;
int res, a, b;
co++;
#pragma omp task shared(a)
a = fib_m(n - 1,co);
#pragma omp task shared(b)
b = fib_m(n - 2,co);
#pragma omp taskwait
res = a + b;
return res;
}
int main()
{
omp_set_nested(1);
omp_set_num_threads(4);
double start_time = omp_get_wtime();
#pragma omp parallel
{
#pragma omp single
{
cout << fib_m(25,1) << endl;
}
}
double time = omp_get_wtime() - start_time;
std::cout << "Time(ms): " << time * 1000 << std::endl;
return 0;
}
RESULT:
With CUTOFF value set to 10, it was under 8 seconds to calculate 45th term.
co=1 14.5s
co=2 9.5s
co=3 6.4s
co=10 7.5s
co=15 7.0s
co=20 8.5s
co=21 >18.0s
co=22 >40.0s
I believe I do not know how to tell the compiler not to create parallel task after a certain depth as: omp_set_max_active_levels seems to have no effect and omp_set_nested is deprecated (though it also has no effect).
So I have to manually specify after which level not to create more tasks. Which IMHO is sad. I still believe there should be a way to do this (if somebody know, kindly let me know). Here is how I attempted it, and after input size of 20 parallel version runs a bit faster than serial (like in 70-80% time).
Ref: Code taken from an assignment from course (solution was not provided, so I don't know how to do it efficiently): https://www.cs.iastate.edu/courses/2018/fall/com-s-527x
#include <stdio.h>
#include <omp.h>
#include <math.h>
int fib(int n, int rec_height)
{
int x = 1, y = 1;
if (n < 2)
return n;
int tCount = 0;
if (rec_height > 0) //Surprisingly without this check parallel code is slower than serial one (I believe it is not needed, I just don't know how to use OpneMP)
{
rec_height -= 1;
#pragma omp task shared(x)
x = fib(n - 1, rec_height);
#pragma omp task shared(y)
y = fib(n - 2, rec_height);
#pragma omp taskwait
}
else{
x = fib(n - 1, rec_height);
y = fib(n - 2, rec_height);
}
return x+y;
}
int main()
{
int tot_thread = 16;
int recDepth = (int)log2f(tot_thread);
if( ((int)pow(2, recDepth)) < tot_thread) recDepth += 1;
printf("\nrecDepth: %d\n",recDepth);
omp_set_max_active_levels(recDepth);
omp_set_nested(recDepth-1);
int n,fibonacci;
double starttime;
printf("\nPlease insert n, to calculate fib(n): %d\n",n);
scanf("%d",&n);
omp_set_num_threads(tot_thread);
starttime=omp_get_wtime();
#pragma omp parallel
{
#pragma omp single
{
fibonacci=fib(n, recDepth);
}
}
printf("\n\nfib(%d)=%d \n",n,fibonacci);
printf("calculation took %lf sec\n",omp_get_wtime()-starttime);
return 0;
}
This question already has an answer here:
OpenMP program is slower than sequential one
(1 answer)
Closed 5 years ago.
i am currently trying to get familiar with OpenMP. For practice i implemented a greedy "learning" algorithm with OpenMP. Then i measured the time with
time ./a.out
I compared with my serial implementation and no matter how many iterations my program is doing the OpenMP one is alway significant slower.
Here is my Code, comments should hopefully explain everything:
#include <omp.h>
#include <iostream>
#include <vector>
#include <cstdlib>
#include <cmath>
#include <stdio.h>
#include <ctime>
#define THREADS 4
using namespace std;
struct TrainData {
double input;
double output;
};
//Long Term Memory struct
struct LTM {
double a; //paramter a of the polynom
double b;
double c;
double score; //score to be minimized!
LTM()
{
a=0;
b=0;
c=0;
score=0;
}
//random LTM with paramters from low to high (including low and high)
LTM(int low, int high)
{
score=0;
a= rand() % high + low;
b= rand() % high + low;
c= rand() % high + low;
}
LTM(double _a, double _b, double _c)
{
a=_a;
b=_b;
c=_c;
}
void print()
{
cout<<"Score: "<<score<<endl;
cout<<"a: "<<a<<" b: "<<b<<" c: "<<c<<endl;
}
};
//the acutal polynom function evaluating with passed LTM
inline double evaluate(LTM <m, const double &x)
{
double ret;
ret = ltm.a*x*x + ltm.b*x + ltm.c;
return ret;
}
//scoring function calculates the Root Mean Square error (RMS)
inline double score_function(LTM <mnew, vector<TrainData> &td)
{
double score;
double val;
int tdsize=td.size();
score=0;
for(int i=0; i< tdsize; i++)
{
val = (td.at(i)).output - evaluate(ltmnew, (td.at(i)).input);
val *= val;
score += val;
}
score /= (double)tdsize;
score = sqrt(score);
return score;
}
LTM iterate(int iterations, vector<TrainData> td, int low, int high)
{
LTM fav = LTM(low,high);
fav.score = score_function(fav, td);
fav.print();
LTM favs[THREADS]; // array for collecting the favorites of each thread
#pragma omp parallel num_threads(THREADS) firstprivate(fav, low, high, td)
{
#pragma omp master
printf("Threads: %d\n", omp_get_num_threads());
LTM cand;
#pragma omp for private(cand)
for(int i=0; i<iterations; i++)
{
cand = LTM(low, high);
cand.score = score_function(cand, td);
if(cand.score < fav.score)
fav = cand;
}
//save the favorite before ending the parallel section
#pragma omp critical
favs[omp_get_thread_num()] = fav;
}
//search for the best one in the array
for(int i=0; i<THREADS; i++)
{
if(favs[i].score < fav.score)
fav=favs[i];
}
return fav;
}
//generate training data from -50 up to 50 with the train LTM
void generateTrainData(vector<TrainData> *td, LTM train)
{
#pragma omp parallel for schedule(dynamic, 25)
for(int i=-50; i< 50; i++)
{
struct TrainData d;
d.input = i;
d.output = evaluate(train, (double)i);
#pragma omp critical
td->push_back(d);
//cout<<"input: "<<d.input<<" -> "<<d.output<<endl;
}
}
int main(int argc, char *argv[])
{
int its= 10000000; //number of iterations
int a=2;
int b=4;
int c=6;
srand(time(NULL));
LTM pol = LTM(a,b,c); //original polynom parameters
vector<TrainData> td;
//first genarte some training data and save it to td
generateTrainData(&td, pol);
//try to find the best solution
LTM fav = iterate( its, td, 1, 6);
printf("Final: a=%f b=%f c=%f score: %f\n", fav.a, fav.b, fav.c, fav.score);
return 0;
}
At my home PC it took 12s with this implementation. The serial one only 6s.
If i increase the number of iterations by factor 10 it will be around 2min/1min (omp / serial).
Can anyone help me?
Okay, thanks to the comments of my initial question i could solve the performance issues.
Like in the comments said the problem was the rand() function i was using.
I replaced them with an appropriate thread safe drand48_r().
Like:
...
LTM(double low, double high, struct drand48_data *buff)
{
score=0;
double x;
drand48_r(buff,&x);
a= low + x * (high - low);
drand48_r(buff,&x);
b= low + x * (high - low);
drand48_r(buff,&x);
c= low + x * (high - low);
}
...
now i got times under one second!
Thanks! :)
I have written the following codes in R and C++ which perform the same algorithm:
a) To simulate the random variable X 500 times. (X has value 0.9 with prob 0.5 and 1.1 with prob 0.5)
b) Multiply these 500 simulated values together to get a value. Save that value in a container
c) Repeat 10000000 times such that the container has 10000000 values
R:
ptm <- proc.time()
steps <- 500
MCsize <- 10000000
a <- rbinom(MCsize,steps,0.5)
b <- rep(500,times=MCsize) - a
result <- rep(1.1,times=MCsize)^a*rep(0.9,times=MCsize)^b
proc.time()-ptm
C++
#include <numeric>
#include <vector>
#include <iostream>
#include <random>
#include <thread>
#include <mutex>
#include <cmath>
#include <algorithm>
#include <chrono>
const size_t MCsize = 10000000;
std::mutex mutex1;
std::mutex mutex2;
unsigned seed_;
std::vector<double> cache;
void generatereturns(size_t steps, int RUNS){
mutex2.lock();
// setting seed
try{
std::mt19937 tmpgenerator(seed_);
seed_ = tmpgenerator();
std::cout << "SEED : " << seed_ << std::endl;
}catch(int exception){
mutex2.unlock();
}
mutex2.unlock();
// Creating generator
std::binomial_distribution<int> distribution(steps,0.5);
std::mt19937 generator(seed_);
for(int i = 0; i!= RUNS; ++i){
double power;
double returns;
power = distribution(generator);
returns = pow(0.9,power) * pow(1.1,(double)steps - power);
std::lock_guard<std::mutex> guard(mutex1);
cache.push_back(returns);
}
}
int main(){
std::chrono::steady_clock::time_point start = std::chrono::steady_clock::now();
size_t steps = 500;
seed_ = 777;
unsigned concurentThreadsSupported = std::max(std::thread::hardware_concurrency(),(unsigned)1);
int remainder = MCsize % concurentThreadsSupported;
std::vector<std::thread> threads;
// starting sub-thread simulations
if(concurentThreadsSupported != 1){
for(int i = 0 ; i != concurentThreadsSupported - 1; ++i){
if(remainder != 0){
threads.push_back(std::thread(generatereturns,steps,MCsize / concurentThreadsSupported + 1));
remainder--;
}else{
threads.push_back(std::thread(generatereturns,steps,MCsize / concurentThreadsSupported));
}
}
}
//starting main thread simulation
if(remainder != 0){
generatereturns(steps, MCsize / concurentThreadsSupported + 1);
remainder--;
}else{
generatereturns(steps, MCsize / concurentThreadsSupported);
}
for (auto& th : threads) th.join();
std::chrono::steady_clock::time_point end = std::chrono::steady_clock::now() ;
typedef std::chrono::duration<int,std::milli> millisecs_t ;
millisecs_t duration( std::chrono::duration_cast<millisecs_t>(end-start) ) ;
std::cout << "Time elapsed : " << duration.count() << " milliseconds.\n" ;
return 0;
}
I can't understand why my R code is so much faster than my C++ code (3.29s vs 12s) even though I have used four threads in the C++ code? Can anyone enlighten me please? How should I improve my C++ code to make it run faster?
EDIT:
Thanks for all the advice! I reserved capacity for my vectors and reduced the amount of locking in my code. The crucial update in the generatereturns() function is :
std::vector<double> cache(MCsize);
std::vector<double>::iterator currit = cache.begin();
//.....
// Creating generator
std::binomial_distribution<int> distribution(steps,0.5);
std::mt19937 generator(seed_);
std::vector<double> tmpvec(RUNS);
for(int i = 0; i!= RUNS; ++i){
double power;
double returns;
power = distribution(generator);
returns = pow(0.9,power) * pow(1.1,(double)steps - power);
tmpvec[i] = returns;
}
std::lock_guard<std::mutex> guard(mutex1);
std::move(tmpvec.begin(),tmpvec.end(),currit);
currit += RUNS;
Instead of locking every time, I created a temporary vector and then used std::move to shift the elements in that tempvec into cache. Now the elapsed time has reduced to 1.9seconds.
First of all, are you running it in release mode?
Switching from debug to release reduced the running time from ~15s to ~4.5s on my laptop (windows 7, i5 3210M).
Also, reducing the number of threads to 2 instead of 4 in my case (I just have 2 cores but with hyperthreading) further reduced the running time to ~2.4s.
Changing the variable power to int (as jimifiki also suggested) also offered a slight boost, reducing the time to ~2.3s.
I really enjoyed your question and I tried the code at home. I tried to change the random number generator, my implementation of std::binomial_distribution requires on average about 9.6 calls of generator().
I know the question is more about comparing R with C++ performances, but since you ask "How should I improve my C++ code to make it run faster?" I insist with pow optimization. You can easily avoid one half of the call by precomputing either 0.9^steps or 1.1^steps before the for loop. This makes your code run a bit faster:
double power1 = pow(0.9,steps);
double ratio = 1.1/0.9;
for(int i = 0; i!= RUNS; ++i){
...
returns = myF1 * pow(myF2, (double)power);
Analogously you can improve the R code:
...
ratio <-1.1/0.9
pow1 = 0.9^steps
result <- rep(ratio,times=MCsize)^rep(pow1,times=MCsize)
...
Probably doesn't help you that much, but
start by using pow(double,int) when your exponent is an int.
int power;
returns = pow(0.9,power) * pow(1.1,(int)steps - power);
Can you see any improvement?
I wrote program which realises this formula:
Pi = 1/n * summ( 4 / ( 1 + ((i-0.5) /n)^2)
Program code:
#include <iostream>
#include <math.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
using namespace std;
const long double PI = double(M_PI);
int main(int argc, char* argv[])
{
typedef struct timeval tm;
tm start, end;
int timer = 0;
int n;
if (argc == 2) n = atoi(argv[1]);
else n = 8000;
long double pi1 = 0;
gettimeofday ( &start, NULL );
for(int i = 1; i <= n; i++) {
pi1 += 4 / ( 1 + (i-0.5) * (i-0.5) / (n*n) );
}
pi1/=n;
gettimeofday ( &end, NULL );
timer = ( end.tv_usec - start.tv_usec );
long double delta = pi1 - PI;
printf("pi = %.12Lf\n",pi1);
printf("delta = %.12Lf\n", delta);
cout << "time = " << timer << endl;
return 0;
}
How to present it in an optimal way? when there will be less floating-point operations in this part:
for(int i = 1; i <= n; i++) {
pi1 += 4 / ( 1 + (i-0.5) * (i-0.5) / (n*n) );
}
Thanks
one idea will be:
double nn = n*n;
for(double i = 0.5; i < n; i += 1) {
pi1 += 4 / ( 1 + i * i / nn );
}
but you need to test if it is any difference with current code.
I suggest you read this excellent document:
Software Optimization Guide for AMD64 Processors
Which is also great when you do not have an AMD processor.
But if I were you, I would replace the whole calculation loop with just
pi1 = M_PI;
Which will probably be the fastest... If you are actually interested in a faster algorithm for Pi calculations, look at the Wikipedia article: Category:Pi algorithm
If you just want to microoptimize your code, read the above mentioned software optimization guide.
Examples of simple optimization:
compute double one_per_n = 1/n; outside the for loop reducing the cost of dividing by non each iteration
compute double j = (i-0.5) * one_per_n inside the loop
pi1 += 4 / (1 + j*j);
This should be faster and also avoid the integer overflow you have for greater values of n. For even more optimized code you will have to look at the generated code and use a profiler to make appropriate changes. The optimized code this way might behave differently on machines with a different CPU or cache.... Avoiding divisions is something that is always good to do to save computation time.
#include <iostream>
#include <cmath>
#include <chrono>
#ifndef M_PI //M_PI is non standard make you sure catch this case
#define M_PI 3.14159265358979323846
#endif
typdef long double float_t;
const float_t PI = double(M_PI);
int main(int argc, char* argv[])
{
int n = argc == 2 ? atoi(argv[1]) : 8000;
float_t pi1=0.0;
//if you can using auto here is a no brainer
std::chrono::time_point start
=std::chrono::system_clock::now();
unsigned n2=n*n;
for(unsigned i = 1; i <= n; i++)
{
pi1 += 4.0 / ( 1.0 + (i-0.5) * (i-0.5) / n2 );
}
pi1/=n;
std::chrono::duration<double> time
=std::chrono::system_clock::now()-start;
float_t delta = pi1 - PI;
std::cout << "pi = " << std::setprecision(12) << pi1
<< "delta = " << std::setprecision(12) << delta
<< "\ntime = " << time.count() << std::endl;
return 0;
}