this is my first time using multi-threading to speed up a heavy calculation.
Background: The idea is to calculate a Kernel Covariance matrix, by reading a list of 3D points x_test and calculating the corresponding matrix, which has dimensions x_test.size() x x_test.size().
I already sped up the calculations by only calculating the lower triangluar matrix. Since all the calculations are independent from each other I tried to speed up the process (x_test.size() = 27000 in my case) by splitting the calculations of the matrix entries row-wise, assigning a range of rows to each thread.
On a single core the calculations took about 280 seconds each time, on 4 cores it took 270-290 seconds.
main.cpp
int main(int argc, char *argv[]) {
double sigma0sq = 1;
double lengthScale [] = {0.7633, 0.6937, 3.3307e+07};
const std::vector<std::vector<double>> x_test = parse2DCsvFile(inputPath);
/* Finding data slices of similar size */
//This piece of code works, each thread is assigned roughly the same number of matrix entries
int numElements = x_test.size()*x_test.size()/2;
const int numThreads = 4;
int elemsPerThread = numElements / numThreads;
std::vector<int> indices;
int j = 0;
for(std::size_t i=1; i<x_test.size()+1; ++i){
int prod = i*(i+1)/2 - j*(j+1)/2;
if (prod > elemsPerThread) {
i--;
j = i;
indices.push_back(i);
if(indices.size() == numThreads-1)
break;
}
}
indices.insert(indices.begin(), 0);
indices.push_back(x_test.size());
/* Spreding calculations to multiple threads */
std::vector<std::thread> threads;
for(std::size_t i = 1; i < indices.size(); ++i){
threads.push_back(std::thread(calculateKMatrixCpp, x_test, lengthScale, sigma0sq, i, indices.at(i-1), indices.at(i)));
}
for(auto & th: threads){
th.join();
}
return 0;
}
As you can see, each thread performs the following calculations on the data assigned to it:
void calculateKMatrixCpp(const std::vector<std::vector<double>> xtest, double lengthScale[], double sigma0sq, int threadCounter, int start, int stop){
char buffer[8192];
std::ofstream out("lower_half_matrix_" + std::to_string(threadCounter) +".csv");
out.rdbuf()->pubsetbuf(buffer, 8196);
for(int i = start; i < stop; ++i){
for(int j = 0; j < i+1; ++j){
double kij = seKernel(xtest.at(i), xtest.at(j), lengthScale, sigma0sq);
if (j!=0)
out << ',';
out << kij;
}
if(i!=xtest.size()-1 )
out << '\n';
}
out.close();
}
and
double seKernel(const std::vector<double> x1,const std::vector<double> x2, double lengthScale[], double sigma0sq) {
double sum(0);
for(std::size_t i=0; i<x1.size();i++){
sum += pow((x1.at(i)-x2.at(i))/lengthScale[i],2);
}
return sigma0sq*exp(-0.5*sum);
}
Aspects I considered
locking by simultaneous access to data vector -> I don't pass a reference to the threads, but a copy of the data. I know this is not optimal in terms of RAM usage, but as far as I know this should prevent simultaneous data access since every thread has its own copy
Output -> every thread writes its part of the lower triangular matrix to its own file. My task manager doesn't indicate a full SSD utilization in the slightest
Compiler and machine
Windows 11
GNU GCC Compiler
Code::Blocks (although I don't think that should be of importance)
There are many details that can be improved in your code, but I think the two biggest issues are:
using vectors or vectors, which leads to fragmented data;
writing each piece of data to file as soon as its value is computed.
The first point is easy to fix: use something like std::vector<std::array<double, 3>>. In the code below I use an alias to make it more readable:
using Point3D = std::array<double, 3>;
std::vector<Point3D> x_test;
The second point is slightly harder to address. I assume you wanted to write to the disk inside each thread because you couldn't manage to write to a shared buffer that you could then write to a file.
Here is a way to do exactly that:
void calculateKMatrixCpp(
std::vector<Point3D> const& xtest, Point3D const& lengthScale, double sigma0sq,
int threadCounter, int start, int stop, std::vector<double>& kMatrix
) {
// ...
double& kij = kMatrix[i * xtest.size() + j];
kij = seKernel(xtest[i], xtest[j], lengthScale, sigma0sq);
// ...
}
// ...
threads.push_back(std::thread(
calculateKMatrixCpp, x_test, lengthScale, sigma0sq,
i, indices[i-1], indices[i], std::ref(kMatrix)
));
Here, kMatrix is the shared buffer and represents the whole matrix you are trying to compute. You need to pass it to the thread via std::ref. Each thread will write to a different location in that buffer, so there is no need for any mutex or other synchronization.
Once you make these changes and try to write kMatrix to the disk, you will realize that this is the part that takes the most time, by far.
Below is the full code I tried on my machine, and the computation time was about 2 seconds whereas the writing-to-file part took 300 seconds! No amount of multithreading can speed that up.
If you truly want to write all that data to the disk, you may have some luck with file mapping. Computing the exact size needed should be easy enough if all values have the same number of digits, and it looks like you could write the values with multithreading. I have never done anything like that, so I can't really say much more about it, but it looks to me like the fastest way to write multiple gigabytes of memory to the disk.
#include <vector>
#include <thread>
#include <iostream>
#include <string>
#include <cmath>
#include <array>
#include <random>
#include <fstream>
#include <chrono>
using Point3D = std::array<double, 3>;
auto generateSampleData() -> std::vector<Point3D> {
static std::minstd_rand g(std::random_device{}());
std::uniform_real_distribution<> d(-1.0, 1.0);
std::vector<Point3D> data;
data.reserve(27000);
for (auto i = 0; i < 27000; ++i) {
data.push_back({ d(g), d(g), d(g) });
}
return data;
}
double seKernel(Point3D const& x1, Point3D const& x2, Point3D const& lengthScale, double sigma0sq) {
double sum = 0.0;
for (auto i = 0u; i < 3u; ++i) {
double distance = (x1[i] - x2[i]) / lengthScale[i];
sum += distance*distance;
}
return sigma0sq * std::exp(-0.5*sum);
}
void calculateKMatrixCpp(std::vector<Point3D> const& xtest, Point3D const& lengthScale, double sigma0sq, int threadCounter, int start, int stop, std::vector<double>& kMatrix) {
std::cout << "start of thread " << threadCounter << "\n" << std::flush;
for(int i = start; i < stop; ++i) {
for(int j = 0; j < i+1; ++j) {
double& kij = kMatrix[i * xtest.size() + j];
kij = seKernel(xtest[i], xtest[j], lengthScale, sigma0sq);
}
}
std::cout << "end of thread " << threadCounter << "\n" << std::flush;
}
int main() {
double sigma0sq = 1;
Point3D lengthScale = {0.7633, 0.6937, 3.3307e+07};
const std::vector<Point3D> x_test = generateSampleData();
/* Finding data slices of similar size */
//This piece of code works, each thread is assigned roughly the same number of matrix entries
int numElements = x_test.size()*x_test.size()/2;
const int numThreads = 4;
int elemsPerThread = numElements / numThreads;
std::vector<int> indices;
int j = 0;
for(std::size_t i = 1; i < x_test.size()+1; ++i){
int prod = i*(i+1)/2 - j*(j+1)/2;
if (prod > elemsPerThread) {
i--;
j = i;
indices.push_back(i);
if(indices.size() == numThreads-1)
break;
}
}
indices.insert(indices.begin(), 0);
indices.push_back(x_test.size());
auto start = std::chrono::system_clock::now();
std::vector<double> kMatrix(x_test.size() * x_test.size(), 0.0);
std::vector<std::thread> threads;
for (std::size_t i = 1; i < indices.size(); ++i) {
threads.push_back(std::thread(calculateKMatrixCpp, x_test, lengthScale, sigma0sq, i, indices[i - 1], indices[i], std::ref(kMatrix)));
}
for (auto& t : threads) {
t.join();
}
auto end = std::chrono::system_clock::now();
auto elapsed_seconds = std::chrono::duration<double>(end - start).count();
std::cout << "computation time: " << elapsed_seconds << "s" << std::endl;
start = std::chrono::system_clock::now();
constexpr int buffer_size = 131072;
char buffer[buffer_size];
std::ofstream out("matrix.csv");
out.rdbuf()->pubsetbuf(buffer, buffer_size);
for (int i = 0; i < x_test.size(); ++i) {
for (int j = 0; j < i + 1; ++j) {
if (j != 0) {
out << ',';
}
out << kMatrix[i * x_test.size() + j];
}
if (i != x_test.size() - 1) {
out << '\n';
}
}
end = std::chrono::system_clock::now();
elapsed_seconds = std::chrono::duration<double>(end - start).count();
std::cout << "writing time: " << elapsed_seconds << "s" << std::endl;
}
Okey I've wrote implementation with optimized formatting.
By using #Nelfeal code it was taking on my system around 250 seconds for the run to complete with write time taking the most by far. Or rather std::ofstream formatting taking most of the time.
I've written a C++20 version via std::format_to/format. It is a multi-threaded version that takes around 25-40 seconds to complete all the computations, formatting, and writing. If run in a single thread, it takes on my system around 70 seconds. Same performance should be achievable via fmt library on C++11/14/17.
Here is the code:
import <vector>;
import <thread>;
import <iostream>;
import <string>;
import <cmath>;
import <array>;
import <random>;
import <fstream>;
import <chrono>;
import <format>;
import <filesystem>;
using Point3D = std::array<double, 3>;
auto generateSampleData(Point3D scale) -> std::vector<Point3D>
{
static std::minstd_rand g(std::random_device{}());
std::uniform_real_distribution<> d(-1.0, 1.0);
std::vector<Point3D> data;
data.reserve(27000);
for (auto i = 0; i < 27000; ++i)
{
data.push_back({ d(g)* scale[0], d(g)* scale[1], d(g)* scale[2] });
}
return data;
}
double seKernel(Point3D const& x1, Point3D const& x2, Point3D const& lengthScale, double sigma0sq) {
double sum = 0.0;
for (auto i = 0u; i < 3u; ++i) {
double distance = (x1[i] - x2[i]) / lengthScale[i];
sum += distance * distance;
}
return sigma0sq * std::exp(-0.5 * sum);
}
void calculateKMatrixCpp(std::vector<Point3D> const& xtest, Point3D lengthScale, double sigma0sq, int threadCounter, int start, int stop, std::filesystem::path localPath)
{
using namespace std::string_view_literals;
std::vector<char> buffer;
buffer.reserve(15'000);
std::ofstream out(localPath);
std::cout << std::format("starting thread {}: from {} to {}\n"sv, threadCounter, start, stop);
for (int i = start; i < stop; ++i)
{
for (int j = 0; j < i; ++j)
{
double kij = seKernel(xtest[i], xtest[j], lengthScale, sigma0sq);
std::format_to(std::back_inserter(buffer), "{:.6g}, "sv, kij);
}
double kii = seKernel(xtest[i], xtest[i], lengthScale, sigma0sq);
std::format_to(std::back_inserter(buffer), "{:.6g}\n"sv, kii);
out.write(buffer.data(), buffer.size());
buffer.clear();
}
}
int main() {
double sigma0sq = 1;
Point3D lengthScale = { 0.7633, 0.6937, 3.3307e+07 };
const std::vector<Point3D> x_test = generateSampleData(lengthScale);
/* Finding data slices of similar size */
//This piece of code works, each thread is assigned roughly the same number of matrix entries
int numElements = x_test.size() * (x_test.size()+1) / 2;
const int numThreads = 3;
int elemsPerThread = numElements / numThreads;
std::vector<int> indices;
int j = 0;
for (std::size_t i = 1; i < x_test.size() + 1; ++i) {
int prod = i * (i + 1) / 2 - j * (j + 1) / 2;
if (prod > elemsPerThread) {
i--;
j = i;
indices.push_back(i);
if (indices.size() == numThreads - 1)
break;
}
}
indices.insert(indices.begin(), 0);
indices.push_back(x_test.size());
auto start = std::chrono::system_clock::now();
std::vector<std::thread> threads;
using namespace std::string_view_literals;
for (std::size_t i = 1; i < indices.size(); ++i)
{
threads.push_back(std::thread(calculateKMatrixCpp, std::ref(x_test), lengthScale, sigma0sq, i, indices[i - 1], indices[i], std::format("./matrix_{}.csv"sv, i-1)));
}
for (auto& t : threads)
{
t.join();
}
auto end = std::chrono::system_clock::now();
auto elapsed_seconds = std::chrono::duration<double>(end - start);
std::cout << std::format("total elapsed time: {}"sv, elapsed_seconds);
return 0;
}
Note: I used 6 digits of precision here as it is the default for std::ofstream. More digits means more writing time to disk and lower performance.
I've recently learnt the basics of threading and throught i'd take it for a spin. However doing some tests, it seems that the threadded version of the code is actually slower than the serial code. Can anybody spot any problems with the following program? If there are none (highly doubtful) can you propose a strategy where I do observe speed ups. The other thing that crossed my mind is that this is a simple problem, so maybe the overhead incurred by threading isn't worth the effort. In reality the BealeFunction below will be a system of ODE's, so computational expensive to evaluate.
Here are the results:
/**
* Results
* -------
*
* In serial: fitness = 163.179; computation took: 4085 microseonds
* In serial: fitness = 163.179; computation took: 3606 microseonds
* In serial: fitness = 163.179; computation took: 4288 microseonds
*
* With threading: fitness = 163.179; computation took: 16893 microseonds
* With threading: fitness = 163.179; computation took: 14333 microseonds
* With threading: fitness = 163.179; computation took: 13636 microseonds
*
*/
And the code to generate them:
#include <chrono>
#include <random>
#include <iostream>
#include <thread>
#include <future>
#include <mutex>
#include <vector>
double BealeFunction(double *parameters) {
double x = parameters[0];
double y = parameters[1];
double first = pow(1.5 - x + x * y, 2);
double second = pow(2.25 - x + x * pow(y, 2), 2);
double third = pow(2.625 - x + x * pow(y, 3), 2);
return first + second + third;
};
double inSerial(std::vector<std::vector<double>> matrix){
double total = 0;
for (auto & i : matrix){
total += BealeFunction(i.data());
}
return total;
}
double withThreading(std::vector<std::vector<double>> matrix){
double total = 0;
std::mutex mtx;
std::vector<std::shared_future<double>> futures;
auto compute1 = [&](int startIndex, int endIndex) {
double sum = 0;
for (int i = startIndex; i <= endIndex; i++) {
sum += BealeFunction(matrix[i].data());
}
return sum;
};
// deal with situation where population size < hardware_concurrency.
int numThreads = 0;
if (matrix.size() < (int) std::thread::hardware_concurrency() - 1) {
numThreads = matrix.size() - 1; // account for 0 index
} else {
numThreads = (int) std::thread::hardware_concurrency() - 1; // account for main thread
}
int numPerThread = floor(matrix.size() / numThreads);
int remainder = matrix.size() % numThreads;
int startIndex = 0;
int endIndex = numPerThread;
for (int i = 0; i < numThreads; i++) {
if (i < remainder) {
// we need to add one more job for these threads
startIndex = i * (numPerThread + 1);
endIndex = startIndex + numPerThread;
} else {
startIndex = i * numPerThread + remainder;
endIndex = startIndex + (numPerThread - 1);
}
std::cout << "thread " << i << "; start index: " << startIndex << "; end index: " << endIndex << std::endl;
std::shared_future<double> f = std::async(std::launch::async, compute1, startIndex, endIndex);
futures.push_back(f);
}
// now collect the results from futures
for (auto &future : futures) {
total += future.get();
}
return total;
}
int main() {
auto start = std::chrono::steady_clock::now();
int N = 2000;
int M = 2;
// (setup code)
std::vector<std::vector<double>> matrix(N, std::vector<double>(M));
int seed = 5;
std::default_random_engine e(seed);
std::uniform_real_distribution<double> dist1(2.9, 3.1);
std::uniform_real_distribution<double> dist2(0.4, 0.6);
for (int i = 0; i < N; i++) {
for (int j = 0; j < M; j++) {
matrix[i][0] = dist1(e);
matrix[i][1] = dist2(e);
}
}
double total = withThreading(matrix);
// double total = inSerial(matrix);
auto end = std::chrono::steady_clock::now();
std::cout << "fitness: " << total << std::endl;
std::cout << "computation took: " << std::chrono::duration_cast<std::chrono::microseconds>(end - start).count()
<< " microseonds" << std::endl;
}
please check out my code and the quesion below - thanks
Code:
#include <iostream>
#include <chrono>
using namespace std;
int bufferWriteIndex = 0;
float curSample = 0;
float damping[5] = { 1, 1, 1, 1, 1 };
float modeDampingTermsExp[5] = { 0.447604, 0.0497871, 0.00247875, 0.00012341, 1.37263e-05 };
float modeDampingTermsExp2[5] = { -0.803847, -3, -6, -9, -11.1962 };
int main(int argc, char** argv) {
float subt = 0;
int subWriteIndex = 0;
auto now = std::chrono::high_resolution_clock::now();
while (true) {
curSample = 0;
for (int i = 0; i < 5; i++) {
//Slow version
damping[i] = damping[i] * modeDampingTermsExp2[i];
//Fast version
//damping[i] = damping[i] * modeDampingTermsExp[i];
float cosT = 2 * damping[i];
for (int m = 0; m < 5; m++) {
curSample += cosT;
}
}
//t += tIncr;
bufferWriteIndex++;
//measure calculations per second
auto elapsed = std::chrono::high_resolution_clock::now() - now;
if ((elapsed / std::chrono::milliseconds(1)) > 1000) {
now = std::chrono::high_resolution_clock::now();
int idx = bufferWriteIndex;
cout << idx - subWriteIndex << endl;
subWriteIndex = idx;
}
}
}
As you can see im measuring the number of calculations or increments of bufferWriteIndex per second.
Question:
Why is performance faster when using modeDampingTermsExp -
Program output:
12625671
12285846
12819392
11179072
12272587
11722863
12648955
vs using modeDampingTermsExp2 ?
1593620
1668170
1614495
1785965
1814576
1851797
1808568
1801945
It's about 10x faster. It seems like the numbers in those 2 arrays have an impact on calculation time. Why?
I am using Visual Studio 2019 with the following flags: /O2 /Oi /Ot /fp:fast
This is because you are hitting denormal numbers (also see this question).
You can get rid of denormals like so:
#include <cmath>
// [...]
for (int i = 0; i < 5; i++) {
damping[i] = damping[i] * modeDampingTermsExp2[i];
if (std::fpclassify(damping[i]) == FP_SUBNORMAL) {
damping[i] = 0; // Treat denormals as 0.
}
float cosT = 2 * damping[i];
for (int m = 0; m < 5; m++) {
curSample += cosT;
}
}
i'm moving outside my confront zone and trying to make a random number distribution program while also making sure it is still somewhat uniform.
here is my code
this is the RandomDistribution.h file
#pragma once
#include <vector>
#include <random>
#include <iostream>
static float randy(float low, float high) {
static std::random_device rd;
static std::mt19937 random(rd());
std::uniform_real_distribution<float> ran(low, high);
return ran(random);
}
typedef std::vector<float> Vfloat;
class RandomDistribution
{
public:
RandomDistribution();
RandomDistribution(float percent, float contents, int container);
~RandomDistribution();
void setvariables(float percent, float contents, int container);
Vfloat RunDistribution();
private:
float divider;
float _percent;
int jar_limit;
float _contents;
float _maxdistribution;
Vfloat Jar;
bool is0;
};
this is my RandomDistribution.cpp
#include "RandomDistribution.h"
RandomDistribution::RandomDistribution() {
}
RandomDistribution::RandomDistribution(float percent, float contents, int containers):_contents(contents),jar_limit(containers)
{
Jar.resize(containers);
if (percent < 0)
_percent = 0;
else {
_percent = percent;
}
divider = jar_limit * percent;
is0 = false;
}
RandomDistribution::~RandomDistribution()
{
}
void RandomDistribution::setvariables(float percent, float contents, int container) {
if (jar_limit != container)
Jar.resize(container);
_contents = contents;
jar_limit = container;
is0 = false;
if (percent < 0)
_percent = 0;
else {
_percent = percent;
}
divider = jar_limit * percent;
}
Vfloat RandomDistribution::RunDistribution() {
for (int i = 0; i < jar_limit; i++) {
if (!is0) {
if (i + 1 >= jar_limit || _contents < 2) {
Jar[i] = _contents;
_contents -= Jar[i];
is0 = true;
}
if (!_percent <= 0) {//making sure it does not get the hole container at once
_maxdistribution = (_contents / (divider)) * (i + 1);
}
else {
_maxdistribution = _contents;
}
Jar[i] = randy(0, _maxdistribution);
if (Jar[i] < 1) {
Jar[i] = 0;
continue;
}
_contents -= Jar[i];
}
else {
Jar[0];
}
//mixing Jar so it is randomly spaced out instead all at the top
int swapper = randy(0, i);
float hold = Jar[i];
Jar[i] = Jar[swapper];
Jar[swapper] = hold;
}
return Jar;
}
source code
int main(){
RandomDistribution distribution[100];
for (int i = 0; i < 100; i++) {
distribution[i] = {RandomDistribution(1.0f, 5000.0f, 2000) };
}
Vfloat k;
k.resize(200);
for (int i = 0; i < 10; i++) {
auto t3 = chrono::steady_clock::now();
for (int b = 0; b < 100; b++) {
k = distribution[b].RunDistribution();
distribution[b].setvariables(1.0f, 5000.0f, 2000);
}
auto t4 = chrono::steady_clock::now();
auto time_span = chrono::duration_cast<chrono::duration<double>>(t4 - t3);
cout << time_span.count() << " seconds\n";
}
}
what prints out is usually between 1 to 2 seconds for each cycle. i want to bring it down to a tenth of a second if possible cause this is gonna be only one step of the process to completion and i want to run it alot more then 100 times. what can i do to speed this up, any trick or something i'm just missing here.
here is a sample of the time stamps
4.71113 seconds
1.35444 seconds
1.45008 seconds
1.74961 seconds
2.59192 seconds
2.76171 seconds
1.90149 seconds
2.2822 seconds
2.36768 seconds
2.61969 seconds
Cheinan Marks has some benchmarks and performance tips related to random generators & friends in his cppcon 2016 talk I Just Wanted a Random Integer! He mentions some fast generators as well IIRC. I'd start there.
I need to convert this program which runs a iteration, Divide the iteration steps into 4 threads. If the iteration is n then I execute it using 4 threads. The program takes a average 4.7 sec to run. The sum is accessible to all the 4 threads and while updating there is a issue. I'm getting 1.5 as answer instead of 3.1457 for the value of pi.Also threading does not decrease the time. Please help me
#include "stdafx.h"
#include <iostream>
#include <chrono>
#include <thread>
#include <functional>
#include <mutex>
//std::mutex m;
long num_rects = 100000000;
struct params
{
int start;
int end;
double mid;
double height;
double width;
params(int st,int en)
{
start = st;
end = en;
width = 1.0 / (double)num_rects;
}
};
double sum = 0.0;
void sub1(params param){
for (int i = param.start; i < param.end; i++)
{
param.mid = (i + 0.5)*param.width;
param.height = 4.0 / (1.0 + param.mid*param.mid);
//m.lock();
sum += param.height;
//m.unlock();
}
}
int _tmain(int argc, _TCHAR* argv[])
{
int i;
double mid, height, width;
double area;
auto begin = std::chrono::high_resolution_clock::now();
params par(0, num_rects / 4);
std::thread t(sub1, par);
params par1(num_rects / 4, num_rects / 2);
std::thread t1(sub1, par1);
params par2(num_rects / 2, (num_rects *3)/ 4);
std::thread t2(sub1, par2);
params par3((num_rects * 3) / 4, num_rects );
std::thread t3(sub1, par3);
t.join();
t1.join();
t2.join();
t3.join();
/*
sub1(par);
sub1(par1);
sub1(par2);
sub1(par3);
*/
width = 1.0 / (double)num_rects;
area = sum*width;
std::cout << area << std::endl;
auto end = std::chrono::high_resolution_clock::now();
std::cout << std::chrono::duration_cast<std::chrono::milliseconds>(end - begin).count() << "ms" << std::endl;
std::cin.get();
return 0;
}
You are suffering from a race condition to write sum, so 2 threads could overwrite sum with different values and then the updated value gets overwritten.
This change should work.
double sub1(params param){
double sum = 0.0; // thread local
for (int i = param.start; i < param.end; i++)
{
param.mid = (i + 0.5)*param.width;
param.height = 4.0 / (1.0 + param.mid*param.mid);
sum += param.height;
}
return sum;
}
#include <future>
int SubMain() {
int i;
double mid, height, width;
double area;
auto begin = std::chrono::high_resolution_clock::now();
params par(0, num_rects / 4);
std::future<double> fut1 = std::async (sub1, par);
params par1(num_rects / 4, num_rects / 2);
std::future<double> fut2 = std::async (sub1, par1);
params par2(num_rects / 2, (num_rects *3)/ 4);
std::future<double> fut3 = std::async (sub1, par2);
params par3((num_rects * 3) / 4, num_rects );
std::future<double> fut4 = std::async (sub1, par3);
sum = fut1.get() + fut2.get() + fut3.get() + fut4.get();
width = 1.0 / (double)num_rects;
area = sum*width;
std::cout << area << std::endl;
auto end = std::chrono::high_resolution_clock::now();
std::cout << std::chrono::duration_cast<std::chrono::milliseconds>(end - begin).count() << "ms" << std::endl;
std::cin.get();
return 0;
}
made some chnages to surt's code this is the final optimized version
double sub1(params param){
double sum = 0.0; // thread local
for (int i = param.start; i < param.end; i++)
{
param.mid = (i + 0.5)*param.width;
param.height = 4.0 / (1.0 + param.mid*param.mid);
sum += param.height;
}
return sum;
}
#include <future>
#include <vector>
int SubMain() {
int i;
double mid, height, width;
double area;
auto begin = std::chrono::high_resolution_clock::now();
std::vector<std::future<double>> futures;
double k = 0;
for (int j = 0; j < 4; j++)
{
params par(num_rects *k, num_rects *(k + 0.25));
k += 0.25;
futures.push_back(std::async(sub1, par));
}
for (std::vector<std::future<double>> ::iterator it = futures.begin(); it != futures.end(); it++)
{
sum += it->get();
}
/* params par(0, num_rects / 4);
std::future<double> fut1 = std::async(sub1, par);
params par1(num_rects / 4, num_rects / 2);
std::future<double> fut2 = std::async(sub1, par1);
params par2(num_rects / 2, (num_rects * 3) / 4);
std::future<double> fut3 = std::async(sub1, par2);
params par3((num_rects * 3) / 4, num_rects);
std::future<double> fut4 = std::async(sub1, par3);
sum = fut1.get() + fut2.get() + fut3.get() + fut4.get();*/
width = 1.0 / (double)num_rects;
area = sum*width;
std::cout << area << std::endl;
auto end = std::chrono::high_resolution_clock::now();
std::cout << std::chrono::duration_cast<std::chrono::milliseconds>(end - begin).count() << "ms" << std::endl;
std::cin.get();
return 0;
}