Puzzling behaviour of async - c++

This might be some weird Linux quirk, but I'm observing very strange behavior.
The following code should compare a synchronized version of summing numbers with an async version. The thing is that I'm seeing a performance increase (it's not caching, it happens even when I split the code into two separate programs), while still observing the program as single-threaded (only one core is used).
strace does show some thread activity, but monitoring tools like top clones still show only one used core.
Second problem I'm observing is that if I increase the spawn ratio, the memory usage just explodes. What is the memory overhead of a thread? With 5000 threads I get ~10GB memory usage.
#include <iostream>
#include <random>
#include <chrono>
#include <future>
using namespace std;
long long sum2(const vector<int>& v, size_t from, size_t to)
{
const size_t boundary = 5*1000*1000;
if (to-from <= boundary)
{
long long rsum = 0;
for (;from < to; from++)
{
rsum += v[from];
}
return rsum;
}
else
{
size_t mid = from + (to-from)/2;
auto s2 = async(launch::async,sum2,cref(v),mid,to);
long long rsum = sum2(v,from,mid);
rsum += s2.get();
return rsum;
}
}
long long sum2(const vector<int>& v)
{
return sum2(v,0,v.size());
}
long long sum(const vector<int>& v)
{
long long rsum = 0;
for (auto i : v)
{
rsum += i;
}
return rsum;
}
int main()
{
const size_t vsize = 100*1000*1000;
vector<int> x;
x.reserve(vsize);
mt19937 rng;
rng.seed(chrono::system_clock::to_time_t(chrono::system_clock::now()));
uniform_int_distribution<uint32_t> dist(0,10);
for (auto i = 0; i < vsize; i++)
{
x.push_back(dist(rng));
}
auto start = chrono::high_resolution_clock::now();
long long suma = sum(x);
auto end = chrono::high_resolution_clock::now();
cout << "Sum is " << suma << endl;
cout << "Duration " << chrono::duration_cast<chrono::nanoseconds>(end - start).count() << " nanoseconds." << endl;
start = chrono::high_resolution_clock::now();
suma = sum2(x);
end = chrono::high_resolution_clock::now();
cout << "Async sum is " << suma << endl;
cout << "Async duration " << chrono::duration_cast<chrono::nanoseconds>(end - start).count() << " nanoseconds." << endl;
return 0;
}

Maybe you observe one core being used because the overlap between threads doing work simultaneously is too short to be noticeable. Summing 5mln values from a continuous area of memory should be very fast on modern hardware, so by the time parent finishes summing, child may have barely started and parent may be spending most or all of the time waiting for the result from the child. Have you tried to increase work unit to see if the overlap becomes noticeable?
Regarding increased performance: even if there is 0 overlap between threads because of too small work unit, multithreaded version can still benefit from additional L1 cache memory. For such a test, memory will likely be a bottleneck and sequential version will use only one L1 cache while multithreaded version will use as many as there are cores.

Have you checked the times that are being printed? On my machine, the serial time is under 1s at -O2, whilst the parallel sum time is several times faster. It's therefore entirely possible that the CPU usage is not enough for long enough for things like "top" to register, since they typically only refresh once per second.
If you increase the number of threads by reducing the count-per-thread, then you effectively increase the overhead of the thread management. If you have 5000 threads active, then your task will take 5000* min-thread-stack-size in additional memory. On my machine that's 20Gb!
Why don't you try increasing the size of the source container? If you make the parallel section take long enough, you'll see the corresponding parallel CPU usage. However, be prepared: summing integers is fast, and the time taken generating the random numbers can take an order of magnitude or two longer than the time to add the numbers together.

Related

Why does the runtime of high_resolution_clock increase with the greater frequency I call it?

In the following code, I repeatedly call std::chrono::high_resolution_clock::now twice, and measure the time it took between these two calls. I would expect this time to be very small, since there is no other code is run between these two calls. However, I observe strange behavior.
For small N, the max element is within a few nanoseconds, as expected. However, the more I increase N, I get very large outliers, and have gotten up to a few milliseconds. Why does this happen?
In other words, why does the max element of v increase as I increase N in the following code?
#include <iostream>
#include <vector>
#include <chrono>
#include <algorithm>
int main()
{
using ns = std::chrono::nanoseconds;
uint64_t N = 10000000;
std::vector<uint64_t> v(N, 0);
for (uint64_t i = 0; i < N; i++) {
auto start = std::chrono::high_resolution_clock::now();
v[i] = std::chrono::duration_cast<ns>(std::chrono::high_resolution_clock::now() - start).count();
}
std::cout << "max: " << *std::max_element(v.begin(), v.end()) << std::endl;
return 0;
}
The longer you run your loop, the more likely it is that your OS will decide that your thread has consumed enough resources for the moment and suspend it. And the longer you run your loop, the more likely it is that this suspension will happen between those calls.
Since you're only looking at the "max" time, this only has to happen once to cause the max time to spike into the millisecond range.

What is causing the threads to execute slower than the serial case?

I have a simple function which computes the sum of "n" numbers.
I am attempting to use threads to implement the sum in parallel. The code is as follows,
void Add(double &sum, const int startIndex, const int endIndex)
{
sum = 0.0;
for (int i = startIndex; i < endIndex; i++)
{
sum = sum + 0.1;
}
}
int main()
{
int n = 100'000'000;
double sum1;
double sum2;
std::thread t1(Add, std::ref(sum1), 0, n / 2);
std::thread t2(Add, std::ref(sum2), n / 2, n);
t1.join();
t2.join();
std::cout << "sum: " << sum1 + sum2 << std::endl;
// double serialSum;
// Add(serialSum, 0, n);
// std::cout << "sum: " << serialSum << std::endl;
return 0;
}
However, the code runs much slower than the serial version. If I modify the function such that it does not take in the sum variable, then I obtain the desired speed-up (nearly 2x).
I read several resources online but all seem to suggest that variables must not be accessed by multiple threads. I do not understand why that would be the case for this example.
Could someone please clarify my mistake?.
The problem here is hardware.
You probably know that CPUs have caches to speed up operations. These caches are many times faster then memory but they work in units called cachelines. Probably 64 byte on your system. Your 2 doubles are each 8 byte large and near certainly will end up being in the same 64 byte region on the stack. And each core in a cpu generally have their own L1 cache while larger caches may be shared between cores.
Now when one thread accesses sum1 the core will load the relevant cacheline into the cache. When the second thread accesses sum2 the other core attempts to load the same cacheline into it's own cache. And the x86 architecture is so nice trying to help you it will ask the first cache to hand over the cacheline so both threads always see the same data.
So while you have 2 separate variables they are in the same cache line and on every access that cacheline bounces from one core to the other and back. Which is a rather slow operation. This is called false sharing.
So you need to put some separation between sum1 and sum2 to make this work fast. See std::hardware_destructive_interference_size for what distance you need to achieve.
Another, and probably way simpler, way is to modify the worker function to use local variables:
void Add(double &sum, const int startIndex, const int endIndex)
{
double t = 0.0;
for (int i = startIndex; i < endIndex; i++)
{
t = t + 0.1;
}
sum = t;
}
You still have false sharing and the two threads will fight over access to sum1 and sum2. But now it only happens once and becomes irrelevant.

TBB: How can I measure scheduling overhead?

I have a very simple parallel_for loop that does some work on a large vector. While this is a contrived example, I am hoping to measure any potential overhead with scheduling by varying grain size. The loop is as follows:
tbb_start = std::chrono::high_resolution_clock::now();
tbb::parallel_for(tbb::blocked_range<int>(0, values.size(), grainSize),
[&](tbb::blocked_range<int> r)
{
for (int i = r.begin(); i < r.end(); ++i)
{
values[i] = std::sin(i * 0.001);
}
});
tbb_end = std::chrono::high_resolution_clock::now();
loop_duration = (tbb_end - tbb_start);
std::cout << "TBB Time: " << loop_duration.count() << "ms" << std::endl;
My assumption here is that increasing the grain size will reduce the scheduling overhead (since more work will be done by fewer threads). How then to measure this change in overhead? Per this paper:
https://www.epcc.ed.ac.uk/sites/default/files/PDF/ewomp99paper.pdf
The authors take the difference between the parallel run time and serial run time, divide that by the number of processors and that number will be the overhead. Is this the standard way to do it? Is there another (possibly better) way?

Cannot get 100% CPU usage from thread class

I tried to write an answer to How to get 100% CPU usage from a C program question by thread class. Here is my code
#include <iostream>
#include <thread>
#include <vector>
#include <mutex>
using namespace std;
static int primes = 0;
void prime(int a, int b);
mutex mtx;
int main()
{
unsigned int nthreads = thread::hardware_concurrency();
vector<thread> threads;
int limit = 1000000;
int intrvl = (int) limit / nthreads;
for (int i = 0; i < nthreads; i++)
{
threads.emplace_back(prime, i*intrvl+1, i*intrvl+intrvl);
}
cout << "Number of logical cores: " << nthreads << "\n";
cout << "Calculating number of primes less than " << limit << "... \n";
for (thread & t : threads) {
t.join();
}
cout << "There are " << primes << " prime numbers less than " << limit << ".\n";
return 0;
}
void prime(int a, int b)
{
for (a; a <= b; a++) {
int i = 2;
while(i <= a) {
if(a % i == 0)
break;
i++;
}
if(i == a) {
mtx.lock();
primes++;
mtx.unlock();
}
}
}
But when I run it I get the following diagram
That is sinusoid. But when I run #Mysticial answer that uses openmp, I get this
I checked both program by ps -eLf and both of them uses 8 threads. Why I get this unsteady diagram and how can I get the same result as openmp does with thread?
There are some fundamental differences between Mystical's answer and your code.
Difference #1
Your code creates a chunk of work for each CPU, and lets it run to completion. This means that once a thread has finished, there will be a sharp drop in the CPU usage since a CPU will be idle while the other threads run to completion. This happens because scheduling is not always fair. One thread may progress, and finish, much faster than the others.
The OpenMP solution solves this by declaring schedule(dynamic) which tells OpenMP to, internally, create a work queue that all the threads will consume work from. When a chunk of work is finished, the thread that would have then exited in your code consumes another chunk of work and gets busy with it.
Eventually, this becomes a balancing act of picking adequately sized chunks. Too large, and the CPUs may not be maxed out toward the end of the task. Too small, and there can be significant overhead.
Difference #2
You are writing to a variable, primes that is shared between all of the threads.
This has 2 consequences:
It requires synchronization to keep prevent a data race.
It makes the cache on a modern CPU very unhappy since a cache flush is required before writes on one thread are visible to another thread.
The OpenMP solution solves this by reducing, via operator+(), the result of the individual values of primes each thread held into the final result. This is what reduction(+ : primes) does.
With this knowledge of how OpenMP is splitting up, scheduling the work, and combining the results, we can modify your code to behave similarly.
#include <iostream>
#include <thread>
#include <vector>
#include <utility>
#include <algorithm>
#include <functional>
#include <mutex>
#include <future>
using namespace std;
int prime(int a, int b)
{
int primes = 0;
for (a; a <= b; a++) {
int i = 2;
while (i <= a) {
if (a % i == 0)
break;
i++;
}
if (i == a) {
primes++;
}
}
return primes;
}
int workConsumingPrime(vector<pair<int, int>>& workQueue, mutex& workMutex)
{
int primes = 0;
unique_lock<mutex> workLock(workMutex);
while (!workQueue.empty()) {
pair<int, int> work = workQueue.back();
workQueue.pop_back();
workLock.unlock(); //< Don't hold the mutex while we do our work.
primes += prime(work.first, work.second);
workLock.lock();
}
return primes;
}
int main()
{
int nthreads = thread::hardware_concurrency();
int limit = 1000000;
// A place to put work to be consumed, and a synchronisation object to protect it.
vector<pair<int, int>> workQueue;
mutex workMutex;
// Put all of the ranges into a queue for the threads to consume.
int chunkSize = max(limit / (nthreads*16), 10); //< Handwaving came picking 16 and a good factor.
for (int i = 0; i < limit; i += chunkSize) {
workQueue.push_back(make_pair(i, min(limit, i + chunkSize)));
}
// Start the threads.
vector<future<int>> futures;
for (int i = 0; i < nthreads; ++i) {
packaged_task<int()> task(bind(workConsumingPrime, ref(workQueue), ref(workMutex)));
futures.push_back(task.get_future());
thread(move(task)).detach();
}
cout << "Number of logical cores: " << nthreads << "\n";
cout << "Calculating number of primes less than " << limit << "... \n";
// Sum up all the results.
int primes = 0;
for (future<int>& f : futures) {
primes += f.get();
}
cout << "There are " << primes << " prime numbers less than " << limit << ".\n";
}
This is still not a perfect reproduction of how the OpenMP example behaves. For example, this is closer to OpenMP's static schedule since chunks of work are a fixed size. Also, OpenMP does not use a work queue at all. So I may have lied a little bit -- call it a white lie since I wanted to be more explicit about showing the work being split up. What it is likely doing behind the scenes is storing the iteration that the next thread should start at when it comes available and a heuristic for the next chunk size.
Even with these differences, I'm able to max out all my CPUs for an extended period of time.
Looking to the future...
You probably noticed that the OpenMP version is a lot more readable. This is because it's meant to solve problems just like this. So, when we try to solve them without a library or compiler extension, we end up reinventing the wheel. Luckily, there is a lot of work being done to bring this sort of functionality directly into C++. Specifically, the Parallelism TS can help us out if we could represent this as a standard C++ algorithm. Then we could tell the library to distribute the algorithm across all CPUs as it sees fit so it does all the heavy lifting for us.
In C++11, with a little bit of help from Boost, this algorithm could be written as:
#include <iostream>
#include <iterator>
#include <algorithm>
#include <boost/range/irange.hpp>
using namespace std;
bool isPrime(int n)
{
if (n < 2)
return false;
for (int i = 2; i < n; ++i) {
if (n % i == 0)
return false;
}
return true;
}
int main()
{
auto range = boost::irange(0, 1000001);
auto numPrimes = count_if(begin(range), end(range), isPrime);
cout << "There are " << numPrimes << " prime numbers less than " << range.back() << ".\n";
}
And to parallelise the algorithm, you just need to #include <execution_policy> and pass std::par as the first parameter to count_if.
auto numPrimes = count_if(par, begin(range), end(range), isPrime);
And that's the kind of code that makes me happy to read.
Note: Absolutely no time was spent optimising this algorithm at all. If we were to do any optimisation, I'd look into something like the the Sieve of Eratosthenes which uses previous prime computations to help with future ones.
First, you need to realize that OpenMP usually has a fairly sophisticated thread pool under the covers, so matching it (exactly) will probably be at least somewhat difficult.
Second, it seems to me that before optimizing the threading, you should attempt to start with at least a halfway decent basic algorithm. In this case, the basic algorithm you're implementing is basically pretty awful. It's checking whether numbers are prime, but doing a lot of work that doesn't accomplish anything useful.
It's checking whether even numbers are prime. Other than 2, they're not. Ever.
It's checking whether odd numbers are divisible by even number. Again, they're not. Ever.
It's checking whether numbers are divisible by numbers larger then their square root. If there's no divisor smaller than the square root, there can't be one larger than the square root either.
Although it probably doesn't affect speed, I also find it a lot easier to have a function that checks whether a single number is prime, and just returns true/false to indicate the result, than to have somewhat elaborate code to figure out whether a preceding loop ran to completion or exited early.
You can optimize the algorithm by eliminating more than that, but that much doesn't strike me as "optimization" nearly so much as simply avoiding completely unnecessary pessimization.
At least in my opinion, it's also a bit easier (in this case) to use std::async to launch the threads. This lets us return a value from our thread (the count we want back) pretty easily.
So, let's start by fixing prime based on those observations:
int prime(int a, int b)
{
int count = 0;
if (a == 2)
++count;
if (a % 2 == 0)
++a;
auto check = [](int i) -> bool {
for (int j = 3; j*j <= i; j += 2)
if (i % j == 0)
return false;
return true;
};
for (a; a <= b; a+=2) {
if (check(a))
++count;
}
return count;
}
Now, let me point out that this is already enough faster (even single-threaded) that if we just wanted to get the job to finish 4 times faster (or so) that we'd get from perfect thread-scaling, we're done, even without using threading at all. For the limit you gave, this finishes in well under 1 second.
For the sake of argument, however, let's assume we want to get more, and make use of multiple cores too. One thing to realize here is that we generally want at least a few more threads than cores. The problem is fairly simple: with only one thread per core, we have nothing to make up for the fact that we haven't really distributed the load even between the threads--the thread processing the largest numbers has quite a bit more work to do than the thread processing the smallest numbers--but if we have (for example) a 4-core machine, as soon as one thread finishes, we can only use 75% of the CPU. Then when another thread finishes, it drops to 50%. Then 25%, and finally it finishes, using only one core.
We could probably do some computation to attempt to distribute the load more evenly, but it's a lot easier to just split the load into, say, six or 8 times as many threads as cores. This way the computation can continue using all the cores until there are only three threads remaining1.
Putting all that into code, we can end up with something like this:
int main() {
using namespace chrono;
int limit = 50000000;
unsigned int nthreads = 8 * thread::hardware_concurrency();
cout << "\nComputing multi-threaded:\n";
cout << "Number of threads: " << nthreads << "\n";
cout << "Calculating number of primes less than " << limit << "... \n";
auto start2 = high_resolution_clock::now();
vector<future<int>> threads;
int intrvl = limit / nthreads;
for (int i = 0; i < nthreads; i++)
threads.emplace_back(std::async(std::launch::async, prime, i*intrvl + 1, (i + 1)*intrvl));
int primes = 0;
for (auto &t : threads)
primes += t.get();
auto end2 = high_resolution_clock::now();
cout << "Primes: " << primes << ", Time: " << duration_cast<milliseconds>(end2 - start2).count() << "\n";
}
Note a couple of points:
This runs enough faster that I've increased the upper limit by a fairly large factor, so it'll run long enough that we can at least see it use 100% of the CPU time for a few seconds before it's done2.
I've added some timing code to get a little more accurate idea of how long it runs for.
At least when I run this, it seems to act about as we'd expect/hope: it uses 100% of the CPU time until it gets very close to the end, when it starts to drop just before finishing (i.e., when we have fewer threads to execute than we have cores to execute them).
In case you wonder how OpenMP avoids this: it usually uses a thread pool, so some number of iterations of the loop is dispatched to the thread pool as a task. This lets it produce a large number of tasks without having a huge number of threads contending for CPU time simultaneously.
With the upper limit you used, it finished on my machine in about 90 milliseconds, which isn't long enough for it to even make a noticeable blip on the CPU usage graph.
The OpenMP example is using "reduction" on the sum variable primes which means that each task sums its own local primes variable.
OpenMP adds the thread local copies of primes together at the end of the parallel part to get the grand total.
That means it does not need to lock.
As #Sam says, a thread will get put to sleep if it cannot acquire the mutex lock.
So in your case, the threads will spend a fair amount of time asleep.
If you don't want to use OpenMP, try static std::atomic<int> primes = 0; then you don't need the mutex lock and unlock.
Or you could simulate OpenMP reduction by using an array primes[numThreads] where thread i sums into primes[i] then sum primes[] at the end.

C++ 11 std thread sumation with atomic very slow

I wanted to learn to use C++ 11 std::threads with VS2012 and I wrote a very simple C++ console program with two threads which just increment a counter. I also want to test the performance difference when two threads are used. Test program is given below:
#include <iostream>
#include <thread>
#include <conio.h>
#include <atomic>
std::atomic<long long> sum(0);
//long long sum;
using namespace std;
const int RANGE = 100000000;
void test_without_threds()
{
sum = 0;
for(unsigned int j = 0; j < 2; j++)
for(unsigned int k = 0; k < RANGE; k++)
sum ++ ;
}
void call_from_thread(int tid)
{
for(unsigned int k = 0; k < RANGE; k++)
sum ++ ;
}
void test_with_2_threds()
{
std::thread t[2];
sum = 0;
//Launch a group of threads
for (int i = 0; i < 2; ++i) {
t[i] = std::thread(call_from_thread, i);
}
//Join the threads with the main thread
for (int i = 0; i < 2; ++i) {
t[i].join();
}
}
int _tmain(int argc, _TCHAR* argv[])
{
chrono::time_point<chrono::system_clock> start, end;
cout << "-----------------------------------------\n";
cout << "test without threds()\n";
start = chrono::system_clock::now();
test_without_threds();
end = chrono::system_clock::now();
chrono::duration<double> elapsed_seconds = end-start;
cout << "finished calculation for "
<< chrono::duration_cast<std::chrono::milliseconds>(end - start).count()
<< "ms.\n";
cout << "sum:\t" << sum << "\n";\
cout << "-----------------------------------------\n";
cout << "test with 2_threds\n";
start = chrono::system_clock::now();
test_with_2_threds();
end = chrono::system_clock::now();
cout << "finished calculation for "
<< chrono::duration_cast<std::chrono::milliseconds>(end - start).count()
<< "ms.\n";
cout << "sum:\t" << sum << "\n";\
_getch();
return 0;
}
Now, when I use for the counter just the long long variable (which is commented) I get value which is different from the correct - 100000000 instead of 200000000. I am not sure why is that and I suppose that the two threads are changing the counter at the same time, but I am not sure how it happens really because ++ is just a very simple instruction. It seems that the threads are caching the sum variable at beginning. Performance is 110 ms with two threads vs 200 ms for one thread.
So the correct way according to documentation is to use std::atomic. However now the performance is much worse for both cases as about 3300 ms without threads and 15820 ms with threads. What is the correct way to use std::atomic in this case?
I am not sure why is that and I suppose that the two threads are changing the counter at the same time, but I am not sure how it happens really because ++ is just a very simple instruction.
Each thread is pulling the value of sum into a register, incrementing the register, and finally writing it back to memory at the end of the loop.
So the correct way according to documentation is to use std::atomic. However now the performance is much worse for both cases as about 3300 ms without threads and 15820 ms with threads. What is the correct way to use std::atomic in this case?
You're paying for the synchronization std::atomic provides. It won't be nearly as fast as using an un-synchronized integer, though you can get a small improvement to performance by refining the memory order of the add:
sum.fetch_add(1, std::memory_order_relaxed);
In this particular case, you're compiling for x86 and operating on a 64-bit integer. This means that the compiler has to generate code to update the value in two 32-bit operations; if you change the target platform to x64, the compiler will generate code to do the increment in a single 64-bit operation.
As a general rule, the solution to problems like this is to reduce the number of writes to shared data.
Your code has a couple of problems. First of all, all the "inputs" involved are compile-time constants, so a good compiler can pre-compute the value for the single-threaded code, so (regardless of the value you give for range) it shows as running in 0 ms.
Second, you're sharing a single variable (sum) between all the threads, forcing all of their accesses to be synchronized at that point. Without synchronization, that gives undefined behavior. As you've already found, synchronizing the access to that variable is quite expensive, so you usually want to avoid it if at all reasonable.
One way to do that is to use a separate subtotal for each thread, so they can all do their additions in parallel, without synchronizing, the adding together the individual results at the end.
Another point is to ensure against false sharing. False sharing arises when two (or more) threads are writing to data that really is separate, but has been allocated in the same cache line. In this case, access to the memory can be serialized even though (as already noted) you don't have any data actually shared between the threads.
Based on those factors, I've rewritten your code slightly to create a separate sum variable for each thread. Those variables are of a class type that gives (fairly) direct access to the data, but does stop the optimizer from seeing that it can do the whole computation at compile-time, so we end up comparing one thread to 4 (which reminds me: I did increase the number of threads from 2 to 4, since I'm using a quad-core machine). I moved that number into a const variable though, so it should be easy to test with different numbers of threads.
#include <iostream>
#include <thread>
#include <conio.h>
#include <atomic>
#include <numeric>
const int num_threads = 4;
struct val {
long long sum;
int pad[2];
val &operator=(long long i) { sum = i; return *this; }
operator long long &() { return sum; }
operator long long() const { return sum; }
};
val sum[num_threads];
using namespace std;
const int RANGE = 100000000;
void test_without_threds()
{
sum[0] = 0LL;
for(unsigned int j = 0; j < num_threads; j++)
for(unsigned int k = 0; k < RANGE; k++)
sum[0] ++ ;
}
void call_from_thread(int tid)
{
for(unsigned int k = 0; k < RANGE; k++)
sum[tid] ++ ;
}
void test_with_threads()
{
std::thread t[num_threads];
std::fill_n(sum, num_threads, 0);
//Launch a group of threads
for (int i = 0; i < num_threads; ++i) {
t[i] = std::thread(call_from_thread, i);
}
//Join the threads with the main thread
for (int i = 0; i < num_threads; ++i) {
t[i].join();
}
long long total = std::accumulate(std::begin(sum), std::end(sum), 0LL);
}
int main()
{
chrono::time_point<chrono::system_clock> start, end;
cout << "-----------------------------------------\n";
cout << "test without threds()\n";
start = chrono::system_clock::now();
test_without_threds();
end = chrono::system_clock::now();
chrono::duration<double> elapsed_seconds = end-start;
cout << "finished calculation for "
<< chrono::duration_cast<std::chrono::milliseconds>(end - start).count()
<< "ms.\n";
cout << "sum:\t" << sum << "\n";\
cout << "-----------------------------------------\n";
cout << "test with threads\n";
start = chrono::system_clock::now();
test_with_threads();
end = chrono::system_clock::now();
cout << "finished calculation for "
<< chrono::duration_cast<std::chrono::milliseconds>(end - start).count()
<< "ms.\n";
cout << "sum:\t" << sum << "\n";\
_getch();
return 0;
}
When I run this, my results are closer to what I'd guess you hoped for:
-----------------------------------------
test without threds()
finished calculation for 78ms.
sum: 000000013FCBC370
-----------------------------------------
test with threads
finished calculation for 15ms.
sum: 000000013FCBC370
... the sums are identical, but N threads increases speed by a factor of approximately N (up to the number of cores available).
Try to use prefix increment, which will give performance improvement.
Test on my machine, std::memory_order_relaxed does not give any advantage.