I tried to write an answer to How to get 100% CPU usage from a C program question by thread class. Here is my code
#include <iostream>
#include <thread>
#include <vector>
#include <mutex>
using namespace std;
static int primes = 0;
void prime(int a, int b);
mutex mtx;
int main()
{
unsigned int nthreads = thread::hardware_concurrency();
vector<thread> threads;
int limit = 1000000;
int intrvl = (int) limit / nthreads;
for (int i = 0; i < nthreads; i++)
{
threads.emplace_back(prime, i*intrvl+1, i*intrvl+intrvl);
}
cout << "Number of logical cores: " << nthreads << "\n";
cout << "Calculating number of primes less than " << limit << "... \n";
for (thread & t : threads) {
t.join();
}
cout << "There are " << primes << " prime numbers less than " << limit << ".\n";
return 0;
}
void prime(int a, int b)
{
for (a; a <= b; a++) {
int i = 2;
while(i <= a) {
if(a % i == 0)
break;
i++;
}
if(i == a) {
mtx.lock();
primes++;
mtx.unlock();
}
}
}
But when I run it I get the following diagram
That is sinusoid. But when I run #Mysticial answer that uses openmp, I get this
I checked both program by ps -eLf and both of them uses 8 threads. Why I get this unsteady diagram and how can I get the same result as openmp does with thread?
There are some fundamental differences between Mystical's answer and your code.
Difference #1
Your code creates a chunk of work for each CPU, and lets it run to completion. This means that once a thread has finished, there will be a sharp drop in the CPU usage since a CPU will be idle while the other threads run to completion. This happens because scheduling is not always fair. One thread may progress, and finish, much faster than the others.
The OpenMP solution solves this by declaring schedule(dynamic) which tells OpenMP to, internally, create a work queue that all the threads will consume work from. When a chunk of work is finished, the thread that would have then exited in your code consumes another chunk of work and gets busy with it.
Eventually, this becomes a balancing act of picking adequately sized chunks. Too large, and the CPUs may not be maxed out toward the end of the task. Too small, and there can be significant overhead.
Difference #2
You are writing to a variable, primes that is shared between all of the threads.
This has 2 consequences:
It requires synchronization to keep prevent a data race.
It makes the cache on a modern CPU very unhappy since a cache flush is required before writes on one thread are visible to another thread.
The OpenMP solution solves this by reducing, via operator+(), the result of the individual values of primes each thread held into the final result. This is what reduction(+ : primes) does.
With this knowledge of how OpenMP is splitting up, scheduling the work, and combining the results, we can modify your code to behave similarly.
#include <iostream>
#include <thread>
#include <vector>
#include <utility>
#include <algorithm>
#include <functional>
#include <mutex>
#include <future>
using namespace std;
int prime(int a, int b)
{
int primes = 0;
for (a; a <= b; a++) {
int i = 2;
while (i <= a) {
if (a % i == 0)
break;
i++;
}
if (i == a) {
primes++;
}
}
return primes;
}
int workConsumingPrime(vector<pair<int, int>>& workQueue, mutex& workMutex)
{
int primes = 0;
unique_lock<mutex> workLock(workMutex);
while (!workQueue.empty()) {
pair<int, int> work = workQueue.back();
workQueue.pop_back();
workLock.unlock(); //< Don't hold the mutex while we do our work.
primes += prime(work.first, work.second);
workLock.lock();
}
return primes;
}
int main()
{
int nthreads = thread::hardware_concurrency();
int limit = 1000000;
// A place to put work to be consumed, and a synchronisation object to protect it.
vector<pair<int, int>> workQueue;
mutex workMutex;
// Put all of the ranges into a queue for the threads to consume.
int chunkSize = max(limit / (nthreads*16), 10); //< Handwaving came picking 16 and a good factor.
for (int i = 0; i < limit; i += chunkSize) {
workQueue.push_back(make_pair(i, min(limit, i + chunkSize)));
}
// Start the threads.
vector<future<int>> futures;
for (int i = 0; i < nthreads; ++i) {
packaged_task<int()> task(bind(workConsumingPrime, ref(workQueue), ref(workMutex)));
futures.push_back(task.get_future());
thread(move(task)).detach();
}
cout << "Number of logical cores: " << nthreads << "\n";
cout << "Calculating number of primes less than " << limit << "... \n";
// Sum up all the results.
int primes = 0;
for (future<int>& f : futures) {
primes += f.get();
}
cout << "There are " << primes << " prime numbers less than " << limit << ".\n";
}
This is still not a perfect reproduction of how the OpenMP example behaves. For example, this is closer to OpenMP's static schedule since chunks of work are a fixed size. Also, OpenMP does not use a work queue at all. So I may have lied a little bit -- call it a white lie since I wanted to be more explicit about showing the work being split up. What it is likely doing behind the scenes is storing the iteration that the next thread should start at when it comes available and a heuristic for the next chunk size.
Even with these differences, I'm able to max out all my CPUs for an extended period of time.
Looking to the future...
You probably noticed that the OpenMP version is a lot more readable. This is because it's meant to solve problems just like this. So, when we try to solve them without a library or compiler extension, we end up reinventing the wheel. Luckily, there is a lot of work being done to bring this sort of functionality directly into C++. Specifically, the Parallelism TS can help us out if we could represent this as a standard C++ algorithm. Then we could tell the library to distribute the algorithm across all CPUs as it sees fit so it does all the heavy lifting for us.
In C++11, with a little bit of help from Boost, this algorithm could be written as:
#include <iostream>
#include <iterator>
#include <algorithm>
#include <boost/range/irange.hpp>
using namespace std;
bool isPrime(int n)
{
if (n < 2)
return false;
for (int i = 2; i < n; ++i) {
if (n % i == 0)
return false;
}
return true;
}
int main()
{
auto range = boost::irange(0, 1000001);
auto numPrimes = count_if(begin(range), end(range), isPrime);
cout << "There are " << numPrimes << " prime numbers less than " << range.back() << ".\n";
}
And to parallelise the algorithm, you just need to #include <execution_policy> and pass std::par as the first parameter to count_if.
auto numPrimes = count_if(par, begin(range), end(range), isPrime);
And that's the kind of code that makes me happy to read.
Note: Absolutely no time was spent optimising this algorithm at all. If we were to do any optimisation, I'd look into something like the the Sieve of Eratosthenes which uses previous prime computations to help with future ones.
First, you need to realize that OpenMP usually has a fairly sophisticated thread pool under the covers, so matching it (exactly) will probably be at least somewhat difficult.
Second, it seems to me that before optimizing the threading, you should attempt to start with at least a halfway decent basic algorithm. In this case, the basic algorithm you're implementing is basically pretty awful. It's checking whether numbers are prime, but doing a lot of work that doesn't accomplish anything useful.
It's checking whether even numbers are prime. Other than 2, they're not. Ever.
It's checking whether odd numbers are divisible by even number. Again, they're not. Ever.
It's checking whether numbers are divisible by numbers larger then their square root. If there's no divisor smaller than the square root, there can't be one larger than the square root either.
Although it probably doesn't affect speed, I also find it a lot easier to have a function that checks whether a single number is prime, and just returns true/false to indicate the result, than to have somewhat elaborate code to figure out whether a preceding loop ran to completion or exited early.
You can optimize the algorithm by eliminating more than that, but that much doesn't strike me as "optimization" nearly so much as simply avoiding completely unnecessary pessimization.
At least in my opinion, it's also a bit easier (in this case) to use std::async to launch the threads. This lets us return a value from our thread (the count we want back) pretty easily.
So, let's start by fixing prime based on those observations:
int prime(int a, int b)
{
int count = 0;
if (a == 2)
++count;
if (a % 2 == 0)
++a;
auto check = [](int i) -> bool {
for (int j = 3; j*j <= i; j += 2)
if (i % j == 0)
return false;
return true;
};
for (a; a <= b; a+=2) {
if (check(a))
++count;
}
return count;
}
Now, let me point out that this is already enough faster (even single-threaded) that if we just wanted to get the job to finish 4 times faster (or so) that we'd get from perfect thread-scaling, we're done, even without using threading at all. For the limit you gave, this finishes in well under 1 second.
For the sake of argument, however, let's assume we want to get more, and make use of multiple cores too. One thing to realize here is that we generally want at least a few more threads than cores. The problem is fairly simple: with only one thread per core, we have nothing to make up for the fact that we haven't really distributed the load even between the threads--the thread processing the largest numbers has quite a bit more work to do than the thread processing the smallest numbers--but if we have (for example) a 4-core machine, as soon as one thread finishes, we can only use 75% of the CPU. Then when another thread finishes, it drops to 50%. Then 25%, and finally it finishes, using only one core.
We could probably do some computation to attempt to distribute the load more evenly, but it's a lot easier to just split the load into, say, six or 8 times as many threads as cores. This way the computation can continue using all the cores until there are only three threads remaining1.
Putting all that into code, we can end up with something like this:
int main() {
using namespace chrono;
int limit = 50000000;
unsigned int nthreads = 8 * thread::hardware_concurrency();
cout << "\nComputing multi-threaded:\n";
cout << "Number of threads: " << nthreads << "\n";
cout << "Calculating number of primes less than " << limit << "... \n";
auto start2 = high_resolution_clock::now();
vector<future<int>> threads;
int intrvl = limit / nthreads;
for (int i = 0; i < nthreads; i++)
threads.emplace_back(std::async(std::launch::async, prime, i*intrvl + 1, (i + 1)*intrvl));
int primes = 0;
for (auto &t : threads)
primes += t.get();
auto end2 = high_resolution_clock::now();
cout << "Primes: " << primes << ", Time: " << duration_cast<milliseconds>(end2 - start2).count() << "\n";
}
Note a couple of points:
This runs enough faster that I've increased the upper limit by a fairly large factor, so it'll run long enough that we can at least see it use 100% of the CPU time for a few seconds before it's done2.
I've added some timing code to get a little more accurate idea of how long it runs for.
At least when I run this, it seems to act about as we'd expect/hope: it uses 100% of the CPU time until it gets very close to the end, when it starts to drop just before finishing (i.e., when we have fewer threads to execute than we have cores to execute them).
In case you wonder how OpenMP avoids this: it usually uses a thread pool, so some number of iterations of the loop is dispatched to the thread pool as a task. This lets it produce a large number of tasks without having a huge number of threads contending for CPU time simultaneously.
With the upper limit you used, it finished on my machine in about 90 milliseconds, which isn't long enough for it to even make a noticeable blip on the CPU usage graph.
The OpenMP example is using "reduction" on the sum variable primes which means that each task sums its own local primes variable.
OpenMP adds the thread local copies of primes together at the end of the parallel part to get the grand total.
That means it does not need to lock.
As #Sam says, a thread will get put to sleep if it cannot acquire the mutex lock.
So in your case, the threads will spend a fair amount of time asleep.
If you don't want to use OpenMP, try static std::atomic<int> primes = 0; then you don't need the mutex lock and unlock.
Or you could simulate OpenMP reduction by using an array primes[numThreads] where thread i sums into primes[i] then sum primes[] at the end.
Related
In the following code, I repeatedly call std::chrono::high_resolution_clock::now twice, and measure the time it took between these two calls. I would expect this time to be very small, since there is no other code is run between these two calls. However, I observe strange behavior.
For small N, the max element is within a few nanoseconds, as expected. However, the more I increase N, I get very large outliers, and have gotten up to a few milliseconds. Why does this happen?
In other words, why does the max element of v increase as I increase N in the following code?
#include <iostream>
#include <vector>
#include <chrono>
#include <algorithm>
int main()
{
using ns = std::chrono::nanoseconds;
uint64_t N = 10000000;
std::vector<uint64_t> v(N, 0);
for (uint64_t i = 0; i < N; i++) {
auto start = std::chrono::high_resolution_clock::now();
v[i] = std::chrono::duration_cast<ns>(std::chrono::high_resolution_clock::now() - start).count();
}
std::cout << "max: " << *std::max_element(v.begin(), v.end()) << std::endl;
return 0;
}
The longer you run your loop, the more likely it is that your OS will decide that your thread has consumed enough resources for the moment and suspend it. And the longer you run your loop, the more likely it is that this suspension will happen between those calls.
Since you're only looking at the "max" time, this only has to happen once to cause the max time to spike into the millisecond range.
I am very new to modern C++ library, and trying to learn how to use std::async to perform some operations on a big pointer array. The sample code I have written is crashing at the point where the async task is launched.
Sample code:
#include <iostream>
#include <future>
#include <tuple>
#include <numeric>
#define maximum(a,b) (((a) > (b)) ? (a) : (b))
class Foo {
bool flag;
public:
Foo(bool b) : flag(b) {}
//******
//
//******
std::tuple<long long, int> calc(int* a, int begIdx, int endIdx) {
long sum = 0;
int max = 0;
if (!(*this).flag) {
return std::make_tuple(sum, max);
}
if (endIdx - begIdx < 100)
{
for (int i = begIdx; i < endIdx; ++i)
{
sum += a[i];
if (max < a[i])
max = a[i];
}
return std::make_tuple(sum, max);
}
int midIdx = endIdx / 2;
auto handle = std::async(&Foo::calc, this, std::ref(a), midIdx, endIdx);
auto resultTuple = calc(a, begIdx, midIdx);
auto asyncTuple = handle.get();
sum = std::get<0>(asyncTuple) +std::get<0>(resultTuple);
max = maximum(std::get<1>(asyncTuple), std::get<1>(resultTuple));
return std::make_tuple(sum, max);
}
//******
//
//******
void call_calc(int*& a) {
auto handle = std::async(&Foo::calc, this, std::ref(a), 0, 10000);
auto resultTuple = handle.get();
std::cout << "Sum = " << std::get<0>(resultTuple) << " Maximum = " << std::get<1>(resultTuple) << std::endl;
}
};
//******
//
//******
int main() {
int* nums = new int[10000];
for (int i = 0; i < 10000; ++i)
nums[i] = rand() % 10000 + 1;
Foo foo(true);
foo.call_calc(nums);
delete[] nums;
}
Can anyone help me to identify why does it crash?
Is there any better approach to apply parallelism to operations on a big pointer array?
The fundamental problem is your code wants to launch more than array size / 100 threads. That means more than 100 threads. 100 threads won't do anything good; they'll thrash. See std::thread::hardware_concurrency, and in general don't use raw async or thread in production applications; write task pools and splice together futures and the like.
That many threads is both extremely inefficient and could exhaust system resources.
The second problem is you failed to calculate the average of 2 values.
The average of begIdx and endIdx is not endIdx/2 but rather:
int midIdx = begIdx + (endIdx-begIdx) / 2;
Live example.
You'll notice I discovered the problem with your program by adding intermediate output. In particular, I had it print out the ranges it was working on, and I noticed it was repeating ranges. This is known as "printf debugging", and is pretty powerful especially when step-based debugging isn't (with this many threads, stepping through the code will be brain-numbing)
The problem with async calls is that they are not done in some universe where an infinite amount of tasks can be executed all at the exact same time.
Async calls are executed on a processor which has a certain amount of processors/cores and the async calls have to be lined up to be executed on them.
Now here is where problems of synchronization, and the problems of blocking, starvation, ... and other multithreaded issues come into play.
Your algorithm is very difficult to follow, as it is spawning tasks inside already created tasks. Something is happening, but it is difficult to follow.
I would solve this problem by:
Creating a vector of results (which will be from async threads)
In a loop execute the async calls (assigning the result to the vector)
Afterwards loop through the reuslts vector gathering the results
In a larger numerical computation, I have to perform the trivial task of summing up the products of the elements of two vectors. Since this task needs to be done very often, I tried to make use of the auto vectorization capabilities of my compiler (VC2015). I introduced a temporary vector, where the products are saved in in a first loop and then performed the summation in a second loop. Optimization was set to full and fast code was preferred. This way, the first loop got vectorized by the compiler (I know this from the compiler output).
The result was surprising. The vectorized code performed 3 times slower on my machine (core i5-4570 3.20 GHz) than the simple code. Could anybody explain why and what might improve the performance? I've put both versions of the algorithm fragment into a minimal running example, which I used myself for testing:
#include "stdafx.h"
#include <vector>
#include <Windows.h>
#include <iostream>
using namespace std;
int main()
{
// Prepare timer
LARGE_INTEGER freq,c_start,c_stop;
QueryPerformanceFrequency(&freq);
int size = 20000000; // size of data
double v = 0;
// Some data vectors. The data inside doesn't matter
vector<double> vv(size);
vector<double> tt(size);
vector<float> dd(size);
// Put random values into the vectors
for (int i = 0; i < size; i++)
{
tt[i] = rand();
dd[i] = rand();
}
// The simple version of the algorithm fragment
QueryPerformanceCounter(&c_start); // start timer
for (int p = 0; p < size; p++)
{
v += tt[p] * dd[p];
}
QueryPerformanceCounter(&c_stop); // Stop timer
cout << "Simple version took: " << ((double)(c_stop.QuadPart - c_start.QuadPart)) / ((double)freq.QuadPart) << " s" << endl;
cout << v << endl; // We use v once. This avoids its calculation to be optimized away.
// The version that is auto-vectorized
for (int i = 0; i < size; i++)
{
tt[i] = rand();
dd[i] = rand();
}
v = 0;
QueryPerformanceCounter(&c_start); // start timer
for (int p = 0; p < size; p++) // This loop is vectorized according to compiler output
{
vv[p] = tt[p] * dd[p];
}
for (int p = 0; p < size; p++)
{
v += vv[p];
}
QueryPerformanceCounter(&c_stop); // Stop timer
cout << "Vectorized version took: " << ((double)(c_stop.QuadPart - c_start.QuadPart)) / ((double)freq.QuadPart) << " s" << endl;
cout << v << endl; // We use v once. This avoids its calculation to be optimized away.
cin.ignore();
return 0;
}
You added a large amount of work by storing the products in a temporary vector.
For such a simple computation on large data, the CPU time that you expect to save by vectorization doesn't matter. Only memory references matter.
You added memory references, so it runs slower.
I would have expected the compiler to optimize the original version of that loop. I doubt the optimization would affect the execution time (because it is dominated by memory access regardless). But it should be visible in the generated code. If you wanted to hand optimize code like that, a temporary vector is always the wrong way to go. The right direction is the following (for simplicity, I assumed size is even):
for (int p = 0; p < size; p+=2)
{
v += tt[p] * dd[p];
v1 += tt[p+1] * dd[p+1];
}
v += v1;
Note that your data is large enough and operation simple enough, that NO optimization should be able to improve on the simplest version. That includes my sample hand optimization. But I assume your test is not exactly representative of what you are really trying to do or understand. So with smaller data or a more complicated operation, the approach I showed may help.
Also notice my version relies on addition being commutative. For real numbers, addition is commutative. But in floating point, it isn't. The answer is likely to be different by an amount too tiny for you to care. But that is data dependent. If you have large values of opposite sign in odd/even positions canceling each other early in the original sequence, then by segregating the even and odd positions my "optimization" would totally destroy the answer. (Of course, the opposite can also be true. For example, if all the even positions were tiny and the odds included large values canceling each other, then the original sequence produced garbage and the changed sequence would be more correct).
I'm trying around on the new C++11 threads, but my simple test has abysmal multicore performance. As a simple example, this program adds up some squared random numbers.
#include <iostream>
#include <thread>
#include <vector>
#include <cstdlib>
#include <chrono>
#include <cmath>
double add_single(int N) {
double sum=0;
for (int i = 0; i < N; ++i){
sum+= sqrt(1.0*rand()/RAND_MAX);
}
return sum/N;
}
void add_multi(int N, double& result) {
double sum=0;
for (int i = 0; i < N; ++i){
sum+= sqrt(1.0*rand()/RAND_MAX);
}
result = sum/N;
}
int main() {
srand (time(NULL));
int N = 1000000;
// single-threaded
auto t1 = std::chrono::high_resolution_clock::now();
double result1 = add_single(N);
auto t2 = std::chrono::high_resolution_clock::now();
auto time_elapsed = std::chrono::duration_cast<std::chrono::milliseconds>(t2-t1).count();
std::cout << "time single: " << time_elapsed << std::endl;
// multi-threaded
std::vector<std::thread> th;
int nr_threads = 3;
double partual_results[] = {0,0,0};
t1 = std::chrono::high_resolution_clock::now();
for (int i = 0; i < nr_threads; ++i)
th.push_back(std::thread(add_multi, N/nr_threads, std::ref(partual_results[i]) ));
for(auto &a : th)
a.join();
double result_multicore = 0;
for(double result:partual_results)
result_multicore += result;
result_multicore /= nr_threads;
t2 = std::chrono::high_resolution_clock::now();
time_elapsed = std::chrono::duration_cast<std::chrono::milliseconds>(t2-t1).count();
std::cout << "time multi: " << time_elapsed << std::endl;
return 0;
}
Compiled with 'g++ -std=c++11 -pthread test.cpp' on Linux and a 3core machine, a typical result is
time single: 33
time multi: 565
So the multi threaded version is more than an order of magnitude slower. I've used random numbers and a sqrt to make the example less trivial and prone to compiler optimizations, so I'm out of ideas.
edit:
This problem scales for larger N, so the problem is not the short runtime
The time for creating the threads is not the problem. Excluding it does not change the result significantly
Wow I found the problem. It was indeed rand(). I replaced it with an C++11 equivalent and now the runtime scales perfectly. Thanks everyone!
On my system the behavior is same, but as Maxim mentioned, rand is not thread safe. When I change rand to rand_r, then the multi threaded code is faster as expected.
void add_multi(int N, double& result) {
double sum=0;
unsigned int seed = time(NULL);
for (int i = 0; i < N; ++i){
sum+= sqrt(1.0*rand_r(&seed)/RAND_MAX);
}
result = sum/N;
}
As you discovered, rand is the culprit here.
For those who are curious, it's possible that this behavior comes from your implementation of rand using a mutex for thread safety.
For example, eglibc defines rand in terms of __random, which is defined as:
long int
__random ()
{
int32_t retval;
__libc_lock_lock (lock);
(void) __random_r (&unsafe_state, &retval);
__libc_lock_unlock (lock);
return retval;
}
This kind of locking would force multiple threads to run serially, resulting in lower performance.
The time needed to execute the program is very small (33msec). This means that the overhead to create and handle several threads may be more than the real benefit. Try using programs that need longer times for the execution (e.g., 10 sec).
To make this faster, use a thread pool pattern.
This will let you enqueue tasks in other threads without the overhead of creating a std::thread each time you want to use more than one thread.
Don't count the overhead of setting up the queue in your performance metrics, just the time to enqueue and extract the results.
Create a set of threads and a queue of tasks (a structure containing a std::function<void()>) to feed them. The threads wait on the queue for new tasks to do, do them, then wait on new tasks.
The tasks are responsible for communicating their "done-ness" back to the calling context, such as via a std::future<>. The code that lets you enqueue functions into the task queue might do this wrapping for you, ie this signature:
template<typename R=void>
std::future<R> enqueue( std::function<R()> f ) {
std::packaged_task<R()> task(f);
std::future<R> retval = task.get_future();
this->add_to_queue( std::move( task ) ); // if we had move semantics, could be easier
return retval;
}
which turns a naked std::function returning R into a nullary packaged_task, then adds that to the tasks queue. Note that the tasks queue needs be move-aware, because packaged_task is move-only.
Note 1: I am not all that familiar with std::future, so the above could be in error.
Note 2: If tasks put into the above described queue are dependent on each other for intermediate results, the queue could deadlock, because no provision to "reclaim" threads that are blocked and execute new code is described. However, "naked computation" non-blocking tasks should work fine with the above model.
This might be some weird Linux quirk, but I'm observing very strange behavior.
The following code should compare a synchronized version of summing numbers with an async version. The thing is that I'm seeing a performance increase (it's not caching, it happens even when I split the code into two separate programs), while still observing the program as single-threaded (only one core is used).
strace does show some thread activity, but monitoring tools like top clones still show only one used core.
Second problem I'm observing is that if I increase the spawn ratio, the memory usage just explodes. What is the memory overhead of a thread? With 5000 threads I get ~10GB memory usage.
#include <iostream>
#include <random>
#include <chrono>
#include <future>
using namespace std;
long long sum2(const vector<int>& v, size_t from, size_t to)
{
const size_t boundary = 5*1000*1000;
if (to-from <= boundary)
{
long long rsum = 0;
for (;from < to; from++)
{
rsum += v[from];
}
return rsum;
}
else
{
size_t mid = from + (to-from)/2;
auto s2 = async(launch::async,sum2,cref(v),mid,to);
long long rsum = sum2(v,from,mid);
rsum += s2.get();
return rsum;
}
}
long long sum2(const vector<int>& v)
{
return sum2(v,0,v.size());
}
long long sum(const vector<int>& v)
{
long long rsum = 0;
for (auto i : v)
{
rsum += i;
}
return rsum;
}
int main()
{
const size_t vsize = 100*1000*1000;
vector<int> x;
x.reserve(vsize);
mt19937 rng;
rng.seed(chrono::system_clock::to_time_t(chrono::system_clock::now()));
uniform_int_distribution<uint32_t> dist(0,10);
for (auto i = 0; i < vsize; i++)
{
x.push_back(dist(rng));
}
auto start = chrono::high_resolution_clock::now();
long long suma = sum(x);
auto end = chrono::high_resolution_clock::now();
cout << "Sum is " << suma << endl;
cout << "Duration " << chrono::duration_cast<chrono::nanoseconds>(end - start).count() << " nanoseconds." << endl;
start = chrono::high_resolution_clock::now();
suma = sum2(x);
end = chrono::high_resolution_clock::now();
cout << "Async sum is " << suma << endl;
cout << "Async duration " << chrono::duration_cast<chrono::nanoseconds>(end - start).count() << " nanoseconds." << endl;
return 0;
}
Maybe you observe one core being used because the overlap between threads doing work simultaneously is too short to be noticeable. Summing 5mln values from a continuous area of memory should be very fast on modern hardware, so by the time parent finishes summing, child may have barely started and parent may be spending most or all of the time waiting for the result from the child. Have you tried to increase work unit to see if the overlap becomes noticeable?
Regarding increased performance: even if there is 0 overlap between threads because of too small work unit, multithreaded version can still benefit from additional L1 cache memory. For such a test, memory will likely be a bottleneck and sequential version will use only one L1 cache while multithreaded version will use as many as there are cores.
Have you checked the times that are being printed? On my machine, the serial time is under 1s at -O2, whilst the parallel sum time is several times faster. It's therefore entirely possible that the CPU usage is not enough for long enough for things like "top" to register, since they typically only refresh once per second.
If you increase the number of threads by reducing the count-per-thread, then you effectively increase the overhead of the thread management. If you have 5000 threads active, then your task will take 5000* min-thread-stack-size in additional memory. On my machine that's 20Gb!
Why don't you try increasing the size of the source container? If you make the parallel section take long enough, you'll see the corresponding parallel CPU usage. However, be prepared: summing integers is fast, and the time taken generating the random numbers can take an order of magnitude or two longer than the time to add the numbers together.