I have a very simple parallel_for loop that does some work on a large vector. While this is a contrived example, I am hoping to measure any potential overhead with scheduling by varying grain size. The loop is as follows:
tbb_start = std::chrono::high_resolution_clock::now();
tbb::parallel_for(tbb::blocked_range<int>(0, values.size(), grainSize),
[&](tbb::blocked_range<int> r)
{
for (int i = r.begin(); i < r.end(); ++i)
{
values[i] = std::sin(i * 0.001);
}
});
tbb_end = std::chrono::high_resolution_clock::now();
loop_duration = (tbb_end - tbb_start);
std::cout << "TBB Time: " << loop_duration.count() << "ms" << std::endl;
My assumption here is that increasing the grain size will reduce the scheduling overhead (since more work will be done by fewer threads). How then to measure this change in overhead? Per this paper:
https://www.epcc.ed.ac.uk/sites/default/files/PDF/ewomp99paper.pdf
The authors take the difference between the parallel run time and serial run time, divide that by the number of processors and that number will be the overhead. Is this the standard way to do it? Is there another (possibly better) way?
Related
So I want to optimize the sum of a really big array and in order to do that I have wrote a multi-threaded code. The problem is that with this code I'm getting better timing results using only one thread instead of 2 or 3 or 4 threads...
Can someone explain me why this happens?
(Also I've only started coding in C++ this semester, until then I only knew C, so I'm sorry for possible dumb mistakes)
This is the thread code
*localSum = 0.0;
for (size_t i = 0; i < stop; i++)
*localSum += v[i];
Main process code
int numThreads = atoi(argv[1]);
int N = 100000000;
// create the input vector v and put some values in v
vector<double> v(N);
for (int i = 0; i < N; i++)
v[i] = i;
// this vector will contain the partial sum for each thread
vector<double> localSum(numThreads, 0);
// create threads. Each thread will compute part of the sum and store
// its result in localSum[threadID] (threadID = 0, 1, ... numThread-1)
startChrono();
vector<thread> myThreads(numThreads);
for (int i = 0; i < numThreads; i++){
int start = i * v.size() / numThreads;
myThreads[i] = thread(threadsum, i, numThreads, &v[start], &localSum[i],v.size()/numThreads);
}
for_each(myThreads.begin(), myThreads.end(), mem_fn(&thread::join));
// calculate global sum
double globalSum = 0.0;
for (int i = 0; i < numThreads; i++)
globalSum += localSum[i];
cout.precision(12);
cout << "Sum = " << globalSum << endl;
cout << "Runtime: " << stopChrono() << endl;
exit(EXIT_SUCCESS);
}
There are a few things:
1- The array just isn't big enough. Vectorized streaming add will be really hard to beat. You need a more complex function than add to really see results. Or a very large array.
2- Related, the overhead of all the thread creation and joining is going to swamp any performance gains from the threading. Adding is really fast, and you can easily saturate the CPU's functional units. for the thread to help it can't even be a hyperthread on the same core, it would need to be on a different core entirely (as the hyperthreads would both compete for the floating point units).
To test this, you can try to create all the treads before you start the timer and stop them all after you stop the timer (have them set a done flag instead of waiting on the join).
3- All your localsum variables are sharing the same cache line. Better would be to make the localsum variable on the stack and put the result into the array instead of adding directly into the array: https://mechanical-sympathy.blogspot.com/2011/07/false-sharing.html
If for some reason, you need to keep the sum observable to others in that array, pad the localsum vector entries like this so they don't share the same cache line:
struct localsumentry {
double sum;
char pad[56];
};
I tried to write an answer to How to get 100% CPU usage from a C program question by thread class. Here is my code
#include <iostream>
#include <thread>
#include <vector>
#include <mutex>
using namespace std;
static int primes = 0;
void prime(int a, int b);
mutex mtx;
int main()
{
unsigned int nthreads = thread::hardware_concurrency();
vector<thread> threads;
int limit = 1000000;
int intrvl = (int) limit / nthreads;
for (int i = 0; i < nthreads; i++)
{
threads.emplace_back(prime, i*intrvl+1, i*intrvl+intrvl);
}
cout << "Number of logical cores: " << nthreads << "\n";
cout << "Calculating number of primes less than " << limit << "... \n";
for (thread & t : threads) {
t.join();
}
cout << "There are " << primes << " prime numbers less than " << limit << ".\n";
return 0;
}
void prime(int a, int b)
{
for (a; a <= b; a++) {
int i = 2;
while(i <= a) {
if(a % i == 0)
break;
i++;
}
if(i == a) {
mtx.lock();
primes++;
mtx.unlock();
}
}
}
But when I run it I get the following diagram
That is sinusoid. But when I run #Mysticial answer that uses openmp, I get this
I checked both program by ps -eLf and both of them uses 8 threads. Why I get this unsteady diagram and how can I get the same result as openmp does with thread?
There are some fundamental differences between Mystical's answer and your code.
Difference #1
Your code creates a chunk of work for each CPU, and lets it run to completion. This means that once a thread has finished, there will be a sharp drop in the CPU usage since a CPU will be idle while the other threads run to completion. This happens because scheduling is not always fair. One thread may progress, and finish, much faster than the others.
The OpenMP solution solves this by declaring schedule(dynamic) which tells OpenMP to, internally, create a work queue that all the threads will consume work from. When a chunk of work is finished, the thread that would have then exited in your code consumes another chunk of work and gets busy with it.
Eventually, this becomes a balancing act of picking adequately sized chunks. Too large, and the CPUs may not be maxed out toward the end of the task. Too small, and there can be significant overhead.
Difference #2
You are writing to a variable, primes that is shared between all of the threads.
This has 2 consequences:
It requires synchronization to keep prevent a data race.
It makes the cache on a modern CPU very unhappy since a cache flush is required before writes on one thread are visible to another thread.
The OpenMP solution solves this by reducing, via operator+(), the result of the individual values of primes each thread held into the final result. This is what reduction(+ : primes) does.
With this knowledge of how OpenMP is splitting up, scheduling the work, and combining the results, we can modify your code to behave similarly.
#include <iostream>
#include <thread>
#include <vector>
#include <utility>
#include <algorithm>
#include <functional>
#include <mutex>
#include <future>
using namespace std;
int prime(int a, int b)
{
int primes = 0;
for (a; a <= b; a++) {
int i = 2;
while (i <= a) {
if (a % i == 0)
break;
i++;
}
if (i == a) {
primes++;
}
}
return primes;
}
int workConsumingPrime(vector<pair<int, int>>& workQueue, mutex& workMutex)
{
int primes = 0;
unique_lock<mutex> workLock(workMutex);
while (!workQueue.empty()) {
pair<int, int> work = workQueue.back();
workQueue.pop_back();
workLock.unlock(); //< Don't hold the mutex while we do our work.
primes += prime(work.first, work.second);
workLock.lock();
}
return primes;
}
int main()
{
int nthreads = thread::hardware_concurrency();
int limit = 1000000;
// A place to put work to be consumed, and a synchronisation object to protect it.
vector<pair<int, int>> workQueue;
mutex workMutex;
// Put all of the ranges into a queue for the threads to consume.
int chunkSize = max(limit / (nthreads*16), 10); //< Handwaving came picking 16 and a good factor.
for (int i = 0; i < limit; i += chunkSize) {
workQueue.push_back(make_pair(i, min(limit, i + chunkSize)));
}
// Start the threads.
vector<future<int>> futures;
for (int i = 0; i < nthreads; ++i) {
packaged_task<int()> task(bind(workConsumingPrime, ref(workQueue), ref(workMutex)));
futures.push_back(task.get_future());
thread(move(task)).detach();
}
cout << "Number of logical cores: " << nthreads << "\n";
cout << "Calculating number of primes less than " << limit << "... \n";
// Sum up all the results.
int primes = 0;
for (future<int>& f : futures) {
primes += f.get();
}
cout << "There are " << primes << " prime numbers less than " << limit << ".\n";
}
This is still not a perfect reproduction of how the OpenMP example behaves. For example, this is closer to OpenMP's static schedule since chunks of work are a fixed size. Also, OpenMP does not use a work queue at all. So I may have lied a little bit -- call it a white lie since I wanted to be more explicit about showing the work being split up. What it is likely doing behind the scenes is storing the iteration that the next thread should start at when it comes available and a heuristic for the next chunk size.
Even with these differences, I'm able to max out all my CPUs for an extended period of time.
Looking to the future...
You probably noticed that the OpenMP version is a lot more readable. This is because it's meant to solve problems just like this. So, when we try to solve them without a library or compiler extension, we end up reinventing the wheel. Luckily, there is a lot of work being done to bring this sort of functionality directly into C++. Specifically, the Parallelism TS can help us out if we could represent this as a standard C++ algorithm. Then we could tell the library to distribute the algorithm across all CPUs as it sees fit so it does all the heavy lifting for us.
In C++11, with a little bit of help from Boost, this algorithm could be written as:
#include <iostream>
#include <iterator>
#include <algorithm>
#include <boost/range/irange.hpp>
using namespace std;
bool isPrime(int n)
{
if (n < 2)
return false;
for (int i = 2; i < n; ++i) {
if (n % i == 0)
return false;
}
return true;
}
int main()
{
auto range = boost::irange(0, 1000001);
auto numPrimes = count_if(begin(range), end(range), isPrime);
cout << "There are " << numPrimes << " prime numbers less than " << range.back() << ".\n";
}
And to parallelise the algorithm, you just need to #include <execution_policy> and pass std::par as the first parameter to count_if.
auto numPrimes = count_if(par, begin(range), end(range), isPrime);
And that's the kind of code that makes me happy to read.
Note: Absolutely no time was spent optimising this algorithm at all. If we were to do any optimisation, I'd look into something like the the Sieve of Eratosthenes which uses previous prime computations to help with future ones.
First, you need to realize that OpenMP usually has a fairly sophisticated thread pool under the covers, so matching it (exactly) will probably be at least somewhat difficult.
Second, it seems to me that before optimizing the threading, you should attempt to start with at least a halfway decent basic algorithm. In this case, the basic algorithm you're implementing is basically pretty awful. It's checking whether numbers are prime, but doing a lot of work that doesn't accomplish anything useful.
It's checking whether even numbers are prime. Other than 2, they're not. Ever.
It's checking whether odd numbers are divisible by even number. Again, they're not. Ever.
It's checking whether numbers are divisible by numbers larger then their square root. If there's no divisor smaller than the square root, there can't be one larger than the square root either.
Although it probably doesn't affect speed, I also find it a lot easier to have a function that checks whether a single number is prime, and just returns true/false to indicate the result, than to have somewhat elaborate code to figure out whether a preceding loop ran to completion or exited early.
You can optimize the algorithm by eliminating more than that, but that much doesn't strike me as "optimization" nearly so much as simply avoiding completely unnecessary pessimization.
At least in my opinion, it's also a bit easier (in this case) to use std::async to launch the threads. This lets us return a value from our thread (the count we want back) pretty easily.
So, let's start by fixing prime based on those observations:
int prime(int a, int b)
{
int count = 0;
if (a == 2)
++count;
if (a % 2 == 0)
++a;
auto check = [](int i) -> bool {
for (int j = 3; j*j <= i; j += 2)
if (i % j == 0)
return false;
return true;
};
for (a; a <= b; a+=2) {
if (check(a))
++count;
}
return count;
}
Now, let me point out that this is already enough faster (even single-threaded) that if we just wanted to get the job to finish 4 times faster (or so) that we'd get from perfect thread-scaling, we're done, even without using threading at all. For the limit you gave, this finishes in well under 1 second.
For the sake of argument, however, let's assume we want to get more, and make use of multiple cores too. One thing to realize here is that we generally want at least a few more threads than cores. The problem is fairly simple: with only one thread per core, we have nothing to make up for the fact that we haven't really distributed the load even between the threads--the thread processing the largest numbers has quite a bit more work to do than the thread processing the smallest numbers--but if we have (for example) a 4-core machine, as soon as one thread finishes, we can only use 75% of the CPU. Then when another thread finishes, it drops to 50%. Then 25%, and finally it finishes, using only one core.
We could probably do some computation to attempt to distribute the load more evenly, but it's a lot easier to just split the load into, say, six or 8 times as many threads as cores. This way the computation can continue using all the cores until there are only three threads remaining1.
Putting all that into code, we can end up with something like this:
int main() {
using namespace chrono;
int limit = 50000000;
unsigned int nthreads = 8 * thread::hardware_concurrency();
cout << "\nComputing multi-threaded:\n";
cout << "Number of threads: " << nthreads << "\n";
cout << "Calculating number of primes less than " << limit << "... \n";
auto start2 = high_resolution_clock::now();
vector<future<int>> threads;
int intrvl = limit / nthreads;
for (int i = 0; i < nthreads; i++)
threads.emplace_back(std::async(std::launch::async, prime, i*intrvl + 1, (i + 1)*intrvl));
int primes = 0;
for (auto &t : threads)
primes += t.get();
auto end2 = high_resolution_clock::now();
cout << "Primes: " << primes << ", Time: " << duration_cast<milliseconds>(end2 - start2).count() << "\n";
}
Note a couple of points:
This runs enough faster that I've increased the upper limit by a fairly large factor, so it'll run long enough that we can at least see it use 100% of the CPU time for a few seconds before it's done2.
I've added some timing code to get a little more accurate idea of how long it runs for.
At least when I run this, it seems to act about as we'd expect/hope: it uses 100% of the CPU time until it gets very close to the end, when it starts to drop just before finishing (i.e., when we have fewer threads to execute than we have cores to execute them).
In case you wonder how OpenMP avoids this: it usually uses a thread pool, so some number of iterations of the loop is dispatched to the thread pool as a task. This lets it produce a large number of tasks without having a huge number of threads contending for CPU time simultaneously.
With the upper limit you used, it finished on my machine in about 90 milliseconds, which isn't long enough for it to even make a noticeable blip on the CPU usage graph.
The OpenMP example is using "reduction" on the sum variable primes which means that each task sums its own local primes variable.
OpenMP adds the thread local copies of primes together at the end of the parallel part to get the grand total.
That means it does not need to lock.
As #Sam says, a thread will get put to sleep if it cannot acquire the mutex lock.
So in your case, the threads will spend a fair amount of time asleep.
If you don't want to use OpenMP, try static std::atomic<int> primes = 0; then you don't need the mutex lock and unlock.
Or you could simulate OpenMP reduction by using an array primes[numThreads] where thread i sums into primes[i] then sum primes[] at the end.
In a larger numerical computation, I have to perform the trivial task of summing up the products of the elements of two vectors. Since this task needs to be done very often, I tried to make use of the auto vectorization capabilities of my compiler (VC2015). I introduced a temporary vector, where the products are saved in in a first loop and then performed the summation in a second loop. Optimization was set to full and fast code was preferred. This way, the first loop got vectorized by the compiler (I know this from the compiler output).
The result was surprising. The vectorized code performed 3 times slower on my machine (core i5-4570 3.20 GHz) than the simple code. Could anybody explain why and what might improve the performance? I've put both versions of the algorithm fragment into a minimal running example, which I used myself for testing:
#include "stdafx.h"
#include <vector>
#include <Windows.h>
#include <iostream>
using namespace std;
int main()
{
// Prepare timer
LARGE_INTEGER freq,c_start,c_stop;
QueryPerformanceFrequency(&freq);
int size = 20000000; // size of data
double v = 0;
// Some data vectors. The data inside doesn't matter
vector<double> vv(size);
vector<double> tt(size);
vector<float> dd(size);
// Put random values into the vectors
for (int i = 0; i < size; i++)
{
tt[i] = rand();
dd[i] = rand();
}
// The simple version of the algorithm fragment
QueryPerformanceCounter(&c_start); // start timer
for (int p = 0; p < size; p++)
{
v += tt[p] * dd[p];
}
QueryPerformanceCounter(&c_stop); // Stop timer
cout << "Simple version took: " << ((double)(c_stop.QuadPart - c_start.QuadPart)) / ((double)freq.QuadPart) << " s" << endl;
cout << v << endl; // We use v once. This avoids its calculation to be optimized away.
// The version that is auto-vectorized
for (int i = 0; i < size; i++)
{
tt[i] = rand();
dd[i] = rand();
}
v = 0;
QueryPerformanceCounter(&c_start); // start timer
for (int p = 0; p < size; p++) // This loop is vectorized according to compiler output
{
vv[p] = tt[p] * dd[p];
}
for (int p = 0; p < size; p++)
{
v += vv[p];
}
QueryPerformanceCounter(&c_stop); // Stop timer
cout << "Vectorized version took: " << ((double)(c_stop.QuadPart - c_start.QuadPart)) / ((double)freq.QuadPart) << " s" << endl;
cout << v << endl; // We use v once. This avoids its calculation to be optimized away.
cin.ignore();
return 0;
}
You added a large amount of work by storing the products in a temporary vector.
For such a simple computation on large data, the CPU time that you expect to save by vectorization doesn't matter. Only memory references matter.
You added memory references, so it runs slower.
I would have expected the compiler to optimize the original version of that loop. I doubt the optimization would affect the execution time (because it is dominated by memory access regardless). But it should be visible in the generated code. If you wanted to hand optimize code like that, a temporary vector is always the wrong way to go. The right direction is the following (for simplicity, I assumed size is even):
for (int p = 0; p < size; p+=2)
{
v += tt[p] * dd[p];
v1 += tt[p+1] * dd[p+1];
}
v += v1;
Note that your data is large enough and operation simple enough, that NO optimization should be able to improve on the simplest version. That includes my sample hand optimization. But I assume your test is not exactly representative of what you are really trying to do or understand. So with smaller data or a more complicated operation, the approach I showed may help.
Also notice my version relies on addition being commutative. For real numbers, addition is commutative. But in floating point, it isn't. The answer is likely to be different by an amount too tiny for you to care. But that is data dependent. If you have large values of opposite sign in odd/even positions canceling each other early in the original sequence, then by segregating the even and odd positions my "optimization" would totally destroy the answer. (Of course, the opposite can also be true. For example, if all the even positions were tiny and the odds included large values canceling each other, then the original sequence produced garbage and the changed sequence would be more correct).
I`m new in C++ programming and try to write some sparse matrix and vector stuff I as a practice.
The sparse matrix is build of a vector of maps, where the vector accesses the rows and the map is used for the sparse entries in the columns.
What I was trying to do is to fill a diagonal dominant sparse matrix with an equation system for a Poisson equation.
Now when filling the matrix in test cases I was able to provoke the following very weird problem, which I broke down to the essential operations.
#include <vector>
#include <iterator>
#include <iostream>
#include <map>
#include <ctime>
int main()
{
unsigned int nDim = 100000;
double clock1;
// alternative std::map<unsigned int, std::map<unsigned int, double> > mat;
std::vector<std::map<unsigned int, double> > mat;
mat.resize(nDim);
// if clause and number set
clock1 = double(clock())/CLOCKS_PER_SEC;
for(unsigned int rowIter = 0; rowIter < nDim; rowIter++)
{
for(unsigned int colIter = 0; colIter < nDim; colIter++)
{
if(rowIter == colIter)
{
mat[rowIter][colIter] = 1.;
}
}
}
std::cout << "time for diagonal fill: " << 1e3 * (double(clock())/CLOCKS_PER_SEC - clock1) << " ms" << std::endl;
// if clause and number insert
clock1 = double(clock())/CLOCKS_PER_SEC;
for(unsigned int rowIter = 0; rowIter < nDim; rowIter++)
{
for(unsigned int colIter = 0; colIter < nDim; colIter++)
{
if(rowIter == colIter)
{
mat[rowIter].insert(std::pair<unsigned int, double>(colIter,1.));
}
}
}
std::cout << "time for insert diagonal fill: " << 1e3 * (double(clock())/CLOCKS_PER_SEC - clock1) << " ms" << std::endl;
// only number set
clock1 = double(clock())/CLOCKS_PER_SEC;
for(unsigned int rowIter = 0; rowIter < nDim; rowIter++)
{
mat[rowIter][rowIter] += 1.;
}
std::cout << "time for easy diagonal fill: " << 1e3 * (double(clock())/CLOCKS_PER_SEC - clock1) << " ms" << std::endl;
// only if clause
clock1 = double(clock())/CLOCKS_PER_SEC;
for(unsigned int rowIter = 0; rowIter < nDim; rowIter++)
{
for(unsigned int colIter = 0; colIter < nDim; colIter++)
{
if(rowIter == colIter)
{
}
}
}
std::cout << "time for if clause: " << 1e3 * (double(clock())/CLOCKS_PER_SEC - clock1) << " ms" << std::endl;
return 0;
}
Running this in gcc (newest version, 4.8.1 I think) the following times appear:
time for diagonal fill: 26317ms
time for insert diagonal: 8783ms
time for easy diagonal fill: 10ms !!!!!!!
time for if clause: 0ms
I only used the loop for the if clause to be sure the it is not responsible for the speed lack.
Optimization level is O3, but the problem also appears on other levels.
So I thought let's try the Visual Studio (2012 Express).
It is a little bit faster, but still as slow as ketchup:
time for diagonal fill: 9408ms
time for insert diagonal: 8860ms
time for easy diagonal fill: 11ms !!!!!!!
time for if clause: 0ms
So MSVSC++ fails, too.
It will probably not even be necessary to used this combination of if-clause and matrix fill, but if... I'm screwed.
Does anybody know where this huge performance gap is coming from and how I could deal with it?
Is it some optimization problem caused by the fact, that the if-clause is inside the loop? Do I maybe just need another compiler flag?
I would also be interested, if it occurs with other systems/compilers, too. I might run it on the Xeon E5 machine at work and see what this baby makes with this devil piece of code :).
EDIT:
I ran it on the Xeon machine: Much faster, still slow.
Times with gcc:
2778ms
2684ms
1ms
0ms
The most obvious performance issue would be allocations within your map. Each time you assign/insert a new item in a map, it's got to allocate space for it and sort the tree appropriately. Doing that thousands of times is bound to be slow.
It's also very significant that you're not clearing the maps after your first loop. That means your subsequent loops don't have to do as much work, so your performance comparisons are not equivalent.
Finally, the nested loops are obviously going to be doing an order of magnitude more iterations than your single loop. From a strict algorithm analysis standpoint, it may be doing the same amount of actual work on the data. However, the program still has to run through all those extra iterations because that's what you've told it to do. The compiler can only optimise it out if there is literally nothing being processed/modified in the loop body.
In the first loop, the runtime system is doing loads of memory allocation, so it takes a lot of time on memory management.
The other loops don't have that overhead; you didn't release the allocation done by the first loop, so they don't have to repeat the memory allocation and it doesn't take anywhere near as long.
The last loop is optimized out by the compiler; it has no side effects, so it doesn't get included in the program.
Morals:
memory allocation has a cost.
benchmarking is hard.
This might be some weird Linux quirk, but I'm observing very strange behavior.
The following code should compare a synchronized version of summing numbers with an async version. The thing is that I'm seeing a performance increase (it's not caching, it happens even when I split the code into two separate programs), while still observing the program as single-threaded (only one core is used).
strace does show some thread activity, but monitoring tools like top clones still show only one used core.
Second problem I'm observing is that if I increase the spawn ratio, the memory usage just explodes. What is the memory overhead of a thread? With 5000 threads I get ~10GB memory usage.
#include <iostream>
#include <random>
#include <chrono>
#include <future>
using namespace std;
long long sum2(const vector<int>& v, size_t from, size_t to)
{
const size_t boundary = 5*1000*1000;
if (to-from <= boundary)
{
long long rsum = 0;
for (;from < to; from++)
{
rsum += v[from];
}
return rsum;
}
else
{
size_t mid = from + (to-from)/2;
auto s2 = async(launch::async,sum2,cref(v),mid,to);
long long rsum = sum2(v,from,mid);
rsum += s2.get();
return rsum;
}
}
long long sum2(const vector<int>& v)
{
return sum2(v,0,v.size());
}
long long sum(const vector<int>& v)
{
long long rsum = 0;
for (auto i : v)
{
rsum += i;
}
return rsum;
}
int main()
{
const size_t vsize = 100*1000*1000;
vector<int> x;
x.reserve(vsize);
mt19937 rng;
rng.seed(chrono::system_clock::to_time_t(chrono::system_clock::now()));
uniform_int_distribution<uint32_t> dist(0,10);
for (auto i = 0; i < vsize; i++)
{
x.push_back(dist(rng));
}
auto start = chrono::high_resolution_clock::now();
long long suma = sum(x);
auto end = chrono::high_resolution_clock::now();
cout << "Sum is " << suma << endl;
cout << "Duration " << chrono::duration_cast<chrono::nanoseconds>(end - start).count() << " nanoseconds." << endl;
start = chrono::high_resolution_clock::now();
suma = sum2(x);
end = chrono::high_resolution_clock::now();
cout << "Async sum is " << suma << endl;
cout << "Async duration " << chrono::duration_cast<chrono::nanoseconds>(end - start).count() << " nanoseconds." << endl;
return 0;
}
Maybe you observe one core being used because the overlap between threads doing work simultaneously is too short to be noticeable. Summing 5mln values from a continuous area of memory should be very fast on modern hardware, so by the time parent finishes summing, child may have barely started and parent may be spending most or all of the time waiting for the result from the child. Have you tried to increase work unit to see if the overlap becomes noticeable?
Regarding increased performance: even if there is 0 overlap between threads because of too small work unit, multithreaded version can still benefit from additional L1 cache memory. For such a test, memory will likely be a bottleneck and sequential version will use only one L1 cache while multithreaded version will use as many as there are cores.
Have you checked the times that are being printed? On my machine, the serial time is under 1s at -O2, whilst the parallel sum time is several times faster. It's therefore entirely possible that the CPU usage is not enough for long enough for things like "top" to register, since they typically only refresh once per second.
If you increase the number of threads by reducing the count-per-thread, then you effectively increase the overhead of the thread management. If you have 5000 threads active, then your task will take 5000* min-thread-stack-size in additional memory. On my machine that's 20Gb!
Why don't you try increasing the size of the source container? If you make the parallel section take long enough, you'll see the corresponding parallel CPU usage. However, be prepared: summing integers is fast, and the time taken generating the random numbers can take an order of magnitude or two longer than the time to add the numbers together.