Multi-threaded GEMM slower than single threaded one?

Multi-threaded GEMM slower than single threaded one? - c++

I wrote some Naiive GEMM code and I am wondering why it is much slower than the equivalent single threaded GEMM code.
With a 200x200 matrix, Single Threaded: 7ms, Multi Threaded: 108ms, CPU: 3930k, 12 threads in thread pool.
template <unsigned M, unsigned N, unsigned P, typename T>
static Matrix<M, P, T> multiply( const Matrix<M, N, T> &lhs, const Matrix<N, P, T> &rhs, ThreadPool & pool )
{
Matrix<M, P, T> result = {0};
Task<void> task(pool);
for (auto i=0u; i<M; ++i)
for (auto j=0u; j<P; j++)
task.async([&result, &lhs, &rhs, i, j](){
T sum = 0;
for (auto k=0u; k < N; ++k)
sum += lhs[i * N + k] * rhs[k * P + j];
result[i * M + j] = sum;
});
task.wait();
return std::move(result);
}

I do not have experience with GEMM, but your problem seems to be related to issues that appear in all kind of multi-threading scenarios.
When using multi-threading, you introduce a couple of potential overheads, the most common of which usually are
creation/cleanup of starting/ending threads
context switches when (number of threads) > (number of CPU cores)
locking of resources, waiting to obtain a lock
cache synchronization issues
The items 2. and 3. probably don't play a role in your example: you are using 12 threads on 12 (hyperthreading) cores, and your algorithm does not involve locks.
However, 1. might be relevant in your case: You are creating a total of 40000 threads, each of which multiplying and adding 200 values. I'd suggest to try a less fine-grained threading, maybe only splitting after the first loop. It's always a good idea not to split up the problem into pieces smaller than necessary.
Also 4. will very likely be important in your case. While you're not running into a race condition when writing the results to the array (because every thread is writing to its own index position), you are very likely to provoke a large overhead of cache syncs.
"Why?" you might think, because you're writing to different places in memory. That's because a typical CPU cache is organized in cache lines, which on the current Intel and AMD CPU models are 64 bytes long. This is the smallest size that can be used for transfers from and to the cache, when something is changed. Now that all CPU cores are reading and writing to adjacent memory words, this leads to synchronization of 64 bytes between all the cores whenever you are writing just 4 bytes (or 8, depending on the size of the data type you're using).
If memory is not an issue, you can simply "pad" every output array element with "dummy" data so that there is only one output element per cache line. If you're using 4byte data types, this would mean to skip 15 array elements for each 1 real data element. The cache issues will also improve when you make your threading less fine-grained, because every thread will access its own continuous region in memory practically without interfering with other threads' memory.
Edit: A more detailed description by Herb Sutter (one of the Gurus of C++) can be found here: http://www.drdobbs.com/parallel/maximize-locality-minimize-contention/208200273
Edit2: BTW, it's suggested to avoid std::move in the return statement, as this might get in the way of return-value-optimization and copy-elision rules, which the standard now demands to happen automatically. See Is returning with `std::move` sensible in the case of multiple return statements?

Multi threading means always synchronization, context switching, function call. This all adds up and costs CPU cycles, you can spend on the main task itself.
If you have just a third nested loop, you save all these steps and can do the computation inline instead of a subroutine, where you must setup a stack, call into, switch to a different thread, return the result and switch back to the main thread.
Multi threading is useful only, if these costs are small compared to the main task. I guess, you will see better results with multi threading, when the matrix is larger than just 200x200.

In general multi-threading is well applicable for tasks which take a lot of time, most favourably because of complexity and not device access. The loop you showed us takes to short to execute for it to be effectively parallelized.
You have to remember that there is much overhead with thread creation. There is also some (but significantly less) overhead with synchronization.

Related

race condition using OpenMP atomic capture operation for 3D histogram of particles and making an index

I have a piece of code in my full code:
const unsigned int GL=8000000;
const int cuba=8;
const int cubn=cuba+cuba;
const int cub3=cubn*cubn*cubn;
int Length[cub3];
int Begin[cub3];
int Counter[cub3];
int MIndex[GL];
struct Particle{
int ix,jy,kz;
int ip;
};
Particle particles[GL];
int GetIndex(const Particle & p){return (p.ix+cuba+cubn*(p.jy+cuba+cubn*(p.kz+cuba)));}
...
#pragma omp parallel for
for(int i=0; i<cub3; ++i) Length[i]=Counter[i]=0;
#pragma omp parallel for
for(int i=0; i<N; ++i)
{
int ic=GetIndex(particles[i]);
#pragma omp atomic update
Length[ic]++;
}
Begin[0]=0;
#pragma omp single
for(int i=1; i<cub3; ++i) Begin[i]=Begin[i-1]+Length[i-1];
#pragma omp parallel for
for(int i=0; i<N; ++i)
{
if(particles[i].ip==3)
{
int ic=GetIndex(particles[i]);
if(ic>cub3 || ic<0) printf("ic=%d out of range!\n",ic);
int cnt=0;
#pragma omp atomic capture
cnt=Counter[ic]++;
MIndex[Begin[ic]+cnt]=i;
}
}
If to remove
#pragma omp parallel for
the code works properly and the output results are always the same.
But with this pragma there is some undefined behaviour/race condition in the code, because each time it gives different output results.
How to fix this issue?
Update: The task is the following. Have lots of particles with some random coordinates. Need to output to the array MIndex the indices in the array particles of the particles, which are in each cell (cartesian cube, for example, 1×1×1 cm) of the coordinate system. So, in the beginning of MIndex there should be the indices in the array particles of the particles in the 1st cell of the coordinate system, then - in the 2nd, then - in the 3rd and so on. The order of indices within given cell in the area MIndex is not important, may be arbitrary. If it is possible, need to make this in parallel, may be using atomic operations.
There is a straight way: to traverse across all the coordinate cells in parallel and in each cell check the coordinates of all the particles. But for large number of cells and particles this seems to be slow. Is there a faster approach? Is it possible to travel across the particles array only once in parallel and fill MIndex array using atomic operations, something like written in the code piece above?

You probably can't get a compiler to auto-parallelize scalar code for you if you want an algorithm that can work efficiently (without needing atomic RMWs on shared counters which would be a disaster, see below). But you might be able to use OpenMP as a way to start threads and get thread IDs.
Keep per-thread count arrays from the initial histogram, use in 2nd pass
(Update: this might not work: I didn't notice the if(particles[i].ip==3) in the source before. I was assuming that Count[ic] will go as high as Length[ic] in the serial version. If that's not the case, this strategy might leave gaps or something.
But as Laci points out, perhaps you want that check when calculating Length in the first place, then it would be fine.)
Manually multi-thread the first histogram (into Length[]), with each thread working on a known range of i values. Keep those per-thread lengths around, even as you sum across them and prefix-sum to build Begin[].
So Length[thread][ic] is the number of particles in that cube, out of the range of i values that this thread worked on. (And will loop over again in the 2nd loop: the key is that we divide the particles between threads the same way twice. Ideally with the same thread working on the same range, so things may still be hot in L1d cache.)
Pre-process that into a per-thread Begin[][] array, so each thread knows where in MIndex to put data from each bucket.
// pseudo-code, fairly close to actual C
for(ic < cub3) {
// perhaps do this "vertical" sum into a temporary array
// or prefix-sum within Length before combining across threads?
int pos = sum(Length[0..nthreads-1][ic-1]) + Begin[0][ic-1];
Begin[0][ic] = pos;
for (int t = 1 ; t<nthreads ; t++) {
pos += Length[t][ic]; // prefix-sum across threads for this cube bucket
Begin[t][ic] = pos;
}
}
This has a pretty terrible cache access pattern, especially with cuba=8 making Length[t][0] and Length[t+1][0] 4096 bytes apart from each other. (So 4k aliasing is a possible problem, as are cache conflict misses).
Perhaps each thread can prefix-sum its own slice of Length into that slice of Begin, 1. for cache access pattern (and locality since it just wrote those Lengths), and 2. to get some parallelism for that work.
Then in the final loop with MIndex, each thread can do int pos = --Length[t][ic] to derive a unique ID from the Length. (Like you were doing with Count[], but without introducing another per-thread array to zero.)
Each element of Length will return to zero, because the same thread is looking at the same points it just counted. With correctly-calculated Begin[t][ic] positions, MIndex[...] = i stores won't conflict. False sharing is still possible, but it's a large enough array that points will tend to be scattered around.
Don't overdo it with number of threads, especially if cuba is greater than 8. The amount of Length / Begin pre-processing work scales with number of threads, so it may be better to just leave some CPUs free for unrelated threads or tasks to get some throughput done. OTOH, with cuba=8 meaning each per-thread array is only 4096 bytes (too small to parallelize the zeroing of, BTW), it's really not that much.
(Previous answer before your edit made it clearer what was going on.)
Is this basically a histogram? If each thread has its own array of counts, you can sum them together at the end (you might need to do that manually, not have OpenMP do it for you). But it seems you also need this count to be unique within each voxel, to have MIndex updated properly? That might be a showstopper, like requiring adjusting every MIndex entry, if it's even possible.
After your update, you are doing a histogram into Length[], so that part can be sped up.
Atomic RMWs would be necessary for your code as-is, performance disaster
Atomic increments of shared counters would be slower, and on x86 might destroy the memory-level parallelism too badly. On x86, every atomic RMW is also a full memory barrier, draining the store buffer before it happens, and blocking later loads from starting until after it happens.
As opposed to a single thread which can have cache misses to multiple Counter, Begin and MIndex elements outstanding, using non-atomic accesses. (Thanks to out-of-order exec, the next iteration's load / inc / store for Counter[ic]++ can be doing the load while there are cache misses outstanding for Begin[ic] and/or for Mindex[] stores.)
ISAs that allow relaxed-atomic increment might be able to do this efficiently, like AArch64. (Again, OpenMP might not be able to do that for you.)
Even on x86, with enough (logical) cores, you might still get some speedup, especially if the Counter accesses are scattered enough they cores aren't constantly fighting over the same cache lines. You'd still get a lot of cache lines bouncing between cores, though, instead of staying hot in L1d or L2. (False sharing is a problem,
Perhaps software prefetch can help, like prefetchw (write-prefetching) the counter for 5 or 10 i iterations later.
It wouldn't be deterministic which point went in which order, even with memory_order_seq_cst increments, though. Whichever thread increments Counter[ic] first is the one that associates that cnt with that i.
Alternative access patterns
Perhaps have each thread scan all points, but only process a subset of them, with disjoint subsets. So the set of Counter[] elements that any given thread touches is only touched by that thread, so the increments can be non-atomic.
Filtering by p.kz ranges maybe makes the most sense since that's the largest multiplier in the indexing, so each thread "owns" a contiguous range of Counter[].
But if your points aren't uniformly distributed, you'd need to know how to break things up to approximately equally divide the work. And you can't just divide it more finely (like OMP schedule dynamic), since each thread is going to scan through all the points: that would multiply the amount of filtering work.
Maybe a couple fixed partitions would be a good tradeoff to gain some parallelism without introducing a lot of extra work.
Re: your edit
You already loop over the whole array of points doing Length[ic]++;? Seems redundant to do the same histogramming work again with Counter[ic]++;, but not obvious how to avoid it.
The count arrays are small, but if you don't need both when you're done, you could maybe just decrement Length to assign unique indices to each point in a voxel. At least the first histogram could benefit from parallelizing with different count arrays for each thread, and just vertically adding at the end. Should scale perfectly with threads since the count array is small enough for L1d cache.
BTW, for() Length[i]=Counter[i]=0; is too small to be worth parallelizing. For cuba=8, it's 8*8*16 * sizeof(int) = 4096 bytes, just one page, so it's just two small memsets.
(Of course if each thread has their own separate Length array, they each need to zero it). That's small enough to even consider unrolling with maybe 2 count arrays per thread to hide store/reload serial dependencies if a long sequence of points are all in the same bucket. Combining count arrays at the end is a job for #pragma omp simd or just normal auto-vectorization with gcc -O3 -march=native since it's integer work.
For the final loop, you could split your points array in half (assign half to each thread), and have one thread get unique IDs by counting down from --Length[i], and another counting up from 0 in Counter[i]++. With different threads looking at different points, this could give you a factor of 2 speedup. (Modulo contention for MIndex stores.)
To do more than just count up and down, you'd need info you don't have from just the overall Length array... but which you did have temporarily. See the section at the top

You are right to make the update Counter[ic]++ atomic, but there is an additional problem on the next line: MIndex[Begin[ic]+cnt]=i; Different iterations can write into the same location here, unless you have mathematical proof that this is never the case from the structure of MIndex. So you have to make that line atomic too. And then there is almost no parallel work left in your loop, so your speed up if probably going to be abysmal.
EDIT the second line however is not of the right form for an atomic operation, so you have to make it critical. Which is going to make performance even worse.
Also, #Laci is correct that since this is an overwrite statement, the order of parallel scheduling is going to influence the outcome. So either live with that fact, or accept that this can not be parallelized.

Missing small primes in C++ atomic prime sieve

I try to develop a concurrent prime sieve implementation using C++ atomics. However, when core_count is increased, more and more small primes are missing from the output.
My guess is that the producer threads overwrite each others' results, before being read by the consumer. Even though the construction should protect against it by using the magic number 0 to indicate it's ready to accept the next prime. It seems the compare_exchange_weak is not really atomic in this case.
Things I've tried:
Replacing compare_exchange_weak with compare_exchange_strong
Changing the memory_order to anything else.
Swapping around the 'crossing-out' and the write.
I have tested it with Microsoft Visual Studio 2019, Clang 12.0.1 and GCC 11.1.0, but to no avail.
Any ideas on this are welcome, including some best practices I might have missed.
#include <algorithm>
#include <atomic>
#include <future>
#include <iostream>
#include <iterator>
#include <thread>
#include <vector>
int main() {
using namespace std;
constexpr memory_order order = memory_order_relaxed;
atomic<int> output{0};
vector<atomic_bool> sieve(10000);
for (auto& each : sieve) atomic_init(&each, false);
atomic<unsigned> finished_worker_count{0};
auto const worker = [&output, &sieve, &finished_worker_count]() {
for (auto current = next(sieve.begin(), 2); current != sieve.end();) {
current = find_if(current, sieve.end(), [](atomic_bool& value) {
bool untrue = false;
return value.compare_exchange_strong(untrue, true, order);
});
if (current == sieve.end()) break;
int const claimed = static_cast<int>(distance(sieve.begin(), current));
int zero = 0;
while (!output.compare_exchange_weak(zero, claimed, order))
;
for (auto product = 2 * claimed; product < static_cast<int>(sieve.size());
product += claimed)
sieve[product].store(true, order);
}
finished_worker_count.fetch_add(1, order);
};
const auto core_count = thread::hardware_concurrency();
vector<future<void>> futures;
futures.reserve(core_count);
generate_n(back_inserter(futures), core_count,
[&worker]() { return async(worker); });
vector<int> result;
while (finished_worker_count < core_count) {
auto current = output.exchange(0, order);
if (current > 0) result.push_back(current);
}
sort(result.begin(), result.end());
for (auto each : result) cout << each << " ";
cout << '\n';
return 0;
}

compare_exchange_weak will update (change) the "expected" value (the local variable zero) if the update cannot be made. This will allow overwriting one prime number with another if the main thread doesn't quickly handle the first prime.
You'll want to reset zero back to zero before rechecking:
while (!output.compare_exchange_weak(zero, claimed, order))
zero = 0;

Even with correctness bugs fixed, I think this approach is going to be low performance with multiple threads writing to the same cache lines.
As 1201ProgramAlarm's points out in their answer, CAS but I wouldn't expect good performance! Having multiple threads storing to the same cache lines will create big stalls. I'd normally write that as follows so you only need to write the zero = 0 once, but it still happens before every CAS.
do{
zero = 0;
}while(!output.compare_exchange_weak(zero, claimed, order));
Caleth pointed out in comments that it's also Undefined Behaviour for a predictate to modify the element (like in your find_if). That's almost certainly not a problem in practice in this case; find_if is just written in C++ in a header (in mainstream implementations) and likely in a way that there isn't actually any UB in the resulting program.
And it would be straightforward to replace the find_if with a loop. In fact probably making the code more readable, since you can just use array indexing the whole time instead of iterators; let the compiler optimize that to a pointer and then pointer-subtraction if it wants.
Scan read-only until you find a candidate to try to claim, don't try to atomic-RMW every true element until you get to a false one. Especially on x86-64, lock cmpxchg is way slower than read-only access to a few contiguous bytes. It's a full memory barrier; there's no way to do an atomic RMW on x86 that isn't seq_cst.
You might still lose the race, so you do still need to try to claim it with an RMW and keep looping on failure. And CAS is a good choice for that.
Correctness seems plausible with this strategy, but I'd avoid it for performance reasons.
Multiple threads storing to the array will cause contention
Expect cache lines to be bouncing around between cores, with most RFOs (MESI Read For Ownership) having to wait to get the data for a cache line from another core that had it in Modified state. A core can't modify a cache line until after it gets exclusive ownership of that cache line. (Usually 64 bytes on modern systems.)
Your sieve size is only 10k bools, so 10 kB, comfortably fitting into L1d cache on modern CPUs. So a single-threaded implementation would get all L1d cache hits when looping over it (in the same thread that just initialized it all to zero).
But with other threads writing the array, at best you'll get hits in L3 cache. But since the sieve size is small, other threads won't be evicting their copies from their own L1d caches, so the RFO (read for ownership) from a core that wants to write will typically find that some other core has it Modified, so the L3 cache (or other tag directory) will have to send on a request to that core to write back or transfer directly. (Intel CPUs from Nehalem onwards use Inclusive L3 cache where the tags also keep track of which cores have the data. They changed that for server chips with Skylake and later, but client CPUs still I think have inclusive L3 cache where the tags also work as a snoop filter / coherence directory.)
With 1 whole byte per bool, and not even factoring out multiples of 2 from your sieve, crossing off multiples of a prime is very high bandwidth. For primes between 32 and 64, you touch every cache line 1 to 2 times. (And you only start at prime*2, not prime*prime, so even for large strides, you still start very near the bottom of the array and touch most of it.)
A single-threaded sieve can use most of L3 cache bandwidth, or saturate DRAM, on a large sieve, even using a bitmap instead of 1 bool per byte. (I made some benchmarks of a hand-written x86 asm version that used a bitmap version in comments on a Codereview Q&A; https://godbolt.org/z/nh39TWxxb has perf stat results in comments on a Skylake i7-6700k with DDR4-2666. My implementation also has some algorithmic improvements, like not storing the even numbers, and starting the crossing off at i*i).
Although to be fair, L3 bandwidth scales with number of cores, especially if different pairs are bouncing data between each other, like A reading lines recently written by B, and B reading lines recently written by C. Unlike with DRAM where the shared bus is the bottleneck, unless per-core bandwidth limits are lower. (Modern server chips need multiple cores to saturate their DRAM controllers, but modern client chips can nearly max out DRAM with one thread active).
You'd have to benchmark to see whether all thread piled up in a bad way or not, like if they tend to end up close together, or if one with a larger stride can pull ahead and get some distance for write-prefetches not to create extra contention.
The cache-miss delays in committing the store to cache can be hidden some by the store buffer and out-of-order exec (especially since it's relaxed, not seq_cst), but it's still not good.
(Using a bitmap with 8 bools per byte would require atomic RMWs for this threading strategy, which would be a performance disaster. If you're going to thread this way, 1 bool per byte is by far the least bad.)
At least if you aren't reading part of the array that's still being written, you might not be getting memory-order mis-speculation on x86. (x86's memory model disallows LoadStore and LoadLoad reordering, but real implementations speculatively load early, and have to roll back if the value they loaded has been invalidated by the time the load is architecturally allowed to happen.)
Better strategy: each thread owns a chunk of the sieve
Probably much better would be segmenting regions and handing out lists of primes to cross off, with each thread marking off multiples of primes in its own region of the output array. So each cache line of the sieve is only touched by one thread, and each thread only touches a subset of the sieve array. (A good chunk size would be half to 3/4 of the L1d or L2 cache size of a core.)
You might start with a small single-threaded sieve, or a fixed list of the first 10 or 20 primes to get the threads started, and have the thread that owns the starting chunk generate more primes. Perhaps appending them to an array of primes, and updating a valid-index (with a release store so readers can keep reading in that array up to that point, then spin-wait or use C++20 .wait() for a value other than what they last saw. But .wait would need a .notify in the writer, like maybe every 10 primes?)
If you want to move on in a larger sieve, divide up the next chunk of the full array between threads and have them each cross off the known primes from the first part of the array. No thread has to wait for any other, the first set of work already contains all the primes that need to be crossed off from an equal-sized chunk of the sieve.
Probably you don't want an actually array of atomic_int; probably all threads should be scanning the sieve to find not-crossed-off positions. Especially if you can do that efficiently with SIMD, or with bit-scan like tzcnt if you use packed bitmaps for this.
(I assume there are some clever algorithmic ideas for segmented sieves; this is just what I came up with off the top of my head.)

OpenCL data parallel summation into a variable

Is it possible to use the opencl data parallel kernel to sum vector of size N, without doing the partial sum trick?
Say that if you have access to 16 work items and your vector is of size 16. Wouldn't it not be possible to just have a kernel doing the following
__kernel void summation(__global float* input, __global float* sum)
{
int idx = get_global_id(0);
sum[0] += input[idx];
}
When I've tried this, the sum variable doesn't get updated, but only overwritten. I've read something about using barriers, and i tried inserting a barrier before the summation above, it does update the variable somehow, but it doesn't reproduce the correct sum.

Let me try to explain why sum[0] is overwritten rather than updated.
In your case of 16 work items, there are 16 threads which are running simultaneously. Now sum[0] is a single memory location which is shared by all of the threads, and the line sum[0] += input[idx] is run by each of the 16 threads, simultaneously.
Now the instruction sum[0] += input[idx] (I think) expands performs a read of sum[0], then adds input[idx] to that before writing the result back to sum[0].
There will will be a data race as multiple threads are reading from and writing to the same shared memory location. So what might happen is:
All threads may read the value of sum[0] before any other thread
writes their updated result back to sum[0], in which case the final
result of sum[0] would be the value of input[idx] of the thread
which executed the slowest. Since this will be different each time,
if you run the example multiple times you should see different
results.
Or, one thread may execute slightly more slowly, in which case
another thread may have already written an updated result back to
sum[0] before this slow thread reads sum[0], in which case there
will be an addition using the values of more than one thread, but not
all threads.
So how can you avoid this?
Option 1 - Atomics (Worse Option):
You can use atomics to force all threads to block if another thread is performing an operation on the shared memory location, but this obviously results in a loss of performance since you are making the parallel process serial (and incurring the costs of parallelisation -- such as moving memory between the host and the device and creating the threads).
Option 2 - Reduction (Better Option):
The best solution would be to reduce the array, since you can use the parallelism most effectively, and can give O(log(N)) performance. Here is a good overview of reduction using OpenCL : Reduction Example.

Option 3 (and worst of all)
__kernel void summation(__global float* input, __global float* sum)
{
int idx = get_global_id(0);
for(int j=0;j<N;j++)
{
barrier(CLK_GLOBAL_MEM_FENCE| CLK_LOCAL_MEM_FENCE);
if(idx==j)
sum[0] += input[idx];
else
doOtherWorkWhileSingleCoreSums();
}
}
using a mainstream gpu, this should sum all of them as slow as a pentium mmx . This is just like computing on a single core and giving other cores other jobs but in a slower way.
A cpu device could be better than gpu for this kind.

How to set number of threads in C++

I have written the following multi-threaded program for multi-threaded sorting using std::sort. In my program grainSize is a parameter. Since grainSize or the number of threads which can spawn is a system dependent feature. Therefore, I am not getting what should be the optimal value to which I should set the grainSize to? I work on Linux?
int compare(const char*,const char*)
{
//some complex user defined logic
}
void multThreadedSort(vector<unsigned>::iterator data, int len, int grainsize)
{
if(len < grainsize)
{
std::sort(data, data + len, compare);
}
else
{
auto future = std::async(multThreadedSort, data, len/2, grainsize);
multThreadedSort(data + len/2, len/2, grainsize); // No need to spawn another thread just to block the calling thread which would do nothing.
future.wait();
std::inplace_merge(data, data + len/2, data + len, compare);
}
}
int main(int argc, char** argv) {
vector<unsigned> items;
int grainSize=10;
multThreadedSort(items.begin(),items.size(),grainSize);
std::sort(items.begin(),items.end(),CompareSorter(compare));
return 0;
}
I need to perform multi-threaded sorting. So, that for sorting large vectors I can take advantage of multiple cores present in today's processor. If anyone is aware of an efficient algorithm then please do share.
I dont know why the value returned by multiThreadedSort() is not sorted, do you see some logical error in it, then please let me know about the same

This gives you the optimal number of threads (such as the number of cores):
unsigned int nThreads = std::thread::hardware_concurrency();
As you wrote it, your effective thread number is not equal to grainSize : it will depend on list size, and will potentially be much more than grainSize.
Just replace grainSize by :
unsigned int grainSize= std::max(items.size()/nThreads, 40);
The 40 is arbitrary but is there to avoid starting threads for sorting to few items which will be suboptimal (the time starting the thread will be larger than sorting the few items). It may be optimized by trial-and-error, and is potentially larger than 40.
You have at least a bug there:
multThreadedSort(data + len/2, len/2, grainsize);
If len is odd (for instance 9), you do not include the last item in the sort. Replace by:
multThreadedSort(data + len/2, len-(len/2), grainsize);

Unless you use a compiler with a totally broken implementation (broken is the wrong word, a better match would be... shitty), several invocations of std::futureshould already do the job for you, without having to worry.
Note that std::future is something that conceptually runs asynchronously, i.e. it may spawn another thread to execute concurrently. May, not must, mind you.
This means that it is perfectly "legitimate" for an implementation to simply spawn one thread per future, and it is also legitimate to never spawn any threads at all and simply execute the task inside wait().
In practice, sane implementations avoid spawning threads on demand and instead use a threadpool where the number of workers is set to something reasonable according to the system the code runs on.
Note that trying to optimize threading with std::thread::hardware_concurrency() does not really help you because the wording of that function is too loose to be useful. It is perfectly allowable for an implementation to return zero, or a more or less arbitrary "best guess", and there is no mechanism for you to detect whether the returned value is a genuine one or a bullshit value.
There also is no way of discriminating hyperthreaded cores, or any such thing as NUMA awareness, or anything the like. Thus, even if you assume that the number is correct, it is still not very meaningful at all.
On a more general note
The problem "What is the correct number of threads" is hard to solve, if there is a good universal answer at all (I believe there is not). A couple of things to consider:
Work groups of 10 are certainly way, way too small. Spawning a thread is an immensely expensive thing (yes, contrary to popular belief that's true for Linux, too) and switching or synchronizing threads is expensive as well. Try "ten thousands", not "tens".
Hyperthreaded cores only execute while the other core in the same group is stalled, most commonly on memory I/O (or, when spinning, by the explicit execution of an instruction such as e.g. REP-NOP on Intel). If you do not have a significant number of memory stalls, extra threads running on hyperthreaded cores will only add context switches, but will not run any faster. For something like sorting (which is all about accessing memory!), you're probably good to go as far as that one goes.
Memory bandwidth is usually saturated by one, sometimes 2 cores, rarely more (depends on the actual hardware). Throwing 8 or 12 threads at the problem will usually not increase memory bandwidth but will heighten pressure on shared cache levels (such as L3 if present, and often L2 as well) and the system page manager. For the particular case of sorting (very incoherent access, lots of stalls), the opposite may be the case. May, but needs not be.
Due to the above, for the general case "number of real cores" or "number of real cores + 1" is often a much better recommendation.
Accessing huge amounts of data with poor locality like with your approach will (single-threaded or multi-threaded) result in a lot of cache/TLB misses and possibly even page faults. That may not only undo any gains from thread parallelism, but it may indeed execute 4-5 orders of magnitude slower. Just think about what a page fault costs you. During a single page fault, you could have sorted a million elements.
Contrary to the above "real cores plus 1" general rule, for tasks that involve network or disk I/O which may block for a long time, even "twice the number of cores" may as well be the best match. So... there is really no single "correct" rule.
What's the conclusion of the somewhat self-contradicting points above? After you've implemented it, be sure to benchmark whether it really runs faster, because this is by no means guaranteed to be the case. And unluckily, there's no way of knowing with certitude what's best without having measured.
As another thing, consider that sorting is by no means trivial to parallelize. You are already using std::inplace_merge so you seem to be aware that it's not just "split subranges and sort them".
But think about it, what exactly does your approach really do? You are subdividing (recursively descending) up to some depth, then sorting the subranges concurrently, and merging -- which means overwriting. Then you are sorting (recursively ascending) larger ranges and merging them, until the whole range is sorted. Classic fork-join.
That means you touch some part of memory to sort it (in a pattern which is not cache-friendly), then touch it again to merge it. Then you touch it yet again to sort the larger range, and you touch it yet another time to merge that larger range. With any "luck", different threads will be accessing the memory locations at different times, so you'll have false sharing.
Also, if your understanding of "large data" is the same as mine, this means you are overwriting every memory location beween 20 and 30 times, possibly more often. That's a lot of traffic.
So much memory being read and written to repeatedly, over and over again, and the main bottleneck is memory bandwidth. See where I'm going? Fork-join looks like an ingenious thing, and in academics it probably is... but it isn't certain at all that this runs any faster on a real machine (it might quite possibly be many times slower).

Ideally, you cannot assume more than n*2 thread running in your system. n is number of CPU cores.
Modern OS uses concept of Hyperthreading. So, now on one CPU at a time can run 2 threads.
As mentioned in another answer, in C++11 you can get optimal number of threads using std::thread::hardware_concurrency();

OpenMP and optimising vector operations

I'm running an algorithm at the moment that is very heavy but extremely parallel.
I've been looking at ways to speed it up and I've noticed that the slowest operation I have is my VecAdd function (It gets called thousands of times on a 6000 or so wide vector).
It is implemented as follows:
bool VecAdd( float* pOut, const float* pIn1, const float* pIn2, unsigned int num )
{
for( int idx = 0; idx < num; idx++ )
{
pOut[idx] = pIn1[idx] + pIn2[idx];
}
return true;
}
Its a very simple loop but all the additions can be performed in parallel. My first optimisation option is to move over to using SIMD as I can easily get a near 4 times speed up doing this.
However I'm also interested in the possibility of using OpenMP and having it automatically thread the for loop (potentially giving me a further 4x speedup for a total of 16x with SIMD).
However it really runs slowly. With the loop straight it takes around 3.2 seconds to process my example data. If I insert
#pragma omp parallel for
before the for loop I was assuming it would farm out several blocks of additions to other threads.
Unfortunately the result is that it takes ~7 seconds to process my example data.
Now I understand that a lot of my problem here will be caused by overheads with setting up threads and so forth but I'm still surprised just how much slower it makes things run.
Is it possible to speed this up by somehow setting up the thread pool in advance or will I never be able to combat these overheads?
Any thoughts on advice as to whether I can thread this nicely with OpenMP will be much appreciated!

Your loop should parallelize fine with the #pragma omp parallel for.
However, I think the problem is that you shouldn't parallelize at that level. You said that the function gets called thousands of times, but only operates on 6000 floats. Parallelize at the higher level, so that each thread is responsible for thounsands/4 calls to VecAdd. Right now you have this algorithm:
List item
serial execution
(re) start threads
do short computation
synchronize threads (at the end of the for loop)
back to serial code
Change it so that it's parallel at the highest possible level.
Memory bandwidth of course matters, but there is no way it would result in slower than serial execution.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js