c++11 async<>, with unknown number of available cores - c++

My C++ code evaluates very large integrals on timeseries data (t2 >> t1). The integrals are fixed length and currently stored in [m x 2] column array of doubles. Column 1 is time. Column 2 is the signal that's being integrated. The code is running on a quadcore or 8 core machine.
For a machine with k cores, I want to:
Spin off k-1 worker processes (one for each of the remaining cores) to evaluate portions of the integral (trapezoidal integrations) and return their results to the waiting master thread.
Achieve the above without deep copying portions of the original array.
Implement C++11 async template for portability
How can I achieve the above without hardcoding the number of available cores?
I am Currently using VS 2012.
Update for Clarity:
For example, here's the rough psuedo-code
data is [100000,2] double
result = MyIntegrator(data[1:50000,1:2]) + MyIntegrator(data[50001:100000, 1:2]);
I need the MyIntegrator() functions to be evaluated in separate threads. The master thread waits for the two results.

Here is source that does a multi-threaded integration of the problem.
#include <vector>
#include <memory>
#include <future>
#include <iterator>
#include <iostream>
struct sample {
double duration;
double value;
};
typedef std::pair<sample*, sample*> data_range;
sample* begin( data_range const& r ) { return r.first; }
sample* end( data_range const& r ) { return r.second; }
typedef std::unique_ptr< std::future< double > > todo_item;
double integrate( data_range r ) {
double total = 0.;
for( auto&& s:r ) {
total += s.duration * s.value;
}
return total;
}
todo_item threaded_integration( data_range r ) {
return todo_item( new std::future<double>( std::async( integrate, r )) );
}
double integrate_over_threads( data_range r, std::size_t threads ) {
if (threads > std::size_t(r.second-r.first))
threads = r.second-r.first;
if (threads == 0)
threads = 1;
sample* begin = r.first;
sample* end = r.second;
std::vector< std::unique_ptr< std::future< double > > > todo_list;
sample* highwater = begin;
while (highwater != end) {
sample* new_highwater = (end-highwater)/threads+highwater;
--threads;
todo_item item = threaded_integration( data_range(highwater, new_highwater) );
todo_list.push_back( std::move(item) );
highwater = new_highwater;
}
double total = 0.;
for (auto&& item: todo_list) {
total += item->get();
}
return total;
}
sample data[5] = {
{1., 1.},
{1., 2.},
{1., 3.},
{1., 4.},
{1., 5.},
};
int main() {
using std::begin; using std::end;
double result = integrate_over_threads( data_range( begin(data), end(data) ), 2 );
std::cout << result << "\n";
}
it requires some modification to read data in exactly the format you specified.
But you can call it with std::thread::hardware_concurrency() as the number of threads, and it should work.
(In particular, to keep it simple, I have pairs of (duration, value) rather than (time, value), but that is just a minor detail).

What about std::thread::hardware_concurrency()?

Get the number of cores running, usually this can be found with std::thread::hardware_concurrency()
Returns number of concurrent threads supported by the implementation. The value should be considered only a hint.
If this is zero then you can try running specific commands based on the OS.
This seems to be a good way to find out the number of cores.
You'll still need to do testing to determine if multithreading will even give you tangible benefits, remember not to optimize prematurely :)

You could overschedule and see if it hurts your performance. Split your array into small fixed-length intervals (computable in one quant, may be fitting in one cache page) and see how that compares in performance with splitting according to number of CPUs.
Use std::packaged_task and pass it to a thread to make sure that you're not hurt by "launch" configuration.
Next step would be introducing thread pool, but that's more complicated.

You could accept a command-line parameter for the number of worker threads.

Related

Windows threading synchronization performance issue

I have a threading issue under windows.
I am developing a program that runs complex physical simulations for different conditions. Say a condition per hour of the year, would be 8760 simulations. I am grouping those simulations per thread such that each thread runs a for loop of 273 simulations (on average)
I bought an AMD ryzen 9 5950x with 16 cores (32 threads) for this task. On Linux, all the threads seem to be between 98% to 100% usage, while under windows I get this:
(The first bar is the I/O thread reading data, the smaller bars are the process threads. Red: synchronization, green: process, purple: I/O)
This is from Visual Studio's concurrency visualizer, which tells me that 63% of the time was spent on thread synchronization. As far as I can tell, my code is the same for both the Linux and windows executions.
I made my best to make the objects immutable to avoid issues and that provided a big gain with my old 8-thread intel i7. However with many more threads, this issue arises.
For threading, I have tried a custom parallel for, and the taskflow library. Both perform identically for what I want to do.
Is there something fundamental about windows threads that produces this behaviour?
The custom parallel for code:
/**
* parallel for
* #tparam Index integer type
* #tparam Callable function type
* #param start start index of the loop
* #param end final +1 index of the loop
* #param func function to evaluate
* #param nb_threads number of threads, if zero, it is determined automatically
*/
template<typename Index, typename Callable>
static void ParallelFor(Index start, Index end, Callable func, unsigned nb_threads=0) {
// Estimate number of threads in the pool
if (nb_threads == 0) nb_threads = getThreadNumber();
// Size of a slice for the range functions
Index n = end - start + 1;
Index slice = (Index) std::round(n / static_cast<double> (nb_threads));
slice = std::max(slice, Index(1));
// [Helper] Inner loop
auto launchRange = [&func] (int k1, int k2) {
for (Index k = k1; k < k2; k++) {
func(k);
}
};
// Create pool and launch jobs
std::vector<std::thread> pool;
pool.reserve(nb_threads);
Index i1 = start;
Index i2 = std::min(start + slice, end);
for (unsigned i = 0; i + 1 < nb_threads && i1 < end; ++i) {
pool.emplace_back(launchRange, i1, i2);
i1 = i2;
i2 = std::min(i2 + slice, end);
}
if (i1 < end) {
pool.emplace_back(launchRange, i1, end);
}
// Wait for jobs to finish
for (std::thread &t : pool) {
if (t.joinable()) {
t.join();
}
}
}
A complete C++ project illustrating the issue is uploaded here
Main.cpp:
//
// Created by santi on 26/08/2022.
//
#include "input_data.h"
#include "output_data.h"
#include "random.h"
#include "par_for.h"
void fillA(Matrix& A){
Random rnd;
rnd.setTimeBasedSeed();
for(int i=0; i < A.getRows(); ++i)
for(int j=0; j < A.getRows(); ++j)
A(i, j) = (int) rnd.randInt(0, 1000);
}
void worker(const InputData& input_data,
OutputData& output_data,
const std::vector<int>& time_indices,
int thread_index){
std::cout << "Thread " << thread_index << " [" << time_indices[0]<< ", " << time_indices[time_indices.size() - 1] << "]\n";
for(const int& t: time_indices){
Matrix b = input_data.getAt(t);
Matrix A(input_data.getDim(), input_data.getDim());
fillA(A);
Matrix x = A * b;
output_data.setAt(t, x);
}
}
void process(int time_steps, int dim, int n_threads){
InputData input_data(time_steps, dim);
OutputData output_data(time_steps, dim);
// correct the number of threads
if ( n_threads < 1 ) { n_threads = ( int )getThreadNumber( ); }
// generate indices
std::vector<int> time_indices = arrange<int>(time_steps);
// compute the split of indices per core
std::vector<ParallelChunkData<int>> chunks = prepareParallelChunks(time_indices, n_threads );
// run in parallel
ParallelFor( 0, ( int )chunks.size( ), [ & ]( int k ) {
// run chunk
worker(input_data, output_data, chunks[k].indices, k );
} );
}
int main(){
process(8760, 5000, 0);
return 0;
}
The performance problem you see is definitely caused by the many memory allocations, as already suspected by Matt in his answer. To expand on this: Here is a screenshot from Intel VTune running on an AMD Ryzen Threadripper 3990X with 64 cores (128 threads):
As you can see, almost all of the time is spent in malloc or free, which get called from the various Matrix operations. The bottom part of the image shows the timeline of the activity of a small selection of the threads: Green means that the thread is inactive, i.e. waiting. Usually only one or two threads are actually active. Allocations and freeing memory accesses a shared resource, causing the threads to wait for each other.
I think you have only two real options:
Option 1: No dynamic allocations anymore
The most efficient thing to do would be to rewrite the code to preallocate everything and get rid of all the temporaries. To adapt it to your example code, you could replace the b = input_data.getAt(t); and x = A * b; like this:
void MatrixVectorProduct(Matrix const & A, Matrix const & b, Matrix & x)
{
for (int i = 0; i < x.getRows(); ++i) {
for (int j = 0; j < x.getCols(); ++j) {
x(i, j) = 0.0;
for (int k = 0; k < A.getCols(); ++k) {
x(i,j) += (A(i,k) * b(k,j));
}
}
}
}
void getAt(int t, Matrix const & input_data, Matrix & b) {
for (int i = 0; i < input_data.getRows(); ++i)
b(i, 0) = input_data(i, t);
}
void worker(const InputData& input_data,
OutputData& output_data,
const std::vector<int>& time_indices,
int thread_index){
std::cout << "Thread " << thread_index << " [" << time_indices[0]<< ", " << time_indices[time_indices.size() - 1] << "]\n";
Matrix A(input_data.getDim(), input_data.getDim());
Matrix b(input_data.getDim(), 1);
Matrix x(input_data.getDim(), 1);
for (const int & t: time_indices) {
getAt(t, input_data.getMat(), b);
fillA(A);
MatrixVectorProduct(A, b, x);
output_data.setAt(t, x);
}
std::cout << "Thread " << thread_index << ": Finished" << std::endl;
}
This fixes the performance problems.
Here is a screenshot from VTune, where you can see a much better utilization:
Option 2: Using a special allocator
The alternative is to use a different allocator that handles allocating and freeing memory more efficiently in multithreaded scenarios. One that I had very good experience with is mimalloc (there are others such as hoard or the one from TBB). You do not need to modify your source code, you just need to link with a specific library as described in the documentation.
I tried mimalloc with your source code, and it gave near 100% CPU utilization without any code changes.
I also found a post on the Intel forums with a similar problem, and the solution there was the same (using a special allocator).
Additional notes
Matrix::allocSpace() allocates the memory by using pointers to arrays. It is better to use one contiguous array for the whole matrix instead of multiple independent arrays. That way, all elements are located behind each other in memory, allowing more efficient access.
But in general I suggest to use a dedicated linear algebra library such as Eigen instead of the hand rolled matrix implementation to exploit vectorization (SSE2, AVX,...) and to get the benefits of a highly optimized library.
Ensure that you compile your code with optimizations enabled.
Disable various cross-checks if you do not need them: assert() (i.e. define NDEBUG in the preprocessor), and for MSVC possibly /GS-.
Ensure that you actually have enough memory installed.
You said that all your memory was pre-allocated, but in the worker function I see this...
Matrix b = input_data.getAt(t);
which allocates and fills a new matrix b, and this...
Matrix A(input_data.getDim(), input_data.getDim());
which allocates and fills a new matrix A, and this...
Matrix x = A * b;
which allocates and fills a new matrix x.
The heap is a global data structure, so the thread synchronization time you're seeing is probably contention in the memory allocate/free functions.
These are in a tight loop. You should fix this loop to access b by reference, and reuse the other 2 matrices for every iteration.

Poor performance with Intel Threading Building Blocks using task_group (new user)

I recently got interested in Intel Threading Building Blocks. I would like to make use of the tbb::task_group class to manage a thread pool.
My first attempt was to build a test where a copy a vector in another: I create nth tasks each taking care of copying a continuous slice of the vector.
However, performances decreases with the number of threads. I have the same results with another thread pool implementation. With TBB 2018 Update 5, gcc 6.3 on debian strecth on an 8 i7 core box, I get the following figures to copy a vector of 1'000'000 of elements:
nth real user
1 0.808s 0.807s
2 1.068s 2.105s
4 1.109s 4.282s
May be some of you would me help understanding the issue. Here is the code:
#include<iostream>
#include<cstdlib>
#include<vector>
#include<algorithm>
#include "tbb/task_group.h"
#include "tbb/task_scheduler_init.h"
namespace mgis{
using real = double;
using size_type = size_t;
}
void my_copy(std::vector<mgis::real>& d,
const std::vector<mgis::real>& s,
const mgis::size_type b,
const mgis::size_type e){
const auto pb = s.begin()+b;
const auto pe = s.begin()+e;
const auto po = d.begin()+b;
std::copy(pb,pe,po);
}
int main(const int argc, const char* const* argv) {
using namespace mgis;
if (argc != 3) {
std::cerr << "invalid number of arguments\n";
std::exit(-1);
}
const auto ng = std::stoi(argv[1]);
const auto nth = std::stoi(argv[2]);
tbb::task_scheduler_init init(nth);
tbb::task_group g;
std::vector<real> v(ng,0);
std::vector<real> v2(ng);
for(auto i =0; i!=2000;++i){
const auto d = ng / nth;
const auto r = ng % nth;
size_type b = 0;
for (size_type i = 0; i != r; ++i) {
g.run([&v2, &v, b, d] { my_copy(v2, v, b, b + d + 1); });
b += d+1;
}
for (size_type i = r; i != nth; ++i) {
g.run([&v2, &v, b, d] { my_copy(v2, v, b, b + d); });
b += d ;
}
g.wait();
}
return EXIT_SUCCESS;
}
such a short benchmark does not make sense as TBB needs to create threads and get them started, it does not happen immediately on the first call to TBB since it is lazy asynchronous process. Though, your user times suggest that threads are up and running but probably don't have work to do.
memcopy is bad for scalability study because it does not scale beyond the number of memory controllers/channels. So, it doesn't matter if you have 4 CPUs or 24, it's unlikely you can get more than x4 speedup even for good hardware. Yours might have less channels.
instead of manually splitting the range, use tbb::parallel_for, you don't need task_group there. Moreover, invoking tasks one by one has linear complexity, parallel_for has logarithmic complexity.

Understanding std::hardware_destructive_interference_size and std::hardware_constructive_interference_size

C++17 added std::hardware_destructive_interference_size and std::hardware_constructive_interference_size. First, I thought it is just a portable way to get the size of a L1 cache line but that is an oversimplification.
Questions:
How are these constants related to the L1 cache line size?
Is there a good example that demonstrates their use cases?
Both are defined static constexpr. Is that not a problem if you build a binary and execute it on other machines with different cache line sizes? How can it protect against false sharing in that scenario when you are not certain on which machine your code will be running?
The intent of these constants is indeed to get the cache-line size. The best place to read about the rationale for them is in the proposal itself:
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/p0154r1.html
I'll quote a snippet of the rationale here for ease-of-reading:
[...] the granularity of memory that does not interfere (to the first-order) [is] commonly referred to as the cache-line size.
Uses of cache-line size fall into two broad categories:
Avoiding destructive interference (false-sharing) between objects with temporally disjoint runtime access patterns from different threads.
Promoting constructive interference (true-sharing) between objects which have temporally local runtime access patterns.
The most sigificant issue with this useful implementation quantity is the questionable portability of the methods used in current practice to determine its value, despite their pervasiveness and popularity as a group. [...]
We aim to contribute a modest invention for this cause, abstractions for this quantity that can be conservatively defined for given purposes by implementations:
Destructive interference size: a number that’s suitable as an offset between two objects to likely avoid false-sharing due to different runtime access patterns from different threads.
Constructive interference size: a number that’s suitable as a limit on two objects’ combined memory footprint size and base alignment to likely promote true-sharing between them.
In both cases these values are provided on a quality of implementation basis, purely as hints that are likely to improve performance. These are ideal portable values to use with the alignas() keyword, for which there currently exists nearly no standard-supported portable uses.
"How are these constants related to the L1 cache line size?"
In theory, pretty directly.
Assume the compiler knows exactly what architecture you'll be running on - then these would almost certainly give you the L1 cache-line size precisely. (As noted later, this is a big assumption.)
For what it's worth, I would almost always expect these values to be the same. I believe the only reason they are declared separately is for completeness. (That said, maybe a compiler wants to estimate L2 cache-line size instead of L1 cache-line size for constructive interference; I don't know if this would actually be useful, though.)
"Is there a good example that demonstrates their use cases?"
At the bottom of this answer I've attached a long benchmark program that demonstrates false-sharing and true-sharing.
It demonstrates false-sharing by allocating an array of int wrappers: in one case multiple elements fit in the L1 cache-line, and in the other a single element takes up the L1 cache-line. In a tight loop a single, a fixed element is chosen from the array and updated repeatedly.
It demonstrates true-sharing by allocating a single pair of ints in a wrapper: in one case, the two ints within the pair do not fit in L1 cache-line size together, and in the other they do. In a tight loop, each element of the pair is updated repeatedly.
Note that the code for accessing the object under test does not change; the only difference is the layout and alignment of the objects themselves.
I don't have a C++17 compiler (and assume most people currently don't either), so I've replaced the constants in question with my own. You need to update these values to be accurate on your machine. That said, 64 bytes is probably the correct value on typical modern desktop hardware (at the time of writing).
Warning: the test will use all cores on your machines, and allocate ~256MB of memory. Don't forget to compile with optimizations!
On my machine, the output is:
Hardware concurrency: 16
sizeof(naive_int): 4
alignof(naive_int): 4
sizeof(cache_int): 64
alignof(cache_int): 64
sizeof(bad_pair): 72
alignof(bad_pair): 4
sizeof(good_pair): 8
alignof(good_pair): 4
Running naive_int test.
Average time: 0.0873625 seconds, useless result: 3291773
Running cache_int test.
Average time: 0.024724 seconds, useless result: 3286020
Running bad_pair test.
Average time: 0.308667 seconds, useless result: 6396272
Running good_pair test.
Average time: 0.174936 seconds, useless result: 6668457
I get ~3.5x speedup by avoiding false-sharing, and ~1.7x speedup by ensuring true-sharing.
"Both are defined static constexpr. Is that not a problem if you build a binary and execute it on other machines with different cache line sizes? How can it protect against false sharing in that scenario when you are not certain on which machine your code will be running?"
This will indeed be a problem. These constants are not guaranteed to map to any cache-line size on the target machine in particular, but are intended to be the best approximation the compiler can muster up.
This is noted in the proposal, and in the appendix they give an example of how some libraries try to detect cache-line size at compile time based on various environmental hints and macros. You are guaranteed that this value is at least alignof(max_align_t), which is an obvious lower bound.
In other words, this value should be used as your fallback case; you are free to define a precise value if you know it, e.g.:
constexpr std::size_t cache_line_size() {
#ifdef KNOWN_L1_CACHE_LINE_SIZE
return KNOWN_L1_CACHE_LINE_SIZE;
#else
return std::hardware_destructive_interference_size;
#endif
}
During compilation, if you want to assume a cache-line size just define KNOWN_L1_CACHE_LINE_SIZE.
Hope this helps!
Benchmark program:
#include <chrono>
#include <condition_variable>
#include <cstddef>
#include <functional>
#include <future>
#include <iostream>
#include <random>
#include <thread>
#include <vector>
// !!! YOU MUST UPDATE THIS TO BE ACCURATE !!!
constexpr std::size_t hardware_destructive_interference_size = 64;
// !!! YOU MUST UPDATE THIS TO BE ACCURATE !!!
constexpr std::size_t hardware_constructive_interference_size = 64;
constexpr unsigned kTimingTrialsToComputeAverage = 100;
constexpr unsigned kInnerLoopTrials = 1000000;
typedef unsigned useless_result_t;
typedef double elapsed_secs_t;
//////// CODE TO BE SAMPLED:
// wraps an int, default alignment allows false-sharing
struct naive_int {
int value;
};
static_assert(alignof(naive_int) < hardware_destructive_interference_size, "");
// wraps an int, cache alignment prevents false-sharing
struct cache_int {
alignas(hardware_destructive_interference_size) int value;
};
static_assert(alignof(cache_int) == hardware_destructive_interference_size, "");
// wraps a pair of int, purposefully pushes them too far apart for true-sharing
struct bad_pair {
int first;
char padding[hardware_constructive_interference_size];
int second;
};
static_assert(sizeof(bad_pair) > hardware_constructive_interference_size, "");
// wraps a pair of int, ensures they fit nicely together for true-sharing
struct good_pair {
int first;
int second;
};
static_assert(sizeof(good_pair) <= hardware_constructive_interference_size, "");
// accesses a specific array element many times
template <typename T, typename Latch>
useless_result_t sample_array_threadfunc(
Latch& latch,
unsigned thread_index,
T& vec) {
// prepare for computation
std::random_device rd;
std::mt19937 mt{ rd() };
std::uniform_int_distribution<int> dist{ 0, 4096 };
auto& element = vec[vec.size() / 2 + thread_index];
latch.count_down_and_wait();
// compute
for (unsigned trial = 0; trial != kInnerLoopTrials; ++trial) {
element.value = dist(mt);
}
return static_cast<useless_result_t>(element.value);
}
// accesses a pair's elements many times
template <typename T, typename Latch>
useless_result_t sample_pair_threadfunc(
Latch& latch,
unsigned thread_index,
T& pair) {
// prepare for computation
std::random_device rd;
std::mt19937 mt{ rd() };
std::uniform_int_distribution<int> dist{ 0, 4096 };
latch.count_down_and_wait();
// compute
for (unsigned trial = 0; trial != kInnerLoopTrials; ++trial) {
pair.first = dist(mt);
pair.second = dist(mt);
}
return static_cast<useless_result_t>(pair.first) +
static_cast<useless_result_t>(pair.second);
}
//////// UTILITIES:
// utility: allow threads to wait until everyone is ready
class threadlatch {
public:
explicit threadlatch(const std::size_t count) :
count_{ count }
{}
void count_down_and_wait() {
std::unique_lock<std::mutex> lock{ mutex_ };
if (--count_ == 0) {
cv_.notify_all();
}
else {
cv_.wait(lock, [&] { return count_ == 0; });
}
}
private:
std::mutex mutex_;
std::condition_variable cv_;
std::size_t count_;
};
// utility: runs a given function in N threads
std::tuple<useless_result_t, elapsed_secs_t> run_threads(
const std::function<useless_result_t(threadlatch&, unsigned)>& func,
const unsigned num_threads) {
threadlatch latch{ num_threads + 1 };
std::vector<std::future<useless_result_t>> futures;
std::vector<std::thread> threads;
for (unsigned thread_index = 0; thread_index != num_threads; ++thread_index) {
std::packaged_task<useless_result_t()> task{
std::bind(func, std::ref(latch), thread_index)
};
futures.push_back(task.get_future());
threads.push_back(std::thread(std::move(task)));
}
const auto starttime = std::chrono::high_resolution_clock::now();
latch.count_down_and_wait();
for (auto& thread : threads) {
thread.join();
}
const auto endtime = std::chrono::high_resolution_clock::now();
const auto elapsed = std::chrono::duration_cast<
std::chrono::duration<double>>(
endtime - starttime
).count();
useless_result_t result = 0;
for (auto& future : futures) {
result += future.get();
}
return std::make_tuple(result, elapsed);
}
// utility: sample the time it takes to run func on N threads
void run_tests(
const std::function<useless_result_t(threadlatch&, unsigned)>& func,
const unsigned num_threads) {
useless_result_t final_result = 0;
double avgtime = 0.0;
for (unsigned trial = 0; trial != kTimingTrialsToComputeAverage; ++trial) {
const auto result_and_elapsed = run_threads(func, num_threads);
const auto result = std::get<useless_result_t>(result_and_elapsed);
const auto elapsed = std::get<elapsed_secs_t>(result_and_elapsed);
final_result += result;
avgtime = (avgtime * trial + elapsed) / (trial + 1);
}
std::cout
<< "Average time: " << avgtime
<< " seconds, useless result: " << final_result
<< std::endl;
}
int main() {
const auto cores = std::thread::hardware_concurrency();
std::cout << "Hardware concurrency: " << cores << std::endl;
std::cout << "sizeof(naive_int): " << sizeof(naive_int) << std::endl;
std::cout << "alignof(naive_int): " << alignof(naive_int) << std::endl;
std::cout << "sizeof(cache_int): " << sizeof(cache_int) << std::endl;
std::cout << "alignof(cache_int): " << alignof(cache_int) << std::endl;
std::cout << "sizeof(bad_pair): " << sizeof(bad_pair) << std::endl;
std::cout << "alignof(bad_pair): " << alignof(bad_pair) << std::endl;
std::cout << "sizeof(good_pair): " << sizeof(good_pair) << std::endl;
std::cout << "alignof(good_pair): " << alignof(good_pair) << std::endl;
{
std::cout << "Running naive_int test." << std::endl;
std::vector<naive_int> vec;
vec.resize((1u << 28) / sizeof(naive_int)); // allocate 256 mibibytes
run_tests([&](threadlatch& latch, unsigned thread_index) {
return sample_array_threadfunc(latch, thread_index, vec);
}, cores);
}
{
std::cout << "Running cache_int test." << std::endl;
std::vector<cache_int> vec;
vec.resize((1u << 28) / sizeof(cache_int)); // allocate 256 mibibytes
run_tests([&](threadlatch& latch, unsigned thread_index) {
return sample_array_threadfunc(latch, thread_index, vec);
}, cores);
}
{
std::cout << "Running bad_pair test." << std::endl;
bad_pair p;
run_tests([&](threadlatch& latch, unsigned thread_index) {
return sample_pair_threadfunc(latch, thread_index, p);
}, cores);
}
{
std::cout << "Running good_pair test." << std::endl;
good_pair p;
run_tests([&](threadlatch& latch, unsigned thread_index) {
return sample_pair_threadfunc(latch, thread_index, p);
}, cores);
}
}
I would almost always expect these values to be the same.
Regarding above, I would like to make a minor contribution to the accepted answer. A while ago, I saw a very good use-case where these two should be defined separately in the folly library. Please see the caveat about Intel Sandy Bridge processor.
https://github.com/facebook/folly/blob/3af92dbe6849c4892a1fe1f9366306a2f5cbe6a0/folly/lang/Align.h
// Memory locations within the same cache line are subject to destructive
// interference, also known as false sharing, which is when concurrent
// accesses to these different memory locations from different cores, where at
// least one of the concurrent accesses is or involves a store operation,
// induce contention and harm performance.
//
// Microbenchmarks indicate that pairs of cache lines also see destructive
// interference under heavy use of atomic operations, as observed for atomic
// increment on Sandy Bridge.
//
// We assume a cache line size of 64, so we use a cache line pair size of 128
// to avoid destructive interference.
//
// mimic: std::hardware_destructive_interference_size, C++17
constexpr std::size_t hardware_destructive_interference_size =
kIsArchArm ? 64 : 128;
static_assert(hardware_destructive_interference_size >= max_align_v, "math?");
// Memory locations within the same cache line are subject to constructive
// interference, also known as true sharing, which is when accesses to some
// memory locations induce all memory locations within the same cache line to
// be cached, benefiting subsequent accesses to different memory locations
// within the same cache line and heping performance.
//
// mimic: std::hardware_constructive_interference_size, C++17
constexpr std::size_t hardware_constructive_interference_size = 64;
static_assert(hardware_constructive_interference_size >= max_align_v, "math?");
I've tested the above code but I think there is a minor error preventing us from understanding the underlying functionning, a single cache line should not be shared between two distinct atomics in order to prevent false sharing.
I've changed the definition of those structs.
struct naive_int
{
alignas ( sizeof ( int ) ) atomic < int > value;
};
struct cache_int
{
alignas ( hardware_constructive_interference_size ) atomic < int > value;
};
struct bad_pair
{
// two atomics sharing a single 64 bytes cache line
alignas ( hardware_constructive_interference_size ) atomic < int > first;
atomic < int > second;
};
struct good_pair
{
// first cache line begins here
alignas ( hardware_constructive_interference_size ) atomic < int >
first;
// That one is still in the first cache line
atomic < int > first_s;
// second cache line starts here
alignas ( hardware_constructive_interference_size ) atomic < int >
second;
// That one is still in the second cache line
atomic < int > second_s;
};
And the resulting run:
Hardware concurrency := 40
sizeof(naive_int) := 4
alignof(naive_int) := 4
sizeof(cache_int) := 64
alignof(cache_int) := 64
sizeof(bad_pair) := 64
alignof(bad_pair) := 64
sizeof(good_pair) := 128
alignof(good_pair) := 64
Running naive_int test.
Average time: 0.060303 seconds, useless result: 8212147
Running cache_int test.
Average time: 0.0109432 seconds, useless result: 8113799
Running bad_pair test.
Average time: 0.162636 seconds, useless result: 16289887
Running good_pair test.
Average time: 0.129472 seconds, useless result: 16420417
I experienced a lot of variance in the last result but never dedicated precisely any core to that specific problem. Anyway this ran out of 2 Xeon 2690V2 and from various run using 64 or 128 for hardware_constructive_interference_size = 128 I found 64 to be more than enought and 128 a very poor use of available cache.
I suddently realized that your question helps me understand what's Jeff Preshing
was talking, all about payload !?

For loop performance and multithreaded performance questions

I was kind of bored so I wanted to try using std::thread and eventually measure performance of single and multithreaded console application. This is a two part question. So I started with a single threaded sum of a massive vector of ints (800000 of ints).
int sum = 0;
auto start = chrono::high_resolution_clock::now();
for (int i = 0; i < 800000; ++i)
sum += ints[i];
auto end = chrono::high_resolution_clock::now();
auto diff = end - start;
Then I added range based and iterator based for loop and measured the same way with chrono::high_resolution_clock.
for (auto& val : ints)
sum += val;
for (auto it = ints.begin(); it != ints.end(); ++it)
sum += *it;
At this point console output looked like:
index loop: 30.0017ms
range loop: 221.013ms
iterator loop: 442.025ms
This was a debug version, so I changed to release and the difference was ~1ms in favor of index based for. No big deal, but just out of curiosity: should there be a difference this big in debug mode between these three for loops? Or even a difference in 1ms in release mode?
I moved on to the thread creation, and tried to do a parallel sum of the array with this lambda (captured everything by reference so I could use vector of ints and a mutex previously declared) using index based for.
auto func = [&](int start, int total, int index)
{
int partial_sum = 0;
auto s = chrono::high_resolution_clock::now();
for (int i = start; i < start + total; ++i)
partial_sum += ints[i];
auto e = chrono::high_resolution_clock::now();
auto d = e - s;
m.lock();
cout << "thread " + to_string(index) + ": " << chrono::duration<double, milli>(d).count() << "ms" << endl;
sum += partial_sum;
m.unlock();
};
for (int i = 0; i < 8; ++i)
threads.push_back(thread(func, i * 100000, 100000, i));
Basically every thread was summing 1/8 of the total array, and the final console output was:
thread 0: 6.0004ms
thread 3: 6.0004ms
thread 2: 6.0004ms
thread 5: 7.0004ms
thread 4: 7.0004ms
thread 1: 7.0004ms
thread 6: 7.0004ms
thread 7: 7.0004ms
8 threads total: 53.0032ms
So I guess the second part of this question is what's going on here? Solution with 2 threads ended with ~30ms as well. Cache ping pong? Something else? If I'm doing something wrong, what would be the correct way to do it? Also if It's relevant, I was trying this on an i7 with 8 threads, so yes I know I didn't count the main thread, but tried it with 7 separate threads and pretty much got the same result.
EDIT: Sorry forgot the mention this was on Windows 7 with Visual Studio 2013 and Visual Studio's v120 compiler or whatever it's called.
EDIT2: Here's the whole main function:
http://pastebin.com/HyZUYxSY
With optimisation not turned on, all the method calls that are performed behind the scenes are likely real method calls. Inline functions are likely not inlined but really called. For template code, you really need to turn on optimisation to avoid that all the code is taken literally. For example, it's likely that your iterator code will call iter.end () 800,000 times, and operator!= for the comparison 800,000 times, which calls operator== and so on and so on.
For the multithreaded code, processors are complicated. Operating systems are complicated. Your code isn't alone on the computer. Your computer can change its clock speed, change into turbo mode, change into heat protection mode. And rounding the times to milliseconds isn't really helpful. Could be one thread to 6.49 milliseconds and another too 6.51 and it got rounded differently.
should there be a difference this big in debug mode between these three for loops?
Yes. If allowed, a decent compiler can produce identical output for each of the 3 different loops, but if optimizations are not enabled, the iterator version has more function calls and function calls have certain overhead.
Or even a difference in 1ms in release mode?
Your test code:
start = ...
for (auto& val : ints)
sum += val;
end = ...
diff = end - start;
sum = 0;
Doesn't use the result of the loop at all so when optimized, the compiler should simply choose to throw away the code resulting in something like:
start = ...
// do nothing...
end = ...
diff = end - start;
For all your loops.
The difference of 1ms may be produced by high granularity of the "high_resolution_clock" in the used implementation of the standard library and by differences in process scheduling during the execution. I measured the index based for being 0.04 ms slower, but that result is meaningless.
Aside from how std::thread is implemented on Windows I would to point your attention to your available execution units and context switching.
An i7 does not have 8 real execution units. It's a quad-core processor with hyper-threading. And HT does not magically double the available number of threads, no matter how it's advertised. It's a really clever system which tries to fit in instructions from an extra pipeline whenever possible. But in the end all instructions go through only four execution units.
So running 8 (or 7) threads is still more than your CPU can really handle simultaneously. That means your CPU has to switch a lot between 8 hot threads clamouring for calculation time. Top that off with several hundred more threads from the OS, admittedly most of which are asleep, that need time and you're left with a high degree of uncertainty in your measurements.
With a single threaded for-loop the OS can dedicate a single core to that task and spread the half-sleeping threads across the other three. This is why you're seeing such a difference between 1 thread and 8 threads.
As for your debugging questions: you should check if Visual Studio has Iterator checking enabled in debugging. When it's enabled every time an iterator is used it is bounds-checked and such. See: https://msdn.microsoft.com/en-us/library/aa985965.aspx
Lastly: have a look at the -openmp switch. If you enable that and apply the OpenMP #pragmas to your for-loops you can do away with all the manual thread creation. I toyed around with similar threading tests (because it's cool. :) ) and OpenMPs performance is pretty damn good.
For the first question, regarding the difference in performance between the range, iterator and index implementations, others have pointed out that in a non-optimized build, much which would normally be inlined may not be.
However there is an additional wrinkle: by default, in Debug builds, Visual Studio will use checked iterators. Access through a checked iterator is checked for safety (does the iterator refer to a valid element?), and consequently operations which use them, including the range-based iteration, are heavily penalized.
For the second part, I have to say that those durations seem abnormally long. When I run the code locally, compiled with g++ -O3 on a core i7-4770 (Linux), I get sub-millisecond timings for each method, less in fact than the jitter between runs. Altering the code to iterate each test 1000 times gives more stable results, with the per test times being 0.33 ms for the index and range loops with no extra tweaking, and about 0.15 ms for the parallel test.
The parallel threads are doing in total the same number of operations, and what's more, using all four cores limits the CPU's ability to dynamically increase its clock speed. So how can it take less total time?
I'd wager that the gains result from better utilization of the per-core L2 caches, four in total. Indeed, using four threads instead of eight threads reduces the total parallel time to 0.11 ms, consistent with better L2 cache use.
Browsing the Intel processor documentation, all the Core i7 processors, including the mobile ones, have at least 4 MB of L3 cache, which will happily accommodate 800 thousand 4-byte ints. So I'm surprised both by the raw times being 100 times larger than I'm seeing, and the 8-thread time totals being so much greater, which as you surmise, is a strong hint that they are thrashing the cache. I'm presuming this is demonstrating just how suboptimal the Debug build code is. Could you post results from an optimised build?
Not knowing how those std::thread classes are implemented, one possible explanation for the 53ms could be:
The threads are started right away when they get instantiated. (I see no thread.start() or threads.StartAll() or alike). So, during the time the first thread instance gets active, the main thread might (or might not) be preempted. There is no guarantee that the threads are getting spawned on individual cores, after all (thread affinity).
If you have a closer look at POSIX APIs, there is the notion of "application context" and "system context", which basically implies, that there might be an OS policy in place which would not use all cores for 1 application.
On Windows (this is where you were testing), maybe the threads are not being spawned directly but via a thread pool, maybe with some extra std::thread functionality, which could produce overhead/delay. (Such as completion ports etc.).
Unfortunately my machine is pretty fast so I had to increase the amount of data processed to yield significant times. But on the upside, this reminded me to point out, that typically, it starts to pay off to go parallel when the computation times are way beyond the time of a time slice (rule of thumb).
Here my "native" Windows implementation, which - for a large enough array finally makes the threads win over a single threaded computation.
#include <stdafx.h>
#include <nativethreadTest.h>
#include <vector>
#include <cstdint>
#include <Windows.h>
#include <chrono>
#include <iostream>
#include <thread>
struct Range
{
Range( const int32_t *p, size_t l)
: data(p)
, length(l)
, result(0)
{}
const int32_t *data;
size_t length;
int32_t result;
};
static int32_t Sum(const int32_t * data, size_t length)
{
int32_t sum = 0;
const int32_t *end = data + length;
for (; data != end; data++)
{
sum += *data;
}
return sum;
}
static int32_t TestSingleThreaded(const Range& range)
{
return Sum(range.data, range.length);
}
DWORD
WINAPI
CalcThread
(_In_ LPVOID lpParameter
)
{
Range * myRange = reinterpret_cast<Range*>(lpParameter);
myRange->result = Sum(myRange->data, myRange->length);
return 0;
}
static int32_t TestWithNCores(const Range& range, size_t ncores)
{
int32_t result = 0;
std::vector<Range> ranges;
size_t nextStart = 0;
size_t chunkLength = range.length / ncores;
size_t remainder = range.length - chunkLength * ncores;
while (nextStart < range.length)
{
ranges.push_back(Range(&range.data[nextStart], chunkLength));
nextStart += chunkLength;
}
Range remainderRange(&range.data[range.length - remainder], remainder);
std::vector<HANDLE> threadHandles;
threadHandles.reserve(ncores);
for (size_t i = 0; i < ncores; ++i)
{
threadHandles.push_back(::CreateThread(NULL, 0, CalcThread, &ranges[i], 0, NULL));
}
int32_t remainderResult = Sum(remainderRange.data, remainderRange.length);
DWORD waitResult = ::WaitForMultipleObjects((DWORD)threadHandles.size(), &threadHandles[0], TRUE, INFINITE);
if (WAIT_OBJECT_0 == waitResult)
{
for (auto& r : ranges)
{
result += r.result;
}
result += remainderResult;
}
else
{
throw std::runtime_error("Something went horribly - HORRIBLY wrong!");
}
for (auto& h : threadHandles)
{
::CloseHandle(h);
}
return result;
}
static int32_t TestWithSTLThreads(const Range& range, size_t ncores)
{
int32_t result = 0;
std::vector<Range> ranges;
size_t nextStart = 0;
size_t chunkLength = range.length / ncores;
size_t remainder = range.length - chunkLength * ncores;
while (nextStart < range.length)
{
ranges.push_back(Range(&range.data[nextStart], chunkLength));
nextStart += chunkLength;
}
Range remainderRange(&range.data[range.length - remainder], remainder);
std::vector<std::thread> threads;
for (size_t i = 0; i < ncores; ++i)
{
threads.push_back(std::thread([](Range* range){ range->result = Sum(range->data, range->length); }, &ranges[i]));
}
int32_t remainderResult = Sum(remainderRange.data, remainderRange.length);
for (auto& t : threads)
{
t.join();
}
for (auto& r : ranges)
{
result += r.result;
}
result += remainderResult;
return result;
}
void TestNativeThreads()
{
const size_t DATA_SIZE = 800000000ULL;
typedef std::vector<int32_t> DataVector;
DataVector data;
data.reserve(DATA_SIZE);
for (size_t i = 0; i < DATA_SIZE; ++i)
{
data.push_back(static_cast<int32_t>(i));
}
Range r = { data.data(), data.size() };
std::chrono::system_clock::time_point singleThreadedStart = std::chrono::high_resolution_clock::now();
int32_t result = TestSingleThreaded(r);
std::chrono::system_clock::time_point singleThreadedEnd = std::chrono::high_resolution_clock::now();
std::cout
<< "Single threaded sum: "
<< std::chrono::duration_cast<std::chrono::milliseconds>(singleThreadedEnd - singleThreadedStart).count()
<< "ms." << " Result = " << result << std::endl;
std::chrono::system_clock::time_point multiThreadedStart = std::chrono::high_resolution_clock::now();
result = TestWithNCores(r, 8);
std::chrono::system_clock::time_point multiThreadedEnd = std::chrono::high_resolution_clock::now();
std::cout
<< "Multi threaded sum: "
<< std::chrono::duration_cast<std::chrono::milliseconds>(multiThreadedEnd - multiThreadedStart).count()
<< "ms." << " Result = " << result << std::endl;
std::chrono::system_clock::time_point stdThreadedStart = std::chrono::high_resolution_clock::now();
result = TestWithSTLThreads(r, 8);
std::chrono::system_clock::time_point stdThreadedEnd = std::chrono::high_resolution_clock::now();
std::cout
<< "std::thread sum: "
<< std::chrono::duration_cast<std::chrono::milliseconds>(stdThreadedEnd - stdThreadedStart).count()
<< "ms." << " Result = " << result << std::endl;
}
Here the output on my machine of this code:
Single threaded sum: 382ms. Result = -532120576
Multi threaded sum: 234ms. Result = -532120576
std::thread sum: 245ms. Result = -532120576
Press any key to continue . . ..
Last not least, I feel urged to mention that the way this code is written it is rather a memory IO performance benchmark than a core CPU computation benchmark.
Better computation benchmarks would use small amounts of data which is local, fits into CPU caches etc.
Maybe it would be interesting to experiment with the splitting of the data into ranges. What if each thread were "jumping" over the data from the start to an end with a gap of ncores? Thread 1: 0 8 16... Thread 2: 1 9 17 ... etc.? Maybe then the "locality" of the memory could gain extra speed.

What can you do to stop running out of stack space when multithreading?

I've implemented a working multithreaded merge sort in C++, but I've hit a wall.
In my implementation, I recursively split an input vector into two parts, and then thread these two parts:
void MergeSort(vector<int> *in)
{
if(in->size() < 2)
return;
vector<int>::iterator ite = in->begin();
vector<int> left = vector<int> (ite, ite + in->size()/2);
vector<int> right = vector<int> (ite + in->size()/2, in->end() );
//current thread spawns 2 threads HERE
thread t1 = thread(MergeSort, &left);
thread t2 = thread(MergeSort, &right);
t1.join();
t2.join();
vector<int> ret;
ret.reserve(in->size() );
ret = MergeSortMerge(left, right);
in->clear();
in->insert(in->begin(), ret.begin(), ret.end() );
return;
}
The code appears to be pretty, but it's one of the most vicious codes I've ever written. Trying to sort an array of more than 1000 int values causes so many threads to spawn, that I run out of stack space, and my computer BSODs :( Consistently.
I am well aware of the reason why this code spawns so many threads, which isn't so good, but technically (if not theoretically), is this not a proper implementation?
Based on a bit of Googling, I seem to have found the need for a threadpool. Would the use of a threadpool resolve the fundamental issue I am running into, the fact that I am trying to spawn too many threads? If so, do you have any recommendations on libraries?
Thank you for the advice and help!
As zdan explained, you shall limit the number of threads. There are two things to consider to determine what's the limit,
The number of CPU cores. In C++11, you can use std::thread::hardware_concurrency() to determine the hardware cores. However, this function may return 0 meaning that the program doesn't know how many cores, in this case, you may assume this value to be 2 or 4.
Limited by the number of data to be processed. You can divide the data to be processed by threads until 1 data per thread, but it will cost too much for only 1 data and it's not cost efficient. For example, you can probably say, when the number of data is smaller than 50, you don't want to divide anymore. So you can determine the maximum number of threads required based on something like total_data_number / 50 + 1.
Then, you choose a minimum number between case 1 & case 2 to determine the limit.
In your case, because you are generating thread by recursion, you can try to determine the recursion depth in similar ways.
I don't think a threadpool is going to help you. Since your algorithm is recursive you'll get to a point where all threads in your pool are consumed and the pool won't want to create any more threads and your algorithm will block.
You could probably just limit your thread creation recursion depth to 2 or 3 (unless you've got a LOT of CPU's it won't make any difference in performance).
You can set your limits on stack space but it is futile. Too many threads, even with a pool, will eat it up at log2(N)*cost per thread. Go for an iterative approach and reduce your overhead. Overhead is the killer.
As far as performance goes you are going to find that using some level of over commit of N threads, where is is the hardware concurrency will probably yield the best results. There will be a good balance between overhead and work per core. If N get's very large, like on a GPU, then other options exist(Bitonic) that make different trade-offs to reduce the communication(waiting/joining) overhead.
Assuming you have a task manager and a semaphore that is contructed for N notifies before allowing the waiting task through,
`
#include <algorithm>
#include <array>
#include <cstdint>
#include <vector>
#include <sometaskmanager.h>
void parallel_merge( size_t N ) {
std::array<int, 1000> ary {0};
// fill array...
intmax_t stride_size = ary.size( )/N; //TODO: Put a MIN size here
auto semaphore = make_semaphore( N );
using iterator = typename std::array<int, 1000>::iterator;
std::vector<std::pair<iterator, iterator>> ranges;
auto last_it = ary.begin( );
for( intmax_t n=stride_size; n<N; n +=stride_size ) {
ranges.emplace_back( last_it, std::next(last_it, std::min(std::distance(last_it, ary.end()), stride_size)));
semaphore.notify( );
}
for( auto const & rng: ranges ) {
add_task( [&semaphore,rng]( ) {
std::sort( rng.first, rng.second );
});
}
semaphore.wait( );
std::vector<std::pair<iterator, iterator>> new_rng;
while( ranges.size( ) > 1 ) {
semaphore = make_semaphore( ranges.size( )/2 );
for( size_t n=0; n<ranges.size( ); n+=2 ) {
auto first=ranges[n].first;
auto last=ranges[n+1].second;
add_task( [&semaphore, first, mid=ranges[n].second, last]( ) {
std::inplace_merge( first, mid, last );
semaphore.notify( );
});
new_rng.emplace_back( first, last );
}
if( ranges.size( ) % 2 != 0 ) {
new_rng.push_back( ranges.back( ) );
}
ranges = new_rng;
semaphore.wait( );
}
}
As you can see, the bottleneck is in the merge phase as there is a lot of cordination that must be done. Sean Parent does a good presentation of building a task manager if you don't have one and about how it compares along with a relative performance analysis in his presentation Better Code: Concurrency, http://sean-parent.stlab.cc/presentations/2016-11-16-concurrency/2016-11-16-concurrency.pdf . TBB and PPL have task managers.