Measuring bandwidth on a ccNUMA system - c++

I'm attempting to benchmark the memory bandwidth on a ccNUMA system with 2x Intel(R) Xeon(R) Platinum 8168:
24 cores # 2.70 GHz,
L1 cache 32 kB, L2 cache 1 MB and L3 cache 33 MB.
As a reference, I'm using the Intel Advisor's Roofline plot, which depicts the bandwidths of each CPU data-path available. According to this, the bandwidth is 230 GB/s.
In order to benchmark this, I'm using my own little benchmark helper tool which performs timed experiments in a loop. The API offers an abstract class called experiment_functor
which looks like this:
class experiment_functor
{
public:
//+/////////////////
// main functionality
//+/////////////////
virtual void init() = 0;
virtual void* data(const std::size_t&) = 0;
virtual void perform_experiment() = 0;
virtual void finish() = 0;
};
The user (myself) can then define the data initialization, the work to be timed i.e. the experiment and the clean-up routine so that freshly allocated data can be used for each experiment. An instance of the derived class can be provided to the API function:
perf_stats perform_experiments(experiment_functor& exp_fn, const std::size_t& data_size_in_byte, const std::size_t& exp_count)
Here's the implementation of the class for the Schönauer vector triad:
class exp_fn : public experiment_functor
{
//+/////////////////
// members
//+/////////////////
const std::size_t data_size_;
double* vec_a_ = nullptr;
double* vec_b_ = nullptr;
double* vec_c_ = nullptr;
double* vec_d_ = nullptr;
public:
//+/////////////////
// lifecycle
//+/////////////////
exp_fn(const std::size_t& data_size)
: data_size_(data_size) {}
//+/////////////////
// main functionality
//+/////////////////
void init() final
{
// allocate
const auto page_size = sysconf(_SC_PAGESIZE) / sizeof(double);
posix_memalign(reinterpret_cast<void**>(&vec_a_), page_size, data_size_ * sizeof(double));
posix_memalign(reinterpret_cast<void**>(&vec_b_), page_size, data_size_ * sizeof(double));
posix_memalign(reinterpret_cast<void**>(&vec_c_), page_size, data_size_ * sizeof(double));
posix_memalign(reinterpret_cast<void**>(&vec_d_), page_size, data_size_ * sizeof(double));
if (vec_a_ == nullptr || vec_b_ == nullptr || vec_c_ == nullptr || vec_d_ == nullptr)
{
std::cerr << "Fatal error, failed to allocate memory." << std::endl;
std::abort();
}
// apply first-touch
#pragma omp parallel for schedule(static)
for (auto index = std::size_t{}; index < data_size_; index += page_size)
{
vec_a_[index] = 0.0;
vec_b_[index] = 0.0;
vec_c_[index] = 0.0;
vec_d_[index] = 0.0;
}
}
void* data(const std::size_t&) final
{
return reinterpret_cast<void*>(vec_d_);
}
void perform_experiment() final
{
#pragma omp parallel for simd safelen(8) schedule(static)
for (auto index = std::size_t{}; index < data_size_; ++index)
{
vec_d_[index] = vec_a_[index] + vec_b_[index] * vec_c_[index]; // fp_count: 2, traffic: 4+1
}
}
void finish() final
{
std::free(vec_a_);
std::free(vec_b_);
std::free(vec_c_);
std::free(vec_d_);
}
};
Note: The function data serves a special purpose in that it tries to cancel out effects of NUMA-balancing. Ever so often, in a random iteration, the function perform_experiments writes in a random fashion, using all threads, to the data provided by this function.
Question: Using this I am consistently getting a max. bandwidth of 201 GB/s. Why am I unable to achieve the stated 230 GB/s?
I am happy to provide any extra information if needed. Thanks very much in advance for your answers.
Update:
Following the suggestions made by #VictorEijkhout, I've now conducted a strong scaling experiment for the read-only bandwidth.
As you can see, the peak bandwidth is indeed average 217 GB/s, maximum 225 GB/s. It is still very puzzling to note that, at a certain point, adding CPUs actually reduces the effective bandwidth.

Bandwidth performance depends on the type of operation you do. For a mix of reads & writes you will indeed not get the peak number; if you only do reads you will get closer.
I suggest you read the documentation for the "Stream benchmark", and take a look at the posted numbers.
Further notes: I hope you tie your threads down with OMP_PROC_BIND? Also, your architecture runs out of bandwidth before it runs out of cores. Your optimal bandwidth performance may happen with less than the total number of cores.

Related

Windows threading synchronization performance issue

I have a threading issue under windows.
I am developing a program that runs complex physical simulations for different conditions. Say a condition per hour of the year, would be 8760 simulations. I am grouping those simulations per thread such that each thread runs a for loop of 273 simulations (on average)
I bought an AMD ryzen 9 5950x with 16 cores (32 threads) for this task. On Linux, all the threads seem to be between 98% to 100% usage, while under windows I get this:
(The first bar is the I/O thread reading data, the smaller bars are the process threads. Red: synchronization, green: process, purple: I/O)
This is from Visual Studio's concurrency visualizer, which tells me that 63% of the time was spent on thread synchronization. As far as I can tell, my code is the same for both the Linux and windows executions.
I made my best to make the objects immutable to avoid issues and that provided a big gain with my old 8-thread intel i7. However with many more threads, this issue arises.
For threading, I have tried a custom parallel for, and the taskflow library. Both perform identically for what I want to do.
Is there something fundamental about windows threads that produces this behaviour?
The custom parallel for code:
/**
* parallel for
* #tparam Index integer type
* #tparam Callable function type
* #param start start index of the loop
* #param end final +1 index of the loop
* #param func function to evaluate
* #param nb_threads number of threads, if zero, it is determined automatically
*/
template<typename Index, typename Callable>
static void ParallelFor(Index start, Index end, Callable func, unsigned nb_threads=0) {
// Estimate number of threads in the pool
if (nb_threads == 0) nb_threads = getThreadNumber();
// Size of a slice for the range functions
Index n = end - start + 1;
Index slice = (Index) std::round(n / static_cast<double> (nb_threads));
slice = std::max(slice, Index(1));
// [Helper] Inner loop
auto launchRange = [&func] (int k1, int k2) {
for (Index k = k1; k < k2; k++) {
func(k);
}
};
// Create pool and launch jobs
std::vector<std::thread> pool;
pool.reserve(nb_threads);
Index i1 = start;
Index i2 = std::min(start + slice, end);
for (unsigned i = 0; i + 1 < nb_threads && i1 < end; ++i) {
pool.emplace_back(launchRange, i1, i2);
i1 = i2;
i2 = std::min(i2 + slice, end);
}
if (i1 < end) {
pool.emplace_back(launchRange, i1, end);
}
// Wait for jobs to finish
for (std::thread &t : pool) {
if (t.joinable()) {
t.join();
}
}
}
A complete C++ project illustrating the issue is uploaded here
Main.cpp:
//
// Created by santi on 26/08/2022.
//
#include "input_data.h"
#include "output_data.h"
#include "random.h"
#include "par_for.h"
void fillA(Matrix& A){
Random rnd;
rnd.setTimeBasedSeed();
for(int i=0; i < A.getRows(); ++i)
for(int j=0; j < A.getRows(); ++j)
A(i, j) = (int) rnd.randInt(0, 1000);
}
void worker(const InputData& input_data,
OutputData& output_data,
const std::vector<int>& time_indices,
int thread_index){
std::cout << "Thread " << thread_index << " [" << time_indices[0]<< ", " << time_indices[time_indices.size() - 1] << "]\n";
for(const int& t: time_indices){
Matrix b = input_data.getAt(t);
Matrix A(input_data.getDim(), input_data.getDim());
fillA(A);
Matrix x = A * b;
output_data.setAt(t, x);
}
}
void process(int time_steps, int dim, int n_threads){
InputData input_data(time_steps, dim);
OutputData output_data(time_steps, dim);
// correct the number of threads
if ( n_threads < 1 ) { n_threads = ( int )getThreadNumber( ); }
// generate indices
std::vector<int> time_indices = arrange<int>(time_steps);
// compute the split of indices per core
std::vector<ParallelChunkData<int>> chunks = prepareParallelChunks(time_indices, n_threads );
// run in parallel
ParallelFor( 0, ( int )chunks.size( ), [ & ]( int k ) {
// run chunk
worker(input_data, output_data, chunks[k].indices, k );
} );
}
int main(){
process(8760, 5000, 0);
return 0;
}
The performance problem you see is definitely caused by the many memory allocations, as already suspected by Matt in his answer. To expand on this: Here is a screenshot from Intel VTune running on an AMD Ryzen Threadripper 3990X with 64 cores (128 threads):
As you can see, almost all of the time is spent in malloc or free, which get called from the various Matrix operations. The bottom part of the image shows the timeline of the activity of a small selection of the threads: Green means that the thread is inactive, i.e. waiting. Usually only one or two threads are actually active. Allocations and freeing memory accesses a shared resource, causing the threads to wait for each other.
I think you have only two real options:
Option 1: No dynamic allocations anymore
The most efficient thing to do would be to rewrite the code to preallocate everything and get rid of all the temporaries. To adapt it to your example code, you could replace the b = input_data.getAt(t); and x = A * b; like this:
void MatrixVectorProduct(Matrix const & A, Matrix const & b, Matrix & x)
{
for (int i = 0; i < x.getRows(); ++i) {
for (int j = 0; j < x.getCols(); ++j) {
x(i, j) = 0.0;
for (int k = 0; k < A.getCols(); ++k) {
x(i,j) += (A(i,k) * b(k,j));
}
}
}
}
void getAt(int t, Matrix const & input_data, Matrix & b) {
for (int i = 0; i < input_data.getRows(); ++i)
b(i, 0) = input_data(i, t);
}
void worker(const InputData& input_data,
OutputData& output_data,
const std::vector<int>& time_indices,
int thread_index){
std::cout << "Thread " << thread_index << " [" << time_indices[0]<< ", " << time_indices[time_indices.size() - 1] << "]\n";
Matrix A(input_data.getDim(), input_data.getDim());
Matrix b(input_data.getDim(), 1);
Matrix x(input_data.getDim(), 1);
for (const int & t: time_indices) {
getAt(t, input_data.getMat(), b);
fillA(A);
MatrixVectorProduct(A, b, x);
output_data.setAt(t, x);
}
std::cout << "Thread " << thread_index << ": Finished" << std::endl;
}
This fixes the performance problems.
Here is a screenshot from VTune, where you can see a much better utilization:
Option 2: Using a special allocator
The alternative is to use a different allocator that handles allocating and freeing memory more efficiently in multithreaded scenarios. One that I had very good experience with is mimalloc (there are others such as hoard or the one from TBB). You do not need to modify your source code, you just need to link with a specific library as described in the documentation.
I tried mimalloc with your source code, and it gave near 100% CPU utilization without any code changes.
I also found a post on the Intel forums with a similar problem, and the solution there was the same (using a special allocator).
Additional notes
Matrix::allocSpace() allocates the memory by using pointers to arrays. It is better to use one contiguous array for the whole matrix instead of multiple independent arrays. That way, all elements are located behind each other in memory, allowing more efficient access.
But in general I suggest to use a dedicated linear algebra library such as Eigen instead of the hand rolled matrix implementation to exploit vectorization (SSE2, AVX,...) and to get the benefits of a highly optimized library.
Ensure that you compile your code with optimizations enabled.
Disable various cross-checks if you do not need them: assert() (i.e. define NDEBUG in the preprocessor), and for MSVC possibly /GS-.
Ensure that you actually have enough memory installed.
You said that all your memory was pre-allocated, but in the worker function I see this...
Matrix b = input_data.getAt(t);
which allocates and fills a new matrix b, and this...
Matrix A(input_data.getDim(), input_data.getDim());
which allocates and fills a new matrix A, and this...
Matrix x = A * b;
which allocates and fills a new matrix x.
The heap is a global data structure, so the thread synchronization time you're seeing is probably contention in the memory allocate/free functions.
These are in a tight loop. You should fix this loop to access b by reference, and reuse the other 2 matrices for every iteration.

Problem of sorting OpenMP threads into NUMA nodes by experiment

I'm attempting to create a std::vector<std::set<int>> with one set for each NUMA-node, containing the thread-ids obtained using omp_get_thread_num().
Topo:
Idea:
Create data which is larger than L3 cache,
set first touch using thread 0,
perform multiple experiments to determine the minimum access time of each thread,
extract the threads into nodes based on sorted access times and information about the topology.
Code: (Intel compiler, OpenMP)
// create data which will be shared by multiple threads
const auto part_size = std::size_t{50 * 1024 * 1024 / sizeof(double)}; // 50 MB
const auto size = 2 * part_size;
auto container = std::unique_ptr<double>(new double[size]);
// open a parallel section
auto thread_count = 0;
auto thread_id_min_duration = std::multimap<double, int>{};
#ifdef DECIDE_THREAD_COUNT
#pragma omp parallel num_threads(std::thread::hardware_concurrency())
#else
#pragma omp parallel
#endif
{
// perform first touch using thread 0
const auto thread_id = omp_get_thread_num();
if (thread_id == 0)
{
thread_count = omp_get_num_threads();
for (auto index = std::size_t{}; index < size; ++index)
{
container.get()[index] = static_cast<double>(std::rand() % 10 + 1);
}
}
#pragma omp barrier
// access the data using all threads individually
#pragma omp for schedule(static, 1)
for (auto thread_counter = std::size_t{}; thread_counter < thread_count; ++thread_counter)
{
// calculate the minimum access time of this thread
auto this_thread_min_duration = std::numeric_limits<double>::max();
for (auto experiment_counter = std::size_t{}; experiment_counter < 250; ++experiment_counter)
{
const auto* data = experiment_counter % 2 == 0 ? container.get() : container.get() + part_size;
const auto start_timestamp = omp_get_wtime();
for (auto index = std::size_t{}; index < part_size; ++index)
{
static volatile auto exceedingly_interesting_value_wink_wink = data[index];
}
const auto end_timestamp = omp_get_wtime();
const auto duration = end_timestamp - start_timestamp;
if (duration < this_thread_min_duration)
{
this_thread_min_duration = duration;
}
}
#pragma omp critical
{
thread_id_min_duration.insert(std::make_pair(this_thread_min_duration, thread_id));
}
}
} // #pragma omp parallel
Not shown here is code which outputs the minimum access times sorted into the multimap.
Env. and Output
How do OMP_PLACES and OMP_PROC_BIND work?
I am attempting to not use SMT by using export OMP_PLACES=cores OMP_PROC_BIND=spread OMP_NUM_THREADS=24. However, I'm getting this output:
What's puzzling me is that I'm having the same access times on all threads. Since I'm trying to spread them across the 2 NUMA nodes, I expect to neatly see 12 threads with access time, say, x and another 12 with access time ~2x.
Why is the above happening?
Additional Information
Even more puzzling are the following environments and their outputs:
export OMP_PLACES=cores OMP_PROC_BIND=spread OMP_NUM_THREADS=26
export OMP_PLACES=cores OMP_PROC_BIND=spread OMP_NUM_THREADS=48
Any help in understanding this phenomenon would be much appreciated.
Put it shortly, the benchmark is flawed.
perform multiple experiments to determine the minimum access time of each thread
The term "minimum access time" is unclear here. I assume you mean "latency". The thing is your benchmark does not measure the latency. volatile tell to the compiler to read store data from the memory hierarchy. The processor is free to store the value in its cache and x86-64 processors actually do that (like almost all modern processors).
How do OMP_PLACES and OMP_PROC_BIND work?
You can find the documentation of both here and there. Put it shortly, I strongly advise you to set OMP_PROC_BIND=TRUE and OMP_PLACES="{0},{1},{2},..." based on the values retrieved from hw-loc. More specifically, you can get this from hwloc-calc which is a really great tool (consider using --li --po, and PU, not CORE because this is what OpenMP runtimes expect). For example you can query the PU identifiers of a given NUMA node. Note that some machines have very weird non-linear OS PU numbering and OpenMP runtimes sometimes fail to map the threads correctly. IOMP (OpenMP runtime of ICC) should use hw-loc internally but I found some bugs in the past related to that. To check the mapping is correct, I advise you to use hwloc-ps. Note that OMP_PLACES=cores does not guarantee that threads are not migrating from one core to another (even one on a different NUMA node) except if OMP_PROC_BIND=TRUE is set (or a similar setting). Note that you can also use numactl so to control the NUMA policies of your process. For example, you can tell to the OS not to use a given NUMA node or to interleave the allocations. The first touch policy is not the only one and may not be the default one on all platforms (on some Linux platforms, the OS can move the pages between the NUMA nodes so to improve locality).
Why is the above happening?
The code takes 4.38 ms to read 50 MiB in memory in each threads. This means 1200 MiB read from the node 0 assuming the first touch policy is applied. Thus the throughout should be about 267 GiB/s. While this seems fine at first glance, this is a pretty big throughput for such a processor especially assuming only 1 NUMA node is used. This is certainly because part of the fetches are done from the L3 cache and not the RAM. Indeed, the cache can partially hold a part of the array and certainly does resulting in faster fetches thanks to the cache associativity and good cache policy. This is especially true as the cache lines are not invalidated since the array is only read. I advise you to use a significantly bigger array to prevent this complex effect happening.
You certainly expect one NUMA node to have a smaller throughput due to remote NUMA memory access. This is not always true in practice. In fact, this is often wrong on modern 2-socket systems since the socket interconnect is often not a limiting factor (this is the main source of throughput slowdown on NUMA systems).
NUMA effect arise on modern platform because of unbalanced NUMA memory node saturation and non-uniform latency. The former is not a problem in your application since all the PUs use the same NUMA memory node. The later is not a problem either because of the linear memory access pattern, CPU caches and hardware prefetchers : the latency should be completely hidden.
Even more puzzling are the following environments and their outputs
Using 26 threads on a 24 core machine means that 4 threads have to be executed on two cores. The thing is hyper-threading should not help much in such a case. As a result, multiple threads sharing the same core will be slowed down. Because IOMP certainly pin thread to cores and the unbalanced workload, 4 threads will be about twice slower.
Having 48 threads cause all the threads to be slower because of a twice bigger workload.
Let me address your first sentence. A C++ std::vector is different from a C malloc. Malloc'ed space is not "instantiated": only when you touch the memory does the physical-to-logical address mapping get established. This is known as "first touch". And that is why in C-OpenMP you initialize an array in parallel, so that the socket touching the part of the array gets the pages of that part. In C++, the "array" in a vector is created by a single thread, so the pages wind up on the socket of that thread.
Here's a solution:
template<typename T>
struct uninitialized {
uninitialized() {};
T val;
constexpr operator T() const {return val;};
double operator=( const T&& v ) { val = v; return val; };
};
Now you can create a vector<uninitialized<double>> and the array memory is not touched until you explicitly initialize it:
vector<uninitialized<double>> x(N),y(N);
#pragma omp parallel for
for (int i=0; i<N; i++)
y[i] = x[i] = 0.;
x[0] = 0; x[N-1] = 1.;
Now, I'm not sure how this goes if you have a vector of sets. Just thought I'd point out the issue.
After more investigation, I note the following:
work-load managers on clusters can and will disregard/reset OMP_PLACES/OMP_PROC_BIND,
memory page migration is a thing on modern NUMA systems.
Following this, I started using the work-load manager's own thread binding/pinning system, and adapted my benchmark to lock the memory page(s) on which my data lay. Furthermore, giving in to my programmer's paranoia, I ditched the std::unique_ptr for fear that it may lay its own first touch after allocating the memory.
// create data which will be shared by multiple threads
const auto size_per_thread = std::size_t{50 * 1024 * 1024 / sizeof(double)}; // 50 MB
const auto total_size = thread_count * size_per_thread;
double* data = nullptr;
posix_memalign(reinterpret_cast<void**>(&data), sysconf(_SC_PAGESIZE), total_size * sizeof(double));
if (data == nullptr)
{
throw std::runtime_error("could_not_allocate_memory_error");
}
// perform first touch using thread 0
#pragma omp parallel num_threads(thread_count)
{
if (omp_get_thread_num() == 0)
{
#pragma omp simd safelen(8)
for (auto d_index = std::size_t{}; d_index < total_size; ++d_index)
{
data[d_index] = -1.0;
}
}
} // #pragma omp parallel
mlock(data, total_size); // page migration is a real thing...
// open a parallel section
auto thread_id_avg_latency = std::multimap<double, int>{};
auto generator = std::mt19937(); // heavy object can be created outside parallel
#pragma omp parallel num_threads(thread_count) private(generator)
{
// access the data using all threads individually
#pragma omp for schedule(static, 1)
for (auto thread_counter = std::size_t{}; thread_counter < thread_count; ++thread_counter)
{
// seed each thread's generator
generator.seed(thread_counter + 1);
// calculate the minimum access latency of this thread
auto this_thread_avg_latency = 0.0;
const auto experiment_count = 250;
for (auto experiment_counter = std::size_t{}; experiment_counter < experiment_count; ++experiment_counter)
{
const auto start_timestamp = omp_get_wtime() * 1E+6;
for (auto counter = std::size_t{}; counter < size_per_thread / 100; ++counter)
{
const auto index = std::uniform_int_distribution<std::size_t>(0, size_per_thread-1)(generator);
auto& datapoint = data[thread_counter * size_per_thread + index];
datapoint += index;
}
const auto end_timestamp = omp_get_wtime() * 1E+6;
this_thread_avg_latency += end_timestamp - start_timestamp;
}
this_thread_avg_latency /= experiment_count;
#pragma omp critical
{
thread_id_avg_latency.insert(std::make_pair(this_thread_avg_latency, omp_get_thread_num()));
}
}
} // #pragma omp parallel
std::free(data);
With these changes, I am noticing the difference I expected.
Further notes:
this experiment shows that the latency of non-local access is 1.09 - 1.15 times that of local access on the cluster that I'm using,
there is no reliable cross-platform way of doing this (requires kernel-APIs),
OpenMP seems to number the threads exactly as hwloc/lstopo, numactl and lscpu seems to number them (logical ID?)
The most astonishing things are that the difference in latencies is very low, and that memory page migration may happen, which begs the question, why should we care about first-touch and all the rest of the NUMA concerns at all?

OpenMP: copying vector using ' multithreading'

For a certain coding application i need to copy a vector consisting of big objects, so i want to make it more efficient. I'll give the old code below, with an attempt to use OpenMP to make it more efficient.
std::vector<Object> Objects, NewObjects;
Objects.reserve(30);
NewObjects.reserve(30);
// old code
Objects = NewObjects;
// new code
omp_set_num_threads(30);
#pragma omp parallel{
Objects[omp_get_thread_num()] = NewObjects[omp_get_thread_num()];
}
Would this give the same result? Or are there issues since i access the vector ' Object' . I thought it might work since i don't access the same index/Object.
omp_set_num_threads(30) does not guarantee that you obtain 30 threads, you may get less and your code will not work properly. You have to use a loop and parallelize it by OpenMP:
#pragma omp parallel for
for(size_t i=0;i<NewObjects.size(); ++i)
{
Objects[i] = NewObjects[i];
}
Note that It may not be faster than the serial version, because parallel execution has significant overheads.
If you use a C++17 compiler the best idea is to use std::copy using parallel execution policy:
std::copy(std::execution::par, NewObjects.begin(), NewObjects.end(), Objects.begin());
I created a benchmark to see how fast my test machine copies objects:
#include <benchmark/benchmark.h>
#include <omp.h>
#include <vector>
constexpr int operator "" _MB(unsigned long long v) { return v * 1024 * 1024; }
class CopyableBigObject
{
public:
CopyableBigObject(const size_t s) : vec(s) {}
CopyableBigObject(const CopyableBigObject& other) = default;
CopyableBigObject(CopyableBigObject&& other) = delete;
~CopyableBigObject() = default;
CopyableBigObject& operator =(const CopyableBigObject&) = default;
CopyableBigObject& operator =(CopyableBigObject&&) = delete;
char& operator [](const int index) { return vec[index]; }
size_t size() const { return vec.size(); }
private:
std::vector<char> vec;
};
// Force some work on the objects so they are not optimized away
int calculated_value(std::vector<CopyableBigObject>& vec)
{
int sum = 0;
for (int x = 0; x < vec.size(); ++x)
{
for (int index = 0; index < vec[x].size(); index += 100)
{
sum += vec[x][index];
}
}
return sum;
}
static void BM_copy_big_objects(benchmark::State& state)
{
const size_t number_of_objects = state.range(0);
const size_t data_size = state.range(1);
for (auto _ : state)
{
std::vector<CopyableBigObject> src{ number_of_objects, CopyableBigObject(data_size) };
std::vector<CopyableBigObject> dest;
state.counters["src"] = calculated_value(src);
dest = src;
state.counters["dest"] = calculated_value(dest);
}
}
static void BM_copy_big_objects_in_parallel(benchmark::State& state)
{
const size_t number_of_objects = state.range(0);
const size_t data_size = state.range(1);
const int number_of_threads = state.range(2);
for (auto _ : state)
{
std::vector<CopyableBigObject> src{ number_of_objects, CopyableBigObject(data_size) };
std::vector<CopyableBigObject> dest{ number_of_objects, CopyableBigObject(0) };
state.counters["src"] = calculated_value(src);
#pragma omp parallel num_threads(number_of_threads)
{
if (omp_get_thread_num() == 0)
{
state.counters["number_of_threads"] = omp_get_num_threads();
}
#pragma omp for
for (int x = 0; x < src.size(); ++x)
{
dest[x] = src[x];
}
}
state.counters["dest"] = calculated_value(dest);
}
}
BENCHMARK(BM_copy_big_objects)
->Unit(benchmark::kMillisecond)
->Args({ 30, 16_MB })
->Args({ 1000, 1_MB })
->Args({ 100, 8_MB });
BENCHMARK(BM_copy_big_objects_in_parallel)
->Unit(benchmark::kMillisecond)
->Args({ 100, 1_MB, 1 })
->Args({ 100, 8_MB, 1 })
->Args({ 800, 1_MB, 1 })
->Args({ 100, 8_MB, 2 })
->Args({ 100, 8_MB, 4 })
->Args({ 100, 8_MB, 8 });
BENCHMARK_MAIN();
These are results I got on my test machine, an old Xeon workstation:
Run on (4 X 2394 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x4)
L1 Instruction 32 KiB (x4)
L2 Unified 4096 KiB (x4)
L3 Unified 16384 KiB (x1)
Load Average: 0.25, 0.14, 0.10
--------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
--------------------------------------------------------------------------------------------------------
BM_copy_big_objects/30/16777216 30.9 ms 30.5 ms 24 dest=0 src=0
BM_copy_big_objects/1000/1048576 0.352 ms 0.349 ms 1987 dest=0 src=0
BM_copy_big_objects/100/8388608 4.62 ms 4.57 ms 155 dest=0 src=0
BM_copy_big_objects_in_parallel/100/1048576/1 0.359 ms 0.355 ms 2028 dest=0 number_of_threads=1 src=0
BM_copy_big_objects_in_parallel/100/8388608/1 4.67 ms 4.61 ms 151 dest=0 number_of_threads=1 src=0
BM_copy_big_objects_in_parallel/800/1048576/1 0.357 ms 0.353 ms 1983 dest=0 number_of_threads=1 src=0
BM_copy_big_objects_in_parallel/100/8388608/2 5.29 ms 5.23 ms 132 dest=0 number_of_threads=2 src=0
BM_copy_big_objects_in_parallel/100/8388608/4 5.32 ms 5.25 ms 133 dest=0 number_of_threads=4 src=0
BM_copy_big_objects_in_parallel/100/8388608/8 5.57 ms 3.98 ms 175 dest=0 number_of_threads=8 src=0
As I expected, parallelizing copying does not improve performance. However, copying large objects is slower than I expected.
Given you stated that you use C++14, there are a number of things you can try which could improve performance:
Move the objects using the move-constructor / move-assignment combination or unique_ptr instead of copying.
Defer making copies of member variables until you really need them by using Copy-On-Write.
This will make copying cheap until you have to update a big object.
If a large proportion of your objects are not updated after they have been copied then you should get a performance boost.
Make sure your class definitions are using the most compact representation. I have seen classes be different sizes depending on whether it is a release build or a debug build because the compiler was using padding for the release build but not the debug build.
Possibly rewrite so copying is avoided altogether.
Without knowing the specific details of your objects, it is not possible to give a specific answer. However, this should point to a full solution.

CPU caching understanding

I tested the next code with Google Benchmark framework to measure memory access latency with different array sizes:
int64_t MemoryAccessAllElements(const int64_t *data, size_t length) {
for (size_t id = 0; id < length; id++) {
volatile int64_t ignored = data[id];
}
return 0;
}
int64_t MemoryAccessEvery4th(const int64_t *data, size_t length) {
for (size_t id = 0; id < length; id += 4) {
volatile int64_t ignored = data[id];
}
return 0;
}
And I get next results (the results are averaged by google benchmark, for big arrays, there is about ~10 iterations and much more work performed for smaller one):
There is a lot of different stuff happened on this picture and unfortunately, I can't explain all changes in the graph.
I tested this code on single core CPU with next caches configuration:
CPU Caches:
L1 Data 32K (x1), 8 way associative
L1 Instruction 32K (x1), 8 way associative
L2 Unified 256K (x1), 8 way associative
L3 Unified 30720K (x1), 20 way associative
At this pictures we can see many changes in graph behavior:
There is a spike after 64 bytes array size that can be explained by the fact, that cache line size is 64 bytes long and with an array of size more than 64 bytes we experience one more L1 cache miss (which can be categorized as a compulsory cache miss)
Also, the latency increasing near the cache size bounds that is also easy to explain - at this moment we experience capacity cache misses
But there is a lot of questions about results that I can't explain:
Why latency for the MemoryAccessEvery4th decreasing after array exceeded ~1024 bytes?
Why we can see another peak for the MemoryAccessAllElements around 512 bytes? It is an interesting point because at this moment we started to access more than one set of cache lines (8 * 64 bytes in one set). But is it really caused by this event and if it is than how it can be explained?
Why we can see latency increasing after passing the L2 cache size while benchmarking MemoryAccessEvery4th but there is no such difference with the MemoryAccessAllElements?
I've tried to compare my results with the results from gallery of processor cache effects and what every programmer should know about memory, but I can't fully describe my results with reasoning from this articles.
Can someone help me to understand the internal processes of CPU caching?
UPD:
I use the following code to measure performance of memory access:
#include <benchmark/benchmark.h>
using namespace benchmark;
void InitializeWithRandomNumbers(long long *array, size_t length) {
auto random = Random(0);
for (size_t id = 0; id < length; id++) {
array[id] = static_cast<long long>(random.NextLong(0, 1LL << 60));
}
}
static void MemoryAccessAllElements_Benchmark(State &state) {
size_t size = static_cast<size_t>(state.range(0));
auto array = new long long[size];
InitializeWithRandomNumbers(array, size);
for (auto _ : state) {
DoNotOptimize(MemoryAccessAllElements(array, size));
}
delete[] array;
}
static void CustomizeBenchmark(benchmark::internal::Benchmark *benchmark) {
for (int size = 2; size <= (1 << 24); size *= 2) {
benchmark->Arg(size);
}
}
BENCHMARK(MemoryAccessAllElements_Benchmark)->Apply(CustomizeBenchmark);
BENCHMARK_MAIN();
You can find slightly different examples in the repository, but actually the basic approach for the benchmark in the question is same.

Improve OpenMP/SSE parallelization effect

I'm tried to improve performance in some routine via OpenMP(parallel for) and SSE intrinsics:
void Tester::ProcessParallel()//ProcessParallel is member of Tester class
{
//Initialize
auto OutMapLen = this->_OutMapLen;
auto KernelBatchLen = this->_KernelBatchLen;
auto OutMapHeig = this->_OutMapHeig;
auto OutMapWid = this->_OutMapWid;
auto InpMapWid = this->_InpMapWid;
auto NumInputMaps = this->_NumInputMaps;
auto InpMapLen = this->_InpMapLen;
auto KernelLen = this->_KernelLen;
auto KernelHeig = this->_KernelHeig;
auto KernelWid = this->_KernelWid;
auto input_local = this->input;
auto output_local = this->output;
auto weights_local = this->weights;
auto biases_local = this->biases;
auto klim = this->_klim;
#pragma omp parallel for firstprivate(OutMapLen,KernelBatchLen,OutMapHeig,OutMapWid,InpMapWid,NumInputMaps,InpMapLen,KernelLen,KernelHeig,KernelWid,input_local,output_local,weights_local,biases_local,klim)
for(auto i=0; i<_NumOutMaps; ++i)
{
auto output_map = output_local + i*OutMapLen;
auto kernel_batch = weights_local + i*KernelBatchLen;
auto bias = biases_local + i;
for(auto j=0; j<OutMapHeig; ++j)
{
auto output_map_row = output_map + j*OutMapWid;
auto inp_row_idx = j*InpMapWid;
for(auto k=0; k<OutMapWid; ++k)
{
auto output_nn = output_map_row + k;
*output_nn = *bias;
auto inp_cursor_idx = inp_row_idx + k;
for(int _i=0; _i<NumInputMaps; ++_i)
{
auto input_cursor = input_local + _i*InpMapLen + inp_cursor_idx;
auto kernel = kernel_batch + _i*KernelLen;
for(int _j=0; _j<KernelHeig; ++_j)
{
auto kernel_row_idx = _j*KernelWid;
auto inp_row_cur_idx = _j*InpMapWid;
int _k=0;
for(; _k<klim; _k+=4)//unroll and vectorize
{
float buf;
__m128 wgt = _mm_loadu_ps(kernel+kernel_row_idx+_k);
__m128 inp = _mm_loadu_ps(input_cursor+inp_row_cur_idx+_k);
__m128 prd = _mm_dp_ps(wgt, inp, 0xf1);
_mm_store_ss(&buf, prd);
*output_nn += buf;
}
for(; _k<KernelWid; ++_k)//residual loop
*output_nn += *(kernel+kernel_row_idx+_k) * *(input_cursor+inp_row_cur_idx+_k);
}
}
}
}
}
}
Pure unrolling and SSE-vectorization (without OpenMP) of last nested loop improves total performance ~1.3 times - it's pretty nice result. Howewer, pure OpenMP parallelization (without unrolling/vectorization) of external loop gives only ~2.1 performance gain on 8-core processor (core i7 2600K). In total, both SSE vectorization and OpenMP parallel_for shows 2.3-2.7 times performance gain. How can I boost OpenMP parallelization effect in the code above?
Interesting: if replace "klim" variable - bound in unrolling last loop - with scalar constant, say, 4, total performance gain rises to 3.5.
Vectorisation and threading do not work orthogonally (in respect to speeding up the calculations) in most cases, i.e. their speed-ups do not necessarily add up. What's worse is that this happens mostly in cases like yours, where data is being processed in a streaming fashion. The reason for that is simple - finite memory bandwidth. A very simple measure of whether this is the case is the so-called computational intensity (CI), defined as the amount of data processing (usually in FLOPS) performed over a byte of input data. In your case you load two XMM registers, which makes 32 bytes of data in total, then perform one dot product operation. Let's have your code running on a 2 GHz Sandy Bridge CPU. Although DPPS takes full 12 cycles to complete on SNB, the CPU is able to overlap several such instructions and retire one every 2 cycles. Therefore at 2 GHz each core could perform 1 billion dot products per second in a tight loop. It would require 32 GB/s of memory bandwidth to keep such a loop busy. The actual bandwidth needed in your case is less since there are other instructions in the loop, but still the main idea remains - the processing rate of the loop is limited by the amount of data that the memory is able to feed to the core. As long as all the data fits into the last-level cache (LLC), performance would more or less scale with the number of threads as the LLC usually provides fairly high bandwidth (e.g. 300 GB/s on Xeon 7500's as stated here). This is not the case once data grows big enough not to fit into the cache as the main memory usually provides an order of magnitude less bandwidth per memory controller. In the latter case all cores have to share the limited memory speed and once it is saturated, adding more threads would not result in increase of the speed-up. Only adding more bandwidth, e.g. having a system with several CPU sockets, would result in an increased processing speed.
There is a theoretical model, called the Roofline model, that captures this in a more formal way. You can see some explanations and applications of the model in this presentation.
The bottom line is: both vectorisation and multiprocessing (e.g. threading) increase the performance but also increase the memory pressure. As long as the memory bandwidth is not saturated, both result in increased processing rate. Once the memory becomes the bottleneck, performance does not increase any more. There are even cases when multithreaded performance drops because of the additional pressure put by vectorisation.
Possibly an optimisation hint: the store to *output_nn might not get optimised since output_nn ultimately points inside a shared variable. Therefore you might try something like:
for(auto k=0; k<OutMapWid; ++k)
{
auto output_nn = output_map_row + k;
auto _output_nn = *bias;
auto inp_cursor_idx = inp_row_idx + k;
for(int _i=0; _i<NumInputMaps; ++_i)
{
...
for(int _j=0; _j<KernelHeig; ++_j)
{
...
for(; _k<klim; _k+=4)//unroll and vectorize
{
...
_output_nn += buf;
}
for(; _k<KernelWid; ++_k)//residual loop
_output_nn += *(kernel+kernel_row_idx+_k) * *(input_cursor+inp_row_cur_idx+_k);
}
}
*output_nn = _output_nn;
}
But I guess your compiler is smart enough to figure it by itself. Anyway, this would only matter in the single-threaded case. Once you are into the saturated memory bandwidth region, no such optimisations would matter.