Improve OpenMP/SSE parallelization effect - c++

I'm tried to improve performance in some routine via OpenMP(parallel for) and SSE intrinsics:
void Tester::ProcessParallel()//ProcessParallel is member of Tester class
{
//Initialize
auto OutMapLen = this->_OutMapLen;
auto KernelBatchLen = this->_KernelBatchLen;
auto OutMapHeig = this->_OutMapHeig;
auto OutMapWid = this->_OutMapWid;
auto InpMapWid = this->_InpMapWid;
auto NumInputMaps = this->_NumInputMaps;
auto InpMapLen = this->_InpMapLen;
auto KernelLen = this->_KernelLen;
auto KernelHeig = this->_KernelHeig;
auto KernelWid = this->_KernelWid;
auto input_local = this->input;
auto output_local = this->output;
auto weights_local = this->weights;
auto biases_local = this->biases;
auto klim = this->_klim;
#pragma omp parallel for firstprivate(OutMapLen,KernelBatchLen,OutMapHeig,OutMapWid,InpMapWid,NumInputMaps,InpMapLen,KernelLen,KernelHeig,KernelWid,input_local,output_local,weights_local,biases_local,klim)
for(auto i=0; i<_NumOutMaps; ++i)
{
auto output_map = output_local + i*OutMapLen;
auto kernel_batch = weights_local + i*KernelBatchLen;
auto bias = biases_local + i;
for(auto j=0; j<OutMapHeig; ++j)
{
auto output_map_row = output_map + j*OutMapWid;
auto inp_row_idx = j*InpMapWid;
for(auto k=0; k<OutMapWid; ++k)
{
auto output_nn = output_map_row + k;
*output_nn = *bias;
auto inp_cursor_idx = inp_row_idx + k;
for(int _i=0; _i<NumInputMaps; ++_i)
{
auto input_cursor = input_local + _i*InpMapLen + inp_cursor_idx;
auto kernel = kernel_batch + _i*KernelLen;
for(int _j=0; _j<KernelHeig; ++_j)
{
auto kernel_row_idx = _j*KernelWid;
auto inp_row_cur_idx = _j*InpMapWid;
int _k=0;
for(; _k<klim; _k+=4)//unroll and vectorize
{
float buf;
__m128 wgt = _mm_loadu_ps(kernel+kernel_row_idx+_k);
__m128 inp = _mm_loadu_ps(input_cursor+inp_row_cur_idx+_k);
__m128 prd = _mm_dp_ps(wgt, inp, 0xf1);
_mm_store_ss(&buf, prd);
*output_nn += buf;
}
for(; _k<KernelWid; ++_k)//residual loop
*output_nn += *(kernel+kernel_row_idx+_k) * *(input_cursor+inp_row_cur_idx+_k);
}
}
}
}
}
}
Pure unrolling and SSE-vectorization (without OpenMP) of last nested loop improves total performance ~1.3 times - it's pretty nice result. Howewer, pure OpenMP parallelization (without unrolling/vectorization) of external loop gives only ~2.1 performance gain on 8-core processor (core i7 2600K). In total, both SSE vectorization and OpenMP parallel_for shows 2.3-2.7 times performance gain. How can I boost OpenMP parallelization effect in the code above?
Interesting: if replace "klim" variable - bound in unrolling last loop - with scalar constant, say, 4, total performance gain rises to 3.5.

Vectorisation and threading do not work orthogonally (in respect to speeding up the calculations) in most cases, i.e. their speed-ups do not necessarily add up. What's worse is that this happens mostly in cases like yours, where data is being processed in a streaming fashion. The reason for that is simple - finite memory bandwidth. A very simple measure of whether this is the case is the so-called computational intensity (CI), defined as the amount of data processing (usually in FLOPS) performed over a byte of input data. In your case you load two XMM registers, which makes 32 bytes of data in total, then perform one dot product operation. Let's have your code running on a 2 GHz Sandy Bridge CPU. Although DPPS takes full 12 cycles to complete on SNB, the CPU is able to overlap several such instructions and retire one every 2 cycles. Therefore at 2 GHz each core could perform 1 billion dot products per second in a tight loop. It would require 32 GB/s of memory bandwidth to keep such a loop busy. The actual bandwidth needed in your case is less since there are other instructions in the loop, but still the main idea remains - the processing rate of the loop is limited by the amount of data that the memory is able to feed to the core. As long as all the data fits into the last-level cache (LLC), performance would more or less scale with the number of threads as the LLC usually provides fairly high bandwidth (e.g. 300 GB/s on Xeon 7500's as stated here). This is not the case once data grows big enough not to fit into the cache as the main memory usually provides an order of magnitude less bandwidth per memory controller. In the latter case all cores have to share the limited memory speed and once it is saturated, adding more threads would not result in increase of the speed-up. Only adding more bandwidth, e.g. having a system with several CPU sockets, would result in an increased processing speed.
There is a theoretical model, called the Roofline model, that captures this in a more formal way. You can see some explanations and applications of the model in this presentation.
The bottom line is: both vectorisation and multiprocessing (e.g. threading) increase the performance but also increase the memory pressure. As long as the memory bandwidth is not saturated, both result in increased processing rate. Once the memory becomes the bottleneck, performance does not increase any more. There are even cases when multithreaded performance drops because of the additional pressure put by vectorisation.
Possibly an optimisation hint: the store to *output_nn might not get optimised since output_nn ultimately points inside a shared variable. Therefore you might try something like:
for(auto k=0; k<OutMapWid; ++k)
{
auto output_nn = output_map_row + k;
auto _output_nn = *bias;
auto inp_cursor_idx = inp_row_idx + k;
for(int _i=0; _i<NumInputMaps; ++_i)
{
...
for(int _j=0; _j<KernelHeig; ++_j)
{
...
for(; _k<klim; _k+=4)//unroll and vectorize
{
...
_output_nn += buf;
}
for(; _k<KernelWid; ++_k)//residual loop
_output_nn += *(kernel+kernel_row_idx+_k) * *(input_cursor+inp_row_cur_idx+_k);
}
}
*output_nn = _output_nn;
}
But I guess your compiler is smart enough to figure it by itself. Anyway, this would only matter in the single-threaded case. Once you are into the saturated memory bandwidth region, no such optimisations would matter.

Related

Problem of sorting OpenMP threads into NUMA nodes by experiment

I'm attempting to create a std::vector<std::set<int>> with one set for each NUMA-node, containing the thread-ids obtained using omp_get_thread_num().
Topo:
Idea:
Create data which is larger than L3 cache,
set first touch using thread 0,
perform multiple experiments to determine the minimum access time of each thread,
extract the threads into nodes based on sorted access times and information about the topology.
Code: (Intel compiler, OpenMP)
// create data which will be shared by multiple threads
const auto part_size = std::size_t{50 * 1024 * 1024 / sizeof(double)}; // 50 MB
const auto size = 2 * part_size;
auto container = std::unique_ptr<double>(new double[size]);
// open a parallel section
auto thread_count = 0;
auto thread_id_min_duration = std::multimap<double, int>{};
#ifdef DECIDE_THREAD_COUNT
#pragma omp parallel num_threads(std::thread::hardware_concurrency())
#else
#pragma omp parallel
#endif
{
// perform first touch using thread 0
const auto thread_id = omp_get_thread_num();
if (thread_id == 0)
{
thread_count = omp_get_num_threads();
for (auto index = std::size_t{}; index < size; ++index)
{
container.get()[index] = static_cast<double>(std::rand() % 10 + 1);
}
}
#pragma omp barrier
// access the data using all threads individually
#pragma omp for schedule(static, 1)
for (auto thread_counter = std::size_t{}; thread_counter < thread_count; ++thread_counter)
{
// calculate the minimum access time of this thread
auto this_thread_min_duration = std::numeric_limits<double>::max();
for (auto experiment_counter = std::size_t{}; experiment_counter < 250; ++experiment_counter)
{
const auto* data = experiment_counter % 2 == 0 ? container.get() : container.get() + part_size;
const auto start_timestamp = omp_get_wtime();
for (auto index = std::size_t{}; index < part_size; ++index)
{
static volatile auto exceedingly_interesting_value_wink_wink = data[index];
}
const auto end_timestamp = omp_get_wtime();
const auto duration = end_timestamp - start_timestamp;
if (duration < this_thread_min_duration)
{
this_thread_min_duration = duration;
}
}
#pragma omp critical
{
thread_id_min_duration.insert(std::make_pair(this_thread_min_duration, thread_id));
}
}
} // #pragma omp parallel
Not shown here is code which outputs the minimum access times sorted into the multimap.
Env. and Output
How do OMP_PLACES and OMP_PROC_BIND work?
I am attempting to not use SMT by using export OMP_PLACES=cores OMP_PROC_BIND=spread OMP_NUM_THREADS=24. However, I'm getting this output:
What's puzzling me is that I'm having the same access times on all threads. Since I'm trying to spread them across the 2 NUMA nodes, I expect to neatly see 12 threads with access time, say, x and another 12 with access time ~2x.
Why is the above happening?
Additional Information
Even more puzzling are the following environments and their outputs:
export OMP_PLACES=cores OMP_PROC_BIND=spread OMP_NUM_THREADS=26
export OMP_PLACES=cores OMP_PROC_BIND=spread OMP_NUM_THREADS=48
Any help in understanding this phenomenon would be much appreciated.
Put it shortly, the benchmark is flawed.
perform multiple experiments to determine the minimum access time of each thread
The term "minimum access time" is unclear here. I assume you mean "latency". The thing is your benchmark does not measure the latency. volatile tell to the compiler to read store data from the memory hierarchy. The processor is free to store the value in its cache and x86-64 processors actually do that (like almost all modern processors).
How do OMP_PLACES and OMP_PROC_BIND work?
You can find the documentation of both here and there. Put it shortly, I strongly advise you to set OMP_PROC_BIND=TRUE and OMP_PLACES="{0},{1},{2},..." based on the values retrieved from hw-loc. More specifically, you can get this from hwloc-calc which is a really great tool (consider using --li --po, and PU, not CORE because this is what OpenMP runtimes expect). For example you can query the PU identifiers of a given NUMA node. Note that some machines have very weird non-linear OS PU numbering and OpenMP runtimes sometimes fail to map the threads correctly. IOMP (OpenMP runtime of ICC) should use hw-loc internally but I found some bugs in the past related to that. To check the mapping is correct, I advise you to use hwloc-ps. Note that OMP_PLACES=cores does not guarantee that threads are not migrating from one core to another (even one on a different NUMA node) except if OMP_PROC_BIND=TRUE is set (or a similar setting). Note that you can also use numactl so to control the NUMA policies of your process. For example, you can tell to the OS not to use a given NUMA node or to interleave the allocations. The first touch policy is not the only one and may not be the default one on all platforms (on some Linux platforms, the OS can move the pages between the NUMA nodes so to improve locality).
Why is the above happening?
The code takes 4.38 ms to read 50 MiB in memory in each threads. This means 1200 MiB read from the node 0 assuming the first touch policy is applied. Thus the throughout should be about 267 GiB/s. While this seems fine at first glance, this is a pretty big throughput for such a processor especially assuming only 1 NUMA node is used. This is certainly because part of the fetches are done from the L3 cache and not the RAM. Indeed, the cache can partially hold a part of the array and certainly does resulting in faster fetches thanks to the cache associativity and good cache policy. This is especially true as the cache lines are not invalidated since the array is only read. I advise you to use a significantly bigger array to prevent this complex effect happening.
You certainly expect one NUMA node to have a smaller throughput due to remote NUMA memory access. This is not always true in practice. In fact, this is often wrong on modern 2-socket systems since the socket interconnect is often not a limiting factor (this is the main source of throughput slowdown on NUMA systems).
NUMA effect arise on modern platform because of unbalanced NUMA memory node saturation and non-uniform latency. The former is not a problem in your application since all the PUs use the same NUMA memory node. The later is not a problem either because of the linear memory access pattern, CPU caches and hardware prefetchers : the latency should be completely hidden.
Even more puzzling are the following environments and their outputs
Using 26 threads on a 24 core machine means that 4 threads have to be executed on two cores. The thing is hyper-threading should not help much in such a case. As a result, multiple threads sharing the same core will be slowed down. Because IOMP certainly pin thread to cores and the unbalanced workload, 4 threads will be about twice slower.
Having 48 threads cause all the threads to be slower because of a twice bigger workload.
Let me address your first sentence. A C++ std::vector is different from a C malloc. Malloc'ed space is not "instantiated": only when you touch the memory does the physical-to-logical address mapping get established. This is known as "first touch". And that is why in C-OpenMP you initialize an array in parallel, so that the socket touching the part of the array gets the pages of that part. In C++, the "array" in a vector is created by a single thread, so the pages wind up on the socket of that thread.
Here's a solution:
template<typename T>
struct uninitialized {
uninitialized() {};
T val;
constexpr operator T() const {return val;};
double operator=( const T&& v ) { val = v; return val; };
};
Now you can create a vector<uninitialized<double>> and the array memory is not touched until you explicitly initialize it:
vector<uninitialized<double>> x(N),y(N);
#pragma omp parallel for
for (int i=0; i<N; i++)
y[i] = x[i] = 0.;
x[0] = 0; x[N-1] = 1.;
Now, I'm not sure how this goes if you have a vector of sets. Just thought I'd point out the issue.
After more investigation, I note the following:
work-load managers on clusters can and will disregard/reset OMP_PLACES/OMP_PROC_BIND,
memory page migration is a thing on modern NUMA systems.
Following this, I started using the work-load manager's own thread binding/pinning system, and adapted my benchmark to lock the memory page(s) on which my data lay. Furthermore, giving in to my programmer's paranoia, I ditched the std::unique_ptr for fear that it may lay its own first touch after allocating the memory.
// create data which will be shared by multiple threads
const auto size_per_thread = std::size_t{50 * 1024 * 1024 / sizeof(double)}; // 50 MB
const auto total_size = thread_count * size_per_thread;
double* data = nullptr;
posix_memalign(reinterpret_cast<void**>(&data), sysconf(_SC_PAGESIZE), total_size * sizeof(double));
if (data == nullptr)
{
throw std::runtime_error("could_not_allocate_memory_error");
}
// perform first touch using thread 0
#pragma omp parallel num_threads(thread_count)
{
if (omp_get_thread_num() == 0)
{
#pragma omp simd safelen(8)
for (auto d_index = std::size_t{}; d_index < total_size; ++d_index)
{
data[d_index] = -1.0;
}
}
} // #pragma omp parallel
mlock(data, total_size); // page migration is a real thing...
// open a parallel section
auto thread_id_avg_latency = std::multimap<double, int>{};
auto generator = std::mt19937(); // heavy object can be created outside parallel
#pragma omp parallel num_threads(thread_count) private(generator)
{
// access the data using all threads individually
#pragma omp for schedule(static, 1)
for (auto thread_counter = std::size_t{}; thread_counter < thread_count; ++thread_counter)
{
// seed each thread's generator
generator.seed(thread_counter + 1);
// calculate the minimum access latency of this thread
auto this_thread_avg_latency = 0.0;
const auto experiment_count = 250;
for (auto experiment_counter = std::size_t{}; experiment_counter < experiment_count; ++experiment_counter)
{
const auto start_timestamp = omp_get_wtime() * 1E+6;
for (auto counter = std::size_t{}; counter < size_per_thread / 100; ++counter)
{
const auto index = std::uniform_int_distribution<std::size_t>(0, size_per_thread-1)(generator);
auto& datapoint = data[thread_counter * size_per_thread + index];
datapoint += index;
}
const auto end_timestamp = omp_get_wtime() * 1E+6;
this_thread_avg_latency += end_timestamp - start_timestamp;
}
this_thread_avg_latency /= experiment_count;
#pragma omp critical
{
thread_id_avg_latency.insert(std::make_pair(this_thread_avg_latency, omp_get_thread_num()));
}
}
} // #pragma omp parallel
std::free(data);
With these changes, I am noticing the difference I expected.
Further notes:
this experiment shows that the latency of non-local access is 1.09 - 1.15 times that of local access on the cluster that I'm using,
there is no reliable cross-platform way of doing this (requires kernel-APIs),
OpenMP seems to number the threads exactly as hwloc/lstopo, numactl and lscpu seems to number them (logical ID?)
The most astonishing things are that the difference in latencies is very low, and that memory page migration may happen, which begs the question, why should we care about first-touch and all the rest of the NUMA concerns at all?

Large and math-heavy C++ program is ~10% faster or slower randomly

The program
I have a C++ program that looks something like the following:
<load data from disk, etc.>
// Get some buffers aligned to 4 KiB
double* const x_a = static_cast<double*>(std::aligned_alloc(......));
double* const p = static_cast<double*>(std::aligned_alloc(......));
double* const m = static_cast<double*>(std::aligned_alloc(......));
double sum = 0.0;
const auto timerstart = std::chrono::steady_clock::now();
for(uint32_t i = 0; i<reps; i++){
uint32_t pos = 0;
double factor;
if((i%2) == 0) factor = 1.0; else factor = -1.0;
for(uint32_t j = 0; j<xyzvec.size(); j++){
pos = j*basis::ndist; //ndist is a compile-time constant == 36
for(uint32_t k =0; k<basis::ndist; k++) x_a[k] = distvec[k+pos];
sum += factor*basis::energy(x_a, &coeff[0], p, m);
}
}
const auto timerstop = std::chrono::steady_clock::now();
<free memory, print stats, etc.>
reger
where reps is a single digit number, xyzvec has ~15k elements, and a single call to basis::energy(...) takes about 100 µs to return. The energy function is huge in terms of code size (~5 MiB of source code that looks something like this, it's from a code generator).
Edit: The m array is somewhat large, ~270 KiB for this test case.
Edit 2: Source code of the two functions responsible for ~90% of execution time
All of the pointers entering energy are __restrict__-qualified and declared to be aligned via __assume_aligned(...), the object files are generated with -Ofast -march=haswell to allow the compiler to optimize and vectorize at will. Profiling suggests the function is currently frontend-bound (L1i cache miss, and fetch/decode).
energy does no dynamic memory allocation or IO, and mostly reads/writes x_a, m and p, x_a is const, which are all aligned to 4k page boundaries. Its execution time ought to be pretty consistent.
The strange timing behaviour
Running the program many times, and looking at the time elapsed between the timer start/stop calls above, I have found it to have a strange bimodal distribution.
Calls to energy are either "fast" or "slow", fast ones take ~91 µs, slow ones take ~106 µs on an Intel Skylake-X 7820X.
All calls to energy in a given process are either fast or slow, the metaphorical coin is flipped once, when the process starts.
The process is not quite random, and can be heavily biased towards the "fast" cases, by purging all kernel caches via echo 3 | sudo tee /proc/sys/vm/drop_caches immediately before execution.
The random effect may be CPU dependent. Running the same executable on a Ryzen 1700X yields both faster and much more consistent execution. The "slow" runs either don't happen or their prominence is much reduced. Both machines are running the same OS. (Ubuntu 20.04 LTS, 5.11.0-41-generic kernel, mitigations=off)
What could be the cause?
Data alignment (dubious, the arrays intensively used are aligned)
Code alignment (maybe, but I have tried printing the function pointer of energy, no correlation with speed)
Cache aliasing?
JCC erratum?
Interrupts, scheduler activity?
Some cores turbo boosting higher? (probably not, tried launching it bound to a core with taskset and tried all cores one by one, could not find one that was always "fast")
???
Edit
Zero-filling x_a, p and m before first use appears to make no difference to the timing pattern.
Replacing (i % 2) with factor *= -1.0 appears to make no difference to the timing pattern.

Throughput of trivially concurrent code does not increase with the number of threads

I am trying to use OpenMP to benchmark the speed of data structure that I implemented. However, I seem to make a fundamental mistake: the throughput decreases instead of increasing with the number of threads no matter what operation I try to benchmark.
Below you can see the code that tries to benchmark the speed of a for-loop, as such I would expect it to scale (somewhat) linearly with the number of threads, it doesn't (compiled on a dualcore laptop with and without -O3 flag on g++ with c++11).
#include <omp.h>
#include <atomic>
#include <chrono>
#include <iostream>
thread_local const int OPS = 10000;
thread_local const int TIMES = 200;
double get_tp(int THREADS)
{
double threadtime[THREADS] = {0};
//Repeat the test many times
for(int iteration = 0; iteration < TIMES; iteration++)
{
#pragma omp parallel num_threads(THREADS)
{
double start, stop;
int loc_ops = OPS/float(THREADS);
int t = omp_get_thread_num();
//Force all threads to start at the same time
#pragma omp barrier
start = omp_get_wtime();
//Do a certain kind of operations loc_ops times
for(int i = 0; i < loc_ops; i++)
{
//Here I would put the operations to benchmark
//in this case a boring for loop
int x = 0;
for(int j = 0; j < 1000; j++)
x++;
}
stop = omp_get_wtime();
threadtime[t] += stop-start;
}
}
double total_time = 0;
std::cout << "\nThread times: ";
for(int i = 0; i < THREADS; i++)
{
total_time += threadtime[i];
std::cout << threadtime[i] << ", ";
}
std::cout << "\nTotal time: " << total_time << "\n";
double mopss = float(OPS)*TIMES/total_time;
return mopss;
}
int main()
{
std::cout << "\n1 " << get_tp(1) << "ops/s\n";
std::cout << "\n2 " << get_tp(2) << "ops/s\n";
std::cout << "\n4 " << get_tp(4) << "ops/s\n";
std::cout << "\n8 " << get_tp(8) << "ops/s\n";
}
Outputs with -O3 on a dualcore, so we don't expect the throughput to increase after 2 threads, but it does not even increase when going from 1 to 2 threads it decreases by 50%:
1 Thread
Thread times: 7.411e-06,
Total time: 7.411e-06
2.69869e+11 ops/s
2 Threads
Thread times: 7.36701e-06, 7.38301e-06,
Total time: 1.475e-05
1.35593e+11ops/s
4 Threads
Thread times: 7.44301e-06, 8.31901e-06, 8.34001e-06, 7.498e-06,
Total time: 3.16e-05
6.32911e+10ops/s
8 Threads
Thread times: 7.885e-06, 8.18899e-06, 9.001e-06, 7.838e-06, 7.75799e-06, 7.783e-06, 8.349e-06, 8.855e-06,
Total time: 6.5658e-05
3.04609e+10ops/s
To make sure that the compiler does not remove the loop, I also tried outputting "x" after measuring the time and to the best of my knowledge the problem persists. I also tried the code on a machine with more cores and it behaved very similarly. Without -O3 the throughput also does not scale. So there is clearly something wrong with the way I benchmark. I hope you can help me.
I'm not sure why you are defining performance as the total number of operations per total CPU time and then get surprised by the decreasing function of the number of threads. This will almost always and universally be the case except for when cache effects kick in. The true performance metric is the number of operations per wall-clock time.
It is easy to show with simple mathematical reasoning. Given a total work W and processing capability of each core P, the time on a single core is T_1 = W / P. Dividing the work evenly among n cores means each of them works for T_1,n = (W / n + H) / P, where H is the overhead per thread induced by the parallelisation itself. The sum of those is T_n = n * T_1,n = W / P + n (H / P) = T_1 + n (H / P). The overhead is always a positive value, even in the trivial case of so-called embarrassing parallelism where no two threads need to communicate or synchronise. For example, launching the OpenMP threads takes time. You cannot get rid of the overhead, you can only amortise it over the lifetime of the threads by making sure that each one get a lot to work on. Therefore, T_n > T_1 and with fixed number of operations in both cases the performance on n cores will always be lower than on a single core. The only exception of this rule is the case when the data for work of size W doesn't fit in the lower-level caches but that for work of size W / n does. This results in massive speed-up that exceeds the number of cores, known as superlinear speed-up. You are measuring inside the thread function so you ignore the value of H and T_n should more or less be equal to T_1 within the timer precision, but...
With multiple threads running on multiple CPU cores, they all compete for limited shared CPU resources, namely last-level cache (if any), memory bandwidth, and thermal envelope.
The memory bandwidth is not a problem when you are simply incrementing a scalar variable, but becomes the bottleneck when the code starts actually moving data in and out of the CPU. A canonical example from numerical computing is the sparse matrix-vector multiplication (spMVM) -- a properly optimised spMVM routine working with double non-zero values and long indices eats so much memory bandwidth, that one can completely saturate the memory bus with as low as two threads per CPU socket, making an expensive 64-core CPU a very poor choice in that case. This is true for all algorithms with low arithmetic intensity (operations per unit of data volume).
When it comes to the thermal envelope, most modern CPUs employ dynamic power management and will overclock or clock down the cores depending on how many of them are active. Therefore, while n clocked down cores perform more work in total per unit of time than a single core, a single core outperforms n cores in terms of work per total CPU time, which is the metric you are using.
With all this in mind, there is one last (but not least) thing to consider -- timer resolution and measurement noise. Your run times are in couples of microseconds. Unless your code is running on some specialised hardware that does nothing else but run your code (i.e., no time sharing with daemons, kernel threads, and other processes and no interrupt handing), you need benchmarks that run several orders of magnitude longer, preferably for at least a couple of seconds.
The loop is almost certainly still getting optimized, even if you output the value of x after the outer loop. The compiler can trivially replace the entire loop with a single instruction since the loop bounds are constant at compile time. Indeed, in this example:
#include <iostream>
int main()
{
int x = 0;
for (int i = 0; i < 10000; ++i) {
for (int j = 0; j < 1000; ++j) {
++x;
}
}
std::cout << x << '\n';
return 0;
}
The loop is replaced with the single assembly instruction mov esi, 10000000.
Always inspect the assembly output when benchmarking to make sure that you're measuring what you think you are; in this case you are just measuring the overhead of creating threads, which of course will be higher the more threads you create.
Consider having the innermost loop do something that can't be optimized away. Random number generation is a good candidate because it should perform in constant time, and it has the side-effect of permuting the PRNG state (making it ineligible to be removed entirely, unless the seed is known in advance and the compiler is able to unravel all of the mutation in the PRNG).
For example:
#include <iostream>
#include <random>
int main()
{
std::mt19937 r;
std::uniform_real_distribution<double> dist{0, 1};
for (int i = 0; i < 10000; ++i) {
for (int j = 0; j < 1000; ++j) {
dist(r);
}
}
return 0;
}
Both loops and the PRNG invocation are left intact here.

no speedup using openmp + SIMD

I am new to Openmp and now trying to use Openmp + SIMD intrinsics to speedup my program, but the result is far from expectation.
In order to simplify the case without losing much essential information, I wrote a simplier toy example:
#include <omp.h>
#include <stdlib.h>
#include <iostream>
#include <vector>
#include <sys/time.h>
#include "immintrin.h" // for SIMD intrinsics
int main() {
int64_t size = 160000000;
std::vector<int> src(size);
// generating random src data
for (int i = 0; i < size; ++i)
src[i] = (rand() / (float)RAND_MAX) * size;
// to store the final results, so size is the same as src
std::vector<int> dst(size);
// get pointers for vector load and store
int * src_ptr = src.data();
int * dst_ptr = dst.data();
__m256i vec_src;
__m256i vec_op = _mm256_set1_epi32(2);
__m256i vec_dst;
omp_set_num_threads(4); // you can change thread count here
// only measure the parallel part
struct timeval one, two;
double get_time;
gettimeofday (&one, NULL);
#pragma omp parallel for private(vec_src, vec_op, vec_dst)
for (int64_t i = 0; i < size; i += 8) {
// load needed data
vec_src = _mm256_loadu_si256((__m256i const *)(src_ptr + i));
// computation part
vec_dst = _mm256_add_epi32(vec_src, vec_op);
vec_dst = _mm256_mullo_epi32(vec_dst, vec_src);
vec_dst = _mm256_slli_epi32(vec_dst, 1);
vec_dst = _mm256_add_epi32(vec_dst, vec_src);
vec_dst = _mm256_sub_epi32(vec_dst, vec_src);
// store results
_mm256_storeu_si256((__m256i *)(dst_ptr + i), vec_dst);
}
gettimeofday(&two, NULL);
double oneD = one.tv_sec + (double)one.tv_usec * .000001;
double twoD = two.tv_sec + (double)two.tv_usec * .000001;
get_time = 1000 * (twoD - oneD);
std::cout << "took time: " << get_time << std::endl;
// output something in case the computation is optimized out
int64_t i = (int)((rand() / (float)RAND_MAX) * size);
for (int64_t i = 0; i < size; ++i)
std::cout << i << ": " << dst[i] << std::endl;
return 0;
}
It is compiled using icpc -g -std=c++11 -march=core-avx2 -O3 -qopenmp test.cpp -o test and the elapsed time of the parallel part is measured. The result is as follows (the median value is picked out of 5 runs each):
1 thread: 92.519
2 threads: 89.045
4 threads: 90.361
The computations seem embarrassingly parallel, as different threads can load their needed data simultaneously given different indices, and the case is similar for writing the results, but why no speedups?
More information:
I checked the assembly code using icpc -g -std=c++11 -march=core-avx2 -O3 -qopenmp -S test.cpp and found vectorized instructions are generated;
To check if it is memory-bound, I commented the computation part in the loop, and the measured time decreased to around 60, but it does not change much if I change the thread count from 1 -> 2 -> 4.
Any advice or clue is welcome.
EDIT-1:
Thank #JerryCoffin for pointing out the possible cause, so I did the Memory Access Analysis using Vtune. Here are the results:
1-thread: Memory Bound: 6.5%, L1 Bound: 0.134, L3 Latency: 0.039
2-threads: Memory Bound: 18.0%, L1 Bound: 0.115, L3 Latency: 0.015
4-threads: Memory Bound: 21.6%, L1 Bound: 0.213, L3 Latency: 0.003
It is an Intel 4770 Processor with 25.6GB/s (23GB/s measured by Vtune) max. bandwidth. The memory bound does increase, but I am still not sure if that is the cause. Any advice?
EDIT-2 (just trying to give thorough information, so the appended stuff can be long but not tedious hopefully):
Thanks for the suggestions from #PaulR and #bazza. I tried 3 ways for comparison. One thing to note is that the processor has 4 cores and 8 hardware threads. Here are the results:
(1) just initialize dst as all zeros in advance: 1 thread: 91.922; 2 threads: 93.170; 4 threads: 93.868 --- seems not effective;
(2) without (1), put the parallel part in an outer loop over 100 iterations, and measure the time of the 100 iterations: 1 thread: 9109.49; 2 threads: 4951.20; 4 threads: 2511.01; 8 threads: 2861.75 --- quite effective except for 8 threads;
(3) based on (2), put one more iteration before the 100 iterations, and measure the time of the 100 iterations: 1 thread: 9078.02; 2 threads: 4956.66; 4 threads: 2516.93; 8 threads: 2088.88 --- similar with (2) but more effective for 8 threads.
It seems more iterations can expose the advantages of openmp + SIMD, but the computation / memory access ratio is unchanged regardless loop count, and locality seems not to be the reason as well since src or dst is too large to stay in any caches, therefore no relations exist between consecutive iterations.
Any advice?
EDIT 3:
In case of misleading, one thing needs to be clarified: in (2) and (3), the openmp directive is outside the added outer loop
#pragma omp parallel for private(vec_src, vec_op, vec_dst)
for (int k = 0; k < 100; ++k) {
for (int64_t i = 0; i < size; i += 8) {
......
}
}
i.e. the outer loop is parallelized using multithreads, and the inner loop is still serially processed. So the effective speedup in (2) and (3) might be achieved by enhanced locality among threads.
I did another experiment that the the openmp directive is put inside the outer loop:
for (int k = 0; k < 100; ++k) {
#pragma omp parallel for private(vec_src, vec_op, vec_dst)
for (int64_t i = 0; i < size; i += 8) {
......
}
}
and the speedup is still not good: 1 thread: 9074.18; 2 threads: 8809.36; 4 threads: 8936.89.93; 8 threads: 9098.83.
Problem still exists. :(
EDIT-4:
If I replace the vectorized part with scalar operations like this (the same calculations but in scalar way):
#pragma omp parallel for
for (int64_t i = 0; i < size; i++) { // not i += 8
int query = src[i];
int res = src[i] + 2;
res = res * query;
res = res << 1;
res = res + query;
res = res - query;
dst[i] = res;
}
The speedup is 1 thread: 92.065; 2 threads: 89.432; 4 threads: 88.864. May I come to the conclusion that the seemingly embarassing parallel is actually memory bound (the bottleneck is load / store operations)? If so, why can't load / store operations well parallelized?
May I come to the conclusion that the seemingly embarassing parallel is actually memory bound (the bottleneck is load / store operations)? If so, why can't load / store operations well parallelized?
Yes this problem is embarrassingly parallel in the sense that it is easy to parallelize due to the lack of dependencies. That doesn't imply that it will scale perfectly. You can still have a bad initialization overhead vs work ratio or shared resources limiting your speedup.
In your case, you are indeed limited by memory bandwidth. A practical consideration first: When compile with icpc (16.0.3 or 17.0.1), the "scalar" version yields better code when size is made constexpr. This is not due to the fact that it optimizes away these two redundant lines:
res = res + query;
res = res - query;
It does, but that makes no difference. Mainly the compiler uses exactly the same instruction that you do with the intrinsic, except for the store. Fore the store, it uses vmovntdq instead of vmovdqu, making use of sophisticated knowledge about the program, memory and the architecture. Not only does vmovntdq require aligned memory and can therefore be more efficient. It gives the CPU a non-temporal hint, preventing this data from being cached during the write to memory. This improves performance, because writing it to cache requires to load the remainder of the cache-line from memory. So while your initial SIMD version does require three memory operations: Reading the source, reading the destination cache line, writing the destination, the compiler version with the non-temporal store requires only two. In fact On my i7-4770 system, the compiler-generated version reduces the runtime at 2 threads from ~85.8 ms to 58.0 ms, and almost perfect 1.5x speedup. The lesson here is to trust your compiler unless you know the architecture and instruction set extremely well.
Considering peak performance here, 58 ms for transferring 2*160000000*4 byte corresponds to 22.07 GB/s (summarizing read and write), which is about the same than your VTune results. (funny enough considering 85.8 ms is about the same bandwidth for two read, one write). There isn't much more direct room for improvement.
To further improve performance, you would have to do something about the operation / byte ratio of your code. Remember that your processor can perform 217.6 GFLOP/s (I guess either the same or twice for intops), but can only read&write 3.2 G int/s. That gives you an idea how much operations you need to perform to not be limited by memory. So if you can, work on the data in blocks so that you can reuse data in caches.
I cannot reproduce your results for (2) and (3). When I loop around the inner loop, the scaling behaves the same. The results look fishy, particularly in the light of the results being so consistent with peak performance otherwise. Generally, I recommend to do the measuring inside of the parallel region and leverage omp_get_wtime like such:
double one, two;
#pragma omp parallel
{
__m256i vec_src;
__m256i vec_op = _mm256_set1_epi32(2);
__m256i vec_dst;
#pragma omp master
one = omp_get_wtime();
#pragma omp barrier
for (int kk = 0; kk < 100; kk++)
#pragma omp for
for (int64_t i = 0; i < size; i += 8) {
...
}
#pragma omp master
{
two = omp_get_wtime();
std::cout << "took time: " << (two-one) * 1000 << std::endl;
}
}
A final remark: Desktop processors and server processors have very different characteristics regarding memory performance. On contemporary server processors, you need much more active threads to saturate the memory bandwidth, while on desktop processors a core can often almost saturate the memory bandwidth.
Edit: One more thought about VTune not classifying it as memory-bound. This may be cause by the short computation time vs initialization. Try to see what VTune says about the code in a loop.

For loop performance and multithreaded performance questions

I was kind of bored so I wanted to try using std::thread and eventually measure performance of single and multithreaded console application. This is a two part question. So I started with a single threaded sum of a massive vector of ints (800000 of ints).
int sum = 0;
auto start = chrono::high_resolution_clock::now();
for (int i = 0; i < 800000; ++i)
sum += ints[i];
auto end = chrono::high_resolution_clock::now();
auto diff = end - start;
Then I added range based and iterator based for loop and measured the same way with chrono::high_resolution_clock.
for (auto& val : ints)
sum += val;
for (auto it = ints.begin(); it != ints.end(); ++it)
sum += *it;
At this point console output looked like:
index loop: 30.0017ms
range loop: 221.013ms
iterator loop: 442.025ms
This was a debug version, so I changed to release and the difference was ~1ms in favor of index based for. No big deal, but just out of curiosity: should there be a difference this big in debug mode between these three for loops? Or even a difference in 1ms in release mode?
I moved on to the thread creation, and tried to do a parallel sum of the array with this lambda (captured everything by reference so I could use vector of ints and a mutex previously declared) using index based for.
auto func = [&](int start, int total, int index)
{
int partial_sum = 0;
auto s = chrono::high_resolution_clock::now();
for (int i = start; i < start + total; ++i)
partial_sum += ints[i];
auto e = chrono::high_resolution_clock::now();
auto d = e - s;
m.lock();
cout << "thread " + to_string(index) + ": " << chrono::duration<double, milli>(d).count() << "ms" << endl;
sum += partial_sum;
m.unlock();
};
for (int i = 0; i < 8; ++i)
threads.push_back(thread(func, i * 100000, 100000, i));
Basically every thread was summing 1/8 of the total array, and the final console output was:
thread 0: 6.0004ms
thread 3: 6.0004ms
thread 2: 6.0004ms
thread 5: 7.0004ms
thread 4: 7.0004ms
thread 1: 7.0004ms
thread 6: 7.0004ms
thread 7: 7.0004ms
8 threads total: 53.0032ms
So I guess the second part of this question is what's going on here? Solution with 2 threads ended with ~30ms as well. Cache ping pong? Something else? If I'm doing something wrong, what would be the correct way to do it? Also if It's relevant, I was trying this on an i7 with 8 threads, so yes I know I didn't count the main thread, but tried it with 7 separate threads and pretty much got the same result.
EDIT: Sorry forgot the mention this was on Windows 7 with Visual Studio 2013 and Visual Studio's v120 compiler or whatever it's called.
EDIT2: Here's the whole main function:
http://pastebin.com/HyZUYxSY
With optimisation not turned on, all the method calls that are performed behind the scenes are likely real method calls. Inline functions are likely not inlined but really called. For template code, you really need to turn on optimisation to avoid that all the code is taken literally. For example, it's likely that your iterator code will call iter.end () 800,000 times, and operator!= for the comparison 800,000 times, which calls operator== and so on and so on.
For the multithreaded code, processors are complicated. Operating systems are complicated. Your code isn't alone on the computer. Your computer can change its clock speed, change into turbo mode, change into heat protection mode. And rounding the times to milliseconds isn't really helpful. Could be one thread to 6.49 milliseconds and another too 6.51 and it got rounded differently.
should there be a difference this big in debug mode between these three for loops?
Yes. If allowed, a decent compiler can produce identical output for each of the 3 different loops, but if optimizations are not enabled, the iterator version has more function calls and function calls have certain overhead.
Or even a difference in 1ms in release mode?
Your test code:
start = ...
for (auto& val : ints)
sum += val;
end = ...
diff = end - start;
sum = 0;
Doesn't use the result of the loop at all so when optimized, the compiler should simply choose to throw away the code resulting in something like:
start = ...
// do nothing...
end = ...
diff = end - start;
For all your loops.
The difference of 1ms may be produced by high granularity of the "high_resolution_clock" in the used implementation of the standard library and by differences in process scheduling during the execution. I measured the index based for being 0.04 ms slower, but that result is meaningless.
Aside from how std::thread is implemented on Windows I would to point your attention to your available execution units and context switching.
An i7 does not have 8 real execution units. It's a quad-core processor with hyper-threading. And HT does not magically double the available number of threads, no matter how it's advertised. It's a really clever system which tries to fit in instructions from an extra pipeline whenever possible. But in the end all instructions go through only four execution units.
So running 8 (or 7) threads is still more than your CPU can really handle simultaneously. That means your CPU has to switch a lot between 8 hot threads clamouring for calculation time. Top that off with several hundred more threads from the OS, admittedly most of which are asleep, that need time and you're left with a high degree of uncertainty in your measurements.
With a single threaded for-loop the OS can dedicate a single core to that task and spread the half-sleeping threads across the other three. This is why you're seeing such a difference between 1 thread and 8 threads.
As for your debugging questions: you should check if Visual Studio has Iterator checking enabled in debugging. When it's enabled every time an iterator is used it is bounds-checked and such. See: https://msdn.microsoft.com/en-us/library/aa985965.aspx
Lastly: have a look at the -openmp switch. If you enable that and apply the OpenMP #pragmas to your for-loops you can do away with all the manual thread creation. I toyed around with similar threading tests (because it's cool. :) ) and OpenMPs performance is pretty damn good.
For the first question, regarding the difference in performance between the range, iterator and index implementations, others have pointed out that in a non-optimized build, much which would normally be inlined may not be.
However there is an additional wrinkle: by default, in Debug builds, Visual Studio will use checked iterators. Access through a checked iterator is checked for safety (does the iterator refer to a valid element?), and consequently operations which use them, including the range-based iteration, are heavily penalized.
For the second part, I have to say that those durations seem abnormally long. When I run the code locally, compiled with g++ -O3 on a core i7-4770 (Linux), I get sub-millisecond timings for each method, less in fact than the jitter between runs. Altering the code to iterate each test 1000 times gives more stable results, with the per test times being 0.33 ms for the index and range loops with no extra tweaking, and about 0.15 ms for the parallel test.
The parallel threads are doing in total the same number of operations, and what's more, using all four cores limits the CPU's ability to dynamically increase its clock speed. So how can it take less total time?
I'd wager that the gains result from better utilization of the per-core L2 caches, four in total. Indeed, using four threads instead of eight threads reduces the total parallel time to 0.11 ms, consistent with better L2 cache use.
Browsing the Intel processor documentation, all the Core i7 processors, including the mobile ones, have at least 4 MB of L3 cache, which will happily accommodate 800 thousand 4-byte ints. So I'm surprised both by the raw times being 100 times larger than I'm seeing, and the 8-thread time totals being so much greater, which as you surmise, is a strong hint that they are thrashing the cache. I'm presuming this is demonstrating just how suboptimal the Debug build code is. Could you post results from an optimised build?
Not knowing how those std::thread classes are implemented, one possible explanation for the 53ms could be:
The threads are started right away when they get instantiated. (I see no thread.start() or threads.StartAll() or alike). So, during the time the first thread instance gets active, the main thread might (or might not) be preempted. There is no guarantee that the threads are getting spawned on individual cores, after all (thread affinity).
If you have a closer look at POSIX APIs, there is the notion of "application context" and "system context", which basically implies, that there might be an OS policy in place which would not use all cores for 1 application.
On Windows (this is where you were testing), maybe the threads are not being spawned directly but via a thread pool, maybe with some extra std::thread functionality, which could produce overhead/delay. (Such as completion ports etc.).
Unfortunately my machine is pretty fast so I had to increase the amount of data processed to yield significant times. But on the upside, this reminded me to point out, that typically, it starts to pay off to go parallel when the computation times are way beyond the time of a time slice (rule of thumb).
Here my "native" Windows implementation, which - for a large enough array finally makes the threads win over a single threaded computation.
#include <stdafx.h>
#include <nativethreadTest.h>
#include <vector>
#include <cstdint>
#include <Windows.h>
#include <chrono>
#include <iostream>
#include <thread>
struct Range
{
Range( const int32_t *p, size_t l)
: data(p)
, length(l)
, result(0)
{}
const int32_t *data;
size_t length;
int32_t result;
};
static int32_t Sum(const int32_t * data, size_t length)
{
int32_t sum = 0;
const int32_t *end = data + length;
for (; data != end; data++)
{
sum += *data;
}
return sum;
}
static int32_t TestSingleThreaded(const Range& range)
{
return Sum(range.data, range.length);
}
DWORD
WINAPI
CalcThread
(_In_ LPVOID lpParameter
)
{
Range * myRange = reinterpret_cast<Range*>(lpParameter);
myRange->result = Sum(myRange->data, myRange->length);
return 0;
}
static int32_t TestWithNCores(const Range& range, size_t ncores)
{
int32_t result = 0;
std::vector<Range> ranges;
size_t nextStart = 0;
size_t chunkLength = range.length / ncores;
size_t remainder = range.length - chunkLength * ncores;
while (nextStart < range.length)
{
ranges.push_back(Range(&range.data[nextStart], chunkLength));
nextStart += chunkLength;
}
Range remainderRange(&range.data[range.length - remainder], remainder);
std::vector<HANDLE> threadHandles;
threadHandles.reserve(ncores);
for (size_t i = 0; i < ncores; ++i)
{
threadHandles.push_back(::CreateThread(NULL, 0, CalcThread, &ranges[i], 0, NULL));
}
int32_t remainderResult = Sum(remainderRange.data, remainderRange.length);
DWORD waitResult = ::WaitForMultipleObjects((DWORD)threadHandles.size(), &threadHandles[0], TRUE, INFINITE);
if (WAIT_OBJECT_0 == waitResult)
{
for (auto& r : ranges)
{
result += r.result;
}
result += remainderResult;
}
else
{
throw std::runtime_error("Something went horribly - HORRIBLY wrong!");
}
for (auto& h : threadHandles)
{
::CloseHandle(h);
}
return result;
}
static int32_t TestWithSTLThreads(const Range& range, size_t ncores)
{
int32_t result = 0;
std::vector<Range> ranges;
size_t nextStart = 0;
size_t chunkLength = range.length / ncores;
size_t remainder = range.length - chunkLength * ncores;
while (nextStart < range.length)
{
ranges.push_back(Range(&range.data[nextStart], chunkLength));
nextStart += chunkLength;
}
Range remainderRange(&range.data[range.length - remainder], remainder);
std::vector<std::thread> threads;
for (size_t i = 0; i < ncores; ++i)
{
threads.push_back(std::thread([](Range* range){ range->result = Sum(range->data, range->length); }, &ranges[i]));
}
int32_t remainderResult = Sum(remainderRange.data, remainderRange.length);
for (auto& t : threads)
{
t.join();
}
for (auto& r : ranges)
{
result += r.result;
}
result += remainderResult;
return result;
}
void TestNativeThreads()
{
const size_t DATA_SIZE = 800000000ULL;
typedef std::vector<int32_t> DataVector;
DataVector data;
data.reserve(DATA_SIZE);
for (size_t i = 0; i < DATA_SIZE; ++i)
{
data.push_back(static_cast<int32_t>(i));
}
Range r = { data.data(), data.size() };
std::chrono::system_clock::time_point singleThreadedStart = std::chrono::high_resolution_clock::now();
int32_t result = TestSingleThreaded(r);
std::chrono::system_clock::time_point singleThreadedEnd = std::chrono::high_resolution_clock::now();
std::cout
<< "Single threaded sum: "
<< std::chrono::duration_cast<std::chrono::milliseconds>(singleThreadedEnd - singleThreadedStart).count()
<< "ms." << " Result = " << result << std::endl;
std::chrono::system_clock::time_point multiThreadedStart = std::chrono::high_resolution_clock::now();
result = TestWithNCores(r, 8);
std::chrono::system_clock::time_point multiThreadedEnd = std::chrono::high_resolution_clock::now();
std::cout
<< "Multi threaded sum: "
<< std::chrono::duration_cast<std::chrono::milliseconds>(multiThreadedEnd - multiThreadedStart).count()
<< "ms." << " Result = " << result << std::endl;
std::chrono::system_clock::time_point stdThreadedStart = std::chrono::high_resolution_clock::now();
result = TestWithSTLThreads(r, 8);
std::chrono::system_clock::time_point stdThreadedEnd = std::chrono::high_resolution_clock::now();
std::cout
<< "std::thread sum: "
<< std::chrono::duration_cast<std::chrono::milliseconds>(stdThreadedEnd - stdThreadedStart).count()
<< "ms." << " Result = " << result << std::endl;
}
Here the output on my machine of this code:
Single threaded sum: 382ms. Result = -532120576
Multi threaded sum: 234ms. Result = -532120576
std::thread sum: 245ms. Result = -532120576
Press any key to continue . . ..
Last not least, I feel urged to mention that the way this code is written it is rather a memory IO performance benchmark than a core CPU computation benchmark.
Better computation benchmarks would use small amounts of data which is local, fits into CPU caches etc.
Maybe it would be interesting to experiment with the splitting of the data into ranges. What if each thread were "jumping" over the data from the start to an end with a gap of ncores? Thread 1: 0 8 16... Thread 2: 1 9 17 ... etc.? Maybe then the "locality" of the memory could gain extra speed.