I'm writing a path tracer as a programming exercise. Yesterday I finally decided to implement multithreading - and it worked well. However, once I wrapped the test code I wrote inside main() in a separate renderer class, I noticed a significant and consistent performance drop. In short - it would seem that filling std::vector anywhere outside of main() causes threads using its elements to perform worse. I managed to isolate and reproduce the issue with simplified code, but unfortunately I still don't know why it happens or what to do in order to fix it.
Performance drop is quite visible and consistent:
97 samples - time = 28.154226s, per sample = 0.290250s, per sample/th = 1.741498
99 samples - time = 28.360723s, per sample = 0.286472s, per sample/th = 1.718832
100 samples - time = 29.335468s, per sample = 0.293355s, per sample/th = 1.760128
vs.
98 samples - time = 30.197734s, per sample = 0.308140s, per sample/th = 1.848841
99 samples - time = 30.534240s, per sample = 0.308427s, per sample/th = 1.850560
100 samples - time = 30.786519s, per sample = 0.307865s, per sample/th = 1.847191
The code I originally posted in this question can be found here: https://github.com/Jacajack/rt/tree/mt_debug or in edit history.
I created a struct foo that is supposed to mimic the behavior of my renderer class and is responsible for initialization of path tracing contexts in its constructor.
The interesting thing is, when I remove the body of foo's constructor and instead do this (initialize contexts directly from main()):
std::vector<rt::path_tracer> contexts; // Can be on stack or on heap, doesn't matter
foo F(cam, scene, bvh, width, height, render_threads, contexts); // no longer fills `contexts`
contexts.reserve(render_threads);
for (int i = 0; i < render_threads; i++)
contexts.emplace_back(cam, scene, bvh, width, height, 1000 + i);
F.run(render_threads);
the performance is back to normal. But then, if I wrap these three lines into a separate function and call it from here, it's worse again. The only pattern I can see here is
that filling the contexts vector outside of main() causes the problem.
I initially thought that this was an alignment/caching issue, so I tried aligning path_tracers with Boost's aligned_allocator and TBB's cache_aligned_allocator with no result. It turns out that this problem persists even when there's only one thread running.
I suspect it must be some kind of wild compiler optimization (I'm using -O3), althought that's just a guess. Do you know any possible causes of such behavior and what can be done to avoid it?
This happens on both gcc 10.1.0 and clang 10.0.0. Currently I'm only using -O3.
I managed to reproduce a similar issue in this standalone example:
#include <iostream>
#include <thread>
#include <random>
#include <algorithm>
#include <chrono>
#include <iomanip>
struct foo
{
std::mt19937 rng;
std::uniform_real_distribution<float> dist;
std::vector<float> buf;
int cnt = 0;
foo(int seed, int n) :
rng(seed),
dist(0, 1),
buf(n, 0)
{
}
void do_stuff()
{
// Do whatever
for (auto &f : buf)
f = (f + 1) * dist(rng);
cnt++;
}
};
int main()
{
int N = 50000000;
int thread_count = 6;
struct bar
{
std::vector<std::thread> threads;
std::vector<foo> &foos;
bool active = true;
bar(std::vector<foo> &f, int thread_count, int n) :
foos(f)
{
/*
foos.reserve(thread_count);
for (int i = 0; i < thread_count; i++)
foos.emplace_back(1000 + i, n);
//*/
}
void run(int thread_count)
{
auto task = [this](foo &f)
{
while (this->active)
f.do_stuff();
};
threads.reserve(thread_count);
for (int i = 0; i < thread_count; i++)
threads.emplace_back(task, std::ref(foos[i]));
}
};
std::vector<foo> foos;
bar B(foos, thread_count, N);
///*
foos.reserve(thread_count);
for (int i = 0; i < thread_count; i++)
foos.emplace_back(1000 + i, N);
//*/
B.run(thread_count);
std::vector<float> buffer(N, 0);
int samples = 0, last_samples = 0;
// Start time
auto t_start = std::chrono::high_resolution_clock::now();
while (1)
{
last_samples = samples;
samples = 0;
for (auto &f : foos)
{
std::transform(
f.buf.cbegin(), f.buf.cend(),
buffer.begin(),
buffer.begin(),
std::plus<float>()
);
samples += f.cnt;
}
if (samples != last_samples)
{
auto t_now = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> t_total = t_now - t_start;
std::cerr << std::setw(4) << samples << " samples - time = " << std::setw(8) << std::fixed << t_total.count()
<< "s, per sample = " << std::setw(8) << std::fixed << t_total.count() / samples
<< "s, per sample/th = " << std::setw(8) << std::fixed << t_total.count() / samples * thread_count << std::endl;
}
}
}
and results:
For N = 100000000, thread_count = 6
In main():
196 samples - time = 26.789526s, per sample = 0.136681s, per sample/th = 0.820088
197 samples - time = 27.045646s, per sample = 0.137288s, per sample/th = 0.823725
200 samples - time = 27.312159s, per sample = 0.136561s, per sample/th = 0.819365
vs.
In foo::foo():
193 samples - time = 22.690566s, per sample = 0.117568s, per sample/th = 0.705406
196 samples - time = 22.972403s, per sample = 0.117206s, per sample/th = 0.703237
198 samples - time = 23.257542s, per sample = 0.117462s, per sample/th = 0.704774
200 samples - time = 23.540432s, per sample = 0.117702s, per sample/th = 0.706213
It seems that the results are the opposite of what is happening in my path tracer, but the visible difference is still here.
Thank you
There is a race condition with foo::buf - one thread makes stores into it, anther reads it. This is undefined behaviour, but on x86-64 platform that is harmless in this particular code.
I cannot reproduce your observations on Intel i9-9900KS, both variants print the same per sample stats.
Compiled with gcc-8.4, g++ -o release/gcc/test.o -c -pthread -m{arch,tune}=native -std=gnu++17 -g -O3 -ffast-math -falign-{functions,loops}=64 -DNDEBUG test.cc
With int N = 50000000; each thread operates on its own array of float[N] which occupies 200MB. Such a data set doesn't fit in CPU caches and the program incurs a lot of data cache misses because it needs to fetch the data from memory:
$ perf stat -ddd ./release/gcc/test
[...]
71474.813087 task-clock (msec) # 6.860 CPUs utilized
66 context-switches # 0.001 K/sec
0 cpu-migrations # 0.000 K/sec
341,942 page-faults # 0.005 M/sec
357,027,759,875 cycles # 4.995 GHz (30.76%)
991,950,515,582 instructions # 2.78 insn per cycle (38.43%)
105,609,126,987 branches # 1477.571 M/sec (38.40%)
155,426,137 branch-misses # 0.15% of all branches (38.39%)
150,832,846,580 L1-dcache-loads # 2110.294 M/sec (38.41%)
4,945,287,289 L1-dcache-load-misses # 3.28% of all L1-dcache hits (38.44%)
1,787,635,257 LLC-loads # 25.011 M/sec (30.79%)
1,103,347,596 LLC-load-misses # 61.72% of all LL-cache hits (30.81%)
<not supported> L1-icache-loads
7,457,756 L1-icache-load-misses (30.80%)
150,527,469,899 dTLB-loads # 2106.021 M/sec (30.80%)
54,966,843 dTLB-load-misses # 0.04% of all dTLB cache hits (30.80%)
26,956 iTLB-loads # 0.377 K/sec (30.80%)
415,128 iTLB-load-misses # 1540.02% of all iTLB cache hits (30.79%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
10.419122076 seconds time elapsed
If you run this application on NUMA CPUs, such as AMD Ryzen and Intel Xeon with multiple sockets, then your observations can probably be explained by adverse placement of threads onto remote NUMA nodes relative to NUMA node where foo::buf is allocated. Those last-level data cache misses have to read memory and if that memory is in a remote NUMA node that takes longer.
To fix that, you may like to allocate memory in the thread that uses it (not in the main thread as the code does) and use a NUMA-aware allocator, such as TCMalloc. See NUMA aware heap memory manager for more details.
When running your benchmark you may like to fix the CPU frequency, so that it doesn't get dynamically adjusted during the run, on Linux you can do that with sudo cpupower frequency-set --related --governor performance.
Related
I am trying to use OpenMP to benchmark the speed of data structure that I implemented. However, I seem to make a fundamental mistake: the throughput decreases instead of increasing with the number of threads no matter what operation I try to benchmark.
Below you can see the code that tries to benchmark the speed of a for-loop, as such I would expect it to scale (somewhat) linearly with the number of threads, it doesn't (compiled on a dualcore laptop with and without -O3 flag on g++ with c++11).
#include <omp.h>
#include <atomic>
#include <chrono>
#include <iostream>
thread_local const int OPS = 10000;
thread_local const int TIMES = 200;
double get_tp(int THREADS)
{
double threadtime[THREADS] = {0};
//Repeat the test many times
for(int iteration = 0; iteration < TIMES; iteration++)
{
#pragma omp parallel num_threads(THREADS)
{
double start, stop;
int loc_ops = OPS/float(THREADS);
int t = omp_get_thread_num();
//Force all threads to start at the same time
#pragma omp barrier
start = omp_get_wtime();
//Do a certain kind of operations loc_ops times
for(int i = 0; i < loc_ops; i++)
{
//Here I would put the operations to benchmark
//in this case a boring for loop
int x = 0;
for(int j = 0; j < 1000; j++)
x++;
}
stop = omp_get_wtime();
threadtime[t] += stop-start;
}
}
double total_time = 0;
std::cout << "\nThread times: ";
for(int i = 0; i < THREADS; i++)
{
total_time += threadtime[i];
std::cout << threadtime[i] << ", ";
}
std::cout << "\nTotal time: " << total_time << "\n";
double mopss = float(OPS)*TIMES/total_time;
return mopss;
}
int main()
{
std::cout << "\n1 " << get_tp(1) << "ops/s\n";
std::cout << "\n2 " << get_tp(2) << "ops/s\n";
std::cout << "\n4 " << get_tp(4) << "ops/s\n";
std::cout << "\n8 " << get_tp(8) << "ops/s\n";
}
Outputs with -O3 on a dualcore, so we don't expect the throughput to increase after 2 threads, but it does not even increase when going from 1 to 2 threads it decreases by 50%:
1 Thread
Thread times: 7.411e-06,
Total time: 7.411e-06
2.69869e+11 ops/s
2 Threads
Thread times: 7.36701e-06, 7.38301e-06,
Total time: 1.475e-05
1.35593e+11ops/s
4 Threads
Thread times: 7.44301e-06, 8.31901e-06, 8.34001e-06, 7.498e-06,
Total time: 3.16e-05
6.32911e+10ops/s
8 Threads
Thread times: 7.885e-06, 8.18899e-06, 9.001e-06, 7.838e-06, 7.75799e-06, 7.783e-06, 8.349e-06, 8.855e-06,
Total time: 6.5658e-05
3.04609e+10ops/s
To make sure that the compiler does not remove the loop, I also tried outputting "x" after measuring the time and to the best of my knowledge the problem persists. I also tried the code on a machine with more cores and it behaved very similarly. Without -O3 the throughput also does not scale. So there is clearly something wrong with the way I benchmark. I hope you can help me.
I'm not sure why you are defining performance as the total number of operations per total CPU time and then get surprised by the decreasing function of the number of threads. This will almost always and universally be the case except for when cache effects kick in. The true performance metric is the number of operations per wall-clock time.
It is easy to show with simple mathematical reasoning. Given a total work W and processing capability of each core P, the time on a single core is T_1 = W / P. Dividing the work evenly among n cores means each of them works for T_1,n = (W / n + H) / P, where H is the overhead per thread induced by the parallelisation itself. The sum of those is T_n = n * T_1,n = W / P + n (H / P) = T_1 + n (H / P). The overhead is always a positive value, even in the trivial case of so-called embarrassing parallelism where no two threads need to communicate or synchronise. For example, launching the OpenMP threads takes time. You cannot get rid of the overhead, you can only amortise it over the lifetime of the threads by making sure that each one get a lot to work on. Therefore, T_n > T_1 and with fixed number of operations in both cases the performance on n cores will always be lower than on a single core. The only exception of this rule is the case when the data for work of size W doesn't fit in the lower-level caches but that for work of size W / n does. This results in massive speed-up that exceeds the number of cores, known as superlinear speed-up. You are measuring inside the thread function so you ignore the value of H and T_n should more or less be equal to T_1 within the timer precision, but...
With multiple threads running on multiple CPU cores, they all compete for limited shared CPU resources, namely last-level cache (if any), memory bandwidth, and thermal envelope.
The memory bandwidth is not a problem when you are simply incrementing a scalar variable, but becomes the bottleneck when the code starts actually moving data in and out of the CPU. A canonical example from numerical computing is the sparse matrix-vector multiplication (spMVM) -- a properly optimised spMVM routine working with double non-zero values and long indices eats so much memory bandwidth, that one can completely saturate the memory bus with as low as two threads per CPU socket, making an expensive 64-core CPU a very poor choice in that case. This is true for all algorithms with low arithmetic intensity (operations per unit of data volume).
When it comes to the thermal envelope, most modern CPUs employ dynamic power management and will overclock or clock down the cores depending on how many of them are active. Therefore, while n clocked down cores perform more work in total per unit of time than a single core, a single core outperforms n cores in terms of work per total CPU time, which is the metric you are using.
With all this in mind, there is one last (but not least) thing to consider -- timer resolution and measurement noise. Your run times are in couples of microseconds. Unless your code is running on some specialised hardware that does nothing else but run your code (i.e., no time sharing with daemons, kernel threads, and other processes and no interrupt handing), you need benchmarks that run several orders of magnitude longer, preferably for at least a couple of seconds.
The loop is almost certainly still getting optimized, even if you output the value of x after the outer loop. The compiler can trivially replace the entire loop with a single instruction since the loop bounds are constant at compile time. Indeed, in this example:
#include <iostream>
int main()
{
int x = 0;
for (int i = 0; i < 10000; ++i) {
for (int j = 0; j < 1000; ++j) {
++x;
}
}
std::cout << x << '\n';
return 0;
}
The loop is replaced with the single assembly instruction mov esi, 10000000.
Always inspect the assembly output when benchmarking to make sure that you're measuring what you think you are; in this case you are just measuring the overhead of creating threads, which of course will be higher the more threads you create.
Consider having the innermost loop do something that can't be optimized away. Random number generation is a good candidate because it should perform in constant time, and it has the side-effect of permuting the PRNG state (making it ineligible to be removed entirely, unless the seed is known in advance and the compiler is able to unravel all of the mutation in the PRNG).
For example:
#include <iostream>
#include <random>
int main()
{
std::mt19937 r;
std::uniform_real_distribution<double> dist{0, 1};
for (int i = 0; i < 10000; ++i) {
for (int j = 0; j < 1000; ++j) {
dist(r);
}
}
return 0;
}
Both loops and the PRNG invocation are left intact here.
I am doing some image processing using OpenCV library and I discovered that the time it takes to process an image depends on the amount of time I put my thread to sleep in between image processing. I measured execution time of several parts of my program and I discovered that the function cv::remap() seems to execute two times slower if I put my thread to sleep for more then certain time period.
Below is the minimal code snippet which shows the strange behavior. I measure the time it takes to execute the cv::remap() function and then I send my threat to sleep for amount of milliseconds set in sleep_time.
#include <opencv2/imgproc.hpp>
#include <thread>
#include <iostream>
int main(int argc, char **argv) {
cv::Mat src = ... // Init
cv::Mat dst = ... // Init
cv::Mat1f map_x = ... // Init;
cv::Mat1f map_y = ... // Init;
for (int i = 0; i < 5; i++) {
auto t1 = std::chrono::system_clock::now();
cv::remap(src, dst, map_x, map_y, cv::INTER_NEAREST, cv::BORDER_CONSTANT, 0);
auto t2 = std::chrono::system_clock::now();
std::chrono::duration<double> elapsed_time = t2 - t1;
std::cout << "elapsed time = " << elapsed_time.count() * 1e3 << " ms" << std::endl;
int sleep_time = 0;
// int sleep_time = 20;
// int sleep_time = 100;
std::this_thread::sleep_for( std::chrono::milliseconds(sleep_time));
}
return 0;
}
If sleep_time is set to 0 the processing takes about 5 ms. Here is the output.
elapsed time = 5.94945 ms
elapsed time = 5.7458 ms
elapsed time = 5.69947 ms
elapsed time = 5.68581 ms
elapsed time = 5.7218 ms
But if I set the sleep_time to 100, the processing is more then two times slower.
elapsed time = 6.09076 ms
elapsed time = 13.2568 ms
elapsed time = 13.4524 ms
elapsed time = 13.3631 ms
elapsed time = 13.3581 ms
I tried out many different values for sleep_time and it seems that the execution doubles when the sleep_time is roughly three times higher then the elapsed_time (sleep_time > 3 * elapsed_time). If I increase the complexity of computation inside the function cv::remap() (e.g. increase the size of the processed image), then the sleep_time can be also set to higher values before the execution starts to double.
I am running my program on a embedded device with ARM processor iMX6 with linux operating system, but I was able to recreate the problem on my desktop running Ubuntu 16.04. I am using compiler arm-angstrom-linux-gnueabi-gcc (GCC) 7.3.0 and the Opencv version 3.3.0.
Does anybody have an idea what is going on?
This is probably your CPU frequency scaling kicking in.
The default frequency governor on Linux is usually "ondemand", which means clock speed is scaled down when load on CPU is low, and scaled back up when load increases. As this process takes some time, your short computation bursts fail to bring the clock speed up, and your process effectively runs on a slower CPU than you actually have.
I have tested this theory on my machine by executing
sudo cpupower frequency-set -g performance
and the effect immediately disappeared. To set the governor back, execute
sudo cpupower frequency-set -g ondemand
I am new to Openmp and now trying to use Openmp + SIMD intrinsics to speedup my program, but the result is far from expectation.
In order to simplify the case without losing much essential information, I wrote a simplier toy example:
#include <omp.h>
#include <stdlib.h>
#include <iostream>
#include <vector>
#include <sys/time.h>
#include "immintrin.h" // for SIMD intrinsics
int main() {
int64_t size = 160000000;
std::vector<int> src(size);
// generating random src data
for (int i = 0; i < size; ++i)
src[i] = (rand() / (float)RAND_MAX) * size;
// to store the final results, so size is the same as src
std::vector<int> dst(size);
// get pointers for vector load and store
int * src_ptr = src.data();
int * dst_ptr = dst.data();
__m256i vec_src;
__m256i vec_op = _mm256_set1_epi32(2);
__m256i vec_dst;
omp_set_num_threads(4); // you can change thread count here
// only measure the parallel part
struct timeval one, two;
double get_time;
gettimeofday (&one, NULL);
#pragma omp parallel for private(vec_src, vec_op, vec_dst)
for (int64_t i = 0; i < size; i += 8) {
// load needed data
vec_src = _mm256_loadu_si256((__m256i const *)(src_ptr + i));
// computation part
vec_dst = _mm256_add_epi32(vec_src, vec_op);
vec_dst = _mm256_mullo_epi32(vec_dst, vec_src);
vec_dst = _mm256_slli_epi32(vec_dst, 1);
vec_dst = _mm256_add_epi32(vec_dst, vec_src);
vec_dst = _mm256_sub_epi32(vec_dst, vec_src);
// store results
_mm256_storeu_si256((__m256i *)(dst_ptr + i), vec_dst);
}
gettimeofday(&two, NULL);
double oneD = one.tv_sec + (double)one.tv_usec * .000001;
double twoD = two.tv_sec + (double)two.tv_usec * .000001;
get_time = 1000 * (twoD - oneD);
std::cout << "took time: " << get_time << std::endl;
// output something in case the computation is optimized out
int64_t i = (int)((rand() / (float)RAND_MAX) * size);
for (int64_t i = 0; i < size; ++i)
std::cout << i << ": " << dst[i] << std::endl;
return 0;
}
It is compiled using icpc -g -std=c++11 -march=core-avx2 -O3 -qopenmp test.cpp -o test and the elapsed time of the parallel part is measured. The result is as follows (the median value is picked out of 5 runs each):
1 thread: 92.519
2 threads: 89.045
4 threads: 90.361
The computations seem embarrassingly parallel, as different threads can load their needed data simultaneously given different indices, and the case is similar for writing the results, but why no speedups?
More information:
I checked the assembly code using icpc -g -std=c++11 -march=core-avx2 -O3 -qopenmp -S test.cpp and found vectorized instructions are generated;
To check if it is memory-bound, I commented the computation part in the loop, and the measured time decreased to around 60, but it does not change much if I change the thread count from 1 -> 2 -> 4.
Any advice or clue is welcome.
EDIT-1:
Thank #JerryCoffin for pointing out the possible cause, so I did the Memory Access Analysis using Vtune. Here are the results:
1-thread: Memory Bound: 6.5%, L1 Bound: 0.134, L3 Latency: 0.039
2-threads: Memory Bound: 18.0%, L1 Bound: 0.115, L3 Latency: 0.015
4-threads: Memory Bound: 21.6%, L1 Bound: 0.213, L3 Latency: 0.003
It is an Intel 4770 Processor with 25.6GB/s (23GB/s measured by Vtune) max. bandwidth. The memory bound does increase, but I am still not sure if that is the cause. Any advice?
EDIT-2 (just trying to give thorough information, so the appended stuff can be long but not tedious hopefully):
Thanks for the suggestions from #PaulR and #bazza. I tried 3 ways for comparison. One thing to note is that the processor has 4 cores and 8 hardware threads. Here are the results:
(1) just initialize dst as all zeros in advance: 1 thread: 91.922; 2 threads: 93.170; 4 threads: 93.868 --- seems not effective;
(2) without (1), put the parallel part in an outer loop over 100 iterations, and measure the time of the 100 iterations: 1 thread: 9109.49; 2 threads: 4951.20; 4 threads: 2511.01; 8 threads: 2861.75 --- quite effective except for 8 threads;
(3) based on (2), put one more iteration before the 100 iterations, and measure the time of the 100 iterations: 1 thread: 9078.02; 2 threads: 4956.66; 4 threads: 2516.93; 8 threads: 2088.88 --- similar with (2) but more effective for 8 threads.
It seems more iterations can expose the advantages of openmp + SIMD, but the computation / memory access ratio is unchanged regardless loop count, and locality seems not to be the reason as well since src or dst is too large to stay in any caches, therefore no relations exist between consecutive iterations.
Any advice?
EDIT 3:
In case of misleading, one thing needs to be clarified: in (2) and (3), the openmp directive is outside the added outer loop
#pragma omp parallel for private(vec_src, vec_op, vec_dst)
for (int k = 0; k < 100; ++k) {
for (int64_t i = 0; i < size; i += 8) {
......
}
}
i.e. the outer loop is parallelized using multithreads, and the inner loop is still serially processed. So the effective speedup in (2) and (3) might be achieved by enhanced locality among threads.
I did another experiment that the the openmp directive is put inside the outer loop:
for (int k = 0; k < 100; ++k) {
#pragma omp parallel for private(vec_src, vec_op, vec_dst)
for (int64_t i = 0; i < size; i += 8) {
......
}
}
and the speedup is still not good: 1 thread: 9074.18; 2 threads: 8809.36; 4 threads: 8936.89.93; 8 threads: 9098.83.
Problem still exists. :(
EDIT-4:
If I replace the vectorized part with scalar operations like this (the same calculations but in scalar way):
#pragma omp parallel for
for (int64_t i = 0; i < size; i++) { // not i += 8
int query = src[i];
int res = src[i] + 2;
res = res * query;
res = res << 1;
res = res + query;
res = res - query;
dst[i] = res;
}
The speedup is 1 thread: 92.065; 2 threads: 89.432; 4 threads: 88.864. May I come to the conclusion that the seemingly embarassing parallel is actually memory bound (the bottleneck is load / store operations)? If so, why can't load / store operations well parallelized?
May I come to the conclusion that the seemingly embarassing parallel is actually memory bound (the bottleneck is load / store operations)? If so, why can't load / store operations well parallelized?
Yes this problem is embarrassingly parallel in the sense that it is easy to parallelize due to the lack of dependencies. That doesn't imply that it will scale perfectly. You can still have a bad initialization overhead vs work ratio or shared resources limiting your speedup.
In your case, you are indeed limited by memory bandwidth. A practical consideration first: When compile with icpc (16.0.3 or 17.0.1), the "scalar" version yields better code when size is made constexpr. This is not due to the fact that it optimizes away these two redundant lines:
res = res + query;
res = res - query;
It does, but that makes no difference. Mainly the compiler uses exactly the same instruction that you do with the intrinsic, except for the store. Fore the store, it uses vmovntdq instead of vmovdqu, making use of sophisticated knowledge about the program, memory and the architecture. Not only does vmovntdq require aligned memory and can therefore be more efficient. It gives the CPU a non-temporal hint, preventing this data from being cached during the write to memory. This improves performance, because writing it to cache requires to load the remainder of the cache-line from memory. So while your initial SIMD version does require three memory operations: Reading the source, reading the destination cache line, writing the destination, the compiler version with the non-temporal store requires only two. In fact On my i7-4770 system, the compiler-generated version reduces the runtime at 2 threads from ~85.8 ms to 58.0 ms, and almost perfect 1.5x speedup. The lesson here is to trust your compiler unless you know the architecture and instruction set extremely well.
Considering peak performance here, 58 ms for transferring 2*160000000*4 byte corresponds to 22.07 GB/s (summarizing read and write), which is about the same than your VTune results. (funny enough considering 85.8 ms is about the same bandwidth for two read, one write). There isn't much more direct room for improvement.
To further improve performance, you would have to do something about the operation / byte ratio of your code. Remember that your processor can perform 217.6 GFLOP/s (I guess either the same or twice for intops), but can only read&write 3.2 G int/s. That gives you an idea how much operations you need to perform to not be limited by memory. So if you can, work on the data in blocks so that you can reuse data in caches.
I cannot reproduce your results for (2) and (3). When I loop around the inner loop, the scaling behaves the same. The results look fishy, particularly in the light of the results being so consistent with peak performance otherwise. Generally, I recommend to do the measuring inside of the parallel region and leverage omp_get_wtime like such:
double one, two;
#pragma omp parallel
{
__m256i vec_src;
__m256i vec_op = _mm256_set1_epi32(2);
__m256i vec_dst;
#pragma omp master
one = omp_get_wtime();
#pragma omp barrier
for (int kk = 0; kk < 100; kk++)
#pragma omp for
for (int64_t i = 0; i < size; i += 8) {
...
}
#pragma omp master
{
two = omp_get_wtime();
std::cout << "took time: " << (two-one) * 1000 << std::endl;
}
}
A final remark: Desktop processors and server processors have very different characteristics regarding memory performance. On contemporary server processors, you need much more active threads to saturate the memory bandwidth, while on desktop processors a core can often almost saturate the memory bandwidth.
Edit: One more thought about VTune not classifying it as memory-bound. This may be cause by the short computation time vs initialization. Try to see what VTune says about the code in a loop.
I’m creating performance framework tool for measuring individual message processing time in CentOS 7. I reserved one CPU for this task with isolcpus kernel option and I run it using taskset.
Ok, now the problem. I trying to measure the max processing time among several messages. The processing time is <= 1000ns, but when I run many iterations I get very high results (> 10000ns).
Here I created some simple code which does nothing interesting but shows the problem. Depending on the number of iterations i can get results like:
max: 84 min: 23 -> for 1000 iterations
max: 68540 min: 11 -> for 100000000 iterations
I'm trying to understand from where this difference came from? I tried to run this with real-time scheduling with highest priority. Is there some way to prevent that?
#include <iostream>
#include <limits>
#include <time.h>
const unsigned long long SEC = 1000L*1000L*1000L;
inline int64_t time_difference( const timespec &start,
const timespec &stop ) {
return ( (stop.tv_sec * SEC - start.tv_sec * SEC) +
(stop.tv_nsec - start.tv_nsec));
}
int main()
{
timespec start, stop;
int64_t max = 0, min = std::numeric_limits<int64_t>::max();
for(int i = 0; i < 100000000; ++i){
clock_gettime(CLOCK_REALTIME, &start);
clock_gettime(CLOCK_REALTIME, &stop);
int64_t time = time_difference(start, stop);
max = std::max(max, time);
min = std::min(min, time);
}
std::cout << "max: " << max << " min: " << min << std::endl;
}
You can't really reduce jitter to zero even with isolcpus, since you still have at least the following:
1) Interrupts delivered to your CPU (you may be able to reduce this my messing with irq affinity - but probably not to zero).
2) Clock timer interrupts are still scheduled for your process and may do a variable amount of work on the kernel side.
3) The CPU itself may pause briefly for P-state or C-state transitions, or other reasons (e.g., to let voltage levels settle after turning on AVX circuitry, etc).
Let us check the documentation...
Isolation will be effected for userspace processes - kernel threads may still get scheduled on the isolcpus isolated CPUs.
So it seems that there is no guarantee of perfect isolation, at least not from the kernel.
Original Problem:
So I have written some code to experiment with threads and do some testing.
The code should create some numbers and then find the mean of those numbers.
I think it is just easier to show you what I have so far. I was expecting with two threads that the code would run about 2 times as fast. Measuring it with a stopwatch I think it runs about 6 times slower! EDIT: Now using the computer and clock() function to tell the time.
void findmean(std::vector<double>*, std::size_t, std::size_t, double*);
int main(int argn, char** argv)
{
// Program entry point
std::cout << "Generating data..." << std::endl;
// Create a vector containing many variables
std::vector<double> data;
for(uint32_t i = 1; i <= 1024 * 1024 * 128; i ++) data.push_back(i);
// Calculate mean using 1 core
double mean = 0;
std::cout << "Calculating mean, 1 Thread..." << std::endl;
findmean(&data, 0, data.size(), &mean);
mean /= (double)data.size();
// Print result
std::cout << " Mean=" << mean << std::endl;
// Repeat, using two threads
std::vector<std::thread> thread;
std::vector<double> result;
result.push_back(0.0);
result.push_back(0.0);
std::cout << "Calculating mean, 2 Threads..." << std::endl;
// Run threads
uint32_t halfsize = data.size() / 2;
uint32_t A = 0;
uint32_t B, C, D;
// Split the data into two blocks
if(data.size() % 2 == 0)
{
B = C = D = halfsize;
}
else if(data.size() % 2 == 1)
{
B = C = halfsize;
D = hsz + 1;
}
// Run with two threads
thread.push_back(std::thread(findmean, &data, A, B, &(result[0])));
thread.push_back(std::thread(findmean, &data, C, D , &(result[1])));
// Join threads
thread[0].join();
thread[1].join();
// Calculate result
mean = result[0] + result[1];
mean /= (double)data.size();
// Print result
std::cout << " Mean=" << mean << std::endl;
// Return
return EXIT_SUCCESS;
}
void findmean(std::vector<double>* datavec, std::size_t start, std::size_t length, double* result)
{
for(uint32_t i = 0; i < length; i ++) {
*result += (*datavec).at(start + i);
}
}
I don't think this code is exactly wonderful, if you could suggest ways of improving it then I would be grateful for that also.
Register Variable:
Several people have suggested making a local variable for the function 'findmean'. This is what I have done:
void findmean(std::vector<double>* datavec, std::size_t start, std::size_t length, double* result)
{
register double holding = *result;
for(uint32_t i = 0; i < length; i ++) {
holding += (*datavec).at(start + i);
}
*result = holding;
}
I can now report: The code runs with almost the same execution time as with a single thread. That is a big improvement of 6x, but surely there must be a way to make it nearly twice as fast?
Register Variable and O2 Optimization:
I have set the optimization to 'O2' - I will create a table with the results.
Results so far:
Original Code with no optimization or register variable:
1 thread: 4.98 seconds, 2 threads: 29.59 seconds
Code with added register variable:
1 Thread: 4.76 seconds, 2 Threads: 4.76 seconds
With reg variable and -O2 optimization:
1 Thread: 0.43 seconds, 2 Threads: 0.6 seconds 2 Threads is now slower?
With Dameon's suggestion, which was to put a large block of memory in between the two result variables:
1 Thread: 0.42 seconds, 2 Threads: 0.64 seconds
With TAS 's suggestion of using iterators to access contents of the vector:
1 Thread: 0.38 seconds, 2 Threads: 0.56 seconds
Same as above on Core i7 920 (single channel memory 4GB):
1 Thread: 0.31 seconds, 2 Threads: 0.56 seconds
Same as above on Core i7 920 (dual channel memory 2x2GB):
1 Thread: 0.31 seconds, 2 Threads: 0.35 seconds
Why are 2 threads 6x slower than 1 thread?
You are getting hit by a bad case of false sharing.
After getting rid of the false-sharing, why is 2 threads not faster than 1 thread?
You are bottlenecked by your memory bandwidth.
False Sharing:
The problem here is that each thread is accessing the result variable at adjacent memory locations. It's likely that they fall on the same cacheline so each time a thread accesses it, it will bounce the cacheline between the cores.
Each thread is running this loop:
for(uint32_t i = 0; i < length; i ++) {
*result += (*datavec).at(start + i);
}
And you can see that the result variable is being accessed very often (each iteration). So each iteration, the threads are fighting for the same cacheline that's holding both values of result.
Normally, the compiler should put *result into a register thereby removing the constant access to that memory location. But since you never turned on optimizations, it's very likely the compiler is indeed still accessing the memory location and thus incurring false-sharing penalties at every iteration of the loop.
Memory Bandwidth:
Once you have eliminated the false sharing and got rid of the 6x slowdown, the reason why you're not getting improvement is because you've maxed out your memory bandwidth.
Sure your processor may be 4 cores, but they all share the same memory bandwidth. Your particular task of summing up an array does very little (computational) work for each memory access. A single thread is already enough to max out your memory bandwidth. Therefore going to more threads is not likely to get you much improvement.
In short, no you won't be able to make summing an array significantly faster by throwing more threads at it.
As stated in other answers, you are seeing false sharing on the result variable, but there is also one other location where this is happening. The std::vector<T>::at() function (as well as the std::vector<T>::operator[]()) access the length of the vector on each element access. To avoid this you should switch to using iterators. Also, using std::accumulate() will allow you to take advantage of optimizations in the standard library implementation you are using.
Here are the relevant parts of the code:
thread.push_back(std::thread(findmean, std::begin(data)+A, std::begin(data)+B, &(result[0])));
thread.push_back(std::thread(findmean, std::begin(data)+B, std::end(data), &(result[1])));
and
void findmean(std::vector<double>::const_iterator start, std::vector<double>::const_iterator end, double* result)
{
*result = std::accumulate(start, end, 0.0);
}
This consistently gives me better performance for two threads on my 32-bit netbook.
More threads doesn't mean faster! There is an overhead in creating and context-switching threads, even the hardware in which this code run is influencing the results. For such a trivial work like this it's better probably a single thread.
This is probably because the cost of launching and waiting for two threads is a lot more than computing the result in a single loop. Your data size is 128MB, which is not alot for modern processors to process in a single loop.