Why slower function run faster if surrounded with another functions? - c++

Just a little c++ code, confirmed behavior in java.
This is example code what reproduce this behavior compiled with Visual Studio 2019 Release x64. I got:
611ms for just increment element.
631ms for increment element with cache, so additional 20ms for overhead.
But when i add heavy op for before each increment(i choised random number generation) and got:
2073ms for just increment element.
1432ms for increment element using cache.
I have intel cpu 10700K, and 3200RAM if it matter.
#include <iostream>
#include <random>
#include <chrono>
#include <cstdlib>
#define ARR_SIZE 256 * 256 * 256
#define ACCESS_SIZE 256 * 256
#define CACHE_SIZE 1024
#define ITERATIONS 1000
using namespace std;
using chrono::high_resolution_clock;
using chrono::duration_cast;
using chrono::milliseconds;
int* arr;
int* cache;
int counter = 0;
void flushCache() {
for (int j = 0; j < CACHE_SIZE; ++j)
{
++arr[cache[j]];
}
counter = 0;
}
void incWithCache(int i) {
cache[counter] = i;
++counter;
if (counter == CACHE_SIZE) {
flushCache();
}
}
void incWithoutCache(int i) {
++arr[i];
}
int heavyOp() {
return rand() % 107;
}
void main()
{
arr = new int[ARR_SIZE];
cache = new int[CACHE_SIZE];
int* access = new int[ACCESS_SIZE];
random_device rd;
mt19937 gen(rd());
for (int i = 0; i < ACCESS_SIZE; ++i) {
access[i] = gen() % (ARR_SIZE);
}
for (int i = 0; i < ARR_SIZE; ++i) {
arr[i] = 0;
}
auto t1 = high_resolution_clock::now();
for (int iter = 0; iter < ITERATIONS; ++iter) {
for (int i = 0; i < ACCESS_SIZE; ++i) {
incWithoutCache(access[i]);
}
}
auto t2 = high_resolution_clock::now();
auto ms_int = duration_cast<milliseconds>(t2 - t1);
cout << "Time without cache " << ms_int.count() << "ms\n";
t1 = high_resolution_clock::now();
for (int iter = 0; iter < ITERATIONS; ++iter) {
for (int i = 0; i < ACCESS_SIZE; ++i) {
incWithCache(access[i]);
}
flushCache();
}
t2 = high_resolution_clock::now();
ms_int = duration_cast<milliseconds>(t2 - t1);
cout << "Time with cache " << ms_int.count() << "ms\n";
t1 = high_resolution_clock::now();
for (int iter = 0; iter < ITERATIONS; ++iter) {
for (int i = 0; i < ACCESS_SIZE; ++i) {
heavyOp();
incWithoutCache(access[i]);
}
}
t2 = high_resolution_clock::now();
ms_int = duration_cast<milliseconds>(t2 - t1);
cout << "Time without cache and time between " << ms_int.count() << "ms\n";
t1 = high_resolution_clock::now();
for (int iter = 0; iter < ITERATIONS; ++iter) {
for (int i = 0; i < ACCESS_SIZE; ++i) {
heavyOp();
incWithCache(access[i]);
}
flushCache();
}
t2 = high_resolution_clock::now();
ms_int = duration_cast<milliseconds>(t2 - t1);
cout << "Time with cache and time between " << ms_int.count() << "ms\n";
}

I think these kind of questions are extremely hard to answer - optimizing compilers, instruction reordering, and caching all make this hard to analyze but I do have a hypothesis.
First, the difference between incWithoutCache and incWithCache without heavyOp seems reasonable - the second one is simply doing more work.
When you introduce heavyOp is where it gets interesting.
heavyOp + incWithoutCache: incWithoutCache requires a fetch from memory to output to arr. When that memory fetch is complete, it can do the addition. The processor may begin the next heavyOp operation before the increment is complete because of pipelining.
heavyOp + incWithCache: incWithCache does not require a fetch from memory in every iteration as it only has to write out a value. The processor can queue that write to a memory controller and continue. It does do ++counter, but in that case you are always accessing the same value and so I assume this could be cached better then ++arr[i] could from incWithoutCache. When the cache is flushed, the flushing loop can be heavily pipelined - each iteration of the flushing loop is independent and so many iterations will be operating at a time.
So I think the big difference here is that the actual writes to arr cannot be as efficiently pipelined without the cache because heavyOp is trashing your pipeline and potentially your cache. Your heavyOp is taking the same amount of time in either case, but in heavyOp + incWithoutCache the amortized cost of a write to arr is higher because it is not overlapped with other writes to arr such as what can occur with heavyOp + incWithCache.
I think vectorization could theoretically be used for the flushing operation but I did not see that on Compiler Explorer so that may not be a cause of the discrepancy. If vectorization was being used that could explain this speed difference.
I will say I am not an expert in this and could easily be completely wrong about all of this... but it makes sense to me.

Related

memcpy beats SIMD intrinsics

I have been looking at fast ways to copy various amounts of data, when NEON vector instructions are available on an ARM device.
I've done some benchmarks, and have some interesting results. I'm trying to understand what I'm looking at.
I have got four versions to copy data:
1. Baseline
Copies element by element:
for (int i = 0; i < size; ++i)
{
copy[i] = orig[i];
}
2. NEON
This code loads four values into a temporary register, then copies the register to output.
Thus the number of loads are reduced by half. There may be a way to skip the temporary register and reduce the loads by one quarter, but I haven't found a way.
int32x4_t tmp;
for (int i = 0; i < size; i += 4)
{
tmp = vld1q_s32(orig + i); // load 4 elements to tmp SIMD register
vst1q_s32(&copy2[i], tmp); // copy 4 elements from tmp SIMD register
}
3. Stepped memcpy,
Uses the memcpy, but copies 4 elements at a time. This is to compare against the NEON version.
for (int i = 0; i < size; i+=4)
{
memcpy(orig+i, copy3+i, 4);
}
4. Normal memcpy
Uses memcpy with full amount of data.
memcpy(orig, copy4, size);
My benchmark using 2^16 values gave some surprising results:
1. Baseline time = 3443[µs]
2. NEON time = 1682[µs]
3. memcpy (stepped) time = 1445[µs]
4. memcpy time = 81[µs]
The speedup for NEON time is expected, however the faster stepped memcpy time is surprising to me. And the time for 4 even more so.
Why is memcpy doing so well? Does it use NEON under-the-hood? Or are there efficient memory copy instructions I am not aware of?
This question discussed NEON versus memcpy(). However I don't feel the answers explore sufficently why the ARM memcpy implementation runs so well
The full code listing is below:
#include <arm_neon.h>
#include <vector>
#include <cinttypes>
#include <iostream>
#include <cstdlib>
#include <chrono>
#include <cstring>
int main(int argc, char *argv[]) {
int arr_size;
if (argc==1)
{
std::cout << "Please enter an array size" << std::endl;
exit(1);
}
int size = atoi(argv[1]); // not very C++, sorry
std::int32_t* orig = new std::int32_t[size];
std::int32_t* copy = new std::int32_t[size];
std::int32_t* copy2 = new std::int32_t[size];
std::int32_t* copy3 = new std::int32_t[size];
std::int32_t* copy4 = new std::int32_t[size];
// Non-neon version
std::chrono::steady_clock::time_point begin = std::chrono::steady_clock::now();
for (int i = 0; i < size; ++i)
{
copy[i] = orig[i];
}
std::chrono::steady_clock::time_point end = std::chrono::steady_clock::now();
std::cout << "Baseline time = " << std::chrono::duration_cast<std::chrono::microseconds>(end - begin).count() << "[µs]" << std::endl;
// NEON version
begin = std::chrono::steady_clock::now();
int32x4_t tmp;
for (int i = 0; i < size; i += 4)
{
tmp = vld1q_s32(orig + i); // load 4 elements to tmp SIMD register
vst1q_s32(&copy2[i], tmp); // copy 4 elements from tmp SIMD register
}
end = std::chrono::steady_clock::now();
std::cout << "NEON time = " << std::chrono::duration_cast<std::chrono::microseconds>(end - begin).count() << "[µs]" << std::endl;
// Memcpy example
begin = std::chrono::steady_clock::now();
for (int i = 0; i < size; i+=4)
{
memcpy(orig+i, copy3+i, 4);
}
end = std::chrono::steady_clock::now();
std::cout << "memcpy time = " << std::chrono::duration_cast<std::chrono::microseconds>(end - begin).count() << "[µs]" << std::endl;
// Memcpy example
begin = std::chrono::steady_clock::now();
memcpy(orig, copy4, size);
end = std::chrono::steady_clock::now();
std::cout << "memcpy time = " << std::chrono::duration_cast<std::chrono::microseconds>(end - begin).count() << "[µs]" << std::endl;
return 0;
}
Note: this code uses memcpy in the wrong direction. It should be memcpy(dest, src, num_bytes).
Because the "normal memcpy" test happens last, the massive order of magnitude speedup vs. other tests would be explained by dead code elimination. The optimizer saw that orig is not used after the last memcpy, so it eliminated the memcpy.
A good way to write reliable benchmarks is with the Benchmark framework, and use their benchmark::DoNotOptimize(x) function prevent dead code elimination.

Parallelizing a small array is slower than parallelizing a large array?

I wrote a small program that generates random values for two valarrays and in a for loop the values of said arrays are added to a new one.
However, when I use a small array size(20 elements) the parallel version takes significantly longer than the serial one and when I'm using large arrays(200 000 elements) it takes roughly the same amount of time(parallel is always a bit slower though).
Why is this?
The only reason I can think is that with the large array the CPU puts it in L3 cache and shares it across all cores, whereas with the small one its having to copy it around the lower cache levels? Or I'm getting this wrong?
Here is the code:
#include <valarray>
#include <iostream>
#include <ctime>
#include <omp.h>
#include <chrono>
int main()
{
int size = 2000000;
std::valarray<double> num1(size), num2(size), result(size);
std::srand(std::time(nullptr));
std::chrono::time_point<std::chrono::steady_clock> start, stop;
std::chrono::microseconds duration;
for (int i = 0; i < size; ++i) {
num1[i] = std::rand();
num2[i] = std::rand();
}
//Parallel execution
start = std::chrono::high_resolution_clock::now();
#pragma omp parallel for num_threads(8)
for (int i = 0; i < size; ++i) {
result[i] = num1[i] + num2[i];
}
stop = std::chrono::high_resolution_clock::now();
duration = std::chrono::duration_cast<std::chrono::microseconds>(stop - start);
std::cout << "Parallel for loop executed in: " << duration.count() << " microseconds" << std::endl;
//Serial execution
start = std::chrono::high_resolution_clock::now();
for (int i = 0; i < size; ++i) {
result[i] = num1[i] + num2[i];
}
stop = std::chrono::high_resolution_clock::now();
duration = std::chrono::duration_cast<std::chrono::microseconds>(stop - start);
std::cout << "Serial for loop executed in: " << duration.count() << " microseconds" << std::endl;
}
Output with size = 200 000
Parallel for loop executed in: 2450 microseconds
Serial for loop executed in: 2726 microseconds
Output with size = 20
Parallel for loop executed in: 4727 microseconds
Serial for loop executed in: 0 microseconds
I'm using a Xeon E3-1230 V5 and I'm compiling with Intel's compiler using maximum optimization and Skylake specific optimizations as well.
I get identical results with Visual Studio's C++ compiler.

Copy local array is faster than array from arguments in c++?

While optimizing some code I discovered some things that I didn't expected.
I wrote a simple code to illustrate what I found below:
#include <string.h>
#include <chrono>
#include <iostream>
using namespace std;
int globalArr[1024][1024];
void initArr(int arr[1024][1024])
{
memset(arr, 0, 1024 * 1024 * sizeof(int));
}
void run()
{
int arr[1024][1024];
initArr(arr);
for(int i = 0; i < 1024; ++i)
{
for(int j = 0; j < 1024; ++j)
{
globalArr[i][j] = arr[i][j];
}
}
}
void run2(int arr[1024][1024])
{
initArr(arr);
for(int i = 0; i < 1024; ++i)
{
for(int j = 0; j < 1024; ++j)
{
globalArr[i][j] = arr[i][j];
}
}
}
int main()
{
{
auto start = chrono::high_resolution_clock::now();
for(int i = 0; i < 256; ++i)
{
run();
}
auto duration = chrono::high_resolution_clock::now() - start;
cout << "(run) Total time: " << chrono::duration_cast<chrono::microseconds>(duration).count() << " microseconds\n";
}
{
auto start = chrono::high_resolution_clock::now();
for(int i = 0; i < 256; ++i)
{
int arr[1024][1024];
run2(arr);
}
auto duration = chrono::high_resolution_clock::now() - start;
cout << "(run2) Total time: " << chrono::duration_cast<chrono::microseconds>(duration).count() << " microseconds\n";
}
return 0;
}
I build the code with g++ version 6.4.0 20180424 with -O3 flag.
Below is the result running on ryzen 1700.
(run) Total time: 43493 microseconds
(run2) Total time: 134740 microseconds
I tried to see the assembly with godbolt.org (Code separated in 2 urls)
https://godbolt.org/g/aKSHH6
https://godbolt.org/g/zfK14x
But I still don't understand what actually made the difference.
So my questions are:
1. What's causing the performance difference?
2. Is it possible passing array in argument with the same performance as local array?
Edit:
Just some extra info, below is the result build using O2
(run) Total time: 94461 microseconds
(run2) Total time: 172352 microseconds
Edit again:
From xaxxon's comment, I try remove the initArr call in both functions. And the result actually run2 is better than run
(run) Total time: 45151 microseconds
(run2) Total time: 35845 microseconds
But I still don't understand the reason.
What's causing the performance difference?
The compiler has to generate code for run2 that will continue to work correctly if you call
run2(globalArr);
or (worse), pass in some overlapping but non-identical address.
If you allow your C++ compiler to inline the call, and it chooses to do so, it'll be able to generate inlined code that knows whether the parameter really aliases your global. The out-of-line codegen still has to be conservative though.
Is it possible passing array in argument with the same performance as local array?
You can certainly fix the aliasing problem in C, using the restrict keyword, like
void run2(int (* restrict globalArr2)[256])
{
int (* restrict g)[256] = globalArr1;
for(int i = 0; i < 32; ++i)
{
for(int j = 0; j < 256; ++j)
{
g[i][j] = globalArr2[i][j];
}
}
}
(or probably in C++ using the non-standard extension __restrict).
This should allow the optimizer as much freedom as it had in your original run - unless it's smart enough to elide the local entirely and simply set the global to zero.

Memory allocation optimized away by compilers

In his talk "Efficiency with algorithms, Performance with data structures", Chandler Carruth talks about the need for a better allocator model in C++. The current allocator model invades the type system and makes it almost impossible to work in many project because of that. On the other hand, the Bloomberg allocator model does not invade the type system but is based on virtual functions calls which makes impossible for the compiler to 'see' the allocation and optimize it. In his talk he talks about compilers deduplicating memory allocation (1:06:47).
It took me some time to find some examples, of memory allocation optimization, but I have found this code sample, which compiled under clang, optimize away all the memory allocation and just return 1000000 without allocating anything.
template<typename T>
T* create() { return new T(); }
int main() {
auto result = 0;
for (auto i = 0; i < 1000000; ++i) {
result += (create<int>() != nullptr);
}
return result;
}
The following paper http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2013/n3664.html also says that allocation could be fused in compilers and seems to suggest that some compilers already do that sort of things.
As I am very interested in the strategies for efficient memory allocation, I really want to understand why Chandler Carruth is against virtual calls in the Bloomberg model. The example above clearly shows that clang optimize things away when it can see the allocation.
I would like to have a "real life code" where this optimization is useful and done by any current compiler
Do you have any example of code where different allocation are fused by anu current compiler?
Do you understand what Chandler Carruth means when he says that compilers can "deduplicate" your allocation in his talk at 1:06:47?
I have found this amazing example which answers the first point of the initial question. Both points 2 and 3 don't have any answer yet.
#include <iostream>
#include <vector>
#include <chrono>
std::vector<double> f_val(std::size_t i, std::size_t n) {
auto v = std::vector<double>( n );
for (std::size_t k = 0; k < v.size(); ++k) {
v[k] = static_cast<double>(k + i);
}
return v;
}
void f_ref(std::size_t i, std::vector<double>& v) {
for (std::size_t k = 0; k < v.size(); ++k) {
v[k] = static_cast<double>(k + i);
}
}
int main (int argc, char const *argv[]) {
const auto n = std::size_t{10};
const auto nb_loops = std::size_t{300000000};
// Begin: Zone 1
{
auto v = std::vector<double>( n, 0.0 );
auto start_time = std::chrono::high_resolution_clock::now();
for (std::size_t i = 0; i < nb_loops; ++i) {
auto w = f_val(i, n);
for (std::size_t k = 0; k < v.size(); ++k) {
v[k] += w[k];
}
}
auto end_time = std::chrono::high_resolution_clock::now();
auto time = std::chrono::duration_cast<std::chrono::microseconds>(end_time - start_time).count();
std::cout << time << std::endl;
std::cout << v[0] << " " << v[n - 1] << std::endl;
}
// End: Zone 1
{
auto v = std::vector<double>( n, 0.0 );
auto w = std::vector<double>( n );
auto start_time = std::chrono::high_resolution_clock::now();
for (std::size_t i = 0; i < nb_loops; ++i) {
f_ref(i, w);
for (std::size_t k = 0; k < v.size(); ++k) {
v[k] += w[k];
}
}
auto end_time = std::chrono::high_resolution_clock::now();
auto time = std::chrono::duration_cast<std::chrono::microseconds>(end_time - start_time).count();
std::cout << time << std::endl;
std::cout << v[0] << " " << v[n - 1] << std::endl;
}
return 0;
}
where not a single memory allocation happens in the for loop with f_val. This only happens with Clang though (Gcc and icpc both fail on this) and when building a slighly more complicated example, the optimization is not done.

C++ 11 std thread sumation with atomic very slow

I wanted to learn to use C++ 11 std::threads with VS2012 and I wrote a very simple C++ console program with two threads which just increment a counter. I also want to test the performance difference when two threads are used. Test program is given below:
#include <iostream>
#include <thread>
#include <conio.h>
#include <atomic>
std::atomic<long long> sum(0);
//long long sum;
using namespace std;
const int RANGE = 100000000;
void test_without_threds()
{
sum = 0;
for(unsigned int j = 0; j < 2; j++)
for(unsigned int k = 0; k < RANGE; k++)
sum ++ ;
}
void call_from_thread(int tid)
{
for(unsigned int k = 0; k < RANGE; k++)
sum ++ ;
}
void test_with_2_threds()
{
std::thread t[2];
sum = 0;
//Launch a group of threads
for (int i = 0; i < 2; ++i) {
t[i] = std::thread(call_from_thread, i);
}
//Join the threads with the main thread
for (int i = 0; i < 2; ++i) {
t[i].join();
}
}
int _tmain(int argc, _TCHAR* argv[])
{
chrono::time_point<chrono::system_clock> start, end;
cout << "-----------------------------------------\n";
cout << "test without threds()\n";
start = chrono::system_clock::now();
test_without_threds();
end = chrono::system_clock::now();
chrono::duration<double> elapsed_seconds = end-start;
cout << "finished calculation for "
<< chrono::duration_cast<std::chrono::milliseconds>(end - start).count()
<< "ms.\n";
cout << "sum:\t" << sum << "\n";\
cout << "-----------------------------------------\n";
cout << "test with 2_threds\n";
start = chrono::system_clock::now();
test_with_2_threds();
end = chrono::system_clock::now();
cout << "finished calculation for "
<< chrono::duration_cast<std::chrono::milliseconds>(end - start).count()
<< "ms.\n";
cout << "sum:\t" << sum << "\n";\
_getch();
return 0;
}
Now, when I use for the counter just the long long variable (which is commented) I get value which is different from the correct - 100000000 instead of 200000000. I am not sure why is that and I suppose that the two threads are changing the counter at the same time, but I am not sure how it happens really because ++ is just a very simple instruction. It seems that the threads are caching the sum variable at beginning. Performance is 110 ms with two threads vs 200 ms for one thread.
So the correct way according to documentation is to use std::atomic. However now the performance is much worse for both cases as about 3300 ms without threads and 15820 ms with threads. What is the correct way to use std::atomic in this case?
I am not sure why is that and I suppose that the two threads are changing the counter at the same time, but I am not sure how it happens really because ++ is just a very simple instruction.
Each thread is pulling the value of sum into a register, incrementing the register, and finally writing it back to memory at the end of the loop.
So the correct way according to documentation is to use std::atomic. However now the performance is much worse for both cases as about 3300 ms without threads and 15820 ms with threads. What is the correct way to use std::atomic in this case?
You're paying for the synchronization std::atomic provides. It won't be nearly as fast as using an un-synchronized integer, though you can get a small improvement to performance by refining the memory order of the add:
sum.fetch_add(1, std::memory_order_relaxed);
In this particular case, you're compiling for x86 and operating on a 64-bit integer. This means that the compiler has to generate code to update the value in two 32-bit operations; if you change the target platform to x64, the compiler will generate code to do the increment in a single 64-bit operation.
As a general rule, the solution to problems like this is to reduce the number of writes to shared data.
Your code has a couple of problems. First of all, all the "inputs" involved are compile-time constants, so a good compiler can pre-compute the value for the single-threaded code, so (regardless of the value you give for range) it shows as running in 0 ms.
Second, you're sharing a single variable (sum) between all the threads, forcing all of their accesses to be synchronized at that point. Without synchronization, that gives undefined behavior. As you've already found, synchronizing the access to that variable is quite expensive, so you usually want to avoid it if at all reasonable.
One way to do that is to use a separate subtotal for each thread, so they can all do their additions in parallel, without synchronizing, the adding together the individual results at the end.
Another point is to ensure against false sharing. False sharing arises when two (or more) threads are writing to data that really is separate, but has been allocated in the same cache line. In this case, access to the memory can be serialized even though (as already noted) you don't have any data actually shared between the threads.
Based on those factors, I've rewritten your code slightly to create a separate sum variable for each thread. Those variables are of a class type that gives (fairly) direct access to the data, but does stop the optimizer from seeing that it can do the whole computation at compile-time, so we end up comparing one thread to 4 (which reminds me: I did increase the number of threads from 2 to 4, since I'm using a quad-core machine). I moved that number into a const variable though, so it should be easy to test with different numbers of threads.
#include <iostream>
#include <thread>
#include <conio.h>
#include <atomic>
#include <numeric>
const int num_threads = 4;
struct val {
long long sum;
int pad[2];
val &operator=(long long i) { sum = i; return *this; }
operator long long &() { return sum; }
operator long long() const { return sum; }
};
val sum[num_threads];
using namespace std;
const int RANGE = 100000000;
void test_without_threds()
{
sum[0] = 0LL;
for(unsigned int j = 0; j < num_threads; j++)
for(unsigned int k = 0; k < RANGE; k++)
sum[0] ++ ;
}
void call_from_thread(int tid)
{
for(unsigned int k = 0; k < RANGE; k++)
sum[tid] ++ ;
}
void test_with_threads()
{
std::thread t[num_threads];
std::fill_n(sum, num_threads, 0);
//Launch a group of threads
for (int i = 0; i < num_threads; ++i) {
t[i] = std::thread(call_from_thread, i);
}
//Join the threads with the main thread
for (int i = 0; i < num_threads; ++i) {
t[i].join();
}
long long total = std::accumulate(std::begin(sum), std::end(sum), 0LL);
}
int main()
{
chrono::time_point<chrono::system_clock> start, end;
cout << "-----------------------------------------\n";
cout << "test without threds()\n";
start = chrono::system_clock::now();
test_without_threds();
end = chrono::system_clock::now();
chrono::duration<double> elapsed_seconds = end-start;
cout << "finished calculation for "
<< chrono::duration_cast<std::chrono::milliseconds>(end - start).count()
<< "ms.\n";
cout << "sum:\t" << sum << "\n";\
cout << "-----------------------------------------\n";
cout << "test with threads\n";
start = chrono::system_clock::now();
test_with_threads();
end = chrono::system_clock::now();
cout << "finished calculation for "
<< chrono::duration_cast<std::chrono::milliseconds>(end - start).count()
<< "ms.\n";
cout << "sum:\t" << sum << "\n";\
_getch();
return 0;
}
When I run this, my results are closer to what I'd guess you hoped for:
-----------------------------------------
test without threds()
finished calculation for 78ms.
sum: 000000013FCBC370
-----------------------------------------
test with threads
finished calculation for 15ms.
sum: 000000013FCBC370
... the sums are identical, but N threads increases speed by a factor of approximately N (up to the number of cores available).
Try to use prefix increment, which will give performance improvement.
Test on my machine, std::memory_order_relaxed does not give any advantage.