Couldn't Reproduce non-friendly C++ cache code - c++

I'm playing with some chunks of code that intentionally tries to demonstrate ideas about CPU Cache, and how to get benefit of aligned data and so on.
From this article http://igoro.com/archive/gallery-of-processor-cache-effects/ which I find very interesting I'm trying to reproduce the behavior of these two loops (example 1):
int[] arr = new int[64 * 1024 * 1024];
// Loop 1
for (int i = 0; i < arr.Length; i++) arr[i] *= 3;
// Loop 2
for (int i = 0; i < arr.Length; i += 16) arr[i] *= 3;
Note the factor 16 in the second loop, as ints are 4 bytes long in my machine ( Intel i5), every element in the second loop will be in a different cache line, so theoretically and according to the author, these two loops will take almost the same time to execute, because the speed here is governed by the memory access, not by the integer multiplication instruction.
So I've tried to reproduce this behavior, with two different compilers, MSVC 19.31.31105.0 and GNU g++ 9.4.0 and the result is the the same on both cases, the second loop takes about 4% of the time with respect to the first loop. Here is the implementation I've used to reproduce it
#include <cstdlib>
#include <chrono>
#include <vector>
#include <string>
#define SIZE 64 * 1024 * 1024
#define K 16
//Comment this to use C array instead of std vector
//#define USE_C_ARRAY
int main() {
#if defined(USE_C_ARRAY)
int* arr = new int[SIZE];
std::cout << "Testing with C style array\n";
#else
std::vector<int> arr(SIZE);
std::cout << "Using std::vector\n";
#endif
std::cout << "Step: " << K << " Elements \n";
std::cout << "Array Elements: " << SIZE << "\n";
std::cout << "Array Size: " << SIZE/(1024*1024) << " MBs \n";
std::cout << "Int Size: " << sizeof(int) << "\n";
auto show_duration = [&](auto title, auto duration){
std::cout << "[" << title << "] Duration: " <<
std::chrono::duration_cast<std::chrono::milliseconds>(duration).count() <<
" (milisecs) \n";
};
auto start = std::chrono::steady_clock::now();
// Loop 1
for (int i = 0; i < SIZE; i++) arr[i]++;
show_duration("FullArray",
(std::chrono::steady_clock::now() - start));
start = std::chrono::steady_clock::now();
// Loop 1
for (int i = 0; i < SIZE; i += K) arr[i]++;
show_duration(std::string("Array Step ") + std::to_string(K),
(std::chrono::steady_clock::now() - start));
#if defined(USE_C_ARRAY)
delete[] arr;
#endif
return EXIT_SUCCESS;
}
To compile it with g++:
g++ -o for_cache_demo for_cache_demo.cpp -std=c++17 -g3
Note 1: I'm using -g3 to explicitly show that I'm avoiding any compiler optimization just to see the raw behavior.
Note 2: The difference in the assembly code generated for the two loops is the second argument of the loop counter add function, which adds 1 or K to i
add DWORD PTR [rbp-4], 16 # When using a 16 Step
Link to test in in Compiler Explorer: https://godbolt.org/z/eqWdce35z
It will be great if you have any suggestions or ideas about what I might be missing, thank you very much in advance!

Your mistake is not using any optimization. The overhead of the non-optimized code masks any cache effects.
When I don't use any optimization, I get the following:
$ g++ -O0 cache.cpp -o cache
$ ./cache
Using std::vector
Step: 16 Elements
Array Elements: 67108864
Array Size: 64 MBs
Int Size: 4
[FullArray] Duration: 105 (milisecs)
[Array Step 16] Duration: 23 (milisecs)
If I use an optimization, even the most conservative optimization flag, -O1, suddenly the result changes:
$ g++ -O1 cache.cpp -o cache
$ ./cache
Using std::vector
Step: 16 Elements
Array Elements: 67108864
Array Size: 64 MBs
Int Size: 4
[FullArray] Duration: 26 (milisecs)
[Array Step 16] Duration: 22 (milisecs)
(you can see that run time of step 16 has not changes much, because it was already bottlenecked by memory access, while run time of step 1 access improved dramatically, since it was limited by executing instructions, not memory bandwidth)
Further raising optimization levels did not change results for me.
You can compare the inner loops for different optimization levels. With -O0:
vector_step_1(std::vector<int, std::allocator<int> >&):
push rbp
mov rbp, rsp
sub rsp, 32
mov QWORD PTR [rbp-24], rdi
mov DWORD PTR [rbp-4], 0
.L7:
cmp DWORD PTR [rbp-4], 67108863
jg .L8
mov eax, DWORD PTR [rbp-4]
movsx rdx, eax
mov rax, QWORD PTR [rbp-24]
mov rsi, rdx
mov rdi, rax
call std::vector<int, std::allocator<int> >::operator[](unsigned long)
mov rdx, rax
mov ecx, DWORD PTR [rdx]
mov eax, ecx
add eax, eax
add eax, ecx
mov DWORD PTR [rdx], eax
add DWORD PTR [rbp-4], 1
jmp .L7
.L8:
nop
leave
ret
You can see it makes a whole function call on every iteration.
With -O1:
vector_step_1(std::vector<int, std::allocator<int> >&):
mov eax, 0
.L5:
mov rdx, rax
add rdx, QWORD PTR [rdi]
mov ecx, DWORD PTR [rdx]
lea ecx, [rcx+rcx*2]
mov DWORD PTR [rdx], ecx
add rax, 4
cmp rax, 268435456
jne .L5
ret
Now, there are just 8 instructions in the loop. At -O2 it becomes 6 instructions per loop.

Related

C++ running slower than VB6

this is my first post on Stack.
I usually dev in VB6 but have recently started doing more coding in C++ using the DEV-C++ IDE with the g++ compiler lib.
I'm having a problem with general program execution speeds.
This old VB6 code runs in 20 seconds.
DefLng A-Z
Private Sub Form_Load()
Dim n(10000, 10) As Long
Dim c(10000, 10) As Long
For d = 1 To 1000000
For dd = 1 To 10000
n(dd, 1) = c(dd, 2) + c(dd, 3)
Next
Next
MsgBox "Done"
End Sub
This C++ code takes 57 seconds...
int main(int argc, char *argv[]) {
long n[10000][10];
long c[10000][10];
for (long d=1;d<1000000;d++){
for (long dd=1;dd<10000;dd++){
n[dd][1]=c[dd][2]+c[dd][3];
}
}
system("PAUSE");
return EXIT_SUCCESS; }
Most of the coding I do is AI related and very heavy on array usage. I've tried using int rather than long, I've tried different machines, the C++ always runs at least three times slower.
Am I being dumb? Can anyone explain what I'm doing wrong?
Cheers.
Short answer
You need to look into your compiler optimization settings. This resource might help
Takeaway: C++ allows you to use many tricks some generic and some dependent on your architecture, when used properly it will be superior to VB in terms of performance.
Long answer
Keep in mind this is highly dependent on your architecture and compiler, also compiler settings. You should configure your compiler to do a more agressive optimization.
Also you should write optimized code taking into account memory access, using CPU cache wisely etc.
I have done a test for you on an ubuntu 16.04 virtual machine using a core of Intel(R) Core(TM) i7-7700K CPU # 4.20GHz. Using the code bellow here are my times depending on the optimization level of the compiler I used g++ 5.4.0
I'm using optimization level 0,1,2,3,s and obtain 36s(completely unoptimized), 23s, and then.. zero.
osboxes#osboxes:~/test$ g++ a.cpp -O0 -o a0
osboxes#osboxes:~/test$ ./a0 start..finished in 36174855 micro seconds
osboxes#osboxes:~/test$ g++ a.cpp -O1 -o a1
osboxes#osboxes:~/test$ ./a1 start..finished in 2352767 micro seconds
osboxes#osboxes:~/test$ g++ a.cpp -O2 -o a2
osboxes#osboxes:~/test$ ./a2 start..finished in 0 micro seconds
osboxes#osboxes:~/test$ g++ a.cpp -O3 -o a3
osboxes#osboxes:~/test$ ./a3 start..finished in 0 micro seconds
osboxes#osboxes:~/test$ g++ a.cpp -Os -o as
osboxes#osboxes:~/test$ ./as start..finished in 0 micro seconds
Note that by using a more agressive optimization level, the compiler will eliminate the code completely because the values in n[] are not used in the program.
To force the compiler to generate code use the volatile keyword when declaring n
With the volatile added now you'll have ~12s with the most agressive optimization (on my machine)
osboxes#osboxes:~/test$ g++ a.cpp -O3 -o a3
osboxes#osboxes:~/test$ ./a3 start..finished in 12139348 micro seconds
osboxes#osboxes:~/test$ g++ a.cpp -Os -o as
osboxes#osboxes:~/test$ ./as start..finished in 12493927 micro seconds
The code I used for the test(based on your example)
#include <iostream>
#include <sys/time.h>
using namespace std;
typedef unsigned long long u64;
u64 timestamp()
{
struct timeval now;
gettimeofday(&now, NULL);
return now.tv_usec + (u64)now.tv_sec*1000000;
}
int main()
{
cout<<"start"<<endl;
u64 t0 = timestamp();
volatile long n[10000][10];
long c[10000][10];
for(long d=1;d<1000000;d++)
{
for(long dd=1;dd<10000;dd++)
{
n[dd][1]=c[dd][2]+c[dd][3];
}
}
u64 t1 = timestamp();
cout<<"..finished in "<< (t1-t0) << " micro seconds\n";
return 0;
}
Multithreading
I have converted your code to use multithreading, using 2 threads I am able to reduce the time by half.
I am using the fact that as it is now, the results are not used so the inner for is not dependent on the outer one, in reality you should find another way to split the work so that the results do not overwrite one another.
#include <iostream>
#include <sys/time.h>
#include <omp.h>
using namespace std;
typedef unsigned long long u64;
u64 timestamp()
{
struct timeval now;
gettimeofday(&now, NULL);
return now.tv_usec + (u64)now.tv_sec*1000000;
}
int main()
{
omp_set_num_threads(2);
#pragma omp parallel
{
}
cout<<"start"<<endl;
u64 t0 = timestamp();
volatile long n[10000][10];
long c[10000][10];
for(long d=1;d<1000000;d++)
{
#pragma omp parallel for
for(long dd=1;dd<10000;dd++)
{
n[dd][1]=c[dd][2]+c[dd][3];
}
}
u64 t1 = timestamp();
cout<<"..finished in "<< (t1-t0) << " micro seconds\n";
return 0;
}
osboxes#osboxes:~/test$ g++ a.cpp -O3 -fopenmp -o a3
osboxes#osboxes:~/test$ ./a3 start..finished in 6673741 micro seconds
UPDATE : Recent c++ compilers give much better results compare to VB6 compiler.
The VB6 codes shown above does not reflect the real case, where access index can be varied and calculations can be breaked into embraced functions.
More experiences show that VB6 has a huge penaty for optimizing when arrays used were passed as function input (by reference).
I tried severals c++ compilers and rewritten the benchmark codes where some random behaviors was added to trick the optimization. Certainly,
the codes could be better, all suggestions are welcome. I used -O3 option for maximizing the speed, multithread mode was not used.
RESULT:
The g++ 8.1 gives the best final result with 5 seconds with const index access (compiler known it, case shown in OP), and 6 seconds with dynamic index access.
VC 2017 takes 2 seconds only for const index access but 35 seconds for dynamic case.
VB6 S1 : Original version, n and c are local variables. VB6 S2: inner loops was moved into a function, n and c are input variables passed by reference.
Test condition : Intel Xeon W3690 / Windows 10
Here are the test result :
The rewritten codes :
VB6:
DefLng A-Z
Private Declare Function GetTickCount Lib "kernel32" () As Long
Public Sub cal_var(ByRef n() As Long, ByRef c() As Long, id As Long)
For dd = 0 To 10000
n(dd, id) = c(dd, id + 1) + c(dd, id + 2)
Next
End Sub
Public Sub cal_const(ByRef n() As Long, ByRef c() As Long)
For dd = 0 To 10000
n(dd, 1) = c(dd, 2) + c(dd, 3)
Next
End Sub
Private Sub Form_Load()
Dim n(10001, 10) As Long
Dim c(10001, 10) As Long
Dim t0 As Long
Dim t1 As Long
Dim t2 As Long
Dim id As Long
Dim ret As Long
t0 = GetTickCount
For d = 1 To 1000000
id = d And 7
Call cal_var(n, c, id) 'For VB S2
'For dd = 0 To 10000
' n(dd, id + 0) = c(dd, id + 1) + c(dd, id + 2)
'Next
Next
t1 = GetTickCount
For d = 1 To 1000000
Call cal_const(n, c) 'For VB S2
'For dd = 0 To 10000
' n(dd, 1) = c(dd, 2) + c(dd, 3)
'Next
Next
t2 = GetTickCount
For d = 0 To 10000
Sum = Sum + n(d, t0 And 7)
Next
MsgBox "Done in " & (t1 - t0) & " and " & (t2 - t1) & " miliseconds"
End Sub
C++ Code:
#include <iostream>
#include <time.h>
#include <string.h>
using namespace std;
#define NUM_ITERATION 1000000
#define ARR_SIZE 10000
typedef long long_arr[ARR_SIZE][10];
void calc_var(long_arr &n, long_arr &c, long id) {
for (long dd = 0; dd < ARR_SIZE; dd++) {
n[dd][id] = c[dd][id + 1] + c[dd][id + 2];
}
}
void calc_const(long_arr &n, long_arr &c) {
for (long dd = 0; dd < ARR_SIZE; dd++) {
n[dd][0] = c[dd][1] + c[dd][2];
}
}
int main()
{
cout << "start ..." << endl;
time_t t0 = time(NULL);
long_arr n;
long_arr c;
memset(n, 0, sizeof(n));
memset(c, t0 & 1, sizeof(c));
for (long d = 1; d < NUM_ITERATION; d++) {
calc_var(n, c, (long)t0 & 7);
}
time_t t1 = time(NULL);
for (long d = 1; d < NUM_ITERATION; d++) {
calc_const(n, c);
}
time_t t2 = time(NULL);
long sum = 0;
for (int i = 0; i < ARR_SIZE; i++) {
sum += n[i][t0 & 7];
}
cout << "by dynamic index: finished in " << (t1 - t0) << " seconds" << endl;
cout << "by const index: finished in " << (t2 - t1) << " seconds" << endl;
cout << "with final result : " << sum << endl;
return 0;
}
The following original answer was based only on test done in VC2008, VB6 S1 case.
ORIGINAL ANSWER:
I have the same result with Progger. I use the following code to avoid the optimization (code blank) made by the compiler :
int main(int argc, char *argv[]) {
long n[10000][10];
long c[10000][10];
long sum = 0;
memset(n, 0, sizeof(n) );
memset(c, 0, sizeof(c) );
for (long d=1;d<1000000;d++){
for (long dd=1;dd<10000;dd++){
n[dd][1]=c[dd][2]+c[dd][3];
}
}
for (long dd=1;dd<10000;dd++){
sum += n[dd][1];
}
return sum;
}
I use the visual studio C++ 2008 to compile the code. it turns out that the compiler uses a lot of adress reallocation with local variable, mixing with an imul (multiplication) instruction, that could be very costly, for example :
.text:00401065 mov edx, [ebp+var_C3510]
.text:0040106B imul edx, 28h
.text:0040106E mov eax, [ebp+var_C3510]
.text:00401074 imul eax, 28h
.text:00401077 mov ecx, [ebp+edx+var_C3500]
.text:0040107E add ecx, [ebp+eax+var_C34FC]
.text:00401085 mov edx, [ebp+var_C3510]
.text:0040108B imul edx, 28h
.text:0040108E mov [ebp+edx+var_61A7C], ecx
Each access to an index take one multiplication and one add instruction.
EDIT1: The problem is related to multidimension array access performance in C++. More information can be found here:
Array Lookups
C++ doesn't provide multidimensional arrays so scientific and engineering applications either write their own or use one of the available array libraries such as Eigen, Armadillo, uBLAS, or Boost.MultiArray. High quality C++ array libraries can provide generally excellent performance but yet simple element lookup speed can still lag that of Fortran. One reason is that Fortran arrays are built in to the language so its compilers can figure out array index strides in loops and avoid computing the memory offset from the indexes for each element. We can't change the C++ compiler so the next best solution is to offer a linear, or flat, 1D indexing operator for multidimensional arrays that can be used to improve performance of hot spot loops.
The VB version use a traditional way (1D optimized flat memory), intructions used are only registers add (eax, edx, esi ...), they are much faster.
.text:00401AD4 mov eax, [ebp-54h]
.text:00401AD7 mov ecx, [eax+esi*4+13888h]
.text:00401ADE mov edx, [eax+esi*4+1D4CCh]
.text:00401AE5 add ecx, edx
.text:00401AE7 mov edx, [ebp-2Ch]
.text:00401AEA mov eax, edi
That can answer the question of speed.
Advice: you should use existed libraries (for example: uBlas) if possible. More discussion can be found here.
Here is the machine code of C++ version :
.text:00401000 ; int __cdecl main(int argc, const char **argv, const char **envp)
.text:00401000 _main proc near ; CODE XREF: ___tmainCRTStartup+F6p
.text:00401000
.text:00401000 var_C3514 = dword ptr -0C3514h
.text:00401000 var_C3510 = dword ptr -0C3510h
.text:00401000 var_C350C = dword ptr -0C350Ch
.text:00401000 var_C3500 = dword ptr -0C3500h
.text:00401000 var_C34FC = dword ptr -0C34FCh
.text:00401000 var_61A84 = dword ptr -61A84h
.text:00401000 var_61A7C = dword ptr -61A7Ch
.text:00401000 argc = dword ptr 8
.text:00401000 argv = dword ptr 0Ch
.text:00401000 envp = dword ptr 10h
.text:00401000
.text:00401000 push ebp
.text:00401001 mov ebp, esp
.text:00401003 mov eax, 0C3514h
.text:00401008 call __alloca_probe
.text:0040100D mov [ebp+var_61A84], 0
.text:00401017 mov [ebp+var_C350C], 1
.text:00401021 jmp short loc_401032
.text:00401023 ; ---------------------------------------------------------------------------
.text:00401023
.text:00401023 loc_401023: ; CODE XREF: _main:loc_401097j
.text:00401023 mov eax, [ebp+var_C350C]
.text:00401029 add eax, 1
.text:0040102C mov [ebp+var_C350C], eax
.text:00401032
.text:00401032 loc_401032: ; CODE XREF: _main+21j
.text:00401032 cmp [ebp+var_C350C], 0F4240h
.text:0040103C jge short loc_401099
.text:0040103E mov [ebp+var_C3510], 1
.text:00401048 jmp short loc_401059
.text:0040104A ; ---------------------------------------------------------------------------
.text:0040104A
.text:0040104A loc_40104A: ; CODE XREF: _main+95j
.text:0040104A mov ecx, [ebp+var_C3510]
.text:00401050 add ecx, 1
.text:00401053 mov [ebp+var_C3510], ecx
.text:00401059
.text:00401059 loc_401059: ; CODE XREF: _main+48j
.text:00401059 cmp [ebp+var_C3510], 2710h
.text:00401063 jge short loc_401097
.text:00401065 mov edx, [ebp+var_C3510]
.text:0040106B imul edx, 28h
.text:0040106E mov eax, [ebp+var_C3510]
.text:00401074 imul eax, 28h
.text:00401077 mov ecx, [ebp+edx+var_C3500]
.text:0040107E add ecx, [ebp+eax+var_C34FC]
.text:00401085 mov edx, [ebp+var_C3510]
.text:0040108B imul edx, 28h
.text:0040108E mov [ebp+edx+var_61A7C], ecx
.text:00401095 jmp short loc_40104A
.text:00401097 ; ---------------------------------------------------------------------------
.text:00401097
.text:00401097 loc_401097: ; CODE XREF: _main+63j
.text:00401097 jmp short loc_401023
.text:00401099 ; ---------------------------------------------------------------------------
.text:00401099
.text:00401099 loc_401099: ; CODE XREF: _main+3Cj
.text:00401099 mov [ebp+var_C3514], 1
.text:004010A3 jmp short loc_4010B4
.text:004010A5 ; ---------------------------------------------------------------------------
.text:004010A5
.text:004010A5 loc_4010A5: ; CODE XREF: _main+DCj
.text:004010A5 mov eax, [ebp+var_C3514]
.text:004010AB add eax, 1
.text:004010AE mov [ebp+var_C3514], eax
.text:004010B4
.text:004010B4 loc_4010B4: ; CODE XREF: _main+A3j
.text:004010B4 cmp [ebp+var_C3514], 2710h
.text:004010BE jge short loc_4010DE
.text:004010C0 mov ecx, [ebp+var_C3514]
.text:004010C6 imul ecx, 28h
.text:004010C9 mov edx, [ebp+var_61A84]
.text:004010CF add edx, [ebp+ecx+var_61A7C]
.text:004010D6 mov [ebp+var_61A84], edx
.text:004010DC jmp short loc_4010A5
.text:004010DE ; ---------------------------------------------------------------------------
.text:004010DE
.text:004010DE loc_4010DE: ; CODE XREF: _main+BEj
.text:004010DE mov eax, [ebp+var_61A84]
.text:004010E4 mov esp, ebp
.text:004010E6 pop ebp
.text:004010E7 retn
.text:004010E7 _main endp
The VB code is longer, so I post here only the main function :
.text:00401A81 loc_401A81: ; CODE XREF: .text:00401B18j
.text:00401A81 mov ecx, [ebp-68h]
.text:00401A84 mov eax, 0F4240h
.text:00401A89 cmp ecx, eax
.text:00401A8B jg loc_401B1D
.text:00401A91 mov edx, [ebp-18h]
.text:00401A94 mov edi, 1
.text:00401A99 add edx, 1
.text:00401A9C mov esi, edi
.text:00401A9E jo loc_401CEC
.text:00401AA4 mov [ebp-18h], edx
.text:00401AA7
.text:00401AA7 loc_401AA7: ; CODE XREF: .text:00401B03j
.text:00401AA7 mov eax, 2710h
.text:00401AAC cmp esi, eax
.text:00401AAE jg short loc_401B05
.text:00401AB0 mov ebx, ds:__vbaGenerateBoundsError
.text:00401AB6 cmp esi, 2711h
.text:00401ABC jb short loc_401AC0
.text:00401ABE call ebx ; __vbaGenerateBoundsError
.text:00401AC0 ; ---------------------------------------------------------------------------
.text:00401AC0
.text:00401AC0 loc_401AC0: ; CODE XREF: .text:00401ABCj
.text:00401AC0 cmp esi, 2711h
.text:00401AC6 jb short loc_401AD4
.text:00401AC8 call ebx ; __vbaGenerateBoundsError
.text:00401ACA ; ---------------------------------------------------------------------------
.text:00401ACA cmp esi, 2711h
.text:00401AD0 jb short loc_401AD4
.text:00401AD2 call ebx
.text:00401AD4
.text:00401AD4 loc_401AD4: ; CODE XREF: .text:00401AC6j
.text:00401AD4 ; .text:00401AD0j
.text:00401AD4 mov eax, [ebp-54h]
.text:00401AD7 mov ecx, [eax+esi*4+13888h]
.text:00401ADE mov edx, [eax+esi*4+1D4CCh]
.text:00401AE5 add ecx, edx
.text:00401AE7 mov edx, [ebp-2Ch]
.text:00401AEA mov eax, edi
.text:00401AEC jo loc_401CEC
.text:00401AF2 add eax, esi
.text:00401AF4 mov [edx+esi*4+9C44h], ecx
.text:00401AFB jo loc_401CEC
.text:00401B01 mov esi, eax
.text:00401B03 jmp short loc_401AA7
.text:00401B05 ; ---------------------------------------------------------------------------
.text:00401B05
.text:00401B05 loc_401B05: ; CODE XREF: .text:00401AAEj
.text:00401B05 mov ecx, [ebp-68h]
.text:00401B08 mov eax, 1
.text:00401B0D add eax, ecx
.text:00401B0F jo loc_401CEC
.text:00401B15 mov [ebp-68h], eax
.text:00401B18 jmp loc_401A81
.text:00401B1D ; ---------------------------------------------------------------------------
.text:00401B1D
.text:00401B1D loc_401B1D: ; CODE XREF: .text:00401A8Bj
.text:00401B1D mov edi, 1
.text:00401B22 mov ebx, 2710h
.text:00401B27 mov esi, edi
.text:00401B29
.text:00401B29 loc_401B29: ; CODE XREF: .text:00401B5Fj
.text:00401B29 cmp esi, ebx
.text:00401B2B jg short loc_401B61
.text:00401B2D cmp esi, 2711h
.text:00401B33 jb short loc_401B3B
.text:00401B35 call ds:__vbaGenerateBoundsError
.text:00401B3B ; ---------------------------------------------------------------------------
.text:00401B3B
.text:00401B3B loc_401B3B: ; CODE XREF: .text:00401B33j
.text:00401B3B mov ecx, [ebp-2Ch]
.text:00401B3E mov eax, [ebp-40h]
.text:00401B41 mov edx, [ecx+esi*4+9C44h]
.text:00401B48 add edx, eax
.text:00401B4A mov eax, edi
.text:00401B4C jo loc_401CEC
.text:00401B52 add eax, esi
.text:00401B54 mov [ebp-40h], edx
.text:00401B57 jo loc_401CEC
.text:00401B5D mov esi, eax
.text:00401B5F jmp short loc_401B29
.text:00401B61 ; ---------------------------------------------------------------------------
.text:00401B61
.text:00401B61 loc_401B61: ; CODE XREF: .text:00401B2Bj
.text:00401B61 mov ebx, ds:__vbaStrI4
.text:00401B67 mov ecx, 80020004h
.text:00401B6C mov [ebp-0B4h], ecx
.text:00401B72 mov [ebp-0A4h], ecx
.text:00401B78 mov [ebp-94h], ecx
.text:00401B7E mov ecx, [ebp-40h]
.text:00401B81 mov eax, 0Ah
.text:00401B86 push offset aDone ; "Done : "

Why do arrays of different integer sizes have different performance?

I have following issue:
The write times to an std::array for int8, int16, int32 and int64 are doubling with each size increase. I can understand such behavior for an 8-bit CPU, but not 32/64-bit.
Why does a 32-bit system need 4 times more time to save 32-bit values than to save 8-bit values?
Here is my test code:
#include <iostream>
#include <array>
#include <chrono>
std::array<std::int8_t, 64 * 1024 * 1024> int8Array;
std::array<std::int16_t, 64 * 1024 * 1024> int16Array;
std::array<std::int32_t, 64 * 1024 * 1024> int32Array;
std::array<std::int64_t, 64 * 1024 * 1024> int64Array;
void PutZero()
{
auto point1 = std::chrono::high_resolution_clock::now();
for (auto &v : int8Array) v = 0;
auto point2 = std::chrono::high_resolution_clock::now();
for (auto &v : int16Array) v = 0;
auto point3 = std::chrono::high_resolution_clock::now();
for (auto &v : int32Array) v = 0;
auto point4 = std::chrono::high_resolution_clock::now();
for (auto &v : int64Array) v = 0;
auto point5 = std::chrono::high_resolution_clock::now();
std::cout << "Time of processing int8 array:\t" << (std::chrono::duration_cast<std::chrono::microseconds>(point2 - point1)).count() << "us." << std::endl;
std::cout << "Time of processing int16 array:\t" << (std::chrono::duration_cast<std::chrono::microseconds>(point3 - point2)).count() << "us." << std::endl;
std::cout << "Time of processing int32 array:\t" << (std::chrono::duration_cast<std::chrono::microseconds>(point4 - point3)).count() << "us." << std::endl;
std::cout << "Time of processing int64 array:\t" << (std::chrono::duration_cast<std::chrono::microseconds>(point5 - point4)).count() << "us." << std::endl;
}
int main()
{
PutZero();
std::cout << std::endl << "Press enter to exit" << std::endl;
std::cin.get();
return 0;
}
I compile it under linux with: g++ -o array_issue_1 main.cpp -O3 -std=c++14
and my results are following:
Time of processing int8 array: 9922us.
Time of processing int16 array: 37717us.
Time of processing int32 array: 76064us.
Time of processing int64 array: 146803us.
If I compile with -O2, then results are 5 times worse for int8!
You can also compile this source in Windows. You will get similar relation between results.
Update #1
When I compile with -O2, then my results are following:
Time of processing int8 array: 60182us.
Time of processing int16 array: 77807us.
Time of processing int32 array: 114204us.
Time of processing int64 array: 186664us.
I didn't analyze assembler output. My main point is that I would like to write efficient code in C++ and things like that show, that things like std::array can be challenging from performance perspective and somehow counter-intuitive.
Why does a 32-bit system need 4 times more time to save 32-bit values than to save 8-bit values?
It doesn't. But there are 3 different issues with your benchmark that are giving you those results.
You're not pre-faulting the memory. So you're page-faulting the arrays during the benchmark. These page faults along with the OS kernel interaction are a dominant factor in the time.
The compiler with -O3 is completely defeating your benchmark by converting all your loops into memset().
Your benchmark is memory-bound. So you're measuring the speed of your memory instead of the core.
Problem 1: The Test Data is not Prefaulted
Your arrays are declared, but not used before the benchmark. Because of the way the kernel and memory allocation works, they are not mapped into memory yet. It's only when you first touch them does this happen. And when it does, it incurs a very large penalty from the kernel to map the page.
This can be done by touching all the arrays before the benchmark.
No Pre-Faulting: http://coliru.stacked-crooked.com/a/1df1f3f9de420d18
g++ -O3 -Wall main.cpp && ./a.out
Time of processing int8 array: 28983us.
Time of processing int16 array: 57100us.
Time of processing int32 array: 113361us.
Time of processing int64 array: 224451us.
With Pre-Faulting: http://coliru.stacked-crooked.com/a/7e62b9c7ca19c128
g++ -O3 -Wall main.cpp && ./a.out
Time of processing int8 array: 6216us.
Time of processing int16 array: 12472us.
Time of processing int32 array: 24961us.
Time of processing int64 array: 49886us.
The times drop by roughly a factor of 4. In other words, your original benchmark was measuring more of the kernel than the actual code.
Problem 2: The Compiler is Defeating the Benchmark
The compiler is recognizing your pattern of writing zeros and is completely replacing all your loops with calls to memset(). So in effect, you're measuring calls to memset() with different sizes.
call std::chrono::_V2::system_clock::now()
xor esi, esi
mov edx, 67108864
mov edi, OFFSET FLAT:int8Array
mov r14, rax
call memset
call std::chrono::_V2::system_clock::now()
xor esi, esi
mov edx, 134217728
mov edi, OFFSET FLAT:int16Array
mov r13, rax
call memset
call std::chrono::_V2::system_clock::now()
xor esi, esi
mov edx, 268435456
mov edi, OFFSET FLAT:int32Array
mov r12, rax
call memset
call std::chrono::_V2::system_clock::now()
xor esi, esi
mov edx, 536870912
mov edi, OFFSET FLAT:int64Array
mov rbp, rax
call memset
call std::chrono::_V2::system_clock::now()
The optimization that's doing this is -ftree-loop-distribute-patterns. Even if you turn that off, the vectorizer will give you a similar effect.
With -O2, vectorization and pattern recognition are both disabled. So the compiler gives you what you write.
.L4:
mov BYTE PTR [rax], 0 ;; <<------ 1 byte at a time
add rax, 1
cmp rdx, rax
jne .L4
call std::chrono::_V2::system_clock::now()
mov rbp, rax
mov eax, OFFSET FLAT:int16Array
lea rdx, [rax+134217728]
.L5:
xor ecx, ecx
add rax, 2
mov WORD PTR [rax-2], cx ;; <<------ 2 bytes at a time
cmp rdx, rax
jne .L5
call std::chrono::_V2::system_clock::now()
mov r12, rax
mov eax, OFFSET FLAT:int32Array
lea rdx, [rax+268435456]
.L6:
mov DWORD PTR [rax], 0 ;; <<------ 4 bytes at a time
add rax, 4
cmp rax, rdx
jne .L6
call std::chrono::_V2::system_clock::now()
mov r13, rax
mov eax, OFFSET FLAT:int64Array
lea rdx, [rax+536870912]
.L7:
mov QWORD PTR [rax], 0 ;; <<------ 8 bytes at a time
add rax, 8
cmp rdx, rax
jne .L7
call std::chrono::_V2::system_clock::now()
With -O2: http://coliru.stacked-crooked.com/a/edfdfaaf7ec2882e
g++ -O2 -Wall main.cpp && ./a.out
Time of processing int8 array: 28414us.
Time of processing int16 array: 22617us.
Time of processing int32 array: 32551us.
Time of processing int64 array: 56591us.
Now it's clear that the smaller word sizes are slower. But you would expect the times to be flat if all the word sizes were the same speed. And the reason they aren't is because of memory bandwidth.
Problem 3: Memory Bandwidth
Because the benchmark (as written) is only writing zeros, it is easily saturating the memory bandwidth for the core/system. So the benchmark becomes affected by how much memory is touched.
To fix that, we need to shrink the dataset so that it fits into cache. To compensate for this, we loop over the same data multiple times.
std::array<std::int8_t, 512> int8Array;
std::array<std::int16_t, 512> int16Array;
std::array<std::int32_t, 512> int32Array;
std::array<std::int64_t, 512> int64Array;
...
auto point1 = std::chrono::high_resolution_clock::now();
for (int c = 0; c < 64 * 1024; c++) for (auto &v : int8Array) v = 0;
auto point2 = std::chrono::high_resolution_clock::now();
for (int c = 0; c < 64 * 1024; c++) for (auto &v : int16Array) v = 0;
auto point3 = std::chrono::high_resolution_clock::now();
for (int c = 0; c < 64 * 1024; c++) for (auto &v : int32Array) v = 0;
auto point4 = std::chrono::high_resolution_clock::now();
for (int c = 0; c < 64 * 1024; c++) for (auto &v : int64Array) v = 0;
auto point5 = std::chrono::high_resolution_clock::now();
Now we see timings that are a lot more flat for the different word-sizes:
http://coliru.stacked-crooked.com/a/f534f98f6d840c5c
g++ -O2 -Wall main.cpp && ./a.out
Time of processing int8 array: 20487us.
Time of processing int16 array: 21965us.
Time of processing int32 array: 32569us.
Time of processing int64 array: 26059us.
The reason why it isn't completely flat is probably because there are numerous other factors involved with the compiler optimizations. You might need to resort to loop-unrolling to get any closer.

why is std::equal much slower than a hand rolled loop for two small std::array?

I was profiling a small piece of code that is part of a larger simulation, and to my surprise, the STL function equal (std::equal) is much slower than a simple for-loop, comparing the two arrays element by element. I wrote a small test case, which I believe to be a fair comparison between the two, and the difference, using g++ 6.1.1 from the Debian archives is not insignificant. I am comparing two, four-element arrays of signed integers. I tested std::equal, operator==, and a small for loop. I didn't use std::chrono for an exact timing, but the difference can be seen explicitly with time ./a.out.
My question is, given the sample code below, why does operator== and the overloaded function std::equal (which calls operator== I believe) take approx 40s to complete, and the hand written loop take only 8s? I'm using a very recent intel based laptop. The for-loop is faster on all optimizations levels, -O1, -O2, -O3, and -Ofast. I compiled the code with
g++ -std=c++14 -Ofast -march=native -mtune=native
Run the code
The loop runs a huge number of times, just to make the difference clear to the naked eye. The modulo operators represent a cheap operation on one of the array elements, and serve to keep the compiler from optimizing out of the loop.
#include<iostream>
#include<algorithm>
#include<array>
using namespace std;
using T = array<int32_t, 4>;
bool
are_equal_manual(const T& L, const T& R)
noexcept {
bool test{ true };
for(uint32_t i{0}; i < 4; ++i) { test = test && (L[i] == R[i]); }
return test;
}
bool
are_equal_alg(const T& L, const T& R)
noexcept {
bool test{ equal(cbegin(L),cend(L),cbegin(R)) };
return test;
}
int main(int argc, char** argv) {
T left{ {0,1,2,3} };
T right{ {0,1,2,3} };
cout << boolalpha << are_equal_manual(left,right) << endl;
cout << boolalpha << are_equal_alg(left,right) << endl;
cout << boolalpha << (left == right) << endl;
bool t{};
const size_t N{ 5000000000 };
for(size_t i{}; i < N; ++i) {
//t = left == right; // SLOW
//t = are_equal_manual(left,right); // FAST
t = are_equal_alg(left,right); // SLOW
left[0] = i % 10;
right[2] = i % 8;
}
cout<< boolalpha << t << endl;
return(EXIT_SUCCESS);
}
Here's the generated assembly of the for loop in main() when the are_equal_manual(left,right) function is used:
.L21:
xor esi, esi
test eax, eax
jne .L20
cmp edx, 2
sete sil
.L20:
mov rax, rcx
movzx esi, sil
mul r8
shr rdx, 3
lea rax, [rdx+rdx*4]
mov edx, ecx
add rax, rax
sub edx, eax
mov eax, edx
mov edx, ecx
add rcx, 1
and edx, 7
cmp rcx, rdi
And here's what's generated when the are_equal_alg(left,right) function is used:
.L20:
lea rsi, [rsp+16]
mov edx, 16
mov rdi, rsp
call memcmp
mov ecx, eax
mov rax, rbx
mov rdi, rbx
mul r12
shr rdx, 3
lea rax, [rdx+rdx*4]
add rax, rax
sub rdi, rax
mov eax, ebx
add rbx, 1
and eax, 7
cmp rbx, rbp
mov DWORD PTR [rsp], edi
mov DWORD PTR [rsp+24], eax
jne .L20
I'm not exactly sure what's happening in the generated code for first case, but it's clearly not calling memcmp(). It doesn't appear to be comparing the contents of the arrays at all. While the loop is still being iterated 5000000000 times, it's optimized to doing nothing much. However, the loop that uses are_equal_alg(left,right) is still performing the comparison. Basically, the compiler is still able to optimize the manual comparison much better than the std::equal template.

Is "for( int k = 5; k--;)" faster than "for( int k = 4; k > -1; --k)"

The question says it all: Is
for( int k = 5; k--;)
faster than
for( int k = 4; k > -1; --k)
and why?
EDIT:
I generated the assembly for debug and release in MSVC2012. But (it's my first time analyzing assembly code), I can't really make sense out of it. I alredy added the "std::cout" to prevent the compiler from removing both loops during release optimization.
Can someone help me what the assembly means?
Debug:
; 10 : for( int k = 5; k--;){ std::cout << k; }
mov DWORD PTR _k$2[ebp], 5
$LN5#wmain:
mov eax, DWORD PTR _k$2[ebp]
mov DWORD PTR tv65[ebp], eax
mov ecx, DWORD PTR _k$2[ebp]
sub ecx, 1
mov DWORD PTR _k$2[ebp], ecx
cmp DWORD PTR tv65[ebp], 0
je SHORT $LN4#wmain
mov esi, esp
mov eax, DWORD PTR _k$2[ebp]
push eax
mov ecx, DWORD PTR __imp_?cout#std##3V?$basic_ostream#DU?$char_traits#D#std###1#A
call DWORD PTR __imp_??6?$basic_ostream#DU?$char_traits#D#std###std##QAEAAV01#H#Z
cmp esi, esp
call __RTC_CheckEsp
jmp SHORT $LN5#wmain
$LN4#wmain:
; 11 :
; 12 : for( int k = 4; k > -1; --k){ std::cout << k; }
mov DWORD PTR _k$1[ebp], 4
jmp SHORT $LN3#wmain
$LN2#wmain:
mov eax, DWORD PTR _k$1[ebp]
sub eax, 1
mov DWORD PTR _k$1[ebp], eax
$LN3#wmain:
cmp DWORD PTR _k$1[ebp], -1
jle SHORT $LN6#wmain
mov esi, esp
mov eax, DWORD PTR _k$1[ebp]
push eax
mov ecx, DWORD PTR __imp_?cout#std##3V?$basic_ostream#DU?$char_traits#D#std###1#A
call DWORD PTR __imp_??6?$basic_ostream#DU?$char_traits#D#std###std##QAEAAV01#H#Z
cmp esi, esp
call __RTC_CheckEsp
jmp SHORT $LN2#wmain
$LN6#wmain:
Release:
; 10 : for( int k = 5; k--;){ std::cout << k; }
mov esi, 5
$LL5#wmain:
mov ecx, DWORD PTR __imp_?cout#std##3V?$basic_ostream#DU?$char_traits#D#std###1#A
dec esi
push esi
call DWORD PTR __imp_??6?$basic_ostream#DU?$char_traits#D#std###std##QAEAAV01#H#Z
test esi, esi
jne SHORT $LL5#wmain
; 11 :
; 12 : for( int k = 4; k > -1; --k){ std::cout << k; }
mov esi, 4
npad 3
$LL3#wmain:
mov ecx, DWORD PTR __imp_?cout#std##3V?$basic_ostream#DU?$char_traits#D#std###1#A
push esi
call DWORD PTR __imp_??6?$basic_ostream#DU?$char_traits#D#std###std##QAEAAV01#H#Z
dec esi
cmp esi, -1
jg SHORT $LL3#wmain
[ UPDATE question has been updated so this is no longer different ] They do different things... the first one executes the loop for k values 4 down to 0, while the second one loops from 5 down to 1... if say the loop body does work related to the magnitude of the number, then they might differ in performance.
Ignoring that, on most CPUs k-- incidentally sets the "flags" register commonly called the "zero" flag, so no further explicit comparison is needed before deciding whether to exit. Still, an optimiser should realise that and avoid any unnecessary second comparison even with the second loop.
Generic quip: compilers are allowed to do lots of things, and the Standard certainly doesn't say anything about the relative performance of these two implementations, so ultimately the only way to know - if you have reason to care - is to use the same compiler and command line options you want for production then inspect the generated assembly or machine code and/or measure very carefully. The findings could differ when the executable's deployed on different hardware, compiler with a later version of the compiler, with different flags, a different compiler etc..
Be careful, the two loops are not equivalent:
for( int k = 5; k--;) cout << k << endl;
prints 4 3 2 1 0. While
for( int k = 5; k > 0; k--) cout << k << endl;
prints 5 4 3 2 1.
In performance point of view, you can have enough confidence in your compiler. Modern compilers know how to optimize this better than we do, in most cases.
It depends on your compiler. Probably not, but as always, one must profile to be certain. Or you could look at the generated assembly code (e.g. gcc -S) and see if it's any different. Make sure to enable optimization before you test, too!

Why access in a for loop is faster than access in a ranged-for in -O0 but not in -O3?

I'm learning performance in C++ (and C++11). And I need to performance in Debug and Release mode because I spend time in debugging and in executing.
I'm surprise with this two tests and how much change with the different compiler flags optimizations.
Test iterator 1:
Optimization 0 (-O0): faster.
Optimization 3 (-O3): slower.
Test iterator 2:
Optimization 0 (-O0): slower.
Optimization 3 (-O3): faster.
P.D.: I use the following clock code.
Test iterator 1:
void test_iterator_1()
{
int z = 0;
int nv = 1200000000;
std::vector<int> v(nv);
size_t count = v.size();
for (unsigned int i = 0; i < count; ++i) {
v[i] = 1;
}
}
Test iterator 2:
void test_iterator_2()
{
int z = 0;
int nv = 1200000000;
std::vector<int> v(nv);
for (int& i : v) {
i = 1;
}
}
UPDATE: The problem is still the same, but for ranged-for in -O3 the differences is small. So for loop 1 is the best.
UPDATE 2: Results:
With -O3:
t1: 80 units
t2: 74 units
With -O0:
t1: 287 units
t2: 538 units
UPDATE 3: The CODE!. Compile with: g++ -std=c++11 test.cpp -O0 (and then -O3)
Your first test is actually setting the value of each element in the vector to 1.
Your second test is setting the value of a copy of each element in the vector to 1 (the original vector is the same).
When you optimize, the second loop more than likely is removed entirely as it is basically doing nothing.
If you want the second loop to actually set the value:
for (int& i : v) // notice the &
{
i = 1;
}
Once you make that change, your loops are likely to produce assembly code that is almost identical.
As a side note, if you wanted to initialize the entire vector to a single value, the better way to do it is:
std::vector<int> v(SIZE, 1);
EDIT
The assembly is fairly long (100+ lines), so I won't post it all, but a couple things to note:
Version 1 will store a value for count and increment i, testing for it each time. Version 2 uses iterators (basically the same as std::for_each(b.begin(), v.end() ...)). So the code for the loop maintenance is very different (it is more setup for version 2, but less work each iteration).
Version 1 (just the meat of the loop)
mov eax, DWORD PTR _i$2[ebp]
push eax
lea ecx, DWORD PTR _v$[ebp]
call ??A?$vector#HV?$allocator#H#std###std##QAEAAHI#Z ; std::vector<int,std::allocator<int> >::operator[]
mov DWORD PTR [eax], 1
Version 2 (just the meat of the loop)
mov eax, DWORD PTR _i$2[ebp]
mov DWORD PTR [eax], 1
When they get optimized, this all changes and (other than the ordering of a few instructions), the output is almost identical.
Version 1 (optimized)
push ebp
mov ebp, esp
sub esp, 12 ; 0000000cH
push ecx
lea ecx, DWORD PTR _v$[ebp]
mov DWORD PTR _v$[ebp], 0
mov DWORD PTR _v$[ebp+4], 0
mov DWORD PTR _v$[ebp+8], 0
call ?resize#?$vector#HV?$allocator#H#std###std##QAEXI#Z ; std::vector<int,std::allocator<int> >::resize
mov ecx, DWORD PTR _v$[ebp+4]
mov edx, DWORD PTR _v$[ebp]
sub ecx, edx
sar ecx, 2 ; this is the only differing instruction
test ecx, ecx
je SHORT $LN3#test_itera
push edi
mov eax, 1
mov edi, edx
rep stosd
pop edi
$LN3#test_itera:
test edx, edx
je SHORT $LN21#test_itera
push edx
call DWORD PTR __imp_??3#YAXPAX#Z
add esp, 4
$LN21#test_itera:
mov esp, ebp
pop ebp
ret 0
Version 2 (optimized)
push ebp
mov ebp, esp
sub esp, 12 ; 0000000cH
push ecx
lea ecx, DWORD PTR _v$[ebp]
mov DWORD PTR _v$[ebp], 0
mov DWORD PTR _v$[ebp+4], 0
mov DWORD PTR _v$[ebp+8], 0
call ?resize#?$vector#HV?$allocator#H#std###std##QAEXI#Z ; std::vector<int,std::allocator<int> >::resize
mov edx, DWORD PTR _v$[ebp]
mov ecx, DWORD PTR _v$[ebp+4]
mov eax, edx
cmp edx, ecx
je SHORT $LN1#test_itera
$LL33#test_itera:
mov DWORD PTR [eax], 1
add eax, 4
cmp eax, ecx
jne SHORT $LL33#test_itera
$LN1#test_itera:
test edx, edx
je SHORT $LN47#test_itera
push edx
call DWORD PTR __imp_??3#YAXPAX#Z
add esp, 4
$LN47#test_itera:
mov esp, ebp
pop ebp
ret 0
Do not worry about how much time each operation takes, that falls squarely under the premature optimization is the root of all evil quote by Donald Knuth. Write easy to understand, simple programs, your time while writing the program (and reading it next week to tweak it, or to find out why the &%$# it is giving crazy results) is much more valuable than any computer time wasted. Just compare your weekly income to the price of an off-the-shelf machine, and think how much of your time is required to shave off a few minutes of compute time.
Do worry when you have measurements showing that the performance isn't adequate. Then you must measure where your runtime (or memory, or whatever else resource is critical) is spent, and see how to make that better. The (sadly out of print) book "Writing Efficient Programs" by Jon Bentley (much of it also appears in his "Programming Pearls") is an eye-opener, and a must read for any budding programmer.
Optimization is pattern matching: The compiler has a number of different situations it can recognize and optimize. If you change the code in a way that makes the pattern unrecognizable to the compiler, suddenly the effect of your optimization vanishes.
So, what you are witnessing is nothing more or less than that the ranged for loop produces more bloated code without optimization, but that in this form the optimizer is able to recognize a pattern that it cannot recognize for the iterator-free case.
In any case, if you are curious, you should take a look at the produced assembler code (compile with -S option).