C++ running slower than VB6 - c++

this is my first post on Stack.
I usually dev in VB6 but have recently started doing more coding in C++ using the DEV-C++ IDE with the g++ compiler lib.
I'm having a problem with general program execution speeds.
This old VB6 code runs in 20 seconds.
DefLng A-Z
Private Sub Form_Load()
Dim n(10000, 10) As Long
Dim c(10000, 10) As Long
For d = 1 To 1000000
For dd = 1 To 10000
n(dd, 1) = c(dd, 2) + c(dd, 3)
Next
Next
MsgBox "Done"
End Sub
This C++ code takes 57 seconds...
int main(int argc, char *argv[]) {
long n[10000][10];
long c[10000][10];
for (long d=1;d<1000000;d++){
for (long dd=1;dd<10000;dd++){
n[dd][1]=c[dd][2]+c[dd][3];
}
}
system("PAUSE");
return EXIT_SUCCESS; }
Most of the coding I do is AI related and very heavy on array usage. I've tried using int rather than long, I've tried different machines, the C++ always runs at least three times slower.
Am I being dumb? Can anyone explain what I'm doing wrong?
Cheers.

Short answer
You need to look into your compiler optimization settings. This resource might help
Takeaway: C++ allows you to use many tricks some generic and some dependent on your architecture, when used properly it will be superior to VB in terms of performance.
Long answer
Keep in mind this is highly dependent on your architecture and compiler, also compiler settings. You should configure your compiler to do a more agressive optimization.
Also you should write optimized code taking into account memory access, using CPU cache wisely etc.
I have done a test for you on an ubuntu 16.04 virtual machine using a core of Intel(R) Core(TM) i7-7700K CPU # 4.20GHz. Using the code bellow here are my times depending on the optimization level of the compiler I used g++ 5.4.0
I'm using optimization level 0,1,2,3,s and obtain 36s(completely unoptimized), 23s, and then.. zero.
osboxes#osboxes:~/test$ g++ a.cpp -O0 -o a0
osboxes#osboxes:~/test$ ./a0 start..finished in 36174855 micro seconds
osboxes#osboxes:~/test$ g++ a.cpp -O1 -o a1
osboxes#osboxes:~/test$ ./a1 start..finished in 2352767 micro seconds
osboxes#osboxes:~/test$ g++ a.cpp -O2 -o a2
osboxes#osboxes:~/test$ ./a2 start..finished in 0 micro seconds
osboxes#osboxes:~/test$ g++ a.cpp -O3 -o a3
osboxes#osboxes:~/test$ ./a3 start..finished in 0 micro seconds
osboxes#osboxes:~/test$ g++ a.cpp -Os -o as
osboxes#osboxes:~/test$ ./as start..finished in 0 micro seconds
Note that by using a more agressive optimization level, the compiler will eliminate the code completely because the values in n[] are not used in the program.
To force the compiler to generate code use the volatile keyword when declaring n
With the volatile added now you'll have ~12s with the most agressive optimization (on my machine)
osboxes#osboxes:~/test$ g++ a.cpp -O3 -o a3
osboxes#osboxes:~/test$ ./a3 start..finished in 12139348 micro seconds
osboxes#osboxes:~/test$ g++ a.cpp -Os -o as
osboxes#osboxes:~/test$ ./as start..finished in 12493927 micro seconds
The code I used for the test(based on your example)
#include <iostream>
#include <sys/time.h>
using namespace std;
typedef unsigned long long u64;
u64 timestamp()
{
struct timeval now;
gettimeofday(&now, NULL);
return now.tv_usec + (u64)now.tv_sec*1000000;
}
int main()
{
cout<<"start"<<endl;
u64 t0 = timestamp();
volatile long n[10000][10];
long c[10000][10];
for(long d=1;d<1000000;d++)
{
for(long dd=1;dd<10000;dd++)
{
n[dd][1]=c[dd][2]+c[dd][3];
}
}
u64 t1 = timestamp();
cout<<"..finished in "<< (t1-t0) << " micro seconds\n";
return 0;
}
Multithreading
I have converted your code to use multithreading, using 2 threads I am able to reduce the time by half.
I am using the fact that as it is now, the results are not used so the inner for is not dependent on the outer one, in reality you should find another way to split the work so that the results do not overwrite one another.
#include <iostream>
#include <sys/time.h>
#include <omp.h>
using namespace std;
typedef unsigned long long u64;
u64 timestamp()
{
struct timeval now;
gettimeofday(&now, NULL);
return now.tv_usec + (u64)now.tv_sec*1000000;
}
int main()
{
omp_set_num_threads(2);
#pragma omp parallel
{
}
cout<<"start"<<endl;
u64 t0 = timestamp();
volatile long n[10000][10];
long c[10000][10];
for(long d=1;d<1000000;d++)
{
#pragma omp parallel for
for(long dd=1;dd<10000;dd++)
{
n[dd][1]=c[dd][2]+c[dd][3];
}
}
u64 t1 = timestamp();
cout<<"..finished in "<< (t1-t0) << " micro seconds\n";
return 0;
}
osboxes#osboxes:~/test$ g++ a.cpp -O3 -fopenmp -o a3
osboxes#osboxes:~/test$ ./a3 start..finished in 6673741 micro seconds

UPDATE : Recent c++ compilers give much better results compare to VB6 compiler.
The VB6 codes shown above does not reflect the real case, where access index can be varied and calculations can be breaked into embraced functions.
More experiences show that VB6 has a huge penaty for optimizing when arrays used were passed as function input (by reference).
I tried severals c++ compilers and rewritten the benchmark codes where some random behaviors was added to trick the optimization. Certainly,
the codes could be better, all suggestions are welcome. I used -O3 option for maximizing the speed, multithread mode was not used.
RESULT:
The g++ 8.1 gives the best final result with 5 seconds with const index access (compiler known it, case shown in OP), and 6 seconds with dynamic index access.
VC 2017 takes 2 seconds only for const index access but 35 seconds for dynamic case.
VB6 S1 : Original version, n and c are local variables. VB6 S2: inner loops was moved into a function, n and c are input variables passed by reference.
Test condition : Intel Xeon W3690 / Windows 10
Here are the test result :
The rewritten codes :
VB6:
DefLng A-Z
Private Declare Function GetTickCount Lib "kernel32" () As Long
Public Sub cal_var(ByRef n() As Long, ByRef c() As Long, id As Long)
For dd = 0 To 10000
n(dd, id) = c(dd, id + 1) + c(dd, id + 2)
Next
End Sub
Public Sub cal_const(ByRef n() As Long, ByRef c() As Long)
For dd = 0 To 10000
n(dd, 1) = c(dd, 2) + c(dd, 3)
Next
End Sub
Private Sub Form_Load()
Dim n(10001, 10) As Long
Dim c(10001, 10) As Long
Dim t0 As Long
Dim t1 As Long
Dim t2 As Long
Dim id As Long
Dim ret As Long
t0 = GetTickCount
For d = 1 To 1000000
id = d And 7
Call cal_var(n, c, id) 'For VB S2
'For dd = 0 To 10000
' n(dd, id + 0) = c(dd, id + 1) + c(dd, id + 2)
'Next
Next
t1 = GetTickCount
For d = 1 To 1000000
Call cal_const(n, c) 'For VB S2
'For dd = 0 To 10000
' n(dd, 1) = c(dd, 2) + c(dd, 3)
'Next
Next
t2 = GetTickCount
For d = 0 To 10000
Sum = Sum + n(d, t0 And 7)
Next
MsgBox "Done in " & (t1 - t0) & " and " & (t2 - t1) & " miliseconds"
End Sub
C++ Code:
#include <iostream>
#include <time.h>
#include <string.h>
using namespace std;
#define NUM_ITERATION 1000000
#define ARR_SIZE 10000
typedef long long_arr[ARR_SIZE][10];
void calc_var(long_arr &n, long_arr &c, long id) {
for (long dd = 0; dd < ARR_SIZE; dd++) {
n[dd][id] = c[dd][id + 1] + c[dd][id + 2];
}
}
void calc_const(long_arr &n, long_arr &c) {
for (long dd = 0; dd < ARR_SIZE; dd++) {
n[dd][0] = c[dd][1] + c[dd][2];
}
}
int main()
{
cout << "start ..." << endl;
time_t t0 = time(NULL);
long_arr n;
long_arr c;
memset(n, 0, sizeof(n));
memset(c, t0 & 1, sizeof(c));
for (long d = 1; d < NUM_ITERATION; d++) {
calc_var(n, c, (long)t0 & 7);
}
time_t t1 = time(NULL);
for (long d = 1; d < NUM_ITERATION; d++) {
calc_const(n, c);
}
time_t t2 = time(NULL);
long sum = 0;
for (int i = 0; i < ARR_SIZE; i++) {
sum += n[i][t0 & 7];
}
cout << "by dynamic index: finished in " << (t1 - t0) << " seconds" << endl;
cout << "by const index: finished in " << (t2 - t1) << " seconds" << endl;
cout << "with final result : " << sum << endl;
return 0;
}
The following original answer was based only on test done in VC2008, VB6 S1 case.
ORIGINAL ANSWER:
I have the same result with Progger. I use the following code to avoid the optimization (code blank) made by the compiler :
int main(int argc, char *argv[]) {
long n[10000][10];
long c[10000][10];
long sum = 0;
memset(n, 0, sizeof(n) );
memset(c, 0, sizeof(c) );
for (long d=1;d<1000000;d++){
for (long dd=1;dd<10000;dd++){
n[dd][1]=c[dd][2]+c[dd][3];
}
}
for (long dd=1;dd<10000;dd++){
sum += n[dd][1];
}
return sum;
}
I use the visual studio C++ 2008 to compile the code. it turns out that the compiler uses a lot of adress reallocation with local variable, mixing with an imul (multiplication) instruction, that could be very costly, for example :
.text:00401065 mov edx, [ebp+var_C3510]
.text:0040106B imul edx, 28h
.text:0040106E mov eax, [ebp+var_C3510]
.text:00401074 imul eax, 28h
.text:00401077 mov ecx, [ebp+edx+var_C3500]
.text:0040107E add ecx, [ebp+eax+var_C34FC]
.text:00401085 mov edx, [ebp+var_C3510]
.text:0040108B imul edx, 28h
.text:0040108E mov [ebp+edx+var_61A7C], ecx
Each access to an index take one multiplication and one add instruction.
EDIT1: The problem is related to multidimension array access performance in C++. More information can be found here:
Array Lookups
C++ doesn't provide multidimensional arrays so scientific and engineering applications either write their own or use one of the available array libraries such as Eigen, Armadillo, uBLAS, or Boost.MultiArray. High quality C++ array libraries can provide generally excellent performance but yet simple element lookup speed can still lag that of Fortran. One reason is that Fortran arrays are built in to the language so its compilers can figure out array index strides in loops and avoid computing the memory offset from the indexes for each element. We can't change the C++ compiler so the next best solution is to offer a linear, or flat, 1D indexing operator for multidimensional arrays that can be used to improve performance of hot spot loops.
The VB version use a traditional way (1D optimized flat memory), intructions used are only registers add (eax, edx, esi ...), they are much faster.
.text:00401AD4 mov eax, [ebp-54h]
.text:00401AD7 mov ecx, [eax+esi*4+13888h]
.text:00401ADE mov edx, [eax+esi*4+1D4CCh]
.text:00401AE5 add ecx, edx
.text:00401AE7 mov edx, [ebp-2Ch]
.text:00401AEA mov eax, edi
That can answer the question of speed.
Advice: you should use existed libraries (for example: uBlas) if possible. More discussion can be found here.
Here is the machine code of C++ version :
.text:00401000 ; int __cdecl main(int argc, const char **argv, const char **envp)
.text:00401000 _main proc near ; CODE XREF: ___tmainCRTStartup+F6p
.text:00401000
.text:00401000 var_C3514 = dword ptr -0C3514h
.text:00401000 var_C3510 = dword ptr -0C3510h
.text:00401000 var_C350C = dword ptr -0C350Ch
.text:00401000 var_C3500 = dword ptr -0C3500h
.text:00401000 var_C34FC = dword ptr -0C34FCh
.text:00401000 var_61A84 = dword ptr -61A84h
.text:00401000 var_61A7C = dword ptr -61A7Ch
.text:00401000 argc = dword ptr 8
.text:00401000 argv = dword ptr 0Ch
.text:00401000 envp = dword ptr 10h
.text:00401000
.text:00401000 push ebp
.text:00401001 mov ebp, esp
.text:00401003 mov eax, 0C3514h
.text:00401008 call __alloca_probe
.text:0040100D mov [ebp+var_61A84], 0
.text:00401017 mov [ebp+var_C350C], 1
.text:00401021 jmp short loc_401032
.text:00401023 ; ---------------------------------------------------------------------------
.text:00401023
.text:00401023 loc_401023: ; CODE XREF: _main:loc_401097j
.text:00401023 mov eax, [ebp+var_C350C]
.text:00401029 add eax, 1
.text:0040102C mov [ebp+var_C350C], eax
.text:00401032
.text:00401032 loc_401032: ; CODE XREF: _main+21j
.text:00401032 cmp [ebp+var_C350C], 0F4240h
.text:0040103C jge short loc_401099
.text:0040103E mov [ebp+var_C3510], 1
.text:00401048 jmp short loc_401059
.text:0040104A ; ---------------------------------------------------------------------------
.text:0040104A
.text:0040104A loc_40104A: ; CODE XREF: _main+95j
.text:0040104A mov ecx, [ebp+var_C3510]
.text:00401050 add ecx, 1
.text:00401053 mov [ebp+var_C3510], ecx
.text:00401059
.text:00401059 loc_401059: ; CODE XREF: _main+48j
.text:00401059 cmp [ebp+var_C3510], 2710h
.text:00401063 jge short loc_401097
.text:00401065 mov edx, [ebp+var_C3510]
.text:0040106B imul edx, 28h
.text:0040106E mov eax, [ebp+var_C3510]
.text:00401074 imul eax, 28h
.text:00401077 mov ecx, [ebp+edx+var_C3500]
.text:0040107E add ecx, [ebp+eax+var_C34FC]
.text:00401085 mov edx, [ebp+var_C3510]
.text:0040108B imul edx, 28h
.text:0040108E mov [ebp+edx+var_61A7C], ecx
.text:00401095 jmp short loc_40104A
.text:00401097 ; ---------------------------------------------------------------------------
.text:00401097
.text:00401097 loc_401097: ; CODE XREF: _main+63j
.text:00401097 jmp short loc_401023
.text:00401099 ; ---------------------------------------------------------------------------
.text:00401099
.text:00401099 loc_401099: ; CODE XREF: _main+3Cj
.text:00401099 mov [ebp+var_C3514], 1
.text:004010A3 jmp short loc_4010B4
.text:004010A5 ; ---------------------------------------------------------------------------
.text:004010A5
.text:004010A5 loc_4010A5: ; CODE XREF: _main+DCj
.text:004010A5 mov eax, [ebp+var_C3514]
.text:004010AB add eax, 1
.text:004010AE mov [ebp+var_C3514], eax
.text:004010B4
.text:004010B4 loc_4010B4: ; CODE XREF: _main+A3j
.text:004010B4 cmp [ebp+var_C3514], 2710h
.text:004010BE jge short loc_4010DE
.text:004010C0 mov ecx, [ebp+var_C3514]
.text:004010C6 imul ecx, 28h
.text:004010C9 mov edx, [ebp+var_61A84]
.text:004010CF add edx, [ebp+ecx+var_61A7C]
.text:004010D6 mov [ebp+var_61A84], edx
.text:004010DC jmp short loc_4010A5
.text:004010DE ; ---------------------------------------------------------------------------
.text:004010DE
.text:004010DE loc_4010DE: ; CODE XREF: _main+BEj
.text:004010DE mov eax, [ebp+var_61A84]
.text:004010E4 mov esp, ebp
.text:004010E6 pop ebp
.text:004010E7 retn
.text:004010E7 _main endp
The VB code is longer, so I post here only the main function :
.text:00401A81 loc_401A81: ; CODE XREF: .text:00401B18j
.text:00401A81 mov ecx, [ebp-68h]
.text:00401A84 mov eax, 0F4240h
.text:00401A89 cmp ecx, eax
.text:00401A8B jg loc_401B1D
.text:00401A91 mov edx, [ebp-18h]
.text:00401A94 mov edi, 1
.text:00401A99 add edx, 1
.text:00401A9C mov esi, edi
.text:00401A9E jo loc_401CEC
.text:00401AA4 mov [ebp-18h], edx
.text:00401AA7
.text:00401AA7 loc_401AA7: ; CODE XREF: .text:00401B03j
.text:00401AA7 mov eax, 2710h
.text:00401AAC cmp esi, eax
.text:00401AAE jg short loc_401B05
.text:00401AB0 mov ebx, ds:__vbaGenerateBoundsError
.text:00401AB6 cmp esi, 2711h
.text:00401ABC jb short loc_401AC0
.text:00401ABE call ebx ; __vbaGenerateBoundsError
.text:00401AC0 ; ---------------------------------------------------------------------------
.text:00401AC0
.text:00401AC0 loc_401AC0: ; CODE XREF: .text:00401ABCj
.text:00401AC0 cmp esi, 2711h
.text:00401AC6 jb short loc_401AD4
.text:00401AC8 call ebx ; __vbaGenerateBoundsError
.text:00401ACA ; ---------------------------------------------------------------------------
.text:00401ACA cmp esi, 2711h
.text:00401AD0 jb short loc_401AD4
.text:00401AD2 call ebx
.text:00401AD4
.text:00401AD4 loc_401AD4: ; CODE XREF: .text:00401AC6j
.text:00401AD4 ; .text:00401AD0j
.text:00401AD4 mov eax, [ebp-54h]
.text:00401AD7 mov ecx, [eax+esi*4+13888h]
.text:00401ADE mov edx, [eax+esi*4+1D4CCh]
.text:00401AE5 add ecx, edx
.text:00401AE7 mov edx, [ebp-2Ch]
.text:00401AEA mov eax, edi
.text:00401AEC jo loc_401CEC
.text:00401AF2 add eax, esi
.text:00401AF4 mov [edx+esi*4+9C44h], ecx
.text:00401AFB jo loc_401CEC
.text:00401B01 mov esi, eax
.text:00401B03 jmp short loc_401AA7
.text:00401B05 ; ---------------------------------------------------------------------------
.text:00401B05
.text:00401B05 loc_401B05: ; CODE XREF: .text:00401AAEj
.text:00401B05 mov ecx, [ebp-68h]
.text:00401B08 mov eax, 1
.text:00401B0D add eax, ecx
.text:00401B0F jo loc_401CEC
.text:00401B15 mov [ebp-68h], eax
.text:00401B18 jmp loc_401A81
.text:00401B1D ; ---------------------------------------------------------------------------
.text:00401B1D
.text:00401B1D loc_401B1D: ; CODE XREF: .text:00401A8Bj
.text:00401B1D mov edi, 1
.text:00401B22 mov ebx, 2710h
.text:00401B27 mov esi, edi
.text:00401B29
.text:00401B29 loc_401B29: ; CODE XREF: .text:00401B5Fj
.text:00401B29 cmp esi, ebx
.text:00401B2B jg short loc_401B61
.text:00401B2D cmp esi, 2711h
.text:00401B33 jb short loc_401B3B
.text:00401B35 call ds:__vbaGenerateBoundsError
.text:00401B3B ; ---------------------------------------------------------------------------
.text:00401B3B
.text:00401B3B loc_401B3B: ; CODE XREF: .text:00401B33j
.text:00401B3B mov ecx, [ebp-2Ch]
.text:00401B3E mov eax, [ebp-40h]
.text:00401B41 mov edx, [ecx+esi*4+9C44h]
.text:00401B48 add edx, eax
.text:00401B4A mov eax, edi
.text:00401B4C jo loc_401CEC
.text:00401B52 add eax, esi
.text:00401B54 mov [ebp-40h], edx
.text:00401B57 jo loc_401CEC
.text:00401B5D mov esi, eax
.text:00401B5F jmp short loc_401B29
.text:00401B61 ; ---------------------------------------------------------------------------
.text:00401B61
.text:00401B61 loc_401B61: ; CODE XREF: .text:00401B2Bj
.text:00401B61 mov ebx, ds:__vbaStrI4
.text:00401B67 mov ecx, 80020004h
.text:00401B6C mov [ebp-0B4h], ecx
.text:00401B72 mov [ebp-0A4h], ecx
.text:00401B78 mov [ebp-94h], ecx
.text:00401B7E mov ecx, [ebp-40h]
.text:00401B81 mov eax, 0Ah
.text:00401B86 push offset aDone ; "Done : "

Related

How do I convert C++ function to assembly(x86_64)?

This is my .CPP file
#include <iostream>
using namespace std;
extern "C" void KeysAsm(int arr[], int n, int thetha, int rho);
// Keep this and call it from assembler
extern "C"
void crim(int *xp, int *yp) {
int temp = *xp;
*xp = *yp;
*yp = temp+2;
}
// Translate this into Intel assembler
void KeysCpp(int arr[], int n, int thetha, int rho){
for (int i = 0; i < n - 1; i++) {
for (int j = 0; j < n - i - 1; j++) {
if (arr[j] > arr[j + 1]) {
crim(&arr[j], &arr[j + 1]);
}
}
arr[i]= arr[i] + thetha / rho * 2 - 4;
}
}
// Function to print an array
void printArray(int arr[], int size){
int i;
for (i = 0; i < size; i++)
cout << arr[i] << "\n";
cout << endl;
}
int main() {
int gamma1[]{
9,
270,
88,
-12,
456,
80,
45,
123,
427,
999
};
int gamma2[]{
900,
312,
542,
234,
234,
1,
566,
123,
427,
111
};
printf("Array:\n");
printArray(gamma1, 10);
KeysAsm(gamma1, 10, 5, 6);
printf("Array Result Asm:\n");
printArray(gamma1, 10);
KeysCpp(gamma2, 10, 5, 6);
printf("Array Result Cpp:\n");
printArray(gamma2, 10);
}
What I want to do is, convert the KeysCpp function into assembly language and call it from this very .CPP file. I want to keep the crim function as it is in .CPP, while only converting the KeysCpp.
Here is my .ASM file
PUBLIC KeysAsm
includelib kernel32.lib
_DATA SEGMENT
EXTERN crim:PROC
_DATA ENDS
_TEXT SEGMENT
KeysAsm PROC
push rbp
mov rbp, rsp
sub rsp, 40
mov QWORD PTR [rbp-24], rdi
mov DWORD PTR [rbp-28], esi
mov DWORD PTR [rbp-32], edx
mov DWORD PTR [rbp-36], ecx
mov DWORD PTR [rbp-4], 0
jmp L3
L3:
mov eax, DWORD PTR [rbp-28]
sub eax, 1
cmp DWORD PTR [rbp-4], eax
jl L7
L4:
mov eax, DWORD PTR [rbp-28]
sub eax, DWORD PTR [rbp-4]
sub eax, 1
cmp DWORD PTR [rbp-8], eax
jl L6
L5:
add DWORD PTR [rbp-8], 1
L6:
mov eax, DWORD PTR [rbp-8]
cdqe
lea rdx, [0+rax*4]
mov rax, QWORD PTR [rbp-24]
add rax, rdx
mov edx, DWORD PTR [rax]
mov eax, DWORD PTR [rbp-8]
cdqe
add rax, 1
lea rcx, [0+rax*4]
mov rax, QWORD PTR [rbp-24]
add rax, rcx
mov eax, DWORD PTR [rax]
cmp edx, eax
jle L5
mov eax, DWORD PTR [rbp-8]
cdqe
add rax, 1
lea rdx, [0+rax*4]
mov rax, QWORD PTR [rbp-24]
add rdx, rax
mov eax, DWORD PTR [rbp-8]
cdqe
lea rcx, [0+rax*4]
mov rax, QWORD PTR [rbp-24]
add rax, rcx
mov rsi, rdx
mov rdi, rax
call crim
L7:
mov DWORD PTR [rbp-8], 0
jmp L4
KeysAsm ENDP
_TEXT ENDS
END
I am using Visual Studio 2017 to run this project.
I am getting next error when I run this code.
Unhandled exception at 0x00007FF74B0E429C in MatrixMultiplication.exe: Stack cookie instrumentation code detected a stack-based buffer overrun. occurred
Your asm looks like it's expecting the x86-64 System V calling convention, with args in RDI, ESI, EDX, ECX. But you said you're compiling with Visual Studio, so the compiler-generated code will use the Windows x64 calling convention: RCX, EDX, R8D, R9D.
And when you call crim, it can use shadow space (32 bytes above its return address, which you didn't reserve space for).
It looks like you got this asm from un-optimized compiler output, probably from https://godbolt.org/z/ea4MPh81r using GCC for Linux, without using -mabi=ms to override the default -mabi=sysv when compiling for non-Windows targets. And then you modified it to make the loop infinite, with a jmp at the bottom instead of a ret? Maybe a different GCC version than 12.2 since the label numbers and code don't match exactly.
(The signs of being un-optimized compiler output are all the reloads from [rbp-whatever], and redoing sign-extension before using an int to index an array with cdqe. A human would know the int must be non-negative. And being GCC specifically, the numbered label like .L1: etc. where you just removed the ., and of heavily using RAX for as much as possible in a debug build. And choices like lea rdx, [0+rax*4] to copy-and-shift, and the exact syntax it used to print that instruction in Intel syntax match GCC.)
To compile a single function for Windows x64, isolate it and give the compiler only prototypes for anything it calls
extern "C" void crim(int *xp, int *yp); // prototype only
void KeysCpp(int arr[], int n, int thetha, int rho){
for (int i = 0; i < n - 1; i++) {
for (int j = 0; j < n - i - 1; j++) {
if (arr[j] > arr[j + 1]) {
crim(&arr[j], &arr[j + 1]);
}
}
arr[i]= arr[i] + thetha / rho * 2 - 4;
}
}
Then on Godbolt, use gcc -O3 -mabi=ms, or use MSVC which always targets Windows. https://godbolt.org/z/Mj5Gb54b5 shows both GCC and MSVC with optimization enabled.
KeysCpp(int*, int, int, int): ; demangled name
cmp edx, 1
jle .L11 ; "shrink wrap" optimization: early-out on n<=1 before saving regs
push r15 ; save some call-preserved regs
push r14
lea r14, [rcx+4] ; arr + 1
push r13
mov r13, rcx
Unfortunately GCC fails to hoist the thetha / rho * 2 - 4 loop-invariant, instead redoing idiv every time through the loop. Seems like an obvious optimization since those are local vars whose address hasn't been taken at all, and it keeps thetha (typo for theta?) and rho in registers. So MSVC is much more efficient here. Clang also misses this optimization.

Couldn't Reproduce non-friendly C++ cache code

I'm playing with some chunks of code that intentionally tries to demonstrate ideas about CPU Cache, and how to get benefit of aligned data and so on.
From this article http://igoro.com/archive/gallery-of-processor-cache-effects/ which I find very interesting I'm trying to reproduce the behavior of these two loops (example 1):
int[] arr = new int[64 * 1024 * 1024];
// Loop 1
for (int i = 0; i < arr.Length; i++) arr[i] *= 3;
// Loop 2
for (int i = 0; i < arr.Length; i += 16) arr[i] *= 3;
Note the factor 16 in the second loop, as ints are 4 bytes long in my machine ( Intel i5), every element in the second loop will be in a different cache line, so theoretically and according to the author, these two loops will take almost the same time to execute, because the speed here is governed by the memory access, not by the integer multiplication instruction.
So I've tried to reproduce this behavior, with two different compilers, MSVC 19.31.31105.0 and GNU g++ 9.4.0 and the result is the the same on both cases, the second loop takes about 4% of the time with respect to the first loop. Here is the implementation I've used to reproduce it
#include <cstdlib>
#include <chrono>
#include <vector>
#include <string>
#define SIZE 64 * 1024 * 1024
#define K 16
//Comment this to use C array instead of std vector
//#define USE_C_ARRAY
int main() {
#if defined(USE_C_ARRAY)
int* arr = new int[SIZE];
std::cout << "Testing with C style array\n";
#else
std::vector<int> arr(SIZE);
std::cout << "Using std::vector\n";
#endif
std::cout << "Step: " << K << " Elements \n";
std::cout << "Array Elements: " << SIZE << "\n";
std::cout << "Array Size: " << SIZE/(1024*1024) << " MBs \n";
std::cout << "Int Size: " << sizeof(int) << "\n";
auto show_duration = [&](auto title, auto duration){
std::cout << "[" << title << "] Duration: " <<
std::chrono::duration_cast<std::chrono::milliseconds>(duration).count() <<
" (milisecs) \n";
};
auto start = std::chrono::steady_clock::now();
// Loop 1
for (int i = 0; i < SIZE; i++) arr[i]++;
show_duration("FullArray",
(std::chrono::steady_clock::now() - start));
start = std::chrono::steady_clock::now();
// Loop 1
for (int i = 0; i < SIZE; i += K) arr[i]++;
show_duration(std::string("Array Step ") + std::to_string(K),
(std::chrono::steady_clock::now() - start));
#if defined(USE_C_ARRAY)
delete[] arr;
#endif
return EXIT_SUCCESS;
}
To compile it with g++:
g++ -o for_cache_demo for_cache_demo.cpp -std=c++17 -g3
Note 1: I'm using -g3 to explicitly show that I'm avoiding any compiler optimization just to see the raw behavior.
Note 2: The difference in the assembly code generated for the two loops is the second argument of the loop counter add function, which adds 1 or K to i
add DWORD PTR [rbp-4], 16 # When using a 16 Step
Link to test in in Compiler Explorer: https://godbolt.org/z/eqWdce35z
It will be great if you have any suggestions or ideas about what I might be missing, thank you very much in advance!
Your mistake is not using any optimization. The overhead of the non-optimized code masks any cache effects.
When I don't use any optimization, I get the following:
$ g++ -O0 cache.cpp -o cache
$ ./cache
Using std::vector
Step: 16 Elements
Array Elements: 67108864
Array Size: 64 MBs
Int Size: 4
[FullArray] Duration: 105 (milisecs)
[Array Step 16] Duration: 23 (milisecs)
If I use an optimization, even the most conservative optimization flag, -O1, suddenly the result changes:
$ g++ -O1 cache.cpp -o cache
$ ./cache
Using std::vector
Step: 16 Elements
Array Elements: 67108864
Array Size: 64 MBs
Int Size: 4
[FullArray] Duration: 26 (milisecs)
[Array Step 16] Duration: 22 (milisecs)
(you can see that run time of step 16 has not changes much, because it was already bottlenecked by memory access, while run time of step 1 access improved dramatically, since it was limited by executing instructions, not memory bandwidth)
Further raising optimization levels did not change results for me.
You can compare the inner loops for different optimization levels. With -O0:
vector_step_1(std::vector<int, std::allocator<int> >&):
push rbp
mov rbp, rsp
sub rsp, 32
mov QWORD PTR [rbp-24], rdi
mov DWORD PTR [rbp-4], 0
.L7:
cmp DWORD PTR [rbp-4], 67108863
jg .L8
mov eax, DWORD PTR [rbp-4]
movsx rdx, eax
mov rax, QWORD PTR [rbp-24]
mov rsi, rdx
mov rdi, rax
call std::vector<int, std::allocator<int> >::operator[](unsigned long)
mov rdx, rax
mov ecx, DWORD PTR [rdx]
mov eax, ecx
add eax, eax
add eax, ecx
mov DWORD PTR [rdx], eax
add DWORD PTR [rbp-4], 1
jmp .L7
.L8:
nop
leave
ret
You can see it makes a whole function call on every iteration.
With -O1:
vector_step_1(std::vector<int, std::allocator<int> >&):
mov eax, 0
.L5:
mov rdx, rax
add rdx, QWORD PTR [rdi]
mov ecx, DWORD PTR [rdx]
lea ecx, [rcx+rcx*2]
mov DWORD PTR [rdx], ecx
add rax, 4
cmp rax, 268435456
jne .L5
ret
Now, there are just 8 instructions in the loop. At -O2 it becomes 6 instructions per loop.

why is std::equal much slower than a hand rolled loop for two small std::array?

I was profiling a small piece of code that is part of a larger simulation, and to my surprise, the STL function equal (std::equal) is much slower than a simple for-loop, comparing the two arrays element by element. I wrote a small test case, which I believe to be a fair comparison between the two, and the difference, using g++ 6.1.1 from the Debian archives is not insignificant. I am comparing two, four-element arrays of signed integers. I tested std::equal, operator==, and a small for loop. I didn't use std::chrono for an exact timing, but the difference can be seen explicitly with time ./a.out.
My question is, given the sample code below, why does operator== and the overloaded function std::equal (which calls operator== I believe) take approx 40s to complete, and the hand written loop take only 8s? I'm using a very recent intel based laptop. The for-loop is faster on all optimizations levels, -O1, -O2, -O3, and -Ofast. I compiled the code with
g++ -std=c++14 -Ofast -march=native -mtune=native
Run the code
The loop runs a huge number of times, just to make the difference clear to the naked eye. The modulo operators represent a cheap operation on one of the array elements, and serve to keep the compiler from optimizing out of the loop.
#include<iostream>
#include<algorithm>
#include<array>
using namespace std;
using T = array<int32_t, 4>;
bool
are_equal_manual(const T& L, const T& R)
noexcept {
bool test{ true };
for(uint32_t i{0}; i < 4; ++i) { test = test && (L[i] == R[i]); }
return test;
}
bool
are_equal_alg(const T& L, const T& R)
noexcept {
bool test{ equal(cbegin(L),cend(L),cbegin(R)) };
return test;
}
int main(int argc, char** argv) {
T left{ {0,1,2,3} };
T right{ {0,1,2,3} };
cout << boolalpha << are_equal_manual(left,right) << endl;
cout << boolalpha << are_equal_alg(left,right) << endl;
cout << boolalpha << (left == right) << endl;
bool t{};
const size_t N{ 5000000000 };
for(size_t i{}; i < N; ++i) {
//t = left == right; // SLOW
//t = are_equal_manual(left,right); // FAST
t = are_equal_alg(left,right); // SLOW
left[0] = i % 10;
right[2] = i % 8;
}
cout<< boolalpha << t << endl;
return(EXIT_SUCCESS);
}
Here's the generated assembly of the for loop in main() when the are_equal_manual(left,right) function is used:
.L21:
xor esi, esi
test eax, eax
jne .L20
cmp edx, 2
sete sil
.L20:
mov rax, rcx
movzx esi, sil
mul r8
shr rdx, 3
lea rax, [rdx+rdx*4]
mov edx, ecx
add rax, rax
sub edx, eax
mov eax, edx
mov edx, ecx
add rcx, 1
and edx, 7
cmp rcx, rdi
And here's what's generated when the are_equal_alg(left,right) function is used:
.L20:
lea rsi, [rsp+16]
mov edx, 16
mov rdi, rsp
call memcmp
mov ecx, eax
mov rax, rbx
mov rdi, rbx
mul r12
shr rdx, 3
lea rax, [rdx+rdx*4]
add rax, rax
sub rdi, rax
mov eax, ebx
add rbx, 1
and eax, 7
cmp rbx, rbp
mov DWORD PTR [rsp], edi
mov DWORD PTR [rsp+24], eax
jne .L20
I'm not exactly sure what's happening in the generated code for first case, but it's clearly not calling memcmp(). It doesn't appear to be comparing the contents of the arrays at all. While the loop is still being iterated 5000000000 times, it's optimized to doing nothing much. However, the loop that uses are_equal_alg(left,right) is still performing the comparison. Basically, the compiler is still able to optimize the manual comparison much better than the std::equal template.

Is "for( int k = 5; k--;)" faster than "for( int k = 4; k > -1; --k)"

The question says it all: Is
for( int k = 5; k--;)
faster than
for( int k = 4; k > -1; --k)
and why?
EDIT:
I generated the assembly for debug and release in MSVC2012. But (it's my first time analyzing assembly code), I can't really make sense out of it. I alredy added the "std::cout" to prevent the compiler from removing both loops during release optimization.
Can someone help me what the assembly means?
Debug:
; 10 : for( int k = 5; k--;){ std::cout << k; }
mov DWORD PTR _k$2[ebp], 5
$LN5#wmain:
mov eax, DWORD PTR _k$2[ebp]
mov DWORD PTR tv65[ebp], eax
mov ecx, DWORD PTR _k$2[ebp]
sub ecx, 1
mov DWORD PTR _k$2[ebp], ecx
cmp DWORD PTR tv65[ebp], 0
je SHORT $LN4#wmain
mov esi, esp
mov eax, DWORD PTR _k$2[ebp]
push eax
mov ecx, DWORD PTR __imp_?cout#std##3V?$basic_ostream#DU?$char_traits#D#std###1#A
call DWORD PTR __imp_??6?$basic_ostream#DU?$char_traits#D#std###std##QAEAAV01#H#Z
cmp esi, esp
call __RTC_CheckEsp
jmp SHORT $LN5#wmain
$LN4#wmain:
; 11 :
; 12 : for( int k = 4; k > -1; --k){ std::cout << k; }
mov DWORD PTR _k$1[ebp], 4
jmp SHORT $LN3#wmain
$LN2#wmain:
mov eax, DWORD PTR _k$1[ebp]
sub eax, 1
mov DWORD PTR _k$1[ebp], eax
$LN3#wmain:
cmp DWORD PTR _k$1[ebp], -1
jle SHORT $LN6#wmain
mov esi, esp
mov eax, DWORD PTR _k$1[ebp]
push eax
mov ecx, DWORD PTR __imp_?cout#std##3V?$basic_ostream#DU?$char_traits#D#std###1#A
call DWORD PTR __imp_??6?$basic_ostream#DU?$char_traits#D#std###std##QAEAAV01#H#Z
cmp esi, esp
call __RTC_CheckEsp
jmp SHORT $LN2#wmain
$LN6#wmain:
Release:
; 10 : for( int k = 5; k--;){ std::cout << k; }
mov esi, 5
$LL5#wmain:
mov ecx, DWORD PTR __imp_?cout#std##3V?$basic_ostream#DU?$char_traits#D#std###1#A
dec esi
push esi
call DWORD PTR __imp_??6?$basic_ostream#DU?$char_traits#D#std###std##QAEAAV01#H#Z
test esi, esi
jne SHORT $LL5#wmain
; 11 :
; 12 : for( int k = 4; k > -1; --k){ std::cout << k; }
mov esi, 4
npad 3
$LL3#wmain:
mov ecx, DWORD PTR __imp_?cout#std##3V?$basic_ostream#DU?$char_traits#D#std###1#A
push esi
call DWORD PTR __imp_??6?$basic_ostream#DU?$char_traits#D#std###std##QAEAAV01#H#Z
dec esi
cmp esi, -1
jg SHORT $LL3#wmain
[ UPDATE question has been updated so this is no longer different ] They do different things... the first one executes the loop for k values 4 down to 0, while the second one loops from 5 down to 1... if say the loop body does work related to the magnitude of the number, then they might differ in performance.
Ignoring that, on most CPUs k-- incidentally sets the "flags" register commonly called the "zero" flag, so no further explicit comparison is needed before deciding whether to exit. Still, an optimiser should realise that and avoid any unnecessary second comparison even with the second loop.
Generic quip: compilers are allowed to do lots of things, and the Standard certainly doesn't say anything about the relative performance of these two implementations, so ultimately the only way to know - if you have reason to care - is to use the same compiler and command line options you want for production then inspect the generated assembly or machine code and/or measure very carefully. The findings could differ when the executable's deployed on different hardware, compiler with a later version of the compiler, with different flags, a different compiler etc..
Be careful, the two loops are not equivalent:
for( int k = 5; k--;) cout << k << endl;
prints 4 3 2 1 0. While
for( int k = 5; k > 0; k--) cout << k << endl;
prints 5 4 3 2 1.
In performance point of view, you can have enough confidence in your compiler. Modern compilers know how to optimize this better than we do, in most cases.
It depends on your compiler. Probably not, but as always, one must profile to be certain. Or you could look at the generated assembly code (e.g. gcc -S) and see if it's any different. Make sure to enable optimization before you test, too!

Why access in a for loop is faster than access in a ranged-for in -O0 but not in -O3?

I'm learning performance in C++ (and C++11). And I need to performance in Debug and Release mode because I spend time in debugging and in executing.
I'm surprise with this two tests and how much change with the different compiler flags optimizations.
Test iterator 1:
Optimization 0 (-O0): faster.
Optimization 3 (-O3): slower.
Test iterator 2:
Optimization 0 (-O0): slower.
Optimization 3 (-O3): faster.
P.D.: I use the following clock code.
Test iterator 1:
void test_iterator_1()
{
int z = 0;
int nv = 1200000000;
std::vector<int> v(nv);
size_t count = v.size();
for (unsigned int i = 0; i < count; ++i) {
v[i] = 1;
}
}
Test iterator 2:
void test_iterator_2()
{
int z = 0;
int nv = 1200000000;
std::vector<int> v(nv);
for (int& i : v) {
i = 1;
}
}
UPDATE: The problem is still the same, but for ranged-for in -O3 the differences is small. So for loop 1 is the best.
UPDATE 2: Results:
With -O3:
t1: 80 units
t2: 74 units
With -O0:
t1: 287 units
t2: 538 units
UPDATE 3: The CODE!. Compile with: g++ -std=c++11 test.cpp -O0 (and then -O3)
Your first test is actually setting the value of each element in the vector to 1.
Your second test is setting the value of a copy of each element in the vector to 1 (the original vector is the same).
When you optimize, the second loop more than likely is removed entirely as it is basically doing nothing.
If you want the second loop to actually set the value:
for (int& i : v) // notice the &
{
i = 1;
}
Once you make that change, your loops are likely to produce assembly code that is almost identical.
As a side note, if you wanted to initialize the entire vector to a single value, the better way to do it is:
std::vector<int> v(SIZE, 1);
EDIT
The assembly is fairly long (100+ lines), so I won't post it all, but a couple things to note:
Version 1 will store a value for count and increment i, testing for it each time. Version 2 uses iterators (basically the same as std::for_each(b.begin(), v.end() ...)). So the code for the loop maintenance is very different (it is more setup for version 2, but less work each iteration).
Version 1 (just the meat of the loop)
mov eax, DWORD PTR _i$2[ebp]
push eax
lea ecx, DWORD PTR _v$[ebp]
call ??A?$vector#HV?$allocator#H#std###std##QAEAAHI#Z ; std::vector<int,std::allocator<int> >::operator[]
mov DWORD PTR [eax], 1
Version 2 (just the meat of the loop)
mov eax, DWORD PTR _i$2[ebp]
mov DWORD PTR [eax], 1
When they get optimized, this all changes and (other than the ordering of a few instructions), the output is almost identical.
Version 1 (optimized)
push ebp
mov ebp, esp
sub esp, 12 ; 0000000cH
push ecx
lea ecx, DWORD PTR _v$[ebp]
mov DWORD PTR _v$[ebp], 0
mov DWORD PTR _v$[ebp+4], 0
mov DWORD PTR _v$[ebp+8], 0
call ?resize#?$vector#HV?$allocator#H#std###std##QAEXI#Z ; std::vector<int,std::allocator<int> >::resize
mov ecx, DWORD PTR _v$[ebp+4]
mov edx, DWORD PTR _v$[ebp]
sub ecx, edx
sar ecx, 2 ; this is the only differing instruction
test ecx, ecx
je SHORT $LN3#test_itera
push edi
mov eax, 1
mov edi, edx
rep stosd
pop edi
$LN3#test_itera:
test edx, edx
je SHORT $LN21#test_itera
push edx
call DWORD PTR __imp_??3#YAXPAX#Z
add esp, 4
$LN21#test_itera:
mov esp, ebp
pop ebp
ret 0
Version 2 (optimized)
push ebp
mov ebp, esp
sub esp, 12 ; 0000000cH
push ecx
lea ecx, DWORD PTR _v$[ebp]
mov DWORD PTR _v$[ebp], 0
mov DWORD PTR _v$[ebp+4], 0
mov DWORD PTR _v$[ebp+8], 0
call ?resize#?$vector#HV?$allocator#H#std###std##QAEXI#Z ; std::vector<int,std::allocator<int> >::resize
mov edx, DWORD PTR _v$[ebp]
mov ecx, DWORD PTR _v$[ebp+4]
mov eax, edx
cmp edx, ecx
je SHORT $LN1#test_itera
$LL33#test_itera:
mov DWORD PTR [eax], 1
add eax, 4
cmp eax, ecx
jne SHORT $LL33#test_itera
$LN1#test_itera:
test edx, edx
je SHORT $LN47#test_itera
push edx
call DWORD PTR __imp_??3#YAXPAX#Z
add esp, 4
$LN47#test_itera:
mov esp, ebp
pop ebp
ret 0
Do not worry about how much time each operation takes, that falls squarely under the premature optimization is the root of all evil quote by Donald Knuth. Write easy to understand, simple programs, your time while writing the program (and reading it next week to tweak it, or to find out why the &%$# it is giving crazy results) is much more valuable than any computer time wasted. Just compare your weekly income to the price of an off-the-shelf machine, and think how much of your time is required to shave off a few minutes of compute time.
Do worry when you have measurements showing that the performance isn't adequate. Then you must measure where your runtime (or memory, or whatever else resource is critical) is spent, and see how to make that better. The (sadly out of print) book "Writing Efficient Programs" by Jon Bentley (much of it also appears in his "Programming Pearls") is an eye-opener, and a must read for any budding programmer.
Optimization is pattern matching: The compiler has a number of different situations it can recognize and optimize. If you change the code in a way that makes the pattern unrecognizable to the compiler, suddenly the effect of your optimization vanishes.
So, what you are witnessing is nothing more or less than that the ranged for loop produces more bloated code without optimization, but that in this form the optimizer is able to recognize a pattern that it cannot recognize for the iterator-free case.
In any case, if you are curious, you should take a look at the produced assembler code (compile with -S option).