Increase of access time with large array indexes - c++

Problem
I currently write a program with large arrays and I am very confused about the processing time of the different arrays. On the one hand I have 7 "smaller" arrays (<=65536 elements) and on the other hand I have 7 large arrays (65536 < elements <= 67108864). Both are integer arrays.
I only want to increment the various elements in a loop. Every loop I increase the elements at the same index of each array.
This means every loop I want to increment all 14 array at the same index. This takes 21 seconds. If I only increment the 7 smaller arrays it takes only 2 seconds and if I only increment the 7 larger arrays it requires 18 seconds!
For a better illustration I have written an example code:
#include <iostream>
#include <time.h>
using namespace std;
int main(int argc,char* argv[])
{
int i;
int* ExampleArray1 = new int [16384]; // 3 "small" arrays (2^14)
int* ExampleArray2 = new int [32768]; // (2^15)
int* ExampleArray3 = new int [65536]; // (2^16)
for(i=0 ; i<16384 ; i++) {ExampleArray1[i] = 0;}
for(i=0 ; i<32768 ; i++) {ExampleArray2[i] = 0;}
for(i=0 ; i<65536 ; i++) {ExampleArray3[i] = 0;}
int* ExampleArray4 = new int [16777216]; // 3 large arrays (2^24)
int* ExampleArray5 = new int [33554432]; // (2^25)
int* ExampleArray6 = new int [67108864]; // (2^26)
for(i=0 ; i<16777216 ; i++) {ExampleArray4[i] = 0;}
for(i=0 ; i<33554432 ; i++) {ExampleArray5[i] = 0;}
for(i=0 ; i<67108864 ; i++) {ExampleArray6[i] = 0;}
int until;
clock_t start,stop;
cout << "Until: "; cin >> until;
start = clock();
for(i=0 ; i<until; i++)
{
ExampleArray1[1]++;
ExampleArray2[1]++;
ExampleArray3[1]++;
}
stop = clock();
cout << "Time: " << static_cast<float>(stop-start)/CLOCKS_PER_SEC << " sec.\n";
start = clock();
for(i=0 ; i<until; i++)
{
ExampleArray4[1]++;
ExampleArray5[1]++;
ExampleArray6[1]++;
}
stop = clock();
cout << "Time: " << static_cast<float>(stop-start)/CLOCKS_PER_SEC << " sec.\n";
delete[] ExampleArray1; ExampleArray1 = NULL;
delete[] ExampleArray2; ExampleArray2 = NULL;
delete[] ExampleArray3; ExampleArray3 = NULL;
delete[] ExampleArray4; ExampleArray4 = NULL;
delete[] ExampleArray5; ExampleArray5 = NULL;
delete[] ExampleArray6; ExampleArray6 = NULL;
getchar();
return 0;
}
Another confusing point is the time difference between the compiling modes.
As "until" I entered 1 billion.
Output -> Debug-Mode (standard):
Time: 6 sec.
Time: 26 sec.
Output -> Release-Mode (Fully Optimization enabled[/Ox] (I believe this is standard)):
Time: 34 sec.
Time: 47 sec.
Output -> Release-Mode (Optimization in property sheet disabled[/Od]):
Time: 25 sec.
Time: 25 sec.
But if I change the code a little bit and replace the increment by an assignment the output is a bit different:
[...]
int until;
int value;
clock_t start,stop;
cout << "Until: "; cin >> until;
cout << "Value: "; cin >> value;
start = clock();
for(i=0 ; i<until; i++)
{
ExampleArray1[1]=value;
ExampleArray2[1]=value;
ExampleArray3[1]=value;
}
stop = clock();
cout << "Time: " << static_cast<float>(stop-start)/CLOCKS_PER_SEC << " sec.\n";
start = clock();
for(i=0 ; i<until; i++)
{
ExampleArray4[1]=value;
ExampleArray5[1]=value;
ExampleArray6[1]=value;
}
stop = clock();
cout << "Time: " << static_cast<float>(stop-start)/CLOCKS_PER_SEC << " sec.\n";
[...]
Output -> Debug-Mode (standard):
Time: 4 sec.
Time: 28 sec.
Output -> Release-Mode (Fully Opitmization enabled[/Ox]):
Time: 38 sec.
Time: 38 sec.
Output -> Release-Mode (Optimization disabled[/Od]):
Time: 24 sec.
Time: 28 sec.
I thought that the access time of an array is always the same.
I'm using Visual Studio 2012 32-Bit Express and Windows 7 64-Bit.
I have everything tested several times. The times were approximately the same (but up to +/- 6 seconds) and I have taken a rounded average.
My questions are:
Questions
Why the times differs so much with the various array sizes ? -> Is this avoidable without splitting the large arrays ?
Why the debug mode requires less time than the release mode ?
What can be the reasons that the release mode without optimization is faster than with optimization ?
Thanks in advance.
Edit:
System data:
Prozessor: AMD Phenom II X6 1045T (6 x 2,7GHz)
L1 Cache: 768 KB (2x6x64 KB) 2-way | L2 Cache: 3072 KB (6x512 KB) 16-way | L3 Cache: 6 MB 48-way associative
RAM: 2x 4GB DDR3
SSD hard drive
Disassembly (of the first code with incrementation)
Disassembly - VS 2012 (Express)
The disassembly in debug mode is in each case (large and small arrays) the same.
Array1 for example and the header of the first loop:
for (i = 0; i<until; i++)
01275FEA mov dword ptr [i],0
01275FF1 jmp main+21Ch (01275FFCh)
01275FF3 mov eax,dword ptr [i]
01275FF6 add eax,1
01275FF9 mov dword ptr [i],eax
01275FFC mov eax,dword ptr [i]
01275FFF cmp eax,dword ptr [until]
01276002 jge main+283h (01276063h)
{
ExampleArray1[1]++;
01276004 mov eax,4
01276009 shl eax,0
0127600C mov ecx,dword ptr [ExampleArray1]
0127600F mov edx,dword ptr [ecx+eax]
01276012 add edx,1
01276015 mov eax,4
0127601A shl eax,0
0127601D mov ecx,dword ptr [ExampleArray1]
01276020 mov dword ptr [ecx+eax],edx
In release mode (with optimization) the code of the first loop is shorter than the code of the second and the second header is twice in the disassembly (could this be the culprit ?). First loop:
for(i=0 ; i<until; i++)
00AD10D9 xor ecx,ecx
00AD10DB mov dword ptr [esp+24h],eax
00AD10DF cmp dword ptr [esp+10h],ecx
00AD10E3 jle main+0F5h (0AD10F5h)
{
ExampleArray1[1]++;
00AD10E5 inc dword ptr [esi+4]
ExampleArray2[1]++;
00AD10E8 inc dword ptr [ebx+4]
ExampleArray3[1]++;
00AD10EB inc dword ptr [edi+4]
00AD10EE inc ecx
00AD10EF cmp ecx,dword ptr [esp+10h]
00AD10F3 jl main+0E5h (0AD10E5h)
Second loop:
for(i=0 ; i<until; i++)
003B13B4 xor ecx,ecx
003B13B6 mov dword ptr [esp+24h],eax
003B13BA cmp dword ptr [esp+10h],ecx
003B13BE jle main+174h (03B13E4h)
for(i=0 ; i<until; i++)
003B13C0 mov eax,dword ptr [esp+1Ch]
003B13C4 mov edx,dword ptr [esp+18h]
003B13C8 mov edi,dword ptr [esp+20h]
003B13CC lea esp,[esp]
{
ExampleArray4[1]++;
003B13D0 inc dword ptr [eax+4]
ExampleArray5[1]++;
003B13D3 inc dword ptr [edx+4]
ExampleArray6[1]++;
003B13D6 inc dword ptr [edi+4]
003B13D9 inc ecx
003B13DA cmp ecx,dword ptr [esp+10h]
003B13DE jl main+160h (03B13D0h)
003B13E0 mov edi,dword ptr [esp+14h]
}
Thats the disassembly of release without opitmization -> First loop:
for(i=0 ; i<until; i++)
001A17A2 mov dword ptr [i],0
001A17A9 jmp main+1C4h (01A17B4h)
001A17AB mov edx,dword ptr [i]
001A17AE add edx,1
001A17B1 mov dword ptr [i],edx
001A17B4 mov eax,dword ptr [i]
001A17B7 cmp eax,dword ptr [until]
001A17BA jge main+22Bh (01A181Bh)
{
ExampleArray1[1]++;
001A17BC mov ecx,4
001A17C1 shl ecx,0
001A17C4 mov edx,dword ptr [ExampleArray1]
001A17C7 mov eax,dword ptr [edx+ecx]
001A17CA add eax,1
001A17CD mov ecx,4
001A17D2 shl ecx,0
001A17D5 mov edx,dword ptr [ExampleArray1]
001A17D8 mov dword ptr [edx+ecx],eax
ExampleArray2[1]++;
001A17DB mov eax,4
001A17E0 shl eax,0
001A17E3 mov ecx,dword ptr [ExampleArray2]
001A17E6 mov edx,dword ptr [ecx+eax]
001A17E9 add edx,1
001A17EC mov eax,4
001A17F1 shl eax,0
001A17F4 mov ecx,dword ptr [ExampleArray2]
001A17F7 mov dword ptr [ecx+eax],edx
ExampleArray3[1]++;
001A17FA mov edx,4
001A17FF shl edx,0
001A1802 mov eax,dword ptr [ExampleArray3]
001A1805 mov ecx,dword ptr [eax+edx]
001A1808 add ecx,1
001A180B mov edx,4
001A1810 shl edx,0
001A1813 mov eax,dword ptr [ExampleArray3]
001A1816 mov dword ptr [eax+edx],ecx
Second loop:
for(i=0 ; i<until; i++)
001A186F mov dword ptr [i],0
001A1876 jmp main+291h (01A1881h)
001A1878 mov eax,dword ptr [i]
001A187B add eax,1
001A187E mov dword ptr [i],eax
001A1881 mov ecx,dword ptr [i]
001A1884 cmp ecx,dword ptr [until]
001A1887 jge main+2F8h (01A18E8h)
{
ExampleArray4[1]++;
001A1889 mov edx,4
001A188E shl edx,0
001A1891 mov eax,dword ptr [ExampleArray4]
001A1894 mov ecx,dword ptr [eax+edx]
001A1897 add ecx,1
001A189A mov edx,4
001A189F shl edx,0
001A18A2 mov eax,dword ptr [ExampleArray4]
{
ExampleArray4[1]++;
001A18A5 mov dword ptr [eax+edx],ecx
ExampleArray5[1]++;
001A18A8 mov ecx,4
001A18AD shl ecx,0
001A18B0 mov edx,dword ptr [ExampleArray5]
001A18B3 mov eax,dword ptr [edx+ecx]
001A18B6 add eax,1
001A18B9 mov ecx,4
001A18BE shl ecx,0
001A18C1 mov edx,dword ptr [ExampleArray5]
001A18C4 mov dword ptr [edx+ecx],eax
ExampleArray6[1]++;
001A18C7 mov eax,4
001A18CC shl eax,0
001A18CF mov ecx,dword ptr [ExampleArray6]
001A18D2 mov edx,dword ptr [ecx+eax]
001A18D5 add edx,1
001A18D8 mov eax,4
001A18DD shl eax,0
001A18E0 mov ecx,dword ptr [ExampleArray6]
001A18E3 mov dword ptr [ecx+eax],edx
}
Disassembly - VS 2013 (Express)
Debug mode: -same as in VS 2012-
In release mode (with optimization) it looks a bit different ->
First loop:
for (i = 0; i<until; i++)
01391384 xor ecx,ecx
01391386 mov dword ptr [esp+10h],eax
0139138A cmp dword ptr [esp+20h],ecx
0139138E jle main+100h (013913A0h)
{
ExampleArray1[1]++;
01391390 inc dword ptr [esi+4]
01391393 inc ecx
ExampleArray2[1]++;
01391394 inc dword ptr [ebx+4]
ExampleArray3[1]++;
01391397 inc dword ptr [edi+4]
0139139A cmp ecx,dword ptr [esp+20h]
0139139E jl main+0F0h (01391390h)
}
Second loop (loop header is larger ? ):
for (i = 0; i<until; i++)
013913EF xor ecx,ecx
013913F1 mov dword ptr [esp+10h],eax
013913F5 cmp dword ptr [esp+20h],ecx
013913F9 jle main+17Bh (0139141Bh)
013913FB mov eax,dword ptr [esp+18h]
013913FF mov edx,dword ptr [esp+0Ch]
01391403 mov edi,dword ptr [esp+1Ch]
{
ExampleArray4[1]++;
01391407 inc dword ptr [eax+4]
0139140A inc ecx
ExampleArray5[1]++;
0139140B inc dword ptr [edx+4]
ExampleArray6[1]++;
0139140E inc dword ptr [edi+4]
01391411 cmp ecx,dword ptr [esp+20h]
01391415 jl main+167h (01391407h)
01391417 mov edi,dword ptr [esp+14h]
}
Release mode (without optimization) -> First loop:
for (i = 0; i<until; i++)
00DC182C mov dword ptr [i],0
00DC1833 jmp main+1CEh (0DC183Eh)
00DC1835 mov edx,dword ptr [i]
00DC1838 add edx,1
00DC183B mov dword ptr [i],edx
00DC183E mov eax,dword ptr [i]
00DC1841 cmp eax,dword ptr [until]
00DC1844 jge main+235h (0DC18A5h)
{
ExampleArray1[1]++;
00DC1846 mov ecx,4
00DC184B shl ecx,0
00DC184E mov edx,dword ptr [ExampleArray1]
00DC1851 mov eax,dword ptr [edx+ecx]
00DC1854 add eax,1
00DC1857 mov ecx,4
00DC185C shl ecx,0
00DC185F mov edx,dword ptr [ExampleArray1]
00DC1862 mov dword ptr [edx+ecx],eax
ExampleArray2[1]++;
00DC1865 mov eax,4
00DC186A shl eax,0
00DC186D mov ecx,dword ptr [ExampleArray2]
00DC1870 mov edx,dword ptr [ecx+eax]
00DC1873 add edx,1
00DC1876 mov eax,4
00DC187B shl eax,0
00DC187E mov ecx,dword ptr [ExampleArray2]
00DC1881 mov dword ptr [ecx+eax],edx
ExampleArray3[1]++;
00DC1884 mov edx,4
00DC1889 shl edx,0
00DC188C mov eax,dword ptr [ExampleArray3]
00DC188F mov ecx,dword ptr [eax+edx]
00DC1892 add ecx,1
00DC1895 mov edx,4
00DC189A shl edx,0
00DC189D mov eax,dword ptr [ExampleArray3]
00DC18A0 mov dword ptr [eax+edx],ecx
}
Second loop:
for (i = 0; i<until; i++)
00DC18F9 mov dword ptr [i],0
00DC1900 jmp main+29Bh (0DC190Bh)
00DC1902 mov eax,dword ptr [i]
00DC1905 add eax,1
00DC1908 mov dword ptr [i],eax
00DC190B mov ecx,dword ptr [i]
00DC190E cmp ecx,dword ptr [until]
00DC1911 jge main+302h (0DC1972h)
{
ExampleArray4[1]++;
00DC1913 mov edx,4
{
ExampleArray4[1]++;
00DC1918 shl edx,0
00DC191B mov eax,dword ptr [ExampleArray4]
00DC191E mov ecx,dword ptr [eax+edx]
00DC1921 add ecx,1
00DC1924 mov edx,4
00DC1929 shl edx,0
00DC192C mov eax,dword ptr [ExampleArray4]
00DC192F mov dword ptr [eax+edx],ecx
ExampleArray5[1]++;
00DC1932 mov ecx,4
00DC1937 shl ecx,0
00DC193A mov edx,dword ptr [ExampleArray5]
00DC193D mov eax,dword ptr [edx+ecx]
00DC1940 add eax,1
00DC1943 mov ecx,4
00DC1948 shl ecx,0
00DC194B mov edx,dword ptr [ExampleArray5]
00DC194E mov dword ptr [edx+ecx],eax
ExampleArray6[1]++;
00DC1951 mov eax,4
00DC1956 shl eax,0
00DC1959 mov ecx,dword ptr [ExampleArray6]
00DC195C mov edx,dword ptr [ecx+eax]
00DC195F add edx,1
00DC1962 mov eax,4
00DC1967 shl eax,0
00DC196A mov ecx,dword ptr [ExampleArray6]
00DC196D mov dword ptr [ecx+eax],edx
}
Edit: Well, I've done what I can. I'm more confused than before. To see the disassembly I have set a breakpoint in my code and pressed right mousbutton->Go to disassembly. Then I have seen the disassembly. In all this codes I have set the breakpoint at "ExampleArray2[1]++" (line 34).
Now, I have noticed that the disassembly changes if I set the breakpoint at another array.
Why ... is Visual Studio broken or something like this or is this normal?
This Question is already really big. I have uploaded the disassembly of the clocks and the hard coded until at pastebin: http://pastebin.com/pxgegzZ6
IMPORTANT: I have made a hopefully helpful discovery in release mode. If I press the right mouse button in the second loop and press "run until cursor is reached" or set the cursor in the second loop and press Ctrl+F10 than the first loop only require 2.6 seconds ! If I set the cursor at the end of the code the first loop need 2.5-2.7 seconds but the second is as slow as before. In debug mode the times don't change.

This looks to me like a cache issue.
If, instead of
for(i=0 ; i<until; i++)
{
ExampleArray4[1]++;
ExampleArray5[1]++;
ExampleArray6[1]++;
}
you did
for(i=0 ; i<until; i++) ExampleArray4[1]++;
for(i=0 ; i<until; i++) ExampleArray5[1]++;
for(i=0 ; i<until; i++) ExampleArray6[1]++;
I predict it would be much faster.
The reason the cache is an issue is because every time you switch from working on one array to the next, a different location must be drawn from cache. By switching array so often you have a lot of cache misses. A similar issue occurs if you walk across rows of a 2d array rather than columns. Presumably, this doesn't affect your smaller arrays because they can all fit in cache.

Related

c++ for loop efficiency: body vs. afterthought

I found something interesting while on leetcode and wish someone can help explain the cause:
I was basically doing merge sort and used the fast slow pointer to find the mid pointer. Here're two versions of such code snippets:
1. update in afterthought
for (ListNode* fast=head;
fast->next && fast->next->next;
fast = fast->next->next, slow = slow->next) { }
2. update in body
for (ListNode* fast=head; fast->next && fast->next->next; ) {
fast = fast->next->next;
slow = slow->next;
}
Why is version 2 faster than the first one?
Compiler: g++ 4.9.2
It is unlikely that comma operation can significantly reduce the speed of for-loop.
I have made both variants and opened disassembly (in Visual Studio 2012) for them to see difference.
looks as:
for (ListNode* fast = head;
0022545E mov eax,dword ptr [head]
00225461 mov dword ptr [ebp-2Ch],eax
fast->next && fast->next->next;
00225464 jmp main+17Bh (022547Bh)
fast = fast->next->next, slow = slow->next) {
00225466 mov eax,dword ptr [ebp-2Ch]
00225469 mov ecx,dword ptr [eax+4]
0022546C mov edx,dword ptr [ecx+4]
0022546F mov dword ptr [ebp-2Ch],edx
00225472 mov eax,dword ptr [slow]
00225475 mov ecx,dword ptr [eax+4]
00225478 mov dword ptr [slow],ecx
0022547B mov eax,dword ptr [ebp-2Ch]
0022547E cmp dword ptr [eax+4],0
00225482 je main+192h (0225492h)
00225484 mov eax,dword ptr [ebp-2Ch]
00225487 mov ecx,dword ptr [eax+4]
0022548A cmp dword ptr [ecx+4],0
0022548E je main+192h (0225492h)
}
is:
for (ListNode* fast = head; fast->next && fast->next->next;) {
0024545E mov eax,dword ptr [head]
00245461 mov dword ptr [ebp-2Ch],eax
00245464 mov eax,dword ptr [ebp-2Ch]
00245467 cmp dword ptr [eax+4],0
0024546B je main+190h (0245490h)
0024546D mov eax,dword ptr [ebp-2Ch]
00245470 mov ecx,dword ptr [eax+4]
00245473 cmp dword ptr [ecx+4],0
00245477 je main+190h (0245490h)
fast = fast->next->next;
00245479 mov eax,dword ptr [ebp-2Ch]
0024547C mov ecx,dword ptr [eax+4]
0024547F mov edx,dword ptr [ecx+4]
00245482 mov dword ptr [ebp-2Ch],edx
slow = slow->next;
00245485 mov eax,dword ptr [slow]
00245488 mov ecx,dword ptr [eax+4]
0024548B mov dword ptr [slow],ecx
}
Only one jmp is the distinction.
Sorry, but I cannot see significant differences, so perhaps the performance problem is not in the place of that two statements.

Why in Release mode part of the variables can be watched within a debugger?

Consider the following code..
#include <vector>
std::basic_string<char> sBasicString = "basic_string";
char* buffer = new char[1000];
for (size_t i = 0 ; i < sBasicString.size() ; ++i)
{
char c;
c = sBasicString[i];
buffer[i] = c;
}
(Please ignore the memory leak - it is not relevant)
I'm compiling it in VS2012 64bit in both Release and Debug (default configuration).
When i'm running the debugger in Debug mode, i can watch the sBasicString and buffer variables as expected (query their value, etc...)
But when i'm running the debugger in Release mode, i still can watch the sBasicString but not the buffer.
Why?
As Release mode has optimization set to "Full Optimization" (default value) and "Generate Debug Info" set to "YES" - i would expect either both variables can be watched or none.
EDIT
trying to add a proper usage of the buffer variable (avoid compiler optimization) - i still get the same behavior.
EDIT 2 Adding 64bit disassemble output of Release mode compilation
int main()
{
000000013F091000 mov rax,rsp
000000013F091003 push rbx
000000013F091004 sub rsp,50h
000000013F091008 mov qword ptr [rax-38h],0FFFFFFFFFFFFFFFEh
std::basic_string<char> sBasicString = "basic_string";
000000013F091010 xor ebx,ebx
000000013F091012 mov qword ptr [rax-20h],rbx
000000013F091016 mov qword ptr [rax-18h],rbx
000000013F09101A mov qword ptr [rax-18h],0Fh
000000013F091022 mov qword ptr [rax-20h],rbx
000000013F091026 mov byte ptr [rax-30h],bl
000000013F091029 lea r8d,[rbx+0Ch]
000000013F09102D lea rdx,[__xi_z+40h (013F093238h)]
000000013F091034 lea rcx,[rax-30h]
000000013F091038 call std::basic_string<char,std::char_traits<char>,std::allocator<char> >::assign (013F0916A0h)
000000013F09103D nop
char* buffer = new char[1000];
000000013F09103E mov ecx,3E8h
000000013F091043 call operator new[] (013F091AD8h)
for (size_t i = 0 ; i < sBasicString.size() ; ++i)
000000013F091048 mov edx,ebx
000000013F09104A cmp qword ptr [rsp+38h],rbx
000000013F09104F jbe main+73h (013F091073h)
{
char c;
c = sBasicString[i];
000000013F091051 lea rcx,[sBasicString]
000000013F091056 cmp qword ptr [rsp+40h],10h
000000013F09105C cmovae rcx,qword ptr [sBasicString]
buffer[i] = c;
000000013F091062 movzx ecx,byte ptr [rcx+rdx]
buffer[i] = c;
000000013F091066 mov byte ptr [rdx+rax],cl
for (size_t i = 0 ; i < sBasicString.size() ; ++i)
000000013F091069 inc rdx
000000013F09106C cmp rdx,qword ptr [rsp+38h]
000000013F091071 jb main+51h (013F091051h)
}
std::cout << buffer << std::endl;
000000013F091073 mov rdx,rax
000000013F091076 mov rcx,qword ptr [__imp_std::cout (013F093068h)]
000000013F09107D call std::operator<<<std::char_traits<char> > (013F0910C0h)
000000013F091082 mov rcx,rax
000000013F091085 mov rdx,qword ptr [__imp_std::endl (013F093060h)]
000000013F09108C call qword ptr [__imp_std::basic_ostream<char,std::char_traits<char> >::operator<< (013F093098h)]
000000013F091092 nop
return 0;
000000013F091093 cmp qword ptr [rsp+40h],10h
000000013F091099 jb main+0A5h (013F0910A5h)
000000013F09109B mov rcx,qword ptr [sBasicString]
000000013F0910A0 call operator delete (013F091AEAh)
000000013F0910A5 mov qword ptr [rsp+40h],0Fh
000000013F0910AE mov qword ptr [rsp+38h],rbx
000000013F0910B3 mov byte ptr [sBasicString],0
return 0;
000000013F0910B8 xor eax,eax
}
000000013F0910BA add rsp,50h
000000013F0910BE pop rbx
000000013F0910BF ret
EDIT 3 Adding 32bit disassemble output
int main()
{
013B1000 push 0FFFFFFFFh
013B1002 push 13B2558h
013B1007 mov eax,dword ptr fs:[00000000h]
013B100D push eax
013B100E mov dword ptr fs:[0],esp
013B1015 sub esp,18h
013B1018 push esi
std::basic_string<char> sBasicString = "basic_string";
013B1019 push 0Ch
013B101B mov dword ptr [esp+18h],0
013B1023 mov dword ptr [esp+1Ch],0
013B102B push 13B3158h
013B1030 lea ecx,[esp+0Ch]
013B1034 mov dword ptr [esp+20h],0Fh
013B103C mov dword ptr [esp+1Ch],0
013B1044 mov byte ptr [esp+0Ch],0
013B1049 call std::basic_string<char,std::char_traits<char>,std::allocator<char> >::assign (013B17B0h)
013B104E mov dword ptr [esp+24h],0
char* buffer = new char[1000];
013B1056 push 3E8h
013B105B call operator new[] (013B1BD6h)
for (size_t i = 0 ; i < sBasicString.size() ; ++i)
013B1060 xor edx,edx
013B1062 add esp,4
013B1065 mov esi,eax
013B1067 cmp dword ptr [esp+14h],edx
013B106B jbe main+8Dh (013B108Dh)
for (size_t i = 0 ; i < sBasicString.size() ; ++i)
013B106D lea ecx,[ecx]
{
char c;
c = sBasicString[i];
013B1070 cmp dword ptr [esp+18h],10h
013B1075 lea ecx,[esp+4]
013B1079 cmovae ecx,dword ptr [esp+4]
for (size_t i = 0 ; i < sBasicString.size() ; ++i)
013B107E inc edx
buffer[i] = c;
013B107F mov al,byte ptr [ecx+edx-1]
013B1083 mov byte ptr [edx+esi-1],al
013B1087 cmp edx,dword ptr [esp+14h]
013B108B jb main+70h (013B1070h)
}
std::cout << buffer << std::endl;
013B108D push dword ptr ds:[13B3030h]
013B1093 push esi
013B1094 push dword ptr ds:[13B3034h]
013B109A call std::operator<<<std::char_traits<char> > (013B10F0h)
013B109F add esp,8
013B10A2 mov ecx,eax
013B10A4 call dword ptr ds:[13B3028h]
return 0;
013B10AA mov dword ptr [esp+24h],0FFFFFFFFh
013B10B2 cmp dword ptr [esp+18h],10h
013B10B7 pop esi
013B10B8 jb main+0C5h (013B10C5h)
013B10BA push dword ptr [esp]
013B10BD call operator delete (013B1BECh)
013B10C2 add esp,4
}
013B10C5 mov ecx,dword ptr [esp+18h]
return 0;
013B10C9 mov dword ptr [esp+14h],0Fh
013B10D1 mov dword ptr [esp+10h],0
013B10D9 mov byte ptr [esp],0
013B10DD xor eax,eax
}
013B10DF mov dword ptr fs:[0],ecx
013B10E6 add esp,24h
013B10E9 ret
Looking at x86 Assembly posted in this question, I can apply my rudimentary Assembly knowledge to understand, where the buffer variable is hidden:
char* buffer = new char[1000];
013B105B call operator new[] (013B1BD6h)
013B1065 mov esi,eax
My candidate is esi register: operator new returned the result in eax and it is moved to esi. Let's follow this register:
for (size_t i = 0 ; i < sBasicString.size() ; ++i)
013B107E inc edx
buffer[i] = c;
013B107F mov al,byte ptr [ecx+edx-1]
013B1083 mov byte ptr [edx+esi-1],al
The last line places char value al to the buffer. edx is obviously the loop counter, see ind edx. So, esi points to the buffer allocated by operator new. And finally:
013B1093 push esi
013B1094 push dword ptr ds:[13B3034h]
013B109A call std::operator<<<std::char_traits<char> > (013B10F0h)
Here esi is printed. So, the answer to your question: buffer variable is kept in the esi CPU register. You can add the line delete[] buffer; to the program and see, how operator delete is applied to esi in Assembly.
Since the whole loop doesn't contain function calls that can change CPU registers, optimized code produced by compiler just keeps the buffer in the register. Debugger doesn't know this and cannot display it.
x64 Assembly works by the same way, but it is more complicated and requires more time to understand. I hope you have an idea now, what happens.
The compiler is smart enough to see you are not doing anything with the buffer, so it simply optimized it away in the Release mode.
std::string, on the other hand comes from a library, and it is harder to detect that assigning to or reading from it does not have side effects. That's why the compiler didn't remove it.

C++ inline Assembler help. (error c2400)

I'm using Visual Studio 2010. I wrote a program that does a simple Binary Search algorithm. I'm trying to convert it into assemble code. I used the Disassembler to get the assembly code. I'm trying to paste it into _asm. I've tried so many ways and it's just not working.
I tried
_asm(" . . . .");
_asm( );
_asm{ } <--- -Currently going with this way for c++. seems to work well.
seen somewhere online people saying put '\' at the end of each line. That hasn't worked for me.
Here's the code. I'll comment where the errors are. well, I have 13 as of now. Some I won't list because they're the same as other errors. Once I fix one or 2 I should be able to fix them all. The orignal c++ code for the function is also in there.It's commented out.
bool binarySearch(int searchNum,int myArray[],int size){
_asm{
push ebp
mov ebp,esp
sub esp,0F0h
push ebx
push esi
push edi
lea edi,[ebp-0F0h]
mov ecx,3Ch
mov eax,0CCCCCCCCh
rep stos dword ptr es:[edi]
// 217: int first=0,last=size-1,middle;
mov dword ptr [first],0
mov eax,dword ptr [size] // ERROR! Error 2 error C2400: inline assembler syntax error in 'second operand'; found ']'
sub eax,1
mov dword ptr [last],eax
// 218: bool found = false;
mov byte ptr [found],0
// 220: while (first <= last)
mov eax,dword ptr [first]
cmp eax,dword ptr [last]
jg binarySearch+80h (0B51970h) //ERROR! Error 4 error C2400: inline assembler syntax error in 'second operand'; found '('
// 222: middle = (first + last)/2;
mov eax,dword ptr [first] ; // Error 5 error C2400: inline assembler syntax error in 'opcode'; found '('
add eax,dword ptr [last] ;
cdq
sub eax,edx
sar eax,1
mov dword ptr [middle],eax
// 224: if(searchNum > myArray[middle])
mov eax,dword ptr [middle]
mov ecx,dword ptr [myArray]
mov edx,dword ptr [searchNum]
cmp edx,dword ptr [ecx+eax*4]
jle binarySearch+61h (0B51951h) // Error 8 error C2400: inline assembler syntax error in 'opcode'; found '('
// 226: first = middle +1;
mov eax,dword ptr [middle]
add eax,1
mov dword ptr [first],eax
jmp binarySearch+7Eh (0B5196Eh)
// 228: else if (searchNum < myArray[middle])
mov eax,dword ptr [middle]
mov ecx,dword ptr [myArray]
mov edx,dword ptr [searchNum]
cmp edx,dword ptr [ecx+eax*4]
jge binarySearch+7Ah (0B5196Ah)
// 230: last = middle -1;
mov eax,dword ptr [middle]
sub eax,1
mov dword ptr [last],eax
// 232: else
jmp binarySearch+7Eh (0B5196Eh) // Error 18 error C2400: inline assembler syntax error in 'second operand'; found '('
// 233: return true;
mov al,1 // Error 19 error C2400: inline assembler syntax error in 'opcode'; found '('
jmp binarySearch+82h (0B51972h)
jmp binarySearch+32h (0B51922h) // Error 22 error C2400: inline assembler syntax error in 'opcode'; found '('
// 236: return false;
xor al,al
pop edi
pop esi
pop ebx
mov esp,ebp
pop ebp
ret
};
/*
int first=0,last=size-1,middle;
bool found = false;
while (first <= last)
{
middle = (first + last)/2;
if(searchNum > myArray[middle])
{
first = middle +1;
}
else if (searchNum < myArray[middle])
{
last = middle -1;
}
else
return true;
}
return false;
*/
}
Here is the (almost 1:1) working code you posted for a standalone assembly.
binsearch.cpp
extern "C"
{
bool BinSearch(int searchNum, int myArray[], int arraySize);
};
// This is the inlined version.
bool BinSearchInline(int searchNum, int myArray[], int arraySize)
{
int middle;
int first;
int last;
char found;
_asm
{
push ebx
push esi
push edi
mov first,0
mov eax, arraySize
sub eax,1
mov last ,eax
mov found,0
LocalLoop:
mov eax, first
cmp eax, last
jg NotFound
mov eax, first
add eax, last
cdq
sub eax,edx
sar eax,1
mov middle,eax
mov eax,middle
mov ecx,myArray
mov edx,searchNum
cmp edx, dword ptr [ecx+eax*4]
jle MaybeLower
mov eax, middle
add eax,1
mov first, eax
jmp WhileLoop
MaybeLower:
mov eax, middle
mov ecx, myArray
mov edx, searchNum
cmp edx,dword ptr [ecx+eax*4]
jge Found
mov eax, middle
sub eax,1
mov last, eax
jmp WhileLoop
Found:
mov al,1
jmp Done
WhileLoop:
jmp LocalLoop
NotFound:
xor al,al
Done:
pop edi
pop esi
pop ebx
};
}
int main(int argc, char*arg[])
{
int testvalues[7];
for(int i = 0; i < 7; i++)
testvalues[i] = i;
bool b = BinSearch(8, testvalues, 7); // false, value not in array
b = BinSearch(3, testvalues, 7); // true, value is in array.
b = BinSearchInline(8, testvalues, 7); // false
b = BinSearchInline(3, testvalues, 7); // true
return 0;
}
binsearch.asm
.486
.model flat, C
option casemap :none
.code
BinSearch PROC, searchNum:DWORD, myArray:PTR DWORD, arraySize:DWORD
LOCAL first:DWORD
LOCAL middle:DWORD
LOCAL last:DWORD
LOCAL found:BYTE
push ebx
push esi
push edi
; This block is only for debugging stack errors and should be removed.
; lea edi,[ebp-0F0h]
; mov ecx,3Ch
; mov eax,0CCCCCCCCh
; rep stos dword ptr es:[edi]
mov dword ptr [first],0
mov eax,dword ptr [arraySize]
sub eax,1
mov dword ptr [last],eax
mov byte ptr [found],0 ; not even used.
##Loop:
mov eax,dword ptr [first]
cmp eax,dword ptr [last]
jg ##NotFound
mov eax,dword ptr [first]
add eax,dword ptr [last]
cdq
sub eax,edx
sar eax,1
mov dword ptr [middle],eax
mov eax,dword ptr [middle]
mov ecx,dword ptr [myArray]
mov edx,dword ptr [searchNum]
cmp edx,dword ptr [ecx+eax*4]
jle ##MaybeLower
mov eax,dword ptr [middle]
add eax,1
mov dword ptr [first],eax
jmp ##WhileLoop
##MaybeLower:
mov eax,dword ptr [middle]
mov ecx,dword ptr [myArray]
mov edx,dword ptr [searchNum]
cmp edx,dword ptr [ecx+eax*4]
jge ##Found
mov eax,dword ptr [middle]
sub eax,1
mov dword ptr [last],eax
jmp ##WhileLoop
##Found:
mov al,1
jmp ##Done
##WhileLoop:
jmp ##Loop
##NotFound:
xor al,al
##Done:
pop edi
pop esi
pop ebx
ret
BinSearch ENDP
END

Why does adding extra check in loop make big difference on some machines, and small difference on others?

I have been doing some testing to see how much of a difference additional bounds checking makes in loops. This is prompted by thinking about the cost of implicit bounds checking inserted by languages such as C#, Java etc, when you access arrays.
Update: I have tried the same executable program out on several additional computers, which throws a lot more light onto what is happening. I've listed the original computer first, and second my modern laptop. On my modern laptop, adding additional checks in the loop adds only between 1 and 4% to the time taken, compared to between 3 and 30% for the original hardware.
Processor x86 Family 6 Model 30 Stepping 5 GenuineIntel ~2793 Mhz
Ratio 2 checks : 1 check = 1.0310
Ratio 3 checks : 1 check = 1.2769
Processor Intel(R) Core(TM) i7-3610QM CPU # 2.30GHz, 2301 Mhz, 4 Core(s), 8 Logical Processor(s)
Ratio 2 checks : 1 check = 1.0090
Ratio 3 checks : 1 check = 1.0393
Processor Intel(R) Core(TM) i5-2500 CPU # 3.30GHz, 4 Cores(s)
Ratio 2 checks : 1 check = 1.0035
Ratio 3 checks : 1 check = 1.0639
Processor Intel(R) Core(TM)2 Duo CPU T9300 # 2.50GHz, 2501 Mhz, 2 Core(s), 2 Logical Processor(s)
Ratio 2 checks : 1 check = 1.1195
Ratio 3 checks : 1 check = 1.3597
Processor x86 Family 15 Model 43 Stepping 1 AuthenticAMD ~2010 Mhz
Ratio 2 checks : 1 check = 1.0776
Ratio 3 checks : 1 check = 1.1451
In the test program, below, the first function checks just one bound, the second function checks two, and the third checks three (in the calling code, n1=n2=n3). I found that the ratio two checks:one was about 1.03, and the ratio three checks:one was about 1.3. I was surprised by that adding one more check made such a difference to performance. I got an interesting answer concerning the low cost of bounds checking on modern processors to my original question, which may throw some light on the differences observed here.
Note that it's important to compile the program without whole program optimization turned on; otherwise the compiler can simply remove the additional bounds checking.
// dotprod.cpp
#include "dotprod.h"
double SumProduct(const double* v1, const double* v2, int n)
{
double sum=0;
for(int i=0;
i<n;
++i)
sum += v1[i]*v2[i];
return sum;
}
double SumProduct(const double* v1, const double* v2, int n1, int n2)
{
double sum=0;
for(int i=0;
i<n1 && i <n2;
++i)
sum += v1[i]*v2[i];
return sum;
}
double SumProduct(const double* v1, const double* v2, int n1, int n2, int n3)
{
double sum=0;
for(int i=0;
i<n1 && i <n2 && i <n3;
++i)
sum += v1[i]*v2[i];
return sum;
}
This code was originally built using Visual Studio 2010, Release, Win32 (I've added the 'C' tag because the reasoning behind the difference in speed is not likely to be C++ specific, and may not be Windows specific). Can anyone explain it?
Rest of the code below, for information. This has some C++ specific stuff in it.
Header file
// dotprod.h
double SumProduct(const double*, const double*, int n);
double SumProduct(const double*, const double*, int n1, int n2);
double SumProduct(const double*, const double*, int n1, int n2, int n3);
Test harness
// main.cpp
#include <stdio.h>
#include <math.h>
#include <numeric>
#include <vector>
#include <windows.h>
#include "../dotprod/dotprod.h" // separate lib
typedef __int64 timecount_t;
inline timecount_t GetTimeCount()
{
LARGE_INTEGER li;
if (!QueryPerformanceCounter(&li)) {
exit(1);
}
return li.QuadPart;
}
int main()
{
typedef std::vector<double> dvec;
const int N = 100 * 1000;
// Initialize
dvec v1(N);
dvec v2(N);
dvec dp1(N);
dvec dp2(N);
dvec dp3(N);
for(int i=0; i<N; ++i) {
v1[i] = i;
v2[i] = log(static_cast<double>(i+1));
}
const timecount_t t0 = GetTimeCount();
// Check cost with one bound
for(int n=0; n<N; ++n) {
dp1[n] = SumProduct(&(v1[0]),&(v2[0]),n);
}
const timecount_t t1 = GetTimeCount();
// Check cost with two bounds
for(int n=0; n<N; ++n) {
dp2[n] = SumProduct(&(v1[0]),&(v2[0]),n,n);
}
const timecount_t t2 = GetTimeCount();
// Check cost with three bounds
for(int n=0; n<N; ++n) {
dp3[n] = SumProduct(&(v1[0]),&(v2[0]),n,n,n);
}
const timecount_t t3 = GetTimeCount();
// Check results
const double sumSumProducts1 = std::accumulate(dp1.begin(), dp1.end(), 0.0);
const double sumSumProducts2 = std::accumulate(dp2.begin(), dp2.end(), 0.0);
const double sumSumProducts3 = std::accumulate(dp3.begin(), dp3.end(), 0.0);
printf("Sums of dot products: %.1f, %.1f, %.1f\n", sumSumProducts1, sumSumProducts2, sumSumProducts3);
// Output timings
const timecount_t elapsed1 = t1-t0;
const timecount_t elapsed2 = t2-t1;
const timecount_t elapsed3 = t3-t2;
printf("Elapsed: %.0f, %.0f, %.0f\n",
static_cast<double>(elapsed1),
static_cast<double>(elapsed2),
static_cast<double>(elapsed3));
const double ratio2to1 = elapsed2 / static_cast<double>(elapsed1);
const double ratio3to1 = elapsed3 / static_cast<double>(elapsed1);
printf("Ratio 2:1=%.2f\n", ratio2to1);
printf("Ratio 3:1=%.2f\n", ratio3to1);
return 0;
}
In order to produce assembly, I took the advice in this answer (case 2, turning off whole program optimization), producing the following asm file.
; Listing generated by Microsoft (R) Optimizing Compiler Version 16.00.40219.01
TITLE C:\dev\TestSpeed\dotprod\dotprod.cpp
.686P
.XMM
include listing.inc
.model flat
INCLUDELIB OLDNAMES
PUBLIC __real#0000000000000000
PUBLIC ?SumProduct##YANPBN0HHH#Z ; SumProduct
EXTRN __fltused:DWORD
; COMDAT __real#0000000000000000
; File c:\dev\testspeed\dotprod\dotprod.cpp
CONST SEGMENT
__real#0000000000000000 DQ 00000000000000000r ; 0
; Function compile flags: /Ogtp
CONST ENDS
; COMDAT ?SumProduct##YANPBN0HHH#Z
_TEXT SEGMENT
tv491 = -4 ; size = 4
_v1$ = 8 ; size = 4
_v2$ = 12 ; size = 4
_n1$ = 16 ; size = 4
_n2$ = 20 ; size = 4
_n3$ = 24 ; size = 4
?SumProduct##YANPBN0HHH#Z PROC ; SumProduct, COMDAT
; 25 : {
push ebp
mov ebp, esp
push ecx
; 26 : double sum=0;
fldz
push ebx
mov ebx, DWORD PTR _v2$[ebp]
push esi
push edi
mov edi, DWORD PTR _n1$[ebp]
; 27 : for(int i=0;
xor ecx, ecx
; 28 : i<n1 && i <n2 && i <n3;
; 29 : ++i)
cmp edi, 4
jl $LC8#SumProduct
; 26 : double sum=0;
mov edi, DWORD PTR _v1$[ebp]
lea esi, DWORD PTR [edi+24]
; 30 : sum += v1[i]*v2[i];
sub edi, ebx
lea edx, DWORD PTR [ecx+2]
lea eax, DWORD PTR [ebx+8]
mov DWORD PTR tv491[ebp], edi
$LN15#SumProduct:
; 28 : i<n1 && i <n2 && i <n3;
; 29 : ++i)
mov ebx, DWORD PTR _n2$[ebp]
cmp ecx, ebx
jge $LN9#SumProduct
cmp ecx, DWORD PTR _n3$[ebp]
jge $LN9#SumProduct
; 30 : sum += v1[i]*v2[i];
fld QWORD PTR [eax-8]
lea edi, DWORD PTR [edx-1]
fmul QWORD PTR [esi-24]
faddp ST(1), ST(0)
cmp edi, ebx
jge SHORT $LN9#SumProduct
; 28 : i<n1 && i <n2 && i <n3;
; 29 : ++i)
cmp edi, DWORD PTR _n3$[ebp]
jge SHORT $LN9#SumProduct
; 30 : sum += v1[i]*v2[i];
mov edi, DWORD PTR tv491[ebp]
fld QWORD PTR [edi+eax]
fmul QWORD PTR [eax]
faddp ST(1), ST(0)
cmp edx, ebx
jge SHORT $LN9#SumProduct
; 28 : i<n1 && i <n2 && i <n3;
; 29 : ++i)
cmp edx, DWORD PTR _n3$[ebp]
jge SHORT $LN9#SumProduct
; 30 : sum += v1[i]*v2[i];
fld QWORD PTR [eax+8]
lea edi, DWORD PTR [edx+1]
fmul QWORD PTR [esi-8]
faddp ST(1), ST(0)
cmp edi, ebx
jge SHORT $LN9#SumProduct
; 28 : i<n1 && i <n2 && i <n3;
; 29 : ++i)
cmp edi, DWORD PTR _n3$[ebp]
jge SHORT $LN9#SumProduct
; 30 : sum += v1[i]*v2[i];
fld QWORD PTR [eax+16]
mov edi, DWORD PTR _n1$[ebp]
fmul QWORD PTR [esi]
add ecx, 4
lea ebx, DWORD PTR [edi-3]
add eax, 32 ; 00000020H
add esi, 32 ; 00000020H
faddp ST(1), ST(0)
add edx, 4
cmp ecx, ebx
jl SHORT $LN15#SumProduct
mov ebx, DWORD PTR _v2$[ebp]
$LC8#SumProduct:
; 28 : i<n1 && i <n2 && i <n3;
; 29 : ++i)
cmp ecx, edi
jge SHORT $LN9#SumProduct
mov edx, DWORD PTR _v1$[ebp]
lea eax, DWORD PTR [ebx+ecx*8]
sub edx, ebx
$LC3#SumProduct:
cmp ecx, DWORD PTR _n2$[ebp]
jge SHORT $LN9#SumProduct
cmp ecx, DWORD PTR _n3$[ebp]
jge SHORT $LN9#SumProduct
; 30 : sum += v1[i]*v2[i];
fld QWORD PTR [eax+edx]
inc ecx
fmul QWORD PTR [eax]
add eax, 8
faddp ST(1), ST(0)
cmp ecx, edi
jl SHORT $LC3#SumProduct
$LN9#SumProduct:
; 31 : return sum;
; 32 : }
pop edi
pop esi
pop ebx
mov esp, ebp
pop ebp
ret 0
?SumProduct##YANPBN0HHH#Z ENDP ; SumProduct
_TEXT ENDS
PUBLIC ?SumProduct##YANPBN0HH#Z ; SumProduct
; Function compile flags: /Ogtp
; COMDAT ?SumProduct##YANPBN0HH#Z
_TEXT SEGMENT
tv448 = -4 ; size = 4
_v1$ = 8 ; size = 4
_v2$ = 12 ; size = 4
_n1$ = 16 ; size = 4
_n2$ = 20 ; size = 4
?SumProduct##YANPBN0HH#Z PROC ; SumProduct, COMDAT
; 15 : {
push ebp
mov ebp, esp
push ecx
; 16 : double sum=0;
fldz
push ebx
mov ebx, DWORD PTR _v2$[ebp]
push esi
push edi
mov edi, DWORD PTR _n1$[ebp]
; 17 : for(int i=0;
xor ecx, ecx
; 18 : i<n1 && i <n2;
; 19 : ++i)
cmp edi, 4
jl SHORT $LC8#SumProduct#2
; 16 : double sum=0;
mov edi, DWORD PTR _v1$[ebp]
lea edx, DWORD PTR [edi+24]
; 20 : sum += v1[i]*v2[i];
sub edi, ebx
lea esi, DWORD PTR [ecx+2]
lea eax, DWORD PTR [ebx+8]
mov DWORD PTR tv448[ebp], edi
$LN19#SumProduct#2:
mov edi, DWORD PTR _n2$[ebp]
cmp ecx, edi
jge SHORT $LN9#SumProduct#2
fld QWORD PTR [eax-8]
lea ebx, DWORD PTR [esi-1]
fmul QWORD PTR [edx-24]
faddp ST(1), ST(0)
cmp ebx, edi
jge SHORT $LN9#SumProduct#2
mov ebx, DWORD PTR tv448[ebp]
fld QWORD PTR [ebx+eax]
fmul QWORD PTR [eax]
faddp ST(1), ST(0)
cmp esi, edi
jge SHORT $LN9#SumProduct#2
fld QWORD PTR [eax+8]
lea ebx, DWORD PTR [esi+1]
fmul QWORD PTR [edx-8]
faddp ST(1), ST(0)
cmp ebx, edi
jge SHORT $LN9#SumProduct#2
fld QWORD PTR [eax+16]
mov edi, DWORD PTR _n1$[ebp]
fmul QWORD PTR [edx]
add ecx, 4
lea ebx, DWORD PTR [edi-3]
add eax, 32 ; 00000020H
add edx, 32 ; 00000020H
faddp ST(1), ST(0)
add esi, 4
cmp ecx, ebx
jl SHORT $LN19#SumProduct#2
mov ebx, DWORD PTR _v2$[ebp]
$LC8#SumProduct#2:
; 18 : i<n1 && i <n2;
; 19 : ++i)
cmp ecx, edi
jge SHORT $LN9#SumProduct#2
mov edx, DWORD PTR _v1$[ebp]
lea eax, DWORD PTR [ebx+ecx*8]
sub edx, ebx
$LC3#SumProduct#2:
cmp ecx, DWORD PTR _n2$[ebp]
jge SHORT $LN9#SumProduct#2
; 20 : sum += v1[i]*v2[i];
fld QWORD PTR [eax+edx]
inc ecx
fmul QWORD PTR [eax]
add eax, 8
faddp ST(1), ST(0)
cmp ecx, edi
jl SHORT $LC3#SumProduct#2
$LN9#SumProduct#2:
; 21 : return sum;
; 22 : }
pop edi
pop esi
pop ebx
mov esp, ebp
pop ebp
ret 0
?SumProduct##YANPBN0HH#Z ENDP ; SumProduct
_TEXT ENDS
PUBLIC ?SumProduct##YANPBN0H#Z ; SumProduct
; Function compile flags: /Ogtp
; COMDAT ?SumProduct##YANPBN0H#Z
_TEXT SEGMENT
_v1$ = 8 ; size = 4
_v2$ = 12 ; size = 4
?SumProduct##YANPBN0H#Z PROC ; SumProduct, COMDAT
; _n$ = eax
; 5 : {
push ebp
mov ebp, esp
mov edx, DWORD PTR _v2$[ebp]
; 6 : double sum=0;
fldz
push ebx
push esi
mov esi, eax
; 7 : for(int i=0;
xor ebx, ebx
push edi
mov edi, DWORD PTR _v1$[ebp]
; 8 : i<n;
; 9 : ++i)
cmp esi, 4
jl SHORT $LC9#SumProduct#3
; 6 : double sum=0;
lea eax, DWORD PTR [edx+8]
lea ecx, DWORD PTR [edi+24]
; 10 : sum += v1[i]*v2[i];
sub edi, edx
lea edx, DWORD PTR [esi-4]
shr edx, 2
inc edx
lea ebx, DWORD PTR [edx*4]
$LN10#SumProduct#3:
fld QWORD PTR [eax-8]
add eax, 32 ; 00000020H
fmul QWORD PTR [ecx-24]
add ecx, 32 ; 00000020H
dec edx
faddp ST(1), ST(0)
fld QWORD PTR [edi+eax-32]
fmul QWORD PTR [eax-32]
faddp ST(1), ST(0)
fld QWORD PTR [eax-24]
fmul QWORD PTR [ecx-40]
faddp ST(1), ST(0)
fld QWORD PTR [eax-16]
fmul QWORD PTR [ecx-32]
faddp ST(1), ST(0)
jne SHORT $LN10#SumProduct#3
; 6 : double sum=0;
mov edx, DWORD PTR _v2$[ebp]
mov edi, DWORD PTR _v1$[ebp]
$LC9#SumProduct#3:
; 8 : i<n;
; 9 : ++i)
cmp ebx, esi
jge SHORT $LN8#SumProduct#3
sub edi, edx
lea eax, DWORD PTR [edx+ebx*8]
sub esi, ebx
$LC3#SumProduct#3:
; 10 : sum += v1[i]*v2[i];
fld QWORD PTR [eax+edi]
add eax, 8
dec esi
fmul QWORD PTR [eax-8]
faddp ST(1), ST(0)
jne SHORT $LC3#SumProduct#3
$LN8#SumProduct#3:
; 11 : return sum;
; 12 : }
pop edi
pop esi
pop ebx
pop ebp
ret 0
?SumProduct##YANPBN0H#Z ENDP ; SumProduct
_TEXT ENDS
END
One big difference between CPUs is the pipeline optimization
The CPU can execute in parallel several instructions until reaches a conditional branch. From this point instead of waiting until all the instructions are executed, the CPU can continue with a branch in parallel until the condition is available and ready to be evaluated. If the assumption was correct, then we have a gain. Otherwise the CPU will go with the other branch.
So the tricky part for a CPU is to find the best assumptions and to execute as many instructions in parallel as possible.

Fully Array alternative

When I tried to create my own alternative to classic array, I saw, that into disassembly code added one instruction: mov edx,dword ptr [myarray]. Why this additional instruction was added?
I want to use my functionality of my alternative, but do not want to lose performance! How to resolve this question? Every processor cycle is important for this application.
For example:
for (unsigned i = 0; i < 10; ++i)
{
array1[i] = i;
array2[i] = 10 - i;
}
Assembly (classic int arrays):
mov edx, dword ptr [ebp-480h]
mov eax, dword ptr [ebp-480h]
mov dword ptr array1[edx*4], eax
mov ecx, 10
sub ecx, dword ptr [ebp-480h]
mov edx, dword ptr [ebp-480h]
mov dword ptr array2[edx*4], ecx
Assembly (my class):
mov edx,dword ptr [array1]
mov eax,dword ptr [ebp-43Ch]
mov ecx,dword ptr [ebp-43Ch]
mov dword ptr [edx+eax*4], ecx
mov edx, 10
sub edx, dword ptr [ebp-43Ch]
mov eax, dword ptr [array2]
mov ecx, dword ptr [ebp-43Ch]
mov dword ptr [eax+ecx*4], edx
One instruction is not a loss of performance with today's processors. I would not worry about it and instead suggest you read Coding Horror's article on micro optimization.
However, that instruction is just moving the first index (myarray+0) to edx so it can be used.