Local variable vs. array access - c++

Which of these would be more computationally efficient, and why?
A) Repeated array access:
for(i=0; i<numbers.length; i++) {
result[i] = numbers[i] * numbers[i] * numbers[i];
}
B) Setting a local variable:
for(i=0; i<numbers.length; i++) {
int n = numbers[i];
result[i] = n * n * n;
}
Would not the repeated array access version have to be calculated (using pointer arithmetic), making the first option slower because it is doing this?:
for(i=0; i<numbers.length; i++) {
result[i] = *(numbers + i) * *(numbers + i) * *(numbers + i);
}

Any sufficiently sophisticated compiler will generate the same code for all three solutions. I turned your three versions into a small C program (with a minor adjustement, I changed the access numbers.length to a macro invocation which gives the length of an array):
#include <stddef.h>
size_t i;
static const int numbers[] = { 0, 1, 2, 4, 5, 6, 7, 8, 9 };
#define ARRAYLEN(x) (sizeof((x)) / sizeof(*(x)))
static int result[ARRAYLEN(numbers)];
void versionA(void)
{
for(i=0; i<ARRAYLEN(numbers); i++) {
result[i] = numbers[i] * numbers[i] * numbers[i];
}
}
void versionB(void)
{
for(i=0; i<ARRAYLEN(numbers); i++) {
int n = numbers[i];
result[i] = n * n * n;
}
}
void versionC(void)
{
for(i=0; i<ARRAYLEN(numbers); i++) {
result[i] = *(numbers + i) * *(numbers + i) * *(numbers + i);
}
}
I then compiled it using optimizations (and debug symbols, for prettier disassembly) with Visual Studio 2012:
C:\Temp>cl /Zi /O2 /Wall /c so19244189.c
Microsoft (R) C/C++ Optimizing Compiler Version 17.00.50727.1 for x86
Copyright (C) Microsoft Corporation. All rights reserved.
so19244189.c
Finally, here's the disassembly:
C:\Temp>dumpbin /disasm so19244189.obj
[..]
_versionA:
00000000: 33 C0 xor eax,eax
00000002: 8B 0C 85 00 00 00 mov ecx,dword ptr _numbers[eax*4]
00
00000009: 8B D1 mov edx,ecx
0000000B: 0F AF D1 imul edx,ecx
0000000E: 0F AF D1 imul edx,ecx
00000011: 89 14 85 00 00 00 mov dword ptr _result[eax*4],edx
00
00000018: 40 inc eax
00000019: 83 F8 09 cmp eax,9
0000001C: 72 E4 jb 00000002
0000001E: A3 00 00 00 00 mov dword ptr [_i],eax
00000023: C3 ret
_versionB:
00000000: 33 C0 xor eax,eax
00000002: 8B 0C 85 00 00 00 mov ecx,dword ptr _numbers[eax*4]
00
00000009: 8B D1 mov edx,ecx
0000000B: 0F AF D1 imul edx,ecx
0000000E: 0F AF D1 imul edx,ecx
00000011: 89 14 85 00 00 00 mov dword ptr _result[eax*4],edx
00
00000018: 40 inc eax
00000019: 83 F8 09 cmp eax,9
0000001C: 72 E4 jb 00000002
0000001E: A3 00 00 00 00 mov dword ptr [_i],eax
00000023: C3 ret
_versionC:
00000000: 33 C0 xor eax,eax
00000002: 8B 0C 85 00 00 00 mov ecx,dword ptr _numbers[eax*4]
00
00000009: 8B D1 mov edx,ecx
0000000B: 0F AF D1 imul edx,ecx
0000000E: 0F AF D1 imul edx,ecx
00000011: 89 14 85 00 00 00 mov dword ptr _result[eax*4],edx
00
00000018: 40 inc eax
00000019: 83 F8 09 cmp eax,9
0000001C: 72 E4 jb 00000002
0000001E: A3 00 00 00 00 mov dword ptr [_i],eax
00000023: C3 ret
Note how the assembly is exactly the same in all cases. So the correct answer to your question
Which of these would be more computationally efficient, and why?
for this compiler is: mu. Your question cannot be answered because it's based on incorrect assumptions. None of the answers is faster than any other.

The theoretical answer:
A reasonably good optimizing compiler should convert version A to version B, and perform only one load from memory. There should be no performance difference if optimization is enabled.
If optimization is disabled, version A will be slower, because the address must be computed 3 times and there are 3 memory loads (2 of them are cached and very fast, but it's still slower than reusing a register).
In practice, the answer will depend on your compiler, and you should check this by benchmarking.

It depends on compiler but all of them should be the same.
First lets look at case B smart compiler will generate code to load value into register only once so it doesn't matter if you use some additional variable or not, compiler generates opcode for mov instruction and has value into register. So B is the same as A.
Now lets compare A and C. We should look at opeators [] inline implementation. a[b] actually is *(a + b) so *(numbers + i) the same as numbers[i] that means cases A and C are the same.
So we have (A==B) && (A==C) all in all (A==B==C) If you know what I mean :).

Related

Compiler choice of not using REP MOVSB instruction for a byte array move

I'm checking the Release build of my project done with the latest version of the VS 2017 C++ compiler. And I'm curious why did compiler choose to build the following code snippet:
//ncbSzBuffDataUsed of type INT32
UINT8* pDst = (UINT8*)(pMXB + 1);
UINT8* pSrc = (UINT8*)pDPE;
for(size_t i = 0; i < (size_t)ncbSzBuffDataUsed; i++)
{
pDst[i] = pSrc[i];
}
as such:
UINT8* pDst = (UINT8*)(pMXB + 1);
UINT8* pSrc = (UINT8*)pDPE;
for(size_t i = 0; i < (size_t)ncbSzBuffDataUsed; i++)
00007FF66441251E 4C 63 C2 movsxd r8,edx
00007FF664412521 4C 2B D1 sub r10,rcx
00007FF664412524 0F 1F 40 00 nop dword ptr [rax]
00007FF664412528 0F 1F 84 00 00 00 00 00 nop dword ptr [rax+rax]
00007FF664412530 41 0F B6 04 0A movzx eax,byte ptr [r10+rcx]
{
pDst[i] = pSrc[i];
00007FF664412535 88 01 mov byte ptr [rcx],al
00007FF664412537 48 8D 49 01 lea rcx,[rcx+1]
00007FF66441253B 49 83 E8 01 sub r8,1
00007FF66441253F 75 EF jne _logDebugPrint_in_MainXchgBuffer+0A0h (07FF664412530h)
}
versus just using a single REP MOVSB instruction? Wouldn't the latter be more efficient?
Edit: First up, there's an intrinsic for rep movsb which Peter Cordes tells us would be much faster here and I believe him (I guess I already did). If you want to force the compiler to do things this way, see: __movsb(): https://learn.microsoft.com/en-us/cpp/intrinsics/movsb.
As to why the compiler didn't do this for you, in the absence of any other ideas the answer might be register pressure. To use rep movsb The compiler would have to:
set up rsi (= source address)
set up rdi (= destination address)
set up rcx (= count)
issue the rep movsb
So now it has had to use up the three registers mandated by the rep movsb instruction, and it may prefer not to do that. Specifically rsi and rdi are expected to be preserved across a function call, so if the compiler can get away with using them in the body of any particular function it will, and (on initial entry to the method, at least) rcx holds the this pointer.
Also, with the code that we see the compiler has generated there, the r10 and rcxregisters might already contain the requisite source and destination addresses (we can't see that from your example), which would be handy for the compiler if so.
In practise, you will probably see the compiler make different choices in different situations. The type of optimisation requested (/O1 - optimise for size, vs /O2 - optimise for speed) will likely also affect this.
More on the x64 register passing convention here, and on the x64 ABI generally here.
Edit 2 (again inspired by Peter's comments):
The compiler probably decided not to vectorise the loop because it doesn't know if the pointers are aligned or might overlap. Without seeing more of the code, we can't be sure. But that's not strictly relevant to my answer, given what the OP actually asked about.
This is not really an answer, and I can't jam it all into a comment. I just want to share my additional findings. (This is probably relevant to the Visual Studio compilers only.)
What also makes a difference is how you structure your loops. For instance:
Assuming the following struct definitions:
#define PCALLBACK ULONG64
#pragma pack(push)
#pragma pack(1)
typedef struct {
ULONG64 ui0;
USHORT w0;
USHORT w1;
//Followed by:
// PCALLBACK[] 'array' - variable size array
}DPE;
#pragma pack(pop)
(1) The regular way to structure a for loop. The following code chunk is called somewhere in the middle of a larger serialization function:
PCALLBACK* pDstClbks = (PCALLBACK*)(pDPE + 1);
for(size_t i = 0; i < (size_t)info.wNumCallbackFuncs; i++)
{
pDstClbks[i] = info.callbackFuncs[i];
}
As was mentioned somewhere in the answer on this page, it is clear that the compiler was starved of registers to have produced the following monstrocity (see how it reused rax for the loop end limit, or movzx eax,word ptr [r13] instruction that could've been clearly left out of the loop.)
PCALLBACK* pDstClbks = (PCALLBACK*)(pDPE + 1);
00007FF7029327CF 48 83 C1 30 add rcx,30h
for(size_t i = 0; i < (size_t)info.wNumCallbackFuncs; i++)
00007FF7029327D3 66 41 3B 5D 00 cmp bx,word ptr [r13]
00007FF7029327D8 73 1F jae 07FF7029327F9h
00007FF7029327DA 4C 8B C1 mov r8,rcx
00007FF7029327DD 4C 2B F1 sub r14,rcx
{
pDstClbks[i] = info.callbackFuncs[i];
00007FF7029327E0 4B 8B 44 06 08 mov rax,qword ptr [r14+r8+8]
00007FF7029327E5 48 FF C3 inc rbx
00007FF7029327E8 49 89 00 mov qword ptr [r8],rax
00007FF7029327EB 4D 8D 40 08 lea r8,[r8+8]
00007FF7029327EF 41 0F B7 45 00 movzx eax,word ptr [r13]
00007FF7029327F4 48 3B D8 cmp rbx,rax
00007FF7029327F7 72 E7 jb 07FF7029327E0h
}
00007FF7029327F9 45 0F B7 C7 movzx r8d,r15w
(2) So if I re-write it into a less familiar C pattern:
PCALLBACK* pDstClbks = (PCALLBACK*)(pDPE + 1);
PCALLBACK* pEndDstClbks = pDstClbks + (size_t)info.wNumCallbackFuncs;
for(PCALLBACK* pScrClbks = info.callbackFuncs;
pDstClbks < pEndDstClbks;
pScrClbks++, pDstClbks++)
{
*pDstClbks = *pScrClbks;
}
this produces a more sensible machine code (on the same compiler, in the same function, in the same project):
PCALLBACK* pDstClbks = (PCALLBACK*)(pDPE + 1);
00007FF71D7E27C2 48 83 C1 30 add rcx,30h
PCALLBACK* pEndDstClbks = pDstClbks + (size_t)info.wNumCallbackFuncs;
00007FF71D7E27C6 0F B7 86 88 00 00 00 movzx eax,word ptr [rsi+88h]
00007FF71D7E27CD 48 8D 14 C1 lea rdx,[rcx+rax*8]
for(PCALLBACK* pScrClbks = info.callbackFuncs; pDstClbks < pEndDstClbks; pScrClbks++, pDstClbks++)
00007FF71D7E27D1 48 3B CA cmp rcx,rdx
00007FF71D7E27D4 76 14 jbe 07FF71D7E27EAh
00007FF71D7E27D6 48 2B F1 sub rsi,rcx
{
*pDstClbks = *pScrClbks;
00007FF71D7E27D9 48 8B 44 0E 08 mov rax,qword ptr [rsi+rcx+8]
00007FF71D7E27DE 48 89 01 mov qword ptr [rcx],rax
00007FF71D7E27E1 48 83 C1 08 add rcx,8
00007FF71D7E27E5 48 3B CA cmp rcx,rdx
00007FF71D7E27E8 77 EF jb 07FF71D7E27D9h
}
00007FF71D7E27EA 45 0F B7 C6 movzx r8d,r14w

Memory allocation for local array when only one entry/index is initialized?

I'm learning C++ from basics (using Visual Studio Community 2015)
While working on the arrays I came across the following:
int main()
{
int i[10] = {};
}
The assembly code for this is :
18: int i[10] = {};
008519CE 33 C0 xor eax,eax
008519D0 89 45 D4 mov dword ptr [ebp-2Ch],eax
008519D3 89 45 D8 mov dword ptr [ebp-28h],eax
008519D6 89 45 DC mov dword ptr [ebp-24h],eax
008519D9 89 45 E0 mov dword ptr [ebp-20h],eax
008519DC 89 45 E4 mov dword ptr [ebp-1Ch],eax
008519DF 89 45 E8 mov dword ptr [ebp-18h],eax
008519E2 89 45 EC mov dword ptr [ebp-14h],eax
008519E5 89 45 F0 mov dword ptr [ebp-10h],eax
008519E8 89 45 F4 mov dword ptr [ebp-0Ch],eax
008519EB 89 45 F8 mov dword ptr [ebp-8],eax
Here, since initalisation is used every int here is initialised to 0 (xor eax,eax). This is clear.
From what I have learnt, any variable would be allocated memory only if it is used (atleast in modern compilers) and if any one element in an array is initialised the complete array would be allocated memory as follows:
int main()
{
int i[10];
i[0] = 20;
int j = 20;
}
Assembly generated:
18: int i[10];
19: i[0] = 20;
00A319CE B8 04 00 00 00 mov eax,4
00A319D3 6B C8 00 imul ecx,eax,0
00A319D6 C7 44 0D D4 14 00 00 00 mov dword ptr [ebp+ecx-2Ch],14h
20: int j = 20;
00A319DE C7 45 C8 14 00 00 00 mov dword ptr [ebp-38h],14h
Here, the compiler used 4 bytes (to copy the value 20 to i[0]) but from what I have learnt the memory for the entire array should be allocated at line 19. But the compiler haven't produced any relevant machine code for this. And where would it store info (stating that the remaining memory for the other nine elements[1-9] of array i's cannot be used by other variables)
Please help!!!

declaring a variable many times in the same code in Rcpp (C++)

For an R user who began using Rcpp, declaring variables is a new thing. My question is what actually happens when the same named variable is declared many times. In many examples, I see that index of for loops are declared each time.
cppFunction('
int add1( const int n ){
int y = 0;
for(int i=0; i<n; i++){
for(int j=0; j<n; j++) y++;
for(int j=0; j<(n*2); j++) y++;
}
return y ;
}
')
instead of
cppFunction('
int add2( const int n ){
int y = 0;
int i, j;
for(i=0; i<n; i++){
for(j=0; j<n; j++) y++;
for(j=0; j<(n*2); j++) y++;
}
return y ;
}
')
Both seem to give the same answer. But is it generally ok to declare a variable (of the same name) many times in the same program? If it is not ok, when it is not ok? Or maybe I don't understand what 'declare' means, and e.g., the two functions above are identical (e.g., nothing is declared many times even in the first function).
Overview
Alrighty, let's take a looksie at the assembly code after a compiler has transformed both statements. The compiler in this case should ideally provide the same optimization (we may want to run with the -O2 flag).
Test Case
I've written up your file using pure C++. That is, I've opted to directly perform the compilation via terminal and not relying on Rcpp black magic which slips in #include <Rcpp.h> during every compilation.
test.cpp
#include <iostream>
int add2( const int n ){
int y = 0;
int i, j;
for(i=0; i<n; i++){
for(j=0; j<n; j++) y++;
for(j=0; j<(n*2); j++) y++;
}
return y ;
}
int add1( const int n ){
int y = 0;
for(int i=0; i<n; i++){
for(int j=0; j<n; j++) y++;
for(int j=0; j<(n*2); j++) y++;
}
return y ;
}
int main(){
std::cout << add1(2) << std::endl;
std::cout << add2(2) << std::endl;
}
Decomposing the Binary
To see how the C++ code was translated into assembly, I've opted to use objdump over the built in otools on macOS. (Someone is more than welcome to provide that output as well).
In macOS, I did:
gcc -g -c test.cpp
# brew install binutils # required for (g)objdump
gobjdump -d -M intel -S test.o
This gives the following annotated output that I've chunked at the end of the post. In a nutshell, the assembly for both versions is exactly the same.
Benchmarks are King
Another way to verify would be to do a simple microbenchmark. If there was significant difference between the two, that would provide evidence to suggest different optimizations.
# install.packages("microbenchmark")
library("microbenchmark")
microbenchmark(a = add1(100L), b = add2(100L))
Gives:
Unit: microseconds
expr min lq mean median uq max neval
a 53.081 53.268 55.35613 53.576 53.8825 92.078 100
b 53.069 53.261 56.28195 53.431 53.6795 169.841 100
Switching the order:
microbenchmark(b = add2(100L), a = add1(100L))
Gives:
Unit: microseconds
expr min lq mean median uq max neval
b 53.112 53.3215 60.14641 55.0575 60.7685 196.865 100
a 53.130 53.6850 58.72041 55.2845 60.6005 93.401 100
In essence, the benchmarks themselves indicate no significant difference between either method.
Appendix
Long Output
Long output add1
int add1( const int n ){
a0: 55 push rbp
a1: 48 89 e5 mov rbp,rsp
a4: 89 7d fc mov DWORD PTR [rbp-0x4],edi
int y = 0;
a7: c7 45 f8 00 00 00 00 mov DWORD PTR [rbp-0x8],0x0
for(int i=0; i<n; i++){
ae: c7 45 f4 00 00 00 00 mov DWORD PTR [rbp-0xc],0x0
b5: 8b 45 f4 mov eax,DWORD PTR [rbp-0xc]
b8: 3b 45 fc cmp eax,DWORD PTR [rbp-0x4]
bb: 0f 8d 76 00 00 00 jge 137 <__Z4add1i+0x97>
for(int j=0; j<n; j++) y++;
c1: c7 45 f0 00 00 00 00 mov DWORD PTR [rbp-0x10],0x0
c8: 8b 45 f0 mov eax,DWORD PTR [rbp-0x10]
cb: 3b 45 fc cmp eax,DWORD PTR [rbp-0x4]
ce: 0f 8d 1b 00 00 00 jge ef <__Z4add1i+0x4f>
d4: 8b 45 f8 mov eax,DWORD PTR [rbp-0x8]
d7: 05 01 00 00 00 add eax,0x1
dc: 89 45 f8 mov DWORD PTR [rbp-0x8],eax
df: 8b 45 f0 mov eax,DWORD PTR [rbp-0x10]
e2: 05 01 00 00 00 add eax,0x1
e7: 89 45 f0 mov DWORD PTR [rbp-0x10],eax
ea: e9 d9 ff ff ff jmp c8 <__Z4add1i+0x28>
for(int j=0; j<(n*2); j++) y++;
ef: c7 45 ec 00 00 00 00 mov DWORD PTR [rbp-0x14],0x0
f6: 8b 45 ec mov eax,DWORD PTR [rbp-0x14]
f9: 8b 4d fc mov ecx,DWORD PTR [rbp-0x4]
fc: c1 e1 01 shl ecx,0x1
ff: 39 c8 cmp eax,ecx
101: 0f 8d 1b 00 00 00 jge 122 <__Z4add1i+0x82>
107: 8b 45 f8 mov eax,DWORD PTR [rbp-0x8]
10a: 05 01 00 00 00 add eax,0x1
10f: 89 45 f8 mov DWORD PTR [rbp-0x8],eax
112: 8b 45 ec mov eax,DWORD PTR [rbp-0x14]
115: 05 01 00 00 00 add eax,0x1
11a: 89 45 ec mov DWORD PTR [rbp-0x14],eax
11d: e9 d4 ff ff ff jmp f6 <__Z4add1i+0x56>
}
122: e9 00 00 00 00 jmp 127 <__Z4add1i+0x87>
return y ;
}
Long Output for add2
int add2( const int n ){
0: 55 push rbp
1: 48 89 e5 mov rbp,rsp
4: 89 7d fc mov DWORD PTR [rbp-0x4],edi
int y = 0;
7: c7 45 f8 00 00 00 00 mov DWORD PTR [rbp-0x8],0x0
int i, j;
for(i=0; i<n; i++){
e: c7 45 f4 00 00 00 00 mov DWORD PTR [rbp-0xc],0x0
15: 8b 45 f4 mov eax,DWORD PTR [rbp-0xc]
18: 3b 45 fc cmp eax,DWORD PTR [rbp-0x4]
1b: 0f 8d 76 00 00 00 jge 97 <__Z4add2i+0x97>
for(j=0; j<n; j++) y++;
21: c7 45 f0 00 00 00 00 mov DWORD PTR [rbp-0x10],0x0
28: 8b 45 f0 mov eax,DWORD PTR [rbp-0x10]
2b: 3b 45 fc cmp eax,DWORD PTR [rbp-0x4]
2e: 0f 8d 1b 00 00 00 jge 4f <__Z4add2i+0x4f>
34: 8b 45 f8 mov eax,DWORD PTR [rbp-0x8]
37: 05 01 00 00 00 add eax,0x1
3c: 89 45 f8 mov DWORD PTR [rbp-0x8],eax
3f: 8b 45 f0 mov eax,DWORD PTR [rbp-0x10]
42: 05 01 00 00 00 add eax,0x1
47: 89 45 f0 mov DWORD PTR [rbp-0x10],eax
4a: e9 d9 ff ff ff jmp 28 <__Z4add2i+0x28>
for(j=0; j<(n*2); j++) y++;
4f: c7 45 f0 00 00 00 00 mov DWORD PTR [rbp-0x10],0x0
56: 8b 45 f0 mov eax,DWORD PTR [rbp-0x10]
59: 8b 4d fc mov ecx,DWORD PTR [rbp-0x4]
5c: c1 e1 01 shl ecx,0x1
5f: 39 c8 cmp eax,ecx
61: 0f 8d 1b 00 00 00 jge 82 <__Z4add2i+0x82>
67: 8b 45 f8 mov eax,DWORD PTR [rbp-0x8]
6a: 05 01 00 00 00 add eax,0x1
6f: 89 45 f8 mov DWORD PTR [rbp-0x8],eax
72: 8b 45 f0 mov eax,DWORD PTR [rbp-0x10]
75: 05 01 00 00 00 add eax,0x1
7a: 89 45 f0 mov DWORD PTR [rbp-0x10],eax
7d: e9 d4 ff ff ff jmp 56 <__Z4add2i+0x56>
}
82: e9 00 00 00 00 jmp 87 <__Z4add2i+0x87>
Output short output
Short output for add1
int add1( const int n ){
int y = 0;
for(int i=0; i<n; i++){
127: 8b 45 f4 mov eax,DWORD PTR [rbp-0xc]
12a: 05 01 00 00 00 add eax,0x1
12f: 89 45 f4 mov DWORD PTR [rbp-0xc],eax
132: e9 7e ff ff ff jmp b5 <__Z4add1i+0x15>
for(int j=0; j<n; j++) y++;
for(int j=0; j<(n*2); j++) y++;
}
return y ;
137: 8b 45 f8 mov eax,DWORD PTR [rbp-0x8]
13a: 5d pop rbp
13b: c3 ret
13c: 0f 1f 40 00 nop DWORD PTR [rax+0x0]
0000000000000140 <_main>:
}
Short output for add2
int add2( const int n ){
int y = 0;
int i, j;
for(i=0; i<n; i++){
87: 8b 45 f4 mov eax,DWORD PTR [rbp-0xc]
8a: 05 01 00 00 00 add eax,0x1
8f: 89 45 f4 mov DWORD PTR [rbp-0xc],eax
92: e9 7e ff ff ff jmp 15 <__Z4add2i+0x15>
for(j=0; j<n; j++) y++;
for(j=0; j<(n*2); j++) y++;
}
return y ;
97: 8b 45 f8 mov eax,DWORD PTR [rbp-0x8]
9a: 5d pop rbp
9b: c3 ret
9c: 0f 1f 40 00 nop DWORD PTR [rax+0x0]
00000000000000a0 <__Z4add1i>:
}
In the 2 examples you give, it will not make much difference which one you choose - the compiler is almost certain to optimise them identically.
Both are perfectly legal. The second case you cite is fine because each variable is contained to the scope of the for loop.
Personally, I will always write my loops like in your second example unless the index of the loop is related to some other pre-existing variable. I think this is a neater solution and complies with the idea of declaring variables where you need them.
C/C++ will allow you to do something which is not completely intuitive - it will allow you to redefine the same variable name in a nested scope and then things can start to get messy:
for (int i = 0; i < 10; i++) {
for (int i = 10; i < 100; i++) {
// Be careful what you do here!
}
}
In the inner loop any reference to 'i' will refer to the 'i' declared in the inner loop - the outer loop 'i' is now inaccessible. I have seen so many bugs based on this and they can be hard to spot because it is almost never a deliberate choice by the programmer.
it is because of the encapsulation.
What you can try is
for(int i = 0; i<5;i++) {std::cout<<i<<std::endl; }
std::cout<<i<<std::endl;
this code does not work because "i" is only declared inside the for loop.
also if you have an "if-statement", every varable which is declared inside is encapsulated and does no longer exist outside the if statement.
The clamps { } also encapsulate variables.
What you can't do is
int i = 5;
int i = 4;
now you try to declare the same variable again, which gives you an error.

Allocating a temporary variable or compute the expression twice [duplicate]

This question already has answers here:
C++ performance of accessing member variables versus local variables
(11 answers)
Closed 2 years ago.
I am trying to write a code as efficient as possible and I encountered the following situation:
int foo(int a, int b, int c)
{
return (a + b) % c;
}
All good! But what if I want to check if the result of the expression to be different of a constant lets say myConst. Lets say I can afford a temporary variable.
What method is the fastest of the following:
int foo(int a, int b, int c)
{
return (((a + b) % c) != myConst) ? (a + b) % c : myException;
}
or
int foo(int a, int b, int c)
{
int tmp = (a + b) % c
return (tmp != myConst) ? tmp : myException;
}
I can't decide. Where is the 'line' where recalculation is more expensive than allocating and deallocating a temporary variable or the other way around.
Don't worry about it, write concise code and leave micro-optimizations to the compiler.
In your example writing the same calculation twice is error prone - so do not do this. In your specific example, compiler is more than likely to avoid creating a temporary on the stack at all!
Your example can (does on my compiler) produce following assembly (i have replaced myConst with constexpr 42 and myException with 0):
foo(int, int, int):
leal (%rdi,%rsi), %eax # this adds a and b, puts result to eax
movl %edx, %ecx # loads c
cltd
idivl %ecx # performs division, puts result into edx
movl $0, %eax #, prepares to return exception value
cmpl $42, %edx #, compares result of division with magic const
cmovne %edx, %eax # overwrites pessimized exception if all is cool
ret
As you see, there is no temporary anywhere in sight!
Use the later.
You're not computing the same value twice.
The code is more clear.
Creating local variables on the stack doesn't take any significant amount of time.
Check the assembler code this generates for both versions. You most likely want the highest optimization settings for your compiler.
You may very well find out the compiler itself can figure out the intermediate value is used twice, but only inside the function, so safe to store in a register.
To add to what has already been posted, ease of debugging is at least as important as code efficiency, (if there is any effect on code efficiency which, as others have posted, is unlikely with optimization on).
Go with the easiest to follow, test and debug.
Use a temp var.
If more developers used simpler, non-compound expressions and more temp vars, there would be far fewer 'Help - I cannot debug my code!' posts to SO.
Hard coded values
The following only applies to hard coded values (even if they aren't const or constexpr)
In the following example on MSVC 2015 they were optimized away completely and replaced only with mov edx, result (=1 in this example):
#include <iostream>
#include <exception>
int myConst{4};
int myException{2};
int foo1(int a,int b,int c)
{
return (((a + b) % c) != myConst) ? (a + b) % c : myException;
}
int foo2(int a,int b,int c)
{
int tmp = (a + b) % c;
return (tmp != myConst) ? tmp : myException;
}
int main()
{
00007FF71F0E1000 48 83 EC 28 sub rsp,28h
auto test1{foo1(5,2,3)};
auto test2{foo2(5,2,3)};
std::cout << test1 <<'\n';
00007FF71F0E1004 48 8B 0D 75 20 00 00 mov rcx,qword ptr [__imp_std::cout (07FF71F0E3080h)]
00007FF71F0E100B BA 01 00 00 00 mov edx,1
00007FF71F0E1010 FF 15 72 20 00 00 call qword ptr [__imp_std::basic_ostream<char,std::char_traits<char> >::operator<< (07FF71F0E3088h)]
00007FF71F0E1016 48 8B C8 mov rcx,rax
00007FF71F0E1019 E8 B2 00 00 00 call std::operator<<<std::char_traits<char> > (07FF71F0E10D0h)
std::cout << test2 <<'\n';
00007FF71F0E101E 48 8B 0D 5B 20 00 00 mov rcx,qword ptr [__imp_std::cout (07FF71F0E3080h)]
00007FF71F0E1025 BA 01 00 00 00 mov edx,1
00007FF71F0E102A FF 15 58 20 00 00 call qword ptr [__imp_std::basic_ostream<char,std::char_traits<char> >::operator<< (07FF71F0E3088h)]
00007FF71F0E1030 48 8B C8 mov rcx,rax
00007FF71F0E1033 E8 98 00 00 00 call std::operator<<<std::char_traits<char> > (07FF71F0E10D0h)
return 0;
00007FF71F0E1038 33 C0 xor eax,eax
}
00007FF71F0E103A 48 83 C4 28 add rsp,28h
00007FF71F0E103E C3 ret
At this point other have pointed out that the optimization will not happen if the values were passed, or if we had separate files, but it seems even if the code is in separate compilation units the optimization is still done and we don't get any instructions for these functions:
#include <iostream>
#include <exception>
#include "Header.h"
int main()
{
00007FF667BF1000 48 83 EC 28 sub rsp,28h
int var1{5},var2{2},var3{3};
auto test1{foo1(var1,var2,var3)};
auto test2{foo2(var1,var2,var3)};
std::cout << test1 <<'\n';
00007FF667BF1004 48 8B 0D 75 20 00 00 mov rcx,qword ptr [__imp_std::cout (07FF667BF3080h)]
00007FF667BF100B BA 01 00 00 00 mov edx,1
00007FF667BF1010 FF 15 72 20 00 00 call qword ptr [__imp_std::basic_ostream<char,std::char_traits<char> >::operator<< (07FF667BF3088h)]
00007FF667BF1016 48 8B C8 mov rcx,rax
00007FF667BF1019 E8 B2 00 00 00 call std::operator<<<std::char_traits<char> > (07FF667BF10D0h)
std::cout << test2 <<'\n';
00007FF667BF101E 48 8B 0D 5B 20 00 00 mov rcx,qword ptr [__imp_std::cout (07FF667BF3080h)]
00007FF667BF1025 BA 01 00 00 00 mov edx,1
00007FF667BF102A FF 15 58 20 00 00 call qword ptr [__imp_std::basic_ostream<char,std::char_traits<char> >::operator<< (07FF667BF3088h)]
00007FF667BF1030 48 8B C8 mov rcx,rax
00007FF667BF1033 E8 98 00 00 00 call std::operator<<<std::char_traits<char> > (07FF667BF10D0h)
return 0;
00007FF667BF1038 33 C0 xor eax,eax
}
00007FF667BF103A 48 83 C4 28 add rsp,28h
00007FF667BF103E C3 ret

C++ Performance: Two shorts vs int bit operations

What of this options have the best performance:
Use two shorts for two sensible informations, or use one int and use bit operations to retrive half of it for each sensible information?
This may vary depending on architecture and compiler, but generally one using int and bit operations on it will have slightly less performance. But the difference of performance will be so minimal that till now I haven't written code that will require that level of optimization. I depend on the compiler to these kind of optimizations for me.
Now let us check the below C++ code that simulates the behaviur:
int main()
{
int x = 100;
short a = 255;
short b = 127;
short p = x >> 16;
short q = x & 0xffff;
short y = a;
short z = b;
return 0;
}
The corresponding assembly code on x86_64 system (from gnu g++) will be as shown below:
00000000004004ed <main>:
int main()
{
4004ed: 55 push %rbp
4004ee: 48 89 e5 mov %rsp,%rbp
int x = 100;
4004f1: c7 45 fc 64 00 00 00 movl $0x64,-0x4(%rbp)
short a = 255;
4004f8: 66 c7 45 f0 ff 00 movw $0xff,-0x10(%rbp)
short b = 127;
4004fe: 66 c7 45 f2 7f 00 movw $0x7f,-0xe(%rbp)
short p = x >> 16;
400504: 8b 45 fc mov -0x4(%rbp),%eax
400507: c1 f8 10 sar $0x10,%eax
40050a: 66 89 45 f4 mov %ax,-0xc(%rbp)
short q = x & 0xffff;
40050e: 8b 45 fc mov -0x4(%rbp),%eax
400511: 66 89 45 f6 mov %ax,-0xa(%rbp)
short y = a;
400515: 0f b7 45 f0 movzwl -0x10(%rbp),%eax
400519: 66 89 45 f8 mov %ax,-0x8(%rbp)
short z = b;
40051d: 0f b7 45 f2 movzwl -0xe(%rbp),%eax
400521: 66 89 45 fa mov %ax,-0x6(%rbp)
return 0;
400525: b8 00 00 00 00 mov $0x0,%eax
}
As we see, "short p = x >> 16" is the slowest as it uses the extra expensive right shift operation. While all other assignments are equal in terms of cost.