C and C++: Array element access pointer vs int

C and C++: Array element access pointer vs int - c++

Is there a performance difference if you either do myarray[ i ] or store the adress of myarray[ i ] in a pointer?
Edit: The pointers are all calculated during an unimportant step in my program where performance is no criteria. During the critical parts the pointers remain static and are not modified. Now the question is if these static pointers are faster than using myarray[ i ] all the time.

For this code:
int main() {
int a[100], b[100];
int * p = b;
for ( unsigned int i = 0; i < 100; i++ ) {
a[i] = i;
*p++ = i;
}
return a[1] + b[2];
}
when built with -O3 optimisation in g++, the statement:
a[i] = i;
produced the assembly output:
mov %eax,(%ecx,%eax,4)
and this statement:
*p++ = i;
produced:
mov %eax,(%edx,%eax,4)
So in this case there was no difference between the two. However, this is not and cannot be a general rule - the optimiser might well generate completely different code for even a slightly different input.

It will probably make no difference at all. The compiler will usually be smart enough to know when you are using an expression more than once and create a temporary itself, if appropriate.

Compilers can do surprising optimizations; the only way to know is to read the generated assembly code.
With GCC, use -S, with -masm=intel for Intel syntax.
With VC++, use /FA (IIRC).
You should also enable optimizations: -O2 or -O3 with GCC, and /O2 with VC++.

I prefer using myarray[ i ] since it is more clear and the compiler has easier time compiling this to optimized code.
When using pointers it is more complex for the compiler to optimize this code since it's harder to know exactly what you're doing with the pointer.

There should not be much different but by using indexing you avoid all types of different pitfalls that the compiler's optimizer is prone to (aliasing being the most important one) and thus I'd say the indexing case should be easier to handle for the compiler. This doesn't mean that you should take care of aforementioned things before the loop, but pointers in a loop generally just adds to the complexity.

Yes. Having a pointer the address won't be calculated by using the initial address of the array. It will accessed directly. So you have a little performance improve if you save the address in a pointer.
But the compiler will usually optimize the code and use the pointer in both cases (if you have statical arrays)
For dynamic arrays (created with new) the pointer will offer you more performance as the compiler cannot optimize array accesses at compile time.

There will be no substantial difference. Premature optimization is the root of all evil - get a profiler before checking micro-optimizations like this. Also, the myarray[i] is more portable to custom types, such as a std::vector.

Okay so your questions is, whats faster:
int main(int argc, char **argv)
{
int array[20];
array[0] = 0;
array[1] = 1;
int *value_1 = &array[1];
printf("%d", *value_1);
printf("%d", array[1]);
printf("%d", *(array + 1));
}
Like someone else already pointed out, compilers can do clever optimization. Of course this is depending on where an expression is used, but normally you shouldn't care about those subtle differences. All your assumption can be proven wrong by the compiler. Today you shouldn't need to care about such differences.
For example the above code produces the following (only snippet):
mov [ebp+var_54], 1 #store 1
lea eax, [ebp+var_58] # load the address of array[0]
add eax, 4 # add 4 (size of int)
mov [ebp+var_5C], eax
mov eax, [ebp+var_5C]
mov eax, [eax]
mov [esp+88h+var_84], eax
mov [esp+88h+var_88], offset unk_403000 # points to %d
call printf
mov eax, [ebp+var_54]
mov [esp+88h+var_84], eax
mov [esp+88h+var_88], offset unk_403000
call printf
mov eax, [ebp+var_54]
mov [esp+88h+var_84], eax
mov [esp+88h+var_88], offset unk_403000
call printf

Short answer: the only way to know for sure is to code up both versions and compare performance. I would personally be surprised if there was a measureable difference unless you were doing a lot of array accesses in a really tight loop. If this is something that happens once or twice over the lifetime of the program, or depends on user input, it's not worth worrying about.
Remember that the expression a[i] is evaluated as *(a+i), which is an addition plus a dereference, whereas *p is just a dereference. Depending on how the code is structured, though, it may not make a difference. Assume the following:
int a[N]; // for any arbitrary N > 1
int *p = a;
size_t i;
for (i = 0; i < N; i++)
printf("a[%d] = %d\n", i, a[i]);
for (i = 0; i < N; i++)
printf("*(%p) = %d\n", (void*) p, *p++);
Now we're comparing a[i] to *p++, which is a dereference plus a postincrement (in addition to the i++ in the loop control); that may turn out to be a more expensive operation than the array subscript. Not to mention we've introduced another variable that's not strictly necessary; we're trading a little space for what may or may not be an improvement in speed. It really depends on the compiler, the structure of the code, optimization settings, OS, and CPU.
Worry about correctness first, then worry about readability/maintainability, then worry about safety/reliability, then worry about performance. Unless you're failing to meet a hard performance requirement, focus on making your intent clear and easy to understand. It doesn't matter how fast your code is if it gives you the wrong answer or performs the wrong action, or if it crashes horribly at the first hint of bad input, or if you can't fix bugs or add new features without breaking something.

Yes.. when storing myarray[i] pointer it will perform better (if used on large scale...)
Why??
It will save you an addition and may be a multiplication (or a shift..)
Many compilers may optimize that for you in case of static memory allocation.
If you are using dynamic memory allocation, the compiler will not optimize it, because it is in runtime!

Related

Why don't GCC & MSVC optimize based on uninitialized variables? Can I force them to?

Consider some dead-simple code (or a more complicated one, see below1) that uses an uninitialized stack variable, e.g.:
int main() { int x; return 17 / x; }
Here's what GCC emits (-O3):
mov eax, 17
xor ecx, ecx
cdq
idiv ecx
ret
Here's what MSVC emits (-O2):
mov eax, 17
cdq
idiv DWORD PTR [rsp]
ret 0
For reference, here's what Clang emits (-O3):
ret
The thing is, all three compilers detect that this is an uninitialized variable just fine (-Wall), but only one of them actually performs any optimizations based on it.
This is kind of stumping me... I thought all the decades of fighting over undefined behavior was to allow compiler optimizations, and yet I'm seeing only one compiler cares to optimize even the most basic cases of UB.
Why is this? What do I do if I want compilers other than Clang to optimize such cases of UB? Is there any way for me to actually get the benefits of UB instead of just the downsides with either compiler?
Footnotes
1 Apparently this was too much of an SSCCE for some folks to appreciate the actual issue. If you want a more complicated example of this problem that isn't undefined on every execution of the program, just massage it a bit. e.g.:
int main(int argc, char *[])
{
int x;
if (argc) { x = 100 + (argc * argc + x); }
return x;
}
On GCC you get:
main:
xor eax, eax
test edi, edi
je .L1
imul edi, edi
lea eax, [rdi+100]
.L1:
ret
On Clang you get:
main:
ret
Same issue, just more complicated.

Optimizing for actually reading unintiailized data is not the point.
Optimizing for assuming the data you read must have been initialized is.
So if you have some variable that can only be written to as 3 or 1, the compiler can assume it is odd.
Or, if you add positive signed constant to a signed value, we can assume the result is larger than the original signed value (this makes some loops faster).
What the optimizer does when it proves an uninitialized value is read isn't important; making UB or indeterminate value calculation faster is not the point. Well behaved programs don't do that on purpose, spending effort making it faster (or slower, or caring) is a waste of compiler writers time.
It may fall out of other efforts. Or it may not.

Consider this example:
int foo(bool x) {
int y;
if (x) y = 3;
return y;
}
Gcc realizes that the only way the function can return something well defined is when x is true. Hence, when optimizations are turned on there is no brach:
foo(bool):
mov eax, 3
ret
Calling foo(true) is not undefined behavior. Calling foo(false) is undefined behavior. There is nothing in the standard that specifies why foo(false) returns 3. There is also nothing in the standard that mandates that foo(false) does not return 3. Compilers do not optimize code that has undefined behavior, but compilers can optimize code without UB (eg remove the branch in foo) because it is not specified what happens when there is UB.
What do I do if I want compilers other than Clang to optimize such cases of UB?
Compilers do that by default. Gcc is not different than Clang with respect to that.
In your example of
int main() { int x; return 17 / x; }
there is no missed optimization, because it is not defined what the code will do in the first place.
Your second example can be considered as a missed opportunity for optimization. Though, again: UB grants opportunities to optimize code that does not have UB. The idea is not that you introduce UB in your code to gain optimizations. As your second example can (and should be) rewritten as
int main(int argc, char *[])
{
int x = 100 + (argc * argc + x);
return x;
}
It isnt a big issue in practice that gcc doesn't bother to remove the branch in your version. If you don't need the branch you don't have to write it just to expect the compiler to remove it.

The Standard uses the term "Undefined Behavior" to refer to actions which in some contexts might be non-portable but correct, but in other contexts would be erroneous, without making any effort to distinguish into when a particular action should be viewed one way or the other.
In C89 and C99, if it would be possible for a type's underlying storage to hold an invalid bit pattern, attempting to use an uninitialized automatic-duration or malloc-allocated object of that type would invoke Undefined Behavior, but if all possible bit patterns would be valid, accessing such an object would simply yield an Unspecified Value of that type. This meant, for example, that a program could do something like:
struct ushorts256 { uint16_t dat[256]; } x,y;
void test(void)
{
struct ushorts256 temp;
int i;
for (i=0; i<86; i++)
temp.dat[i*3]=i;
x=temp;
y=temp;
}
and if the callers only cared about what was in multiple-of-3 elements of the structures, there would be no need to have the code worry about the other 171 values of temp.
C11 changed the rules so that compiler writers wouldn't have to follow the C89 and C99 behavior if they felt there was something more useful they could do. For example, depending upon what calling code would do with the arrays, it might be more efficient to simply have the code write every third item of x and every third item of y, leaving the remaining items alone. A consequence of this would be that non-multiple-of-3 items of x might not match the corresponding items of y, but people seeking to sell compilers were expected to be able to judge their particular customers' needs better than the Committee ever could.
Some compilers treat uninitialized objects in a manner consistent with C89 and C99. Some may exploit the freedom to have the values behave non-deterministically (as in the example above) but without not disrupting program behavior. Some may opt to treat any programs that access uninitialized variables in gratuitously meaningless fashion. Portable programs may not rely upon any particular treatment, but the authors of the Standard have expressly stated they did not wish to "demean" useful programs that happened to be non-portable (see http://www.open-std.org/jtc1/sc22/wg14/www/C99RationaleV5.10.pdf page 13).

Is it possible to write to an array second element by overflowing the first element in C?

In low level languages it is possible to mov a dword (32 bit) to the first array element this will overflow to write to the second, third and fourth element, or mov a word (16 bit) to the first and it will overflow to the second element.
How to achieve the same effect in c? as when trying for example:
char txt[] = {0, 0};
txt[0] = 0x4142;
it gives a warning [-Woverflow]
and the value of txt[1] doesn't change and txt[0] is set to 0x42.
How to get the same behavior as in assembly:
mov word [txt], 0x4142
the previous assembly instruction will set the first element [txt+0] to 0x42 and the second element [txt+1] to 0x41.
EDIT
What about this suggestion?
define the array as a single variable.
uint16_t txt;
txt = 0x4142;
and accessing the elements with ((uint8_t*) &txt)[0] for the first element and ((uint8_t*) &txt)[1] for the second element.

If you are totally sure this will not cause a segmentation fault, which you must be, you can use memcpy()
uint16_t n = 0x4142;
memcpy((void *)txt, (void *)&n, sizeof(uint16_t));
By using void pointers, this is the most versatile solution, generalizable to all the cases beyond this example.

txt[0] = 0x4142; is an assignment to a char object, so the right hand side is implicitly cast to (char) after being evaluated.
The NASM equivalent is mov byte [rsp-4], 'BA'. Assembling that with NASM gives you the same warning as your C compiler:
foo.asm:1: warning: byte data exceeds bounds [-w+number-overflow]
Also, modern C is not a high-level assembler. C has types, NASM doesn't (operand-size is on a per-instruction basis only). Don't expect C to work like NASM.
C is defined in terms of an "abstract machine", and the compiler's job is to make asm for the target CPU which produces the same observable results as if the C was running directly on the C abstract machine. Unless you use volatile, actually storing to memory doesn't count as an observable side-effect. This is why C compilers can keep variables in registers.
And more importantly, things that are undefined behaviour according to the ISO C standard may still be undefined when compiling for x86. For example, x86 asm has well-defined behaviour for signed overflow: it wraps around. But in C, it's undefined behaviour, so compilers can exploit this to make more efficient code for for (int i=0 ; i<=len ;i++) arr[i] *= 2; without worrying that i<=len might always be true, giving an infinite loop. See What Every C Programmer Should Know About Undefined Behavior.
Type-punning by pointer-casting other than to char* or unsigned char* (or __m128i* and other Intel SSE/AVX intrinsic types, because they're also defined as may_alias types) violates the strict-aliasing rule. txt is a char array, but I think it's still a strict-aliasing violation to write it through a uint16_t* and then read it back via txt[0] and txt[1].
Some compilers may define the behaviour of *(uint16_t*)txt = 0x4142, or happen to produce the code you expect in some cases, but you shouldn't count on it always working and being safe other code also reads and writes txt[].
Compilers (i.e. C implementations, to use the terminology of the ISO standard) are allowed to define behaviour that the C standard leaves undefined. But in a quest for higher performance, they choose to leave a lot of stuff undefined. This is why compiling C for x86 is not similar to writing in asm directly.
Many people consider modern C compilers to be actively hostile to the programmer, looking for excuses to "miscompile" your code. See the 2nd half of this answer on gcc, strict-aliasing, and horror stories, and also the comments. (The example in that answer is safe with a proper memcpy; the problem was a custom implementation of memcpy that copied using long*.)
Here's a real-life example of a misaligned pointer leading to a fault on x86 (because gcc's auto-vectorization strategy assumed that some whole number of elements would reach a 16-byte alignment boundary. i.e. it depended on the uint16_t* being aligned.)
Obviously if you want your C to be portable (including to non-x86), you must use well-defined ways to type-pun. In ISO C99 and later, writing one union member and reading another is well-defined. (And in GNU C++, and GNU C89).
In ISO C++, the only well-defined way to type-pun is with memcpy or other char* accesses, to copy object representations.
Modern compilers know how to optimize away memcpy for small compile-time constant sizes.
#include <string.h>
#include <stdint.h>
void set2bytes_safe(char *p) {
uint16_t src = 0x4142;
memcpy(p, &src, sizeof(src));
}
void set2bytes_alias(char *p) {
*(uint16_t*)p = 0x4142;
}
Both functions compile to the same code with gcc, clang, and ICC for x86-64 System V ABI:
# clang++6.0 -O3 -march=sandybridge
set2bytes_safe(char*):
mov word ptr [rdi], 16706
ret
Sandybridge-family doesn't have LCP decode stalls for 16-bit mov immediate, only for 16-bit immediates with ALU instructions. This is an improvement over Nehalem (See Agner Fog's microarch guide), but apparently gcc8.1 -march=sandybridge doesn't know about it because it still likes to:
# gcc and ICC
mov eax, 16706
mov WORD PTR [rdi], ax
ret
define the array as a single variable.
... and accessing the elements with ((uint8_t*) &txt)[0]
Yes, that's fine, assuming that uint8_t is unsigned char, because char* is allowed to alias anything.
This is the case on almost any implementation that supports uint8_t at all, but it's theoretically possible to build one where it's not, and char is a 16 or 32-bit type, and uint8_t is implemented with a more expensive read/modify/write of the containing word.

One option is to Trust Your Compiler(tm) and just write proper code.
With this test code:
#include <iostream>
int main() {
char txt[] = {0, 0};
txt[0] = 0x41;
txt[1] = 0x42;
std::cout << txt;
}
Clang 6.0 produces:
int main() {
00E91020 push ebp
00E91021 mov ebp,esp
00E91023 push eax
00E91024 lea eax,[ebp-2]
char txt[] = {0, 0};
00E91027 mov word ptr [ebp-2],4241h <-- Combined write, without any tricks!
txt[0] = 0x41;
txt[1] = 0x42;
std::cout << txt;
00E9102D push eax
00E9102E push offset cout (0E99540h)
00E91033 call std::operator<<<std::char_traits<char> > (0E91050h)
00E91038 add esp,8
}
00E9103B xor eax,eax
00E9103D add esp,4
00E91040 pop ebp
00E91041 ret

You're looking to do a deep copy which you'll need to use a loop to accomplish (or a function that does the loop for you internally: memcpy).
Simply assigning 0x4142 to a char will have to be truncated to fit in the char. This should throw a warning as the outcome will be implementation specific, but typically the least significant bits are retained.
In any case, if you know the numbers you want to assign you could just construct using them: const char txt[] = { '\x41', '\x42' };
I'd suggest doing this with an initializer-list, obviously it's on you to make sure the initializer list is at least as long as size(txt). For example:
copy_n(begin({ '\x41', '\x42' }), size(txt), begin(txt));
Live Example

How does C++ keep the const value? [duplicate]

This question already has answers here:
Two different values at the same memory address
(7 answers)
Closed 5 years ago.
#include <iostream>
using namespace std;
int main() {
const int a = 10;
auto *b = const_cast<int *>(&a);
*b = 20;
cout << a << " " << *b;
cout << endl << &a << " " << b;
}
The output looks like:
10 20
0x7ffeeb1d396c 0x7ffeeb1d396c
The a and *b are at the same address, why do they have different value?

Most probably this is caused by optimization.
As molbdnilo already said in his comment: "Compilers trust you blindly. If you lie to them, strange things can happen."
So when optimization is enabled, the compiler finds the declaration
const int a = 10;
and thinks "ah, this will never change, so we don't need a "real" variable and can replace all appearances of a in the code simply with 10". This behavior is called constant folding.
Now in the next line you "cheat" the compiler:
auto *b = const_cast<int *>(&a);
*b = 20;
and change a, although you have promised not to do so.
The result is confusion.
Like a lot of people have already mentioned and Christian Hackl has thoroughly analyzed in his excellent in-depth answer, it's undefined behavior. The compiler is generally allowed to apply constant folding on constants, which are explicitly declared const.
Your question is a very good example (I don't know how anyone can vote it down!) why const_cast is very dangerous (and even more when combined with raw pointers), should be only used if absolutely necessary, and should be at least thoroughly commented why it was used if it was unavoidable. Comments are important in that case because it's not only the compiler you are "cheating":
Also your co-workers will rely on the information const and rely on the information that it won't change. If it DOES change though, and you didn't inform them, and did not comment, and did not exactly know what you were doing, you'll have a bad day at work :)
Try it out: Perhaps your program will even behave different in debug build (not optimized) and release build (optimized), and those bugs are usually pretty annoying to fix.

I think it helps to view const like this:
If you declare something as const it is actually not the compiler that ensures that you dont change it, but rather the opposite: you make a promise to the compiler that you wont change it.
The language helps you a bit to keep your promise, eg you cannot do:
const int a = 5;
a = 6; // error
but as you discovered, you indeed can attempt to modify something that you declared const. However, then you broke your promise and as always in c++, once you break the rules you are on your own (aka undefined behaviour).

The a and *b are at the same address, why do they have different
value?
They don't, not according to C++. Neither do they, according to C++.
How can that be? The answer is that the C++ language does not define the behaviour of your program because you modify a const object. That's not allowed, and there are no rules for what will happen in such a case.
You are only allowed to use const_cast in order to modify something which wasn't const in the first place:
int a = 123;
int const* ptr1 = &a;
// *ptr1 = 456; // compilation error
int* ptr2 = const_cast<int*>(ptr1);
*ptr2 = 456; // OK
So you've ended up with a program whose behaviour is not defined by C++. The behaviour is instead defined by compiler writers, compiler settings and pure chance.
That's also what's wrong with your question. It's impossible to give you a correct answer without knowing exactly how you compiled this and on which system it runs, and perhaps even then there are random factors involved.
You can inspect your binary to find out what the compiler thought it was doing; there are even online tools like https://gcc.godbolt.org/ for that. But try to use printf instead of std::cout for such an analysis; it will produce easier-to-read assembly code. You may also use easier-to-spot numbers, like 123 and 456.
For example, assuming that you even use GCC, compare these two programs; the only difference is the const modifier for a:
#include <stdio.h>
int main() {
int const a = 123;
auto *b = const_cast<int *>(&a);
*b = 456;
printf("%d", a);
printf("%d", *b);
}
#include <stdio.h>
int main() {
int a = 123;
auto *b = const_cast<int *>(&a);
*b = 456;
printf("%d", a);
printf("%d", *b);
}
Look at what the printf("%d", a); call becomes. In the const version, it becomes:
mov esi, 123
mov edi, OFFSET FLAT:.LC0
mov eax, 0
call printf
In the non-const version:
mov eax, DWORD PTR [rbp-12]
mov esi, eax
mov edi, OFFSET FLAT:.LC0
mov eax, 0
call printf
Without going too much into the details of assembly code, one can see that the compiler optimised the const variable such that the value 123 is pushed directly to printf. It doesn't really treat a as a variable for the printf call, but it does do so for the pointer initialisation.
Such chaos is the nature of undefined behaviour. The moral of the story is that you should always avoid undefined behaviour, and thinking twice before casting is an important part of allowing the compiler to help you in that endeavor.

Compiling your code once with const and once without const using gcc, shows that there either no space is allocated for a, or it's content is being ignored in const version. (even with 0 optimization) The only difference in their assembly result is the line which you print the a. Compiler simply prints 10 instead of refering to memory and fetching contents of a: (left side is const int a and right side is int a version)
movl $10, %esi | movl -36(%rbp), %eax
> movl %eax, %esi
As you can see there is an extra memory access for non-const version, which we know is expensive in terms of time. So the compiler decided to trust your code, and concluded that a const is never to be changed, and replaced its value wherever the variable is referenced.
Compiler options: --std=c++11 -O0 -S

How to fast initialize with 1 really big array

I have enermous array:
int* arr = new int[BIGNUMBER];
How to fullfil it with 1 number really fast. Normally I would do
for(int i = 0; i < BIGNUMBER; i++)
arr[i] = 1
but I think it would take long.
Can I use memcpy or similar?

You could try using the standard function std::uninitialized_fill_n:
#include <memory>
// ...
std::uninitialized_fill_n(arr, BIGNUMBER, 1);
In any case, when it comes to performance, the rule is to always make measurements to back up your assumptions - especially if you are going to abandon a clear, simple design to embrace a more complex one because of an alleged performance improvement.
EDIT:
Notice that - as Benjamin Lindley mentioned in the comments - for trivial types std::uninitialized_fill_n does not bring any advantage over the more obvious std::fill_n. The advantage would exist for non-trivial types, since std::uninitialized_fill would allow you to allocate a memory region and then construct objects in place.
However, one should not fall into the trap of calling std::uninitialized_fill_n for a memory region that is not uninitialized. The following, for instance, would give undefined behavior:
my_object* v = new my_object[BIGNUMBER];
std::uninitialized_fill_n(my_object, BIGNUMBER, my_object(42)); // UB!

Alternative to a dynamic array is std::vector<int> with the constructor that accepts an initial value for each element:
std::vector<int> v(BIGNUMBER, 1); // 'BIGNUMBER' elements, all with value 1.
as already stated, performance would need measured. This approach provides the additional benefit that the memory will be freed automatically.

Some possible alternatives to Andy Prowl's std::uninitialized_fill_n() solution, just for posterity:
If you are lucky and your value is composed of all the same bytes, memset will do the trick.
Some implementations offer a 16-bit version memsetw, but that's not everywhere.
GCC has an extension for Designated Initializers that can fill ranges.
I've worked with a few ARM systems that had libraries that had accelerated CPU and DMA variants of word-fill, hand coded in assembly -- you might look and see if your platform offers any of this, if you aren't terribly concerned about portability.
Depending on your processor, even looking into loops around SIMD intrinsics may provide a boost; some of the SIMD units have load/store pipelines that are optimized for moving data around like this. On the other hand you may take severe penalties for moving between register types.
Last but definitely not least, to echo some of the commenters: you should test and see. Compilers tend to be pretty good at recognizing and optimizing patterns like this -- you probably are just trading off portability or readability with anything other than the simple loop or uninitialized_fill_n.
You may be interested in prior questions:
Is there memset() that accepts integers larger than char?
initializing an array of ints
How to initialize all members of an array to the same value?

Under Linux/x86 gcc with optimizations turned on, your code will compile to the following:
rax = arr
rdi = BIGNUMBER
400690: c7 04 90 01 00 00 00 movl $0x1,(%rax,%rdx,4)
Move immediate int(1) to rax + rdx
400697: 48 83 c2 01 add $0x1,%rdx
Increment register rdx
40069b: 48 39 fa cmp %rdi,%rdx
Cmp rdi to rdx
40069e: 75 f0 jne 400690 <main+0xa0>
If BIGNUMBER has been reached jump back to start.
It takes about 1 second per gigabyte on my machine, but most of that I bet is paging in physical memory to back the uninitialized allocation.

Just unroll the loop by, say, 8 or 16 times. Functions like memcpy are fast, but they're really there for convenience, not to be faster than anything you could possibly write:
for (i = 0; i < BIGNUMBER-8; i += 8){
a[i+0] = 1; // this gets rid of the test against BIGNUMBER, and the increment, on 7 out of 8 items.
a[i+1] = 1; // the compiler should be able to see that a[i] is being calculated repeatedly here
...
a[i+7] = 1;
}
for (; i < BIGNUMBER; i++) a[i] = 1;
The compiler might be able to unroll the loop for you, but why take the chance?

Use memset or memcpy
memset(arr, 0, BIGNUMER);

Try using memset?
memset(arr, 1, BIGNUMBER);
http://www.cplusplus.com/reference/cstring/memset/

memset(arr, 1, sizeof(int) * BIGNUMBER);

optimization of access to members in c++

I'm running into an inconsistent optimization behavior with different compilers for the following code:
class tester
{
public:
tester(int* arr_, int sz_)
: arr(arr_), sz(sz_)
{}
int doadd()
{
sm = 0;
for (int n = 0; n < 1000; ++n)
{
for (int i = 0; i < sz; ++i)
{
sm += arr[i];
}
}
return sm;
}
protected:
int* arr;
int sz;
int sm;
};
The doadd function simulates some intensive access to members (ignore the overflows in addition for this question). Compared with similar code implemented as a function:
int arradd(int* arr, int sz)
{
int sm = 0;
for (int n = 0; n < 1000; ++n)
{
for (int i = 0; i < sz; ++i)
{
sm += arr[i];
}
}
return sm;
}
The doadd method runs about 1.5 times slower than the arradd function when compiled in Release mode with Visual C++ 2008. When I modify the doadd method to be as follows (aliasing all members with locals):
int doadd()
{
int mysm = 0;
int* myarr = arr;
int mysz = sz;
for (int n = 0; n < 1000; ++n)
{
for (int i = 0; i < mysz; ++i)
{
mysm += myarr[i];
}
}
sm = mysm;
return sm;
}
Runtimes become roughly the same. Am I right in concluding that this is a missing optimization by the Visual C++ compiler? g++ seems to do it better and run both the member function and the normal function at the same speed when compiling with -O2 or -O3.
The benchmarking is done by invoking the doadd member and arradd function on some sufficiently large array (a few millions of integers in size).
EDIT: Some fine-grained testing shows that the main culprit is the sm member. Replacing all others by local versions still makes the runtime long, but once I replace sm by mysm the runtime becomes equal to the function version.
Resolution
Dissapointed with the answers (sorry guys), I shaked off my laziness and dove into the disassembly listings for this code. My answer below summarizes the findings. In short: it has nothing to do with aliasing, it has all to do with loop unrolling, and with some strange heuristics MSVC applies when deciding which loop to unroll.

It may be an aliasing issue - the compiler can not know that the instance variable sm will never be pointed at by arr, so it has to treat sm as if it were effectively volatile, and save it on every iteration. You could make sm a different type to test this hypothesis. Or just use a temporary local sum (which will get cached in a register) and assign it to sm at the end.

MSVC is correct, in that it is the only one that, given the code we've seen, is guaranteed to work correctly. GCC employs optimizations that are probably safe in this specific instance, but that can only be verified by seeing more of the program.
Because sm is not a local variable, MSVC apparently assumes that it might alias arr.
That's a fairly reasonable assumption: because arr is protected, a derived class might set it to point to sm, so arr could alias sm.
GCC sees that it doesn't actually alias arr, and so it doesn't write sm back to memory until after the loop, which is much faster.
It's certainly possible to instantiate the class so that arr points to sm, which MSVC would handle, but GCC wouldn't.
Assuming that sz > 1, GCCs optimization is permissible in general.
Because the function loops over arr, treating it as an array of sz elements, calling the function with sz > 1 would yield undefined behavior whether or not arr aliases sm, and so GCC could safely assume that they don't alias. But if sz == 1, or if the compiler can't be sure what sz's value might be, then it runs the risk that sz might be 1, and so arr and sm could alias perfectly legally, and GCC's code would break.
So most likely, GCC simply gets away with it by inlining the whole thing, and seeing that in this case, they don't alias.

I disassembled the code with MSVC to better understand what's going on. Turns out aliasing wasn't a problem at all, and neither was some kind of paranoid thread safety.
Here is the interesting part of the arradd function disassambled:
for (int n = 0; n < 10; ++n)
{
for (int i = 0; i < sz; ++i)
013C101C mov ecx,ebp
013C101E mov ebx,29B9270h
{
sm += arr[i];
013C1023 add eax,dword ptr [ecx-8]
013C1026 add edx,dword ptr [ecx-4]
013C1029 add esi,dword ptr [ecx]
013C102B add edi,dword ptr [ecx+4]
013C102E add ecx,10h
013C1031 sub ebx,1
013C1034 jne arradd+23h (13C1023h)
013C1036 add edi,esi
013C1038 add edi,edx
013C103A add eax,edi
013C103C sub dword ptr [esp+10h],1
013C1041 jne arradd+16h (13C1016h)
013C1043 pop edi
013C1044 pop esi
013C1045 pop ebp
013C1046 pop ebx
ecx points to the array, and we can see that the internal loop is unrolled x4 here - note the four consecutive add instructions from following addresses, and ecx being advanced by 16 bytes (4 words) at a time inside the loop.
For the unoptimized version of the member function, doadd:
int tester::doadd()
{
sm = 0;
for (int n = 0; n < 10; ++n)
{
for (int i = 0; i < sz; ++i)
{
sm += arr[i];
}
}
return sm;
}
The disassembly is (it's harder to find since the compiler inlined it into main):
int tr_result = tr.doadd();
013C114A xor edi,edi
013C114C lea ecx,[edi+0Ah]
013C114F nop
013C1150 xor eax,eax
013C1152 add edi,dword ptr [esi+eax*4]
013C1155 inc eax
013C1156 cmp eax,0A6E49C0h
013C115B jl main+102h (13C1152h)
013C115D sub ecx,1
013C1160 jne main+100h (13C1150h)
Note 2 things:
The sum is stored in a register - edi. Hence, there's not aliasing "care" taken here. The value of sm isn't re-read all the time. edi isinitialized just once and then used as a temporary. You don't see its return since the compiler optimized it and used edi directly as the return value of the inlined code.
The loop is not unrolled. Why? No good reason.
Finally, here's an "optimized" version of the member function, with mysm keeping the sum local manually:
int tester::doadd_opt()
{
sm = 0;
int mysm = 0;
for (int n = 0; n < 10; ++n)
{
for (int i = 0; i < sz; ++i)
{
mysm += arr[i];
}
}
sm = mysm;
return sm;
}
The (again, inlined) disassembly is:
int tr_result_opt = tr_opt.doadd_opt();
013C11F6 xor edi,edi
013C11F8 lea ebp,[edi+0Ah]
013C11FB jmp main+1B0h (13C1200h)
013C11FD lea ecx,[ecx]
013C1200 xor ecx,ecx
013C1202 xor edx,edx
013C1204 xor eax,eax
013C1206 add ecx,dword ptr [esi+eax*4]
013C1209 add edx,dword ptr [esi+eax*4+4]
013C120D add eax,2
013C1210 cmp eax,0A6E49BFh
013C1215 jl main+1B6h (13C1206h)
013C1217 cmp eax,0A6E49C0h
013C121C jge main+1D1h (13C1221h)
013C121E add edi,dword ptr [esi+eax*4]
013C1221 add ecx,edx
013C1223 add edi,ecx
013C1225 sub ebp,1
013C1228 jne main+1B0h (13C1200h)
The loop here is unrolled, but just x2.
This explains my speed-difference observations quite well. For a 175e6 array, the function runs ~1.2 secs, the unoptimized member ~1.5 secs, and the optimized member ~1.3 secs. (Note that this may differ for you, on another machine I got closer runtimes for all 3 versions).
What about gcc? When compiled with it, all 3 versions ran at ~1.5 secs. Suspecting the lack of unrolling I looked at gcc's disassembly and indeed: gcc doesn't unroll any of the versions.

As Paul wrote it is probably because sm member is really updated every time in the "real" memory , meanwhile local summary in the function can be accumulated in register variable (after compiler optimization).

You can get similar issues when passing in pointer arguments. If you like getting your hands dirty, you may find the restrict keyword useful in future.
http://developers.sun.com/solaris/articles/cc_restrict.html

This isn't really the same code at all. If you put the sm, arr and sz variables inside the class instead of making theme local, the compiler can't (easily) guess that some other class won't inherit from test class and want access to these members, doing something like `arr=&sm; doadd();. Henceforth, access to these variables can't be optimized away as they can when they are local to function.
In the end the reason is basically the one Paul pointed out, sm is updated in real memory when using a class member, can be stored in a register when in a function. Memory reads from add shouldn't change resulting time much, as memomry must be read anyway to get the value.
In this case if test is not exported to another module and not aliased even indirectly to something exported, and if there is no aliasing like above. The compiler could optimize intermediate writes to sm... some compilers like gcc seems to optimize aggressively enough to detect above cases (would it also if test class is exported). But these are really hard guesses. There is still much simpler optimizations that are not yet performed by compilers (like inlining through different compilation units).

The key is probably that doadd is written like this if you make the member accesses explicit with this:
int doadd()
{
this->sm = 0;
for (int n = 0; n < 1000; ++n)
{
for (int i = 0; i < this->sz; ++i)
{
this->sm += this->arr[i];
}
}
return this->sm;
}
Therein lies the problem: all class members are accessed via the this pointer, whereas arradd has all variables on the stack. To speed it up, you have found that by moving all members on to the stack as local variables, the speed then matches arradd. So this indicates the this indirection is responsible for the performance loss.
Why might that be? As I understand it this is usually stored in a register so I don't think it's ultimately any slower than just accessing the stack (which is an offset in to the stack pointer as well). As other answers point out, it's probably the aliasing problem that generates less optimal code: the compiler can't tell if any of the memory addresses overlap. Updating sm could also in theory change the content of arr, so it decides to write out the value of sm to main memory every time rather than tracking it in a register. When variables are on the stack, the compiler can assume they're all at different memory addresses. The compiler doesn't see the program as clearly as you do: it can tell what's on the stack (because you declared it like that), but everything else is just arbitrary memory addresses that could be anything, anywhere, overlapping any other pointer.
I'm not surprised the optimisation in your question (using local variables) isn't made - not only would the compiler have to prove the memory of arr does not overlap anything pointed to by this, but also that not updating the member variables until the end of the function is equivalent to the un-optimised version updating throughout the function. That can be a lot trickier to determine than you imagine, especially if you take concurrency in to account.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

C and C++: Array element access pointer vs int - c++

It will probably make no difference at all. The compiler will usually be smart enough to know when you are using an expression more than once and create a temporary itself, if appropriate.

Compilers can do surprising optimizations; the only way to know is to read the generated assembly code. With GCC, use -S, with -masm=intel for Intel syntax. With VC++, use /FA (IIRC). You should also enable optimizations: -O2 or -O3 with GCC, and /O2 with VC++.

I prefer using myarray[ i ] since it is more clear and the compiler has easier time compiling this to optimized code. When using pointers it is more complex for the compiler to optimize this code since it's harder to know exactly what you're doing with the pointer.

There will be no substantial difference. Premature optimization is the root of all evil - get a profiler before checking micro-optimizations like this. Also, the myarray[i] is more portable to custom types, such as a std::vector.

Related

Why don't GCC & MSVC optimize based on uninitialized variables? Can I force them to?

Is it possible to write to an array second element by overflowing the first element in C?

How does C++ keep the const value? [duplicate]

How to fast initialize with 1 really big array

optimization of access to members in c++

Categories

Resources