Can C++ compilers optimize calls to at()?

Can C++ compilers optimize calls to at()? - c++

Since regular array accesses via the [] operator are unchecked, it's not fun to hit the headlines when your program has a remote code execution exploit or data leakage due to a buffer overflow.
Most standard array containers contains the at() method which allows bounds checked access to the array elements. This makes out of bounds array accesses well defined (throws exception), instead of undefined behavior.
This basically eliminates buffer overflow arbitrary code execution exploits and there is also a clang-tidy check that warns that you should use at() when the index is non-constant. So I changed it quite a few places.
Most managed languages have checked arrays and their compilers can eliminate the checks when they can.
I know C++ compilers can do awesome optimizations. The question is can C++ compilers do this to eliminate calls to at() when they see it can't overflow?

Here is a classic case that would be subject to bounds check elimination in managed languages: iterating up to the size.
#include <vector>
int test(std::vector<int> &v)
{
int sum = 0;
for (size_t i = 0; i < v.size(); i++)
sum += v.at(i);
return sum;
}
This is not as trivial to optimize as when both the index and the size are constants (which could be solved by constant propagation), it requires more advanced reasoning about the relationships between values.
As seen on Godbolt, GCC (9.2), Clang (9.0.0) and even MSVC (v19.22) can handle such code reasonably. GCC and Clang autovectorize. MSVC just generates a basic loop:
$LL4#test:
add eax, DWORD PTR [r9+rdx*4]
inc rdx
cmp rdx, r8
jb SHORT $LL4#test
Which is not that bad, but given that it does vectorize a similar loop that uses [] instead of .at(), I have to conclude that: yes, there is a significant cost to using at even in some basic cases where we might expect otherwise (especially given that there is no range check, so the auto-vectorization step got scared for seemingly no reason). If you choose to target only GCC and Clang then there is less of an issue. In more tricky cases, GCC and Clang can also be "sufficiently confused", for example when passing the indexes through a data structure (unlikely code, but the point is, range information can sometimes be lost).

Related

Crash with icc: can the compiler invent writes where none existed in the abstract machine?

Consider the following simple program:
#include <cstring>
#include <cstdio>
#include <cstdlib>
void replace(char *str, size_t len) {
for (size_t i = 0; i < len; i++) {
if (str[i] == '/') {
str[i] = '_';
}
}
}
const char *global_str = "the quick brown fox jumps over the lazy dog";
int main(int argc, char **argv) {
const char *str = argc > 1 ? argv[1] : global_str;
replace(const_cast<char *>(str), std::strlen(str));
puts(str);
return EXIT_SUCCESS;
}
It takes an (optional) string on the command line and prints it, with / characters replaced by _. This replacement functionality is implemented by the c_repl function1. For example, a.out foo/bar prints:
foo_bar
Elementary stuff so far, right?
If you don't specify a string, it conveniently uses the global string the quick brown fox jumps over the lazy dog, which doesn't contain any / characters, and so doesn't undergo any replacement.
Of course, string constants are const char[], so I need to cast away the constness first - that's the const_cast you see. Since the string is never actually modified, I am under the impression this is legal.
gcc and clang compile a binary that has the expected behavior, with or without passing a string on the command line. icc crashes, when you don't provide a string, however:
icc -xcore-avx2 char_replace.cpp && ./a.out
Segmentation fault (core dumped)
The underlying cause is the main loop for c_repl which looks like this:
400c0c: vmovdqu ymm2,YMMWORD PTR [rsi]
400c10: add rbx,0x20
400c14: vpcmpeqb ymm3,ymm0,ymm2
400c18: vpblendvb ymm4,ymm2,ymm1,ymm3
400c1e: vmovdqu YMMWORD PTR [rsi],ymm4
400c22: add rsi,0x20
400c26: cmp rbx,rcx
400c29: jb 400c0c <main+0xfc>
It is a vectorized loop. The basic idea is that 32 bytes are loaded, and then compared against the / character, forming a mask value with a byte set for each byte that matched, and then the existing string is blended against a vector containing 32 _ characters, effectively replacing only the / characters. Finally, the updated register is written back to the string, with the vmovdqu YMMWORD PTR [rsi],ymm4 instruction.
This final store crashes, because the string is read-only and allocated in the .rodata section of the binary, which is loaded using read-only pages. Of course, the store was a logical "no op", writing back the same characters it read, but the CPU doesn't care!
Is my code legal C++ and therefore I should blame icc for miscompiling this, or I am wading into the UB swamp somewhere?
1 The same crash from the same issue occurs with std::replace on a std::string rather than my "C-like" code, but I wanted to simplify the analysis as much as possible and make it entirely self-contained.

Your program is well-formed and free of undefined behaviour, as far as I can tell. The C++ abstract machine never actually assigns to a const object. A not-taken if() is sufficient to "hide" / "protect" things that would be UB if they executed. The only thing an if(false) can't save you from is an ill-formed program, e.g. syntax errors or trying to use extensions that don't exist on this compiler or target arch.
Compilers aren't in general allowed to invent writes with if-conversion to branchless code.
Casting away const is legal, as long as you don't actually assign through it. e.g. for passing a pointer to a function that isn't const-correct, and takes a read-only input with a non-const pointer. The answer you linked on Is it allowed to cast away const on a const-defined object as long as it is not actually modified? is correct.
ICC's behaviour here is not evidence for UB in ISO C++ or C. I think your reasoning is sound, and this is well-defined. You've found an ICC bug. If anyone cares, report it on their forums: https://software.intel.com/en-us/forums/intel-c-compiler. Existing bug reports in that section of their forum have been accepted by developers, e.g. this one.
We can construct an example where it auto-vectorizes the same way (with unconditional and non-atomic read/maybe-modify/rewrite) where it's clearly illegal, because the read / rewrite is happening on a 2nd string that the C abstract machine doesn't even read.
Thus, we can't trust ICC's code-gen to tell us anything about when we've caused UB, because it will make crashing code even in clearly legal cases.
Godbolt: ICC19.0.1 -O2 -march=skylake (Older ICC only understood options like -xcore-avx2, but modern ICC understands the same -march as GCC/clang.)
#include <stddef.h>
void replace(const char *str1, char *str2, size_t len) {
for (size_t i = 0; i < len; i++) {
if (str1[i] == '/') {
str2[i] = '_';
}
}
}
It checks for overlap between str1[0..len-1] and str2[0..len-1], but for large enough len and no overlap it will use this inner loop:
..B1.15: # Preds ..B1.15 ..B1.14 //do{
vmovdqu ymm2, YMMWORD PTR [rsi+r8] #6.13 // load from str2
vpcmpeqb ymm3, ymm0, YMMWORD PTR [rdi+r8] #5.24 // compare vs. str1
vpblendvb ymm4, ymm2, ymm1, ymm3 #6.13 // blend
vmovdqu YMMWORD PTR [r8+rsi], ymm4 #6.13 // store to str2
add r8, 32 #4.5 // i+=32
cmp r8, rax #4.5
jb ..B1.15 # Prob 82% #4.5 // }while(i<len);
For thread-safety, it's well known that inventing write via non-atomic read/rewrite is unsafe.
The C++ abstract machine never touches str2 at all, so that invalidates any arguments for the one-string version about data-race UB being impossible because reading str at the same time another thread is writing it was already UB. Even C++20 std::atomic_ref doesn't change that, because we're reading through a non-atomic pointer.
But even worse than that, str2 can be nullptr. Or pointing to close to the end of an object (which happens to be stored near the end of a page), with str1 containing chars such that no writes past the end of str2 / the page will happen. We could even arrange for only the very last byte (str2[len-1]) to be in a new page, so that it's one-past-the-end of a valid object. It's even legal to construct such a pointer (as long as you don't deref). But it would be legal to pass str2=nullptr; code behind an if() that doesn't run doesn't cause UB.
Or another thread is running the same search/replace function in parallel, with a different key/replacement that will only write different elements of str2. The non-atomic load/store of unmodified values will step on modified values from the other thread. It's definitely allowed, according to the C++11 memory model, for different threads to simultaneously touch different elements of the same array. C++ memory model and race conditions on char arrays. (This is why char must be as large as the smallest unit of memory the target machine can write without a non-atomic RMW. An internal atomic RMW for a byte stores into cache is fine, though, and doesn't stop byte-store instructions from being useful.)
(This example is only legal with the separate str1/str2 version, because reading every element means the threads would be reading array elements the other thread could be in the middle of writing, which is data-race UB.)
As Herb Sutter mentioned in atomic<> Weapons: The C++ Memory Model and Modern Hardware Part 2: Restrictions on compilers and hardware (incl. common bugs); code generation and performance on x86/x64, IA64, POWER, ARM, and more; relaxed atomics; volatile: weeding out non-atomic RMW code-gen has been an ongoing issue for compilers after C++11 was standardized. We're most of the way there, but highly-aggressive and less-mainstream compilers like ICC clearly still have bugs.
(However, I'm pretty confident that Intel compiler devs would consider this a bug.)
Some less-plausible (to see in a real program) examples that this would also break:
Besides nullptr, you could pass a pointer to (an array of) std::atomic<T> or a mutex where a non-atomic read/rewrite breaks things by inventing writes. (char* can alias anything).
Or str2 points to a buffer that you've carved up for dynamic allocation, and the early part of str1 will have some matches, but later parts of str1 won't have any matches, and that part of str2 is being used by other threads. (And for some reason you can't easily calculate a length that stops the loop short).
For future readers: If you want to let compilers auto-vectorize this way:
You can write source like str2[i] = x ? replacement : str2[i]; that always writes the string in the C++ abstract machine. IIRC, that lets gcc/clang vectorize the way ICC does after doing its unsafe if-conversion to blend.
In theory an optimizing compiler can turn it back into a conditional branch in the scalar cleanup or whatever to avoid dirtying memory unnecessarily. (Or if targeting an ISA like ARM32 where a predicated store is possible, instead of only ALU select operations like x86 cmov, PowerPC isel, or AArch64 csel. ARM32 predicated instructions are architecturally a NOP if the predicate is false).
Or if an x86 compiler chose to use AVX512 masked stores, that would also make it safe to vectorize the way ICC does: masked stores do fault suppression, and never actually store to elements where the mask is false. (When using a mask register with AVX-512 load and stores, is a fault raised for invalid accesses to masked out elements?).
vpcmpeqb k1, zmm0, [rdi] ; compare from memory into mask
vmovdqu8 [rsi]{k1}, zmm1 ; masked store that only writes elements where the mask is true
ICC19 actually does basically this (but with indexed addressing modes) with -march=skylake-avx512. But with ymm vectors because 512-bit lowers max turbo too much to be worth it unless your whole program is heavily using AVX512, on Skylake Xeons anyway.
So I think ICC19 is safe when vectorizing this with AVX512, but not AVX2. Unless there are problems in its cleanup code where it does something more complicated with vpcmpuq and kshift / kor, a zero-masked load, and a masked compare into another mask reg.
AVX1 has masked stores (vmaskmovps/pd) with fault-suppression and everything, but until AVX512BW there's no granularity narrower than 32 bits. The AVX2 integer versions are only available in dword/qword granularity, vpmaskmovd/q.

Why don't C++ compilers optimize this conditional boolean assignment as an unconditional assignment?

Consider the following function:
void func(bool& flag)
{
if(!flag) flag=true;
}
It seems to me that if flag has a valid boolean value, this would be equivalent to unconditional setting it to true, like this:
void func(bool& flag)
{
flag=true;
}
Yet neither gcc nor clang optimize it this way — both generate the following at -O3 optimization level:
_Z4funcRb:
.LFB0:
.cfi_startproc
cmp BYTE PTR [rdi], 0
jne .L1
mov BYTE PTR [rdi], 1
.L1:
rep ret
My question is: is it just that the code is too special-case to care to optimize, or are there any good reasons why such optimization would be undesired, given that flag is not a reference to volatile? It seems the only reason which might be is that flag could somehow have a non-true-or-false value without undefined behavior at the point of reading it, but I'm not sure whether this is possible.

This may negatively impact the performance of the program due to cache coherence considerations. Writing to flag each time func() is called would dirty the containing cache line. This will happen regardless of the fact that the value being written exactly matches the bits found at the destination address before the write.
EDIT
hvd has provided another good reason that prevents such an optimization. It is a more compelling argument against the proposed optimization, since it may result in undefined behavior, whereas my (original) answer only addressed performance aspects.
After a little more reflection, I can propose one more example why compilers should be strongly banned - unless they can prove that the transformation is safe for a particular context - from introducing the unconditional write. Consider this code:
const bool foo = true;
int main()
{
func(const_cast<bool&>(foo));
}
With an unconditional write in func() this definitely triggers undefined behavior (writing to read-only memory will terminate the program, even if the effect of the write would otherwise be a no-op).

Aside from Leon's answer on performance:
Suppose flag is true. Suppose two threads are constantly calling func(flag). The function as written, in that case, does not store anything to flag, so this should be thread-safe. Two threads do access the same memory, but only to read it. Unconditionally setting flag to true means two different threads would be writing to the same memory. This is not safe, this is unsafe even if the data being written is identical to the data that's already there.

I am not sure about the behaviour of C++ here, but in C the memory might change because if the memory contains a non-zero value other than 1, it would remain unchanged with the check, but changed to 1 with the check.
But as I am not very fluent in C++, I don't know if this situation is even possible.

Why don't modern C++ compilers optimize away simple loops like this? (Clang, MSVC)

When I compile and run this code with Clang (-O3) or MSVC (/O2)...
#include <stdio.h>
#include <time.h>
static int const N = 0x8000;
int main()
{
clock_t const start = clock();
for (int i = 0; i < N; ++i)
{
int a[N]; // Never used outside of this block, but not optimized away
for (int j = 0; j < N; ++j)
{
++a[j]; // This is undefined behavior (due to possible
// signed integer overflow), but Clang doesn't see it
}
}
clock_t const finish = clock();
fprintf(stderr, "%u ms\n",
static_cast<unsigned int>((finish - start) * 1000 / CLOCKS_PER_SEC));
return 0;
}
... the loop doesn't get optimized away.
Furthermore, neither Clang 3.6 nor Visual C++ 2013 nor GCC 4.8.1 tells me that the variable is uninitialized!
Now I realize that the lack of an optimization isn't a bug per se, but I find this astonishing given how compilers are supposed to be pretty smart nowadays. This seems like such a simple piece of code that even liveness analysis techniques from a decade ago should be able to take care of optimizing away the variable a and therefore the whole loop -- never mind the fact that incrementing the variable is already undefined behavior.
Yet only GCC is able to figure out that it's a no-op, and none of the compilers tells me that this is an uninitialized variable.
Why is this? What's preventing simple liveness analysis from telling the compiler that a is unused? Moreover, why isn't the compiler detecting that a[j] is uninitialized in the first place? Why can't the existing uninitialized-variable-detectors in all of those compilers catch this obvious error?

The undefined behavior is irrelevant here. Replacing the inner loop with:
for (int j = 1; j < N; ++j)
{
a[j-1] = a[j];
a[j] = j;
}
... has the same effect, at least with Clang.
The issue is that the inner loop both loads from a[j] (for some j) and stores to a[j] (for some j). None of the stores can be removed, because the compiler believes they may be visible to later loads, and none of the loads can be removed, because their values are used (as input to the later stores). As a result, the loop still has side-effects on memory, so the compiler doesn't see that it can be deleted.
Contrary to n.m.'s answer, replacing int with unsigned does not make the problem go away. The code generated by Clang 3.4.1 using int and using unsigned int is identical.

It's an interesting issue with regards to optimizing. I would
expect that in most cases, the compiler would treat each element
of the array as an individual variable when doing dead code
analysis. Ans 0x8000 make too many individual variables to
track, so the compiler doesn't try. The fact that a[j]
doesn't always access the the same object could cause problems
as well for the optimizer.
Obviously, different compilers use different heuristics;
a compiler could treat the array as a single object, and detect
that it never affected output (observable behavior). Some
compilers may choose not to, however, on the grounds that
typically, it's a lot of work for very little gain: how often
would such optimizations be applicable in real code?

++a[j]; // This is undefined behavior too, but Clang doesn't see it
Are you saying this is undefined behavior because the array elements are uninitialized?
If so, although this is a common interpretation of clause 4.1/1 in the standard I believe it is incorrect. The elements are 'uninitialized' in the sense that programmers usually use this term, but I do not believe this corresponds exactly to the C++ specification's use of the term.
In particular C++11 8.5/11 states that these objects are in fact default initialized, and this seems to me to be mutually exclusive with being uninitialized. The standard also states that for some objects being default initialized means that 'no initialized is performed'. Some might assume this means that they are uninitialized but this is not specified and I simply take it to mean that no such performance is required.
The spec does make clear that the array elements will have indeterminant values. C++ specifies, by reference to the C standard, that indeterminant values can be either valid representations, legal to access normally, or trap representations. If the particular indeterminant values of the array elements happen to all be valid representations, (and none are INT_MAX, avoiding overflow) then the above line does not trigger any undefined behavior in C++11.
Since these array elements could be trap representations it would be perfectly conformant for clang to act as though they are guaranteed to be trap representations, effectively choosing to make the code UB in order to create an optimization opportunity.
Even if clang doesn't do that it could still choose to optimize based on the dataflow. Clang does know how to do that, as demonstrated by the fact that if the inner loop is changed slightly then the loops do get removed.
So then why does the (optional) presence of UB seem to stymie optimization, when UB is usually taken as an opportunity for more optimization?
What may be going on is that clang has decided that users want int trapping based on the hardware's behavior. And so rather than taking traps as an optimization opportunity, clang has to generate code which faithfully reproduces the program behavior in hardware. This means that the loops cannot be eliminated based on dataflow, because doing so might eliminate hardware traps.
C++14 updates the behavior such that accessing indeterminant values itself produces undefined behavior, independent of whether one considers the variable uninitialized or not: https://stackoverflow.com/a/23415662/365496

That is indeed very interesting. I tried your example with MSVC 2013.
My first idea was that the fact that the ++a[j] is somewhat undefined is the reason why the loop is not removed, because removing this would definetly change the meaning of the program from an undefined/incorrect semantic to something meaningful, so I tried to initialize the values before but the loops still did not dissappear.
Afterwards I replaced the ++a[j]; with an a[j] = 0; which then produced an output without any loop so everything between the two calls to clock() was removed. I can only guess about the reason. Perhaps the optimizer is not able to prove that the operator++ has no side effects for any reason.

What are the differences between using array offsets vs pointer incrementation?

Given 2 functions, which should be faster, if there is any difference at all? Assume that the input data is very large
void iterate1(const char* pIn, int Size)
{
for ( int offset = 0; offset < Size; ++offset )
{
doSomething( pIn[offset] );
}
}
vs
void iterate2(const char* pIn, int Size)
{
const char* pEnd = pIn+Size;
while(pIn != pEnd)
{
doSomething( *pIn++ );
}
}
Are there other issues to be considered with either approach?

Chances are, your compiler's optimizer will create a loop induction variable for the first case to turn it into the second. I'd expect no difference after optimizations so I tend to prefer the first style because I find it clearer to read.

Boojum is correct - IF your compiler has a good optimizer and you have it enabled. If that's not the case, or your use of arrays isn't sequential and liable to optimization, using array offsets can be far, far slower.
Here's an example. Back about 1988, we were implementing a window with a simple teletype interface on a Mac II. This consisted of 24 lines of 80 characters. When you got a new line in from the ticker, you scrolled up the top 23 lines and displayed the new one on the bottom. When there was something on the teletype, which wasn't all the time, it came in at 300 baud, which with the serial protocol overhead was about 30 characters per second. So we're not talking something that should have taxed a 16 MHz 68020 at all!
But the guy who wrote this did it like:
char screen[24][80];
and used 2-D array offsets to scroll the characters like this:
int i, j;
for (i = 0; i < 23; i++)
for (j = 0; j < 80; j++)
screen[i][j] = screen[i+1][j];
Six windows like this brought the machine to its knees!
Why? Because compilers were stupid in those days, so in machine language, every instance of the inner loop assignment, screen[i][j] = screen[i+1][j], looked kind of like this (Ax and Dx are CPU registers);
Fetch the base address of screen from memory into the A1 register
Fetch i from stack memory into the D1 register
Multiply D1 by a constant 80
Fetch j from stack memory and add it to D1
Add D1 to A1
Fetch the base address of screen from memory into the A2 register
Fetch i from stack memory into the D1 register
Add 1 to D1
Multiply D1 by a constant 80
Fetch j from stack memory and add it to D1
Add D1 to A2
Fetch the value from the memory address pointed to by A2 into D1
Store the value in D1 into the memory address pointed to by A1
So we're talking 13 machine language instructions for each of the 23x80=1840 inner loop iterations, for a total of 23920 instructions, including 3680 CPU-intensive integer multiplies.
We made a few changes to the C source code, so then it looked like this:
int i, j;
register char *a, *b;
for (i = 0; i < 22; i++)
{
a = screen[i];
b = screen[i+1];
for (j = 0; j < 80; j++)
*a++ = *b++;
}
There are still two machine-language multiplies, but they're in the outer loop, so there are only 46 integer multiplies instead of 3680. And the inner loop *a++ = *b++ statement only consisted of two machine-language operations.
Fetch the value from the memory address pointed to by A2 into D1, and post-increment A2
Store the value in D1 into the memory address pointed to by A1, and post-increment A1.
Given there are 1840 inner loop iterations, that's a total of 3680 CPU-cheap instructions - 6.5 times fewer - and NO integer multiplies. After this, instead of dying at six teletype windows, we never were able to pull up enough to bog the machine down - we ran out of teletype data sources first. And there are ways to optimize this much, much further, as well.
Now, modern compilers will do that kind of optimization for you - IF you ask them to do it, and IF your code is structured in a way that permits it.
But there are still circumstances where compilers can't do that for you - for instance, if you're doing non-sequential operations in the array.
So I've found it's served me well to use pointers instead of array references whenever possible. The performance is certainly never worse, and frequently much, much better.

With modern compiler there shouldn't be any difference in performance between the two, especially in such simplistic easily recognizable examples. Moreover, even if the compiler does not recognize their equivalence, i.e. translates each code "literally", there still shouldn't be any noticeable performance difference on a typical modern hardware platform. (Of course, there might be more specialized platforms out there where the difference might be noticeable.)
As for other considerations... Conceptually, when you implement an algorithm using the index access you impose a random-access requirement on the underlying data structure. When you use a pointer ("iterator") access, you only impose a sequential-access requirement on the underlying data structure. Random-access is a stronger requirement than sequential-access. For this reason I, for one, in my code prefer to stick to pointer access whenever possible, and use index access only when necessary.
More generally, if an algorithm can be implemented efficiently through sequential access, it is better to do it that way, without involving the unnecessary stronger requirement of random-access. This might prove useful in the future, should a need arise to refactor the code or to change the algorithm.

They are almost identical. Both solutions involve a temporary variable, an increment of a word on your system (int or ptr), and a logical check which should take one assembly instruction.
The only difference I see is the array lookup
arr[idx]
might require pointer arithmetic then a fetch while the dereference:
*ptr
just requires a fetch
My advice is that if it really matters, implement both and see if there's any savings.

To be sure, you must profile in your intended target environment.
That said, my guess is that any modern compiler is going to optimize them both down to very similar (if not identical) code.
If you didn't have an optimizer, the second has a chance of being faster, because you aren't re-computing the pointer on every iteration. But unless Size is a VERY large number (or the routine is called quite often), the difference isn't going to matter to your program's overall execution speed.

The pointer op used to be much faster. Now it's a bit faster, but the compiler may optimize it for you
Historically it was much faster to iterate via *p++ than p[i]; that was part of the motivation for having pointers in the language.
Plus, p[i] often requires a slower multiply op or at least a shift, so the optimization of replacing multiplies in a loop with adds to a pointer was sufficiently important to have a specific name: strength reduction. The subscript also tended to produce bigger code.
However, two things have changed: one is that compilers are much more sophisticated and are generally capable of doing this optimization for you.
The other is that the relative difference between an op and a memory access has increased. When *p++ was invented memory and cpu op times were similar. Today, a random desktop machine can do 3 billion integer ops / second, but only about 10 or 20 million random DRAM reads. Cache accesses are faster, and the system will prefetch and stream sequential memory accesses as you step through an array, but it still costs a lot to hit memory, and a bit of subscript fiddling isn't such a big deal.

Several years ago I asked this exact question. Someone in an interview was failing a candidate for picking the array notation because it was supposedly obviously slower. At that point I compiled both versions and looked at the disassembly. There was one opcode extra in the array notation. This was with Visual C++ (.net?). Based on what I saw I concluded that there is no appreciable difference.
Doing this again, here is what I found:
iterate1(arr, 400); // array notation
011C1027 mov edi,dword ptr [__imp__printf (11C20A0h)]
011C102D add esp,0Ch
011C1030 xor esi,esi
011C1032 movsx ecx,byte ptr [esp+esi+8] <-- Loop starts here
011C1037 push ecx
011C1038 push offset string "%c" (11C20F4h)
011C103D call edi
011C103F inc esi
011C1040 add esp,8
011C1043 cmp esi,190h
011C1049 jl main+32h (11C1032h)
iterate2(arr, 400); // pointer offset notation
011C104B lea esi,[esp+8]
011C104F nop
011C1050 movsx edx,byte ptr [esi] <-- Loop starts here
011C1053 push edx
011C1054 push offset string "%c" (11C20F4h)
011C1059 call edi
011C105B inc esi
011C105C lea eax,[esp+1A0h]
011C1063 add esp,8
011C1066 cmp esi,eax
011C1068 jne main+50h (11C1050h)

Why don't you try both and time them? My guess would be that they are optimized by the compiler into basically the same code. Just remember to turn on optimizations when comparing (-O3).

In the "other considerations" column, I'd say approach one is more clear. That's just my opinion though.

You're asking the wrong question. Should a developer aim for readability or performance first?
The first version is idiomatic for processing array, and your intent will be clear to anyone who has worked with arrays before, whereas the second relies heavily on the equivalence between array names and pointers, forcing someone reading the code to switch metaphors several times.
Cue the comments saying that the second version is crystal clear to any developer worth his keybaord.
If you wrote your program, and it's running slow, and you have profiled to the point where you have identified this loop as the bottleneck, then it would make sense to pop the hood and look at which of these is faster. But get something clear up and running first using well-known idiomatic language constructs.

Performance questions aside, it strikes me that the while loop variant has potential maintainability issues, as a programmer coming along to add some new bells and whistles has to remember to put the array increment in the right place, whereas the for loop variant puts it safely out of the body of the loop.

GCC: program doesn't work with compilation option -O3

I'm writing a C++ program that doesn't work (I get a segmentation fault) when I compile it with optimizations (options -O1, -O2, -O3, etc.), but it works just fine when I compile it without optimizations.
Is there any chance that the error is in my code? or should I assume that this is a bug in GCC?
My GCC version is 3.4.6.
Is there any known workaround for this kind of problem?
There is a big difference in speed between the optimized and unoptimized version of my program, so I really need to use optimizations.
This is my original functor. The one that works fine with no levels of optimizations and throws a segmentation fault with any level of optimization:
struct distanceToPointSort{
indexedDocument* point ;
distanceToPointSort(indexedDocument* p): point(p) {}
bool operator() (indexedDocument* p1,indexedDocument* p2){
return distance(point,p1) < distance(point,p2) ;
}
} ;
And this one works flawlessly with any level of optimization:
struct distanceToPointSort{
indexedDocument* point ;
distanceToPointSort(indexedDocument* p): point(p) {}
bool operator() (indexedDocument* p1,indexedDocument* p2){
float d1=distance(point,p1) ;
float d2=distance(point,p2) ;
std::cout << "" ; //without this line, I get a segmentation fault anyways
return d1 < d2 ;
}
} ;
Unfortunately, this problem is hard to reproduce because it happens with some specific values. I get the segmentation fault upon sorting just one out of more than a thousand vectors, so it really depends on the specific combination of values each vector has.

Now that you posted the code fragment and a working workaround was found (#Windows programmer's answer), I can say that perhaps what you are looking for is -ffloat-store.
-ffloat-store
Do not store floating point variables in registers, and inhibit other options that might change whether a floating point value is taken from a register or memory.
This option prevents undesirable excess precision on machines such as the 68000 where the floating registers (of the 68881) keep more precision than a double is supposed to have. Similarly for the x86 architecture. For most programs, the excess precision does only good, but a few programs rely on the precise definition of IEEE floating point. Use -ffloat-store for such programs, after modifying them to store all pertinent intermediate computations into variables.
Source: http://gcc.gnu.org/onlinedocs/gcc-3.4.6/gcc/Optimize-Options.html

I would assume your code is wrong first.
Though it is hard to tell.
Does your code compile with 0 warnings?
g++ -Wall -Wextra -pedantic -ansi

Here's some code that seems to work, until you hit -O3...
#include <stdio.h>
int main()
{
int i = 0, j = 1, k = 2;
printf("%d %d %d\n", *(&j-1), *(&j), *(&j+1));
return 0;
}
Without optimisations, I get "2 1 0"; with optimisations I get "40 1 2293680". Why? Because i and k got optimised out!
But I was taking the address of j and going out of the memory region allocated to j. That's not allowed by the standard. It's most likely that your problem is caused by a similar deviation from the standard.
I find valgrind is often helpful at times like these.
EDIT: Some commenters are under the impression that the standard allows arbitrary pointer arithmetic. It does not. Remember that some architectures have funny addressing schemes, alignment may be important, and you may get problems if you overflow certain registers!
The words of the [draft] standard, on adding/subtracting an integer to/from a pointer (emphasis added):
"If both the pointer operand and the result point to elements of the same array object, or one past the last element of the array object, the evaluation shall not produce an overflow; otherwise, the behavior is undefined."
Seeing as &j doesn't even point to an array object, &j-1 and &j+1 can hardly point to part of the same array object. So simply evaluating &j+1 (let alone dereferencing it) is undefined behaviour.
On x86 we can be pretty confident that adding one to a pointer is fairly safe and just takes us to the next memory location. In the code above, the problem occurs when we make assumptions about what that memory contains, which of course the standard doesn't go near.

As an experiment, try to see if this will force the compiler to round everything consistently.
volatile float d1=distance(point,p1) ;
volatile float d2=distance(point,p2) ;
return d1 < d2 ;

The error is in your code. It's likely you're doing something that invokes undefined behavior according to the C standard which just happens to work with no optimizations, but when GCC makes certain assumptions for performing its optimizations, the code breaks when those assumptions aren't true. Make sure to compile with the -Wall option, and the -Wextra might also be a good idea, and see if you get any warnings. You could also try -ansi or -pedantic, but those are likely to result in false positives.

You may be running into an aliasing problem (or it could be a million other things). Look up the -fstrict-aliasing option.
This kind of question is impossible to answer properly without more information.

It is very seldom the compiler fault, but compiler do have bugs in them, and them often manifest themselves at different optimization levels (if there is a bug in an optimization pass, for example).
In general when reporting programming problems: provide a minimal code sample to demonstrate the issue, such that people can just save the code to a file, compile and run it. Make it as easy as possible to reproduce your problem.
Also, try different versions of GCC (compiling your own GCC is very easy, especially on Linux). If possible, try with another compiler. Intel C has a compiler which is more or less GCC compatible (and free for non-commercial use, I think). This will help pinpointing the problem.

It's almost (almost) never the compiler.
First, make sure you're compiling warning-free, with -Wall.
If that didn't give you a "eureka" moment, attach a debugger to the least optimized version of your executable that crashes and see what it's doing and where it goes.
5 will get you 10 that you've fixed the problem by this point.

Ran into the same problem a few days ago, in my case it was aliasing. And GCC does it differently, but not wrongly, when compared to other compilers. GCC has become what some might call a rules-lawyer of the C++ standard, and their implementation is correct, but you also have to be really correct in you C++, or it'll over optimize somethings, which is a pain. But you get speed, so can't complain.

I expect to get some downvotes here after reading some of the comments, but in the console game programming world, it's rather common knowledge that the higher optimization levels can sometimes generate incorrect code in weird edge cases. It might very well be that edge cases can be fixed with subtle changes to the code, though.

Alright...
This is one of the weirdest problems I've ever had.
I dont think I have enough proof to state it's a GCC bug, but honestly... It really looks like one.
This is my original functor. The one that works fine with no levels of optimizations and throws a segmentation fault with any level of optimization:
struct distanceToPointSort{
indexedDocument* point ;
distanceToPointSort(indexedDocument* p): point(p) {}
bool operator() (indexedDocument* p1,indexedDocument* p2){
return distance(point,p1) < distance(point,p2) ;
}
} ;
And this one works flawlessly with any level of optimization:
struct distanceToPointSort{
indexedDocument* point ;
distanceToPointSort(indexedDocument* p): point(p) {}
bool operator() (indexedDocument* p1,indexedDocument* p2){
float d1=distance(point,p1) ;
float d2=distance(point,p2) ;
std::cout << "" ; //without this line, I get a segmentation fault anyways
return d1 < d2 ;
}
} ;
Unfortunately, this problem is hard to reproduce because it happens with some specific values. I get the segmentation fault upon sorting just one out of more than a thousand vectors, so it really depends on the specific combination of values each vector has.

Wow, I didn't expect answers so quicly, and so many...
The error occurs upon sorting a std::vector of pointers using std::sort()
I provide the strict-weak-ordering functor.
But I know the functor I provide is correct because I've used it a lot and it works fine.
Plus, the error cannot be some invalid pointer in the vector becasue the error occurs just when I sort the vector. If I iterate through the vector without applying std::sort first, the program works fine.
I just used GDB to try to find out what's going on. The error occurs when std::sort invoke my functor. Aparently std::sort is passing an invalid pointer to my functor. (of course this happens with the optimized version only, any level of optimization -O, -O2, -O3)

as other have pointed out, probably strict aliasing.
turn it of in o3 and try again. My guess is that you are doing some pointer tricks in your functor (fast float as int compare? object type in lower 2 bits?) that fail across inlining template functions.
warnings do not help to catch this case. "if the compiler could detect all strict aliasing problems it could just as well avoid them" just changing an unrelated line of code may make the problem appear or go away as it changes register allocation.

As the updated question will show ;) , the problem exists with a std::vector<T*>. One common error with vectors is reserve()ing what should have been resize()d. As a result, you'd be writing outside array bounds. An optimizer may discard those writes.

post the code in distance! it probably does some pointer magic, see my previous post. doing an intermediate assignment just hides the bug in your code by changing register allocation. even more telling of this is the output changing things!

The true answer is hidden somewhere inside all the comments in this thread. First of all: it is not a bug in the compiler.
The problem has to do with floating point precision. distanceToPointSort should be a function that should never return true for both the arguments (a,b) and (b,a), but that is exactly what can happen when the compiler decides to use higher precision for some data paths. The problem is especially likely on, but by no means limited to, x86 without -mfpmath=sse. If the comparator behaves that way, the sort function can become confused, and the segmentation fault is not surprising.
I consider -ffloat-store the best solution here (already suggested by CesarB).

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js