compiler reordering and load operation

compiler reordering and load operation - c++

I'm starting lock free programming and I encounter some difficulties for basic stuff. I have found the following example :
#define COMPILER_BARRIER() asm volatile("" ::: "memory")
int Value;
int IsPublished = 0;
void sendValue(int x)
{
Value = x;
COMPILER_BARRIER(); // prevent reordering of stores
IsPublished = 1;
}
int tryRecvValue()
{
if (IsPublished)
{
COMPILER_BARRIER(); // prevent reordering of loads
return Value;
}
return -1; // or some other value to mean not yet received
}
What kind of reordering a compiler can perform in tryRecvValue function ?

Answer
Without the inline asm*, the compiler is able to load the value from Value first, then IsPublished, then perform the check and return.
Always check the assembly output if you're not sure what the compiler is doing. This is especially true of lock free programming techniques.
More Information
* NOTE: Inline asm is not a required part of the C++ standard, but is implemented by all major compilers. It is only mentioned in section 7.4:
The asm declaration is conditionally-supported; its meaning is implementation-defined. [ Note: Typically it
is used to pass information through the implementation to an assembler. — end note ]
Generally speaking, compilers cannot reorder reads and writes around inline assembly because the compiler cannot make assumptions about what the assembly does.
The use of COMPILER_BARRIER() above will not act as any kind of read/write barrier, it just has the side affect of not allowing reordering of assembly instructions around the statement.
Assuming the above code is called from different threads, it will only work as-is on x86, since the architecture guarantees that writes are never (observably) reordered by the cpu. (reference: 8.2.3.2 in volume 3A of the Intel Manuals)
For use on other architectures with relaxed memory models, (like PowerPc and ARM) you'll need hardware barriers that prevent these types of reorders. For a good breakdown, check out Jeff Preshing's articles.

Related

Input Operand `"m"(var)` and Output Operand `"=m"(var)` in GNU C inline asm? Used with no instructions as barriers?

I wonder what input operand "m"(var) and output operand "=m"(var) in asm do:
asm volatile("" : "=m"(blk) : :);
asm volatile("" : : "m"(write_idx), "m"(blk) :);
I ran into the 2 lines above here, in a SPMC queue.
And what are the side effects? The lines above have no asm instruction, and so I believe the author was trying to utilize some well defined side effects (e.g. flushing the values of write_idx and blk hold by registers if they are in the 2nd line?)

asm volatile("" : "=m"(blk) : :);
After this statement, the compiler believes that some random assembly code has written to blk. As an output operand was given, the compiler assumes that the previous value of blk doesn't matter and can discard code that assigned a value to blk that nobody read before this statement. It also must assume that blk now holds some other value and read that value back from memory instead of e.g. from a copy in registers.
So shortly, this looks like an attempt to force the compiler into treating blk as if it was assigned to from a source unkown to the compiler. This is usually better achieved by qualifying blk as volatile.
Note that as the asm statement is qualified volatile, the statement additionally implies an ordering requirement to the compiler, preventing certain code rearrangements.
Similar code (though without the volatile) could also be used to tell the compiler that you want an object to assume an unspecified value as an optimisation. I recommend against this sort of trick though.
asm volatile("" : : "m"(write_idx), "m"(blk) :);
The compiler assumes that this statement reads from the variables write_idx and blk. Therefore, it must materialise the current contents of these variables into memory before it can execute the statement. This looks like a crude approximation of a memory barrier but does not actually effect a memory barrier as was likely intended.
Aside from this, the code you have shown must certainly be defective: C++ objects may unless specified otherwise not be modified concurrently without synchronisation. However, this code cannot have such synchronisation as it does not include any headers with types (such as std::atomic) or facilities (such as mutexes) for synchronisation.
Without reading the code in detail, I suppose that the author just uses ordinary objects for synchronisation, believing concurrent mutation to be well-defined. Observing that it is not, the author likely believed that the compiler was to blame and added these asm statements as kludges to make it generate code that seems to work. However, this is incorrect though it may appear to work on architectures with sufficiently strict memory models (such as x86) and with sufficiently stupid compilers or by sheer luck.
Do not program like this. Use atomics or high level synchronisation facilities instead.

The first one sounds like a bug to me. An asm output operand must always be written to. The compiler will assume any stores to blk before that can be optimized out because the asm block will set a new value. Example:
extern int foo;
int bar()
{
auto& blk = foo;
asm volatile("" : "=m"(blk) : :);
return blk;
}
int baz()
{
foo = 42;
return bar();
}
Compiling with gcc 12.2 at -O2 will produce:
baz():
movl foo(%rip), %eax
ret
As you can see, the foo = 42; has been optimized out, this is just the return blk;. Commenting out the asm, for comparison:
baz():
movl $42, foo(%rip)
movl $42, %eax
ret
(The calls to bar have been inlined in both cases.)

C++ Statement Reordering

This is a question about Chandler's answer here (I didn't have a high enough rep to comment): Enforcing statement order in C++
In his answer, suppose foo() has no input or output. It's a black box that does work that is observable eventually, but won't be needed immediately (e.g. executes some callback). So we don't have input/output data locally handy to tell the compiler not to optimize. But I know that foo() will modify the memory somewhere, and the result will be observable eventually. Will the following prevent statement reordering and get the correct timing in this case?
#include <chrono>
#include <iostream>
//I believe this tells the compiler that all memory everywhere will be clobbered?
//(from his cppcon talk: https://youtu.be/nXaxk27zwlk?t=2441)
__attribute__((always_inline)) inline void DoNotOptimize() {
asm volatile("" : : : "memory");
}
// The compiler has full knowledge of the implementation.
static int ugly_global = 1; //we print this to screen sometime later
static void foo(void) { ugly_global *= 2; }
auto time_foo() {
using Clock = std::chrono::high_resolution_clock;
auto t1 = Clock::now(); // Statement 1
DoNotOptimize();
foo(); // Statement 2
DoNotOptimize();
auto t2 = Clock::now(); // Statement 3
return t2 - t1;
}

Will the following prevent statement reordering and get the correct timing in this case?
It should not be necessary because the calls to Clock::now should, at the language-definition level, enforce enough ordering. (That is, the C++11 standard says that the high resolution clock ought to get as much information as the system can give here, in the way that is most useful here. See "secondary question" below.)
But there is a more general case. It's worth thinking about the question: How does whoever provides the C++ library implementation actually write this function? Or, take C++ itself out of the equation. Given a language standard, how does an implementor—a person or group writing an implementation of that language—get you what you need? Fundamentally, we need to make a distinction between what the language standard requires and how an implementation provider goes about implementing the requirements.
The language itself may be expressed in terms of an abstract machine, and the C and C++ languages are. This abstract machine is pretty loosely defined: it executes some kind of instructions, which access data, but in many cases we don't know how it does these things, or even how big the various data items are (with some exceptions for fixed-size integers like int64_t), and os on. The machine may or may not have "registers" that hold things in ways that cannot be addressed as well as memory that can be addressed and whose addresses can be recorded in pointers:
p = &var
makes the value store in p (in memory or a register) such that using *p accesses the value stored in var (in memory or a register—some machines, especially back in the olden days, have / had addressable registers).1
Nonetheless, despite all of this abstraction, we want to run real code on real machines. Real machines have real constraints: some instructions might require particular values in particular registers (think about all the bizarre stuff in the x86 instruction sets, or wide-result integer multipliers and dividers that use special-purpose registers, as on some MIPS processors), or cause CPU sychronizations, or whatever.
GCC in particular invented a system of constraints to express what you could or could not do on the machine itself, using the machine's instruction set. Over time, this evolved into user-accessible asm constructs with input, output, and clobber sections. The particular one you show:
__attribute__((always_inline)) inline void DoNotOptimize() {
asm volatile("" : : : "memory");
}
expresses the idea that "this instruction" (asm; the actual provided instruction is blank) "cannot be moved" (volatile) "and clobbers all of the computer's memory, but no registers" ("memory" as the clobber section).
This is not part of either C or C++ as a language. It's just a compiler construction, supported by GCC and now supported by clang as well. But it suffices to force the compiler to issue all stores-to-memory before the asm, and reload values from memory as needed after the asm, in case they changed when the computer executed the (nonexistent) instruction included in the asm line. There's no guarantee that this will work, or even compile at all, in some other compiler, but as long as we're the implementor, we choose the compiler we're implementing for/with.
C++ as a language now has support for ordered memory operations, which an implementor must implement. The implementor can use these asm volatile constructs to achieve the right result, provided they do actually achieve the right result. For instance, if we need to cause the machine itself to synchronize—to emit a memory barrier—we can stick the appropriate machine instruction, such as mfence or membar #sync or whatever it may be, in the asm's instruction-section clause. See also compiler reordering vs memory reordering as Klaus mentioned in a comment.
It is up to the implementor to find an appropriately effective trick, compiler-specific or not, to get the right semantics while minimizing any runtime slowdown: for instance, we might want to use lfence rather than mfence if that's sufficient, or membar #LoadLoad, or whatever the right thing is for the machine. If our implementation of Clock::now requires some sort of fancy inline asm, we write one. If not, we don't. We make sure that we produce what's required—and then all users of the system can just use it, without needing to know what sort of grubby implementation tricks we had to invoke.
There's a secondary question here: does the language specification really constrain the implementor the way we think/hope it does? Chris Dodd's comment says he thinks so, and he's usually right on these kinds of questions. A couple of other commenters think otherwise, but I'm with Chris Dodd on this one. I think it is not necessary. You can always compile to assembly, or disassemble the compiled program, to check, though!
If the compiler didn't do the right thing, that asm would force it to do the right thing, in GCC and clang. It probably wouldn't work in other compilers.
1On the KA-10 in particular, the registers were just the first sixteen words of memory. As the Wikipedia page notes, this meant you could put instructions into there and call them. Because the first 16 words were the registers, these instructions ran much faster than other instructions.

Clang/GCC Compiler Intrinsics without corresponding compiler flag

I know there are similar questions to this, but compiling different file with different flag is not acceptable solution here since it would complicate the codebase real quick. An answer with "No, it is not possible" will do.
Is it possible, in any version of Clang OR GCC, to compile intrinsics function for SSE 2/3/3S/4.1 while only enable compiler to use SSE instruction set for its optimization?
EDIT: For example, I want compiler to turn _mm_load_si128() to movdqa, but compiler must not do emit this instruction at any other place than this intrinsics function, similar to how MSVC compiler works.
EDIT2: I have dynamic dispatcher in place and several version of single function with different instruction sets written using intrinsics function. Using multiple file will make this much harder to maintain as same version of code will span multiple file, and there are a lot of this type of functions.
EDIT3: Example source code as requested: https://github.com/AviSynth/AviSynthPlus/blob/master/avs_core/filters/resample.cpp or most file in that folder really.

Here is an approach using gcc that might be acceptable. All source code goes into a single source file. The single source file is divided into sections. One section generates code according to the command line options used. Functions like main() and processor feature detection go in this section. Another section generates code according to a target override pragma. Intrinsic functions supported by the target override value can be used. Functions in this section should be called only after processor feature detection has confirmed the needed processor features are present. This example has a single override section for AVX2 code. Multiple override sections can be used when writing functions optimized for multiple targets.
// temporarily switch target so that all x64 intrinsic functions will be available
#pragma GCC push_options
#pragma GCC target ("arch=core-avx2")
#include <intrin.h>
// restore the target selection
#pragma GCC pop_options
//----------------------------------------------------------------------------
// the following functions will be compiled using default code generation
//----------------------------------------------------------------------------
int dummy1 (int a) {return a;}
//----------------------------------------------------------------------------
// the following functions will be compiled using core-avx2 code generation
// all x64 intrinc functions are available
#pragma GCC push_options
#pragma GCC target ("arch=core-avx2")
//----------------------------------------------------------------------------
static __m256i bitShiftLeft256ymm (__m256i *data, int count)
{
__m256i innerCarry, carryOut, rotate;
innerCarry = _mm256_srli_epi64 (*data, 64 - count); // carry outs in bit 0 of each qword
rotate = _mm256_permute4x64_epi64 (innerCarry, 0x93); // rotate ymm left 64 bits
innerCarry = _mm256_blend_epi32 (_mm256_setzero_si256 (), rotate, 0xFC); // clear lower qword
*data = _mm256_slli_epi64 (*data, count); // shift all qwords left
*data = _mm256_or_si256 (*data, innerCarry); // propagate carrys from low qwords
carryOut = _mm256_xor_si256 (innerCarry, rotate); // clear all except lower qword
return carryOut;
}
//----------------------------------------------------------------------------
// the following functions will be compiled using default code generation
#pragma GCC pop_options
//----------------------------------------------------------------------------
int main (void)
{
return 0;
}
//----------------------------------------------------------------------------

There is no way to control instruction set used for the compiler, other than the switches on the compiler itself. In other words, there are no pragmas or other features for this, just the overall compiler flags.
This means that the only viable solution for achieving what you want is to use the -msseX and split your source into multiple files (of course, you can always use various clever #include etc to keep one single textfile as the main source, and just include the same file in multiple places)
Of course, the source code of the compiler is available. I'm sure the maintainers of GCC and Clang/LLVM will happily take patches that improve on this. But bear in mind that the path from "parsing the source" to "emitting instructions" is quite long and complicated. What should happen if we do this:
#pragma use_sse=1
void func()
{
... some code goes here ...
}
#pragma use_sse=3
void func2()
{
...
func();
...
}
Now, func is short enough to be inlined, should the compiler inline it? If so, should it use sse1 or sse3 instructions for func().
I understand that YOU may not care about that sort of difficulty, but the maintainers of Clang and GCC will indeed have to deal with this in some way.
Edit:
In the headerfiles declaring the SSE intrinsics (and many other intrinsics), a typical function looks something like this:
extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__, __artificial__))
_mm_add_ss (__m128 __A, __m128 __B)
{
return (__m128) __builtin_ia32_addss ((__v4sf)__A, (__v4sf)__B);
}
The builtin_ia32_addss is only available in the compiler when you have enabled the -msse option. So if you convince the compiler to still allow you to use the _mm_add_ss() when you have -mno-sse, it will give you an error for "__builtin_ia32_addss is not declared in this scope" (I just tried).
It would probably not be very hard to change this particular behaviour - there are probably only a few places where the code does the "introduce builtin functions". However, I'm not convinced that there are further issues in the code, later on when it comes to actually issuing instructions in the compiler.
I have done some work with "builtin functions" in a Clang-based compiler, and unfortunately, there are several steps involved in getting from the "parser" to the "code generation", where the builtin function gets involved.
Edit2:
Compared to GCC, solving this for Clang is even more complex, in that the compiler itself has understanding of SSE instructions, so it simply has this in the header file:
static __inline__ __m128 __attribute__((__always_inline__, __nodebug__))
_mm_add_ps(__m128 __a, __m128 __b)
{
return __a + __b;
}
The compiler will then know that to add a couple of __m128, it needs to produce the correct SSE instruction. I have just downloaded Clang (I'm at home, my work on Clang is at work, and not related to SSE at all, just builtin functions in general - and I haven't really done much of the changes to Clang as such, but it was enough to understand roughly how builtin functions work).
However, from your perspective, the fact that it's not a builtin function makes it worse, because the operator+ translation is much more complicated. I'm pretty sure the compiler just makes it into an "add these two things", and then pass it to LLVM for further work - LLVM will be the part that understands SSE instructions etc. But for your purposes, this makes it worse, because the fact that this is an "intrinsic function" is now pretty much lost, and the compiler just deals with it just as if you'd written a + b, with the side effect of a and b being types that are 128 bits long. It makes it even more complicated to deal with generating "the right instructions" and yet keeping "all other" instructions at a different SSE level.

How to store a C++ variable in a register

I would like some clarification regarding a point about the storage of register variables:
Is there a way to ensure that if we have declared a register variable in our code, that it will ONLY be stored in a register?
#include<iostream>
using namespace std;
int main()
{
register int i = 10;// how can we ensure this will store in register only.
i++;
cout << i << endl;
return 0;
}

You can't. It is only a hint to the compiler that suggests that the variable is heavily used. Here's the C99 wording:
A declaration of an identiﬁer for an object with storage-class speciﬁer register suggests that access to the object be as fast as possible. The extent to which such suggestions are effective is implementation-deﬁned.
And here's the C++11 wording:
A register specifier is a hint to the implementation that the variable so declared will be heavily used. [ Note: The hint can be ignored and in most implementations it will be ignored if the address of the variable is taken. This use is deprecated (see D.2). —end note ]
In fact, the register storage class specifier is deprecated in C++11 (Annex D.2):
The use of the register keyword as a storage-class-specifier (7.1.1) is deprecated.
Note that you cannot take the address of a register variable in C because registers do not have an address. This restriction is removed in C++ and taking the address is pretty much guaranteed to ensure the variable won't end up in a register.
Many modern compilers simply ignore the register keyword in C++ (unless it is used in an invalid way, of course). They are simply much better at optimizing than they were when the register keyword was useful. I'd expect compilers for niche target platforms to treat it more seriously.

The register keyword has different meanings in C and C++. In C++ it is in fact redundant and seems even to be deprecated nowadays.
In C it is different. First don't take the name of the keyword literally, it is has not always to do with a "hardware register" on a modern CPU. The restriction that is imposed on register variables is that you can't take their address, the & operation is not allowed. This allows you to mark a variable for optimization and ensure that the compiler will shout at you if you try to take its address. In particular a register variable that is also const qualified can never alias, so it is a good candidate for optimization.
Using register as in C systematically forces you to think of every place where you take the address of a variable. This is probably nothing you would want to do in C++, which heavily relies on references to objects and things like that. This might be a reason why C++ didn't copy this property of register variables from C.

Generally it's impossibly. Specifically one can take certain measures to increase the probability:
Use proper optimization level eg. -O2
Keep the number of the variables small
register int a,b,c,d,e,f,g,h,i, ... z; // can also produce an error
// results in _spilling_ a register to stack
// as the CPU runs out of physical registers
Do not take an address of the register variable.
register int a;
int *b = &a; /* this would be an error in most compilers, but
especially in the embedded world the compilers
release the restrictions */
In some compilers, you can suggest
register int a asm ("eax"); // to put a variable to a specific register

Generally CPP compilers(g++) do quite a few optimizations to the code. So when you declare a register variable, it is not necessary that the compiler will store that value directly in the register. (i.e) the code 'register int x' may not result in compiler storing that int directly in the register. But if we can force the compiler to do so, we may be successful.
For example, if we use the following piece of code, then we may force the compiler to do what we desire. Compilation of the following piece of code may error out, which indicates that the int is actually getting stored directly in the register.
int main() {
volatile register int x asm ("eax");
int y = *(&x);
return 0;
}
For me, g++ compiler is throwing the following error in this case.
[nsidde#nsidde-lnx cpp]$ g++ register_vars.cpp
register_vars.cpp: In function ‘int main()’:
register_vars.cpp:3: error: address of explicit register variable ‘x’ requested
The line 'volatile register int x asm ("eax")' is instructing the compiler that, store the integer x in 'eax' register and in doing so do not do any optimizations. This will make sure that the value is stored in the register directly. That is why accessing the address of the variable is throwing an error.
Alternatively, the C compiler (gcc), may error out with the following code itself.
int main() {
register int a=10;
int c = *(&a);
return 0;
}
For me, the gcc compiler is throwing the following error in this case.
[nsidde#nsidde-lnx cpp]$ gcc register.c
register.c: In function ‘main’:
register.c:5: error: address of register variable ‘a’ requested

It's just a hint to the compiler; you can't force it to place the variable in a register. In any event, the compiler writer probably has much better knowledge of the target architecture than the application programmer, and is therefore better placed to write code that makes register allocation decisions. In other words, you are unlikely to achieve anything by using register.

The "register" keyword is a remnant of the time when compilers had to fit on machines with 2MB of RAM (shared between 18 terminals with a user logged in on each). Or PC/Home computers with 128-256KB of RAM. At that point, the compiler couldn't really run through a large function to figure out which register to use for which variable, to use the registers most effectively. So if the programmer gave a "hint" with register, the compiler would put that in a register (if possible).
Modern compilers don't fit several times in 2MB of RAM, but they are much more clever at assigning variables to registers. In the example given, I find it very unlikley that the compiler wouldn't put it in a register. Obviously, registers are limited in number, and given a sufficiently complex piece of code, some variables will not fit in registers. But for such a simple example, a modern compiler will make i a register, and it will probably not touch memory until somewhere inside ostream& ostream::operator<<(ostream& os, int x).

The only way to ensure that you are using a register, is to use inline assembly. But, even if you do this, you are not guaranteed that the compiler won't store your value outside of the inline assembly block. And, of course, your OS may decide to interrupt your program at any point, storing all your registers to memory, in order to give the CPU to another process.
So, unless you write assembler code within the kernel with all interrupts disabled, there is absolutely no way to ensure that your variable will never hit memory.
Of course, that is only relevant if you are concerned about safety. From a performance perspective, compiling with -O3 is usually enough, the compiler usually does quite a good job at determining which variables to hold in registers. Anyway, storing variables in registers is only one small aspect of performance tuning, the much more important aspect is to ensure that no superfluous or expensive work gets done in the inner loop.

Here you can use volatile register int i = 10 in C++ to ensure i to be stored in register. volatile keyword will not allow the compiler to optimize the variable i.

"Observable behaviour" and compiler freedom to eliminate/transform pieces c++ code

After reading this discussion I realized that I almost totally misunderstand the matter :)
As the description of C++ abstract machine is not rigorous enough(comparing, for instance, with JVM specification), and if a precise answer isn't possible I would rather want to get informal clarifications about rules that reasonable "good" (non-malicious) implementation should follow.
The key concept of part 1.9 of the Standard addressing implementation freedom is so called as-if rule:
an implementation is free to disregard any requirement of this
Standard as long as the result is as if the requirement had been
obeyed, as far as can be determined from the observable behavior of
the program.
The term "observable behavior", according to the standard (I cite n3092), means the following:
— Access to volatile objects are evaluated strictly according to the
rules of the abstract machine.
— At program termination, all data written into files shall be
identical to one of the possible results that execution of the program
according to the abstract semantics would have produced.
— The input and output dynamics of interactive devices shall take
place in such a fashion that prompting output is actually delivered
before a program waits for input. What constitutes an interactive
device is implementation-defined.
So, roughly speaking, the order and operands of volatile access operations and io operations should be preserved; implementation may make arbitrary changes in the program which preserve these invariants (comparing to some allowed behaviour of the abstract c++ machine)
Is it reasonable to expect that non-malicious implementation treates io operations wide enough (for instance, any system call from user code is treated as such operation)? (E.g. RAII mutex lock/unlock wouldn't be thrown away by compiler in case RAII wrapper contains no volatiles)
How deeply the "behavioral observation" should immerse from user-defined c++ program level into library/system calls? The question is, of course, only about library calls that not intended to have io/volatile access from the user viewpoint (e.g. as new/delete operations) but may (and usually does) access volatiles or io in the library/system implementation. Should the compiler treat such calls from the user viewpoint (and consider such side effects as not observable) or from "library" viewpoint (and consider the side effects as observable) ?
If I need to prevent some code from elimination by compiler, is it a good practice not to ask all the questions above and simply add (possibly fake) volatile access operations (wrap the actions needed to volatile methods and call them on volatile instances of my own classes) in any case that seems suspicious?
Or I'm totally wrong and the compiler is disallowed to remove any c++ code except of cases explicitly mentioned by the standard (as copy elimination)

The important bit is that the compiler must be able to prove that the code has no side effects before it can remove it (or determine which side effects it has and replace it with some equivalent piece of code). In general, and because of the separate compilation model, that means that the compiler is somehow limited as to what library calls have observable behavior and can be eliminated.
As to the deepness of it, it depends on the library implementation. In gcc, the C standard library uses compiler attributes to inform the compiler of potential side effects (or absence of them). For example, strlen is tagged with a pure attribute that allows the compiler to transform this code:
char p[] = "Hi there\n";
for ( int i = 0; i < strlen(p); ++i ) std::cout << p[i];
into
char * p = get_string();
int __length = strlen(p);
for ( int i = 0; i < __length; ++i ) std::cout << p[i];
But without the pure attribute the compiler cannot know whether the function has side effects or not (unless it is inlining it, and gets to see inside the function), and cannot perform the above optimization.
That is, in general, the compiler will not remove code unless it can prove that it has no side effects, i.e. will not affect the outcome of the program. Note that this does not only relate to volatile and io, since any variable change might have observable behavior at a later time.
As to question 3, the compiler will only remove your code if the program behaves exactly as if the code was present (copy elision being an exception), so you should not even care whether the compiler removes it or not. Regarding question 4, the as-if rule stands: If the outcome of the implicit refactor made by the compiler yields the same result, then it is free to perform the change. Consider:
unsigned int fact = 1;
for ( unsigned int i = 1; i < 5; ++i ) fact *= i;
The compiler can freely replace that code with:
unsigned int fact = 120; // I think the math is correct... imagine it is
The loop is gone, but the behavior is the same: each loop interaction does not affect the outcome of the program, and the variable has the correct value at the end of the loop, i.e. if it is later used in some observable operation, the result will be as-if the loop had been executed.
Don't worry too much on what observable behavior and the as-if rule mean, they basically mean that the compiler must yield the output that you programmed in your code, even if it is free to get to that outcome by a different path.
EDIT
#Konrad raises a really good point regarding the initial example I had with strlen: how can the compiler know that strlen calls can be elided? And the answer is that in the original example it cannot, and thus it could not elide the calls. There is nothing telling the compiler that the pointer returned from the get_string() function does not refer to memory that is being modified elsewhere. I have corrected the example to use a local array.
In the modified example, the array is local, and the compiler can verify that there are no other pointers that refer to the same memory. strlen takes a const pointer and so it promises not to modify the contained memory, and the function is pure so it promises not to modify any other state. The array is not modified inside the loop construct, and gathering all that information the compiler can determine that a single call to strlen suffices. Without the pure specifier, the compiler cannot know whether the result of strlen will differ in different invocations and has to call it.

The abstract machine defined by the standard will, given a specific
input, produce one of a set of specific output. In general, all that is
guaranteed is that for that specific input, the compiled code will
produce one of the possible specific output. The devil is in the
details, however, and there are a number of points to keep in mind.
The most important of these is probably the fact that if the program has
undefined behavior, the compiler can do absolutely anything. All bets
are off. Compilers can and do use potential undefined behavior for
optimizing: for example, if the code contains something like *p = (*q) ++,
the compiler can conclude that p and q aren't aliases to the same
variable.
Unspecified behavior can have similar effects: the actual behavior may
depend on the level of optimization. All that is requires is that the
actual output correspond to one of the possible outputs of the abstract
machine.
With regards to volatile, the stadnard does say that access to
volatile objects is observable behavior, but it leaves the meaning of
"access" up to the implementation. In practice, you can't really count
much on volatile these days; actual accesses to volatile objects may
appear to an outside observer in a different order than they occur in
the program. (This is arguably in violation of the intent of the
standard, at the very least. It is, however, the actual situation with
most modern compilers, running on a modern architecture.)
Most implementations treat all system calls as “IO”. With
regards to mutexes, of course: as far as C++03 is concerned, as soon as
you start a second thread, you've got undefined behavior (from the C++
point of view—Posix or Windows do define it), and in C++11,
synchronization primatives are part of the language, and constrain the
set of possible outputs. (The compiler can, of course, elimiate the
synchronizations if it can prove that they weren't necessary.)
The new and delete operators are special cases. They can be
replaced by user defined versions, and those user defined versions may
clearly have observable behavior. The compiler can only remove them if
it has some means of knowing either that they haven't been replaced, of
that the replacements have no observable behavior. In most systems,
replacement is defined at link time, after the compiler has finished its
work, so no changes are allowed.
With regards to your third question: I think you're looking at it from
the wrong angle. Compilers don't “eliminate” code, and no
particular statement in a program is bound to a particular block of
code. Your program (the complete program) defines a particular
semantics, and the compiler must do something which produces an
executable program having those semantics. The most obvious solution
for the compiler writer is to take each statement separately, and
generate code for it, but that's the compiler writer's point of view,
not yours. You put source code in, and get an executable out; but lots
of statements don't result in any code, and even for those that do,
there isn't necessarily a one to one relationship. In this sense, the
idea of “preventing some code elimination” doesn't make
sense: your program has a semantics, specified by the standard, and all
you can ask for (and all that you should be interested in) is that the
final executable have those semantics. (Your fourth point is similar:
the compiler doesn't “remove” any code.)

I can't speak for what the compilers should do, but here's what some compilers actually do
#include <array>
int main()
{
std::array<int, 5> a;
for(size_t p = 0; p<5; ++p)
a[p] = 2*p;
}
assembly output with gcc 4.5.2:
main:
xorl %eax, %eax
ret
replacing array with vector shows that new/delete are not subject to elimination:
#include <vector>
int main()
{
std::vector<int> a(5);
for(size_t p = 0; p<5; ++p)
a[p] = 2*p;
}
assembly output with gcc 4.5.2:
main:
subq $8, %rsp
movl $20, %edi
call _Znwm # operator new(unsigned long)
movl $0, (%rax)
movl $2, 4(%rax)
movq %rax, %rdi
movl $4, 8(%rax)
movl $6, 12(%rax)
movl $8, 16(%rax)
call _ZdlPv # operator delete(void*)
xorl %eax, %eax
addq $8, %rsp
ret
My best guess is that if the implementation of a function call is not available to the compiler, it has to treat it as possibly having observable side-effects.

1. Is it reasonable to expect that non-malicious implementation treates io operations wide enough
Yes. Assuming side-effects is the default. Beyond default, compilers must prove things (except for copy-elimination).
2. How deeply the "behavioral observation" should immerse from user-defined c++ program level into library/system calls?
As deep as it can. Using current standard C++ the compiler can't look behind library with meaning of static library, i.e. calls that target a function inside some ".a-" or ".lib file" calls, so side effects are assumed.
Using the traditional compilation model with multiple object files, the compiler is even unable to look behind extern calls. Optimizations accross
units of compilation may be done at link-time though.
Btw, some compilers have an extension to tell it about pure functions. From the gcc documentation:
Many functions have no effects except the return value and their return value depends only on the parameters and/or global variables. Such a function can be subject to common subexpression elimination and loop optimization just as an arithmetic operator would be. These functions should be declared with the attribute pure. For example,
int square (int) __attribute__ ((pure));
says that the hypothetical function square is safe to call fewer times than the program says.
Some of common examples of pure functions are strlen or memcmp. Interesting non-pure functions
are functions with infinite loops or those depending on volatile memory or other system resource,
that may change between two consecutive calls (such as feof in a multithreading environment).
Thinking about poses an interesting question to me: If some chunk of code mutates a non-local variable, and calls an un-introspectible function,
will it assume that this extern function might depend on that non-local variable?
compilation-unit A:
int foo() {
extern int x;
return x;
}
compilation-unit B:
int x;
int bar() {
for (x=0; x<10; ++x) {
std::cout << foo() << '\n';
}
}
The current standard has a notion of sequence points. I guess if a compiler does not see enough,
it can only optimize as far as to not break the ordering of dependent sequence points.
3. If I need to prevent some code from elimination by compiler
Except by looking at the object-dump, how could you judge whether something was removed?
And if you can't judge, than is this not equivalent to the impossibility of writing code that depends on its (non-)removal?
In that respect, compiler extensions (like for example OpenMP) help you in being able to judge. Some builtin mechanisms exist, too, like volatile variables.
Does a tree exist if nobody can observe it? Et hop, we are at quantum mechanics.
4. Or I'm totally wrong and the compiler is disallowed to remove any c++ code except of cases explicitly mentioned by the standard (as copy elimination)
No, it is perfectly allowed so. It is also allowed to transform code like it's a piece of slime.
(with the exception of copy elimination, you couldn't judge anyways).

One difference is that Java is designed to run on one platform only, the JVM. That makes it much easier to be "rigorous enough" in the specification, as there is only the platform to consider and you can document exactly how it works.
C++ is designed to be able to run on a wide selection of platforms and do that natively, without an intervening abstraction layer, but use the underlying hardware functionality directly. Therefore it has chosen to allow the functionality that actually exist on different platforms. For example, the result of some shift operations like int(1) << 33 is allowed to be different on different system, because that's the way the hardware works.
The C++ standard describes the result you can expect from your program, not the way it has to be achieved. In some cases it says that you have to check you particular implementation, because the results may differ but still be what is expected there.
For example, on an IBM mainframe nobody expects floating point to be IEEE compatible because the mainframe series is much older that the IEEE standard. Still C++ allows the use of the underlying hardware while Java does not. Is that an advantage or a disavantage for either language? It depends!
Within the restrictions and allowances of the language, a reasonable implementation must behave as if it did like you have coded in your program. If you do system calls like locking a mutex, the compiler has the options of not knowing what the calls do and therefore cannot remove them, or do know exactly what they do and therefore also know if they can be removed or not. The result is the same!
If you do calls to the standard library, the compiler can very well know exactly what the call does, as this is described in the standard. It then has the option of really calling a function, replace it with some other code, or skip it entirely if it has no effect. For example, std::strlen("Hello world!") can be replaced by 12. Some compilers do that, and you will not notice.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js