C++ Statement Reordering - c++

This is a question about Chandler's answer here (I didn't have a high enough rep to comment): Enforcing statement order in C++
In his answer, suppose foo() has no input or output. It's a black box that does work that is observable eventually, but won't be needed immediately (e.g. executes some callback). So we don't have input/output data locally handy to tell the compiler not to optimize. But I know that foo() will modify the memory somewhere, and the result will be observable eventually. Will the following prevent statement reordering and get the correct timing in this case?
#include <chrono>
#include <iostream>
//I believe this tells the compiler that all memory everywhere will be clobbered?
//(from his cppcon talk: https://youtu.be/nXaxk27zwlk?t=2441)
__attribute__((always_inline)) inline void DoNotOptimize() {
asm volatile("" : : : "memory");
}
// The compiler has full knowledge of the implementation.
static int ugly_global = 1; //we print this to screen sometime later
static void foo(void) { ugly_global *= 2; }
auto time_foo() {
using Clock = std::chrono::high_resolution_clock;
auto t1 = Clock::now(); // Statement 1
DoNotOptimize();
foo(); // Statement 2
DoNotOptimize();
auto t2 = Clock::now(); // Statement 3
return t2 - t1;
}

Will the following prevent statement reordering and get the correct timing in this case?
It should not be necessary because the calls to Clock::now should, at the language-definition level, enforce enough ordering. (That is, the C++11 standard says that the high resolution clock ought to get as much information as the system can give here, in the way that is most useful here. See "secondary question" below.)
But there is a more general case. It's worth thinking about the question: How does whoever provides the C++ library implementation actually write this function? Or, take C++ itself out of the equation. Given a language standard, how does an implementor—a person or group writing an implementation of that language—get you what you need? Fundamentally, we need to make a distinction between what the language standard requires and how an implementation provider goes about implementing the requirements.
The language itself may be expressed in terms of an abstract machine, and the C and C++ languages are. This abstract machine is pretty loosely defined: it executes some kind of instructions, which access data, but in many cases we don't know how it does these things, or even how big the various data items are (with some exceptions for fixed-size integers like int64_t), and os on. The machine may or may not have "registers" that hold things in ways that cannot be addressed as well as memory that can be addressed and whose addresses can be recorded in pointers:
p = &var
makes the value store in p (in memory or a register) such that using *p accesses the value stored in var (in memory or a register—some machines, especially back in the olden days, have / had addressable registers).1
Nonetheless, despite all of this abstraction, we want to run real code on real machines. Real machines have real constraints: some instructions might require particular values in particular registers (think about all the bizarre stuff in the x86 instruction sets, or wide-result integer multipliers and dividers that use special-purpose registers, as on some MIPS processors), or cause CPU sychronizations, or whatever.
GCC in particular invented a system of constraints to express what you could or could not do on the machine itself, using the machine's instruction set. Over time, this evolved into user-accessible asm constructs with input, output, and clobber sections. The particular one you show:
__attribute__((always_inline)) inline void DoNotOptimize() {
asm volatile("" : : : "memory");
}
expresses the idea that "this instruction" (asm; the actual provided instruction is blank) "cannot be moved" (volatile) "and clobbers all of the computer's memory, but no registers" ("memory" as the clobber section).
This is not part of either C or C++ as a language. It's just a compiler construction, supported by GCC and now supported by clang as well. But it suffices to force the compiler to issue all stores-to-memory before the asm, and reload values from memory as needed after the asm, in case they changed when the computer executed the (nonexistent) instruction included in the asm line. There's no guarantee that this will work, or even compile at all, in some other compiler, but as long as we're the implementor, we choose the compiler we're implementing for/with.
C++ as a language now has support for ordered memory operations, which an implementor must implement. The implementor can use these asm volatile constructs to achieve the right result, provided they do actually achieve the right result. For instance, if we need to cause the machine itself to synchronize—to emit a memory barrier—we can stick the appropriate machine instruction, such as mfence or membar #sync or whatever it may be, in the asm's instruction-section clause. See also compiler reordering vs memory reordering as Klaus mentioned in a comment.
It is up to the implementor to find an appropriately effective trick, compiler-specific or not, to get the right semantics while minimizing any runtime slowdown: for instance, we might want to use lfence rather than mfence if that's sufficient, or membar #LoadLoad, or whatever the right thing is for the machine. If our implementation of Clock::now requires some sort of fancy inline asm, we write one. If not, we don't. We make sure that we produce what's required—and then all users of the system can just use it, without needing to know what sort of grubby implementation tricks we had to invoke.
There's a secondary question here: does the language specification really constrain the implementor the way we think/hope it does? Chris Dodd's comment says he thinks so, and he's usually right on these kinds of questions. A couple of other commenters think otherwise, but I'm with Chris Dodd on this one. I think it is not necessary. You can always compile to assembly, or disassemble the compiled program, to check, though!
If the compiler didn't do the right thing, that asm would force it to do the right thing, in GCC and clang. It probably wouldn't work in other compilers.
1On the KA-10 in particular, the registers were just the first sixteen words of memory. As the Wikipedia page notes, this meant you could put instructions into there and call them. Because the first 16 words were the registers, these instructions ran much faster than other instructions.

Related

Practical consequences of breaking strict aliasing between int and float

I once had a very subtle bug in one of the projects I had to maintain. Essentially it was doing something like this:
union Value {
int64_t int64;
int32_t int32[2];
int16_t shorts[4];
int8_t chars[8];
float floats[2];
double float64;
};
Value v;
// in one place (not sure about exact code, it could be just memcpy):
v.shorts[0] = <some short value>;
v.shorts[1] = <some other short value>;
// in another place:
float f = v.floats[0];
Now, as far as the standard is concerned, this is simply UB. In practice, this could mean anything, but I can hardly imagine a reasonable implementation that would cause the code above to start World War III or disintegrate my PC. In real life, I can only imagine two things happening:
The compiler may screw up something with optimizations, not realizing that it deals with the same memory here. Pretty unlikely in this case, since writes and reads happened in totally different places.
Nothing bad really happens, and the float value is simply read bit-by-bit.
In practice, it was almost always case 2, except for once. After running the program compiled with MSVC 2010 in release mode on about 100-150 input files, in one of the files it generated an incorrect value that differed in exactly one bit from what it was supposed to be according to the common sense. That was a pretty significant bit, too, so instead of, say, 1.5, I got something like 117.9. I was able to trace it down to that exact read, and after fixing the code to adhere to strict aliasing rules, everything worked fine.
Now the question is, purely from low-level point of view, what could have caused that? Some peculiarities in CPU handling floating-point values? Hardware caching specifics? Compiler quirks? Why only one value was wrong?
The hardware was some old 2-core 64-bit Intel CPU running a 32-bit Windows 7, if that is of any help. The program is a single-threaded console application, nothing fancy. The problem was 100% reproducible, the same input files always produced the same output, and it's always the same value that was wrong.
From the point of view of the Standard, the code v.shorts[0] = something; takes a pointer value of type "short*", adds zero, and uses the resulting pointer to store a value. I think the authors of C89 intended that quality implementations where aliasing would be useful would recognize aliasing in this case, but nothing in the letter of the Standard would require it. Note that when the rules were included in C89, compilers were only expected to apply them on a very local level; further, the rules generally don't pose severe problems for programmers unless they're applied at a more far-reaching level. Unfortunately, some compilers are aggressively seek to extend the range of the rules as far as possible.
If you were to place each array within a separate structure within the union and then do something like:
v.floats.arr[0] = value;
v.floats.arr[1] = value;
v.floats = v.floats; // Compiler knows that float* may alter float members,
// and that writing member of union may alter other
// members
... now use other stuff
a compiler should hopefully recognize that the assignment to v.floats need
not generate any code but a conforming compiler must must still regard it as proper notice that other members of the union may have been altered. Note that the pattern does not seem to be reliable in gcc as of 6.2, however; in some cases when an assignment is not required to generate any code, the compiler will ignore the assignment--including its aliasing implications--altogether. I don't see any reason to bend over backward working around gcc's broken behavior, however--simply use -fno-strict-aliasing guilt-free unless or until gcc's aliasing logic gets fixed.

compiler reordering and load operation

I'm starting lock free programming and I encounter some difficulties for basic stuff. I have found the following example :
#define COMPILER_BARRIER() asm volatile("" ::: "memory")
int Value;
int IsPublished = 0;
void sendValue(int x)
{
Value = x;
COMPILER_BARRIER(); // prevent reordering of stores
IsPublished = 1;
}
int tryRecvValue()
{
if (IsPublished)
{
COMPILER_BARRIER(); // prevent reordering of loads
return Value;
}
return -1; // or some other value to mean not yet received
}
What kind of reordering a compiler can perform in tryRecvValue function ?
Answer
Without the inline asm*, the compiler is able to load the value from Value first, then IsPublished, then perform the check and return.
Always check the assembly output if you're not sure what the compiler is doing. This is especially true of lock free programming techniques.
More Information
* NOTE: Inline asm is not a required part of the C++ standard, but is implemented by all major compilers. It is only mentioned in section 7.4:
The asm declaration is conditionally-supported; its meaning is implementation-defined. [ Note: Typically it
is used to pass information through the implementation to an assembler. — end note ]
Generally speaking, compilers cannot reorder reads and writes around inline assembly because the compiler cannot make assumptions about what the assembly does.
The use of COMPILER_BARRIER() above will not act as any kind of read/write barrier, it just has the side affect of not allowing reordering of assembly instructions around the statement.
Assuming the above code is called from different threads, it will only work as-is on x86, since the architecture guarantees that writes are never (observably) reordered by the cpu. (reference: 8.2.3.2 in volume 3A of the Intel Manuals)
For use on other architectures with relaxed memory models, (like PowerPc and ARM) you'll need hardware barriers that prevent these types of reorders. For a good breakdown, check out Jeff Preshing's articles.

Is there a reason why not to use link-time optimization (LTO)?

GCC, MSVC, LLVM, and probably other toolchains have support for link-time (whole program) optimization to allow optimization of calls among compilation units.
Is there a reason not to enable this option when compiling production software?
I assume that by "production software" you mean software that you ship to the customers / goes into production. The answers at Why not always use compiler optimization? (kindly pointed out by Mankarse) mostly apply to situations in which you want to debug your code (so the software is still in the development phase -- not in production).
6 years have passed since I wrote this answer, and an update is necessary. Back in 2014, the issues were:
Link time optimization occasionally introduced subtle bugs, see for example Link-time optimization for the kernel. I assume this is less of an issue as of 2020. Safeguard against these kinds of compiler and linker bugs: Have appropriate tests to check the correctness of your software that you are about to ship.
Increased compile time. There are claims that the situation has significantly improved since 2014, for example thanks to slim objects.
Large memory usage. This post claims that the situation has drastically improved in recent years, thanks to partitioning.
As of 2020, I would try to use LTO by default on any of my projects.
This recent question raises another possible (but rather specific) case in which LTO may have undesirable effects: if the code in question is instrumented for timing, and separate compilation units have been used to try to preserve the relative ordering of the instrumented and instrumenting statements, then LTO has a good chance of destroying the necessary ordering.
I did say it was specific.
If you have well written code, it should only be advantageous. You may hit a compiler/linker bug, but this goes for all types of optimisation, this is rare.
Biggest downside is it drastically increases link time.
Apart from to this,
Consider a typical example from embedded system,
void function1(void) { /*Do something*/} //located at address 0x1000
void function2(void) { /*Do something*/} //located at address 0x1100
void function3(void) { /*Do something*/} //located at address 0x1200
With predefined addressed functions can be called through relative addresses like below,
(*0x1000)(); //expected to call function2
(*0x1100)(); //expected to call function2
(*0x1200)(); //expected to call function3
LTO can lead to unexpected behavior.
updated:
In automotive embedded SW development,Multiple parts of SW are compiled and flashed on to a separate sections.
Boot-loader, Application/s, Application-Configurations are independently flash-able units. Boot-loader has special capabilities to update Application and Application-configuration. At every power-on cycle boot-loader ensures the SW application and application-configuration's compatibility and consistence via Hard-coded location for SW-Versions and CRC and many more parameters. Linker-definition files are used to hard-code the variable location and some function location.
Given that the code is implemented correctly, then link time optimization should not have any impact on the functionality. However, there are scenarios where not 100% correct code will typically just work without link time optimization, but with link time optimization the incorrect code will stop working. There are similar situations when switching to higher optimization levels, like, from -O2 to -O3 with gcc.
That is, depending on your specific context (like, age of the code base, size of the code base, depth of tests, are you starting your project or are you close to final release, ...) you would have to judge the risk of such a change.
One scenario where link-time-optimization can lead to unexpected behavior for wrong code is the following:
Imagine you have two source files read.c and client.c which you compile into separate object files. In the file read.c there is a function read that does nothing else than reading from a specific memory address. The content at this address, however, should be marked as volatile, but unfortunately that was forgotten. From client.c the function read is called several times from the same function. Since read only performs one single read from the address and there is no optimization beyond the boundaries of the read function, read will always when called access the respective memory location. Consequently, every time when read is called from client.c, the code in client.c gets a freshly read value from the address, just as if volatile had been used.
Now, with link-time-optimization, the tiny function read from read.c is likely to be inlined whereever it is called from client.c. Due to the missing volatile, the compiler will now realize that the code reads several times from the same address, and may therefore optimize away the memory accesses. Consequently, the code starts to behave differently.
Rather than mandating that all implementations support the semantics necessary to accomplish all tasks, the Standard allows implementations intended to be suitable for various tasks to extend the language by defining semantics in corner cases beyond those mandated by the C Standard, in ways that would be useful for those tasks.
An extremely popular extension of this form is to specify that cross-module function calls will be processed in a fashion consistent with the platform's Application Binary Interface without regard for whether the C Standard would require such treatment.
Thus, if one makes a cross-module call to a function like:
uint32_t read_uint32_bits(void *p)
{
return *(uint32_t*)p;
}
the generated code would read the bit pattern in a 32-bit chunk of storage at address p, and interpret it as a uint32_t value using the platform's native 32-bit integer format, without regard for how that chunk of storage came to hold that bit pattern. Likewise, if a compiler were given something like:
uint32_t read_uint32_bits(void *p);
uint32_t f1bits, f2bits;
void test(void)
{
float f;
f = 1.0f;
f1bits = read_uint32_bits(&f);
f = 2.0f;
f2bits = read_uint32_bits(&f);
}
the compiler would reserve storage for f on the stack, store the bit pattern for 1.0f to that storage, call read_uint32_bits and store the returned value, store the bit pattern for 2.0f to that storage, call read_uint32_bits and store that returned value.
The Standard provides no syntax to indicate that the called function might read the storage whose address it receives using type uint32_t, nor to indicate that the pointer the function was given might have been written using type float, because implementations intended for low-level programming already extended the language to supported such semantics without using special syntax.
Unfortunately, adding in Link Time Optimization will break any code that relies upon that popular extension. Some people may view such code as broken, but if one recognizes the Spirit of C principle "Don't prevent programmers from doing what needs to be done", the Standard's failure to mandate support for a popular extension cannot be viewed as intending to deprecate its usage if the Standard fails to provide any reasonable alternative.
LTO could also reveal edge-case bugs in code-signing algorithms. Consider a code-signing algorithm based on certain expectations about the TEXT portion of some object or module. Now LTO optimizes the TEXT portion away, or inlines stuff into it in a way the code-signing algorithm was not designed to handle. Worst case scenario, it only affects one particular distribution pipeline but not another, due to a subtle difference in which encryption algorithm was used on each pipeline. Good luck figuring out why the app won't launch when distributed from pipeline A but not B.
LTO support is buggy and LTO related issues has lowest priority for compiler developers. For example: mingw-w64-x86_64-gcc-10.2.0-5 works fine with lto, mingw-w64-x86_64-gcc-10.2.0-6 segfauls with bogus address. We have just noticed that windows CI stopped working.
Please refer the following issue as an example.

"Observable behaviour" and compiler freedom to eliminate/transform pieces c++ code

After reading this discussion I realized that I almost totally misunderstand the matter :)
As the description of C++ abstract machine is not rigorous enough(comparing, for instance, with JVM specification), and if a precise answer isn't possible I would rather want to get informal clarifications about rules that reasonable "good" (non-malicious) implementation should follow.
The key concept of part 1.9 of the Standard addressing implementation freedom is so called as-if rule:
an implementation is free to disregard any requirement of this
Standard as long as the result is as if the requirement had been
obeyed, as far as can be determined from the observable behavior of
the program.
The term "observable behavior", according to the standard (I cite n3092), means the following:
— Access to volatile objects are evaluated strictly according to the
rules of the abstract machine.
— At program termination, all data written into files shall be
identical to one of the possible results that execution of the program
according to the abstract semantics would have produced.
— The input and output dynamics of interactive devices shall take
place in such a fashion that prompting output is actually delivered
before a program waits for input. What constitutes an interactive
device is implementation-defined.
So, roughly speaking, the order and operands of volatile access operations and io operations should be preserved; implementation may make arbitrary changes in the program which preserve these invariants (comparing to some allowed behaviour of the abstract c++ machine)
Is it reasonable to expect that non-malicious implementation treates io operations wide enough (for instance, any system call from user code is treated as such operation)? (E.g. RAII mutex lock/unlock wouldn't be thrown away by compiler in case RAII wrapper contains no volatiles)
How deeply the "behavioral observation" should immerse from user-defined c++ program level into library/system calls? The question is, of course, only about library calls that not intended to have io/volatile access from the user viewpoint (e.g. as new/delete operations) but may (and usually does) access volatiles or io in the library/system implementation. Should the compiler treat such calls from the user viewpoint (and consider such side effects as not observable) or from "library" viewpoint (and consider the side effects as observable) ?
If I need to prevent some code from elimination by compiler, is it a good practice not to ask all the questions above and simply add (possibly fake) volatile access operations (wrap the actions needed to volatile methods and call them on volatile instances of my own classes) in any case that seems suspicious?
Or I'm totally wrong and the compiler is disallowed to remove any c++ code except of cases explicitly mentioned by the standard (as copy elimination)
The important bit is that the compiler must be able to prove that the code has no side effects before it can remove it (or determine which side effects it has and replace it with some equivalent piece of code). In general, and because of the separate compilation model, that means that the compiler is somehow limited as to what library calls have observable behavior and can be eliminated.
As to the deepness of it, it depends on the library implementation. In gcc, the C standard library uses compiler attributes to inform the compiler of potential side effects (or absence of them). For example, strlen is tagged with a pure attribute that allows the compiler to transform this code:
char p[] = "Hi there\n";
for ( int i = 0; i < strlen(p); ++i ) std::cout << p[i];
into
char * p = get_string();
int __length = strlen(p);
for ( int i = 0; i < __length; ++i ) std::cout << p[i];
But without the pure attribute the compiler cannot know whether the function has side effects or not (unless it is inlining it, and gets to see inside the function), and cannot perform the above optimization.
That is, in general, the compiler will not remove code unless it can prove that it has no side effects, i.e. will not affect the outcome of the program. Note that this does not only relate to volatile and io, since any variable change might have observable behavior at a later time.
As to question 3, the compiler will only remove your code if the program behaves exactly as if the code was present (copy elision being an exception), so you should not even care whether the compiler removes it or not. Regarding question 4, the as-if rule stands: If the outcome of the implicit refactor made by the compiler yields the same result, then it is free to perform the change. Consider:
unsigned int fact = 1;
for ( unsigned int i = 1; i < 5; ++i ) fact *= i;
The compiler can freely replace that code with:
unsigned int fact = 120; // I think the math is correct... imagine it is
The loop is gone, but the behavior is the same: each loop interaction does not affect the outcome of the program, and the variable has the correct value at the end of the loop, i.e. if it is later used in some observable operation, the result will be as-if the loop had been executed.
Don't worry too much on what observable behavior and the as-if rule mean, they basically mean that the compiler must yield the output that you programmed in your code, even if it is free to get to that outcome by a different path.
EDIT
#Konrad raises a really good point regarding the initial example I had with strlen: how can the compiler know that strlen calls can be elided? And the answer is that in the original example it cannot, and thus it could not elide the calls. There is nothing telling the compiler that the pointer returned from the get_string() function does not refer to memory that is being modified elsewhere. I have corrected the example to use a local array.
In the modified example, the array is local, and the compiler can verify that there are no other pointers that refer to the same memory. strlen takes a const pointer and so it promises not to modify the contained memory, and the function is pure so it promises not to modify any other state. The array is not modified inside the loop construct, and gathering all that information the compiler can determine that a single call to strlen suffices. Without the pure specifier, the compiler cannot know whether the result of strlen will differ in different invocations and has to call it.
The abstract machine defined by the standard will, given a specific
input, produce one of a set of specific output. In general, all that is
guaranteed is that for that specific input, the compiled code will
produce one of the possible specific output. The devil is in the
details, however, and there are a number of points to keep in mind.
The most important of these is probably the fact that if the program has
undefined behavior, the compiler can do absolutely anything. All bets
are off. Compilers can and do use potential undefined behavior for
optimizing: for example, if the code contains something like *p = (*q) ++,
the compiler can conclude that p and q aren't aliases to the same
variable.
Unspecified behavior can have similar effects: the actual behavior may
depend on the level of optimization. All that is requires is that the
actual output correspond to one of the possible outputs of the abstract
machine.
With regards to volatile, the stadnard does say that access to
volatile objects is observable behavior, but it leaves the meaning of
"access" up to the implementation. In practice, you can't really count
much on volatile these days; actual accesses to volatile objects may
appear to an outside observer in a different order than they occur in
the program. (This is arguably in violation of the intent of the
standard, at the very least. It is, however, the actual situation with
most modern compilers, running on a modern architecture.)
Most implementations treat all system calls as “IO”. With
regards to mutexes, of course: as far as C++03 is concerned, as soon as
you start a second thread, you've got undefined behavior (from the C++
point of view—Posix or Windows do define it), and in C++11,
synchronization primatives are part of the language, and constrain the
set of possible outputs. (The compiler can, of course, elimiate the
synchronizations if it can prove that they weren't necessary.)
The new and delete operators are special cases. They can be
replaced by user defined versions, and those user defined versions may
clearly have observable behavior. The compiler can only remove them if
it has some means of knowing either that they haven't been replaced, of
that the replacements have no observable behavior. In most systems,
replacement is defined at link time, after the compiler has finished its
work, so no changes are allowed.
With regards to your third question: I think you're looking at it from
the wrong angle. Compilers don't “eliminate” code, and no
particular statement in a program is bound to a particular block of
code. Your program (the complete program) defines a particular
semantics, and the compiler must do something which produces an
executable program having those semantics. The most obvious solution
for the compiler writer is to take each statement separately, and
generate code for it, but that's the compiler writer's point of view,
not yours. You put source code in, and get an executable out; but lots
of statements don't result in any code, and even for those that do,
there isn't necessarily a one to one relationship. In this sense, the
idea of “preventing some code elimination” doesn't make
sense: your program has a semantics, specified by the standard, and all
you can ask for (and all that you should be interested in) is that the
final executable have those semantics. (Your fourth point is similar:
the compiler doesn't “remove” any code.)
I can't speak for what the compilers should do, but here's what some compilers actually do
#include <array>
int main()
{
std::array<int, 5> a;
for(size_t p = 0; p<5; ++p)
a[p] = 2*p;
}
assembly output with gcc 4.5.2:
main:
xorl %eax, %eax
ret
replacing array with vector shows that new/delete are not subject to elimination:
#include <vector>
int main()
{
std::vector<int> a(5);
for(size_t p = 0; p<5; ++p)
a[p] = 2*p;
}
assembly output with gcc 4.5.2:
main:
subq $8, %rsp
movl $20, %edi
call _Znwm # operator new(unsigned long)
movl $0, (%rax)
movl $2, 4(%rax)
movq %rax, %rdi
movl $4, 8(%rax)
movl $6, 12(%rax)
movl $8, 16(%rax)
call _ZdlPv # operator delete(void*)
xorl %eax, %eax
addq $8, %rsp
ret
My best guess is that if the implementation of a function call is not available to the compiler, it has to treat it as possibly having observable side-effects.
1. Is it reasonable to expect that non-malicious implementation treates io operations wide enough
Yes. Assuming side-effects is the default. Beyond default, compilers must prove things (except for copy-elimination).
2. How deeply the "behavioral observation" should immerse from user-defined c++ program level into library/system calls?
As deep as it can. Using current standard C++ the compiler can't look behind library with meaning of static library, i.e. calls that target a function inside some ".a-" or ".lib file" calls, so side effects are assumed.
Using the traditional compilation model with multiple object files, the compiler is even unable to look behind extern calls. Optimizations accross
units of compilation may be done at link-time though.
Btw, some compilers have an extension to tell it about pure functions. From the gcc documentation:
Many functions have no effects except the return value and their return value depends only on the parameters and/or global variables. Such a function can be subject to common subexpression elimination and loop optimization just as an arithmetic operator would be. These functions should be declared with the attribute pure. For example,
int square (int) __attribute__ ((pure));
says that the hypothetical function square is safe to call fewer times than the program says.
Some of common examples of pure functions are strlen or memcmp. Interesting non-pure functions
are functions with infinite loops or those depending on volatile memory or other system resource,
that may change between two consecutive calls (such as feof in a multithreading environment).
Thinking about poses an interesting question to me: If some chunk of code mutates a non-local variable, and calls an un-introspectible function,
will it assume that this extern function might depend on that non-local variable?
compilation-unit A:
int foo() {
extern int x;
return x;
}
compilation-unit B:
int x;
int bar() {
for (x=0; x<10; ++x) {
std::cout << foo() << '\n';
}
}
The current standard has a notion of sequence points. I guess if a compiler does not see enough,
it can only optimize as far as to not break the ordering of dependent sequence points.
3. If I need to prevent some code from elimination by compiler
Except by looking at the object-dump, how could you judge whether something was removed?
And if you can't judge, than is this not equivalent to the impossibility of writing code that depends on its (non-)removal?
In that respect, compiler extensions (like for example OpenMP) help you in being able to judge. Some builtin mechanisms exist, too, like volatile variables.
Does a tree exist if nobody can observe it? Et hop, we are at quantum mechanics.
4. Or I'm totally wrong and the compiler is disallowed to remove any c++ code except of cases explicitly mentioned by the standard (as copy elimination)
No, it is perfectly allowed so. It is also allowed to transform code like it's a piece of slime.
(with the exception of copy elimination, you couldn't judge anyways).
One difference is that Java is designed to run on one platform only, the JVM. That makes it much easier to be "rigorous enough" in the specification, as there is only the platform to consider and you can document exactly how it works.
C++ is designed to be able to run on a wide selection of platforms and do that natively, without an intervening abstraction layer, but use the underlying hardware functionality directly. Therefore it has chosen to allow the functionality that actually exist on different platforms. For example, the result of some shift operations like int(1) << 33 is allowed to be different on different system, because that's the way the hardware works.
The C++ standard describes the result you can expect from your program, not the way it has to be achieved. In some cases it says that you have to check you particular implementation, because the results may differ but still be what is expected there.
For example, on an IBM mainframe nobody expects floating point to be IEEE compatible because the mainframe series is much older that the IEEE standard. Still C++ allows the use of the underlying hardware while Java does not. Is that an advantage or a disavantage for either language? It depends!
Within the restrictions and allowances of the language, a reasonable implementation must behave as if it did like you have coded in your program. If you do system calls like locking a mutex, the compiler has the options of not knowing what the calls do and therefore cannot remove them, or do know exactly what they do and therefore also know if they can be removed or not. The result is the same!
If you do calls to the standard library, the compiler can very well know exactly what the call does, as this is described in the standard. It then has the option of really calling a function, replace it with some other code, or skip it entirely if it has no effect. For example, std::strlen("Hello world!") can be replaced by 12. Some compilers do that, and you will not notice.

can compiler reorganize instructions over sleep call?

Is there a difference if it is the first use of the variable or not. For example are a and b treated differently?
void f(bool&a, bool& b)
{
...
a=false;
boost::this_thread::sleep...//1 sec sleep
a=true;
b=true;
...
}
EDIT: people asked why I want to know this.
1. I would like to have some way to tell the compiler not to optimize(swap the order of the execution of the instructions) in some function, and using atomic and or mutexes is much more complicated than using sleep(and in my case sleeping is not a performance problem).
2. Like I said this is generally important to know.
We can't really tell. On scenario could be that the compiler has full introspection to your function at the calling site (and possibly does inline it), in which case it can jumble your function with the caller, and then do optimizations appropriately.
It could then e.g. completely optimize away a and b because there is no code that depends on a and b. Or it might see that you violate aliasing rules so that a and b refer to the same entity, and then merge them according to your program flow.
But it could also be that you tell the compiler to not optimize at all, e.g. with g++'s -O0 flag, in which case not much will happen.
The only proof for your particular platform *, can be made by looking at the generated assembly, or by telling the compiler to please output some log about what it optimizes (g++ has many flags for that).
* compiler+flags used to compile compiler+version+add-ons, hardware, operating system; even the weather might be relevant if your compiler omits some optimizations if it takes to long [which would actually be cool feature for debug builds, imho]
They are not local (because they are references), so it can't, because it can't tell whether the called function sees them or not and has to assume that it does. If they were local variables, it could, because local variables are not visible to the called function unless pointer or reference to them was created.