Enforcing statement order in C++ - c++

Suppose I have a number of statements that I want to execute in
a fixed order. I want to use g++ with optimization level 2, so some
statements could be reordered. What tools does one have to enforce a certain ordering of statements?
Consider the following example.
using Clock = std::chrono::high_resolution_clock;
auto t1 = Clock::now(); // Statement 1
foo(); // Statement 2
auto t2 = Clock::now(); // Statement 3
auto elapsedTime = t2 - t1;
In this example it is important that the statements 1-3 are executed in
the given order. However, can't the compiler think statement 2 is
independent of 1 and 3 and execute the code as follows?
using Clock=std::chrono::high_resolution_clock;
foo(); // Statement 2
auto t1 = Clock::now(); // Statement 1
auto t2 = Clock::now(); // Statement 3
auto elapsedTime = t2 - t1;

I'd like to try to provide a somewhat more comprehensive answer after this was discussed with the C++ standards committee. In addition to being a member of the C++ committee, I'm also a developer on the LLVM and Clang compilers.
Fundamentally, there is no way to use a barrier or some operation in the sequence to achieve these transformations. The fundamental problem is that the operational semantics of something like an integer addition are totally known to the implementation. It can simulate them, it knows they cannot be observed by correct programs, and is always free to move them around.
We could try to prevent this, but it would have extremely negative results and would ultimately fail.
First, the only way to prevent this in the compiler is to tell it that all of these basic operations are observable. The problem is that this then would preclude the overwhelming majority of compiler optimizations. Inside the compiler, we have essentially no good mechanisms to model that the timing is observable but nothing else. We don't even have a good model of what operations take time. As an example, does converting a 32-bit unsigned integer to a 64-bit unsigned integer take time? It takes zero time on x86-64, but on other architectures it takes non-zero time. There is no generically correct answer here.
But even if we succeed through some heroics at preventing the compiler from reordering these operations, there is no guarantee this will be enough. Consider a valid and conforming way to execute your C++ program on an x86 machine: DynamoRIO. This is a system that dynamically evaluates the machine code of the program. One thing it can do is online optimizations, and it is even capable of speculatively executing the entire range of basic arithmetic instructions outside of the timing. And this behavior isn't unique to dynamic evaluators, the actual x86 CPU will also speculate (a much smaller number of) instructions and reorder them dynamically.
The essential realization is that the fact that arithmetic isn't observable (even at the timing level) is something that permeates the layers of the computer. It is true for the compiler, the runtime, and often even the hardware. Forcing it to be observable would both dramatically constrain the compiler, but it would also dramatically constrain the hardware.
But all of this should not cause you to lose hope. When you want to time the execution of basic mathematical operations, we have well studied techniques that work reliably. Typically these are used when doing micro-benchmarking. I gave a talk about this at CppCon2015: https://youtu.be/nXaxk27zwlk
The techniques shown there are also provided by various micro-benchmark libraries such as Google's: https://github.com/google/benchmark#preventing-optimization
The key to these techniques is to focus on the data. You make the input to the computation opaque to the optimizer and the result of the computation opaque to the optimizer. Once you've done that, you can time it reliably. Let's look at a realistic version of the example in the original question, but with the definition of foo fully visible to the implementation. I've also extracted a (non-portable) version of DoNotOptimize from the Google Benchmark library which you can find here: https://github.com/google/benchmark/blob/v1.0.0/include/benchmark/benchmark_api.h#L208
#include <chrono>
template <class T>
__attribute__((always_inline)) inline void DoNotOptimize(const T &value) {
asm volatile("" : "+m"(const_cast<T &>(value)));
}
// The compiler has full knowledge of the implementation.
static int foo(int x) { return x * 2; }
auto time_foo() {
using Clock = std::chrono::high_resolution_clock;
auto input = 42;
auto t1 = Clock::now(); // Statement 1
DoNotOptimize(input);
auto output = foo(input); // Statement 2
DoNotOptimize(output);
auto t2 = Clock::now(); // Statement 3
return t2 - t1;
}
Here we ensure that the input data and the output data are marked as un-optimizable around the computation foo, and only around those markers are the timings computed. Because you are using data to pincer the computation, it is guaranteed to stay between the two timings and yet the computation itself is allowed to be optimized. The resulting x86-64 assembly generated by a recent build of Clang/LLVM is:
% ./bin/clang++ -std=c++14 -c -S -o - so.cpp -O3
.text
.file "so.cpp"
.globl _Z8time_foov
.p2align 4, 0x90
.type _Z8time_foov,#function
_Z8time_foov: # #_Z8time_foov
.cfi_startproc
# BB#0: # %entry
pushq %rbx
.Ltmp0:
.cfi_def_cfa_offset 16
subq $16, %rsp
.Ltmp1:
.cfi_def_cfa_offset 32
.Ltmp2:
.cfi_offset %rbx, -16
movl $42, 8(%rsp)
callq _ZNSt6chrono3_V212system_clock3nowEv
movq %rax, %rbx
#APP
#NO_APP
movl 8(%rsp), %eax
addl %eax, %eax # This is "foo"!
movl %eax, 12(%rsp)
#APP
#NO_APP
callq _ZNSt6chrono3_V212system_clock3nowEv
subq %rbx, %rax
addq $16, %rsp
popq %rbx
retq
.Lfunc_end0:
.size _Z8time_foov, .Lfunc_end0-_Z8time_foov
.cfi_endproc
.ident "clang version 3.9.0 (trunk 273389) (llvm/trunk 273380)"
.section ".note.GNU-stack","",#progbits
Here you can see the compiler optimizing the call to foo(input) down to a single instruction, addl %eax, %eax, but without moving it outside of the timing or eliminating it entirely despite the constant input.
Hope this helps, and the C++ standards committee is looking at the possibility of standardizing APIs similar to DoNotOptimize here.

Summary:
There seems to be no guaranteed way to prevent reordering, but as long as link-time/full-program optimisation is not enabled, locating the called function in a separate compilation unit seems a fairly good bet. (At least with GCC, although logic would suggest that this is likely with other compilers too.) This comes at the cost of the function call - inlined code is by definition in the same compilation unit and open to reordering.
Original answer:
GCC reorders the calls under -O2 optimisation:
#include <chrono>
static int foo(int x) // 'static' or not here doesn't affect ordering.
{
return x*2;
}
int fred(int x)
{
auto t1 = std::chrono::high_resolution_clock::now();
int y = foo(x);
auto t2 = std::chrono::high_resolution_clock::now();
return y;
}
GCC 5.3.0:
g++ -S --std=c++11 -O0 fred.cpp :
_ZL3fooi:
pushq %rbp
movq %rsp, %rbp
movl %ecx, 16(%rbp)
movl 16(%rbp), %eax
addl %eax, %eax
popq %rbp
ret
_Z4fredi:
pushq %rbp
movq %rsp, %rbp
subq $64, %rsp
movl %ecx, 16(%rbp)
call _ZNSt6chrono3_V212system_clock3nowEv
movq %rax, -16(%rbp)
movl 16(%rbp), %ecx
call _ZL3fooi
movl %eax, -4(%rbp)
call _ZNSt6chrono3_V212system_clock3nowEv
movq %rax, -32(%rbp)
movl -4(%rbp), %eax
addq $64, %rsp
popq %rbp
ret
But:
g++ -S --std=c++11 -O2 fred.cpp :
_Z4fredi:
pushq %rbx
subq $32, %rsp
movl %ecx, %ebx
call _ZNSt6chrono3_V212system_clock3nowEv
call _ZNSt6chrono3_V212system_clock3nowEv
leal (%rbx,%rbx), %eax
addq $32, %rsp
popq %rbx
ret
Now, with foo() as an extern function:
#include <chrono>
int foo(int x);
int fred(int x)
{
auto t1 = std::chrono::high_resolution_clock::now();
int y = foo(x);
auto t2 = std::chrono::high_resolution_clock::now();
return y;
}
g++ -S --std=c++11 -O2 fred.cpp :
_Z4fredi:
pushq %rbx
subq $32, %rsp
movl %ecx, %ebx
call _ZNSt6chrono3_V212system_clock3nowEv
movl %ebx, %ecx
call _Z3fooi
movl %eax, %ebx
call _ZNSt6chrono3_V212system_clock3nowEv
movl %ebx, %eax
addq $32, %rsp
popq %rbx
ret
BUT, if this is linked with -flto (link-time optimisation):
0000000100401710 <main>:
100401710: 53 push %rbx
100401711: 48 83 ec 20 sub $0x20,%rsp
100401715: 89 cb mov %ecx,%ebx
100401717: e8 e4 ff ff ff callq 100401700 <__main>
10040171c: e8 bf f9 ff ff callq 1004010e0 <_ZNSt6chrono3_V212system_clock3nowEv>
100401721: e8 ba f9 ff ff callq 1004010e0 <_ZNSt6chrono3_V212system_clock3nowEv>
100401726: 8d 04 1b lea (%rbx,%rbx,1),%eax
100401729: 48 83 c4 20 add $0x20,%rsp
10040172d: 5b pop %rbx
10040172e: c3 retq

Reordering may be done by the compiler, or by the processor.
Most compilers offer a platform-specific method to prevent reordering of read-write instructions. On gcc, this is
asm volatile("" ::: "memory");
(More information here)
Note that this only indirectly prevents reordering operations, as long as they depend on the reads / writes.
In practice I haven't yet seen a system where the system call in Clock::now() does have the same effect as such a barrier. You could inspect the resulting assembly to be sure.
It is not uncommon, however, that the function under test gets evaluated during compile time. To enforce "realistic" execution, you may need to derive input for foo() from I/O or a volatile read.
Another option would be to disable inlining for foo() - again, this is compiler specific and usually not portable, but would have the same effect.
On gcc, this would be __attribute__ ((noinline))
#Ruslan brings up a fundamental issue: How realistic is this measurement?
Execution time is affected by many factors: one is the actual hardware we are running on, the other is concurrent access to shared resources like cache, memory, disk and CPU cores.
So what we usually do to get comparable timings: make sure they are reproducible with a low error margin. This makes them somewhat artificial.
"hot cache" vs. "cold cache" execution performance can easily differ by an order of magnitude - but in reality, it will be something inbetween ("lukewarm"?)

The C++ language defines what is observable in a number of ways.
If foo() does nothing observable, then it can be eliminated completely. If foo() only does a computation that stores values in "local" state (be it on the stack or in an object somewhere), and the compiler can prove that no safely-derived pointer can get into the Clock::now() code, then there are no observable consequences to moving the Clock::now() calls.
If foo() interacted with a file or the display, and the compiler cannot prove that Clock::now() does not interact with the file or the display, then reordering cannot be done, because interaction with a file or display is observable behavior.
While you can use compiler-specific hacks to force code not to move around (like inline assembly), another approach is to attempt to outsmart your compiler.
Create a dynamically loaded library. Load it prior to the code in question.
That library exposes one thing:
namespace details {
void execute( void(*)(void*), void *);
}
and wraps it like this:
template<class F>
void execute( F f ) {
struct bundle_t {
F f;
} bundle = {std::forward<F>(f)};
auto tmp_f = [](void* ptr)->void {
auto* pb = static_cast<bundle_t*>(ptr);
(pb->f)();
};
details::execute( tmp_f, &bundle );
}
which packs up a nullary lambda and uses the dynamic library to run it in a context that the compiler cannot understand.
Inside the dynamic library, we do:
void details::execute( void(*f)(void*), void *p) {
f(p);
}
which is pretty simple.
Now to reorder the calls to execute, it must understand the dynamic library, which it cannot while compiling your test code.
It can still eliminate foo()s with zero side effects, but you win some, you lose some.

No it can't. According to the C++ standard [intro.execution]:
14 Every value computation and side effect associated with a
full-expression is sequenced before every value computation and side
effect associated with the next full-expression to be evaluated.
A full-expression is basically a statement terminated by a semicolon. As you can see the above rule stipulates statements must be executed in order. It is within statements that the compiler is allowed more free rein (i.e. it is under some circumstance allowed to evaluate expressions that make up a statement in orders other than left-to-right or anything else specific).
Note the conditions for the as-if rule to apply are not met here. It is unreasonable to think that any compiler would be able to prove that reordering calls to get the system time would not affect observable program behaviour. If there was a circumstance in which two calls to get the time could be reordered without changing observed behaviour, it would be extremely inefficient to actually produce a compiler that analyses a program with enough understanding to be able to infer this with certainty.

No.
Sometimes, by the "as-if" rule, statements may be re-ordered. This is not because they are logically independent of each other, but because that independence allows such a re-ordering to occur without changing the semantics of the program.
Moving a system call that obtains the current time obviously does not satisfy that condition. A compiler that knowingly or unknowingly does so is non-compliant and really silly.
In general, I wouldn't expect any expression that results in a system call to be "second-guessed" by even an aggressively optimizing compiler. It just doesn't know enough about what that system call does.

noinline function + inline assembly black box + full data dependencies
This is based on https://stackoverflow.com/a/38025837/895245 but because I didn't see any clear justification of why the ::now() cannot be reordered there, I would rather be paranoid and put it inside a noinline function together with the asm.
This way I'm pretty sure the reordering cannot happen, since the noinline "ties" the the ::now and the data dependency.
main.cpp
#include <chrono>
#include <iostream>
#include <string>
// noinline ensures that the ::now() cannot be split from the __asm__
template <class T>
__attribute__((noinline)) auto get_clock(T& value) {
// Make the compiler think we actually use / modify the value.
// It can't "see" what is going on inside the assembly string.
__asm__ __volatile__ ("" : "+g" (value));
return std::chrono::high_resolution_clock::now();
}
template <class T>
static T foo(T niters) {
T result = 42;
for (T i = 0; i < niters; ++i) {
result = (result * result) - (3 * result) + 1;
}
return result;
}
int main(int argc, char **argv) {
unsigned long long input;
if (argc > 1) {
input = std::stoull(argv[1], NULL, 0);
} else {
input = 1;
}
// Must come before because it could modify input
// which is passed as a reference.
auto t1 = get_clock(input);
auto output = foo(input);
// Must come after as it could use the output.
auto t2 = get_clock(output);
std::cout << "output " << output << std::endl;
std::cout << "time (ns) "
<< std::chrono::duration_cast<std::chrono::nanoseconds>(t2 - t1).count()
<< std::endl;
}
GitHub upstream.
Compile and run:
g++ -ggdb3 -O3 -std=c++14 -Wall -Wextra -pedantic -o main.out main.cpp
./main.out 1000
./main.out 10000
./main.out 100000
The only minor downside of this method is that we add one extra callq instruction over an inline method. objdump -CD shows that main contains:
11b5: e8 26 03 00 00 callq 14e0 <auto get_clock<unsigned long long>(unsigned long long&)>
11ba: 48 8b 34 24 mov (%rsp),%rsi
11be: 48 89 c5 mov %rax,%rbp
11c1: b8 2a 00 00 00 mov $0x2a,%eax
11c6: 48 85 f6 test %rsi,%rsi
11c9: 74 1a je 11e5 <main+0x65>
11cb: 31 d2 xor %edx,%edx
11cd: 0f 1f 00 nopl (%rax)
11d0: 48 8d 48 fd lea -0x3(%rax),%rcx
11d4: 48 83 c2 01 add $0x1,%rdx
11d8: 48 0f af c1 imul %rcx,%rax
11dc: 48 83 c0 01 add $0x1,%rax
11e0: 48 39 d6 cmp %rdx,%rsi
11e3: 75 eb jne 11d0 <main+0x50>
11e5: 48 89 df mov %rbx,%rdi
11e8: 48 89 44 24 08 mov %rax,0x8(%rsp)
11ed: e8 ee 02 00 00 callq 14e0 <auto get_clock<unsigned long long>(unsigned long long&)>
so we see that foo was inlined, but get_clock were not and surround it.
get_clock itself however is extremely efficient, consisting of a single leaf call optimized instruction that doesn't even touch the stack:
00000000000014e0 <auto get_clock<unsigned long long>(unsigned long long&)>:
14e0: e9 5b fb ff ff jmpq 1040 <std::chrono::_V2::system_clock::now()#plt>
Since the clock precision is itself limited, I think that is unlikely that you will be able to notice the timing effects of one extra jmpq. Note that one call is required regardless since ::now() is in a shared library.
Call ::now() from inline assembly with a data dependency
This would be the most efficient solution possible, overcoming even the extra jmpq mentioned above.
This is unfortunately extremely hard to do correctly as shown at: Calling printf in extended inline ASM
If your time measurement can be done directly in inline assembly without a call however, then this technique can be used. This is the case for example for gem5 magic instrumentation instructions, x86 RDTSC (not sure if this is representative anymore) and possibly other performance counters.
Related threads:
Is it legal for a C++ optimizer to reorder calls to clock()?
Tested with GCC 8.3.0, Ubuntu 19.04.

Related

Is returning a private class member slower than using a struct and accessing that variable directly?

Suppose you have a class that has private members which are accessed a lot in a program (such as in a loop which has to be fast). Imagine I have defined something like this:
class Foo
{
public:
Foo(unsigned set)
: vari(set)
{}
const unsigned& read_vari() const { return vari; }
private:
unsigned vari;
};
The reason I would like to do it this way is because, once the class is created, "vari" shouldn't be changed anymore. Thus, to minimize bug occurrence, "it seemed like a good idea at the time".
However, if I now need to call this function millions of times, I was wondering if there is an overhead and a slowdown instead of simply using:
struct Foo
{
unsigned vari;
};
So, was my first impule right in using a class, to avoid anyone mistakenly changing the value of the variable after it has been set by the constructor?
Also, does this introduce a "penalty" in the form of a function call overhead. (Assuming I use optimization flags in the compiler, such as -O2 in GCC)?
They should come out to be the same. Remember that frustrating time you tried to use the operator[] on a vector and gdb just replied optimized out? This is what will happen here. The compiler will not create a function call here but it will rather access the variable directly.
Let's have a look at the following code
struct foo{
int x;
int& get_x(){
return x;
}
};
int direct(foo& f){
return f.x;
}
int fnc(foo& f){
return f.get_x();
}
Which was compiled with g++ test.cpp -o test.s -S -O2. The -S flag tells the compiler to "Stop after the stage of compilation proper; do not assemble (quote from the g++ manpage)." This is what the compiler gives us:
_Z6directR3foo:
.LFB1026:
.cfi_startproc
movl (%rdi), %eax
ret
and
_Z3fncR3foo:
.LFB1027:
.cfi_startproc
movl (%rdi), %eax
ret
as you can see, no function call was made in the second case and they are both the same. Meaning there is no performance overhead in using the accessor method.
bonus: what happens if optimizations are turned off? same code, here are the results:
_Z6directR3foo:
.LFB1022:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
movq %rdi, -8(%rbp)
movq -8(%rbp), %rax
movl (%rax), %eax
popq %rbp
.cfi_def_cfa 7, 8
ret
and
_Z3fncR3foo:
.LFB1023:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
subq $16, %rsp
movq %rdi, -8(%rbp)
movq -8(%rbp), %rax
movq %rax, %rdi
call _ZN3foo5get_xEv #<<<call to foo.get_x()
movl (%rax), %eax
leave
.cfi_def_cfa 7, 8
ret
As you can see without optimizations, the sturct is faster than the accessor, but who ships code without optimizations?
You can expect identical performance. A great many C++ classes rely on this - for example, C++11's list::size() const can be expected to trivially return a data member. (Which contrasts with vector(), where the implementation's I've looked at calculate size() as the difference between pointer data member's corresponding to begin() and end(), ensuring typical iterator usage is as fast as possible at the cost of potentially slower indexed iteration, if the optimiser can't determine that size() is constant across loop iterations).
There's typically no particular reason to return by const reference for a type like unsigned that should fit in a CPU register anyway, but as it's inlined the compiler doesn't have to take that literally (for an out-of-line version it would likely be implemented by returning a pointer that has to be dereferenced). (The atypical reason is to allow taking the address of the variable, which is why say vector::operator[](size_t) const needs to return a const T& rather than a T, even if T is small enough to fit in a register.)
There is only one way to tell with certainty which one is faster in your particular program built with your particular tools with your particular optimisation flags on your particular platform — by measuring both variants.
Having said that, chances are good that the binaries will be identical, instruction for instruction.
As others have said, optimizers these days are relied on to boil out abstraction (especially in C++, which is more or less built to take advantage of that) and they're very, very good.
But you might not need the getter for this.
struct Foo {
Foo(unsigned set) : vari(set) {}
unsigned const vari;
};
const doesn't forbid initialization.

Why is this no-op loop not optimized away?

The following code does some copying from one array of zeroes interpreted as floats to another one, and prints timing of this operation. As I've seen many cases where no-op loops are just optimized away by compilers, including gcc, I was waiting that at some point of changing my copy-arrays program it will stop doing the copying.
#include <iostream>
#include <cstring>
#include <sys/time.h>
static inline long double currentTime()
{
timespec ts;
clock_gettime(CLOCK_MONOTONIC,&ts);
return ts.tv_sec+(long double)(ts.tv_nsec)*1e-9;
}
int main()
{
size_t W=20000,H=10000;
float* data1=new float[W*H];
float* data2=new float[W*H];
memset(data1,0,W*H*sizeof(float));
memset(data2,0,W*H*sizeof(float));
long double time1=currentTime();
for(int q=0;q<16;++q) // take more time
for(int k=0;k<W*H;++k)
data2[k]=data1[k];
long double time2=currentTime();
std::cout << (time2-time1)*1e+3 << " ms\n";
delete[] data1;
delete[] data2;
}
I compiled this with g++ 4.8.1 command g++ main.cpp -o test -std=c++0x -O3 -lrt. This program prints 6952.17 ms for me. (I had to set ulimit -s 2000000 for it to not crash.)
I also tried changing creation of arrays with new to automatic VLAs, removing memsets, but this doesn't change g++ behavior (apart from changing timings by several times).
It seems the compiler could prove that this code won't do anything sensible, so why didn't it optimize the loop away?
Anyway it isn't impossible (clang++ version 3.3):
clang++ main.cpp -o test -std=c++0x -O3 -lrt
The program prints 0.000367 ms for me... and looking at the assembly language:
...
callq clock_gettime
movq 56(%rsp), %r14
movq 64(%rsp), %rbx
leaq 56(%rsp), %rsi
movl $1, %edi
callq clock_gettime
...
while for g++:
...
call clock_gettime
fildq 32(%rsp)
movl $16, %eax
fildq 40(%rsp)
fmull .LC0(%rip)
faddp %st, %st(1)
.p2align 4,,10
.p2align 3
.L2:
movl $1, %ecx
xorl %edx, %edx
jmp .L5
.p2align 4,,10
.p2align 3
.L3:
movq %rcx, %rdx
movq %rsi, %rcx
.L5:
leaq 1(%rcx), %rsi
movss 0(%rbp,%rdx,4), %xmm0
movss %xmm0, (%rbx,%rdx,4)
cmpq $200000001, %rsi
jne .L3
subl $1, %eax
jne .L2
fstpt 16(%rsp)
leaq 32(%rsp), %rsi
movl $1, %edi
call clock_gettime
...
EDIT (g++ v4.8.2 / clang++ v3.3)
SOURCE CODE - ORIGINAL VERSION (1)
...
size_t W=20000,H=10000;
float* data1=new float[W*H];
float* data2=new float[W*H];
...
SOURCE CODE - MODIFIED VERSION (2)
...
const size_t W=20000;
const size_t H=10000;
float data1[W*H];
float data2[W*H];
...
Now the case that isn't optimized is (1) + g++
The code in this question has changed quite a bit, invalidating correct answers. This answer applies to the 5th version: as the code currently attempts to read uninitialized memory, an optimizer may reasonably assume that unexpected things are happening.
Many optimization steps have a similar pattern: there's a pattern of instructions that's matched to the current state of compilation. If the pattern matches at some point, the matched pattern is (parametrically) replaced by a more efficient version. A very simple example of such a pattern is the definition of a variable that's not subsequently used; the replacement in this case is simply a deletion.
These patterns are designed for correct code. On incorrect code, the patterns may simply fail to match, or they may match in entirely unintended ways. The first case leads to no optimization, the second case may lead to totally unpredictable results (certainly if the modified code if further optimized)
Why do you expect the compiler to optimise this? It’s generally really hard to prove that writes to arbitrary memory addresses are a “no-op”. In your case it would be possible, but it would require the compiler to trace the heap memory addresses through new (which is once again hard since these addresses are generated at runtime) and there really is no incentive for doing this.
After all, you tell the compiler explicitly that you want to allocate memory and write to it. How is the poor compiler to know that you’ve been lying to it?
In particular, the problem is that the heap memory could be aliased to lots of other stuff. It happens to be private to your process but like I said above, proving this is a lot of work for the compiler, unlike for function local memory.
The only way in which the compiler could know that this is a no-op is if it knew what memset does. In order for that to happen, the function must either be defined in a header (and it typically isn't), or it must be treated as a special intrinsic by the compiler. But barring those tricks, the compiler just sees a call to an unknown function which could have side effects and do different things for each of the two calls.

What does the compiler do in assembly when optimizing code? ie -O2 flag

So when you add an optimization flag when compiling your C++, it runs faster, but how does this work? Could someone explain what really goes on in the assembly?
It means you're making the compiler do extra work / analysis at compile time, so you can reap the rewards of a few extra precious cpu cycles at runtime. Might be best to explain with an example.
Consider a loop like this:
const int n = 5;
for (int i = 0; i < n; ++i)
cout << "bleh" << endl;
If you compile this without optimizations, the compiler will not do any extra work for you -- assembly generated for this code snippet will likely be a literal translation into compare and jump instructions. (which isn't the fastest, just the most straightforward)
However, if you compile WITH optimizations, the compiler can easily inline this loop since it knows the upper bound can't ever change because n is const. (i.e. it can copy the repeated code 5 times directly instead of comparing / checking for the terminating loop condition).
Here's another example with an optimized function call. Below is my whole program:
#include <stdio.h>
static int foo(int a, int b) {
return a * b;
}
int main(int argc, char** argv) {
fprintf(stderr, "%d\n", foo(10, 15));
return 0;
}
If i compile this code without optimizations using gcc foo.c on my x86 machine, my assembly looks like this:
movq %rsi, %rax
movl %edi, -4(%rbp)
movq %rax, -16(%rbp)
movl $10, %eax ; these are my parameters to
movl $15, %ecx ; the foo function
movl %eax, %edi
movl %ecx, %esi
callq _foo
; .. about 20 other instructions ..
callq _fprintf
Here, it's not optimizing anything. It's loading the registers with my constant values and calling my foo function. But look if i recompile with the -O2 flag:
movq (%rax), %rdi
leaq L_.str(%rip), %rsi
movl $150, %edx
xorb %al, %al
callq _fprintf
The compiler is so smart that it doesn't even call foo anymore. It just inlines it's return value.
Most of the optimization happens in the compiler's intermediate representation before the assembly is generated. You should definitely check out Agner Fog's Software optimization resources. Chapter 8 of the 1st manual describes optimizations performed by the compiler with examples.

Which is faster : if (bool) or if(int)?

Which value is better to use? Boolean true or Integer 1?
The above topic made me do some experiments with bool and int in if condition. So just out of curiosity I wrote this program:
int f(int i)
{
if ( i ) return 99; //if(int)
else return -99;
}
int g(bool b)
{
if ( b ) return 99; //if(bool)
else return -99;
}
int main(){}
g++ intbool.cpp -S generates asm code for each functions as follows:
asm code for f(int)
__Z1fi:
LFB0:
pushl %ebp
LCFI0:
movl %esp, %ebp
LCFI1:
cmpl $0, 8(%ebp)
je L2
movl $99, %eax
jmp L3
L2:
movl $-99, %eax
L3:
leave
LCFI2:
ret
asm code for g(bool)
__Z1gb:
LFB1:
pushl %ebp
LCFI3:
movl %esp, %ebp
LCFI4:
subl $4, %esp
LCFI5:
movl 8(%ebp), %eax
movb %al, -4(%ebp)
cmpb $0, -4(%ebp)
je L5
movl $99, %eax
jmp L6
L5:
movl $-99, %eax
L6:
leave
LCFI6:
ret
Surprisingly, g(bool) generates more asm instructions! Does it mean that if(bool) is little slower than if(int)? I used to think bool is especially designed to be used in conditional statement such as if, so I was expecting g(bool) to generate less asm instructions, thereby making g(bool) more efficient and fast.
EDIT:
I'm not using any optimization flag as of now. But even absence of it, why does it generate more asm for g(bool) is a question for which I'm looking for a reasonable answer. I should also tell you that -O2 optimization flag generates exactly same asm. But that isn't the question. The question is what I've asked.
Makes sense to me. Your compiler apparently defines a bool as an 8-bit value, and your system ABI requires it to "promote" small (< 32-bit) integer arguments to 32-bit when pushing them onto the call stack. So to compare a bool, the compiler generates code to isolate the least significant byte of the 32-bit argument that g receives, and compares it with cmpb. In the first example, the int argument uses the full 32 bits that were pushed onto the stack, so it simply compares against the whole thing with cmpl.
Compiling with -03 gives the following for me:
f:
pushl %ebp
movl %esp, %ebp
cmpl $1, 8(%ebp)
popl %ebp
sbbl %eax, %eax
andb $58, %al
addl $99, %eax
ret
g:
pushl %ebp
movl %esp, %ebp
cmpb $1, 8(%ebp)
popl %ebp
sbbl %eax, %eax
andb $58, %al
addl $99, %eax
ret
.. so it compiles to essentially the same code, except for cmpl vs cmpb.
This means that the difference, if there is any, doesn't matter. Judging by unoptimized code is not fair.
Edit to clarify my point. Unoptimized code is for simple debugging, not for speed. Comparing the speed of unoptimized code is senseless.
When I compile this with a sane set of options (specifically -O3), here's what I get:
For f():
.type _Z1fi, #function
_Z1fi:
.LFB0:
.cfi_startproc
.cfi_personality 0x3,__gxx_personality_v0
cmpl $1, %edi
sbbl %eax, %eax
andb $58, %al
addl $99, %eax
ret
.cfi_endproc
For g():
.type _Z1gb, #function
_Z1gb:
.LFB1:
.cfi_startproc
.cfi_personality 0x3,__gxx_personality_v0
cmpb $1, %dil
sbbl %eax, %eax
andb $58, %al
addl $99, %eax
ret
.cfi_endproc
They still use different instructions for the comparison (cmpb for boolean vs. cmpl for int), but otherwise the bodies are identical. A quick look at the Intel manuals tells me: ... not much of anything. There's no such thing as cmpb or cmpl in the Intel manuals. They're all cmp and I can't find the timing tables at the moment. I'm guessing, however, that there's no clock difference between comparing a byte immediate vs. comparing a long immediate, so for all practical purposes the code is identical.
edited to add the following based on your addition
The reason the code is different in the unoptimized case is that it is unoptimized. (Yes, it's circular, I know.) When the compiler walks the AST and generates code directly, it doesn't "know" anything except what's at the immediate point of the AST it's in. At that point it lacks all contextual information needed to know that at this specific point it can treat the declared type bool as an int. A boolean is obviously by default treated as a byte and when manipulating bytes in the Intel world you have to do things like sign-extend to bring it to certain widths to put it on the stack, etc. (You can't push a byte.)
When the optimizer views the AST and does its magic, however, it looks at surrounding context and "knows" when it can replace code with something more efficient without changing semantics. So it "knows" it can use an integer in the parameter and thereby lose the unnecessary conversions and widening.
With GCC 4.5 on Linux and Windows at least, sizeof(bool) == 1. On x86 and x86_64, you can't pass in less than an general purpose register's worth to a function (whether via the stack or a register depending on the calling convention etc...).
So the code for bool, when un-optimized, actually goes to some length to extract that bool value from the argument stack (using another stack slot to save that byte). It's more complicated than just pulling a native register-sized variable.
Yeah, the discussion's fun. But just test it:
Test code:
#include <stdio.h>
#include <string.h>
int testi(int);
int testb(bool);
int main (int argc, char* argv[]){
bool valb;
int vali;
int loops;
if( argc < 2 ){
return 2;
}
valb = (0 != (strcmp(argv[1], "0")));
vali = strcmp(argv[1], "0");
printf("Arg1: %s\n", argv[1]);
printf("BArg1: %i\n", valb ? 1 : 0);
printf("IArg1: %i\n", vali);
for(loops=30000000; loops>0; loops--){
//printf("%i: %i\n", loops, testb(valb=!valb));
printf("%i: %i\n", loops, testi(vali=!vali));
}
return valb;
}
int testi(int val){
if( val ){
return 1;
}
return 0;
}
int testb(bool val){
if( val ){
return 1;
}
return 0;
}
Compiled on a 64-bit Ubuntu 10.10 laptop with:
g++ -O3 -o /tmp/test_i /tmp/test_i.cpp
Integer-based comparison:
sauer#trogdor:/tmp$ time /tmp/test_i 1 > /dev/null
real 0m8.203s
user 0m8.170s
sys 0m0.010s
sauer#trogdor:/tmp$ time /tmp/test_i 1 > /dev/null
real 0m8.056s
user 0m8.020s
sys 0m0.000s
sauer#trogdor:/tmp$ time /tmp/test_i 1 > /dev/null
real 0m8.116s
user 0m8.100s
sys 0m0.000s
Boolean test / print uncommented (and integer commented):
sauer#trogdor:/tmp$ time /tmp/test_i 1 > /dev/null
real 0m8.254s
user 0m8.240s
sys 0m0.000s
sauer#trogdor:/tmp$ time /tmp/test_i 1 > /dev/null
real 0m8.028s
user 0m8.000s
sys 0m0.010s
sauer#trogdor:/tmp$ time /tmp/test_i 1 > /dev/null
real 0m7.981s
user 0m7.900s
sys 0m0.050s
They're the same with 1 assignment and 2 comparisons each loop over 30 million loops. Find something else to optimize. For example, don't use strcmp unnecessarily. ;)
At the machine level there is no such thing as bool
Very few instruction set architectures define any sort of boolean operand type, although there are often instructions that trigger an action on non-zero values. To the CPU, usually, everything is one of the scalar types or a string of them.
A given compiler and a given ABI will need to choose specific sizes for int and bool and when, like in your case, these are different sizes they may generate slightly different code, and at some levels of optimization one may be slightly faster.
Why is bool one byte on many systems?
It's safer to choose a char type for bool because someone might make a really large array of them.
Update: by "safer", I mean: for the compiler and library implementors. I'm not saying people need to reimplement the system type.
It will mostly depend on the compiler and the optimization. There's an interesting discussion (language agnostic) here:
Does "if ([bool] == true)" require one more step than "if ([bool])"?
Also, take a look at this post: http://www.linuxquestions.org/questions/programming-9/c-compiler-handling-of-boolean-variables-290996/
Approaching your question in two different ways:
If you are specifically talking about C++ or any programming language that will produce assembly code for that matter, we are bound to what code the compiler will generate in ASM. We are also bound to the representation of true and false in c++. An integer will have to be stored in 32 bits, and I could simply use a byte to store the boolean expression. Asm snippets for conditional statements:
For the integer:
mov eax,dword ptr[esp] ;Store integer
cmp eax,0 ;Compare to 0
je false ;If int is 0, its false
;Do what has to be done when true
false:
;Do what has to be done when false
For the bool:
mov al,1 ;Anything that is not 0 is true
test al,1 ;See if first bit is fliped
jz false ;Not fliped, so it's false
;Do what has to be done when true
false:
;Do what has to be done when false
So, that's why the speed comparison is so compile dependent. In the case above, the bool would be slightly fast since cmp would imply a subtraction for setting the flags. It also contradicts with what your compiler generated.
Another approach, a much simpler one, is to look at the logic of the expression on it's own and try not to worry about how the compiler will translate your code, and I think this is a much healthier way of thinking. I still believe, ultimately, that the code being generated by the compiler is actually trying to give a truthful resolution. What I mean is that, maybe if you increase the test cases in the if statement and stick with boolean in one side and integer in another, the compiler will make it so the code generated will execute faster with boolean expressions in the machine level.
I'm considering this is a conceptual question, so I'll give a conceptual answer. This discussion reminds me of discussions I commonly have about whether or not code efficiency translates to less lines of code in assembly. It seems that this concept is generally accepted as being true. Considering that keeping track of how fast the ALU will handle each statement is not viable, the second option would be to focus on jumps and compares in assembly. When that is the case, the distinction between boolean statements or integers in the code you presented becomes rather representative. The result of an expression in C++ will return a value that will then be given a representation. In assembly, on the other hand, the jumps and comparisons will be based in numeric values regardless of what type of expression was being evaluated back at you C++ if statement. It is important on these questions to remember that purely logicical statements such as these end up with a huge computational overhead, even though a single bit would be capable of the same thing.

Atomic 64 bit writes with GCC

I've gotten myself into a confused mess regarding multithreaded programming and was hoping someone could come and slap some understanding in me.
After doing quite a bit of reading, I've come to the understanding that I should be able to set the value of a 64 bit int atomically on a 64 bit system1.
I found a lot of this reading difficult though, so thought I would try to make a test to verify this. So I wrote a simple program with one thread which would set a variable into one of two values:
bool switcher = false;
while(true)
{
if (switcher)
foo = a;
else
foo = b;
switcher = !switcher;
}
And another thread which would check the value of foo:
while (true)
{
__uint64_t blah = foo;
if ((blah != a) && (blah != b))
{
cout << "Not atomic! " << blah << endl;
}
}
I set a = 1844674407370955161; and b = 1144644202170355111;. I run this program and get no output warning me that blah is not a or b.
Great, looks like it probably is an atomic write...but then, I changed the first thread to set a and b directly, like so:
bool switcher = false;
while(true)
{
if (switcher)
foo = 1844674407370955161;
else
foo = 1144644202170355111;
switcher = !switcher;
}
I re-run, and suddenly:
Not atomic! 1144644203261303193
Not atomic! 1844674406280007079
Not atomic! 1144644203261303193
Not atomic! 1844674406280007079
What's changed? Either way I'm assigning a large number to foo - does the compiler handle a constant number differently, or have I misunderstood everything?
Thanks!
1: Intel CPU documentation, section 8.1, Guaranteed Atomic Operations
2: GCC Development list discussing that GCC doesn't guarantee it in the documentation, but the kernel and other programs rely on it
Disassembling the loop, I get the following code with gcc:
.globl _switcher
_switcher:
LFB2:
pushq %rbp
LCFI0:
movq %rsp, %rbp
LCFI1:
movl $0, -4(%rbp)
L2:
cmpl $0, -4(%rbp)
je L3
movq _foo#GOTPCREL(%rip), %rax
movl $-1717986919, (%rax)
movl $429496729, 4(%rax)
jmp L5
L3:
movq _foo#GOTPCREL(%rip), %rax
movl $1486032295, (%rax)
movl $266508246, 4(%rax)
L5:
cmpl $0, -4(%rbp)
sete %al
movzbl %al, %eax
movl %eax, -4(%rbp)
jmp L2
LFE2:
So it would appear that gcc does use to 32-bit movl instruction with 32-bit immediate values. There is an instruction movq that can move a 64-bit register to memory (or memory to a 64-bit register), but it does not seems to be able to set move an immediate value to a memory address, so the compiler is forced to either use a temporary register and then move the value to memory, or to use to movl. You can try to force it to use a register by using a temporary variable, but this may not work.
References:
mov
movq
http://www.x86-64.org/documentation/assembly.html
immediate values inside instructions remain 32 bits.
There is no way for the compiler to do the assignation of a 64 bits constant atomically, excepted by first filling a register and then moving that register to the variable. That is probably more costly than assigning directly to the variable and as atomicity is not required by the language, the atomic solution is not chosen.
The Intel CPU documentation is right, aligned 8 Bytes read/writes are always atomic on recent hardware (even on 32 bit operating systems).
What you don't tell us, are you using a 64 bit hardware on a 32 bit system? If so, the 8 byte write will most likely be splitted into two 4 byte writes by the compiler.
Just have a look at the relevant section in the object code.