The following pattern is commonplace in lots of software that wants to tell its user how many times it has done various things:
int num_times_done_it; // global
void doit() {
++num_times_done_it;
// do something
}
void report_stats() {
printf("called doit %i times\n", num_times_done_it);
// and probably some other stuff too
}
Unfortunately, if multiple threads can call doit without some sort of synchronisation, the concurrent read-modify-writes to num_times_done_it may be a data race and hence the entire program's behaviour would be undefined. Further, if report_stats can be called concurrently with doit absent any synchronisation, there's another data race between the thread modifying num_times_done_it and the thread reporting its value.
Often, the programmer just wants a mostly-right count of the number of times doit has been called with as little overhead as possible.
(If you consider this example trivial, Hogwild! gains a significant speed advantage over a data-race-free stochastic gradient descent using essentially this trick. Also, I believe the Hotspot JVM does exactly this sort of unguarded, multithreaded access to a shared counter for method invocation counts---though it's in the clear since it generates assembly code instead of C++11.)
Apparent non-solutions:
Atomics, with any memory order I know of, fail "as little overhead as possible" here (an atomic increment can be considerably more expensive than an ordinary increment) while overdelivering on "mostly-right" (by being exactly right).
I don't believe tossing volatile into the mix makes data races OK, so replacing the declaration of num_times_done_it by volatile int num_times_done_it doesn't fix anything.
There's the awkward solution of having a separate counter per thread and adding them all up in report_stats, but that doesn't solve the data race between doit and report_stats. Also, it's messy, it assumes the updates are associative, and doesn't really fit Hogwild!'s usage.
Is it possible to implement invocation counters with well-defined semantics in a nontrivial, multithreaded C++11 program without some form of synchronisation?
EDIT: It seems that we can do this in a slightly indirect way using memory_order_relaxed:
atomic<int> num_times_done_it;
void doit() {
num_times_done_it.store(1 + num_times_done_it.load(memory_order_relaxed),
memory_order_relaxed);
// as before
}
However, gcc 4.8.2 generates this code on x86_64 (with -O3):
0: 8b 05 00 00 00 00 mov 0x0(%rip),%eax
6: 83 c0 01 add $0x1,%eax
9: 89 05 00 00 00 00 mov %eax,0x0(%rip)
and clang 3.4 generates this code on x86_64 (again with -O3):
0: 8b 05 00 00 00 00 mov 0x0(%rip),%eax
6: ff c0 inc %eax
8: 89 05 00 00 00 00 mov %eax,0x0(%rip)
My understanding of x86-TSO is that both of these code sequences are, barring interrupts and funny page protection flags, entirely equivalent to the one-instruction memory inc and the one-instruction memory add generated by the straightforward code. Does this use of memory_order_relaxed constitute a data race?
count for each thread separately and sum up after the threads joined. For intermediate results, you may also sum up in between, you result might be off though. This pattern is also faster. You might embed it into a basic helper class for your threads so you have it everywheren if you are using it often.
And - depending on compiler & platform, atomics aren't that expensive (see Herb Sutters "atomic weapons" talk http://channel9.msdn.com/Shows/Going+Deep/Cpp-and-Beyond-2012-Herb-Sutter-atomic-Weapons-1-of-2) but in your case it'll create problems with the caches so it's not advisable.
It seems that the memory_order_relaxed trick is the right way to do this.
This blog post by Dmitry Vyukov at Intel begins by answering exactly my question, and proceeds to list the memory_order_relaxed store and load as the proper alternative.
I am still unsure of whether this is really OK; in particular, N3710 makes me doubt that I ever understood memory_order_relaxed in the first place.
Related
I'm using _CrtMemCheckpoint and _CrtMemDumpAllObjectsSince to track possible memory leaks in my dll.
In DllMain when DLL_PROCESS_ATTACH is detected an init function is called which calls _CrtMemCheckpoint(&startState) on the global _CrtMemState variable startState. When DLL_PROCESS_DETACH is detected an exit function is called that calls _CrtMemDumpAllObjectsSince(&startState). This returns
ExitInstance()Dumping objects ->
{8706} normal block at 0x07088200, 8 bytes long.
Data: <p v > 70 FF 76 07 01 01 CD CD
{8705} normal block at 0x07084D28, 40 bytes long.
Data: < > 00 00 00 10 FF FF FF FF FF FF FF FF 00 00 00 00
{4577} normal block at 0x070845F0, 40 bytes long.
Data: <dbV > 64 62 56 0F 01 00 00 00 FF FF FF FF FF FF FF FF
{166} normal block at 0x028DD4B8, 40 bytes long.
Data: <dbV > 64 62 56 0F 01 00 00 00 FF FF FF FF FF FF FF FF
{87} normal block at 0x02889BA8, 12 bytes long.
Data: < P > DC 50 90 02 00 00 00 00 01 00 00 00
So far so good, except the last three entries (4577, 166 and 87) are also in startState. I.E. If I run _CrtDumpMemoryLeaks() in my Init function and in my Exit function those entries are in both lists.
The documentation says this:
_CrtMemDumpAllObjectsSince uses the value of the state parameter to determine where to initiate the dump operation. To begin dumping from
a specified heap state, the state parameter must be a pointer to a
_CrtMemState structure that has been filled in by _CrtMemCheckpoint before _CrtMemDumpAllObjectsSince was called.
Which makes me believe that items tracked in startState would be excluded from the output. At the end of the Init function where _CrtMemCheckpoint is called there have been about 4700 allocation calls. Shouldn't _CrtMemDumpAllObjectsSince only dump objects allocated after that checkpoint call?
What have I missed?
Short
This is apparently strange only, but it does the job (in part), however in an overzealous mode.
These functions are decades old, so not buggy but not completely well designed.
Truth is there is something in the "old" state that change after your "since" state.
So the question is "yes it does reflect a change since, but is it a lethal leak?"
This frequent and amplified with delayed init for DLL.
Also by a lot of complex objects like map/string/array/list which does delay allocation of internal buffer.
Bad news being that nearly all complex object declared as "static" are in fact inited on first use.
So theses change ought to be shown in _CrtMemDumpAllObjectsSince because they changed their memory alloc.
Unfortunately the display is so crude and unfiltered that it also show too many irrelevant blocks (not modified).
Typical biggest culprit is use of "realloc" that change state of old alloc
This may even look stranger as they may disappear,
For example when a genuine malloc is made after stat snapshoot, because this one will do a kind of 'reset' of the low water marker used for dump, setting it to a higher level. And this magically make a bunch of your "extra display" to disappear.
Behavior is even more erratic if you are doing Multithreading, as it becomes easily non repetitive.
Note:
The fact that it doesn't show a file name and line number is a sign that it's dealing here with a preinit.
So culprits are most likely static complex objetc inited BEFORE main() (or InitInstance();)
Long:
_CrtMemDumpAllObjectsSince is painful!
And in fact, can be so clutter of non-useful info that it defies the purpose for a day/day simple use of _CrtMemDumpAllObjectsSince.
(In the good essence it inspires)
Workaround
None simple!
You may try to do a malloc then free AFTER you do your "since" state snapshoot, in order to tease this marker.
But to be safer and more in control, unfortunetaly I did saw way around writting my own _MyCrtMemDumpAllObjectsSince that dump from the original MS structure.
This was inspired by static void __cdecl dump_all_object_since_nolock(_CrtMemState const* const state) throw()
( see "debug_heap.cpp") Copyright Microsoft!
Code available "as is" more for inspiration.
But before some explanation on the way _CrtMemState works:
A _CrtMemState state does have a pointer 'pBlockHeader' which is usually a link of a double linked list of '_CrtMemBlockHeader*'
This list is in fact more than a snapshoot at a time it is build, but a selection (unclear how) of all the memory blocks in use, arranged in such a way that the "current state" is directly pointed to by 'pBlockHeader'
So that going
-> _block_header_prev allows the exploration of older blocks
-> _block_header_next allows the exploration of newer blocks (the Juice you look for in the concept "Since" but very dangerous as there is no end marker)
The TRICKY part:
MS maintain a vital internal static _CrtMemBlockHeader* called __acrt_first_block
This __acrt_first_block continuously changes during alloc and realloc
However, _MyCrtMemDumpAllObjectsSince dump does start from this __acrt_first_block and go forward (use _block_header_next) until finding a NULL ptr
The first block handled is decided by this __acrt_first_block, and the 'state' you sent is not more than a STOP of the dump.
Otherwise said _CrtMemDumpAllObjectsSince doesn't really dump the "since" state
But dump from __acrt_first_block as it is to your "since" state.
The 'for' loop is overkilling by showing blocks from a 'start'(since) to an 'end'(oldest modified since).
This makes sense but this does also encompass dumping blocks that have NOT been modified. Showing things we don't care about.
MS structure is clever and can be used directly, while it is not guaranteed that Microsoft will maintain the same vital structure _CrtMemBlockHeader in future
But over the last 15 years I haven't seen a bit of change in it (nor do I foresee any reason they would change a strategical and critical structure.)
I dislike copy/paste of MS code and resolve linker with my piggyback code
So the workaround I used is based on the capability to intercept text message sent to "Output" windows, decoding and storing ALL information in my own bank
Structurally below gives an idea of the intercepts using a static struct under lock to store all infos
_CrtSetReportHook2(_CRT_RPTHOOK_INSTALL,MyReportHookDumpFilter);
_CrtMemDumpAllObjectsSince(state); // CONTRARY to what it says, this thing seems to dump everything until old_state
_CrtSetReportHook2(_CRT_RPTHOOK_REMOVE,MyReportHookDumpFilter);
_MyReportHookDumpFilterCommand(_CUST_SORT,NULL);
_MyReportHookDumpFilterCommand(_CUST_DUMP,NULL);
The `_MyReportHookDumpFilterCommand` does check the preexistence of blocks that are NOT modified at all and avoids displaying those during it's Dump phase
Take it as inspiration of code to ease display.
If anybody have simpler way to use it, please share!
I'm a beginner in C++ and I've just read that macros work by replacing text whenever needed. In this case, does this mean that it makes the .exe run faster? And how is this different than an inline function?
For example, if I have the following macro :
#define SQUARE(x) ((x) * (x))
and normal function :
int Square(const int& x)
{
return x*x;
}
and inline function :
inline int Square(const int& x)
{
return x*x;
}
What are the main differences between these three and especially between the inline function and the macro? Thank you.
You should avoid using macros if possible. Inline functions are always the better choice, as they are type safe. An inline function should be as fast as a macro (if it is indeed inlined by the compiler; note that the inline keyword is not binding but just a hint to the compiler, which may ignore it if inlining is not possible).
PS: as a matter of style, avoid using const Type& for parameter types that are fundamental, like int or double. Simply use the type itself, in other words, use
int Square(int x)
since a copy won't affect (or even make it worse) performance, see e.g. this question for more details.
Macros translate to: stupid replacing of pattern A with pattern B. This means: everything happens before the compiler kicks in. Sometimes they come in handy; but in general, they should be avoided. Because you can do a lot of things, and later on, in the debugger, you have no idea what is going on.
Besides: your approach to performance is well, naive, to say it friendly. First you learn the language (which is hard for modern C++, because there are a ton of important concepts and things one absolutely need to know and understand). Then you practice, practice, practice. And then, when you really come to a point where your existing application has performance problems; then do profiling to understand the real issue.
In other words: if you are interested in performance, you are asking the wrong question. You should worry much more about architecture (like: potential bottlenecks), configuration (in the sense of latency between different nodes in your system), and so on. Of course, you should apply common sense; and not write code that is obviously wasting memory or CPU cycles. But sometimes a piece of code that runs 50% slower ... might be 500% easier to read and maintain. And if execution time is then 500ms, and not 250ms; that might be totally OK (unless that specific part is called a thousand times per minute).
The difference between a macro and an inlined function is that a macro is dealt with before the compiler sees it.
On my compiler (clang++) without optimisation flags the square function won't be inlined. The code it generates looks like this
4009f0: 55 push %rbp
4009f1: 48 89 e5 mov %rsp,%rbp
4009f4: 89 7d fc mov %edi,-0x4(%rbp)
4009f7: 8b 7d fc mov -0x4(%rbp),%edi
4009fa: 0f af 7d fc imul -0x4(%rbp),%edi
4009fe: 89 f8 mov %edi,%eax
400a00: 5d pop %rbp
400a01: c3 retq
the imul is the assembly instruction doing the work, the rest is moving data around.
code that calls it looks like
400969: e8 82 00 00 00 callq 4009f0 <_Z6squarei>
iI add the -O3 flag to Inline it and that imul shows up in the main function where the function is called from in C++ code
0000000000400a10 <main>:
400a10: 41 56 push %r14
400a12: 53 push %rbx
400a13: 50 push %rax
400a14: 48 8b 7e 08 mov 0x8(%rsi),%rdi
400a18: 31 f6 xor %esi,%esi
400a1a: ba 0a 00 00 00 mov $0xa,%edx
400a1f: e8 9c fe ff ff callq 4008c0 <strtol#plt>
400a24: 48 89 c3 mov %rax,%rbx
400a27: 0f af db imul %ebx,%ebx
It's a reasonable thing to do to get a basic handle on assembly language for your machine and use gcc -S on your source, or objdump -D on your binary (as I did here) to see exactly what is going on.
Using the macro instead of the inlined function gets something very similar
0000000000400a10 <main>:
400a10: 41 56 push %r14
400a12: 53 push %rbx
400a13: 50 push %rax
400a14: 48 8b 7e 08 mov 0x8(%rsi),%rdi
400a18: 31 f6 xor %esi,%esi
400a1a: ba 0a 00 00 00 mov $0xa,%edx
400a1f: e8 9c fe ff ff callq 4008c0 <strtol#plt>
400a24: 48 89 c3 mov %rax,%rbx
400a27: 0f af db imul %ebx,%ebx
Note one of the many dangers here with macros: what does this do ?
x = 5; std::cout << SQUARE(++x) << std::endl;
36? nope, 42. It becomes
std::cout << ++x * ++x << std::endl;
which becomes 6 * 7
Don't be put off by people telling you not to care about optimisation. Using C or C++ as your language is an optimisation in itself. Just try to work out if you're wasting time with it and be sensible.
Macros just perform text substitution to modify source code.
As such, macros don't inherently affect performance of code. The techniques you use to design and code obviously affect performance. So the only implication of macros on performance is based on what the macro does (i.e. what code you write the macro to emit).
The big danger of macros is that they do not respect scope. The changes they make are unconditional, cross function boundaries, and things like that. There are a lot of subtleties in writing macros to make them behave as intended (avoid unintended side effects in code, avoid undefined behaviour, etc). This means code which uses macros is harder to understand, and harder to get right.
At best, with modern compilers, the performance gain you can get using macros, is the same as can be achieved with inline functions - at the expense of increasing chances of the code behaving incorrectly. You are therefore better off using inline functions - unlike macros they are typesafe and work consistently with other code.
Modern compilers might choose to not inline a function, even if you have specified it as inline. If that happens, you generally don't need to worry - modern compilers are able to do a better job than most modern programmers in deciding whether a function should be inlined.
Using such a macro only make sense if its argument is itself a #define'd constant, as the computation will then be performed by the preprocessor. Even then, double-check that the result is the expected one.
When working on classic variables, the (inlined) function form should be preferred as:
It is type-safe;
It will handle expressions used as an argument in a consistent way. This not only includes the case of per/post increments as quoted by Peter, but when the argument it itself some computation-intensive expression, using the macro form forces the evaluation of that argument twice (which may not necessarely evaluate to the same value btw) vs. only once for the function.
I have to admit that I used to code such macros for quick prototyping of apparently simple functions, but the time those make me lose over the years finalyl changed my mind !
I've been reading about bit twiddling hacks and thought, are compilers able to avoid branching in the following code:
constexpr int min(const int lhs, const int rhs) noexcept {
if (lhs < rhs) {
return lhs;
}
return rhs
}
by replacing it with (explanation):
constexpr int min(const int lhs, const int rhs) noexcept {
return rhs ^ ((lhs ^ rhs) & -(lhs < rhs));
}
Are compilers able to...: Yes definitively!
Can I rely on those optimizations?: No, you can't rely on any optimization. There may always be some strange conditions under which a compiler chooses not to implement a certain optimization for some non-obvious reason or just fails to see the possibility. Also in general, I've made the observation that compilers sometimes are a lot dumber than people think (Or the people (including me) are dumber than they think).(1)
Not asked, but a very important aspect: Can I rely on this actually being an optimization? NO! For one, (especially on x86) performance always depends on the surrounding code and there are a lot of different optimizations that interact. Also, some architectures might even offer commands that implement the operation even more efficient.
Should I use bit twiddeling optimizations?: In general: No - especially not without verifying that they actually give you any benefit! Even when they do improve performance, it makes your code harder to read and review and it makes some architecture and compiler specific assumptions (representation of integers, execution time of instructions, penalty for branch miss-prediction ...), that might lead to worse performance when you port your code to another architecture or - in the worst case even lead to false results.
My advice:
If you need to get the last bit of performance for a specific system, then just try both variants and measure (and verify the result every time, you update your CPU and/or compiler). For any other case, assume that the compiler is at least as good in making low level optimizations as you are. I'd also suggest that you first learn about all optimization related compiler flags and set up a proper benchmark BEFORE starting to use low level optimizations in any case.
I think the only area, where hand optimizations are still sometimes beneficial is if you want to to optimally use vector units. Modern compiler can auto-vectorize many things, but this is still relatively new field and there are certain things which the compiler is just not allowed to do because it violates some guarantees from the standard (especially where floating point operations are concerned).
(1) Some people seem to think, that independent of what their code looks like, the compiler will always produce the optimal code that provides the same semantics. For one, there are limits to what a compiler can do in a limited amount of time (there are a lot of heuristics that work MOST of the time but not always). Second, in many cases, the c++ standard requires the compiler to give certain guarantees, that you are not actually interested in at the moment, but still prevent optimizations.
clang++ (3.5.2-1) seems to be smart enough -O3 (I'm not using c++11 or c++14, constexpr and noexcept removed from source):
08048760 <_Z3minii>:
8048760: 8b 44 24 08 mov 0x8(%esp),%eax
8048764: 8b 4c 24 04 mov 0x4(%esp),%ecx
8048768: 39 c1 cmp %eax,%ecx
804876a: 0f 4e c1 cmovle %ecx,%eax
804876d: c3 ret
gcc (4.9.3) (-O3) instead do branch with jle:
08048740 <_Z3minii>:
8048740: 8b 54 24 08 mov 0x8(%esp),%edx
8048744: 8b 44 24 04 mov 0x4(%esp),%eax
8048748: 39 d0 cmp %edx,%eax
804874a: 7e 02 jle 804874e <_Z3minii+0xe>
804874c: 89 d0 mov %edx,%eax
804874e: f3 c3 repz ret
(x86 32bit)
This min2 (mangled) is the bit alternative (from gcc):
08048750 <_Z4min2ii>:
8048750: 8b 44 24 08 mov 0x8(%esp),%eax
8048754: 8b 54 24 04 mov 0x4(%esp),%edx
8048758: 31 c9 xor %ecx,%ecx
804875a: 39 c2 cmp %eax,%edx
804875c: 0f 9c c1 setl %cl
804875f: 31 c2 xor %eax,%edx
8048761: f7 d9 neg %ecx
8048763: 21 ca and %ecx,%edx
8048765: 31 d0 xor %edx,%eax
8048767: c3 ret
It's possible for a compiler to detect this pattern and replace it by your proposal.
However, neither clang++ nor g++ do this optimization, see for instance g++ 5.2.0's assembly output.
I decide I want to benchmark a particular function, so I naïvely write code like this:
#include <ctime>
#include <iostream>
int SlowCalculation(int input) { ... }
int main() {
std::cout << "Benchmark running..." << std::endl;
std::clock_t start = std::clock();
int answer = SlowCalculation(42);
std::clock_t stop = std::clock();
double delta = (stop - start) * 1.0 / CLOCKS_PER_SEC;
std::cout << "Benchmark took " << delta << " seconds, and the answer was "
<< answer << '.' << std::endl;
return 0;
}
A colleague pointed out that I should declare the start and stop variables as volatile to avoid code reordering. He suggested that the optimizer could, for example, effectively reorder the code like this:
std::clock_t start = std::clock();
std::clock_t stop = std::clock();
int answer = SlowCalculation(42);
At first I was skeptical that such extreme reordering was allowed, but after some research and experimentation, I learned that it was.
But volatile didn't feel like the right solution; isn't volatile really just for memory mapped I/O?
Nevertheless, I added volatile and found that not only did the benchmark take significantly longer, it also was wildly inconsistent from run to run. Without volatile (and getting lucky to ensure the code wasn't reordered), the benchmark consistently took 600-700 ms. With volatile, it often took 1200 ms and sometimes more than 5000 ms. The disassembly listings for the two versions showed virtually no difference other than a different selection of registers. This makes me wonder if there is another way to avoid the code reordering that doesn't have such overwhelming side effects.
My question is:
What is the best way to prevent code reordering in benchmarking code like this?
My question is similar to this one (which was about using volatile to avoid elision rather than reordering), this one (which didn't answer how to prevent reordering), and this one (which debated whether the issue was code reordering or dead code elimination). While all three are on this exact topic, none actually answer my question.
Update: The answer appears to be that my colleague was mistaken and that reordering like this isn't consistent with the standard. I've upvoted everyone who said so and am awarding the bounty to the Maxim.
I've seen one case (based on the code in this question) where Visual Studio 2010 reordered the clock calls as I illustrated (only in 64-bit builds). I'm trying to make a minimal case to illustrate that so that I can file a bug on Microsoft Connect.
For those who said that volatile should be much slower because it forces reads and writes to memory, this isn't quite consistent with the code being emitted. In my answer on this question, I show the disassembly for the the code with and without volatile. Inside the loop, everything is kept in registers. The only significant differences appear to be register selection. I do not understand x86 assembly well enough to know why the performance of the non-volatile version is consistently fast while the volatile version is inconsistently (and sometimes dramatically) slower.
A colleague pointed out that I should declare the start and stop variables as volatile to avoid code reordering.
Sorry, but your colleague is wrong.
The compiler does not reorder calls to functions whose definitions are not available at compile time. Simply imagine the hilarity that would ensue if compiler reordered such calls as fork and exec or moved code around these.
In other words, any function with no definition is a compile time memory barrier, that is, the compiler does not move subsequent statements before the call or prior statements after the call.
In your code calls to std::clock end up calling a function whose definition is not available.
I can not recommend enough watching atomic Weapons: The C++ Memory Model and Modern Hardware because it discusses misconceptions about (compile time) memory barriers and volatile among many other useful things.
Nevertheless, I added volatile and found that not only did the benchmark take significantly longer, it also was wildly inconsistent from run to run. Without volatile (and getting lucky to ensure the code wasn't reordered), the benchmark consistently took 600-700 ms. With volatile, it often took 1200 ms and sometimes more than 5000 ms
Not sure if volatile is to blame here.
The reported run-time depends on how the benchmark is run. Make sure you disable CPU frequency scaling so that it does not turn on turbo mode or switches frequency in the middle of the run. Also, micro-benchmarks should be run as real-time priority processes to avoid scheduling noise. It could be that during another run some background file indexer starts competing with your benchmark for the CPU time. See this for more details.
A good practice is to measure times it takes to execute the function a number of times and report min/avg/median/max/stdev/total time numbers. High standard deviation may indicate that the above preparations are not performed. The first run often is the longest because the CPU cache may be cold and it may take many cache misses and page faults and also resolve dynamic symbols from shared libraries on the first call (lazy symbol resolution is the default run-time linking mode on Linux, for example), while subsequent calls are going to execute with much less overhead.
The usual way to prevent reordering is a compile barrier i.e asm volatile ("":::"memory"); (with gcc). This is an asm instruction which does nothing, but we tell the compiler it will clobber memory, so it is not permitted to reorder code across it. The cost of this is only the actual cost of removing the reorder, which obviously is not the case for changing the optimisation level etc as suggested elsewhere.
I believe _ReadWriteBarrier is equivalent for Microsoft stuff.
Per Maxim Yegorushkin's answer, reordering is unlikely to be the cause of your issues though.
Volatile ensures one thing, and one thing only: reads from a volatile variable will be read from memory every time -- the compiler won't assume that the value can be cached in a register. And likewise, writes will be written through to memory. The compiler won't keep it around in a register "for a while, before writing it out to memory".
In order to prevent compiler reordering you may use so called compiler fences.
MSVC includes 3 compiler fences:
_ReadWriteBarrier() - full fence
_ReadBarrier() - two-sided fence for loads
_WriteBarrier() - two-sided fence for stores
ICC includes __memory_barrier() full fence.
Full fences are usually the best choice because there is no need in finer-granularity on this level (compiler fences are basically costless in run-time).
Statment reordering (which most compiler do when optimization is enabled), thats the also main reason why certain program fail to operation operation when compiled with compiler optimzation.
Will suggest to read http://preshing.com/20120625/memory-ordering-at-compile-time to see potential issues we can face with compiler reoredering etc.
Related problem: how to stop the compiler from hoisting a tiny repeated calculation out of a loop
I couldn't find this anywhere - so adding my own answer 11 years after the question was asked ;).
Using volatile on variables is not what you want for that. That will cause the compiler to load and store those variable from and to RAM every single time (assuming there is a side effect of that that must be preserved: aka - good for I/O registers). When you are bench marking you are not interested in measuring how long it takes to get something from memory, or write it there. Often you just want your variable to be in CPU registers.
volatile is usable if you assign to it once outside a loop that doesn't get optimized away (like summing an array), as an alternative to printing the result. (Like the long-running function in the question). But not inside a tiny loop; that will introduce store/reload instructions and store-forwarding latency.
I think that the ONLY way to submit your compiler into not optimizing your benchmark code to hell is by using asm. This allows you to fool the compiler into thinking it doesn't know anything about your variables content, or usage, so it has to do everything every single time, as often as your loop asks it to.
For example, if I wanted to benchmark m & -m where m is some uint64_t, I could try:
uint64_t const m = 0x0000080e70100000UL;
for (int i = 0; i < loopsize; ++i)
{
uint64_t result = m & -m;
}
The compiler would obviously say: I'm not even going to calculate that;
since you're not using the result. Aka, it would actually do:
for (int i = 0; i < loopsize; ++i)
{
}
Then you can try:
uint64_t const m = 0x0000080e70100000UL;
static uint64_t volatile result;
for (int i = 0; i < loopsize; ++i)
{
result = m & -m;
}
and the compiler says, ok - so you want me to write to result every time
and do
uint64_t const m = 0x0000080e70100000UL;
uint64_t tmp = m & -m;
static uint64_t volatile result;
for (int i = 0; i < loopsize; ++i)
{
result = tmp;
}
Spending a lot of time writing to the memory address of result loopsize times, just as you asked.
Finally you could also make m volatile, but the result would look like this in assembly:
507b: ba e8 03 00 00 mov $0x3e8,%edx
# top of loop
5080: 48 8b 05 89 ef 20 00 mov 0x20ef89(%rip),%rax # 214010 <m_test>
5087: 48 8b 0d 82 ef 20 00 mov 0x20ef82(%rip),%rcx # 214010 <m_test>
508e: 48 f7 d8 neg %rax
5091: 48 21 c8 and %rcx,%rax
5094: 48 89 44 24 28 mov %rax,0x28(%rsp)
5099: 83 ea 01 sub $0x1,%edx
509c: 75 e2 jne 5080 <main+0x120>
Reading from memory twice and writing to it once, besides the requested calculation with registers.
The correct way to do this is therefore:
for (int i = 0; i < loopsize; ++i)
{
uint64_t result = m & -m;
asm volatile ("" : "+r" (m) : "r" (result));
}
which results in the assembly code (from gcc8.2 on the Godbolt compiler explorer):
# gcc8.2 -O3 -fverbose-asm
movabsq $8858102661120, %rax #, m
movl $1000, %ecx #, ivtmp_9 # induction variable tmp_9
.L2:
mov %rax, %rdx # m, tmp91
neg %rdx # tmp91
and %rax, %rdx # m, result
# asm statement here, m=%rax result=%rdx
subl $1, %ecx #, ivtmp_9
jne .L2
ret
Doing exactly the three requested assembly instructions inside the loop, plus a sub and jne for the loop overhead.
The trick here is that by using the asm volatile1 and tell the compiler
"r" input operand: it uses the value of result as input so the compiler has to materialize it in a register.
"+r" input/output operand: m stays in the same register but is (potentially) modified.
volatile: it has some mysterious side effect and/or is not a pure function of the inputs; the compiler must execute it as many times as the source does. This forces the compiler to leave your test snippet alone and inside the loop. See the gcc manual's Extended Asm#Volatile section.
footnote 1: The volatile is required here or the compiler will turn this into an empty loop. Non-volatile asm (with any output operands) is considered a pure function of its inputs that can be optimized away if the result is unused. Or CSEd to only run once if used multiple times with the same input.
Everything below is not mine-- and I do not necessarily agree with it. --Carlo Wood
If you had used asm volatile ("" : "=r" (m) : "r" (result)); (with an "=r" write-only output), the compiler might choose the same register for m and result, creating a loop-carried dependency chain that tests the latency, not throughput, of the calculation.
From that, you'd get this asm:
5077: ba e8 03 00 00 mov $0x3e8,%edx
507c: 0f 1f 40 00 nopl 0x0(%rax) # alignment padding
# top of loop
5080: 48 89 e8 mov %rbp,%rax # copy m
5083: 48 f7 d8 neg %rax # -m
5086: 48 21 c5 and %rax,%rbp # m &= -m instead of using the tmp as the destination.
5089: 83 ea 01 sub $0x1,%edx
508c: 75 f2 jne 5080 <main+0x120>
This will run at 1 iteration per 2 or 3 cycles (depending on whether your CPU has mov-elimination or not.) The version without a loop-carried dependency can run at 1 per clock cycle on Haswell and later, and Ryzen. Those CPUs have the ALU throughput to run at least 4 uops per clock cycle.
This asm corresponds to C++ that looks like this:
for (int i = 0; i < loopsize; ++i)
{
m = m & -m;
}
By misleading the compiler with a write-only output constraint, we've created asm that doesn't look like the source (which looked like it was computing a new result from a constant every iteration, not using result as an input to the next iteration..)
You might want to microbenchmark latency, so you can more easily detect the benefit of compiling with -mbmi or -march=haswell to let the compiler use blsi %rax, %rax and calculate m &= -m; in one instruction. But it's easier to keep track of what you're doing if the C++ source has the same dependency as the asm, instead of fooling the compiler into introducing a new dependency.
You could make two C files, SlowCalculation compiled with g++ -O3 (high level of optimization), and the benchmark one compiled with g++ -O1 (lower level, still optimized - that may be sufficient for that benchmarking part).
According to the man page, reordering of code happens during -O2 and -O3 optimizations levels.
Since optimization happens during compilation, not linkage, the benchmark side should not be affected by code reordering.
Assuming you are using g++ - but there should be something equivalent in another compiler.
The correct way to do this in C++ is to use a class, e.g. something like
class Timer
{
std::clock_t startTime;
std::clock_t* targetTime;
public:
Timer(std::clock_t* target) : targetTime(target) { startTime = std::clock(); }
~Timer() { *target = std::clock() - startTime; }
};
and use it like this:
std::clock_t slowTime;
{
Timer timer(&slowTime);
int answer = SlowCalculation(42);
}
Mind you, I don't actually believe your compiler will ever re-order like this.
There are a couple of ways that I can think of. The idea is to create compile time barriers so that compiler does not reorder a set of instructions.
One possible way to avoid reordering would be to enforce dependency among instructions that cannot be resolved by compiler (e.g. passing a pointer to the function and using that pointer in later instruction). I'm not sure how that would affect the performance of the actual code that you are interested in benchmarking.
Another possibility is to make the function SlowCalculation(42); an extern function (define this function in a separate .c/.cpp file and link the file to your main program) and declare start and stop as global variables. I do not know what are the optimizations offered by the link-time/inter-procedural optimizer of your compiler.
Also, if you compile at O1 or O0, most probably the compiler would not bother reordering instructions.
Your question is somewhat related to (Compile time barriers - compiler code reordering - gcc and pthreads)
Reordering described by your colleague just breaks 1.9/13
Sequenced before is an asymmetric, transitive, pair-wise relation between evaluations executed by a single
thread (1.10), which induces a partial order among those evaluations. Given any two evaluations A and B, if
A is sequenced before B, then the execution of A shall precede the execution of B. If A is not sequenced before
B and B is not sequenced before A, then A and B are unsequenced . [ Note: The execution of unsequenced
evaluations can overlap. —end note ] Evaluations A and B are indeterminately sequenced when either A
is sequenced before B or B is sequenced before A, but it is unspecified which. [ Note: Indeterminately
sequenced evaluations cannot overlap, but either could be executed first. —end note ]
So basically you should not think about reordering while you don't use threads.
I have a class:
class Vector {
public:
element* get(int i);
private:
element* getIfExists(int i):
};
get invokes getIfExists; if element exists, it is returned, if not, some action is performed. getIfExists can signal that some element i is not present
either by throwing exception, or by returning NULL.
Question: would there be any difference in performance? In one case, get will need to check ==NULL, in another try... catch.
Its a matter of design, not performance. If its an exceptional situation -like in your get function- then throw an exception; or even better fire an assert since violation of a function precondition is a programming error. If its an expected case -like in your getIfExist function- then don't throw an exception.
Regarding performance, zero cost exception implementations exist (although not all compilers use that strategy). This means that the overhead is only paid when an exception its thrown, which should be... well... exceptionally.
Modern compilers implement 'zero cost' exceptions - they only incur cost when thrown, and the cost is proportional to the cleanup plus the cache-miss to get the list to clean up. Therefore, if exceptions are exceptional, they can indeed be faster than return-codes. And if they are unexceptional, they may be slower. And if your error is in a function in a function in a function call, it actually do much less work throwing. The details are fascinating and well worth googling.
But the cost is very marginal. In a tight loop it might make a difference, but generally not.
You should write the code that is easiest to reason about and maintain, and then profile it and revisit your decision only if its a bottleneck.
(See comments!)
Without a doubt the return NULL variant has a better performance.
You should mostly never use exceptions when using return values is possible too.
Since the method is named get I assume NULL won't be a valid result value, so passing NULL should be the best solution.
If the caller does not test the result value, it dereferences a null value, rendering a SIGSEGV, what is appropriate too.
If the method is rarely called, you should not care about micro optimizations at all.
Which translated method looks easier to you?
$ g++ -Os -c test.cpp
#include <cstddef>
void *get_null(int i) throw ();
void *get_throwing(int i) throw (void*);
int one(int i) {
void *res = get_null(i);
if(res != NULL) {
return 1;
}
return 0;
}
int two(int i) {
try {
void *res = get_throwing(i);
return 1;
} catch(void *ex) {
return 0;
}
}
$ objdump -dC test.o
0000000000000000 <one(int)>:
0: 50 push %rax
1: e8 00 00 00 00 callq 6 <one(int)+0x6>
6: 48 85 c0 test %rax,%rax
9: 0f 95 c0 setne %al
c: 0f b6 c0 movzbl %al,%eax
f: 5a pop %rdx
10: c3 retq
0000000000000011 <two(int)>:
11: 56 push %rsi
12: e8 00 00 00 00 callq 17 <two(int)+0x6>
17: b8 01 00 00 00 mov $0x1,%eax
1c: 59 pop %rcx
1d: c3 retq
1e: 48 ff ca dec %rdx
21: 48 89 c7 mov %rax,%rdi
24: 74 05 je 2b <two(int)+0x1a>
26: e8 00 00 00 00 callq 2b <two(int)+0x1a>
2b: e8 00 00 00 00 callq 30 <two(int)+0x1f>
30: e8 00 00 00 00 callq 35 <two(int)+0x24>
35: 31 c0 xor %eax,%eax
37: eb e3 jmp 1c <two(int)+0xb>
There will certainly be a difference in performance (maybe even a very big one if you give Vector::getIfExists a throw() specification, but I 'm speculating a bit here). But IMO that's missing the forest for the trees.
The money question is: are you going to call this method so many times with an out-of-bounds parameter? And if yes, why?
Yes, there would be a difference in performance: returning NULL is less expensive than throwing an exception, and checking for NULL is less expensive than catching an exception.
Addendum: But performance is only relevant if you expect that this case will happen frequently, in which case it's probably not an exceptional case anyway. In C++, it's considered bad style to use exceptions to implement normal program logic, which this seems to be: I'm assuming that the point of get is to auto-extend the vector when necessary?
If the caller is going to be expecting to deal with the possibility of an item not existing, you should return in a way that indicates that without throwing an exception. If the caller is not going to be prepared, you should throw an exception. Of course, the called routine isn't likely to magically know whether the caller is prepared for trouble. A few approaches to consider:
Microsoft's pattern is to have a Get() method, which returns the object if it exists and throws an exception if it doesn't, and a TryGet() method, which returns a Boolean indicating whether the object existed, and stores the object (if it exists) to a Ref parameter. My big complaint with this pattern is that interfaces using it cannot be covariant.
A variation, which I often prefer for collections of reference types, is to have Get and TryGet methods, and have TryGet return null for non-existent items. Interface covariance works much better this way.
A slight variation on the above, which works even for value-types or unconstrained generics, is to have the TryGet method accept a Boolean by reference, and store to that Boolean a success/fail indicator. In case of failure, the code can return an unspecified object of the appropriate type (most likely default<T>.
Another approach, which is particularly suitable for private methods, is to pass either a Boolean or an enumerated type specifying whether a routine should return null or throw an exception in case of failure. This approach can improve the quality of generated exceptions while minimizing duplicated code. For example, if one is trying to get a packet of data from a communications pipe and the caller isn't prepared for failure, and an error occurs in a routine that reads the packet header, an exception should probably be thrown by the packet-header-read routine. If, however, the caller will be prepared not to receive a packet, the packet-header-read routine should indicate failure without throwing. The cleanest way to allow for both possibilities would be for the read-packet routine to pass an "errors will be dealt with by caller" flag to the read-packet-header routine.
In some contexts, it may be useful for a routine's caller to pass a delegate to be invoked in case anticipated problems arise. The delegate could attempt to resolve the problem, and do something to indicate whether the operation should be retried, the caller should return with an error code, an exception should be raised, or something else entirely should happen. This can sometimes be the best approach, but it's hard to figure out what data should be passed to the error delegate and how it should be given control over the error handling.
In practice, I tend to use #2 a lot. I dislike #1, since I feel that the return value of a function should correspond with its primary purpose.