Do macros in C++ improve performance? - c++

I'm a beginner in C++ and I've just read that macros work by replacing text whenever needed. In this case, does this mean that it makes the .exe run faster? And how is this different than an inline function?
For example, if I have the following macro :
#define SQUARE(x) ((x) * (x))
and normal function :
int Square(const int& x)
{
return x*x;
}
and inline function :
inline int Square(const int& x)
{
return x*x;
}
What are the main differences between these three and especially between the inline function and the macro? Thank you.

You should avoid using macros if possible. Inline functions are always the better choice, as they are type safe. An inline function should be as fast as a macro (if it is indeed inlined by the compiler; note that the inline keyword is not binding but just a hint to the compiler, which may ignore it if inlining is not possible).
PS: as a matter of style, avoid using const Type& for parameter types that are fundamental, like int or double. Simply use the type itself, in other words, use
int Square(int x)
since a copy won't affect (or even make it worse) performance, see e.g. this question for more details.

Macros translate to: stupid replacing of pattern A with pattern B. This means: everything happens before the compiler kicks in. Sometimes they come in handy; but in general, they should be avoided. Because you can do a lot of things, and later on, in the debugger, you have no idea what is going on.
Besides: your approach to performance is well, naive, to say it friendly. First you learn the language (which is hard for modern C++, because there are a ton of important concepts and things one absolutely need to know and understand). Then you practice, practice, practice. And then, when you really come to a point where your existing application has performance problems; then do profiling to understand the real issue.
In other words: if you are interested in performance, you are asking the wrong question. You should worry much more about architecture (like: potential bottlenecks), configuration (in the sense of latency between different nodes in your system), and so on. Of course, you should apply common sense; and not write code that is obviously wasting memory or CPU cycles. But sometimes a piece of code that runs 50% slower ... might be 500% easier to read and maintain. And if execution time is then 500ms, and not 250ms; that might be totally OK (unless that specific part is called a thousand times per minute).

The difference between a macro and an inlined function is that a macro is dealt with before the compiler sees it.
On my compiler (clang++) without optimisation flags the square function won't be inlined. The code it generates looks like this
4009f0: 55 push %rbp
4009f1: 48 89 e5 mov %rsp,%rbp
4009f4: 89 7d fc mov %edi,-0x4(%rbp)
4009f7: 8b 7d fc mov -0x4(%rbp),%edi
4009fa: 0f af 7d fc imul -0x4(%rbp),%edi
4009fe: 89 f8 mov %edi,%eax
400a00: 5d pop %rbp
400a01: c3 retq
the imul is the assembly instruction doing the work, the rest is moving data around.
code that calls it looks like
400969: e8 82 00 00 00 callq 4009f0 <_Z6squarei>
iI add the -O3 flag to Inline it and that imul shows up in the main function where the function is called from in C++ code
0000000000400a10 <main>:
400a10: 41 56 push %r14
400a12: 53 push %rbx
400a13: 50 push %rax
400a14: 48 8b 7e 08 mov 0x8(%rsi),%rdi
400a18: 31 f6 xor %esi,%esi
400a1a: ba 0a 00 00 00 mov $0xa,%edx
400a1f: e8 9c fe ff ff callq 4008c0 <strtol#plt>
400a24: 48 89 c3 mov %rax,%rbx
400a27: 0f af db imul %ebx,%ebx
It's a reasonable thing to do to get a basic handle on assembly language for your machine and use gcc -S on your source, or objdump -D on your binary (as I did here) to see exactly what is going on.
Using the macro instead of the inlined function gets something very similar
0000000000400a10 <main>:
400a10: 41 56 push %r14
400a12: 53 push %rbx
400a13: 50 push %rax
400a14: 48 8b 7e 08 mov 0x8(%rsi),%rdi
400a18: 31 f6 xor %esi,%esi
400a1a: ba 0a 00 00 00 mov $0xa,%edx
400a1f: e8 9c fe ff ff callq 4008c0 <strtol#plt>
400a24: 48 89 c3 mov %rax,%rbx
400a27: 0f af db imul %ebx,%ebx
Note one of the many dangers here with macros: what does this do ?
x = 5; std::cout << SQUARE(++x) << std::endl;
36? nope, 42. It becomes
std::cout << ++x * ++x << std::endl;
which becomes 6 * 7
Don't be put off by people telling you not to care about optimisation. Using C or C++ as your language is an optimisation in itself. Just try to work out if you're wasting time with it and be sensible.

Macros just perform text substitution to modify source code.
As such, macros don't inherently affect performance of code. The techniques you use to design and code obviously affect performance. So the only implication of macros on performance is based on what the macro does (i.e. what code you write the macro to emit).
The big danger of macros is that they do not respect scope. The changes they make are unconditional, cross function boundaries, and things like that. There are a lot of subtleties in writing macros to make them behave as intended (avoid unintended side effects in code, avoid undefined behaviour, etc). This means code which uses macros is harder to understand, and harder to get right.
At best, with modern compilers, the performance gain you can get using macros, is the same as can be achieved with inline functions - at the expense of increasing chances of the code behaving incorrectly. You are therefore better off using inline functions - unlike macros they are typesafe and work consistently with other code.
Modern compilers might choose to not inline a function, even if you have specified it as inline. If that happens, you generally don't need to worry - modern compilers are able to do a better job than most modern programmers in deciding whether a function should be inlined.

Using such a macro only make sense if its argument is itself a #define'd constant, as the computation will then be performed by the preprocessor. Even then, double-check that the result is the expected one.
When working on classic variables, the (inlined) function form should be preferred as:
It is type-safe;
It will handle expressions used as an argument in a consistent way. This not only includes the case of per/post increments as quoted by Peter, but when the argument it itself some computation-intensive expression, using the macro form forces the evaluation of that argument twice (which may not necessarely evaluate to the same value btw) vs. only once for the function.
I have to admit that I used to code such macros for quick prototyping of apparently simple functions, but the time those make me lose over the years finalyl changed my mind !

Related

Are compilers able to avoid branching instructions?

I've been reading about bit twiddling hacks and thought, are compilers able to avoid branching in the following code:
constexpr int min(const int lhs, const int rhs) noexcept {
if (lhs < rhs) {
return lhs;
}
return rhs
}
by replacing it with (explanation):
constexpr int min(const int lhs, const int rhs) noexcept {
return rhs ^ ((lhs ^ rhs) & -(lhs < rhs));
}
Are compilers able to...: Yes definitively!
Can I rely on those optimizations?: No, you can't rely on any optimization. There may always be some strange conditions under which a compiler chooses not to implement a certain optimization for some non-obvious reason or just fails to see the possibility. Also in general, I've made the observation that compilers sometimes are a lot dumber than people think (Or the people (including me) are dumber than they think).(1)
Not asked, but a very important aspect: Can I rely on this actually being an optimization? NO! For one, (especially on x86) performance always depends on the surrounding code and there are a lot of different optimizations that interact. Also, some architectures might even offer commands that implement the operation even more efficient.
Should I use bit twiddeling optimizations?: In general: No - especially not without verifying that they actually give you any benefit! Even when they do improve performance, it makes your code harder to read and review and it makes some architecture and compiler specific assumptions (representation of integers, execution time of instructions, penalty for branch miss-prediction ...), that might lead to worse performance when you port your code to another architecture or - in the worst case even lead to false results.
My advice:
If you need to get the last bit of performance for a specific system, then just try both variants and measure (and verify the result every time, you update your CPU and/or compiler). For any other case, assume that the compiler is at least as good in making low level optimizations as you are. I'd also suggest that you first learn about all optimization related compiler flags and set up a proper benchmark BEFORE starting to use low level optimizations in any case.
I think the only area, where hand optimizations are still sometimes beneficial is if you want to to optimally use vector units. Modern compiler can auto-vectorize many things, but this is still relatively new field and there are certain things which the compiler is just not allowed to do because it violates some guarantees from the standard (especially where floating point operations are concerned).
(1) Some people seem to think, that independent of what their code looks like, the compiler will always produce the optimal code that provides the same semantics. For one, there are limits to what a compiler can do in a limited amount of time (there are a lot of heuristics that work MOST of the time but not always). Second, in many cases, the c++ standard requires the compiler to give certain guarantees, that you are not actually interested in at the moment, but still prevent optimizations.
clang++ (3.5.2-1) seems to be smart enough -O3 (I'm not using c++11 or c++14, constexpr and noexcept removed from source):
08048760 <_Z3minii>:
8048760: 8b 44 24 08 mov 0x8(%esp),%eax
8048764: 8b 4c 24 04 mov 0x4(%esp),%ecx
8048768: 39 c1 cmp %eax,%ecx
804876a: 0f 4e c1 cmovle %ecx,%eax
804876d: c3 ret
gcc (4.9.3) (-O3) instead do branch with jle:
08048740 <_Z3minii>:
8048740: 8b 54 24 08 mov 0x8(%esp),%edx
8048744: 8b 44 24 04 mov 0x4(%esp),%eax
8048748: 39 d0 cmp %edx,%eax
804874a: 7e 02 jle 804874e <_Z3minii+0xe>
804874c: 89 d0 mov %edx,%eax
804874e: f3 c3 repz ret
(x86 32bit)
This min2 (mangled) is the bit alternative (from gcc):
08048750 <_Z4min2ii>:
8048750: 8b 44 24 08 mov 0x8(%esp),%eax
8048754: 8b 54 24 04 mov 0x4(%esp),%edx
8048758: 31 c9 xor %ecx,%ecx
804875a: 39 c2 cmp %eax,%edx
804875c: 0f 9c c1 setl %cl
804875f: 31 c2 xor %eax,%edx
8048761: f7 d9 neg %ecx
8048763: 21 ca and %ecx,%edx
8048765: 31 d0 xor %edx,%eax
8048767: c3 ret
It's possible for a compiler to detect this pattern and replace it by your proposal.
However, neither clang++ nor g++ do this optimization, see for instance g++ 5.2.0's assembly output.

Data races, UB, and counters in C++11

The following pattern is commonplace in lots of software that wants to tell its user how many times it has done various things:
int num_times_done_it; // global
void doit() {
++num_times_done_it;
// do something
}
void report_stats() {
printf("called doit %i times\n", num_times_done_it);
// and probably some other stuff too
}
Unfortunately, if multiple threads can call doit without some sort of synchronisation, the concurrent read-modify-writes to num_times_done_it may be a data race and hence the entire program's behaviour would be undefined. Further, if report_stats can be called concurrently with doit absent any synchronisation, there's another data race between the thread modifying num_times_done_it and the thread reporting its value.
Often, the programmer just wants a mostly-right count of the number of times doit has been called with as little overhead as possible.
(If you consider this example trivial, Hogwild! gains a significant speed advantage over a data-race-free stochastic gradient descent using essentially this trick. Also, I believe the Hotspot JVM does exactly this sort of unguarded, multithreaded access to a shared counter for method invocation counts---though it's in the clear since it generates assembly code instead of C++11.)
Apparent non-solutions:
Atomics, with any memory order I know of, fail "as little overhead as possible" here (an atomic increment can be considerably more expensive than an ordinary increment) while overdelivering on "mostly-right" (by being exactly right).
I don't believe tossing volatile into the mix makes data races OK, so replacing the declaration of num_times_done_it by volatile int num_times_done_it doesn't fix anything.
There's the awkward solution of having a separate counter per thread and adding them all up in report_stats, but that doesn't solve the data race between doit and report_stats. Also, it's messy, it assumes the updates are associative, and doesn't really fit Hogwild!'s usage.
Is it possible to implement invocation counters with well-defined semantics in a nontrivial, multithreaded C++11 program without some form of synchronisation?
EDIT: It seems that we can do this in a slightly indirect way using memory_order_relaxed:
atomic<int> num_times_done_it;
void doit() {
num_times_done_it.store(1 + num_times_done_it.load(memory_order_relaxed),
memory_order_relaxed);
// as before
}
However, gcc 4.8.2 generates this code on x86_64 (with -O3):
0: 8b 05 00 00 00 00 mov 0x0(%rip),%eax
6: 83 c0 01 add $0x1,%eax
9: 89 05 00 00 00 00 mov %eax,0x0(%rip)
and clang 3.4 generates this code on x86_64 (again with -O3):
0: 8b 05 00 00 00 00 mov 0x0(%rip),%eax
6: ff c0 inc %eax
8: 89 05 00 00 00 00 mov %eax,0x0(%rip)
My understanding of x86-TSO is that both of these code sequences are, barring interrupts and funny page protection flags, entirely equivalent to the one-instruction memory inc and the one-instruction memory add generated by the straightforward code. Does this use of memory_order_relaxed constitute a data race?
count for each thread separately and sum up after the threads joined. For intermediate results, you may also sum up in between, you result might be off though. This pattern is also faster. You might embed it into a basic helper class for your threads so you have it everywheren if you are using it often.
And - depending on compiler & platform, atomics aren't that expensive (see Herb Sutters "atomic weapons" talk http://channel9.msdn.com/Shows/Going+Deep/Cpp-and-Beyond-2012-Herb-Sutter-atomic-Weapons-1-of-2) but in your case it'll create problems with the caches so it's not advisable.
It seems that the memory_order_relaxed trick is the right way to do this.
This blog post by Dmitry Vyukov at Intel begins by answering exactly my question, and proceeds to list the memory_order_relaxed store and load as the proper alternative.
I am still unsure of whether this is really OK; in particular, N3710 makes me doubt that I ever understood memory_order_relaxed in the first place.

What's the performance of the "address of" operator &?

I have to pass a lot of pointers in my code in the midst of a big loop, (so I have lots of expressions like foo(&x, &y, ...)). I was wondering whether I should store the pointers as separate variables (i.e. cache) for performance (at the cost of introducing more variables and clutter in my code)?
(Doing lots of matrix mult. and the CUBLAS library insists on pointers...)
No -- the address-of operator is about as inexpensive/fast as anything you can hope for. It's possible to overload it, and such an overload could be slower, but overloading it at all is fairly unusual.
std::addressof incurs no penalty at all. When it boils down to assembly code, we only refer to objects through their addresses anyway so the information is already at hand.
Concerning operator&, it depends whether it has been overloaded or not. The original, non-overloaded version behaves exactly like std::addressof. However if operator& has been overloaded (which is a very bad idea anyway, and quite frowned upon), all bets are off since we can't guess what the overloaded implementation will be.
So the answer to your question is: no need to store the pointers separately, you can just use std::addressof or operator& whenever you need them, even if you have to repeat it.
in C++ there are references, what you are describing sounds more like a behaviour for the C language rather than C++.
that kind of signature for C++ methods it's usually implemented to avoid copying, by default when you call a method and pass some arguments to it, this arguments generate a local copy, having references it's a technique that help you avoid this overhead for the copy.
It would be very efficient. If you are using Linux, you can check by using objdump. The following is how fooP(&var1, &var2) looks like in the x86 architecture. It is nothing more than a LEA instruction.
fooP(&var1, &var2);
8048642: 8d 44 24 1c lea 0x1c(%esp),%eax -> this is to get the address of var2 to eax
8048646: 89 44 24 04 mov %eax,0x4(%esp) -> this is to save the address to the stack for later fooP call
804864a: 8d 44 24 18 lea 0x18(%esp),%eax -> address of var1
804864e: 89 04 24 mov %eax,(%esp)
8048651: e8 be ff ff ff call 8048614 <_Z4fooPPiS_>
Reference in this case is actually same as the above.

NULL vs throw and performance

I have a class:
class Vector {
public:
element* get(int i);
private:
element* getIfExists(int i):
};
get invokes getIfExists; if element exists, it is returned, if not, some action is performed. getIfExists can signal that some element i is not present
either by throwing exception, or by returning NULL.
Question: would there be any difference in performance? In one case, get will need to check ==NULL, in another try... catch.
Its a matter of design, not performance. If its an exceptional situation -like in your get function- then throw an exception; or even better fire an assert since violation of a function precondition is a programming error. If its an expected case -like in your getIfExist function- then don't throw an exception.
Regarding performance, zero cost exception implementations exist (although not all compilers use that strategy). This means that the overhead is only paid when an exception its thrown, which should be... well... exceptionally.
Modern compilers implement 'zero cost' exceptions - they only incur cost when thrown, and the cost is proportional to the cleanup plus the cache-miss to get the list to clean up. Therefore, if exceptions are exceptional, they can indeed be faster than return-codes. And if they are unexceptional, they may be slower. And if your error is in a function in a function in a function call, it actually do much less work throwing. The details are fascinating and well worth googling.
But the cost is very marginal. In a tight loop it might make a difference, but generally not.
You should write the code that is easiest to reason about and maintain, and then profile it and revisit your decision only if its a bottleneck.
(See comments!)
Without a doubt the return NULL variant has a better performance.
You should mostly never use exceptions when using return values is possible too.
Since the method is named get I assume NULL won't be a valid result value, so passing NULL should be the best solution.
If the caller does not test the result value, it dereferences a null value, rendering a SIGSEGV, what is appropriate too.
If the method is rarely called, you should not care about micro optimizations at all.
Which translated method looks easier to you?
$ g++ -Os -c test.cpp
#include <cstddef>
void *get_null(int i) throw ();
void *get_throwing(int i) throw (void*);
int one(int i) {
void *res = get_null(i);
if(res != NULL) {
return 1;
}
return 0;
}
int two(int i) {
try {
void *res = get_throwing(i);
return 1;
} catch(void *ex) {
return 0;
}
}
$ objdump -dC test.o
0000000000000000 <one(int)>:
0: 50 push %rax
1: e8 00 00 00 00 callq 6 <one(int)+0x6>
6: 48 85 c0 test %rax,%rax
9: 0f 95 c0 setne %al
c: 0f b6 c0 movzbl %al,%eax
f: 5a pop %rdx
10: c3 retq
0000000000000011 <two(int)>:
11: 56 push %rsi
12: e8 00 00 00 00 callq 17 <two(int)+0x6>
17: b8 01 00 00 00 mov $0x1,%eax
1c: 59 pop %rcx
1d: c3 retq
1e: 48 ff ca dec %rdx
21: 48 89 c7 mov %rax,%rdi
24: 74 05 je 2b <two(int)+0x1a>
26: e8 00 00 00 00 callq 2b <two(int)+0x1a>
2b: e8 00 00 00 00 callq 30 <two(int)+0x1f>
30: e8 00 00 00 00 callq 35 <two(int)+0x24>
35: 31 c0 xor %eax,%eax
37: eb e3 jmp 1c <two(int)+0xb>
There will certainly be a difference in performance (maybe even a very big one if you give Vector::getIfExists a throw() specification, but I 'm speculating a bit here). But IMO that's missing the forest for the trees.
The money question is: are you going to call this method so many times with an out-of-bounds parameter? And if yes, why?
Yes, there would be a difference in performance: returning NULL is less expensive than throwing an exception, and checking for NULL is less expensive than catching an exception.
Addendum: But performance is only relevant if you expect that this case will happen frequently, in which case it's probably not an exceptional case anyway. In C++, it's considered bad style to use exceptions to implement normal program logic, which this seems to be: I'm assuming that the point of get is to auto-extend the vector when necessary?
If the caller is going to be expecting to deal with the possibility of an item not existing, you should return in a way that indicates that without throwing an exception. If the caller is not going to be prepared, you should throw an exception. Of course, the called routine isn't likely to magically know whether the caller is prepared for trouble. A few approaches to consider:
Microsoft's pattern is to have a Get() method, which returns the object if it exists and throws an exception if it doesn't, and a TryGet() method, which returns a Boolean indicating whether the object existed, and stores the object (if it exists) to a Ref parameter. My big complaint with this pattern is that interfaces using it cannot be covariant.
A variation, which I often prefer for collections of reference types, is to have Get and TryGet methods, and have TryGet return null for non-existent items. Interface covariance works much better this way.
A slight variation on the above, which works even for value-types or unconstrained generics, is to have the TryGet method accept a Boolean by reference, and store to that Boolean a success/fail indicator. In case of failure, the code can return an unspecified object of the appropriate type (most likely default<T>.
Another approach, which is particularly suitable for private methods, is to pass either a Boolean or an enumerated type specifying whether a routine should return null or throw an exception in case of failure. This approach can improve the quality of generated exceptions while minimizing duplicated code. For example, if one is trying to get a packet of data from a communications pipe and the caller isn't prepared for failure, and an error occurs in a routine that reads the packet header, an exception should probably be thrown by the packet-header-read routine. If, however, the caller will be prepared not to receive a packet, the packet-header-read routine should indicate failure without throwing. The cleanest way to allow for both possibilities would be for the read-packet routine to pass an "errors will be dealt with by caller" flag to the read-packet-header routine.
In some contexts, it may be useful for a routine's caller to pass a delegate to be invoked in case anticipated problems arise. The delegate could attempt to resolve the problem, and do something to indicate whether the operation should be retried, the caller should return with an error code, an exception should be raised, or something else entirely should happen. This can sometimes be the best approach, but it's hard to figure out what data should be passed to the error delegate and how it should be given control over the error handling.
In practice, I tend to use #2 a lot. I dislike #1, since I feel that the return value of a function should correspond with its primary purpose.

GCC function padding value

Whenever I compile C or C++ code with optimizations enable,d GCC aligns functions to a 16-byte boundary (on IA-32). If the function is shorter than 16 bytes, GCC pads it with some bytes, which don't seem to be random at all:
19: c3 ret
1a: 8d b6 00 00 00 00 lea 0x0(%esi),%esi
It always seems to be either 8d b6 00 00 00 00 ... or 8d 74 26 00.
Do function padding bytes have any significance?
The padding is created by the assembler, not by gcc. It merely sees a .align directive (or equivalent) and doesn't know whether the space to be padded is inside a function (e.g. loop alignment) or between functions, so it must insert NOPs of some sort. Modern x86 assemblers use the largest possible NOP opcodes with the intention of spending as few cycles as possible if the padding is for loop alignment.
Personally, I'm extremely skeptical of alignment as an optimization technique. I've never seen it help much, and it can definitely hurt by increasing the total code size (and cache utilization) tremendously. If you use the -Os optimization level, it's off by default, so there's nothing to worry about. Otherwise you can disable all the alignments with the proper -f options.
The assembler first sees an .align directive. Since it doesn't know if this address is within a function body or not, it cannot output NULL 0x00 bytes, and must generate NOPs (0x90).
However:
lea esi,[esi+0x0] ; does nothing, psuedocode: ESI = ESI + 0
executes in fewer clock cycles than
nop
nop
nop
nop
nop
nop
If this code happened to fall within a function body (for instance, loop alignment), the lea version would be much faster, while still "doing nothing."
The instruction lea 0x0(%esi),%esi just loads the value in %esi into %esi - it's no-operation (or NOP), which means that if it's executed it will have no effect.
This just happens to be a single instruction, 6-byte NOP. 8d 74 26 00 is just a 4-byte encoding of the same instruction.