What's the performance of the "address of" operator &? - c++

I have to pass a lot of pointers in my code in the midst of a big loop, (so I have lots of expressions like foo(&x, &y, ...)). I was wondering whether I should store the pointers as separate variables (i.e. cache) for performance (at the cost of introducing more variables and clutter in my code)?
(Doing lots of matrix mult. and the CUBLAS library insists on pointers...)

No -- the address-of operator is about as inexpensive/fast as anything you can hope for. It's possible to overload it, and such an overload could be slower, but overloading it at all is fairly unusual.

std::addressof incurs no penalty at all. When it boils down to assembly code, we only refer to objects through their addresses anyway so the information is already at hand.
Concerning operator&, it depends whether it has been overloaded or not. The original, non-overloaded version behaves exactly like std::addressof. However if operator& has been overloaded (which is a very bad idea anyway, and quite frowned upon), all bets are off since we can't guess what the overloaded implementation will be.
So the answer to your question is: no need to store the pointers separately, you can just use std::addressof or operator& whenever you need them, even if you have to repeat it.

in C++ there are references, what you are describing sounds more like a behaviour for the C language rather than C++.
that kind of signature for C++ methods it's usually implemented to avoid copying, by default when you call a method and pass some arguments to it, this arguments generate a local copy, having references it's a technique that help you avoid this overhead for the copy.

It would be very efficient. If you are using Linux, you can check by using objdump. The following is how fooP(&var1, &var2) looks like in the x86 architecture. It is nothing more than a LEA instruction.
fooP(&var1, &var2);
8048642: 8d 44 24 1c lea 0x1c(%esp),%eax -> this is to get the address of var2 to eax
8048646: 89 44 24 04 mov %eax,0x4(%esp) -> this is to save the address to the stack for later fooP call
804864a: 8d 44 24 18 lea 0x18(%esp),%eax -> address of var1
804864e: 89 04 24 mov %eax,(%esp)
8048651: e8 be ff ff ff call 8048614 <_Z4fooPPiS_>
Reference in this case is actually same as the above.

Related

How are rvalues assigned to lvalues in assembly?

First question here. I will in a few weeks/months need to create procedural code in which there will be functions assigning big (I mean really big) sets of data directly to pointers. Here is some example of code I will be doing :
void MyFuntion(string* str)
{
*str = "some data in a string";
}
As it surely is important : I am on windows 10, in visual-studio 2019, compiling with the default c++ compiler on release x86.
Imagine something like this but with strings that can contain several millions of characters, or with int/float arrays also with several millions of elements.
So, this is a single operation assigning a rvalue to a pointer, which is therefore on the heap. Of course, if I create a local variable containing the data, it will be more than 1MB and therefore will cause a stack overflow, right ?
As I understand, since the data only exists as a rvalue here, it doesn't have a memory existence, but I would like to know : how is the rvalue assigned to the pointer ? Like, how is it done in assembly ? I must say I have never done any assembly, I have a few (very few) notions but I'd like to get into it when I have time.
Is it temporary created in the stack or heap before being put in the final memory address ? My guess is that the memory address (the pointer in which I am assigning the data) is directly filled with the data, like, bit by bit, so no existence of the rvalue in memory.
If I'm correct, the only things that exist in the stack here are : the function call, the pointer copy, then the instruction, which should be something like "assign rvalue X to lvalue Y" and the size of the instruction doesn't depend on the size of the rvalue and lvalue, so there should not be any problem regarding the stack here.
So, if I'm correct, this code should not cause any problem, no matter how big the rvalue is, but I would still like to know how it is done exactly, assembly-wise. Note that I am not only looking for an answer, but more like some references, books or docs, that could explain in detail. I guess what I am looking for won't be in a c++ book, but more like a assembly book, this might be a good starting point to get myself into it !
Although a specific OS and compiler were mentioned, the example assembly in this answer will probably differ from what the querent's compiler would output, because I don't have a Windows 10 machine available at the time of writing and used a different environment having forgotten about Godbolt. However, this topic is general enough in my opinion that it shouldn't really matter in this specific case.
What even is a value on the right side of an assignment operator? What does assignment look like at the assembly level? Here's a simple example.
void assign_thing(int *p) {
*p = 42;
}
movl $42, (%rdi)
retq
"Move the 32-bit integer 42 into the memory location to which rdi is pointing." %rdi here represents p, and (%rdi) means *p. For something dead simple like an integer, it's pretty much that simple. How about a simple structure?
struct stuff {
int id;
float value;
char text[8];
};
void assign_thing(stuff *p) {
*p = {42, 1.5, "Hello!"};
}
movabsq $4593671619917905962, %rax
movq %rax, (%rdi)
movabsq $36762444129608, %rax
movq %rax, 8(%rdi)
retq
A little harder to read at first glance, but pretty much the same idea. The compiler was smart and packed the integer and float values 42 and 1.5 into a single 64-bit value and stuffs that directly into (%rdi). Likewise with the string "Hello!", which is short enough to fit into a single 64-bit value and gets stuffed into 8(%rdi) (8 bytes past p is the offset of text).
So far, none of the rvalues actually exist in memory when they get assigned. They're just part of the instructions. What if it's something a lot bigger, like a string?
// Overflow checking omitted for brevity.
void assign_thing(char *p) {
// Assignment with = doesn't actually do what you'd want here,
// so this'll have to do.
strcpy(p, "What if it's something a lot bigger, like a string?");
}
vmovups -5484(%rip), %ymm0
vmovups %ymm0, 20(%rdi) ; I'm guessing the disassembler meant to say 0x20
vmovups -5517(%rip), %ymm0
vmovups %ymm0, (%rdi)
vzeroupper
retq
Now, the rvalue does reside in memory when it gets assigned. Do note that this is not because strcpy was used instead of =, but because the compiler decided that it would be better to store that "rvalue" string somewhere in a read-only area like .rodata and just copy it over. If I had used a much shorter string, any reasonably modern compiler would probably optimize it into a few mov or movabsq instructions like in the second example. Unless p points to a buffer on the stack and your strcpy ends up overflowing it, you won't get a stack overflow here.
Now what about your example? I'm guessing that your string type is really std::string, and that's not a trivial type. So what happens there? In C++, the assignment operator = is overloadable, and std::string indeed has its own overloads, so instead of directly stuffing or copying values into the object, a special member function operator= is called. That is to say, your *str = "some data in a string" is really a str->operator=("some data in a string"). How your rvalue string gets copied is up to the implementation of std::string::operator=, but it'll most likely be optimized into something like my last example. The actual string data of an std::string resides on the heap, so stack overflow still isn't a problem here.
tl;dr (this answer + the comments, compressed into a few sentences)
If your string is small enough, it probably won't exist in memory during assignment. If it's big enough, it'll sit in a read-only area somewhere and get copied over when needed. The stack is often not even involved, so don't worry about overflow.

How can I determine in c++ where one Instruction (In bytes) ends and another starts? [duplicate]

This question already has an answer here:
X86 Assembly - How to calculate instruction opcodes length in bytes [closed]
How to tell length of an x86-64 instruction opcode using CPU itself?
(1 answer)
Closed 3 years ago.
For example, at the address 0x762C51, there is the instruction call sub_E91E50.
In bytes, this is E8 FA F1 72 00.
Next, at the address 0x762C56, there is the instruction push 0. In bytes, this is 6A 00.
Now, when it comes to C++ reading a function like this, it would only have the bytes like: E8 FA F1 72 00 6A 00
How can I determine where the first instruction ends and the next begins.
For variable length instruction sets you can't really do this. Many tools will try but it is often trivial to mess them up if you try. Compiled code works better.
The best way which won't necessarily result in a complete disassembly is to go in execution order. So you have to have a known correct entry point and go from there following the execution paths and looking for collisions and setting those aside for a human to figure out.
Simulation is even better and it might give you better coverage in some areas, but also will leave gaps where an execution order disassembler wouldn't.
Even gnu assembler messes up with variable length instruction sets (next time please specify the target/isa as stated in the assembly tag). So whatever a "full disassembler" is can't possibly do it either, in general.
If someone has told you like it's a class assignment or you compile a function to an object and based on the label in the disassembly you feel comfortable that is a valid entry point then you can start disassembling there in execution order. Understanding that if it is an object there will be incomplete instructions to be filled in later, if you link then if you assume that a function label address is an entry point then can disassemble in execution order.
C++ really has nothing to do with this if you have a sequence of bytes and you are sure you know the entry point it doesn't much matter how that code was created, unless it intentionally has anti-disassembly items in it (hand written or compiler/tool created) generally this is not a problem, but technically a tool could do this and it is trivial for a human to do it by hand.

Do macros in C++ improve performance?

I'm a beginner in C++ and I've just read that macros work by replacing text whenever needed. In this case, does this mean that it makes the .exe run faster? And how is this different than an inline function?
For example, if I have the following macro :
#define SQUARE(x) ((x) * (x))
and normal function :
int Square(const int& x)
{
return x*x;
}
and inline function :
inline int Square(const int& x)
{
return x*x;
}
What are the main differences between these three and especially between the inline function and the macro? Thank you.
You should avoid using macros if possible. Inline functions are always the better choice, as they are type safe. An inline function should be as fast as a macro (if it is indeed inlined by the compiler; note that the inline keyword is not binding but just a hint to the compiler, which may ignore it if inlining is not possible).
PS: as a matter of style, avoid using const Type& for parameter types that are fundamental, like int or double. Simply use the type itself, in other words, use
int Square(int x)
since a copy won't affect (or even make it worse) performance, see e.g. this question for more details.
Macros translate to: stupid replacing of pattern A with pattern B. This means: everything happens before the compiler kicks in. Sometimes they come in handy; but in general, they should be avoided. Because you can do a lot of things, and later on, in the debugger, you have no idea what is going on.
Besides: your approach to performance is well, naive, to say it friendly. First you learn the language (which is hard for modern C++, because there are a ton of important concepts and things one absolutely need to know and understand). Then you practice, practice, practice. And then, when you really come to a point where your existing application has performance problems; then do profiling to understand the real issue.
In other words: if you are interested in performance, you are asking the wrong question. You should worry much more about architecture (like: potential bottlenecks), configuration (in the sense of latency between different nodes in your system), and so on. Of course, you should apply common sense; and not write code that is obviously wasting memory or CPU cycles. But sometimes a piece of code that runs 50% slower ... might be 500% easier to read and maintain. And if execution time is then 500ms, and not 250ms; that might be totally OK (unless that specific part is called a thousand times per minute).
The difference between a macro and an inlined function is that a macro is dealt with before the compiler sees it.
On my compiler (clang++) without optimisation flags the square function won't be inlined. The code it generates looks like this
4009f0: 55 push %rbp
4009f1: 48 89 e5 mov %rsp,%rbp
4009f4: 89 7d fc mov %edi,-0x4(%rbp)
4009f7: 8b 7d fc mov -0x4(%rbp),%edi
4009fa: 0f af 7d fc imul -0x4(%rbp),%edi
4009fe: 89 f8 mov %edi,%eax
400a00: 5d pop %rbp
400a01: c3 retq
the imul is the assembly instruction doing the work, the rest is moving data around.
code that calls it looks like
400969: e8 82 00 00 00 callq 4009f0 <_Z6squarei>
iI add the -O3 flag to Inline it and that imul shows up in the main function where the function is called from in C++ code
0000000000400a10 <main>:
400a10: 41 56 push %r14
400a12: 53 push %rbx
400a13: 50 push %rax
400a14: 48 8b 7e 08 mov 0x8(%rsi),%rdi
400a18: 31 f6 xor %esi,%esi
400a1a: ba 0a 00 00 00 mov $0xa,%edx
400a1f: e8 9c fe ff ff callq 4008c0 <strtol#plt>
400a24: 48 89 c3 mov %rax,%rbx
400a27: 0f af db imul %ebx,%ebx
It's a reasonable thing to do to get a basic handle on assembly language for your machine and use gcc -S on your source, or objdump -D on your binary (as I did here) to see exactly what is going on.
Using the macro instead of the inlined function gets something very similar
0000000000400a10 <main>:
400a10: 41 56 push %r14
400a12: 53 push %rbx
400a13: 50 push %rax
400a14: 48 8b 7e 08 mov 0x8(%rsi),%rdi
400a18: 31 f6 xor %esi,%esi
400a1a: ba 0a 00 00 00 mov $0xa,%edx
400a1f: e8 9c fe ff ff callq 4008c0 <strtol#plt>
400a24: 48 89 c3 mov %rax,%rbx
400a27: 0f af db imul %ebx,%ebx
Note one of the many dangers here with macros: what does this do ?
x = 5; std::cout << SQUARE(++x) << std::endl;
36? nope, 42. It becomes
std::cout << ++x * ++x << std::endl;
which becomes 6 * 7
Don't be put off by people telling you not to care about optimisation. Using C or C++ as your language is an optimisation in itself. Just try to work out if you're wasting time with it and be sensible.
Macros just perform text substitution to modify source code.
As such, macros don't inherently affect performance of code. The techniques you use to design and code obviously affect performance. So the only implication of macros on performance is based on what the macro does (i.e. what code you write the macro to emit).
The big danger of macros is that they do not respect scope. The changes they make are unconditional, cross function boundaries, and things like that. There are a lot of subtleties in writing macros to make them behave as intended (avoid unintended side effects in code, avoid undefined behaviour, etc). This means code which uses macros is harder to understand, and harder to get right.
At best, with modern compilers, the performance gain you can get using macros, is the same as can be achieved with inline functions - at the expense of increasing chances of the code behaving incorrectly. You are therefore better off using inline functions - unlike macros they are typesafe and work consistently with other code.
Modern compilers might choose to not inline a function, even if you have specified it as inline. If that happens, you generally don't need to worry - modern compilers are able to do a better job than most modern programmers in deciding whether a function should be inlined.
Using such a macro only make sense if its argument is itself a #define'd constant, as the computation will then be performed by the preprocessor. Even then, double-check that the result is the expected one.
When working on classic variables, the (inlined) function form should be preferred as:
It is type-safe;
It will handle expressions used as an argument in a consistent way. This not only includes the case of per/post increments as quoted by Peter, but when the argument it itself some computation-intensive expression, using the macro form forces the evaluation of that argument twice (which may not necessarely evaluate to the same value btw) vs. only once for the function.
I have to admit that I used to code such macros for quick prototyping of apparently simple functions, but the time those make me lose over the years finalyl changed my mind !

Are compilers able to avoid branching instructions?

I've been reading about bit twiddling hacks and thought, are compilers able to avoid branching in the following code:
constexpr int min(const int lhs, const int rhs) noexcept {
if (lhs < rhs) {
return lhs;
}
return rhs
}
by replacing it with (explanation):
constexpr int min(const int lhs, const int rhs) noexcept {
return rhs ^ ((lhs ^ rhs) & -(lhs < rhs));
}
Are compilers able to...: Yes definitively!
Can I rely on those optimizations?: No, you can't rely on any optimization. There may always be some strange conditions under which a compiler chooses not to implement a certain optimization for some non-obvious reason or just fails to see the possibility. Also in general, I've made the observation that compilers sometimes are a lot dumber than people think (Or the people (including me) are dumber than they think).(1)
Not asked, but a very important aspect: Can I rely on this actually being an optimization? NO! For one, (especially on x86) performance always depends on the surrounding code and there are a lot of different optimizations that interact. Also, some architectures might even offer commands that implement the operation even more efficient.
Should I use bit twiddeling optimizations?: In general: No - especially not without verifying that they actually give you any benefit! Even when they do improve performance, it makes your code harder to read and review and it makes some architecture and compiler specific assumptions (representation of integers, execution time of instructions, penalty for branch miss-prediction ...), that might lead to worse performance when you port your code to another architecture or - in the worst case even lead to false results.
My advice:
If you need to get the last bit of performance for a specific system, then just try both variants and measure (and verify the result every time, you update your CPU and/or compiler). For any other case, assume that the compiler is at least as good in making low level optimizations as you are. I'd also suggest that you first learn about all optimization related compiler flags and set up a proper benchmark BEFORE starting to use low level optimizations in any case.
I think the only area, where hand optimizations are still sometimes beneficial is if you want to to optimally use vector units. Modern compiler can auto-vectorize many things, but this is still relatively new field and there are certain things which the compiler is just not allowed to do because it violates some guarantees from the standard (especially where floating point operations are concerned).
(1) Some people seem to think, that independent of what their code looks like, the compiler will always produce the optimal code that provides the same semantics. For one, there are limits to what a compiler can do in a limited amount of time (there are a lot of heuristics that work MOST of the time but not always). Second, in many cases, the c++ standard requires the compiler to give certain guarantees, that you are not actually interested in at the moment, but still prevent optimizations.
clang++ (3.5.2-1) seems to be smart enough -O3 (I'm not using c++11 or c++14, constexpr and noexcept removed from source):
08048760 <_Z3minii>:
8048760: 8b 44 24 08 mov 0x8(%esp),%eax
8048764: 8b 4c 24 04 mov 0x4(%esp),%ecx
8048768: 39 c1 cmp %eax,%ecx
804876a: 0f 4e c1 cmovle %ecx,%eax
804876d: c3 ret
gcc (4.9.3) (-O3) instead do branch with jle:
08048740 <_Z3minii>:
8048740: 8b 54 24 08 mov 0x8(%esp),%edx
8048744: 8b 44 24 04 mov 0x4(%esp),%eax
8048748: 39 d0 cmp %edx,%eax
804874a: 7e 02 jle 804874e <_Z3minii+0xe>
804874c: 89 d0 mov %edx,%eax
804874e: f3 c3 repz ret
(x86 32bit)
This min2 (mangled) is the bit alternative (from gcc):
08048750 <_Z4min2ii>:
8048750: 8b 44 24 08 mov 0x8(%esp),%eax
8048754: 8b 54 24 04 mov 0x4(%esp),%edx
8048758: 31 c9 xor %ecx,%ecx
804875a: 39 c2 cmp %eax,%edx
804875c: 0f 9c c1 setl %cl
804875f: 31 c2 xor %eax,%edx
8048761: f7 d9 neg %ecx
8048763: 21 ca and %ecx,%edx
8048765: 31 d0 xor %edx,%eax
8048767: c3 ret
It's possible for a compiler to detect this pattern and replace it by your proposal.
However, neither clang++ nor g++ do this optimization, see for instance g++ 5.2.0's assembly output.

Data races, UB, and counters in C++11

The following pattern is commonplace in lots of software that wants to tell its user how many times it has done various things:
int num_times_done_it; // global
void doit() {
++num_times_done_it;
// do something
}
void report_stats() {
printf("called doit %i times\n", num_times_done_it);
// and probably some other stuff too
}
Unfortunately, if multiple threads can call doit without some sort of synchronisation, the concurrent read-modify-writes to num_times_done_it may be a data race and hence the entire program's behaviour would be undefined. Further, if report_stats can be called concurrently with doit absent any synchronisation, there's another data race between the thread modifying num_times_done_it and the thread reporting its value.
Often, the programmer just wants a mostly-right count of the number of times doit has been called with as little overhead as possible.
(If you consider this example trivial, Hogwild! gains a significant speed advantage over a data-race-free stochastic gradient descent using essentially this trick. Also, I believe the Hotspot JVM does exactly this sort of unguarded, multithreaded access to a shared counter for method invocation counts---though it's in the clear since it generates assembly code instead of C++11.)
Apparent non-solutions:
Atomics, with any memory order I know of, fail "as little overhead as possible" here (an atomic increment can be considerably more expensive than an ordinary increment) while overdelivering on "mostly-right" (by being exactly right).
I don't believe tossing volatile into the mix makes data races OK, so replacing the declaration of num_times_done_it by volatile int num_times_done_it doesn't fix anything.
There's the awkward solution of having a separate counter per thread and adding them all up in report_stats, but that doesn't solve the data race between doit and report_stats. Also, it's messy, it assumes the updates are associative, and doesn't really fit Hogwild!'s usage.
Is it possible to implement invocation counters with well-defined semantics in a nontrivial, multithreaded C++11 program without some form of synchronisation?
EDIT: It seems that we can do this in a slightly indirect way using memory_order_relaxed:
atomic<int> num_times_done_it;
void doit() {
num_times_done_it.store(1 + num_times_done_it.load(memory_order_relaxed),
memory_order_relaxed);
// as before
}
However, gcc 4.8.2 generates this code on x86_64 (with -O3):
0: 8b 05 00 00 00 00 mov 0x0(%rip),%eax
6: 83 c0 01 add $0x1,%eax
9: 89 05 00 00 00 00 mov %eax,0x0(%rip)
and clang 3.4 generates this code on x86_64 (again with -O3):
0: 8b 05 00 00 00 00 mov 0x0(%rip),%eax
6: ff c0 inc %eax
8: 89 05 00 00 00 00 mov %eax,0x0(%rip)
My understanding of x86-TSO is that both of these code sequences are, barring interrupts and funny page protection flags, entirely equivalent to the one-instruction memory inc and the one-instruction memory add generated by the straightforward code. Does this use of memory_order_relaxed constitute a data race?
count for each thread separately and sum up after the threads joined. For intermediate results, you may also sum up in between, you result might be off though. This pattern is also faster. You might embed it into a basic helper class for your threads so you have it everywheren if you are using it often.
And - depending on compiler & platform, atomics aren't that expensive (see Herb Sutters "atomic weapons" talk http://channel9.msdn.com/Shows/Going+Deep/Cpp-and-Beyond-2012-Herb-Sutter-atomic-Weapons-1-of-2) but in your case it'll create problems with the caches so it's not advisable.
It seems that the memory_order_relaxed trick is the right way to do this.
This blog post by Dmitry Vyukov at Intel begins by answering exactly my question, and proceeds to list the memory_order_relaxed store and load as the proper alternative.
I am still unsure of whether this is really OK; in particular, N3710 makes me doubt that I ever understood memory_order_relaxed in the first place.