If I define static instance of a class, is there optimization in compilers (particularly g++/clang) to omit base register (for thiscalls) when data members accessed directly or indirectly (I mean [base + index * scale + displacement] formula) and just use single displacement constant for all of them? All member functions may became static (in case of sole instance of the class it is reasonable).
I can't check this, because on godbolt.org compiler aggressively optimizes following code to xor eax, eax; ret:
struct A
{
int i;
void f()
{
++i;
}
};
static A a;
int main(int argc, char * argv[])
{
a.i = argc;
}
Short answer: Maybe.
Long answer: A modern compiler certainly has the ability to optimize away fetching the this pointer, and using complex addressing modes is definitely within the reach of all modern compilers that I'm aware of (including, but not limited to: gcc, clang and MS Visual C).
Whether a particular compiler chooses to do so on a specific construct is down to how well the compiler "understands" the code presented to it. As you've just experienced, the compiler removes all of your code, because it doesn't actually "do" anything. You're just assigning a member of a global struct, which is never again used, so the compiler can reason that "well, you never use it again, so I won't do that". Remove static, and it's plausible that the compiler can't know that it's not used elsewhere. Or print the value of a.i, or pass it to an external function that can't be inlined, etc, etc.
In your example, I would really just expect the compiler to store the value of argc into the address of a.i, and that can probably be done in two instructions, move argc from stack into a register, and move that register into the memory calculated for a.i - which is probably a constant address according to the compiler. So no fancy addressing modes needed in this case.
Related
In this question Will a static variable always use up memory? it is stated that compilers are allowed to optimize away a static variable if the address is never taken, e.g. like following:
void f() {
static int i = 3;
printf( "%d", i );
}
If there exists a function which takes its arguments by reference, is the compiler still allowed to optimize away the variable, e.g. as in
void ref( int & i ) {
printf( "%d", i );
}
void f() {
static int i = 3;
g( i );
}
Is the situation different for the "perfect forwarding" case. Here the function body is empty on purpose:
template< typename T >
void fwd( T && i ) {
}
void f() {
static int i = 3;
fwd( i );
}
Furthermore, would the compiler be allowed to optimize the call in the following case. (The function body is empty on purpose again):
void ptr( int * i ) {
}
void f() {
static int i = 3;
ptr( &i );
}
My questions arise from the fact, that references are not a pointer by the standard - but implemented as one usually.
Apart from, "is the compiler allowed to?" I am actually more interested in whether compilers do this kind of optimization?
that compilers are allowed to optimize away a static variable if the address is never taken
You seem to concentrated on the wrong part of the answer. The answer states:
the compiler can do anything it wants to your code so long as the observable behavior is the same
The end. You can take the address, don't take it, calculate the meaning of life and calculate how to heal cancer, the only thing that matters is observable effect. As long as you don't actually heal cancer (or output the results of calculations...), all calculations are just no-op.
f there exists a function which takes its arguments by reference, is the compiler still allowed to optimize away the variable
Yes. The code is just putc('3').
Is the situation different for the "perfect forwarding" case
No. The code is still just putc('3').
would the compiler be allowed to optimize the call in the following case
Yes. This code has no observable effect, contrary to the previous ones. The call to f() can just be removed.
in whether compilers do this kind of optimization?
Copy your code to https://godbolt.org/ and inspect the assembly code. Even with no experience in assembly code, you will see differences with different code and compilers.
Choose x86 gcc (trunk) and remember to enable optimizations -O. Copy code with static, then remove static - did the code change? Repeat for all code snippets.
Compilers are allowed to optimize out variables under the "as-if" rule, meaning that the compiler is allowed to do any optimization that doesn't alter the observable behaviour of the program. Whether the optimization actually occurs depends on how good the compiler's optimizer is, what optimization level you request, and whether the optimization belongs to a class of optimizations that actually improve performance (humans are not very good at predicting this).
In all of the examples you gave, the as-if rule gives the compiler latitude to eliminate the static variable.
In example 1, the definition of f is equivalent to void f() { printf("%d", 3); }. Since this has the exact same observable behaviour as the f you wrote, the compiler is allowed to replace one by the other, optimizing out the variable.
In example 2, since fwd does nothing, the definition of f is equivalent to void f() {}. Again, the as-if rule allows the compiler to replace the f you wrote with this empty function.
Example 3 is very similar to example 2 in terms of the implications of the as-if rule.
If you want to see whether a compiler will actually perform these optimizations, Godbolt is very useful. For example, if you look here, you'll see that at -O2, both GCC and Clang will perform the optimization described for example 1. They probably do this by first inlining ref into f.
This question already has answers here:
Does accessing a declared non-volatile object through a volatile reference/pointer confer volatile rules upon said accesses?
(4 answers)
Closed 2 years ago.
Writes to volatile variables are somehow side effects in C++ and generally can't be optimized out under as-if rule, usually. In practice, this usually means that on inspection of the assembly you'll see one one store for each volatile store by the abstract machine1.
However, it isn't clear to me if the stores must be performed in the following case where the underlying object is not volatile, but the stores are done through a pointer-to-volatile:
void vtest() {
int buf[1];
volatile int * vptr = buf;
*vptr = 0;
*vptr = 1;
*vptr = 2;
}
Here, gcc does in fact optimize out all of the stores. Clang does not. Oddly, the behavior depends on the buffer size: with buf[3] gcc emits the stores, but with buf[4] it doesn't and so on.
Is gcc's behavior here legal?
[1] with some small variations, e.g., some compilers will use a single read-modify-write instruction on x86 to implement something like v++ where v is volatile).
While it would be useful for the C and C++ Standards to recognize a categories of implementations where volatile loads and stores have particular semantics, and report via predefined macros, intrinsics, or other such means what semantics a particular implementation is using, neither implementation presently does so. Given a loop:
void hang_if_nonzero(int mode)
{
int i = 1;
do { +*(volatile int*)0x1234; } while(mode);
}
a compiler would be required to generate code that will block program execution if mode is non-zero, because the volatile read is defined as being a side effect in and of itself, regardless of whether there is any means by which the effect of executing it could be distinguished from that of skipping it. There would be no requirement, however, that the compiler actually generate any load instructions for the volatile access. If the compiler specified that it was only for use on hardware platforms where the effect of reading address 0x1234 would be indistinguishable from the effect of skipping the read, it would be allowed to skip the read.
In cases where an object's address is taken, but a compiler can account for all the ways in which the address is used and code never inspects the representation of the address, a compiler would not be required to allocate "normal" addressable storage but may at its leisure allocate a register or other form of storage which wouldn't be accessed via normal loads and stores. It may even pretend to allocate storage without actually doing so if it can tell what value an object would contain when accessed. If e.g. a program were to do something like:
int test(int mode)
{
int a[2] = {1,2};
int *p = a;
return p[mode & 1] + p[mode & 1];
}
a compiler wouldn't be required to actually allocate any storage for a, but could instead at its leisure generate code equivalent to return (1+(mode & 1)) << 1;. Even if p were declared as int volatile *p = a;, that would not create a need for the compiler to allocate addressable storage for a, since a compiler could still account for everything done through pointers to a and would thus have no obligation to keep a in addressable storage. A compiler would thus be allowed to treat a read of a[mode & 1] as equivalent to evaluating the expression (1+(mode & 1)). If the read were done through a volatile pointer, then it would need to be treated as a side effect for purposes of determining whether a loop may be omitted, but there would be no requirement that the read itself actually do anything.
This is related to Why can't GCC generate an optimal operator== for a struct of two int32s?. I was playing around with the code from that question at godbolt.org and noticed this odd behavior.
struct Point {
int x, y;
};
bool nonzero_ptr(Point const* a) {
return a->x || a->y;
}
bool nonzero_ref(Point const& a) {
return a.x || a.y;
}
https://godbolt.org/z/e49h6d
For nonzero_ptr, clang -O3 (all versions) produces this or similar code:
mov al, 1
cmp dword ptr [rdi], 0
je .LBB0_1
ret
.LBB0_1:
cmp dword ptr [rdi + 4], 0
setne al
ret
This strictly implements the short-circuiting behavior of the C++ function, loading the y field only if the x field is zero.
For nonzero_ref, clang 3.6 and earlier generate the same code as they do for nonzero_ptr, but clang 3.7 through 11.0.1 produce
mov eax, dword ptr [rdi + 4]
or eax, dword ptr [rdi]
setne al
ret
which loads y unconditionally. No version of clang is willing to do that when the parameter is a pointer. Why?
The only situation I can think of (on the x64 platform) where the behavior of the branching code would be observably different is when there's no memory mapped at [rdi+4], but I'm still unsure why clang would consider that case important for pointers and not references. My best guess is that there is some language-lawyery argument that references must be to "full objects" and pointers needn't be:
char* p = alloc_4k_page_surrounded_by_guard_pages();
int* pi = reinterpret_cast<int*>(p + 4096 - sizeof(int));
Point* ppt = reinterpret_cast<Point*>(pi); // ok???
ppt->x = 42; // ok???
Point& rpt = *ppt; // UB???
But if the spec implies that, I'm not seeing how.
This is a missed optimization; the branchless code is safe for both C++ source versions.
In Why is gcc allowed to speculatively load from a struct? GCC actually is speculatively loading both struct members through a pointer even though the C source only references one or the other. So at least GCC developers have decided that this optimization is 100% safe, in their interpretation of the C and C++ standards (I think that's intentional, not a bug). Clang generates a 0 or 1 index to choose which int to load, so clang is still just as reluctant as in your case to invent a load. (C vs C++: same asm with or without -xc, with a version of the source ported to work as either: https://godbolt.org/z/6oPKKd)
The obvious difference in your asm is that the pointer version avoids access to a->y if a->x != 0, and that this only matters for correctness1 if a->y was in an unmapped page; you're right about that being the relevant corner case.
But ISO C++ doesn't allow partial objects. The page-boundary setup in your example is I'm pretty sure undefined behaviour. In a path of execution that reads a->x, the compiler can assume it's safe to also read a->y.
This would of course not be the case for int *p; and p[0] || p[1], because it's totally valid to have an implicit-length 0-terminated array that happens to be 1 element long, in the last 4 bytes of a page.
As #Nate suggested in comments, perhaps clang simply doesn't take advantage of that ISO C++ fact when optimizing; maybe it does internally transform to something more like an array by the time it's considering this "if-conversion" type of optimization (branchy to branchless). Or maybe LLVM just don't let itself invent loads through pointers.
It can always do it for reference args because references are guaranteed non-NULL. It would be "even more" UB for the caller to do nonzero_ref(*ppt), like in your partial-object example, because in C++ terms we're dereferencing a pointer to the whole object.
An experiment: deref the pointer to get a full tmp object
bool nonzero_ptr_full_deref(Point const* pa) {
Point a = *pa;
return a.x || a.y;
}
https://godbolt.org/z/ejrn9h - compiles branchlessly, same as nonzero_ref. Not sure what / how much this tells us. This is what I expected, given that it makes access to a->y effectively unconditional in the C++ source.
Footnote 1: Like all mainstream ISAs, x86-64 doesn't do hardware race detection, so the possibility of loading something another thread might be writing only matters for performance, and then only if the full struct is split across a cache-line boundary since we're already reading one member. If the object doesn't span a cache line, any false-sharing performance effect is already incurred.
Making asm like this doesn't "introduce data-race UB" because x86 asm has well-defined behaviour for this possibility, unlike ISO C++. The asm works for any possible value loaded from [rdi+4] so it correctly implements the semantics of the C++ source. Inventing reads is thread-safe, unlike writes, and is allowed because it's not volatile so the access isn't a visible side-effect. The only question is whether the pointer must point to a full valid Point object.
Part of data races (on non-atomic objects) being Undefined Behaviour is to allow for C++ implementations on hardware with race detection. Another is to allow compilers to assume that it's safe to reload something they accessed once, and expect the same value unless there's an acquire or seq_cst load between the two points. Even making code that would crash if the 2nd load differed from the first. That's irrelevant in this case because we're not talking about turning 1 access into 2 (instead 0 into 1 whose value may not matter), but is why roll-your-own atomics (e.g. in the Linux kernel) need to use volatile* casts for ACCESS_ONCE (https://lwn.net/Articles/793253/#Invented%20Loads).
I believe that from the point of view of standard C++, the compiler could emit the same code for both, since there is no provision in the standard for "partial objects" like the one you've constructed. The fact that it doesn't could simply be a missed optimization.
One could compare code like a->x || b->y where the compiler really does have to emit a branch, since the caller could legally pass a null or invalid pointer for b so long as a->x is nonzero. On the other hand, if a,b are references, then a.x || b.y should not need a branch according to the standard, since they must always be references to valid objects. So the "missed optimization" in your nonzero_ptr could just be the compiler not noticing that it can take advantage of the fact that the pointers in a->x and a->y are the same pointer.
Alternatively, it's possible that clang is, as an extension, trying to produce code that will still work when you use non-standard features to create objects in which only some members can be accessed. The fact that this works for pointers but not for references could be a bug or limitation of that extension, but I don't think it's any sort of conformance violation.
I learned that pointer aliasing may hurt performance, and that a __restrict__ attribute (in GCC, or equivalent attributes in other implementations) may help keeping track of which pointers should or should not be aliased. Meanwhile, I also learned that GCC's implementation of valarray stores a __restrict__'ed pointer (line 517 in https://gcc.gnu.org/onlinedocs/libstdc++/libstdc++-html-USERS-4.1/valarray-source.html), which I think hints the compiler (and responsible users) that the private pointer can be assumed not to be aliased anywhere in valarray methods.
But if we alias a pointer to a valarray object, for example:
#include <valarray>
int main() {
std::valarray<double> *a = new std::valarray<double>(10);
std::valarray<double> *b = a;
return 0;
}
is it valid to say that the member pointer of a is aliased too? And would the very existence of b hurt any optimizations that valarray methods could benefit otherwise? (Is it bad practice to point to optimized pointer containers?)
Let's first understand how aliasing hurts optimization.
Consider this code,
void
process_data(float *in, float *out, float gain, int nsamps)
{
int i;
for (i = 0; i < nsamps; i++) {
out[i] = in[i] * gain;
}
}
In C or C++, it is legal for the parameters in and out to point to overlapping regions in memory.... When the compiler optimizes the function, it does not in general know whether in and out are aliases. It must therefore assume that any store through out can affect the memory pointed to by in, which severely limits its ability to reorder or parallelize the code (For some simple cases, the compiler could analyze the entire program to determine that two pointers cannot be aliases. But in general, it is impossible for the compiler to determine whether or not two pointers are aliases, so to be safe, it must assume that they are).
Coming to your code,
#include <valarray>
int main() {
std::valarray<double> *a = new std::valarray<double>(10);
std::valarray<double> *b = a;
return 0;
}
Since a and b are aliases. The underlying storage structure used by valarray will also be aliased(I think it uses an array. Not very sure about this). So, any part of your code that uses a and b in a fashion similar to that shown above will not benefit from compiler optimizations like parallelization and reordering. Note that JUST the existence of b will not hurt optimization but how you use it.
Credits:
The quoted part and the code is take from here. This should serve as a good source for more information about the topic as well.
is it valid to say that the member pointer of a is aliased too?
Yes. For example, a->[0] and b->[0] reference the same object. That's aliasing.
And would the very existence of b hurt any optimizations that valarray methods could benefit otherwise?
No.
You haven't done anything with b in your sample code. Suppose you have a function much larger than this sample code that starts with the same construct. There's usually no problem if the first several lines of that function uses a but never b, and the remaining lines uses b but never a. Usually. (Optimizing compilers do rearrange lines of code however.)
If on the other hand you intermingle uses of a and b, you aren't hurting the optimizations. You are doing something much worse: You are invoking undefined behavior. "Don't do it" is the best solution to the undefined behavior problem.
Addendum
The C restrict and gcc __restrict__ keywords are not constraints on the developers of the compiler or the standard library. Those keywords are promises to the compiler/library that restricted data do not overlap other data. The compiler/library doesn't check whether the programmer violated this promise. If this promise enables certain optimizations that might otherwise be invalid with overlapping data, the compiler/library is free to apply those optimizations.
What this means is that restrict (or __restrict__) is a restriction on you, not the compiler. You can violate those restrictions even without your b pointer. For example, consider
*a = a->[std::slice(a.size()-1,a.size(),-1)];
This is undefined behavior.
Just wondering: When I add restrict to a pointer, I tell the compiler that the pointer is not an alias for another pointer. Let's assume I have a function like:
// Constructed example
void foo (float* result, const float* a, const float* b, const size_t size)
{
for (size_t i = 0; i < size; ++i)
{
result [i] = a [0] * b [i];
}
}
If the compiler has to assume that result might overlap with a, it has to refetch a each time. But, as a is marked const, the compiler could also assume that a is fixed, and hence fetching it once is ok.
Question is, in a situation like this, what is the recommend way to work with restrict? I surely don't want the compiler to refetch a each time, but I couldn't find good information about how restrict is supposed to work here.
Your pointer is const, telling anyone calling your function that you won't touch the data which is pointed at through that variable. Unfortunately, the compiler still won't know if result is an alias of the const pointers. You can always use a non-const pointer as a const-pointer. For example, a lot of functions take a const char (i.e. string) pointer as a parameter, but you can, if you wish, pass it a non-const pointer, the function is merely making you a promise that it wont use that particular pointer to change anything.
Basically, to get closer to your question, you'd need to add restrict to a and b in order to 'promise' the compiler that whoever uses this function won't pass in result as an alias to a or b. Assuming, of course, you're able to make such a promise.
Everyone here seems very confused. There's not a single example of a const pointer in any answer so far.
The declaration const float* a is not a const pointer, it's const storage. The pointer is still mutable. float *const a is a const pointer to a mutable float.
So the question should be, is there any point in float *const restrict a (or const float *const restrict a if you prefer).
Yes, you need restrict. Pointer-to-const doesn't mean that nothing can change the data, only that you can't change it through that pointer.
const is mostly just a mechanism to ask the compiler to help you keep track of which stuff you want functions to be allowed to modify. const is not a promise to the compiler that a function really won't modify data.
Unlike restrict, using pointer-to-const to mutable data is basically a promise to other humans, not to the compiler. Casting away const all over the place won't lead to wrong behaviour from the optimizer (AFAIK), unless you try to modify something that the compiler put in read-only memory (see below about static const variables). If the compiler can't see the definition of a function when optimizing, it has to assume that it casts away const and modifies data through that pointer (i.e. that the function doesn't respect the constness of its pointer args).
The compiler does know that static const int foo = 15; can't change, though, and will reliably inline the value even if you pass its address to unknown functions. (This is why static const int foo = 15; is not slower than #define foo 15 for an optimizing compiler. Good compilers will optimize it like a constexpr whenever possible.)
Remember that restrict is a promise to the compiler that things you access through that pointer don't overlap with anything else. If that's not true, your function won't necessarily do what you expect. e.g. don't call foo_restrict(buf, buf, buf) to operate in-place.
In my experience (with gcc and clang), restrict is mainly useful on pointers that you store through. It doesn't hurt to put restrict on your source pointers, too, but usually you get all the asm improvement possible from putting it on just the destination pointer(s), if all the stores your function does are through restrict pointers.
If you have any function calls in your loop, restrict on a source pointer does let clang (but not gcc) avoid a reload. See these test-cases on the Godbolt compiler explorer, specifically this one:
void value_only(int); // a function the compiler can't inline
int arg_pointer_valonly(const int *__restrict__ src)
{
// the compiler needs to load `*src` to pass it as a function arg
value_only(*src);
// and then needs it again here to calculate the return value
return 5 + *src; // clang: no reload because of __restrict__
}
gcc6.3 (targeting the x86-64 SysV ABI) decides to keep src (the pointer) in a call-preserved register across the function call, and reload *src after the call. Either gcc's algorithms didn't spot that optimization possibility, or decided it wasn't worth it, or the gcc devs on purpose didn't implement it because they think it's not safe. IDK which. But since clang does it, I'm guessing it's probably legal according to the C11 standard.
clang4.0 optimizes this to only load *src once, and keep the value in a call-preserved register across the function call. Without restrict, it doesn't do this, because the called function might (as a side-effect) modify *src through another pointer.
The caller of this function might have passed the address of a global variable, for example. But any modification of *src other than through the src pointer would violate the promise that restrict made to the compiler. Since we don't pass src to valonly(), the compiler can assume it doesn't modify the value.
The GNU dialect of C allows using __attribute__((pure)) or __attribute__((const)) to declare that a function has no side-effects, allowing this optimization without restrict, but there's no portable equivalent in ISO C11 (AFAIK). Of course, allowing the function to inline (by putting it in a header file or using LTO) also allows this kind of optimization, and is much better for small functions especially if called inside loops.
Compilers are generally pretty aggressive about doing optimizations that the standard allows, even if they're surprising to some programmers and break some existing unsafe code which happened to work. (C is so portable that many things are undefined behaviour in the base standard; most nice implementations do define the behaviour of lots of things that the standard leaves as UB.) C is not a language where it's safe to throw code at the compiler until it does what you want, without checking that you're doing it the right way (without signed-integer overflows, etc.)
If you look at the x86-64 asm output for compiling your function (from the question), you can easily see the difference. I put it on the Godbolt compiler explorer.
In this case, putting restrict on a is sufficient to let clang hoist the load of a[0], but not gcc.
With float *restrict result, both clang and gcc will hoist the load.
e.g.
# gcc6.3, for foo with no restrict, or with just const float *restrict a
.L5:
vmovss xmm0, DWORD PTR [rsi]
vmulss xmm0, xmm0, DWORD PTR [rdx+rax*4]
vmovss DWORD PTR [rdi+rax*4], xmm0
add rax, 1
cmp rcx, rax
jne .L5
vs.
# gcc 6.3 with float *__restrict__ result
# clang is similar with const float *__restrict__ a but not on result.
vmovss xmm1, DWORD PTR [rsi] # outside the loop
.L11:
vmulss xmm0, xmm1, DWORD PTR [rdx+rax*4]
vmovss DWORD PTR [rdi+rax*4], xmm0
add rax, 1
cmp rcx, rax
jne .L11
So in summary, put __restrict__ on all pointers that are guaranteed not to overlap with something else.
BTW, restrict is only a keyword in C. Some C++ compilers support __restrict__ or __restrict as an extension, so you should #ifdef it away on unknown compilers.
Since
In the C-99 Standard (ISO/IEC 9899:1999 (E)) there are examples of const * restrict, e.g., in section 7.8.2.3:
The strtoimax and strtoumax functions
Synopsis
#include <inttypes.h>
intmax_t strtoimax(const char * restrict nptr,
char ** restrict endptr, int base);
--- snip ---
Therefore, if one assumes that the standard would not provide such an example if const * were redundant to * restrict, then they are, indeed, not redundant.
As the previous answer stated, you need to add "restrict".
I also wanted to comment on your scenario that "result might overlap with a". That is not the only reason the compiler will detect that "a" could change. It could also be changed by another thread that has a pointer to "a". Thus, even if your function did not change any values, the compiler will still assume that "a" could change.