I was looking at the assembly Visual Studio generated for this simple x64 program:
struct Point {
int a, b;
Point() {
a = 0; b = 1;
}
};
int main(int argc, char* argv[])
{
Point arr[3];
arr[0].b = 2;
return 0;
}
And when it meets arr[0].b = 2, it generates this:
mov eax, 8
imul rax, rax, 0
mov dword ptr [rbp+rax+4],2
Why does it do imul rax, rax, 0 instead of a simple mov rax, 0, or even xor rax, rax? How is imul more efficient, if at all?
Kamac
The reason is because the assembly is calculating the offset of both the Point object in the array, which happens to be on the stack, as well as the offset to the variable b.
Intel documentation for imul with three (3) operands state:
Three-operand form — This form requires a destination operand (the
first operand) and two source operands (the second and the third
operands). Here, the first source operand (which can be a
general-purpose register or a memory location) is multiplied by the
second source operand (an immediate value). The intermediate product
(twice the size of the first source operand) is truncated and stored
in the destination operand (a general-purpose register).
In your case it is calculating the offset of the object in the array which results in addressing the first (zeroth) Point location on the stack. Having that resolved it is then adding the offset for .b which is the +4. So broken down:
mov eax,8 ; prepare to offset into the Point array
imul rax, rax, 0 ; Calculate which Point object is being referred to
mov dword ptr [rbp+rax+4],2 ; Add the offset to b and move value 2 in
instruction. All of which resolves to arr[0].b = 2.
I take it you did not compile with aggressive optimization. When going with a straight compile (no optimization, debug on, etc.) the compiler is not making any assumptions with respects to addressing.
Comparison to clang
On OS X (El Capitan) with clang 3.9.0 and no optimization flags, once the Point objects are instantiated in the array, the assignment of .b = 2 is simply:
mov dword ptr [rbp - 44], 2
In this case, clang is pretty smart about the offsets and resolves addressing during the default optimization.
Related
This is the code in question:
struct Cell
{
Cell* U;
Cell* D;
void Detach();
};
void Cell::Detach()
{
U->D = D;
D->U = U;
}
clang-14 -O3 produces:
mov rax, qword ptr [rdi] <-- rax = U
mov rcx, qword ptr [rdi + 8] <-- rcx = D
mov qword ptr [rax + 8], rcx <-- U->D = D
mov rcx, qword ptr [rdi + 8] <-- this queries the D field again
mov qword ptr [rcx], rax <-- D->U = U
gcc 11.2 -O3 produces almost the same, but leaves out one mov:
mov rdx, QWORD PTR [rdi]
mov rax, QWORD PTR [rdi+8]
mov QWORD PTR [rdx+8], rax
mov QWORD PTR [rax], rdx
Clang reads the D field twice, while GCC reads it only once and re-uses it. Apparently GCC is not afraid of the first assignment changing anything that has an impact on the second assignment. I'm trying to understand if/when this is allowed.
Checking correctness gets a bit complicated when U or D point at themselves, each other and/or the same target.
My understanding is that the shorter code of GCC is correct if it is guaranteed that the pointers point at the beginning of a Cell (never inside it), regardless of which Cell it is.
Following this line of thought further, this is the case when a) Cells are always aligned to their size, and b) no custom manipulation of such a pointer occurs (referencing and arithmetic are fine).
I suspect case a) is guaranteed by the compiler, and case b) would require invoking undefined behavior of some sort, and as such can be ignored.
This would explain why GCC allows itself this optimization.
Is my reasoning correct? If so, why does clang not make the same optimization?
There are many potential optimizations in C and C++ that are usually safe, but aren't quite sound. If one regards the -> operator as being usable to build a standard-layout object without having to use placement new on it first (an abstraction model that is relied upon by a lot of code, whether or not the Standard mandates support), removing the if (mode) in the following C and C++ funcitons would be such an optimization.
C version:
struct s { int x,y; }; /* Assume int is 4 bytes, and struct is 8 */
void test(struct s *p1, struct s *p2, int mode)
{
p1->y = 1;
p2->x = 2;
if (mode)
p1->y = 1;
}
C++ version:
#include <new>
struct s { int x,y; };
void test(void *vp1, void *vp2, int mode)
{
if (1)
{
struct s* p1 = new (vp1) struct s;
p1->x = 1;
}
if (1)
{
struct s* p2 = new (vp2) struct s;
p2->y = 2;
}
if (mode)
{
struct s* p3 = new (vp1) struct s;
p3->x = 1;
}
}
The optimization would be correct unless the address in p2 is four bytes higher than p1. Under the "traditional" abstraction model used in C or C++, if the address of p1 happens to be 0x1000 and that of p2 happens to be 0x1004, the first assignment would cause addresses 0x1000-0x1007 to hold a struct s, if it didn't already, whose second member (at address 0x1004) would equal 1. The second assignment, by overwriting that object, would end its lifetime and cause addresses 0x1004 to 0x100B to hold a struct s whose first member would equal 2. The third assignment, if executed, would end the lifetime of that second object and re-create the first.
If the third assignment is executed, there would be an object at address 0x1000 whose second field (at address 0x1004) would hold the readable value 1. If the assignment is skipped, there would be an object at address 0x1004 whose first field would hold the value 2. Behavior would be defined in both cases, and a compiler that didn't know which case would apply would have to accommodate both of them by making the value at 0x1004 depend upon mode.
As it happens, the authors of clang do not seem to have provided for that corner case, and thus omit the conditional check. While I think the Standard should use an abstraction model that would allow such optimization, while also supporting the common structure-creation pattern in situations that don't involve weird aliasing corner cases, I don't see any way of interpreting the Standard that would allow for such optimization without allowing compilers to arbitrarily break a large amount of existing code.
I don't think there's any general way of knowing when a decision by gcc or clang not to impose a particular optimization represents a recognition of potential corner cases where the optimization would be incorrect, and an inability to prove that none of them apply, and when it simply represents an oversight which may be "corrected" to as to replace correct behavior with an unsound optimization.
I have to implement a "bandpass" filter. Let a and b denote two integers that induce a half-open interval [a, b). If some argument x lies within this interval (i.e., a <= x < b), I return a pointer to a C string const char* high, otherwise I return a pointer const char* low. The vanilla implementation of this function looks like
const char* vanilla_bandpass(int a, int b, int x, const char* low,
const char* high)
{
const bool withinInterval { (a <= x) && (x < b) };
return (withinInterval ? high : low);
}
which when compiled with -O3 -march=znver2 on Godbolt gives the following assembly code
vanilla_bandpass(int, int, int, char const*, char const*):
mov rax, r8
cmp edi, edx
jg .L4
cmp edx, esi
jge .L4
ret
.L4:
mov rax, rcx
ret
Now, I've looked into creating a version without a jump/branch, which looks like this
#include <cstdint>
const char* funky_bandpass(int a, int b, int x, const char* low,
const char* high)
{
const bool withinInterval { (a <= x) && (x < b) };
const auto low_ptr = reinterpret_cast<uintptr_t>(low) * (!withinInterval);
const auto high_ptr = reinterpret_cast<uintptr_t>(high) * withinInterval;
const auto ptr_sum = low_ptr + high_ptr;
const auto* result = reinterpret_cast<const char*>(ptr_sum);
return result;
}
which just is ultimately just a "chord" between two pointers. Using the same options as before, this code compiles to
funky_bandpass(int, int, int, char const*, char const*):
mov r9d, esi
cmp edi, edx
mov esi, edx
setle dl
cmp esi, r9d
setl al
and edx, eax
mov eax, edx
and edx, 1
xor eax, 1
imul rdx, r8
movzx eax, al
imul rcx, rax
lea rax, [rcx+rdx]
ret
While at first glance, this function has more instructions, careful benchmarking shows that it's 1.8x to 1.9x faster than the vanilla_bandpass implementation.
Is this use of uintptr_t valid and free of undefined behavior? I'm well aware that the language around uintptr_t is vague and ambiguous to say the least, and that anything that isn't explicitly specified in the standard (like arithmetic on uintptr_t) is generally considered undefined behavior. On the other hand, in many cases, the standard explicitly calls out when something has undefined behavior, which it also doesn't do in this case. I'm aware that the "blending" that happens when adding together low_ptr and high_ptr touches on topics just as pointer provenance, which is a murky topic in and of itself.
Is this use of uintptr_t valid and free of undefined behavior?
Yes. Conversion from pointer to integer (of sufficient size such as uintptr_t) is well defined, and integer arithmetic is well defined.
Another thing to be wary about is whether converting a modified uintptr_t back to a pointer gives back what you want. Only guarantee given by the standard is that pointer converted to integer converted back gives the same address. Luckily, this guarantee is sufficient for you, because you always use the exact value from a converted pointer.
If you were using something other than pointer to narrow character, I think you would need to use std::launder on the result of the conversion.
The Standard doesn't require that implementations process uintptr_t-to-pointer conversions in useful fashion even in cases where the uintptr_t values are produced from pointer-to-integer conversions. Given e.g.
extern int x[5],y[5];
int *px5 = x+5, *py0 = y;
the pointers px5 and py0 might compare equal, and regardless of whether they do or not, code may use px5[-1] to access x[4], or py0[0] to access y[0], but may not access px5[0] nor py0[-1]. If the pointers happen to be equal, and code attempts to access ((int*)(uintptr_t)px5)[-1], a compiler could replace (int*)(uintptr_t)px5) with py0, since that pointer would compare equal to px5, but then jump the rails when attempting to access py0[-1]. Likewise if code tries to access ((int*)(uintptr_t)py0)[0], a compiler could replace (int*)(uintptr_t)py0 with px5, and then jump the rails when attempting to access px5[0].
While it may seem obtuse for a compiler to do such a thing, clang gets even crazier. Consider:
#include <stdint.h>
extern int x[],y[];
int test(int i)
{
y[0] = 1;
uintptr_t px = (uintptr_t)(x+5);
uintptr_t py = (uintptr_t)(y+i);
int flag = (px==py);
if (flag)
y[i] = 2;
return y[0];
}
If px and py are coincidentally equal and i is zero, that will cause clang to set y[0] to 2 but return 1. See https://godbolt.org/z/7Sa_KZ for generated code ("mov eax,1 / ret" means "return 1").
Consider the following two useless C++ functions.
Compiled with GCC (4.9.2, 32- or 64-bit) both functions returning the same value as expected.
Compiled with Visual Studio 2010 or Visual Studio 2017 (unmanaged code) both functions returning different values.
What I've tried:
brackets, brackets, brackets
explicit casts to char
sizeof(char) is evaluated to 1
debug / release version
32- / 64-bit
What's going on here? It seems to be a fundamental bug in VS.
char test1()
{
char buf[] = "The quick brown fox...", *pbuf = buf;
char value = (*(pbuf++) & 0x0F) | (*(pbuf++) & 0xF0);
return value;
}
char test2()
{
char buf[] = "The quick brown fox...", *pbuf = buf;
char a = *(pbuf++) & 0x0F;
char b = *(pbuf++) & 0xF0;
char value = a | b;
return value;
}
Edit:
It's not an attempt to blame VS (as mentioned in the posts).
It's not a matter of signed or unsigned.
It's not a matter of the order of evaluation left and right side of the or-operator. Changing the order of the assignments of a and b in test2() yields to a third result.
But the simultaneity is a good point. It seems the ordering of evaluation is defined to be undefined. In a first step, the generated code evaluates the complete expression in test1() without incrementing any pointer. In a second step the pointers will be incremented. Since the incrementation has no effect and the data remains unchanged after this specific operation, the optimizer will remove the code.
Sorry for inconveniences, but this is not what i would expect. In no language.
For completeness, here the disassembled code of test1():
0028102A mov ecx,dword ptr [ebp-8]
0028102D movsx edx,byte ptr [ecx]
00281030 and edx,0Fh
00281033 mov eax,dword ptr [ebp-8]
00281036 movsx ecx,byte ptr [eax]
00281039 and ecx,0F0h
0028103F or edx,ecx
00281041 mov byte ptr [ebp-1],dl
00281044 mov edx,dword ptr [ebp-8]
00281047 add edx,1
0028104A mov dword ptr [ebp-8],edx
0028104D mov eax,dword ptr [ebp-8]
00281050 add eax,1
00281053 mov dword ptr [ebp-8],eax
The behaviour of (*(pbuf++) & 0x0F) | (*(pbuf++) & 0xF0); is undefined. | (unlike ||) is not a sequencing point, and so you have simultaneous reads and writes on pbuf in the same program step.
Not a VS bug therefore. (Such things rarely are: a golden rule is not to blame the compiler.)
(Note also that char can be either signed or unsigned. That can introduce differences in code like yours.)
Moving a member variable to a local variable reduces the number of writes in this loop despite the presence of the __restrict keyword. This is using GCC -O3. Clang and MSVC optimise the writes in both cases. [Note that since this question was posted we observed that adding __restrict to the calling function caused GCC to also move the store out of the loop. See the godbolt link below and the comments]
class X
{
public:
void process(float * __restrict d, int size)
{
for (int i = 0; i < size; ++i)
{
d[i] = v * c + d[i];
v = d[i];
}
}
void processFaster(float * __restrict d, int size)
{
float lv = v;
for (int i = 0; i < size; ++i)
{
d[i] = lv * c + d[i];
lv = d[i];
}
v = lv;
}
float c{0.0f};
float v{0.0f};
};
With gcc -O3 the first one has an inner loop that looks like:
.L3:
mulss xmm0, xmm1
add rdi, 4
addss xmm0, DWORD PTR [rdi-4]
movss DWORD PTR [rdi-4], xmm0
cmp rax, rdi
movss DWORD PTR x[rip+4], xmm0 ;<<< the extra store
jne .L3
.L1:
rep ret
The second here:
.L8:
mulss xmm0, xmm1
add rdi, 4
addss xmm0, DWORD PTR [rdi-4]
movss DWORD PTR [rdi-4], xmm0
cmp rdi, rax
jne .L8
.L7:
movss DWORD PTR x[rip+4], xmm0
ret
See https://godbolt.org/g/a9nCP2 for the complete code.
Why does the compiler not perform the lv optimisation here?
I'm assuming the 3 memory accesses per loop are worse than the 2 (assuming size is not a small number), though I've not measured this yet.
Am I right to make that assumption?
I think the observable behaviour should be the same in both cases.
This seems to be caused by the missing __restrict qualifier on the f_original function. __restrict is a GCC extension; it is not quite clear how it is expected to behave in C++. Maybe it is a compiler bug (missed optimization) that it appears to disappear after inlining.
The two methods are not identical. In the first, the value of v is updated multiple times during the execution. That may be or may not be what you want, but it is not the same as the second method, so it is not something the compiler can decide for itself as a possible optimization.
The restrict keyword says there is no aliasing with anything else, in effect same as if the value had been local (and no local had any references to it).
In the second case there is no external visible effect of v so it doesn't need to store it.
In the first case there is a potential that some external might see it, the compiler doesn't at this time know that there will be no threads that could change it, but it knows that it doesn't have to read it as its neither atomic nor volatile. And the change of d[] another externally visible variable make the storing necessary.
If the compiler writers reasoning, well neither d nor v are volatile nor atomic so we can just do it all using 'as-if', then the compiler has to be sure no one can touch v at all. I'm pretty sure this will come in one of the new version as there is no synchronisation before the return and this will be the case in 99+% of all cases anyway. Programmers will then have to put either volatile or atomic on variables that are changed, which I think I could live with.
I am trying to understand assembly a little better, so I've been paying attention to the assembly output from CDB when I debug my code. My platform is an Intel Xeon with Windows 7.
The following C++ code:
int main()
{
int a = 30;
int b = 0;
b = ++a;
return 0;
}
produces the following assembly for the line with the increment operator:
b = ++a;
0x13f441023 <+0x0013> mov eax,dword ptr [rsp]
0x13f441026 <+0x0016> inc eax
0x13f441028 <+0x0018> mov dword ptr [rsp],eax //Move eax to some memory address
0x13f44102b <+0x001b> mov eax,dword ptr [rsp] //Move it back to eax?
0x13f44102e <+0x001e> mov dword ptr [rsp+4],eax
My question is, what is the purpose of moving the value in eax to memory, then immediately moving the same value back into eax, as indicated by the comments? Is this for thread safety, or just some artifact of a debug build?
The compiler initially translates your instructions into assembly using static single assignment values (SSA), meaning that each operation gets a temporary value to store its result. Only at a later backend stage, these values would be translated into machine registers according to your target machine, and possibly into memory locations if necessary (explicitly required or spilled due to lack of registers).
Between these stages, the optimizer may eliminate partial values, but initially ++a is one operation, and assigning a (post increment) into b is a second operation. Since a and b are both local variables, they will be stored on the stack (and must be visible there, in case you step with a debugger for example), a will reside in [rsp] and b in [rsp+4].
So your compiler, at some point, probably has (in some intermediate representation):
value1 = a
value2 = value1 + 1
a = value2 //self increment
b = a
Or something similar. a and b must be memory resident, but the operations will usually be done on registers, so at first the compiler does -
value1 = a
value2 = value1 + 1
0x13f441023 <+0x0013> mov eax,dword ptr [rsp]
0x13f441026 <+0x0016> inc eax
a = value2
0x13f441028 <+0x0018> mov dword ptr [rsp],eax
b = a
0x13f44102b <+0x001b> mov eax,dword ptr [rsp]
0x13f44102e <+0x001e> mov dword ptr [rsp+4],eax
Note that the intermediate values were kept in a register - in a normal compilation, they would have probably get eliminated altogether by one of the optimization passes (prior to register assignment and code generation).
just some artifact of a debug build?
Yes, just some artifact of a debug build (actually from an unoptimized build)