Consider the following two useless C++ functions.
Compiled with GCC (4.9.2, 32- or 64-bit) both functions returning the same value as expected.
Compiled with Visual Studio 2010 or Visual Studio 2017 (unmanaged code) both functions returning different values.
What I've tried:
brackets, brackets, brackets
explicit casts to char
sizeof(char) is evaluated to 1
debug / release version
32- / 64-bit
What's going on here? It seems to be a fundamental bug in VS.
char test1()
{
char buf[] = "The quick brown fox...", *pbuf = buf;
char value = (*(pbuf++) & 0x0F) | (*(pbuf++) & 0xF0);
return value;
}
char test2()
{
char buf[] = "The quick brown fox...", *pbuf = buf;
char a = *(pbuf++) & 0x0F;
char b = *(pbuf++) & 0xF0;
char value = a | b;
return value;
}
Edit:
It's not an attempt to blame VS (as mentioned in the posts).
It's not a matter of signed or unsigned.
It's not a matter of the order of evaluation left and right side of the or-operator. Changing the order of the assignments of a and b in test2() yields to a third result.
But the simultaneity is a good point. It seems the ordering of evaluation is defined to be undefined. In a first step, the generated code evaluates the complete expression in test1() without incrementing any pointer. In a second step the pointers will be incremented. Since the incrementation has no effect and the data remains unchanged after this specific operation, the optimizer will remove the code.
Sorry for inconveniences, but this is not what i would expect. In no language.
For completeness, here the disassembled code of test1():
0028102A mov ecx,dword ptr [ebp-8]
0028102D movsx edx,byte ptr [ecx]
00281030 and edx,0Fh
00281033 mov eax,dword ptr [ebp-8]
00281036 movsx ecx,byte ptr [eax]
00281039 and ecx,0F0h
0028103F or edx,ecx
00281041 mov byte ptr [ebp-1],dl
00281044 mov edx,dword ptr [ebp-8]
00281047 add edx,1
0028104A mov dword ptr [ebp-8],edx
0028104D mov eax,dword ptr [ebp-8]
00281050 add eax,1
00281053 mov dword ptr [ebp-8],eax
The behaviour of (*(pbuf++) & 0x0F) | (*(pbuf++) & 0xF0); is undefined. | (unlike ||) is not a sequencing point, and so you have simultaneous reads and writes on pbuf in the same program step.
Not a VS bug therefore. (Such things rarely are: a golden rule is not to blame the compiler.)
(Note also that char can be either signed or unsigned. That can introduce differences in code like yours.)
Related
When I run the following program, it always prints "yes". However when I change SOME_CONSTANT to -2 it always prints "no". Why is that? I am using visual studio 2019 compiler with optimizations disabled.
#define SOME_CONSTANT -3
void func() {
static int i = 2;
int j = SOME_CONSTANT;
i += j;
}
void main() {
if (((bool(*)())func)()) {
printf("yes\n");
}
else {
printf("no\n");
}
}
EDIT: Here is the output assembly of func (IDA Pro 7.2):
sub rsp, 18h
mov [rsp+18h+var_18], 0FFFFFFFEh
mov eax, [rsp+18h+var_18]
mov ecx, cs:i
add ecx, eax
mov eax, ecx
mov cs:i, eax
add rsp, 18h
retn
Here is the first part of main:
sub rsp, 628h
mov rax, cs:__security_cookie
xor rax, rsp
mov [rsp+628h+var_18], rax
call ?func##YAXXZ ; func(void)
test eax, eax
jz short loc_1400012B0
Here is main decompiled:
int __cdecl main(int argc, const char **argv, const char **envp)
{
int v3; // eax
func();
if ( v3 )
printf("yes\n");
else
printf("no\n");
return 0;
}
((bool(*)())func)()
This expression takes a pointer to func, casts the pointer to a different type of function, then invokes it. Invoking a function through a pointer-to-function whose function signature does not match the original function is undefined behavior which means that anything at all might happen. From the moment this function call happens, the behavior of the program cannot be reasoned about. You cannot predict what will happen with any certainty. Behavior might be different on different optimization levels, different compilers, different versions of the same compiler, or when targeting different architectures.
This is simply because the compiler is allowed to assume that you won't do this. When the compiler's assumptions and reality come into conflict, the result is a vacuum into which the compiler can insert whatever it likes.
The simple answer to your question "why is that?" is, quite simply: because it can. But tomorrow it might do something else.
What apparently happened is:
mov ecx, cs:i
add ecx, eax
mov eax, ecx ; <- final value of i is stored in eax
mov cs:i, eax ; and then also stored in i itself
Different registers could have been used, it just happened to work this way. There is nothing about the code that forces eax to be chosen. That mov eax, ecx is really redundant, ecx could have been stored straight to i. But it happened to work this way.
And in main:
call ?func##YAXXZ ; func(void)
test eax, eax
jz short loc_1400012B0
rax (or part of it, like eax or al) is used for the return value for integer-ish types (such as booleans) in the WIN64 ABI, so that makes sense. That means the final value of i happens to be used as the return value, by accident.
I always get printed out no, so it must be dependent from compiler to compiler, hence the best answer is UB (Undefined Behavior).
I have to implement a "bandpass" filter. Let a and b denote two integers that induce a half-open interval [a, b). If some argument x lies within this interval (i.e., a <= x < b), I return a pointer to a C string const char* high, otherwise I return a pointer const char* low. The vanilla implementation of this function looks like
const char* vanilla_bandpass(int a, int b, int x, const char* low,
const char* high)
{
const bool withinInterval { (a <= x) && (x < b) };
return (withinInterval ? high : low);
}
which when compiled with -O3 -march=znver2 on Godbolt gives the following assembly code
vanilla_bandpass(int, int, int, char const*, char const*):
mov rax, r8
cmp edi, edx
jg .L4
cmp edx, esi
jge .L4
ret
.L4:
mov rax, rcx
ret
Now, I've looked into creating a version without a jump/branch, which looks like this
#include <cstdint>
const char* funky_bandpass(int a, int b, int x, const char* low,
const char* high)
{
const bool withinInterval { (a <= x) && (x < b) };
const auto low_ptr = reinterpret_cast<uintptr_t>(low) * (!withinInterval);
const auto high_ptr = reinterpret_cast<uintptr_t>(high) * withinInterval;
const auto ptr_sum = low_ptr + high_ptr;
const auto* result = reinterpret_cast<const char*>(ptr_sum);
return result;
}
which just is ultimately just a "chord" between two pointers. Using the same options as before, this code compiles to
funky_bandpass(int, int, int, char const*, char const*):
mov r9d, esi
cmp edi, edx
mov esi, edx
setle dl
cmp esi, r9d
setl al
and edx, eax
mov eax, edx
and edx, 1
xor eax, 1
imul rdx, r8
movzx eax, al
imul rcx, rax
lea rax, [rcx+rdx]
ret
While at first glance, this function has more instructions, careful benchmarking shows that it's 1.8x to 1.9x faster than the vanilla_bandpass implementation.
Is this use of uintptr_t valid and free of undefined behavior? I'm well aware that the language around uintptr_t is vague and ambiguous to say the least, and that anything that isn't explicitly specified in the standard (like arithmetic on uintptr_t) is generally considered undefined behavior. On the other hand, in many cases, the standard explicitly calls out when something has undefined behavior, which it also doesn't do in this case. I'm aware that the "blending" that happens when adding together low_ptr and high_ptr touches on topics just as pointer provenance, which is a murky topic in and of itself.
Is this use of uintptr_t valid and free of undefined behavior?
Yes. Conversion from pointer to integer (of sufficient size such as uintptr_t) is well defined, and integer arithmetic is well defined.
Another thing to be wary about is whether converting a modified uintptr_t back to a pointer gives back what you want. Only guarantee given by the standard is that pointer converted to integer converted back gives the same address. Luckily, this guarantee is sufficient for you, because you always use the exact value from a converted pointer.
If you were using something other than pointer to narrow character, I think you would need to use std::launder on the result of the conversion.
The Standard doesn't require that implementations process uintptr_t-to-pointer conversions in useful fashion even in cases where the uintptr_t values are produced from pointer-to-integer conversions. Given e.g.
extern int x[5],y[5];
int *px5 = x+5, *py0 = y;
the pointers px5 and py0 might compare equal, and regardless of whether they do or not, code may use px5[-1] to access x[4], or py0[0] to access y[0], but may not access px5[0] nor py0[-1]. If the pointers happen to be equal, and code attempts to access ((int*)(uintptr_t)px5)[-1], a compiler could replace (int*)(uintptr_t)px5) with py0, since that pointer would compare equal to px5, but then jump the rails when attempting to access py0[-1]. Likewise if code tries to access ((int*)(uintptr_t)py0)[0], a compiler could replace (int*)(uintptr_t)py0 with px5, and then jump the rails when attempting to access px5[0].
While it may seem obtuse for a compiler to do such a thing, clang gets even crazier. Consider:
#include <stdint.h>
extern int x[],y[];
int test(int i)
{
y[0] = 1;
uintptr_t px = (uintptr_t)(x+5);
uintptr_t py = (uintptr_t)(y+i);
int flag = (px==py);
if (flag)
y[i] = 2;
return y[0];
}
If px and py are coincidentally equal and i is zero, that will cause clang to set y[0] to 2 but return 1. See https://godbolt.org/z/7Sa_KZ for generated code ("mov eax,1 / ret" means "return 1").
I was looking at the assembly Visual Studio generated for this simple x64 program:
struct Point {
int a, b;
Point() {
a = 0; b = 1;
}
};
int main(int argc, char* argv[])
{
Point arr[3];
arr[0].b = 2;
return 0;
}
And when it meets arr[0].b = 2, it generates this:
mov eax, 8
imul rax, rax, 0
mov dword ptr [rbp+rax+4],2
Why does it do imul rax, rax, 0 instead of a simple mov rax, 0, or even xor rax, rax? How is imul more efficient, if at all?
Kamac
The reason is because the assembly is calculating the offset of both the Point object in the array, which happens to be on the stack, as well as the offset to the variable b.
Intel documentation for imul with three (3) operands state:
Three-operand form — This form requires a destination operand (the
first operand) and two source operands (the second and the third
operands). Here, the first source operand (which can be a
general-purpose register or a memory location) is multiplied by the
second source operand (an immediate value). The intermediate product
(twice the size of the first source operand) is truncated and stored
in the destination operand (a general-purpose register).
In your case it is calculating the offset of the object in the array which results in addressing the first (zeroth) Point location on the stack. Having that resolved it is then adding the offset for .b which is the +4. So broken down:
mov eax,8 ; prepare to offset into the Point array
imul rax, rax, 0 ; Calculate which Point object is being referred to
mov dword ptr [rbp+rax+4],2 ; Add the offset to b and move value 2 in
instruction. All of which resolves to arr[0].b = 2.
I take it you did not compile with aggressive optimization. When going with a straight compile (no optimization, debug on, etc.) the compiler is not making any assumptions with respects to addressing.
Comparison to clang
On OS X (El Capitan) with clang 3.9.0 and no optimization flags, once the Point objects are instantiated in the array, the assignment of .b = 2 is simply:
mov dword ptr [rbp - 44], 2
In this case, clang is pretty smart about the offsets and resolves addressing during the default optimization.
Background
I have a VS2013 solution containing many projects an numerous sources.
In my sources, I use the same macro thousands of times in different locations in the sources.
Something like:
#define MyMacro(X) X
where X is const char*
I have a DLL project, that with the above macro definition result in a 800KB output dll size.
Problem
In some scenarios or modes, I wish to change my macro definition to the following:
#define MyMacro(X) Utils::MyFunc(X)
This change had a very unpleasant side effect which result in the DLL output file size increasing by 100KB.
Notes
Utils::MyFunc() is used for the first time. So, naturally, I except the binary to increase (a little) since a new code is introduces
Utils::MyFunc() does not include large header or libs.
Utils::MyFunc() does allocate string object.
All projects are compiled using definitions to favor small code.
Artificial example
#define M1(X) X
#define M2(X) ReturnString1(X)
#define M3(X) ReturnString2(X)
string ReturnString1(const char* c)
{
return string(c);
}
string ReturnString2(const string& s)
{
return string(s);
}
int _tmain(int argc, _TCHAR* argv[])
{
M3("TEST");
M3("TEST");
.
. // 5000 times
.
M3("TEST");
return 1;
}
In the above example, I've generate a small EXE project to try and mimic the problem I'm facing.
Using M1 exclusively in _tmain - compilation was instantaneous and output file was 88KB EXE.
Using M2 exclusively in _tmain - compilation took minutes and output file was 239KB EXE.
Using M3 exclusively in _tmain - compilation took a lot longer and output file was 587KB EXE.
I used IDA to compare between the binaries and extracted the function names from the binaries.
In M2 & M3, I see a lot more of the following functions than I see in M1:
... $basic_string#DU?$char_traits#D#std##V?$allocator#...
I'm not too surprised about it since in M2 & M3 I'm allocating a string object.
But is it enough to justify a 151KB & 499KB increase?
Question
Is it expected from string allocation to have such a substantial impact on the output file size?
Here is another "artificial" example:
int main()
{
const char* p = M1("TEST");
std::cout << p;
string s = M3("TEST");
std::cout << s;
return 1;
}
I have commented one section at a time and looked at the generated ASM. For the M1 macro, I got:
012B1000 mov ecx,dword ptr [_imp_?cout#std##3V?$basic_ostream#DU?$char_traits#D#std###1#A (012B204Ch)]
012B1006 call std::operator<<<std::char_traits<char> > (012B1020h)
012B100B mov eax,1
While for M3:
00DC1068 push 4
00DC106A push ecx
00DC106B lea ecx,[ebp-40h]
00DC106E mov dword ptr [ebp-2Ch],0Fh
00DC1075 mov dword ptr [ebp-30h],0
00DC107C mov byte ptr [ebp-40h],0
00DC1080 call std::basic_string<char,std::char_traits<char>,std::allocator<char> >::assign (0DC1820h)
00DC1085 lea edx,[ebp-40h]
00DC1088 mov dword ptr [ebp-4],0
00DC108F lea ecx,[s]
00DC1092 call ReturnString2 (0DC1000h)
00DC1097 mov byte ptr [ebp-4],2
00DC109B mov eax,dword ptr [ebp-2Ch]
00DC109E cmp eax,10h
00DC10A1 jb main+6Dh (0DC10ADh)
00DC10A3 inc eax
00DC10A4 push eax
00DC10A5 push dword ptr [ebp-40h]
00DC10A8 call std::_Wrap_alloc<std::allocator<char> >::deallocate (0DC17C0h)
00DC10AD mov ecx,dword ptr [_imp_?cout#std##3V?$basic_ostream#DU?$char_traits#D#std###1#A (0DC3050h)]
00DC10B3 lea edx,[s]
00DC10B6 mov dword ptr [ebp-2Ch],0Fh
00DC10BD mov dword ptr [ebp-30h],0
00DC10C4 mov byte ptr [ebp-40h],0
00DC10C8 call std::operator<<<char,std::char_traits<char>,std::allocator<char> > (0DC1100h)
00DC10CD mov eax,dword ptr [ebp-14h]
00DC10D0 cmp eax,10h
00DC10D3 jb main+9Fh (0DC10DFh)
00DC10D5 inc eax
00DC10D6 push eax
00DC10D7 push dword ptr [s]
00DC10DA call std::_Wrap_alloc<std::allocator<char> >::deallocate (0DC17C0h)
00DC10DF mov eax,1
Looking at the first column (addresses), the M1 code size is 12, while M3 - 119.
I will leave it as an exercise for the reader to figure out the difference between 5,000 * 12 and 5,000 * 119 :)
Let's take two cases in a simple example:
int _tmain()
{
"TEST";
std::string("TEST");
}
The first statement has no effect and is trivially optimized away.
The second statement constructs a string, which requires a function call. But what function is called? Maybe it's the string constructor, but if that's inlined, it might actually be that malloc(), strlen(), and memcpy() are called directly from main (not explicitly, but those three functions might plausibly be used by a string constructor which could be inline).
Now if you have this:
std::string("TEST");
std::string("TEST");
std::string("TEST");
You can see it's not 3 function calls, but 9 (in our hypothetical). You could get it back to 3 if you make sure the function you're calling is not inline (either using __declspec(noinline) or by defining it in a separate translation unit, aka .cpp file).
You may find that enabling full optimizations (Release build) lets the compiler figure out that these strings are never used, and get rid of them. Maybe.
I am trying to understand assembly a little better, so I've been paying attention to the assembly output from CDB when I debug my code. My platform is an Intel Xeon with Windows 7.
The following C++ code:
int main()
{
int a = 30;
int b = 0;
b = ++a;
return 0;
}
produces the following assembly for the line with the increment operator:
b = ++a;
0x13f441023 <+0x0013> mov eax,dword ptr [rsp]
0x13f441026 <+0x0016> inc eax
0x13f441028 <+0x0018> mov dword ptr [rsp],eax //Move eax to some memory address
0x13f44102b <+0x001b> mov eax,dword ptr [rsp] //Move it back to eax?
0x13f44102e <+0x001e> mov dword ptr [rsp+4],eax
My question is, what is the purpose of moving the value in eax to memory, then immediately moving the same value back into eax, as indicated by the comments? Is this for thread safety, or just some artifact of a debug build?
The compiler initially translates your instructions into assembly using static single assignment values (SSA), meaning that each operation gets a temporary value to store its result. Only at a later backend stage, these values would be translated into machine registers according to your target machine, and possibly into memory locations if necessary (explicitly required or spilled due to lack of registers).
Between these stages, the optimizer may eliminate partial values, but initially ++a is one operation, and assigning a (post increment) into b is a second operation. Since a and b are both local variables, they will be stored on the stack (and must be visible there, in case you step with a debugger for example), a will reside in [rsp] and b in [rsp+4].
So your compiler, at some point, probably has (in some intermediate representation):
value1 = a
value2 = value1 + 1
a = value2 //self increment
b = a
Or something similar. a and b must be memory resident, but the operations will usually be done on registers, so at first the compiler does -
value1 = a
value2 = value1 + 1
0x13f441023 <+0x0013> mov eax,dword ptr [rsp]
0x13f441026 <+0x0016> inc eax
a = value2
0x13f441028 <+0x0018> mov dword ptr [rsp],eax
b = a
0x13f44102b <+0x001b> mov eax,dword ptr [rsp]
0x13f44102e <+0x001e> mov dword ptr [rsp+4],eax
Note that the intermediate values were kept in a register - in a normal compilation, they would have probably get eliminated altogether by one of the optimization passes (prior to register assignment and code generation).
just some artifact of a debug build?
Yes, just some artifact of a debug build (actually from an unoptimized build)