Compiler optimization of static constexpr - c++

Given the following C++ code:
#include <stdio.h>
static constexpr int x = 1;
void testfn() {
if (x == 2)
printf("This is test.\n");
}
int main() {
for (int a = 0; a < 10; a++)
testfn();
return 0;
}
Visual Studio 2019 produces the following Debug build assembly (viewed using Approach 1 of accepted answer at: How to view the assembly behind the code using Visual C++?)
int main() {
00EC1870 push ebp
00EC1871 mov ebp,esp
00EC1873 sub esp,0CCh
00EC1879 push ebx
00EC187A push esi
00EC187B push edi
00EC187C lea edi,[ebp-0CCh]
00EC1882 mov ecx,33h
00EC1887 mov eax,0CCCCCCCCh
00EC188C rep stos dword ptr es:[edi]
00EC188E mov ecx,offset _6D4A0457_how_compiler_treats_staticconstexpr#cpp (0ECC003h)
00EC1893 call #__CheckForDebuggerJustMyCode#4 (0EC120Dh)
for (int a = 0; a < 10; a++)
00EC1898 mov dword ptr [ebp-8],0
00EC189F jmp main+3Ah (0EC18AAh)
00EC18A1 mov eax,dword ptr [ebp-8]
00EC18A4 add eax,1
00EC18A7 mov dword ptr [ebp-8],eax
00EC18AA cmp dword ptr [ebp-8],0Ah
00EC18AE jge main+47h (0EC18B7h)
testfn();
00EC18B0 call testfn (0EC135Ch)
00EC18B5 jmp main+31h (0EC18A1h)
return 0;
00EC18B7 xor eax,eax
}
As can be seen in the assembly, possibly because this is a Debug build, there is pointless references to the for loop and testfn in main. I would have hoped that they should not find any mention in the assembly code at all given that the printf in testfn will never be hit since static constexpr int x=1.
I have 2 questions:
(1)Perhaps in the Release build, the for loop is optimized away. How can I check this? Viewing the release build assembly code does not work for me even on using the Approach 2 specified at at: How to view the assembly behind the code using Visual C++?. The file with the assembly code is not produced at all.
(2)In using static constexpr int/double/char as opposed to #define's, under what circumstances is one guaranteed that the former does not involve any unnecessary overhead (runtime computations/evaluations)? #define's, though much maligned, seem to offer much greater guarantee than static constexpr's in this regard.

The issue here is that you are compiling the code using a debug build. If you want sanity in the asm, compile as release instead. The problem is that a debugger is used to help confirm the logic of the underlying code. The logic in your underlying code is that it should call testfn() 10 times. As a result, you should be able to place a breakpoint on that method, and hit it at the correct point in the execution. In a release build, that breakpoint would never be hit (because it would have been optimised away).
In your case however, it's entirely incorrect to say that the constexpr is being ignored. You may notice that there are no calls to printf() in the generated asm, so the compiler has correctly identified that if (x == 2) can never be true, and has removed it. However, if the compiler removed the call to testfn() completely, your breakpoint would never be hit, and the debugger would basically be useless.
Don't look at the output of a debug build and imagine it tells you anything useful about the code or compiler. You should expect the code to be deliberately de-optimised.

Related

Why does MSVC generate nop instructions for atomic loads on x64?

If you compile code such as
#include <atomic>
int load(std::atomic<int> *p) {
return p->load(std::memory_order_acquire) + p->load(std::memory_order_acquire);
}
you see that MSVC generates NOP padding after each memory load:
int load(std::atomic<int> *) PROC
mov edx, DWORD PTR [rcx]
npad 1
mov eax, DWORD PTR [rcx]
npad 1
add eax, edx
ret 0
Why is this? Is there any way to avoid it without relaxing the memory order (which would affect the correctness of the code)?
p->load() may eventually use the _ReadWriteBarrier compiler intrinsic.
According to this: https://developercommunity.visualstudio.com/t/-readwritebarrier-intrinsic-emits-unnecessary-code/1538997
the nops get inserted because of the flag /volatileMetadata which is now on by default. You can return to the old behavior by adding /volatileMetadata-, but doing so will result in worse performance if your code is ever run emulated. It’ll still be emulated correctly, but the emulator will have to pessimistically assume every load/store needs a barrier.
And compiling with /volatileMetadata- does indeed remove the npad.

How to measure the number of increments per second

I want to measure the speed in which my PC can increment a counter N times (e.g., for N = 10^9).
I tried the following code:
using namespace std
auto start = chrono::steady_clock::now();
for (int i = 0; i < N; ++i)
{
}
auto end = chrono::steady_clock::now();
However, the compiler is smart enough to simply set i=N, and I get that start==end regardless of the value of N.
How can I change the code to measure the increment speed? (adding costly operations in the loop would dominate the runtime and would not allow the measurement to be correct).
I use Windows 10 and Visual Studio 15.9.7.
A bit of motivation: my code takes about 2 seconds for N=10^9. I'm wondering if there's any "meat" left in optimizing it further (e.g., could it possibly go down to 1 sec? or would the loop itself require more?)
This question doesn't really make sense in C or C++. The compiler aims to generate the fastest code that meets the constraints defined by your source code. In your question, you do not define a constraint that the compiler must do a loop at all. Because the loop has no effect, the optimizer will remove it.
Gabriel Staple's answer is probably the nearest thing you can get to a sensible answer to your question, but it is also not quite right because it defines too many constraints that limits the compiler's freedom to implement optimal code. Volatile often forces the compiler to write the result back to memory each time the variable is modified.
eg, this code:
void foo(int N) {
for (volatile int i = 0; i < N; ++i)
{
}
}
Becomes this assembly (on an x64 compiler I tried):
mov DWORD PTR [rsp-4], 0
mov eax, DWORD PTR [rsp-4]
cmp edi, eax
jle .L1
.L3:
mov eax, DWORD PTR [rsp-4] # Read i from mem
add eax, 1 # i++
mov DWORD PTR [rsp-4], eax # Write i to mem
mov eax, DWORD PTR [rsp-4] # Read it back again before
# evaluating the loop condition.
cmp eax, edi # Is i < N?
jl .L3 # Jump back to L3 if not.
.L1:
It sounds like your real question is more like how fast is:
L1: add eax, 1
jmp L1
Even the answer to that is complex and requires an understanding of the internals of your CPU's pipelines.
I recommend playing with Godbolt to understand more about what the compiler is doing. eg https://godbolt.org/z/59XUSu
You can directly measure the speed of the "empty loop", but it is not easy to convince a C++ compiler to emit it. GCC and Clang can be tricked with asm volatile("") but MSVC inline assembly has always been different and is disabled completely for 64bit programs.
It is possible to use MASM to side-step that restriction:
.MODEL FLAT
.CODE
_testfun PROC
sub ecx, 1
jnz _testfun
ret
_testfun ENDP
END
Import it into your code with extern "C" void testfun(unsigned N);.
Try volatile int i = 0 In your for loop. The volatile keyword tells the compiler this variable could change at any time, due to outside events or threads, and therefore it can't make the same assumptions about what the variable might be in the future.

Is it beneficial to make library functions templated to avoid compiler instructions?

Let's say I'm creating my own library in namespace l. Would it be beneficial to make as much as possible members of the namespace templated? This would encourage the compiler to only generate instructions for members that are actually called by the user of the library. To make my point clear, I'll demonstrate it here:
1)
namespace l{
template <typename = void>
int f() { return 3; }
}
vs.
2)
namespace l{
int f() { return 3; }
}
It is not being called in main to show the difference.
int main() { return EXIT_SUCCESS; }
function 1) does not require additional instructions for l::f():
main:
push rbp
mov rbp, rsp
mov dword ptr [rbp - 4], 0
mov eax, 0
pop rbp
ret
function 2) does require additional instructions for l::f() ( If, again, l::f() is not called):
l::f()
push rbp
mov rbp, rsp
mov eax, 3
pop rbp
ret
main:
push rbp
mov rbp, rsp
mov dword ptr [rbp - 4], 0
mov eax, 0
pop rbp
ret
tl;dr
Is it beneficial to make library functions templated to avoid compiler instructions?
No. Emitting dead code isn't the expensive part of compiling. File access, parsing and optimization (not necessarily in that order) take time, and this idea forces library clients to read & parse more code than in the regular header + library model.
Templates are usually blamed for slowing builds, not speeding them up.
It also means you can't build your library ahead of time, so each user needs to compile whichever parts they use from scratch, in every translation unit where they're used.
The total time spent compiling will probably be greater with the templated version. You'd have to profile to be sure (and I suspect this f is so small as to be immeasurable either way) but I have a hard time seeing this as a useful improvement.
Your comparison isn't representative anyway - a good compiler will discard dead code at link time. Some will also be able to inline code from static libraries, so there's no reliable effect on either compile-time or runtime performance.

Macro replace itself using undesired braces

I have problems with macros, since they are being replaced with braces.
Since I will need to compile for different Operative Systems [WINDOWS, OSX, ANDROID, iOS], I'm trying to use typedef for the basic C++ types, to replace them easily and test performances.
Since I'm doing lots of static_cast, I thought I could use a macro to do it only when its need (CPU is critical on my software). So in this way, the static_cast will be only performed when the types are different, instead doing weird things like this:
const int tv = 8;
const int tvc = static_cast<int>(8);
So, if FORCE_USE32 is enabled or not, it would choose the best version for it
So Visual Studio 2017 using default compiler gives me an error when I do some things like this:
#ifndef FORCE_USE32
#define FORCE_USE32 0
#endif
#if FORCE_USE32
typedef int s08;
#define Cs08(v) {v}
#else
typedef char s08;
#define Cs08(v) {static_cast<s08>(v)}
#endif
// this line give me an error because Cs08 is replaced by {static_cast<s08>(1)} instead just static_cast<s08>(1)
std::array<s08, 3> myArray{Cs08(1), 0, 0};
I know I could solve easily creating a variable before I do the array, something like this
const s08 tempVar = Cs08(1);
std::array<s08, 3> myArray{tempVar, 0, 0};
But I do not understand the reason, and I want to keep my code as clean as possible. Is there any way to include the macro inside the array definition?
You are trying to solve a non-problem
const int tvc = static_cast<int>(8);
Will not use any CPU cycles here. How dumb do you think compilers are nowadays? Even with no optimizations the above cast is a no-op (no operation). There won't be any additional instructions generated for the cast.
auto test(int a) -> int
{
return a;
}
auto test_cast(int a) -> int
{
return static_cast<int>(a);
}
With no optimization enabled the two functions generate identical code:
test(int): # #test(int)
push rbp
mov rbp, rsp
mov dword ptr [rbp - 4], edi
mov eax, dword ptr [rbp - 4]
pop rbp
ret
test_cast(int): # #test_cast(int)
push rbp
mov rbp, rsp
mov dword ptr [rbp - 4], edi
mov eax, dword ptr [rbp - 4]
pop rbp
ret
With -O3 they get:
test(int): # #test(int)
mov eax, edi
ret
test_cast(int): # #test_cast(int)
mov eax, edi
ret
Coming back to how smart the compilers (actually the optimization algorithms) are, with optimizations enabled a compiler can do crazy crazy things, like loop unrolling, converting a recursive function to an iterative one, removing entire redundant code and on and on and on. What you are doing is premature optimization. If your code is performance critical then you need a decent understanding of assembly, compiler optimizations and system architecture. And then you don't just blindly optimize what you think is slow. You write code for readability first and then you profile.
Answering your macro problem: just remove the {} from the macro:
#define Cs08(v) v
#define Cs08(v) static_cast<s08>(v)

What is the impact of an object allocation on the binary size?

Background
I have a VS2013 solution containing many projects an numerous sources.
In my sources, I use the same macro thousands of times in different locations in the sources.
Something like:
#define MyMacro(X) X
where X is const char*
I have a DLL project, that with the above macro definition result in a 800KB output dll size.
Problem
In some scenarios or modes, I wish to change my macro definition to the following:
#define MyMacro(X) Utils::MyFunc(X)
This change had a very unpleasant side effect which result in the DLL output file size increasing by 100KB.
Notes
Utils::MyFunc() is used for the first time. So, naturally, I except the binary to increase (a little) since a new code is introduces
Utils::MyFunc() does not include large header or libs.
Utils::MyFunc() does allocate string object.
All projects are compiled using definitions to favor small code.
Artificial example
#define M1(X) X
#define M2(X) ReturnString1(X)
#define M3(X) ReturnString2(X)
string ReturnString1(const char* c)
{
return string(c);
}
string ReturnString2(const string& s)
{
return string(s);
}
int _tmain(int argc, _TCHAR* argv[])
{
M3("TEST");
M3("TEST");
.
. // 5000 times
.
M3("TEST");
return 1;
}
In the above example, I've generate a small EXE project to try and mimic the problem I'm facing.
Using M1 exclusively in _tmain - compilation was instantaneous and output file was 88KB EXE.
Using M2 exclusively in _tmain - compilation took minutes and output file was 239KB EXE.
Using M3 exclusively in _tmain - compilation took a lot longer and output file was 587KB EXE.
I used IDA to compare between the binaries and extracted the function names from the binaries.
In M2 & M3, I see a lot more of the following functions than I see in M1:
... $basic_string#DU?$char_traits#D#std##V?$allocator#...
I'm not too surprised about it since in M2 & M3 I'm allocating a string object.
But is it enough to justify a 151KB & 499KB increase?
Question
Is it expected from string allocation to have such a substantial impact on the output file size?
Here is another "artificial" example:
int main()
{
const char* p = M1("TEST");
std::cout << p;
string s = M3("TEST");
std::cout << s;
return 1;
}
I have commented one section at a time and looked at the generated ASM. For the M1 macro, I got:
012B1000 mov ecx,dword ptr [_imp_?cout#std##3V?$basic_ostream#DU?$char_traits#D#std###1#A (012B204Ch)]
012B1006 call std::operator<<<std::char_traits<char> > (012B1020h)
012B100B mov eax,1
While for M3:
00DC1068 push 4
00DC106A push ecx
00DC106B lea ecx,[ebp-40h]
00DC106E mov dword ptr [ebp-2Ch],0Fh
00DC1075 mov dword ptr [ebp-30h],0
00DC107C mov byte ptr [ebp-40h],0
00DC1080 call std::basic_string<char,std::char_traits<char>,std::allocator<char> >::assign (0DC1820h)
00DC1085 lea edx,[ebp-40h]
00DC1088 mov dword ptr [ebp-4],0
00DC108F lea ecx,[s]
00DC1092 call ReturnString2 (0DC1000h)
00DC1097 mov byte ptr [ebp-4],2
00DC109B mov eax,dword ptr [ebp-2Ch]
00DC109E cmp eax,10h
00DC10A1 jb main+6Dh (0DC10ADh)
00DC10A3 inc eax
00DC10A4 push eax
00DC10A5 push dword ptr [ebp-40h]
00DC10A8 call std::_Wrap_alloc<std::allocator<char> >::deallocate (0DC17C0h)
00DC10AD mov ecx,dword ptr [_imp_?cout#std##3V?$basic_ostream#DU?$char_traits#D#std###1#A (0DC3050h)]
00DC10B3 lea edx,[s]
00DC10B6 mov dword ptr [ebp-2Ch],0Fh
00DC10BD mov dword ptr [ebp-30h],0
00DC10C4 mov byte ptr [ebp-40h],0
00DC10C8 call std::operator<<<char,std::char_traits<char>,std::allocator<char> > (0DC1100h)
00DC10CD mov eax,dword ptr [ebp-14h]
00DC10D0 cmp eax,10h
00DC10D3 jb main+9Fh (0DC10DFh)
00DC10D5 inc eax
00DC10D6 push eax
00DC10D7 push dword ptr [s]
00DC10DA call std::_Wrap_alloc<std::allocator<char> >::deallocate (0DC17C0h)
00DC10DF mov eax,1
Looking at the first column (addresses), the M1 code size is 12, while M3 - 119.
I will leave it as an exercise for the reader to figure out the difference between 5,000 * 12 and 5,000 * 119 :)
Let's take two cases in a simple example:
int _tmain()
{
"TEST";
std::string("TEST");
}
The first statement has no effect and is trivially optimized away.
The second statement constructs a string, which requires a function call. But what function is called? Maybe it's the string constructor, but if that's inlined, it might actually be that malloc(), strlen(), and memcpy() are called directly from main (not explicitly, but those three functions might plausibly be used by a string constructor which could be inline).
Now if you have this:
std::string("TEST");
std::string("TEST");
std::string("TEST");
You can see it's not 3 function calls, but 9 (in our hypothetical). You could get it back to 3 if you make sure the function you're calling is not inline (either using __declspec(noinline) or by defining it in a separate translation unit, aka .cpp file).
You may find that enabling full optimizations (Release build) lets the compiler figure out that these strings are never used, and get rid of them. Maybe.