Does C++ optimisation level affect Swig Python module performance - c++

I have a large Swig Python module. The C++ wrapper ends up being about 320,000 LoC (including headers I guess). I currently compile this with -O1, and g++ produces a binary that is 44MiB in size, and takes about 3 minutes to compile it.
If I turn off optimisation (-O0), the binary comes out at 40MiB, and it takes 44s to compile.
Is compiling the wrapper with -O0 going to hurt the performance of the python module significantly? Before I go and profile the performance of the module at different optimisation levels, has anyone done this sort of analysis before or have any insight into whether it matters?

-O0 deactivates all the optimizations performed by gcc. And optimizations matter.
So, without a lot of knowledge in your application, I could suggest that this will hurt the performance of your application.
A usually safe optimization level to use is -O2.
You can check the kind of optimizations performed by GCC in:
http://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html.
But at the end, if you want to know exactly you should compile at different levels and profile.

This is bad irrespective of SWIG modules or not. There are many optimisations that happen even with gcc -O1 which you will miss if you prevent them happening.
You can check the difference by inspecting the asm generated by your compiler of choice. Of these the ones I trivially know will be detrimental to SWIG's generated wrapper:
Dead code elimination:
void foo() {
int a = 1;
a = 0;
}
With -O1 this entirely pointless code gets totally removed:
foo:
pushl %ebp
movl %esp, %ebp
popl %ebp
ret
whereas with -O0 it becomes:
foo:
pushl %ebp
movl %esp, %ebp
subl $16, %esp
movl $1, -4(%ebp)
movl $0, -4(%ebp)
leave
ret
Register allocation will be detrimentally impacted in functions with lots of local variables - most SWIG wrapper functions will see a hit from this. It's hard to show a concise example of this though.
Another example, the output from gcc compiling the SWIG wrapper for the prototype:
int foo(unsigned int a, unsigned int b, unsigned int c, unsigned int d);
Generates with -O0:
Java_testJNI_foo:
pushl %ebp
movl %esp, %ebp
subl $88, %esp
movl 16(%ebp), %eax
movl %eax, -48(%ebp)
movl 20(%ebp), %eax
movl %eax, -44(%ebp)
movl 24(%ebp), %eax
movl %eax, -56(%ebp)
movl 28(%ebp), %eax
movl %eax, -52(%ebp)
movl 32(%ebp), %eax
movl %eax, -64(%ebp)
movl 36(%ebp), %eax
movl %eax, -60(%ebp)
movl 40(%ebp), %eax
movl %eax, -72(%ebp)
movl 44(%ebp), %eax
movl %eax, -68(%ebp)
movl $0, -32(%ebp)
movl -48(%ebp), %eax
movl %eax, -28(%ebp)
movl -56(%ebp), %eax
movl %eax, -24(%ebp)
movl -64(%ebp), %eax
movl %eax, -20(%ebp)
movl -72(%ebp), %eax
movl %eax, -16(%ebp)
movl -16(%ebp), %eax
movl %eax, 12(%esp)
movl -20(%ebp), %eax
movl %eax, 8(%esp)
movl -24(%ebp), %eax
movl %eax, 4(%esp)
movl -28(%ebp), %eax
movl %eax, (%esp)
call foo
movl %eax, -12(%ebp)
movl -12(%ebp), %eax
movl %eax, -32(%ebp)
movl -32(%ebp), %eax
leave
ret
Compared to -O1 which generates just:
Java_testJNI_foo:
pushl %ebp
movl %esp, %ebp
subl $24, %esp
movl 40(%ebp), %eax
movl %eax, 12(%esp)
movl 32(%ebp), %eax
movl %eax, 8(%esp)
movl 24(%ebp), %eax
movl %eax, 4(%esp)
movl 16(%ebp), %eax
movl %eax, (%esp)
call foo
leave
ret
With -O1 g++ can generate far smarter code for:
%module test
%{
int value() { return 100; }
%}
%feature("compactdefaultargs") foo;
%inline %{
int foo(int a=value(), int b=value(), int c=value()) {
return 0;
}
%}
The short answer is with optimisations disabled completely GCC generates extremely naive code - this is true of SWIG wrappers as much as any other program, if not more given the style of the automatically generated code.

Related

Increment variable by N inside array index

Could someone please tell me whether or not such a construction is valid (i.e not an UB) in C++. I have some segfaults because of that and spent couple of days trying to figure out what is going on there.
// Synthetic example
int main(int argc, char** argv)
{
int array[2] = {99, 99};
/*
The point is here. Is it legal? Does it have defined behaviour?
Will it increment first and than access element or vise versa?
*/
std::cout << array[argc += 7]; // Use argc just to avoid some optimisations
}
So, of course I did some analysis, both GCC(5/7) and clang(3.8) generate same code. First add than access.
Clang(3.8): clang++ -O3 -S test.cpp
leal 7(%rdi), %ebx
movl .L_ZZ4mainE5array+28(,%rax,4), %esi
movl $_ZSt4cout, %edi
callq _ZNSolsEi
movl $.L.str, %esi
movl $1, %edx
movq %rax, %rdi
callq _ZSt16__ostream_insertIcSt11char_traitsIcEERSt13basic_ostreamIT_T0_ES6_PKS3_l
GCC(5/7) g++-7 -O3 -S test.cpp
leal 7(%rdi), %ebx
movl $_ZSt4cout, %edi
subq $16, %rsp
.cfi_def_cfa_offset 32
movq %fs:40, %rax
movq %rax, 8(%rsp)
xorl %eax, %eax
movabsq $425201762403, %rax
movq %rax, (%rsp)
movslq %ebx, %rax
movl (%rsp,%rax,4), %esi
call _ZNSolsEi
movl $.LC0, %esi
movq %rax, %rdi
call _ZStlsISt11char_traitsIcEERSt13basic_ostreamIcT_ES5_PKc
movl %ebx, %esi
So, can I assume such a baheviour is a standard one?
By itself array[argc += 7] is OK, the result of argc + 7 will be used as an index to array.
However, in your example array has just 2 elements, and argc is never negative, so your code will always result in UB due to an out-of-bounds array access.
In case of a[i+=N] the expression i += N will always be evaluated first before accessing the index. But the example that you provided invokes UB as your example array contains only two elements and thus you are accessing out of bounds of the array.
Your case is clearly undefined behaviour, since you will exceed array bounds for the following reasons:
First, expression array[argc += 7] is equal to *((array)+(argc+=7)), and the values of the operands will be evaluated before + is evaluated (cf. here); Operator += is an assignment (and not a side effect), and the value of an assignment is the result of argc (in this case) after the assignment (cf. here). Hence, the +=7 gets effective for subscripting;
Second, argc is defined in C++ to be never negative (cf. here); So argc += 7 will always be >=7 (or a signed integer overflow in very unrealistic scenarious, but still UB then).
Hence, UB.
It's normal behavior. Name of array actualy is a pointer to first element of array. And array[n] is the same as *(array+n)

Why does GCC 4.8.2 not propagate 'unused but set' optimization?

If a variable is not read from ever, it is obviously optimized out. However, the only store operation on that variable is the result of the only read operation of another variable. So, this second variable should also be optimized out. Why is this not being done?
int main() {
timeval a,b,c;
// First and only logical use of a
gettimeofday(&a,NULL);
// Junk function
foo();
// First and only logical use of b
gettimeofday(&b,NULL);
// This gets optimized out as c is never read from.
c.tv_sec = a.tv_sec - b.tv_sec;
//std::cout << c;
}
Aseembly (gcc 4.8.2 with -O3):
subq $40, %rsp
xorl %esi, %esi
movq %rsp, %rdi
call gettimeofday
call foo()
leaq 16(%rsp), %rdi
xorl %esi, %esi
call gettimeofday
xorl %eax, %eax
addq $40, %rsp
ret
subq $8, %rsp
Edit: The results are the same for using rand() .
There's no store operation! There are 2 calls to gettimeofday, yes, but that is a visible effect. And visible effects are precisely the things that may not be optimized away.

Does GCC optimize std::tie used only for readability?

Suppose I have a std::tuple:
std::tuple<int,int,int,int> t = {1,2,3,4};
and I want to use std::tie just for readability purpose like that:
int a, b, c, d; // in real context these names would be meaningful
std::tie(a, b, c, d) = t;
vs. just using t.get<int>(0), etc.
Would a GCC optimize the memory use of this tuple or would it allocate additional space for a, b, c, d variables?
In this case I don't see any reason why not, under the as-if rule the compiler only has to emulate the observable behavior of the program. A quick experiment using godbolt:
#include <tuple>
#include <cstdio>
void func( int x1, int x2,int x3, int x4)
{
std::tuple<int,int,int,int> t{x1,x2,x3,x4};
int a, b, c, d; // in real context these names would be meaningful
std::tie(a, b, c, d) = t;
printf( "%d %d %d %d\n", a, b, c, d ) ;
}
shows that gcc does indeed optimize it away:
func(int, int, int, int):
movl %ecx, %r8d
xorl %eax, %eax
movl %edx, %ecx
movl %esi, %edx
movl %edi, %esi
movl $.LC0, %edi
jmp printf
On the other hand if you used a address of t and printed it out, we now have observable behavior which relies on t existing (see it live):
printf( "%p\n", static_cast<void*>(&t) );
and we can see that gcc no longer optimizes away the t:
movl %esi, 12(%rsp)
leaq 16(%rsp), %rsi
movd 12(%rsp), %xmm1
movl %edi, 12(%rsp)
movl $.LC0, %edi
movd 12(%rsp), %xmm2
movl %ecx, 12(%rsp)
movd 12(%rsp), %xmm0
movl %edx, 12(%rsp)
movd 12(%rsp), %xmm3
punpckldq %xmm2, %xmm1
punpckldq %xmm3, %xmm0
punpcklqdq %xmm1, %xmm0
At the end of the day you need to look at what the compiler generates and profile your code, in more complicated cases it may surprise you. Just because the compiler is allowed to do certain optimizations does not mean it will. I have looked at more complicated cases where the compiler does not do what I would expect with std::tuple. godbolt is a very helpful tool here, I can not count how many optimizations assumptions I used to have that were upended by plugging in simple examples into godbolt.
Note, I typically use printf in these examples because iostreams generates a lot of code that gets in the way of the example.

how do static variables inside functions work?

In the following code:
int count(){
static int n(5);
n = n + 1;
return n;
}
the variable n is instantiated only once at the first call to the function.
There should be a flag or something so it initialize the variable only once.. I tried to look on the generated assembly code from gcc, but didn't have any clue.
How does the compiler handle this?
This is, of course, compiler-specific.
The reason you didn't see any checks in the generated assembly is that, since n is an int variable, g++ simply treats it as a global variable pre-initialized to 5.
Let's see what happens if we do the same with a std::string:
#include <string>
void count() {
static std::string str;
str += ' ';
}
The generated assembly goes like this:
_Z5countv:
.LFB544:
.cfi_startproc
.cfi_personality 0x3,__gxx_personality_v0
.cfi_lsda 0x3,.LLSDA544
pushq %rbp
.cfi_def_cfa_offset 16
movq %rsp, %rbp
.cfi_offset 6, -16
.cfi_def_cfa_register 6
pushq %r13
pushq %r12
pushq %rbx
subq $8, %rsp
movl $_ZGVZ5countvE3str, %eax
movzbl (%rax), %eax
testb %al, %al
jne .L2 ; <======= bypass initialization
.cfi_offset 3, -40
.cfi_offset 12, -32
.cfi_offset 13, -24
movl $_ZGVZ5countvE3str, %edi
call __cxa_guard_acquire ; acquire the lock
testl %eax, %eax
setne %al
testb %al, %al
je .L2 ; check again
movl $0, %ebx
movl $_ZZ5countvE3str, %edi
.LEHB0:
call _ZNSsC1Ev ; call the constructor
.LEHE0:
movl $_ZGVZ5countvE3str, %edi
call __cxa_guard_release ; release the lock
movl $_ZNSsD1Ev, %eax
movl $__dso_handle, %edx
movl $_ZZ5countvE3str, %esi
movq %rax, %rdi
call __cxa_atexit ; schedule the destructor to be called at exit
jmp .L2
.L7:
.L3:
movl %edx, %r12d
movq %rax, %r13
testb %bl, %bl
jne .L5
.L4:
movl $_ZGVZ5countvE3str, %edi
call __cxa_guard_abort
.L5:
movq %r13, %rax
movslq %r12d,%rdx
movq %rax, %rdi
.LEHB1:
call _Unwind_Resume
.L2:
movl $32, %esi
movl $_ZZ5countvE3str, %edi
call _ZNSspLEc
.LEHE1:
addq $8, %rsp
popq %rbx
popq %r12
popq %r13
leave
ret
.cfi_endproc
The line I've marked with the bypass initialization comment is the conditional jump instruction that skips the construction if the variable already points to a valid object.
This is entirely up to the implementation; the language standard says nothing about that.
In practice, the compiler will usually include a hidden flag variable somewhere that indicates whether the static variable has already been instantiated or not. The static variable and the flag will probably be in the static storage area of the program (e.g. the data segment, not the stack segment), not in the function scope memory, so you may have to look around about in the assembly. (The variable can't go on the call stack, for obvious reasons, so it's really like a global variable. "static allocation" really covers all sorts of static variables!)
Update: As #aix points out, if the static variable is initialized to a constant expression, you may not even need a flag, because the initialization can be performed at load time rather than at the first function call. In C++11 you should be able to take advantage of that better than in C++03 thanks to the wider availability of constant expressions.
It's quite likely that this variable will be handled just as ordinary global variable by gcc. That means the initialization will be statically initialized directly in the binary.
This is possible, since you initialize it by a constant. If you initialized it eg. with another function return value, the compiler would add a flag and skip the initialization based on the flag.

Mapping class to instance

I'm trying to build a simple entity/component system in c++ base on the second answer to this question : Best way to organize entities in a game?
Now, I would like to have a static std::map that returns (and even automatically create, if possible).
What I am thinking is something along the lines of:
PositionComponent *pos = systems[PositionComponent].instances[myEntityID];
What would be the best way to achieve that?
Maybe this?
std::map< std::type_info*, Something > systems;
Then you can do:
Something sth = systems[ &typeid(PositionComponent) ];
Just out of curiosity, I checked the assembler code of this C++ code
#include <typeinfo>
#include <cstdio>
class Foo {
virtual ~Foo() {}
};
int main() {
printf("%p\n", &typeid(Foo));
}
to be sure that it is really a constant. Assembler (stripped) outputted by GCC (without any optimisations):
.globl _main
_main:
LFB27:
pushl %ebp
LCFI0:
movl %esp, %ebp
LCFI1:
pushl %ebx
LCFI2:
subl $20, %esp
LCFI3:
call L3
"L00000000001$pb":
L3:
popl %ebx
leal L__ZTI3Foo$non_lazy_ptr-"L00000000001$pb"(%ebx), %eax
movl (%eax), %eax
movl %eax, 4(%esp)
leal LC0-"L00000000001$pb"(%ebx), %eax
movl %eax, (%esp)
call _printf
movl $0, %eax
addl $20, %esp
popl %ebx
leave
ret
So it actually has to read the L__ZTI3Foo$non_lazy_ptr symbol (I wonder though that this is not constant -- maybe with other compiler options or with other compilers, it is). So a constant may be slightly faster (if the compiler sees the constant at compile time) because you save a read.
You should create some constants (ex. POSITION_COMPONENT=1) and then map those integers to instances.