How to disassemble a compiler generated code? - c++

I would like to see the disassembled code in the same order that the compiler generates after instruction rescheduling. b.t.w I am using GDB and when I give a command saying disas /m FunctionName it gives me disassembled code in the order of source code. I am trying to look at the effectiveness of instruction rescheduling by my compiler (GCC 4.1) and would like to see how instructions are rescheduled.
Thanks!
//////////////////EDITS////////////////////////////////////////
After looking at disassembled code for a line of code:
double w_n = (A_n[2] * x[0] + A_n[5] * y + A_n[8] * z + A_n[11]) ;
I could see its 83 bytes of instructions. But after unrolling it by 2 iterations :
double w_n[2] = { (A_n[2] * x[0] + A_n[5] * y + A_n[8] * z + A_n[11]), (A_n_2[2] * x[0] + A_n_2[5] * y + A_n_2[8] * z + A_n_2[11]) };
The block of code is 226 bytes. And there is enormous increase in instruction count. Could anyone tell me why this is happening? I can also see from VTune that instructions retired after unrolling has increased.
Possible reason I could think: Compiler is getting enough opportunity with increased block size to generate simple instructions so maximize the throughput of Instruction prefetch and decoder unit.
Any help is greatly appreciated. Thanks!!

If rescheduling has been done by the compiler, you really should see that when disassembling in gdb.
Otherwise you can perhaps use objdump directly on the command-line, that's my preferred way of seeing code in an ELF:
$ objdump --disassemble a.out | less
It doesn't reference the source at all, so it should really show what's in the binary itself.

In the step in which you compile the code into an object file, you could also simply tell the GCC driver (gcc) that you want to get assembly code:
gcc -S -c file.c
gcc -O2 -S -c file.c
gcc -S -masm=intel -c file.c
(the latter generates Intel instead of AT&T syntax assembly)
You can even then throw that assembly code at the assembler (e.g. gasm) later on to get an object file which can be linked.
As to why the code is bigger, there is a number of reasons. The heuristics we humans used to hand-optimize assembly haven't been true anymore for quite some time. One big goal is pipelining, another vectorization. All in all it's about parallelizing as much as possible and having to invalidate as little as possible from the (already read) cache at any given time in order to speed up execution.
Even though it seems counter-intuitive, this can lead to bigger, yet faster, code.

Related

Does a compile time logb/ilogb exist?

My question stems from the act of trying to find a compile time (constexpr) way to get the exponent of a floating-point number (the reason why is not the topic). ilog/ilogb is the best runtime way (other than bit fiddling or casts/unions). Looking at the disassembly in Visual Studio's implementation doesn't help without any idea what they are even doing in the first place. I was hoping there's a formula (of some sort) to getting the exponent of a float or something to point me in the right direction.
What I'm trying to achieve:
constexpr int exponent = ilogb(123.45f);
No, compile time logarithms do not portably exist.
However, some compiler could compute that at compile time (using the as-if rule). GCC would do so, perhaps because by using indirectly __builtin_ilogb.
Don't forget a simpler approach: a C++ file can be generated at build time (you'll configure your build automation tool, e.g. make or ninja to do that). So you could write some script (e.g. in Python, awk or some other scripting language), or even your own other C++ program, generating that constant.
BTW, with GCC 7.3 on Linux/x86-64, the following file jaran.cc
#include <cmath>
int f(void) {
constexpr int e = ilogb(123.45f);
return e;
}
is compiled into a constant function (as seen with g++ -S -O -fverbose-asm)
.type _Z1fv, #function
_Z1fv:
.LFB253:
.cfi_startproc
# /tmp/jarann.cc:4: }
movl $6, %eax #,
ret
.cfi_endproc
Generating C++ code is a "portable" approach, but of course require to configure your build for that.

When should I use DO CONCURRENT and when OpenMP?

I am aware of this and this, but I ask again as the first link is pretty old now, and the second link did not seem to reach a conclusive answer. Has any consensus developed?
My problem is simple:
I have a DO loop that has elements that may be run concurrently. Which method do I use ?
Below is code to generate particles on a simple cubic lattice.
npart is the number of particles
npart_edge & npart_face are that along an edge and a face, respectively
space is the lattice spacing
Rx, Ry, Rz are position arrays
x, y, z are temporary variables to decide positon on lattice
Note the difference that x,y and z have to be arrays in the CONCURRENT case, but not so in the OpenMP case because they can be defined as being PRIVATE.
So do I use DO CONCURRENT (which, as I understand from the links above, uses SIMD) :
DO CONCURRENT (i = 1, npart)
x(i) = MODULO(i-1, npart_edge)
Rx(i) = space*x(i)
y(i) = MODULO( ( (i-1) / npart_edge ), npart_edge)
Ry(i) = space*y(i)
z(i) = (i-1) / npart_face
Rz(i) = space*z(i)
END DO
Or do I use OpenMP?
!$OMP PARALLEL DEFAULT(SHARED) PRIVATE(x,y,z)
!$OMP DO
DO i = 1, npart
x = MODULO(i-1, npart_edge)
Rx(i) = space*x
y = MODULO( ( (i-1) / npart_edge ), npart_edge)
Ry(i) = space*y
z = (i-1) / npart_face
Rz(i) = space*z
END DO
!$OMP END DO
!$OMP END PARALLEL
My tests:
Placing 64 particles in a box of side 10:
$ ifort -qopenmp -real-size 64 omp.f90
$ ./a.out
CPU time = 6.870000000000001E-003
Real time = 3.600000000000000E-003
$ ifort -real-size 64 concurrent.f90
$ ./a.out
CPU time = 6.699999999999979E-005
Real time = 0.000000000000000E+000
Placing 100000 particles in a box of side 100:
$ ifort -qopenmp -real-size 64 omp.f90
$ ./a.out
CPU time = 8.213300000000000E-002
Real time = 1.280000000000000E-002
$ ifort -real-size 64 concurrent.f90
$ ./a.out
CPU time = 2.385000000000000E-003
Real time = 2.400000000000000E-003
Using the DO CONCURRENT construct seems to be giving me at least an order of magnitude better performance. This was done on an i7-4790K. Also, the advantage of concurrency seems to decrease with increasing size.
DO CONCURRENT does not do any parallelization per se. The compiler may decide to parallelize it using threads or use SIMD instructions or even offload to a GPU. For threads you often have to instruct it to do so. For GPU offloading you need a particular compiler with particular options. Or (often!), the compiler just treats DO CONCURENT as a regular DO and uses SIMD if it would use them for the regular DO.
OpenMP is also not just threads, the compiler can use SIMD instructions if it wants. There is also omp simd directive, but that is only a suggestion to the compiler to use SIMD, it can be ignored.
You should try, measure and see. There is no single definitive answer. Not even for a given compiler, the less for all compilers.
If you would not use OpenMP anyway, I would give DO CONCURRENT a try to see if the automatic parallelizer does a better job with this construct. Chances are good that it will help. If your code is already in OpenMP, I do not see any point introducing DO CONCURRENT.
My practice is to use OpenMP and try to make sure the compiler vectorizes (SIMD) what it can. Especially because I use OpenMP all over my program anyway. DO CONCURRENT still has to prove it is actually useful. I am not convinced, yet, but some GPU examples look promising - however, real codes are often much more complex.
Your specific examples and the performance measurement:
Too little code is given and there are subtle points in every benchmarking. I wrote some simple code around your loops and did my own tests. I was careful NOT to include the thread creation into the timed block. You should not include $omp parallel into your timing. I also took the minimum real time over multiple computations because sometimes the first take is longer (certainly with DO CONCURRENT). CPU has various throttle modes and may need some time to spin-up. I also added SCHEDULE(STATIC).
npart=10000000
ifort -O3 concurrent.f90: 6.117300000000000E-002
ifort -O3 concurrent.f90 -parallel: 5.044600000000000E-002
ifort -O3 concurrent_omp.f90: 2.419600000000000E-002
npart=10000, default 8 threads (hyper-threading)
ifort -O3 concurrent.f90: 5.430000000000000E-004
ifort -O3 concurrent.f90 -parallel: 8.899999999999999E-005
ifort -O3 concurrent_omp.f90: 1.890000000000000E-004
npart=10000, OMP_NUM_THREADS=4 (ignore hyper-threading)
ifort -O3 concurrent.f90: 5.410000000000000E-004
ifort -O3 concurrent.f90 -parallel: 9.200000000000000E-005
ifort -O3 concurrent_omp.f90: 1.070000000000000E-004
Here, DO CONCURRENT seems to be somewhat faster for the small case, but not too much if we make sure to use the right number of cores. It is clearly slower for the big case. The -parallel option is clearly necessary for the automatic parallelization.

Extent of G++ compiler optimization on non-commutative operations

I am concerned about the G++ optimizer's effect on arithmetic operations, specifically integer operations that are not necessarily commutative, eg * and /. This concern arose when I looked at a simple function in gdb that had been compiled with the -O3 flag set; it was all in all a better function, but it's form was completely different then what it had been with no optimization, operations had been removed, and some had been relocated. Here is a simple function with which I will demonstrate the crux of my concern;
int ClipLower(int num, int dig){
int Mult10 = 1;
while (dig != 0){
Mult10 *= 10, dig--;
}
return ((num / Mult10) * Mult10);
}
This function simply clips off the base10 digits below digit 'dig'. My concern is, does the compiler consider things like the fact that math on integers is non-commutative? So, will the compiler try to reduce (num / mult10) * mult10 into num * 1, and of course discard the one?
I am aware that volatile will avoid this situation, but I would still like my code optimized as much as possible. So in essence I am asking if the gnu optimizer will understand that integer math is non-communicative, and further more how much of a concern optimization-gone-awry really is.
also
here is the disassembly for the function at -O4, as you can see, the order of operations is fine
13 return ((num / Mult10) * Mult10);
cltd
idiv %ecx
imul %ecx,%eax
ret
amusingly, the compiler generated a load of no-operations following the function, presumably as padding because it ended up so small.
Here is the list of flags that -O3 in g++ is equivalent to: https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html
Now if you look carefully, there is also -Ofast which is defined as -O3 + some other, especially -ffast-math. In description of -ffast-math you can read:
This option is not turned on by any -O option besides -Ofast since it can result in incorrect output for programs that depend on an exact implementation of IEEE or ISO rules/specifications for math functions. It may, however, yield faster code for programs that do not require the guarantees of these specifications.
This is done precisely to ensure default compiler flags do not violate rounding error and other floating point standard specifications.
There is also a related question on SO, why don't compilers optimize a*a*a*a*a*a to (a*a*a)^2, the answer is the same. (I cannot find the link atm =/)
Btw, Mult10 *= 10, dig--; are you trying to lose people following your code? =D
EDIT: Another by the way, going over -O3 has no effect. Except that some people say you might overflow some internal variable. I didn't test the overflow but I'm sure -O4 and -O100 are equivalent to -O3 at this point of writing this.
Try it and look at the assembly
Optimization should not effect output, only speed. Rounding should be maintained. But bugs can occur, although much more rarely nowadays.
Generally issues are more likely with floating point. 2/7 with floats might vary slightly.
With ints it should always be 0, no matter what optimization, even if it is multiplied by 7.

Tool for detecting pointer aliasing problems in C / C++

Is there a tool that can do alias analysis on a program and tell you where gcc / g++ are having to generate sub-optimal instruction sequences due to potential pointer aliasing?
I don't know of anything that gives "100 %" coverage, but for vectorizing code (which aliasing often prevents) use the -ftree-vectorizer-verbose=n option, where n is an integer between 1 and 6. This prints out some info why a loop couldn't be vectorized.
For instance, with g++ 4.1, the code
//#define RSTR __restrict__
#define RSTR
void addvec(float* RSTR a, float* b, int n)
{
for (int i = 0; i < n; i++)
a[i] = a[i] + b[i];
}
results in
$ g++ -ftree-vectorizer-verbose=1 -ftree-vectorize -O3 -c aliastest.cpp
aliastest.cpp:6: note: vectorized 0 loops in function.
Now, switch to the other definition for RSTR and you get
$ g++ -ftree-vectorizer-verbose=1 -ftree-vectorize -O3 -c aliastest.cpp
aliastest.cpp:6: note: LOOP VECTORIZED.
aliastest.cpp:6: note: vectorized 1 loops in function.
Interestingly, if one switches to g++ 4.4, it can vectorize the first non-restrict case by versioning and a runtime check:
$ g++44 -ftree-vectorizer-verbose=1 -O3 -c aliastest.cpp
aliastest.cpp:6: note: created 1 versioning for alias checks.
aliastest.cpp:6: note: LOOP VECTORIZED.
aliastest.cpp:4: note: vectorized 1 loops in function.
And this is done for both of the RSTR definitons.
In the past I've tracked down cases aliasing slowdowns with some help from a profiler. Some of the game console profilers will highlight parts of the code that are causing lots of load-hit-store penalties - these can often occur because the compiler assumes some pointers are aliased and has to generate the extra load instructions. Once you know the part of the code they're occuring, you can backtrack from the assembly to the source to see what might be considered aliased, and add "restict" as needed (or other tricks to avoid the extra loads).
I'm not sure if there are any freely available profilers that will let you get into this level of detail, however.
The side benefit of this approach is that you only spend your time examining cases that actually slow your code down.

Which compiles to faster code: "n * 3" or "n+(n*2)"?

Which compiles to faster code: "ans = n * 3" or "ans = n+(n*2)"?
Assuming that n is either an int or a long, and it is is running on a modern Win32 Intel box.
Would this be different if there was some dereferencing involved, that is, which of these would be faster?
long a;
long *pn;
long ans;
...
*pn = some_number;
ans = *pn * 3;
Or
ans = *pn+(*pn*2);
Or, is it something one need not worry about as optimizing compilers are likely to account for this in any case?
IMO such micro-optimization is not necessary unless you work with some exotic compiler. I would put readability on the first place.
It doesn't matter. Modern processors can execute an integer MUL instruction in one clock cycle or less, unlike older processers which needed to perform a series of shifts and adds internally in order to perform the MUL, thereby using multiple cycles. I would bet that
MUL EAX,3
executes faster than
MOV EBX,EAX
SHL EAX,1
ADD EAX,EBX
The last processor where this sort of optimization might have been useful was probably the 486. (yes, this is biased to intel processors, but is probably representative of other architectures as well).
In any event, any reasonable compiler should be able to generate the smallest/fastest code. So always go with readability first.
As it's easy to measure it yourself, why don't do that? (Using gcc and time from cygwin)
/* test1.c */
int main()
{
int result = 0;
int times = 1000000000;
while (--times)
result = result * 3;
return result;
}
machine:~$ gcc -O2 test1.c -o test1
machine:~$ time ./test1.exe
real 0m0.673s
user 0m0.608s
sys 0m0.000s
Do the test for a couple of times and repeat for the other case.
If you want to peek at the assembly code, gcc -S -O2 test1.c
This would depend on the compiler, its configuration and the surrounding code.
You should not try and guess whether things are 'faster' without taking measurements.
In general you should not worry about this kind of nanoscale optimisation stuff nowadays - it's almost always a complete irrelevance, and if you were genuinely working in a domain where it mattered, you would already be using a profiler and looking at the assembly language output of the compiler.
It's not difficult to find out what the compiler is doing with your code (I'm using DevStudio 2005 here). Write a simple program with the following code:
int i = 45, j, k;
j = i * 3;
k = i + (i * 2);
Place a breakpoint on the middle line and run the code using the debugger. When the breakpoint is triggered, right click on the source file and select "Go To Disassembly". You will now have a window with the code the CPU is executing. You will notice in this case that the last two lines produce exactly the same instructions, namely, "lea eax,[ebx+ebx*2]" (not bit shifting and adding in this particular case). On a modern IA32 CPU, it's probably more efficient to do a straight MUL rather than bit shifting due to pipelineing nature of the CPU which incurs a penalty when using a modified value too soon.
This demonstrates what aku is talking about, namely, compilers are clever enough to pick the best instructions for your code.
It does depend on the compiler you are actually using, but very probably they translate to the same code.
You can check it by yourself by creating a small test program and checking its disassembly.
Most compilers are smart enough to decompose an integer multiplication into a series of bit shifts and adds. I don't know about Windows compilers, but at least with gcc you can get it to spit out the assembler, and if you look at that you can probably see identical assembler for both ways of writing it.
It doesn't care. I think that there are more important things to optimize. How much time have you invested thinking and writing that question instead of coding and testing by yourself?
:-)
As long as you're using a decent optimising compiler, just write code that's easy for the compiler to understand. This makes it easier for the compiler to perform clever optimisations.
You asking this question indicates that an optimising compiler knows more about optimisation than you do. So trust the compiler. Use n * 3.
Have a look at this answer as well.
Compilers are good at optimising code such as yours. Any modern compiler would produce the same code for both cases and additionally replace * 2 by a left shift.
Trust your compiler to optimize little pieces of code like that. Readability is much more important at the code level. True optimization should come at a higher level.