I just asked a question related to how the compiler optimizes certain C++ code, and I was looking around SO for any questions about how to verify that the compiler has performed certain optimizations. I was trying to look at the assembly listing generated with g++ (g++ -c -g -O2 -Wa,-ahl=file.s file.c) to possibly see what is going on under the hood, but the output is too cryptic to me. What techniques do people use to tackle this problem, and are there any good references on how to interpret the assembly listings of optimized code or articles specific to the GCC toolchain that talk about this problem?
GCC's optimization passes work on an intermediary representation of your code in a format called GIMPLE.
Using the -fdump-* family of options, you can ask GCC to output intermediary states of the tree.
For example, feed this to gcc -c -fdump-tree-all -O3
unsigned fib(unsigned n) {
if (n < 2) return n;
return fib(n - 2) + fib(n - 1);
}
and watch as it gradually transforms from simple exponential algorithm into a complex polynomial algorithm. (Really!)
A useful technique is to run the code under a good sampling profiler, e.g. Zoom under Linux or Instruments (with Time Profiler instrument) under Mac OS X. These profilers not only show you the hotspots in your code but also map source code to disassembled object code. Highlighting a source line shows the (not necessarily contiguous) lines of generated code that map to the source line (and vice versa). Online opcode references and optimization tips are a nice bonus.
Instruments: developer.apple.com
Zoom: www.rotateright.com
Not gcc, but when debugging in Visual Studio you have the option to intersperse assembly and source, which gives a good idea of what has been generated for what statement. But sometimes it's not quite aligned correctly.
The output of the gcc tool chain and objdump -dS isn't at the same granularity. This article on getting gcc to output source and assembly has the same options as you are using.
Adding the -L option (eg, gcc -L -ahl) may provide slightly more intelligible listings.
The equivalent MSVC option is /FAcs (and it's a little better because it intersperses the source, machine language, and binary, and includes some helpful comments).
About one third of my job consists of doing just what you're doing: juggling C code around and then looking at the assembly output to make sure it's been optimized correctly (which is preferred to just writing inline assembly all over the place).
Game-development blogs and articles can be a good resource for the topic since games are effectively real-time applications in constant memory -- I have some notes on it, so does Mike Acton, and others. I usually like to keep Intel's instruction set reference up in a window while going through listings.
The most helpful thing is to get a good ground-level understanding of assembly programming generally first -- not because you want to write assembly code, but because having done so makes reading disassembly much easier. I've had a hard time finding a good modern textbook though.
In order to output the optimizations applied you can use:
-fopt-info-optimized
To see those that have not been applied
-fopt-info-missed
Beware that the output is sent to standard error stream so to see it you actually have to redirect that : ( hint 2>&1 )
Here is nice example of :
g++ -O3 -std=c++11 -march=native -mtune=native
-fopt-info-optimized h2d.cpp -o h2d 2>&1
h2d.cpp:225:3: note: loop vectorized
h2d.cpp:213:3: note: loop vectorized
h2d.cpp:198:3: note: loop vectorized
h2d.cpp:186:3: note: loop vectorized
You can check the interleaved output, when having applied -g with objdump -dS|c++filt , but that will not get you that far.Enjoy!
Zoom from RotateRight ( http://rotateright.com ) is mentioned in another answer, but to expand on that: it shows you the mapping of source to assembly in what they call the "code browser". It's incredibly handy even if you're not an asm expert because they have also integrated assembly documentation into the app. And the assembly listing is annotated with comments and timing for several CPU types.
You can just open your object or executable file with Zoom and take a look at what the compiler has done with your code.
Victor, in your case the optimization you are looking for is just a smaller allocation of local memory on the stack. You should see a smaller allocation at function entry and a smaller deallocation at function exit if the space used by the empty class is optimized away.
As for the general question, I've been reading (and writing) assembly language for more than (gulp!) 30 years and all I can say is that it takes practice, especially to read the output of a compiler.
Instead of trying to read through an assembler dump, run your program inside a debugger. You can pause execution, single-step through instructions, set breakpoints on the code you want to check, etc. Many debuggers can display your original C code alongside the generated assembly so you can more easily see what the compiler did to optimize your code.
Also, if you are trying to test a specific compiler optimization you can create a short dummy function that contains the type of code that fits the optimization you are interested in (and not much else, the simpler it is the easier the assembly is to read). Compile the program once with optimizations on and once with them off; comparing the generated assembly code for the dummy function between builds should show you what the compiler's optimizers did.
Related
-- snipped from chat.so --
I am stuck with gcc 4.6.2 on a certain project and after profiling with intel VTune
i noticed that very insignificant functions were not being inlined (or at least showed up under hotspots, which I assumed meant a failed inline)
an example function is a reinterpret cast, 2 numeric additions, and a ternary statement
i BELIEVE these are being inlined in Windows, but due to the profiling, think they are not being inlined in linux under gcc 4.6.2
I am attempting to get an ICC build working in linux (works in windows), but that'll take a little time
until then, does anyone know if GCC 4.6.2 is that different from VS2010 in terms of relatively simple compiler optimizations? I've turned on -O3 in GCC
what led me to this is that this is a rewrite of a significant section of code, and on Windows, the performance is approximately equal or a little slower, while on Linux it is at least 2x as slow.
The most informative answer would help me understand the steps required to verify inlining across platforms and how best to approach this situation as I understand these things are extremely situation-specific.
EDIT: Also, assuming that business-specific reasons force me to stick with GCC 4.6.2, what can I do about this without rewriting the code to make it less maintainable?
Thanks!
First the super-obvious for completeness: Are you absolutely sure that all the files doing the probably non-inlined calls were compiled with -O3?
The gcc and VS compiler and tool chains are sufficiently different that it wouldn't surprise me at all if their optimizers behaved rather differently.
Next let me observe that the ternary operator can be very deceiving. Ternary operators are almost certainly going to create a branch and potentially constructor calls, conversions, etc. Don't assume that just because it's a terse operator in C++ the compiler will be able generate a tiny amount of code for it. This could potentially inhibit the compiler from optmizing it. In fact, you could try reworking the ternary code into a normal if statement and see if that helps your performance at all.
Then once you've moved on to further diagnostics, an easy thing to try is to use strings <binary> | grep function and see if the function name shows up in the binary at all. If it doesn't then it's definitely being inlined (although even if it shows up it could be strictly debug information and not actual code). There are other tools such as nm, readelf, elfdump, and dump that can introspect binaries for symbols as well. You would need to see which tools are available on your platform and then try to use them to find the function(s) in question.
Another idea is to load the compiled binary into gdb, and ask it to disassemble the code at the file and line at the point where the function call is made. Then you can read the disassembly code to see what the compiler did. Most of the code should actually be fairly obvious. You will likely see something like a call instruction if an actual function call was made.
Is it possible to somehow convert a simple C or C++ code (by simple I mean: taking some int as input, printing some simple shapes dependent on that int as output) to assembly language? If there isn't I'll just do it manually but since I'm gonna be doing it for processors like Intel 8080, it just seemed a bit tedious. Can you somehow automate the process?
Also, if there is a way, how good (as in: elegant) would the output assembly file source code be when compared to just translating it manually?
Most compilers will let you produce assembly output. For a couple of obvious examples, Clang and gcc/g++ use the -S flag, and MS VC++ uses the -Fa flag to do so.
A few compilers don't support this directly (e.g., if memory serves Watcom didn't). The ones I've seen like this had you produce an object file, and then included a disassembler that would produce an assembly language file from the object file. I don't remember for sure, but it wouldn't surprise me if this is what you'd need to do with the Digital Mars compiler.
To somebody who's accustomed to writing assembly language, the output from most compilers typically tends to look at least somewhat inelegant, especially on a CPU like an x86 that has quite a few registers that are now really general purpose, but have historically had more specific meanings. For example, if some piece of code needs both a pointer and a counter, a person would probably put the pointer in ESI or EDI, and the counter in ECX. The compiler might easily reverse those. That'll work fine, but an experienced assembly language programmer will undoubtedly find it more readable using ESI for the pointer and ECX for the counter.
Take look at gcc -S:
gcc -S hello.c # outputs hello.s file
Other compilers that maintain at lest partial gcc compatibility may also accept this flag. LLVM's clang, for example, does.
Well, yes there is such a program. It's called "Compiler"
To answer your edit: The elegance of the output depends on the optimization of your compiler. Usually compilers do not generate code we humans would call "elegant".
Most folks here are right, but seem to have missed the note about 8080 (no wonder, it's not in the title :). However, Google comes to the rescue as always - looking for compiler for 8080 produces some nice results like these:
http://www.bdsoft.com/resources/bdsc.html
http://tack.sourceforge.net/
Most of these are pretty old and might be poorly maintained. You might also try 8085 which is fairly similar
(by simple I mean: taking some int as input, printing some simple shapes dependent on
that int as output) to assembly language?
Looking at the output of an x86 compiler is not going to be very helpful, since inputting and outputting are done by a C or C++ library. With an 8080 there is no such library so you have to develop your own I/O routines for some particular hardware. That's lots and lots of additional work.
I have a task to create optimized C++ source code and give it to friend for compilation. It means, that I do not control the final compilation, I just write the source code of C++ program.
I know, that a can make optimization during compilation with -O1 (and -O2 and others) options of GCC. But how can I get this optimized source code instead of compiled program? I am not able to configure parameters of my friend's compiler, that is why I need to make a good source on my side.
The optimizations performed by GCC are low level, that means you won't get C++ code again but assembly code in best case. But you won't be able to convert it or something.
In sum: Optimize the source code on code level, not on object level.
You could ask GCC to dump its internal (Gimple, ...) representations, at various "stages". The middle-end of GCC is made of hundreds of passes, and you could ask GCC to dump them, with arguments like -fdump-tree-all or -fdump-gimple-all; beware that you can get hundreds of dump files for a single compilation!
However, GCC internal representations are quite low level, and you should not expect to understand them without reading a lot of material.
The dump options I am mentionning are mostly useful to those working inside GCC, or extending it thru plugins coded in C or extensions coded in MELT (a high-level domain specific language to extend GCC). I am not sure they will be very useful to your friend. However, they can be useful to make you understand that optimization passes do a lot of complex processing.
And don't forget that premature optimization is evil : you should first make your program run correctly, then benchmark and profile it, at last optimize the few parts worth of your efforts. You probably won't be able to write correct & efficient programs without testing and running them yourself, before giving them to your friend.
Easy - choose the best algorithm possible, let the rest be handled by the optimizer.
Optimizing the source code is different than optimizing the binary. You optimize the source code, the compiler will optimize the binary.
For anything more than algorithm choice, you'll need to do some profiling. Sure, there are practices that can speed up code speed, but some make the code less readable. Only optimize when you have to, and after you measure.
I'm considering picking up some very rudimentary understanding of assembly. My current goal is simple: VERY BASIC understanding of GCC assembler output when compiling C/C++ with the -S switch for x86/x86-64.
Just enough to do simple things such as looking at a single function and verifying whether GCC optimizes away things I expect to disappear.
Does anyone have/know of a truly concise introduction to assembly, relevant to GCC and specifically for the purpose of reading, and a list of the most important instructions anyone casually reading assembly should know?
You should use GCC's -fverbose-asm option. It makes the compiler output additional information (in the form of comments) that make it easier to understand the assembly code's relationship to the original C/C++ code.
If you're using gcc or clang, the -masm=intel argument tells the compiler to generate assembly with Intel syntax rather than AT&T syntax, and the --save-temps argument tells the compiler to save temporary files (preprocessed source, assembly output, unlinked object file) in the directory GCC is called from.
Getting a superficial understanding of x86 assembly should be easy with all the resources out there. Here's one such resource: http://www.cs.virginia.edu/~evans/cs216/guides/x86.html .
You can also just use disasm and gdb to see what a compiled program is doing.
I usually hunt down the processor documentation when faced with a new device, and then just look up the opcodes as I encounter ones I don't know.
On Intel, thankfully the opcodes are somewhat sensible. PowerPC not so much in my opinion. MIPS was my favorite. For MIPS I borrowed my neighbor's little reference book, and for PPC I had some IBM documentation in a PDF that was handy to search through. (And for Intel, mostly I guess and then watch the registers to make sure I'm guessing right! heh)
Basically, the assembly itself is easy. It basically does three things: move data between memory and registers, operate on data in registers, and change the program counter. Mapping between your language of choice and the assembly will require some study (e.g. learning how to recognize a virtual function call), and for this an "integrated" source and disassembly view (like you can get in Visual Studio) is very useful.
"casually reading assembly" lol (nicely)
I would start by following in gdb at run time; you get a better feel for whats happening. But then maybe thats just me. it will disassemble a function for you (disass func) then you can single step through it
If you are doing this solely to check the optimizations - do not worry.
a) the compiler does a good job
b) you wont be able to understand what it is doing anyway (nobody can)
Unlike higher-level languages, there's really not much (if any) difference between being able to read assembly and being able to write it. Instructions have a one-to-one relationship with CPU opcodes -- there's no complexity to skip over while still retaining an understanding of what the line of code does. (It's not like a higher-level language where you can see a line that says "print $var" and not need to know or care about how it goes about outputting it to screen.)
If you still want to learn assembly, try the book Assembly Language Step-by-Step: Programming with Linux, by Jeff Duntemann.
I'm sure there are introductory books and web sites out there, but a pretty efficient way of learning it is actually to get the Intel references and then try to do simple stuff (like integer math and Boolean logic) in your favorite high-level language and then look what the resulting binary code is.
Is there a method to automatically find the best compiler options (on a given machine), which result in the fastest possible executable?
Naturally, I use g++ -O3, but there are additional flags that may make the code run faster, e.g. -ffast-math and others, some of which are hardware-dependent.
Does anyone know some code I can put in my configure.ac file (GNU autotools), so that the flags will be added to the Makefile automatically by the ./configure command?
In addition to automatically determining the best flags, I would be interested in some useful compiler flags that are good to use as a default for most optimized executables.
Update: Most people suggest to just try different flags and select the best ones empirically. For that method, I'd have a follow-up question: Is there a utility that lists all compiler flags that are possible for the machine I'm running on (e.g. tests if SSE instructions are available etc.)?
I don't think you can do this at configure-time, but there is at least one program which attempts to optimize gcc option flags given a particular executable and machine. See http://www.coyotegulch.com/products/acovea/ for example.
You might be able to use this with some knowledge of your target machine(s) to find a good set of options for your code.
Um - yes. This is possible. Look into profile-guided optimization.
some compilers provide "-fast" option to automatically select most aggressive optimization for given compilation host. http://en.wikipedia.org/wiki/Intel_C%2B%2B_Compiler
Unfortunately, g++ does not provide similar flags.
as a follow-up to your next question, for g++ you can use -mtune option together with -O3 which will give you reasonably fast defaults. Challenge then is to find processor type of your compilation host. you may want to look on autoconf macro archive, to see somebody wrote necessary tests. otherwise, assuming linux, you have to parse /proc/cpuinfo to get processor type
After some googling, I found this script: gcccpuopt.
On one of my machines (32bit), it outputs:
-march=pentium4 -mfpmath=sse
On another machine (64bit) it outputs:
$ ./gcccpuopt
Warning: The optimum *32 bit* architecture is reported
-m32 -march=core2 -mfpmath=sse
So, it's not perfect, but might be helpful.
See also -mcpu=native/-mtune=native gcc options.
Is there a method to automatically find the best compiler options (on a given machine), which result in the fastest possible executable?
No.
You could compile your program with a large assortment of compiler options, then benchmark each and every version, then select the one that is "fastest," but that's hardly reliable and probably not useful for your program.
This is a solution that works for me, but it does take a little while to set up. In "Python Scripting for Computational Science" by Hans Petter Langtangen (an excellent book in my opinion), an example is given of using a short python script to do numerical experiments to determine the best compiler options for your C/Fortran/... program. This is described in Chapter 1.1.11 on "Nested Heterogeneous Data Structures".
Source code for examples from the book are freely available at http://folk.uio.no/hpl/scripting/index.html (I'm not sure of the license, so will not reproduce any code here), and in particular you can find code for a similar numerical test in the code in TCSE3-3rd-examples.tar.gz in the file src/app/wavesim2D/F77/compile.py , which you could use as a base for writing a script which is appropriate for a particular system/language (C++ in your case).
Optimizing your app is mainly your job, not the compiler's.
Here's an example of what I'm talking about.
Once you've done that, IF your app is compute-bound, with hotspots in your code (not in library code) THEN the compiler optimizations for speed will make some difference, so you can try different flag combinations.