Do optimizers (generally speaking here) take my c/c++ code and write better c/c++ code or do they translate it straight into assembly and then optimize that. Or is it a combo?
EDIT:
I am using gcc (but I would like to know what others do also)
Optimizers can be at different levels, but usually they won't generate new readable code (although sometimes this happens with other languages, like JavaScript for example.)
GCC generates an intermediate representation:
http://www.tldp.org/HOWTO/GCC-Frontend-HOWTO-4.html
Optimizations are then applied to this structure. See more here, for example:
https://gcc.gnu.org/onlinedocs/gccint/Tree-SSA.html
From there, the backend translates it to final machine code (although I believe this part also involves optimizations, as well.)
Do optimizers ...
Well, optimizers (or better optimization strategies) come with particular compiler implementations.
There's no general answer for your question
and write better c/c++ code or do they translate it straight into assembly
No, their job is to optimize the backend code, which might be target assembly or whatever intermediate machine code. Thus there's no intermediate optimized c++ code to be expected.
Optimizer don't rewrite c/c++ code.
The compiler does a lexical analysis, and then makes a semantic analysis using some kind of internal graph representation of your code. The optimizer first works on this internal representation to identify and optimize the flow of execution (for example constant propagation).
Once the code generation can start, the optimizer intervenes again, to make macine dependent optimization (register allocation, special instruction sets such as intel's MMX, etc...)
Only at the end does it generate assembler code.
-- snipped from chat.so --
I am stuck with gcc 4.6.2 on a certain project and after profiling with intel VTune
i noticed that very insignificant functions were not being inlined (or at least showed up under hotspots, which I assumed meant a failed inline)
an example function is a reinterpret cast, 2 numeric additions, and a ternary statement
i BELIEVE these are being inlined in Windows, but due to the profiling, think they are not being inlined in linux under gcc 4.6.2
I am attempting to get an ICC build working in linux (works in windows), but that'll take a little time
until then, does anyone know if GCC 4.6.2 is that different from VS2010 in terms of relatively simple compiler optimizations? I've turned on -O3 in GCC
what led me to this is that this is a rewrite of a significant section of code, and on Windows, the performance is approximately equal or a little slower, while on Linux it is at least 2x as slow.
The most informative answer would help me understand the steps required to verify inlining across platforms and how best to approach this situation as I understand these things are extremely situation-specific.
EDIT: Also, assuming that business-specific reasons force me to stick with GCC 4.6.2, what can I do about this without rewriting the code to make it less maintainable?
Thanks!
First the super-obvious for completeness: Are you absolutely sure that all the files doing the probably non-inlined calls were compiled with -O3?
The gcc and VS compiler and tool chains are sufficiently different that it wouldn't surprise me at all if their optimizers behaved rather differently.
Next let me observe that the ternary operator can be very deceiving. Ternary operators are almost certainly going to create a branch and potentially constructor calls, conversions, etc. Don't assume that just because it's a terse operator in C++ the compiler will be able generate a tiny amount of code for it. This could potentially inhibit the compiler from optmizing it. In fact, you could try reworking the ternary code into a normal if statement and see if that helps your performance at all.
Then once you've moved on to further diagnostics, an easy thing to try is to use strings <binary> | grep function and see if the function name shows up in the binary at all. If it doesn't then it's definitely being inlined (although even if it shows up it could be strictly debug information and not actual code). There are other tools such as nm, readelf, elfdump, and dump that can introspect binaries for symbols as well. You would need to see which tools are available on your platform and then try to use them to find the function(s) in question.
Another idea is to load the compiled binary into gdb, and ask it to disassemble the code at the file and line at the point where the function call is made. Then you can read the disassembly code to see what the compiler did. Most of the code should actually be fairly obvious. You will likely see something like a call instruction if an actual function call was made.
Consider this simple case scenario:
I download the pre-built binaries of a C++ compiler (say CLang or GCC or anything else) for my generic OS (that is not windows). I compile my code which consists of some computationally expensive mathematical calculation with optimization flag -O3 and I have an execution time of T1.
On a different attempt, this time instead of using pre-built binaries I download the source code and build the compiler by myself on my generic machine. I compile the same code with the same optimization flag, achieving execution time T2?
Will T2 < T1 or they will be more or less the same?
In other words, is the execution time independent from the way that compiler is built?
The compiler's optimization of your code is the result of the behavior of the compiler, not the performance of the compiler.
As long as the compiler has the same behavioral design, it will produce exactly the same output.
Generally the same compiler version should generate the same assembler code given the same C or C++ code input. However there are certain things that might further affect the code that is being execute when you run the compiler.
A distro might have backported (or even created own) patches from other versions.
Modern compilers often have library depenencies (e.g. cloog) that may have different behaviour in different versions, causing the compiler to make code generation decisions based on essentially other data
These libraries may (in some compiler versions) be optional at compile time (might need to give --enable switches to configure, or configure tries to autodetect them).
Compiler switches like -march=native will look on what hardware you compile and try to optimize accordingly.
a time limit in the compilers optimizer triggers, essentially making better optimizations on better machines; or the same for memory (I don't think thats to be found in modern compilers anymore though)
That said, even the same assembler might perform different on yours and their machine, e.g. because one is optimized for AMD, the other for intel.
In my opinion, and in theory, compilation speed can be faster, since you can say to "compiler which compile the compiler", "please target to my computer, and you can use my computer's processor's own machine code to optimize".
But I think compiler's optimization cannot be faster.. To make compiler's optimization faster, I think we need put something like new technology into compiler, not just re-compile.
That depends on how that compiler is implemented and on your platform, but the answer will be most likely "no".
If your platform provides specific functionality that can improve the performance of your program, the optimizer in your compiler might use that functionality to produce a faster program. The optimizer can do so only if the compiler writer was aware of the functionality and has implemented special treatment for your platform in the optimizer. If that is the case, the detection might be done dynamically in the optimizer, meaning any build of the optimizer can detect the platform and optimize your code. Only if the detection has to occur at compiletime of the optimizer for some reason, recompiling it on your platform could give that advantage. But if such a better build for your plaform exists, the compiler vendor most likely has provided binaries for it.
So, with all these ifs, it's unlikely that your program will be any faster when you recompile the compiler on your platform. There is a chance, however, that the compiler will be a bit faster if it is optimized to your platform rather than a generic binary, resulting on shorter compiletimes.
I remember reading somewhere that to really optimize & speed up certain section of the code, programmers write that section in Assembly language. My questions are -
Is this practice still done? and How does one do this?
Isn't writing in Assembly Language a bit too cumbersome & archaic?
When we compile C code (with or without -O3 flag), the compiler does some code optimization & links all libraries & converts the code to binary object file. So when we run the program it is already in its most basic form i.e. binary. So how does inducing 'Assembly Language' help?
I am trying to understand this concept & any help or links is much appreciated.
UPDATE: Rephrasing point 3 as requested by dbemerlin- Because you might be able to write more effective assembly code than the compiler generates but unless you are an assembler expert your code will propably run slower because often the compiler optimizes the code better than most humans can.
The only time it's useful to revert to assembly language is when
the CPU instructions don't have functional equivalents in C++ (e.g. single-instruction-multiple-data instructions, BCD or decimal arithmetic operations)
AND the compiler doesn't provide extra functions to wrap these operations (e.g. C++11 Standard has atomic operations including compare-and-swap, <cstdlib> has div/ldiv et al for getting quotient and remainder efficiently)
AND there isn't a good third-party library (e.g. http://mitpress.mit.edu/catalog/item/default.asp?tid=3952&ttype=2)
OR
for some inexplicable reason - the optimiser is failing to use the best CPU instructions
...AND...
the use of those CPU instructions would give some significant and useful performance boost to bottleneck code.
Simply using inline assembly to do an operation that can easily be expressed in C++ - like adding two values or searching in a string - is actively counterproductive, because:
the compiler knows how to do this equally well
to verify this, look at its assembly output (e.g. gcc -S) or disassemble the machine code
you're artificially restricting its choices regarding register allocation, CPU instructions etc., so it may take longer to prepare the CPU registers with the values needed to execute your hardcoded instruction, then longer to get back to an optimal allocation for future instructions
compiler optimisers can choose between equivalent-performance instructions specifying different registers to minimise copying between them, and may choose registers in such a way that a single core can process multiple instructions during one cycle, whereas forcing everythingt through specific registers would serialise it
in fairness, GCC has ways to express needs for specific types of registers without constraining the CPU to an exact register, still allowing such optimisations, but it's the only inline assembly I've ever seen that addresses this
if a new CPU model comes out next year with another instruction that's 1000% faster for that same logical operation, then the compiler vendor is more likely to update their compiler to use that instruction, and hence your program to benefit once recompiled, than you are (or whomever's maintaining the software then is)
the compiler will select an optimal approach for the target architecture its told about: if you hardcode one solution then it will need to be a lowest-common-denominator or #ifdef-ed for your platforms
assembly language isn't as portable as C++, both across CPUs and across compilers, and even if you seemingly port an instruction, it's possible to make a mistake re registers that are safe to clobber, argument passing conventions etc.
other programmers may not know or be comfortable with assembly
One perspective that I think's worth keeping in mind is that when C was introduced it had to win over a lot of hardcore assembly language programmers who fussed over the machine code generated. Machines had less CPU power and RAM back then and you can bet people fussed over the tiniest thing. Optimisers became very sophisticated and have continued to improve, whereas the assembly languages of processors like the x86 have become increasingly complicated, as have their execution pipelines, caches and other factors involved in their performance. You can't just add values from a table of cycles-per-instruction any more. Compiler writers spend time considering all those subtle factors (especially those working for CPU manufacturers, but that ups the pressure on other compilers too). It's now impractical for assembly programmers to average - over any non-trivial application - significantly better efficiency of code than that generated by a good optimising compiler, and they're overwhelmingly likely to do worse. So, use of assembly should be limited to times it really makes a measurable and useful difference, worth the coupling and maintenance costs.
First of all, you need to profile your program. Then you optimize the most used paths in C or C++ code. Unless advantages are clear you don't rewrite in assembler. Using assembler makes your code harder to maintain and much less portable - it is not worth it except in very rare situations.
(1) Yes, the easiest way to try this out is to use inline assembly, this is compiler dependent but usually looks something like this:
__asm
{
mov eax, ebx
}
(2) This is highly subjective
(3) Because you might be able to write more effective assembly code than the compiler generates.
You should read the classic book Zen of Code Optimization and the followup Zen of Graphics Programming by Michael Abrash.
Summarily in the first book he explained how to use assembly programming pushed to the limits. In the followup he explained that programmers should rather use some higher level language like C and only try to optimize very specific spots using assembly, if necessary at all.
One motivation of this change of mind was that he saw that highly optimized programs for one generation of processor could become (somewhat) slow in the next generation of the same processor familly compared to code compiled from a high level language (maybe compiler using new instructions for instance, or performance and behavior of existing ones changing from a processor generation to another).
Another reason is that compilers are quite good and optimize aggressively nowaday, there is usually much more performance to gain working on algorithms that converting C code to assembly. Even for GPU (Graphic Cards processors) programming you can do it with C using cuda or OpenCL.
There are still some (rare) cases when you should/have to use assembly, usually to get very fine control on the hardware. But even in OS kernel code it's usually very small parts and not that much code.
There's very few reasons to use assembly language these days, even low-level constructs like SSE and the older MMX have built-in intrinsics in both gcc and MSVC (icc too I bet but I never used it).
Honestly, optimizers these days are so insanely aggressive that most people couldn't match even half their performance writing code in assembly. You can change how data is ordered in memory (for locality) or tell the compiler more about your code (through #pragma), but actually writing assembly code... doubt you'll get anything extra from it.
#VJo, note that using intrinsics in high level C code would let you do the same optimizations, without using a single assembly instruction.
And for what it's worth, there have been discussions about the next Microsoft C++ compiler, and how they'll drop inline assembly from it. That speaks volumes about the need for it.
I dont think you specified the processor. Different answers depending on the processor and the environment. The general answer is yes it is still done, it is not archaic certainly. The general reason is the compilers, sometimes they do a good job at optimizing in general but not really well for specific targets. Some are really good at one target and not so good at others. Most of the time it is good enough, most of the time you want portable C code and not non-portable assembler. But you still find that C libraries will still hand optimize memcpy and other routines that the compiler simply cannot figure out that there is a very fast way to implement it. In part because that corner case is not worth spending time on making the compiler optimize for, just solve it in assembler and the build system has a lot of if this target then use C if that target use C if that target use asm, if that target use asm. So it still occurs, and I argue must continue forever in some areas.
X86 is is own beast with a lot of history, we are at a point where you really cannot in a practical manner write one blob of assembler that is always faster, you can definitely optimize routines for a specific processor on a specific machine on a specific day, and out perform the compiler. Other than for some specific cases it is generally futile. Educational but overall not worth the time. Also note the processor is no longer the bottleneck, so a sloppy generic C compiler is good enough, find the performance elsewhere.
Other platforms which often means embedded, arm, mips, avr, msp430, pic, etc. You may or may not be running an operating system, you may or may not be running with a cache or other such things that your desktop has. So the weaknesses of the compiler will show. Also note that programming languages continue to evolve away from processors instead of toward them. Even in the case of C considered perhaps to be a low level language, it doesnt match the instruction set. There will always be times where you can produce segments of assembler that outperform the compiler. Not necessarily the segment that is your bottleneck but across the entire program you can often make improvements here and there. You still have to check the value of doing that. In an embedded environment it can and does make the difference between success and failure of a product. If your product has $25 per unit invested in more power hungry, board real estate, higher speed processors so you dont have to use assembler, but your competitor spends $10 or less per unit and is willing to mix asm with C to use smaller memories, use less power, cheaper parts, etc. Well so long as the NRE is recovered then the mixed with asm solution will in the long run.
True embedded is a specialized market with specialized engineers. Another embedded market, your embedded linux roku, tivo, etc. Embedded phones, etc all need to have portable operating systems to survive because you need third party developers. So the platform has to be more like a desktop than an embedded system. Buried in the C library as mentioned or the operating system there may be some assembler optimizations, but as with the desktop you want to try to throw more hardware at so the software can be portable instead of hand optimized. And your product line or embedded operating system will fail if assembler is required for third party success.
The biggest concern I have is that this knowledge is being lost at an alarming rate. Because nobody inspects the assembler, because nobody writes in assembler, etc. Nobody is noticing that the compilers have not been improving when it comes to the code being produced. Developers often think they have to buy more hardware instead of realizing that by either knowing the compiler or how to program better they can improve their performance by 5 to several hundred percent with the same compiler, sometimes with the same source code. 5-10% usually with the same source code and compiler. gcc 4 does not always produce better code than gcc 3, I keep both around because sometimes gcc3 does better. Target specific compilers can (not always do) run circles around gcc, you can see a few hundred percent improvement sometimes with the same source code different compiler. Where does all of this come from? The folks that still bother to look and/or use assembler. Some of those folks work on the compiler backends. The front end and middle are fun and educational certainly, but the backend is where you make or break quality and performance of the resulting program. Even if you never write assembler but only look at the output from the compiler from time to time (gcc -O2 -s myprog.c) it will make you a better high level programmer and will retain some of this knowledge. If nobody is willing to know and write assembler then by definition we have given up in writing and maintaining compilers for high level languages and software in general will cease to exist.
Understand that with gcc for example the output of the compiler is assembly that is passed to an assembler which turns it into object code. The C compiler does not normally produce binaries. The objects when combined into the final binary, are done by the linker, yet another program that is called by the compiler and not part of the compiler. The compiler turns C or C++ or ADA or whatever into assembler then the assembler and linker tools take it the rest of the way. Dynamic recompilers, like tcc for example, must be able to generate binaries on the fly somehow, but I see that as the exception not the rule. LLVM has its own runtime solution as well as quite visibly showing the high level to internal code to target code to binary path if you use it as a cross compiler.
So back to the point, yes it is done, more often than you think. Mostly has to do with the language not comparing directly to the instruction set, and then the compiler not always producing fast enough code. If you can get say dozens of times improvement on heavily used functions like malloc or memcpy. Or want to have a HD video player on your phone without hardware support, balance the pros and cons of assembler. Truly embedded markets still use assembler quite a bit, sometimes it is all C but sometimes the software is completely coded in assembler. For desktop x86, the processor is not the bottleneck. The processors are microcoded. Even if you make beautiful looking assembler on the surface it wont run really fast on all families x86 processors, sloppy, good enough code is more likely to run about the same across the board.
I highly recommend learning assembler for non-x86 ISAs like arm, thumb/thumb2, mips, msp430, avr. Targets that have compilers, particularly ones with gcc or llvm compiler support. Learn the assembler, learn to understand the output of the C compiler, and prove that you can do better by actually modifying that output and testing it. This knowledge will help make your desktop high level code much better without assembler, faster and more reliable.
It depends. It is (still) being done in some situations, but for the most part, it is not worth it. Modern CPUs are insanely complex, and it is equally complex to write efficient assembly code for them. So most of the time, the assembly you write by hand will end up slower than what the compiler can generate for you.
Assuming a decent compiler released within the last couple of years, you can usually tweak your C/C++ code to gain the same performance benefit as you would using assembly.
A lot of people in the comments and answers here are talking about the "N times speedup" they gained rewriting something in assembly, but that by itself doesn't mean too much. I got a 13 times speedup from rewriting a C function evaluating fluid dynamics equations in C, by applying many of the same optimizations as you would if you were to write it in assembly, by knowing the hardware, and by profiling. At the end, it got close enough to the theoretical peak performance of the CPU that there would be no point in rewriting it in assembly. Usually, it's not the language that's the limiting factor, but the actual code you've written. As long as you're not using "special" instructions that the compiler has difficulty with, it's hard to beat well-written C++ code.
Assembly isn't magically faster. It just takes the compiler out of the loop. That is often a bad thing, unless you really know what you're doing, since the compiler performs a lot of optimizations that are really really painful to do manually. But in rare cases, the compiler just doesn't understand your code, and can't generate efficient assembly for it, and then, it might be useful to write some assembly yourself. Other than driver development or the like (where you need to manipulate the hardware directly), the only place I can think of where writing assembly may be worth it is if you're stuck with a compiler that can't generate efficient SSE code from intrinsics (such as MSVC). Even there, I'd still start out using intrinsics in C++, and profile it and try to tweak it as much as possible, but because the compiler just isn't very good at this, it might eventually be worth it to rewrite that code in assembly.
Take a look here, where the guy improved performances 6 times using assembly code. So, the answer is : it is still being done, but the compiler is doing pretty good job.
"Is this practice still done?"
--> It is done in image processing, signal processing, AI (eg. efficient matrix multiplication), and other. I would bet the processing of the scroll gesture on my macbook trackpad is also partially assembly code because it is immediate.
--> It is even done in C# applications (see https://blogs.msdn.microsoft.com/winsdk/2015/02/09/c-and-fastcall-how-to-make-them-work-together-without-ccli-shellcode/)
"Isn't writing in Assembly Language a bit too cumbersome & archaic?"
--> It is a tool like a hammer or a screwdriver and some tasks require a watchmaker screwdriver.
"When we compile C code (with or without -O3 flag), the compiler does some code optimization ... So how does inducing 'Assembly Language' help?"
--> I like what #jalf said, that writing C code in a way you would write assembly will already lead to efficient code. However to do this you must think how you would write the code in assembly language, so eg. understand all places where data is copied (and feel pain each time it is unnecessary).
With assembly language you can be sure which instructions are generated. Even if your C code is efficient there is no guarantee that the resulting assembly will be efficient with every compiler. (see https://lucasmeijer.com/posts/cpp_unity/)
--> With assembly language, when you distribute a binary, you can test for the cpu and make different branches depending on the cpu features as optimized for for AVX or just for SSE, but you only need to distribute one binary. With intrinsics this is also possible in C++ or .NET Core 3. (see https://devblogs.microsoft.com/dotnet/using-net-hardware-intrinsics-api-to-accelerate-machine-learning-scenarios/)
On my work, I used assembly on embedded target (micro controller) for low level access.
But for a PC software, I don't think it is very usefull.
I have an example of assembly optimization I've done, but again it's on an embedded target. You can see some examples of assembly programming for PCs too, and it creates really small and fast programs, but usually not worth the effort (Look for "assembly for windows", you can find some very small and pretty programs).
My example was when I was writing a printer controller, and there was a function that was supposed to be called every 50 micro-seconds. It has to do reshuffling of bits, more or less. Using C I've been able to do it in about 35microseconds, and with assembly I've done it in about 8 microseconds. It's a very specific procedure but still, something real and necessary.
On some embedded devices (phones and PDAs), it's useful because the compilers are not terribly mature, and can generate extremely slow and even incorrect code. I have personally had to work around, or write assembly code to fix, the buggy output of several different compilers for ARM-based embedded platforms.
Yes. Use either inline assembly or link assembly object modules. Which method you should use depends on how much assembly code you need to write. Usually it's OK to use inline assembly for a couple of lines and switch to separate object modules once if it's more than one function.
Definitely, but sometimes it's necessary. The prominent example here would be programming an operating system.
Most compilers today optimize the code you write in a high-level language much better than anyone could ever write assembly code. People mostly use it to write code that would otherwise be impossible to write in a high-level language like C. If someone uses it for anything else means he is either better at optimization than a modern compiler (I doubt that) or just plain stupid, e.g. he doesn't know what compiler flags or function attributes to use.
use this:
__asm__ __volatile__(/*assembly code goes here*/);
the __asm__ can also just be asm.
The __volatile__ stops the compiler from making further optimizations.
Code duplication is usually bad and often quite easy to spot. I suppose that compilers could detect it automatically in easiest cases - they already parse the text and get the intermediate representation that they analyze in various ways - detect suspicious patterns like uninitialized variables, optimize emitted code, etc. I guess they could often detect functionally duplicate code this way as well and account for it while emitting machine code.
Are there C++ compilers that can detect duplicate code and only emit corresponding machine code once instead of for each duplicate in the source text?
Some do, some don't.
From the LLVM optimization's page: -mergefunc (MergeFunctions pass, how it works)
The functions are separated in small blocks in the LLVM Intermediate Representation, this optimization pass tries to merge similar blocks. It's not guaranteed to succeed though.
You'll find plenty of other optimizations on this page, even though some of them may appear cryptic at first glance.
I would add a note though, that duplicate code isn't so bad for the compiler / executable, it's bad from a maintenance point of view, and there is nothing a compiler can do about it.
I think the question makes the false assumption that compilers would always want to eliminate code duplication. code duplication is bad for readability/maintainability of source code not necesarily performance of compiled code, indeed one could consider loop unrolling as a compiler adding duplicate code to increase speed. compiled code does not need to follow the same principles as source code and generally doesn't as it is for the machine not for humans to read.
generally compilers are busy compiling not transforming source code, of course IDEs may allow both.
From my knowledge, the code elimination does not usually happen across the functions. So if you write some duplicate piece of code in two different functions there are very less chances(close to none) that piece of code will be eliminated.
There are some optimizations like return value optimization, function inlining which can happen across functions. However most of the optimization is done within the function itself.This is not usually done at the higher language level, by this i mean that the compiler wont look at the C++ code and start optimizing it. Compilers mostly have an intermediary representation, between high level language(C++) and machine language. This intermediary representation(IR) is somewhat similar to machine language but is not exactly the machine language of the system on which code is compiled. Refer to the wiki page http://en.wikipedia.org/wiki/Compiler_optimization, it lists some of those optimizations
Visual C++ does this if you specify 'minimize code size' (/O1). The function provided is described in the docs for /Og, which is deprecated in favour of simpler catch-all options to favor size or favor speed (/O2).