I have heard from someone that big projects with a large number of warnings in code builds significantly slower than ones with a small amount of warnings (of course, it is intended that the compiler's set to have high level warning-sensitivity).
Is there any reasonable explanation, or maybe anyone can share their experience about this topic?
On GCC compiler (e.g. gcc for C or g++ for C++) warnings do take a small amount of CPU time. Use e.g. gcc -ftime-report if you want a detailed report of compiler timing. Warning diagnostics do depend upon the optimization level.
But optimizations (especially at high level, like -O2 or more) take much more time than warnings. Empirically, optimized compilation time is proportional to the compilation unit size and to the square of the size (e.g. in number of Gimple instructions, or in lines of C code) of the biggest function. So if you have huge functions (e.g. some function of ten thousand lines in some generated C code) you may want to split them into smaller pieces.
In the early days of MELT (a GCC plugin and GCC experimental branch -GPLv3+ licensed- implementing a DSL to extend GCC, that I have developed and am still working on), it generated huge initialization functions in C (today it is less the case, the initialization is split in many C++ functions; see e.g. gcc/melt/generated/warmelt-base.cc from the MELT branch of GCC as an example). At that time, I plotted the compilation -O2 time versus the length of that initialization function and measured the compilation time vs its length. You could also experiment with manydl.c code. Again, the square of biggest function length is an experimental measure, but might be explained by register allocation issues. Also, J.Pitrat also observed that huge generated C functions -by his interesting CAIA system- are exhausting the compiler.
Also, warnings are output, and sometimes, the IDE or the terminal reading the compiler output may be slowed down if you have a lot of warnings.
Of course, as commented several times, compiler warnings are your friends (so always compile with e.g. gcc -Wall). So please improve your code to get no warnings at all. (In particular, initialize most of your local variables - I usually initialize all of them; since the compiler could optimize by removing some initializations if it can be proven that they are useless).
BTW, you could customize GCC with e.g. MELT to add your own customized warnings (e.g. to check some coding rules conformance).
Also, in C++ with weird templates, you could write a few dozens of lines which take many hours to be compiled (or even crash the compiler because of lack of memory, see this question).
NB. In 2019, GCC MELT is dead, its domain gcc-melt.org disappeared but the web pages are archived here.
It depends a lot on what the warnings actually are.
For an example, if there are lots of "variable is unused" warnings and "condition in 'if' is always true/false" warnings, then that may mean there's a lot of unnecessary code that the compiler has to parse and then remove during optimisation.
For other warnings there may be other consequences. For example, consider a "variable is self initialising" warning caused by something like int i = i;. I'd imagine this could add a whole pile of complications/overhead (where the compiler attempts to determine if the variable is "live" or can be optimised out).
This will likely depend extensively on the compiler, and how it is implemented.
That being said, there are two sure sources of slow-down:
Printing the warnings themselves is a non-trivial task, it requires extensive formatting, potentially accessing back the file, plus all those notes (macro expansion, template instantiation), and finally pushing that to an I/O device.
Emitting said warnings, with all those macro expansion and template instantiation data, might be non-trivial too. Furthermore, if collected first and only emitted at the end of the compilation process (instead of being streamed as they are produced), then the growing memory consumption will also slow you down (requiring more pages to be provided by the OS, ...)
In general, in terms of engineering, I do not expect compiler writers to worry much about the cost of emitting diagnostics; as long as it is a reasonable cost, there seems to be little incentive in optimizing a couple milli-seconds when a human intervention is going to be required anyway.
Who cares if your build takes 10% longer due to the compiler printing tons of warnings? The problem is the code you're being warned about, not the extra time it takes. Besides, 10% is probably a huge overestimate for the overhead of printing even very large numbers of warnings.
Related
I was involved in a debugging situation where I had no PDBs at all (unfortunately this still happens). In the particular case I was researching a stack corruption and I tried to do a manual stack walk. However, there was a strong mismatch between the ESP and the EBP registers, so I think the code was compiled with the /Oy optimization (frame pointer omission) turned on. In this case, I had to give up.
My question now is: from the Visual Studio 2015 C++ compiler switches for optmization, which ones will make debugging hard? A short explanation why it becomes hard would be great.
To limit the scope of the question, answers should consider x86 (32 bit) only.
The available options can be found on MSDN, as there are:
O1 - creates small code
Os - favors small code
O2 - creates fast code
Ot - favors fast code
Ob - controls inline expansion
Oi - generate intrinsic functions
The following ones do not need to be considered:
Od - disables optimization. This will obviously cause the least trouble
Og - is deprecated
Ox - is just a combination of others. This will obviously cause the sum of troubles of the individual switches
Oy - omit frame pointers. I already know about it. It makes stack walking really difficult. It much about guessing.
Wow, there are a lot of different types of code optimisation, way more than I have a detailed knowledge of, but I will try to detail different optimisations that significantly affect the debugging experience, knowledge of an optimisation can help guide which compiler switches will enable it.
Reordering instructions to prevent the cpu idling, generally produces more code, the debugger will appear to jump around the code and not execute linearly.
Reducing code to a compile time constant, smaller code and faster code, this code will be skipped over when debugging.
Omitting frame pointers, produces smaller code but makes it more difficult to walk the stack.
Efficient register usage, this will cause variables to be unreadable or wrong before they go out of scope because the compiler has decided the variable can be safely retired early and it's register freed up for usage elsewhere, instead of needlessly pushing it's value to the stack. This produces smaller and faster code.
Inlining, inlining produces fatter code which is faster, the inlined functions will not appear in the call stack.
All the optimisation flags are frowned upon these days, Profile Guided Optimisation is by far the preferable way to use optimisations in release and if you want to debug release-ish code you should use the -Zo flag which will produce better pdb's which can get you more information about inlined functions and variables in registers.
GCC & clang have a per optimisation set of flags the -O flags are just amalgamations of those flags and looking at how GCC distributes the optimisations and details about what each optimisation is will help further understanding about all the different optimisations compilers, in general, do.
EDIT: Also if you want to turn on individual compiler flags and see what it does to the compiled code for GCC and clang god bolt is really useful https://gcc.godbolt.org
As far as I know, debugging problems may be mostly caused by information omission in the compiled binary.
Some of the causes are:
frame pointers omission: /Oy, as you already found out
inlining: /Ob2 is in effect when /O1, /O2 or /Ox is used
code reduction: hard to know how this maps exactly to /O VC++ compiler options
better register usage
(to some degree) intrinsic functions: /Oi
My personal opinion is that a binary without debugging information (.PDB file) is usually "hard enough" to be debugged, at least not worth my time :-)
This question already has answers here:
How will i know whether inline function is actually replaced at the place where it is called or not?
(10 answers)
Closed 7 years ago.
My understanding is that compilers follow certain semantics that decide whether or not a function should be expanded inline. for example, if the callee unconditionally (no if/élse-if to return) returns a value, it may be expanded in caller itself. Similarly, function call overhead can also guide this expansion.(I may be completely wrong)
Similarly, the hardware parameters like cache-usage may also play a role in expansion.
As a programmer, I want to understand these semantics and the algorithms which guide inline expansion. Ultimately, I should be able to write(or recognize) a code that surely will be inlined(not-inlined). I don't mean to override compiler or that I think I would be able to write a code better than compiler itself. The question is rather to understand internals of the compilers.
EDIT: Since I use gcc/g++ in my work, we can limit the scope to these two alone. Though, I was of opinion that there will be several things common across compilers in this context.
You don't need to understand the inlining (or other optimizations) criteria, because by definition (assuming that the optimizing compiler is not buggy on that respect), an inlined code should behave the same as a non-inlined code.
Your first example (callee unconditionally returning a value) is in practice certainly wrong, in the sense that several compilers are able to inline conditional returns.
For example, consider this f.c file:
static int fact (int n) {
if (n <= 0) return 1;
else
return n * fact (n - 1);
}
int foo () {
return fact (10);
}
Compile it with gcc -O3 -fverbose-asm -S f.c; the resulting f.s assembly file contains only one function (foo), the fact function has completely gone, and the fact(10) has been inlined (recursively) and replaced (constant folding) by 3628800.
With GCC -current version is GCC 5.2 in july 2015-, assuming you ask it to optimize (e.g. compile with gcc -O2 or g++ -O2 or -O3) the inlining decision is not easy to understand. The compiler would very probably make inlining decisions better than what you can do. There are many internal heuristics guiding it (so no simple few guiding principles, but some heuristics to inline, other to avoid inlining, and probably some meta-heuristics to choose). Read about optimize options (-finline-limit=...), function attributes.
You might use the always_inline and gnu_inline and noinline (and also noclone) function attributes, but I don't recommend doing that in general.
you could disable inlining with noinline but very often the resulting code would be slower. So don't do that...
The key point is that the compiler is better optimizing and inlining than what you reasonably can, so trust it to inline and optimize well.
Optimizing compilers (see also this) can (and do) inline functions even without you knowing that, e.g. they are sometimes inlining functions not marked inline or not inlining some functions marked inline.
So no, you don't want to "understand these semantics and the algorithms which guide inline expansion", they are too difficult ... and vary from one compiler to another (even one version to another). If you really want to understand why GCC is inlining (this means spending months of work, and I believe you should not lose your time on that), use -fdump-tree-all and other dump flags, instrument the compiler using MELT -which I am developing-, dive into the source code (since GCC is a free software).
You'll need more than your life time, or at least several dozens of years, to understand all of GCC (more than ten millions lines of source code) and how it is optimizing. By the time you understood something, the GCC community would have worked on new optimizations, etc...
BTW, if you compile and link an entire application or library with gcc -flto -O3 (e.g. with make CC='gcc -flto -O3') the GCC compiler would do link-time optimization and inline some calls accross translation units (e.g. in f1.c you call foo defined in f2.c, and some of the calls to foo in f1.c would got inlined).
The compiler optimizations do take into account cache sizes (for deciding about inlining, unrolling, register allocation & spilling and other optimizations), in particular when compiling with gcc -mtune=native -O3
Unless you force the compiler (e.g. by using noinline or alwaysinline function attributes in GCC, which is often wrong and would produce worse code), you'll never be able in practice to guess that a given code chunk would certainly be inlined. Even people working on GCC middle end optimizations cannot guess that reliably! So you cannot reliably understand -and predict- the compiler behavior in practice, hence don't even waste your time to try that.
Look also into MILEPOST GCC; by using machine learning techniques to tune some GCC parameters, they have been able to sometimes get astonishing performance improvements, but they certainly cannot explain or understand them.
If you need to understand your particular compiler while coding some C or C++, your code is probably wrong (e.g. probably could have some undefined behavior). You should code against some language specification (either the C11 or C++14 standards, or the particular GCC dialect e.g. -std=gnu11 documented and implemented by your GCC compiler) and trust your compiler to be faithful w.r.t. that specification.
Inlining is like copy-paste. There aren't so many gotchas that will prevent it from working, but it should be used judiciously. If it gets out of control, the program will become bloated.
Most compilers use a heuristic based on the "size" of the function. Since this is usually before any code generation pass, the number of AST nodes may be used as a proxy for size. A function that includes inlined calls needs to include them it its own size, or inlining can go totally out of control. However, AST nodes that will not generate instructions should not prevent inlining. It can be difficult to tell what will generate a "move" instruction and what will generate nothing.
Since modern C++ tends to involve lots of functions that perform conceptual rearrangement with no underlying instructions, the difficulty is telling the difference between no instructions, "just a few" moves, and enough move instructions to cause a problem. The only way to tell for a particular instance is to run the program in a debugger and/or read the disassembly.
Mostly in typical C++ code, we just assume that the inliner is working hard enough. For performance-critical situations, you can't just eyeball it or assume that anything is working optimally. Detailed performance analysis at the disassembly level is essential.
So we've all heard the don't-use-register line, the reasoning being that trying to out-optimize a compiler is a fool's errand.
register, from what I know, doesn't actually state anything about CPU registers, just that a given variable can't be referenced indirectly. I'll hazard a guess that it's often referred to as obsolete because compilers can detect a lack of addressing automatically thus making such optimizations transparent.
But if we're firm on that argument, can't it be levelled at every optimization-driven keyword in C? Why do we use inline and C99's restrict for example?
I suppose that some things like aliasing make deducing some optimizations hard or even impossible, so where is the line drawn before we start venturing into Sufficiently Smart Compiler territory?
Where should the line should be drawn in C and C++ between spoon-feeding a compiler optimization information and assuming it knows what it's doing?
EDIT: Jens Gustedt pointed out that my conflating of C and C++ isn't right since two of the keywords have semantic differences and one doesn't exist in standard C++. I had a good link about register in C++ which I'll add if I find it...
I would agree that register and inline are somewhat similar in this respect. If the compiler can see the body of the callee while compiling a call site, it should be able to make a good decision on inlining. The use of the inline keyword in both C and C++ has more to do with the mechanics of making the body of the function visible than with anything else.
restrict, however, is different. When compiling a function, the compiler has no idea of what the call sites are going to be. Being able to assume no aliasing can enable optimizations that would otherwise be impossible.
inline is used in the scenario where you implement a non-templated function within the header then include it from multiple compilation units.
This ensures that the compiler should create just one instance of the function as though it were inlined, so you do not get a link error for multiply defined symbol. It does not however require the compiler to actually inline it.
There are GNU flags I think force-inline or similar but that is a language extension.
register doesn't even say that you can't reference the
variable indirectly (at least in C++). It said that in the
original C, but that has been dropped.
Whether trying to out-optimize the compiler is a fool's errand
depends on the optimization. Not many compilers, for example,
will convert sin(x) * sin(x) + cos(x) * cos(x) into 1.
Today, most compilers ignore register, and no one uses it,
because compilers have become good enough at register allocation
to do a better job than you can with register. In fact,
respecting register would typically make the generated code
slower. This is not the case for inline or restrict: in
both cases, there exist techniques, at least theoretically,
which could result in the compiler doing a better job than you
can. Such techniques are not widespread, however, and (as far
as I know, at least), have a very high compile time overhead,
with in some cases compile times which grow exponentially with
the size of the program (which makes them more or less unusable
on most real programs—compile times which are measured in
years really aren't acceptable).
As to where to draw the line... it changes in time. When
I first started programming in C, register made a significant
difference, and was widely used. Today, no. I imagine that in
time, the same may happen with inline or restrict—some
experimental compilers are very close with inline already.
This is a flame-bait question but I will dive in anyway.
Compilers are a lot better at optimising that your average programmer. There was a time I programmed on a 25MHz 68030 and I got some advantage from the use of register because the compiler's optimizer was so poor. But that was back in 1990.
I see inline as just as bad as register.
In general, measure first before you modify. If you find that you code performs so poorly you want to use register or inline, take a deep breath, stand back and look for a better algorithm first.
In recent times (i.e. the last 5 years) I have gone through code bases and removed inline functions galore with no perceptible change in performance being visible. Code size, however, always benefits from the removal of inline methods. That isn't a big issue for your standard x86-style monster multicore marvel of the modern age but it does matter if you work in the embedded space.
It is a moving target, because compiler technology is improving. (Well, sometimes it is more changing than improving, but that has some of the same effect of rendering your optimization attempts moot, or worse.)
Generally, you should not guess at whether an optimization keyword or other optimization technique is good or not. One has to learn quite a bit about how computers work, including the particular platform you are targeting, and how compilers work.
So a rule about using various optimization techniques is to ask do I know the compiler will not do the best job here? Am I willing to commit to that for a while—will the compiler remain stable while this code is in use, am I willing to rewrite the code when the compiler changes this situation? Typically, you have to be an experienced and knowledgeable software engineer to know when you can do better than the compiler. It also helps if you can talk to the compiler developers.
This means people cannot give you an answer here that has a definite guideline. It depends on what compiler you are using, what your project is, what your resources are, and what your goals are, and so on.
Although some people say not to try to out-optimize the compiler, there are various areas of software engineering where people do better than a compiler and in which it is worth the expense of paying people for this.
The difference is as follows:
register is very local optimization (i.e. inside one function). The register allocation is a relatively solved problem both by smarter compilers and by larger number of register (mostly the former but say x86-64 have more registers then x86 and both have larger number then say 8-bit processor)
inline is harder as it is inter-procedure optimization. However as it involves relatively small depth of recursion and small number of procedures (if inlined procedure is too big there is no sense of inlining it) it may be safely left to the compiler.
restrict is much harder. To fully know the that two pointers don't alias you would need to analyse whole program (including libraries, system, plug-ins etc.) - and even then run into problems. However the information is clearer for programmer AND it is part of specification.
Consider very simple code:
void my_memcpy(void *dst, const void *src, size_t size) {
for (size_t i = 0; i < size; i++) {
((char *)dst)[i] = ((const char *)str)[i];
}
}
Is there a benefit to making this code efficient? Yes - memcpy tend to be very useful (say for copying GC). Can this code be vectorized (here - moved by words - say 128b instead of 8b)? Compiler would have to deduce that dst and src does not alias in any way and regions pointed by them are independent. size may depend on user input or runtime behaviour or other elements which makes the analysis practically impossible - similar problems to Halting Problem - in general we cannot analyse everything without running it. Or it might be part of C library (I assume shared libraries) and is called by program hence all call sites are not even known at compile time. Without such analysis the program would exhibit different behaviour with optimization on. On the other hand programmer might ensure that they are different objects simply by knowing the (even higher-level) design instead of need for bottom-up analysis.
restrict can also be part of documentation as it might be programmer who wrote the procedure in a way that it cannot handle 2 aliasing pointers. For example if we want to copy memory from aliasing locations the above code is incorrect.
So to sum up - Sufficiently Smart Compiler would not be able to deduce the restrict (unless we move to compilers understending the meaning of code) without knowing the whole program. Even then the it would be close to undecidability. However for local optimization the compilers are already sufficiently smart. My guess it that Sufficiently Smart Compiler with whole program analysis would be able to deduce in many interesting cases however.
PS. By local I mean single function. So local optimization cannot assume anything about arguments, global variables etc.
One thing that hasn't been mentioned is that many non-x86 compilers aren't nearly as good at optimizing as gcc and other "modern" C-compilers are.
For instance, the compilers for PIC are absolutely terrible at optimizing. Also, the optimizer for cicc (the CUDA compiler), though much better, still seems to miss a lot of fairly simple optimizations.
For these cases, I've found optimization hints like register, inline, and #pragma unroll to be extremely useful.
From what I have seen back in the days I was more involved with C/C++, these are merely orders directly given to the compiler. Compiler may try to inline a function even if it is not given the direct order to do so. That really depends on the compiler and may even raise some cross-compiler issues. As an example, visual studio provides different levels of optimization which correspond to the different intelligence levels of the compiler. I have read that all class functions are implicitly inline to give compiler a hint to minimize function call overhead. In any case, these directives are extremely helpful when you are using a less intelligent compiler while in intelligent cases, they may be very obvious for the compiler to do some optimization.
Also, be sure that these keywords are guaranteed to be safe. Some compiler optimizations may not work with some libraries such as OpenGL (as I have seen it myself). So in cases where you feel that compiler optimization may be harmful, you can use these keywords to make sure it is done the way you want it to.
The compilers such as g++ these days optimize the code very well. You might as well search for optimization elsewhere, maybe in the methods and algorithm you use or by using TBB or CUDA to make your code parallel.
I remember reading somewhere that to really optimize & speed up certain section of the code, programmers write that section in Assembly language. My questions are -
Is this practice still done? and How does one do this?
Isn't writing in Assembly Language a bit too cumbersome & archaic?
When we compile C code (with or without -O3 flag), the compiler does some code optimization & links all libraries & converts the code to binary object file. So when we run the program it is already in its most basic form i.e. binary. So how does inducing 'Assembly Language' help?
I am trying to understand this concept & any help or links is much appreciated.
UPDATE: Rephrasing point 3 as requested by dbemerlin- Because you might be able to write more effective assembly code than the compiler generates but unless you are an assembler expert your code will propably run slower because often the compiler optimizes the code better than most humans can.
The only time it's useful to revert to assembly language is when
the CPU instructions don't have functional equivalents in C++ (e.g. single-instruction-multiple-data instructions, BCD or decimal arithmetic operations)
AND the compiler doesn't provide extra functions to wrap these operations (e.g. C++11 Standard has atomic operations including compare-and-swap, <cstdlib> has div/ldiv et al for getting quotient and remainder efficiently)
AND there isn't a good third-party library (e.g. http://mitpress.mit.edu/catalog/item/default.asp?tid=3952&ttype=2)
OR
for some inexplicable reason - the optimiser is failing to use the best CPU instructions
...AND...
the use of those CPU instructions would give some significant and useful performance boost to bottleneck code.
Simply using inline assembly to do an operation that can easily be expressed in C++ - like adding two values or searching in a string - is actively counterproductive, because:
the compiler knows how to do this equally well
to verify this, look at its assembly output (e.g. gcc -S) or disassemble the machine code
you're artificially restricting its choices regarding register allocation, CPU instructions etc., so it may take longer to prepare the CPU registers with the values needed to execute your hardcoded instruction, then longer to get back to an optimal allocation for future instructions
compiler optimisers can choose between equivalent-performance instructions specifying different registers to minimise copying between them, and may choose registers in such a way that a single core can process multiple instructions during one cycle, whereas forcing everythingt through specific registers would serialise it
in fairness, GCC has ways to express needs for specific types of registers without constraining the CPU to an exact register, still allowing such optimisations, but it's the only inline assembly I've ever seen that addresses this
if a new CPU model comes out next year with another instruction that's 1000% faster for that same logical operation, then the compiler vendor is more likely to update their compiler to use that instruction, and hence your program to benefit once recompiled, than you are (or whomever's maintaining the software then is)
the compiler will select an optimal approach for the target architecture its told about: if you hardcode one solution then it will need to be a lowest-common-denominator or #ifdef-ed for your platforms
assembly language isn't as portable as C++, both across CPUs and across compilers, and even if you seemingly port an instruction, it's possible to make a mistake re registers that are safe to clobber, argument passing conventions etc.
other programmers may not know or be comfortable with assembly
One perspective that I think's worth keeping in mind is that when C was introduced it had to win over a lot of hardcore assembly language programmers who fussed over the machine code generated. Machines had less CPU power and RAM back then and you can bet people fussed over the tiniest thing. Optimisers became very sophisticated and have continued to improve, whereas the assembly languages of processors like the x86 have become increasingly complicated, as have their execution pipelines, caches and other factors involved in their performance. You can't just add values from a table of cycles-per-instruction any more. Compiler writers spend time considering all those subtle factors (especially those working for CPU manufacturers, but that ups the pressure on other compilers too). It's now impractical for assembly programmers to average - over any non-trivial application - significantly better efficiency of code than that generated by a good optimising compiler, and they're overwhelmingly likely to do worse. So, use of assembly should be limited to times it really makes a measurable and useful difference, worth the coupling and maintenance costs.
First of all, you need to profile your program. Then you optimize the most used paths in C or C++ code. Unless advantages are clear you don't rewrite in assembler. Using assembler makes your code harder to maintain and much less portable - it is not worth it except in very rare situations.
(1) Yes, the easiest way to try this out is to use inline assembly, this is compiler dependent but usually looks something like this:
__asm
{
mov eax, ebx
}
(2) This is highly subjective
(3) Because you might be able to write more effective assembly code than the compiler generates.
You should read the classic book Zen of Code Optimization and the followup Zen of Graphics Programming by Michael Abrash.
Summarily in the first book he explained how to use assembly programming pushed to the limits. In the followup he explained that programmers should rather use some higher level language like C and only try to optimize very specific spots using assembly, if necessary at all.
One motivation of this change of mind was that he saw that highly optimized programs for one generation of processor could become (somewhat) slow in the next generation of the same processor familly compared to code compiled from a high level language (maybe compiler using new instructions for instance, or performance and behavior of existing ones changing from a processor generation to another).
Another reason is that compilers are quite good and optimize aggressively nowaday, there is usually much more performance to gain working on algorithms that converting C code to assembly. Even for GPU (Graphic Cards processors) programming you can do it with C using cuda or OpenCL.
There are still some (rare) cases when you should/have to use assembly, usually to get very fine control on the hardware. But even in OS kernel code it's usually very small parts and not that much code.
There's very few reasons to use assembly language these days, even low-level constructs like SSE and the older MMX have built-in intrinsics in both gcc and MSVC (icc too I bet but I never used it).
Honestly, optimizers these days are so insanely aggressive that most people couldn't match even half their performance writing code in assembly. You can change how data is ordered in memory (for locality) or tell the compiler more about your code (through #pragma), but actually writing assembly code... doubt you'll get anything extra from it.
#VJo, note that using intrinsics in high level C code would let you do the same optimizations, without using a single assembly instruction.
And for what it's worth, there have been discussions about the next Microsoft C++ compiler, and how they'll drop inline assembly from it. That speaks volumes about the need for it.
I dont think you specified the processor. Different answers depending on the processor and the environment. The general answer is yes it is still done, it is not archaic certainly. The general reason is the compilers, sometimes they do a good job at optimizing in general but not really well for specific targets. Some are really good at one target and not so good at others. Most of the time it is good enough, most of the time you want portable C code and not non-portable assembler. But you still find that C libraries will still hand optimize memcpy and other routines that the compiler simply cannot figure out that there is a very fast way to implement it. In part because that corner case is not worth spending time on making the compiler optimize for, just solve it in assembler and the build system has a lot of if this target then use C if that target use C if that target use asm, if that target use asm. So it still occurs, and I argue must continue forever in some areas.
X86 is is own beast with a lot of history, we are at a point where you really cannot in a practical manner write one blob of assembler that is always faster, you can definitely optimize routines for a specific processor on a specific machine on a specific day, and out perform the compiler. Other than for some specific cases it is generally futile. Educational but overall not worth the time. Also note the processor is no longer the bottleneck, so a sloppy generic C compiler is good enough, find the performance elsewhere.
Other platforms which often means embedded, arm, mips, avr, msp430, pic, etc. You may or may not be running an operating system, you may or may not be running with a cache or other such things that your desktop has. So the weaknesses of the compiler will show. Also note that programming languages continue to evolve away from processors instead of toward them. Even in the case of C considered perhaps to be a low level language, it doesnt match the instruction set. There will always be times where you can produce segments of assembler that outperform the compiler. Not necessarily the segment that is your bottleneck but across the entire program you can often make improvements here and there. You still have to check the value of doing that. In an embedded environment it can and does make the difference between success and failure of a product. If your product has $25 per unit invested in more power hungry, board real estate, higher speed processors so you dont have to use assembler, but your competitor spends $10 or less per unit and is willing to mix asm with C to use smaller memories, use less power, cheaper parts, etc. Well so long as the NRE is recovered then the mixed with asm solution will in the long run.
True embedded is a specialized market with specialized engineers. Another embedded market, your embedded linux roku, tivo, etc. Embedded phones, etc all need to have portable operating systems to survive because you need third party developers. So the platform has to be more like a desktop than an embedded system. Buried in the C library as mentioned or the operating system there may be some assembler optimizations, but as with the desktop you want to try to throw more hardware at so the software can be portable instead of hand optimized. And your product line or embedded operating system will fail if assembler is required for third party success.
The biggest concern I have is that this knowledge is being lost at an alarming rate. Because nobody inspects the assembler, because nobody writes in assembler, etc. Nobody is noticing that the compilers have not been improving when it comes to the code being produced. Developers often think they have to buy more hardware instead of realizing that by either knowing the compiler or how to program better they can improve their performance by 5 to several hundred percent with the same compiler, sometimes with the same source code. 5-10% usually with the same source code and compiler. gcc 4 does not always produce better code than gcc 3, I keep both around because sometimes gcc3 does better. Target specific compilers can (not always do) run circles around gcc, you can see a few hundred percent improvement sometimes with the same source code different compiler. Where does all of this come from? The folks that still bother to look and/or use assembler. Some of those folks work on the compiler backends. The front end and middle are fun and educational certainly, but the backend is where you make or break quality and performance of the resulting program. Even if you never write assembler but only look at the output from the compiler from time to time (gcc -O2 -s myprog.c) it will make you a better high level programmer and will retain some of this knowledge. If nobody is willing to know and write assembler then by definition we have given up in writing and maintaining compilers for high level languages and software in general will cease to exist.
Understand that with gcc for example the output of the compiler is assembly that is passed to an assembler which turns it into object code. The C compiler does not normally produce binaries. The objects when combined into the final binary, are done by the linker, yet another program that is called by the compiler and not part of the compiler. The compiler turns C or C++ or ADA or whatever into assembler then the assembler and linker tools take it the rest of the way. Dynamic recompilers, like tcc for example, must be able to generate binaries on the fly somehow, but I see that as the exception not the rule. LLVM has its own runtime solution as well as quite visibly showing the high level to internal code to target code to binary path if you use it as a cross compiler.
So back to the point, yes it is done, more often than you think. Mostly has to do with the language not comparing directly to the instruction set, and then the compiler not always producing fast enough code. If you can get say dozens of times improvement on heavily used functions like malloc or memcpy. Or want to have a HD video player on your phone without hardware support, balance the pros and cons of assembler. Truly embedded markets still use assembler quite a bit, sometimes it is all C but sometimes the software is completely coded in assembler. For desktop x86, the processor is not the bottleneck. The processors are microcoded. Even if you make beautiful looking assembler on the surface it wont run really fast on all families x86 processors, sloppy, good enough code is more likely to run about the same across the board.
I highly recommend learning assembler for non-x86 ISAs like arm, thumb/thumb2, mips, msp430, avr. Targets that have compilers, particularly ones with gcc or llvm compiler support. Learn the assembler, learn to understand the output of the C compiler, and prove that you can do better by actually modifying that output and testing it. This knowledge will help make your desktop high level code much better without assembler, faster and more reliable.
It depends. It is (still) being done in some situations, but for the most part, it is not worth it. Modern CPUs are insanely complex, and it is equally complex to write efficient assembly code for them. So most of the time, the assembly you write by hand will end up slower than what the compiler can generate for you.
Assuming a decent compiler released within the last couple of years, you can usually tweak your C/C++ code to gain the same performance benefit as you would using assembly.
A lot of people in the comments and answers here are talking about the "N times speedup" they gained rewriting something in assembly, but that by itself doesn't mean too much. I got a 13 times speedup from rewriting a C function evaluating fluid dynamics equations in C, by applying many of the same optimizations as you would if you were to write it in assembly, by knowing the hardware, and by profiling. At the end, it got close enough to the theoretical peak performance of the CPU that there would be no point in rewriting it in assembly. Usually, it's not the language that's the limiting factor, but the actual code you've written. As long as you're not using "special" instructions that the compiler has difficulty with, it's hard to beat well-written C++ code.
Assembly isn't magically faster. It just takes the compiler out of the loop. That is often a bad thing, unless you really know what you're doing, since the compiler performs a lot of optimizations that are really really painful to do manually. But in rare cases, the compiler just doesn't understand your code, and can't generate efficient assembly for it, and then, it might be useful to write some assembly yourself. Other than driver development or the like (where you need to manipulate the hardware directly), the only place I can think of where writing assembly may be worth it is if you're stuck with a compiler that can't generate efficient SSE code from intrinsics (such as MSVC). Even there, I'd still start out using intrinsics in C++, and profile it and try to tweak it as much as possible, but because the compiler just isn't very good at this, it might eventually be worth it to rewrite that code in assembly.
Take a look here, where the guy improved performances 6 times using assembly code. So, the answer is : it is still being done, but the compiler is doing pretty good job.
"Is this practice still done?"
--> It is done in image processing, signal processing, AI (eg. efficient matrix multiplication), and other. I would bet the processing of the scroll gesture on my macbook trackpad is also partially assembly code because it is immediate.
--> It is even done in C# applications (see https://blogs.msdn.microsoft.com/winsdk/2015/02/09/c-and-fastcall-how-to-make-them-work-together-without-ccli-shellcode/)
"Isn't writing in Assembly Language a bit too cumbersome & archaic?"
--> It is a tool like a hammer or a screwdriver and some tasks require a watchmaker screwdriver.
"When we compile C code (with or without -O3 flag), the compiler does some code optimization ... So how does inducing 'Assembly Language' help?"
--> I like what #jalf said, that writing C code in a way you would write assembly will already lead to efficient code. However to do this you must think how you would write the code in assembly language, so eg. understand all places where data is copied (and feel pain each time it is unnecessary).
With assembly language you can be sure which instructions are generated. Even if your C code is efficient there is no guarantee that the resulting assembly will be efficient with every compiler. (see https://lucasmeijer.com/posts/cpp_unity/)
--> With assembly language, when you distribute a binary, you can test for the cpu and make different branches depending on the cpu features as optimized for for AVX or just for SSE, but you only need to distribute one binary. With intrinsics this is also possible in C++ or .NET Core 3. (see https://devblogs.microsoft.com/dotnet/using-net-hardware-intrinsics-api-to-accelerate-machine-learning-scenarios/)
On my work, I used assembly on embedded target (micro controller) for low level access.
But for a PC software, I don't think it is very usefull.
I have an example of assembly optimization I've done, but again it's on an embedded target. You can see some examples of assembly programming for PCs too, and it creates really small and fast programs, but usually not worth the effort (Look for "assembly for windows", you can find some very small and pretty programs).
My example was when I was writing a printer controller, and there was a function that was supposed to be called every 50 micro-seconds. It has to do reshuffling of bits, more or less. Using C I've been able to do it in about 35microseconds, and with assembly I've done it in about 8 microseconds. It's a very specific procedure but still, something real and necessary.
On some embedded devices (phones and PDAs), it's useful because the compilers are not terribly mature, and can generate extremely slow and even incorrect code. I have personally had to work around, or write assembly code to fix, the buggy output of several different compilers for ARM-based embedded platforms.
Yes. Use either inline assembly or link assembly object modules. Which method you should use depends on how much assembly code you need to write. Usually it's OK to use inline assembly for a couple of lines and switch to separate object modules once if it's more than one function.
Definitely, but sometimes it's necessary. The prominent example here would be programming an operating system.
Most compilers today optimize the code you write in a high-level language much better than anyone could ever write assembly code. People mostly use it to write code that would otherwise be impossible to write in a high-level language like C. If someone uses it for anything else means he is either better at optimization than a modern compiler (I doubt that) or just plain stupid, e.g. he doesn't know what compiler flags or function attributes to use.
use this:
__asm__ __volatile__(/*assembly code goes here*/);
the __asm__ can also just be asm.
The __volatile__ stops the compiler from making further optimizations.
Is anyone using JIT tricks to improve the runtime performance of statically compiled languages such as C++? It seems like hotspot analysis and branch prediction based on observations made during runtime could improve the performance of any code, but maybe there's some fundamental strategic reason why making such observations and implementing changes during runtime are only possible in virtual machines. I distinctly recall overhearing C++ compiler writers mutter "you can do that for programs written in C++ too" while listening to dynamic language enthusiasts talk about collecting statistics and rearranging code, but my web searches for evidence to support this memory have come up dry.
Profile guided optimization is different than runtime optimization. The optimization is still done offline, based on profiling information, but once the binary is shipped there is no ongoing optimization, so if the usage patterns of the profile-guided optimization phase don't accurately reflect real-world usage then the results will be imperfect, and the program also won't adapt to different usage patterns.
You may be interesting in looking for information on HP's Dynamo, although that system focused on native binary -> native binary translation, although since C++ is almost exclusively compiled to native code I suppose that's exactly what you are looking for.
You may also want to take a look at LLVM, which is a compiler framework and intermediate representation that supports JIT compilation and runtime optimization, although I'm not sure if there are actually any LLVM-based runtimes that can compile C++ and execute + runtime optimize it yet.
I did that kind of optimization quite a lot in the last years. It was for a graphic rendering API that I've implemented. Since the API defined several thousand different drawing modes as general purpose function was way to slow.
I ended up writing my own little Jit-compiler for a domain specific language (very close to asm, but with some high level control structures and local variables thrown in).
The performance improvement I got was between factor 10 and 60 (depended on the complexity of the compiled code), so the extra work paid off big time.
On the PC I would not start to write my own jit-compiler but use either LIBJIT or LLVM for the jit-compilation. It wasn't possible in my case due to the fact that I was working on a non mainstream embedded processor that is not supported by LIBJIT/LLVM, so I had to invent my own.
The answer is more likely: no one did more than PGO for C++ because the benefits are likely unnoticeable.
Let me elaborate: JIT engines/runtimes have both blesses and drawbacks from their developer's view: they have more information at runtime but much little time to analyze.
Some optimizations are really expensive and you will unlikely see without a huge impact on start time are those one like: loop unrolling, auto-vectorization (which in most cases is also based on loop unrolling), instruction selection (to use SSE4.1 for CPU that use SSE4.1) combined with instruction scheduling and reordering (to use better super-scalar CPUs). This kind of optimizations combine great with C like code (that is accessible from C++).
The single full-blown compiler architecture to do advanced compilation (as far as I know) is the Java Hotspot compilation and architectures with similar principles using tiered compilation (Java Azul's systems, the popular to the day JaegerMonkey JS engine).
But one of the biggest optimization on runtime is the following:
Polymorphic inline caching (meaning that if you run the first loop with some types, the second time, the code of the loop will be specialized types that were from previous loop, and the JIT will put a guard and will put as default branch the inlined types, and based on it, from this specialized form using a SSA-form engine based will apply constant folding/propagation, inlining, dead-code-elimination optimizations, and depends of how "advanced" the JIT is, will do an improved or less improved CPU register assignment.)
As you may notice, the JIT (hotspots) will improve mostly the branchy code, and with runtime information will get better than a C++ code, but a static compiler, having at it's side the time to do analysis, instruction reordering, for simple loops, will likely get a little better performance. Also, typically, the C++ code, areas that need to be fast tends to not be OOP, so the information of the JIT optimizations will not bring such an amazing improvement.
Another advantage of JITs is that JIT works cross assemblies, so it has more information if it wants to do inlining.
Let me elaborate: let's say that you have a base class A and you have just one implementation of it namely B in another package/assembly/gem/etc. and is loaded dynamically.
The JIT as it see that B is the only implementation of A, it can replace everywhere in it's internal representation the A calls with B codes, and the method calls will not do a dispatch (look on vtable) but will be direct calls. Those direct calls may be inlined also. For example this B have a method: getLength() which returns 2, all calls of getLength() may be reduced to constant 2 all over. At the end a C++ code will not be able to skip the virtual call of B from another dll.
Some implementations of C++ do not support to optimize over more .cpp files (even today there is the -lto flag in recent versions of GCC that makes this possible). But if you are a C++ developer, concerned about speed, you will likely put the all sensitive classes in the same static library or even in the same file, so the compiler can inline it nicely, making the extra information that JIT have it by design, to be provided by developer itself, so no performance loss.
visual studio has an option for doing runtime profiling that then can be used for optimization of code.
"Profile Guided Optimization"
Microsoft Visual Studio calls this "profile guided optimization"; you can learn more about it at MSDN. Basically, you run the program a bunch of times with a profiler attached to record its hotspots and other performance characteristics, and then you can feed the profiler's output into the compiler to get appropriate optimizations.
I believe LLVM attempts to do some of this. It attempts to optimize across the whole lifetime of the program (compile-time, link-time, and run-time).
Reasonable question - but with a doubtful premise.
As in Nils' answer, sometimes "optimization" means "low-level optimization", which is a nice subject in its own right.
However, it is based on the concept of a "hot-spot", which has nowhere near the relevance it is commonly given.
Definition: a hot-spot is a small region of code where a process's program counter spends a large percentage of its time.
If there is a hot-spot, such as a tight inner loop occupying a lot of time, it is worth trying to optimize at the low level, if it is in code that you control (i.e. not in a third-party library).
Now suppose that inner loop contains a call to a function, any function. Now the program counter is not likely to be found there, because it is more likely to be in the function. So while the code may be wasteful, it is no longer a hot-spot.
There are many common ways to make software slow, of which hot-spots are one. However, in my experience, that is the only one of which most programmers are aware, and the only one to which low-level optimization applies.
See this.