Why C++ compilers have many optimization levels - c++

I was just thinking why C++ compilers have many optimization levels like O1, O2 etc. Why can everything be part of just one optimization level O.
I tried search online a lot but didn't got a convincing answer for this.

Off the top of my head: optimizing takes time (more optimization means slower compilation), debugging optimized code can be more difficult, more aggressive optimization can reveal bugs, you can optimize for different things (program size, speed, etc.)…

While compilers are smart enough to optimize several common commands (O1) they may try with not-so-frequent strategies (O2) and even with corner cases (O3).
Whether the resulting code is more optimized depends a lot on the original code, and sometimes on the CPU hardware. The only way to tell which "O" is best is trying and measuring running times.
Remember that the one who really knows about the code is you. Write it with part of your brain thinking about how fast it will run.

For precisely two reasons:
There is no one sequence of compiler optimizations that can simultaneously optimize all possible program characteristics of interest, such as execution time, compilation time, code size, energy consumption, binary-portability, etc. In compiler optimizations research, this is known as the phase ordering problem.
Most developers do not want to bother with figuring out which compiler optimizations to use and in what order; they just want to use whatever is generally recommended in a small number of common scenarios.
That's why compiler developers have decided to offer a small collection of optimization levels from which developers can easily choose in general, yet offering hundreds of fine-grained optimization options for advanced scenarios.
The term "optimization levels" is really a misnomer, since they are not exactly "levels" with respect to each other. A better term would be something like "optimization groups".
Designing optimization levels is a complicated matter for compilers that target a broad range of programs and architectures, such as GCC, Clang, icc, and VC++. Many research papers have been published in the past decade that show that the optimization levels offered by compilers are far from being the best for a particular program, target architecture, or specific collection thereof. This motivated a line of research on compiler auto-tuning, which can be considered as an approach that falls somewhere in between offering few optimization levels and offering fine-grained control over compiler optimizations.
In summary, optimization levels provide an important convenience for developers, which will be required for many decades to come.

Related

Compiling via C/C++ and compiler bugs/limits

I'm looking at the possibility of implementing a high-level (Lisp-like) language by compiling via C (or possibly C++ if exceptions turn out to be useful enough); this is a strategy a number of projects have previously used. Of course, this would generate C code unlike, and perhaps in some dimensions exceeding the complexity of, anything you would write by hand.
Modern C compilers are highly reliable in normal usage, but it's hard to know what bugs might be lurking in edge cases under unusual stresses, particularly if you go over some "well no programmer would ever write an X bigger than Y" hidden limit.
It occurs to me the coincidence of these facts might lead to unhappiness.
Are there any known cases, or is there a good way to find cases, of generated code tripping over edge case bugs/limits in reasonably recent versions of major compilers (e.g. GCC, Microsoft C++, Clang)?
This may not quite be the answer you were looking for, but quite a few years ago, I worked on a project, where parts of the system was written in some higher level language, where it was easy to build state machines in processes. This language produced C-code that was then compiled. The compiler we were using for this project was gcc (version around 2.95 - don't quote me on that, but pre-3.0 for sure). We did run into a couple of code generation bugs, but that was, from my memory more to do with using a not-so-popular processor [revealing which processor may reveal something I shouldn't about the project, so I'd rather not say what it was, even if it was a very long time ago].
A colleague close to me was investigating one of those code generation bugs, which was in a function of around 200k lines, all of the function a big switch-statement, with each case in the switch statement being around 50-1000 lines each (with several layers of sub-switch statements inside it).
From my memory of it, the code was crashing because it produced an invalid operation or stored something in a register already occupied for something else, so once you hit the right bit of code, it would fail due to an illegal memory access - and it had nothing to do with the long size of the code, because my colleague managed to get it down to about 30 lines of code eventually (after a lot of "lets cut this out and see if it still goes wrong"), and after a few days we had a new version of the compiler with a fix. Nice to know that your many thousands of dollars to pay for the compiler service contract is worth having at least sometimes...
My point here is that modern compilers tolerate a lot of large code. There are also minimum limits that "a compliant compiler must support at least of ". For example, I believe (from memory, again), that the compiler needs to support 127 levels of nested statements (that is, a combination of 127 if, switch, while and do-while) within a function. And, from a discussion somewhere (which is where the "the compiler should support 127 levels of nested statements" comes from), we found that MSVC and GCC both support a whole lot more (enough that we gave up on finding it...)
Short answer:
You have no choice if you need the performance of a compiler rather than the ease of life of an interpreter (or pre-compiler + interpreter). You will be compiling into some lower level language, and C is the assembly language of today, with C++ being about as available and as apt for the task as C. There is no reason why you should fear this route. It is actually, in some sense, a very common route.
Long answer:
Generated code is in no way unusual. Even relatively modest uses of generated code are resulting in C source that is "unlike what any programmer would ever write", in terms of quantity (trivial code patterns repeated with small variations millions of times), or quality (combinations of language features that a human would never use but that are still, say, legal C++). There are also quite a few compilers that compile into C or C++, some famous and well known to people who wrote the C and C++ language standards.
The most common C and C++ compilers cope well with generated code. Yes, they are never perfect.
Compilers may have various simple limitations, such as maximum code line
length; these tend to be documented and easy to comply with in your generator once you run into them.
Compilers may have defects.
Some kinds of defects are actually less of concern for generated code than for hand written code. Your code generator actually gives you a degree of freedom to deal with many situations in a systematic way as soon as you start understanding the pattern and caring about the problem.
Defects that actually result in the compiler not compiling valid code correctly are quickly discovered, given enough users of the compiler. They are treated by compiler vendors as particularly high priority defects. Even if the compiler is essentially dead or serving a niche market, so that no fix is available, plenty of information tends to be available on the Internet, including people's existing experience of working around the defect (different compiler, different code construction, different compiler switches... the solution varies a lot and may feel awkward but nobody gives up on their job just because some software is buggy, right)? So there is usually a choice of searchable solutions.
It's often good striving for portability across compilers, but also knowing and tracking the limits of portability. If you have not tested a particular C or C++ compiler very well, do not claim that it would work as part of your toolset.
You are asking an implied question between C and C++. Well, there are more shades of grey here. C++ is a very rich language. You can use almost all C++ features for good purposes within your generator, but in some cases you should ask yourself whether a particular major feature could become a liability, costing you more than it brings to you. For example, different compilers use different strategies for template instantiation. Implicit instantiation can lead to unforeseen complexity with portable generated code. If templates do help you design the generator tremendously, don't hesitate to use them; but if you only have a marginal use case for them, remember that you have a better reason than most people to restrict the language you generate into.
There are all sorts of implementation defined limits in C. Some well defined and visible to the programmer (think numeric limits), others are not. In my copy of the draft standard, section 5.2.4.1 details the lower bounds on these limits:
5.2.4 Environmental limits
Both the translation and execution environments constrain the implementation of
language translators and libraries. The following summarizes the language-related
environmental limits on a conforming implementation; the library-related limits are
discussed in clause 7.
5.2.4.1 Translation limits
The implementation shall be able to translate and execute at least one program that contains at least one instance of every one of the following limits:18)
— 127 nesting levels of blocks
— 63 nesting levels of conditional inclusion
— 12 pointer, array, and function declarators (in any combinations) modifying an arithmetic, structure, union, or void type in a declaration
[...]
18) Implementations should avoid imposing fixed translation limits whenever possible.
I can't say for sure if your translator is likely to hit any of these or if the C compiler(s) you're targeting will have a problem even if you do, but I think you'll probably be fine.
As for bugs:
Clang - http://llvm.org/bugs/buglist.cgi?bug_status=all&product=clang
GCC - http://gcc.gnu.org/bugzilla/buglist.cgi?bug_status=all&product=gcc
Visual Studio - https://connect.microsoft.com/VisualStudio/feedback
It may sound obvious but the only way to really know is by testing. If you can't do all the effort yourself, at least make your product cross-platform so that people can easily test for you! If people like your project, they are usually willing to submit bug reports or even patches for free :)

How do I find how C++ compiler implements something except inspecting emitted machine code?

Suppose I crafted a set of classes to abstract something and now I worry whether my C++ compiler will be able to peel off those wrappings and emit really clean, concise and fast code. How do I find out what the compiler decided to do?
The only way I know is to inspect the disassembly. This works well for simple code, but there're two drawbacks - the compiler might do it different when it compiles the same code again and also machine code analysis is not trivial, so it takes effort.
How else can I find how the compiler decided to implement what I coded in C++?
I'm afraid you're out of luck on this one. You're trying to find out "what the compiler did". What the compiler did is to produce machine code. The disassembly is simply a more readable form of the machine code, but it can't add information that isn't there. You can't figure out how a meat grinder works by looking at a hamburger.
I was actually wondering about that.
I have been quite interested, for the last few months, in the Clang project.
One of Clang particular interests, wrt optimization, is that you can emit the optimized LLVM IR code instead of machine code. The IR is a high-level assembly language, with the notion of structure and type.
Most of the optimizations passes in the Clang compiler suite are indeed performed on the IR (the last round is of course architecture specific and performed by the backend depending on the available operations), this means that you could actually see, right in the IR, if the object creation (as in your linked question) was optimized out or not.
I know it is still assembly (though of higher level), but it does seem more readable to me:
far less opcodes
typed objects / pointers
no "register" things or "magic" knowledge required
Would that suit you :) ?
Timing the code will directly measure its speed and can avoid looking at the disassembly entirely. This will detect when compiler, code modifications or subtle configuration changes have affected the performance (either for better or worse). In that way it's better than the disassembly which is only an indirect measure.
Things like code size can also serve as possible indicators of problems. At the very least they suggest that something has changed. It can also point out unexpected code bloat when the compiler should have boiled down a bunch of templates (or whatever) into a concise series of instructions.
Of course, looking at the disassembly is an excellent technique for developing the code and helping decide if the compiler is doing a sufficiently good translation. You can see if you're getting your money's worth, as it were.
In other words, measure what you expect and then dive in if you think the compiler is "cheating" you.
You want to know if the compiler produced "clean, concise and fast code".
"Clean" has little meaning here. Clean code is code which promotes readability and maintainability -- by human beings. Thus, this property relates to what the programmer sees, i.e. the source code. There is no notion of cleanliness for binary code produced by a compiler that will be looked at by the CPU only. If you wrote a nice set of classes to abstract your problem, then your code is as clean as it can get.
"Concise code" has two meanings. For source code, this is about saving the scarce programmer eye and brain resources, but, as I pointed out above, this does not apply to compiler output, since there is no human involved at that point. The other meaning is about code which is compact, thus having lower storage cost. This can have an impact on execution speed, because RAM is slow, and thus you really want the innermost loops of your code to fit in the CPU level 1 cache. The size of the functions produced by the compiler can be obtained with some developer tools; on systems which use GNU binutils, you can use the size command to get the total code and data sizes in an object file (a compiled .o), and objdump to get more information. In particular, objdump -x will give the size of each individual function.
"Fast" is something to be measured. If you want to know whether your code is fast or not, then benchmark it. If the code turns out to be too slow for your problem at hand (this does not happen often) and you have some compelling theoretical reason to believe that the hardware could do much better (e.g. because you estimated the number of involved operations, delved into the CPU manuals, and mastered all the memory bandwidth and cache issues), then (and only then) is it time to have a look at what the compiler did with your code. Barring these conditions, cleanliness of source code is a much more important issue.
All that being said, it can help quite a lot if you have a priori notions of what a compiler can do. This requires some training. I suggest that you have a look at the classic dragon book; but otherwise you will have to spend some time compiling some example code and looking at the assembly output. C++ is not the easiest language for that, you may want to begin with plain C. Ideally, once you know enough to be able to write your own compiler, then you know what a compiler can do, and you can guess what it will do on a given code.
You might find a compiler that had an option to dump a post-optimisation AST/representation - how readable it would be is another matter. If you're using GCC, there's a chance it wouldn't be too hard, and that someone might have already done it - GCCXML does something vaguely similar. Of little use if the compiler you want to build your production code on can't do it.
After that, some compiler (e.g. gcc with -S) can output assembly language, which might be usefully clearer than reading a disassembly: for example, some compilers alternate high-level source as comments then corresponding assembly.
As for the drawbacks you mentioned:
the compiler might do it different when it compiles the same code again
absolutely, only the compiler docs and/or source code can tell you the chance of that, though you can put some performance checks into nightly test runs so you'll get alerted if performance suddenly changes
and also machine code analysis is not trivial, so it takes effort.
Which raises the question: what would be better. I can image some process where you run the compiler over your code and it records when variables are cached in registers at points of use, which function calls are inlined, even the maximum number of CPU cycles an instruction might take (where knowable at compile time) etc. and produces some record thereof, then a source viewer/editor that colour codes and annotates the source correspondingly. Is that the kind of thing you have in mind? Would it be useful? Perhaps some more than others - e.g. black-and-white info on register usage ignores the utility of the various levels of CPU cache (and utilisation at run-time); the compiler probably doesn't even try to model that anyway. Knowing where inlining was really being done would give me a warm fuzzy feeling. But, profiling seems more promising and useful generally. I fear the benefits are more intuitively real than actually, and compiler writers are better off pursuing C++0x features, run-time instrumentation, introspection, or writing D "on the side" ;-).
The answer to your question was pretty much nailed by Karl. If you want to see what the compiler did, you have to start going through the assembly code it produced--elbow grease is required. As to discovering the "why" behind the "how" of how it implemented your code...every compiler (and every build, potentially), as you mentioned, is different. There are different approaches, different optimizations, etc. However, I wouldn't worry about whether it's emitting clean, concise machine code--cleanliness and concision should be left to the source code. Speed, on the other hand, is pretty much the programmer's responsibility (profiling ftw). More interesting concerns are correctness, maintainability, readability, etc. If you want to see if it made a specific optimization, the compiler docs might help (if they're available for your compiler). You can also just trying searching to see if the compiler implements a known technique for optimizing whatever. If those approaches fail, though, you're right back to reading assembly code. Keep in mind that the code that you're checking out might have little to no impact on performance or executable size--grab some hard data before diving into any of this stuff.
Actually, there is a way to get what you want, if you can get your compiler to
produce DWARF debugging information. There will be a DWARF description for each
out-of-line function and within that description there will (hopefully) be entries
for each inlined function. It's not trivial to read DWARF, and sometimes compilers
don't produce complete or accurate DWARF, but it can be a useful source of information
about what the compiler actually did, that's not tied to any one compiler or CPU.
Once you have a DWARF reading library there are all sorts of useful tools you can
build around it.
Don't expect to use it with Visual C++ as that uses a different debugging format.
(But you might be able to do similar queries through the debug helper library
that comes with it.)
If your compiler manages to translate your "wrappings and emit really clean, concise and fast code" the effort to follow-up the emitted code should be reasonable.
Contrary to another answer I feel that emitted assembly code may well be "clean" if it is (relatively) easily mappable to the original source code, if it doesn't consist of calls all over the place and that the system of jumps is not too complex. With code scheduling and re-ordering an optimized machine code which is also readable is, alas, a thing of the past.

Mixing assembler code with c/c++

Why is assembly language code often needed along with C/C++ ?
What can't be done in C/C++, which is possible when assembly language code is mixed?
I have some source code of some 3D computer games. There are a lot of assembler code in use.
Things that pop to mind, in no particular order:
Special instructions. In an embedded application, I need to invalidate the cache after a DMA transfer has filled the memory buffer. The only way to do that on an SH-4 CPU is to execute a special instruction, so inline assembly (or a free-standing assembly function) is the only way to go.
Optimizations. Once upon a time, it was common for compilers to not know every trick that was possible to do. In some of those cases, it was worth the effort to replace an inner loop with a hand-crafted version. On the kinds of CPUs you find in small embedded systems (think 8051, PIC, and so forth) it can be valuable to push inner loops into assembly. I will emphasize that for modern processors with pipelines, multi-issue execution, extensive caching and more, it is often exceptionally difficult for hand coding to even approach the capabilities of the optimizer.
Interrupt handling. In an embedded application it is often needed to catch system events such as interrupts and exceptions. It is often the case that the first few instructions executed by an interrupt have special responsibilities and the only way to guarantee that the right things happen is to write the outer layer of a handler in assembly. For example, on a ColdFire (or any descendant of the 68000) only the very first instruction is guaranteed to execute. To prevent nested interrupts, that instruction must modify the interrupt priority level to mask out the priority of the current interrupt.
Certain portions of an OS kernel. For example, task switching requires that the execution state (at least most registers including PC and stack pointer) be saved for the current task and the state loaded for the new task. Fiddling with execution state of the CPU is well outside of the feature set of the language, but can be wrapped in a small amount of assembly code in a way that allows the rest of the kernel to be written in C or C++.
Edit: I've touched up the wording about optimization. Let me emphasize that for targets with large user populations and well supported compilers with decent optimization, it is highly unlikely that an assembly coder can beat the performance of the optimizer.
Before attempting, start by careful profiling to determine where the bottlenecks really lie. With that information in hand, examine assumptions and algorithms carefully, because the best optimization of all is usually to find a better way to handle the larger picture. Then, if all else fails, isolate the bottleneck in a test case, benchmark it carefully, and begin tweaking in assembly.
Why is assembly language code often
needed along with C/C++ ?
Competitive advantage. Like, if you are writing software for the (soon-to-be) #1 gaming company in the world.
What can't be done in C/C++, which is
possible when assembly language code
is mixed?
Nothing, unless some absolute performance level is needed, say, X frames per second or Y billions of polygons per second.
Edit: based on other replies, it seems the consensus is that embedded systems (iPhone, Android etc) have hardware accelerators that certainly require the use of assembly.
I have some source code of some 3D
computer games. There are a lot of
assembler code in use.
They are either written in the 80's-90's, or they are used sparingly (maybe 1% - 5% of total source code) inside a game engine.
Edit: to this date, compiler auto-vectorization quality is still poor. So, you may see programs that contain vectorization intrinsics, and since it's not really much different from writing in actual assembly (most intrinsics have one-one mapping to assembly instructions) some folks might just decide to write in assembly.
Update:
According to anecdotal evidence, RollerCoaster Tycoon is written in 99% assembly.
http://www.chrissawyergames.com/faq3.htm
In the past, compilers used to be pretty poor at optimizing for a particular architecture, and architectures used to be simpler. Now the reverse is true. These days, it's pretty hard for a human to write better assembly than an optimizing compiler, for deeply-pipelined, branch-predicting processors. And so you won't see it much. What there is will be short, and highly targeted.
In short, you probably won't need to do this. If you think you do, profile your code to make sure you've identified a hotspot - don't optimize something just because it's slow, if you're only spending 0.1% of your execution time there. See if you can improve your design or algorithm. If you don't find any improvement there, or if you need functionality not exposed by your higher-level language, look into hand-coding assembly.
There are certain things that can only be done in assembler and cannot be done in C/C++.
These include:
generating software interrupts (SWI or INT instructions)
Use of instructions like SWP for creating mutexes
specialist coporcessor instructions (such as those needed to program the MMU and manage RAM caches)
Access to carry and overflow flags.
You may also be able to optimize code better in assembler than C/C++ (eg memcpy on Android is written in assembler)
There may be new instructions that your compiler cannot yet generate, or the compiler does a bad job, or you may need to control the CPU directly.
Why is assembly language code often
needed along with C/C++ ?needed along with C/C++ ?
It isn't
What can't be done in C/C++, which is
possible when assembly language code
is mixed?
Accessing system registers or IO ports on the CPU.
Accessing BIOS functions.
Using specialized instructions that doesn't map directly to the programming language,
e.g. SIMD instructions.
Provide optimized code that's better than the compiler produces.
The two first points you usually don't need unless you're writing an operating system, or code
running without an operatiing system.
Modern CPUs are quite complex, and you'll be hard pressed to find people that actually can write assembly than what the compiler produces. Many compilers come with libraries giving you access
to more advanced features, like SIMD instructions, so nowadays you often don't need to fall back to
assembly for that.
One more thing worth mentioning is:
C & C++ do not provide any convenient way to setup stack frames when one needs to implement a binary level interop with a script language - or to implement some kind of support for closures.
Assembly can be very optimal than what any compiler can generate in certain situations.

How to know what optimizations are done automatically by my compiler

I was going through this link Will it optimize and wondered how can we know what optimizations are done by a particular compiler.
Like does VC8.0 convert if-else statements to switch-case?
Is such information available on msdn?
As everyone seems to be bent on telling the OP that he shouldn't worry about it, there is some useful although not as specific as the OP requested) information about compiler optimization (options).
You'll have to figure out what flags you're using, especially for MSVC and Intel (GCC release build should default to -O2), but here are the links:
GCC
MSVC
Intel
This is about as close as you'll get before disassembling your binary after compilation.
It depends on the level of of optimization you choose for compiler.
you can find a very nice article about it here
First of all, if optimization took place then your program should work faster usually. After that you could inspect disassembly code to find out what kind of optimizations were performed.
I don't know anything about VC8.0, so I'm not sure how you would access that information. However, if you are generally interested in the kinds of optimisations that go on and want to experiment, I recommend you use LLVM. You can look at the unoptimised, disassembled byte code generated from the default C front end, and then run various optimiser passes over it to see what the effect is each time. Because it's a nicer, abstract assembly code, it tends to be a little easier to figure out what is an optimisation derivable from the code and what is a machine-specific optimisation.
Like does VC8.0 convert if-else statements to switch-case?
Compilers do not do magically rewrite your source code. And even if they did, what would that tell you? What you really would want to know is if the compiler compiled it into a jump table or into multiple compare operations. Any dis-assembler will tell you that.
To clarify my point: Writing a switch-case statement does not necesseraly imply that there will be a jump table in the binary. Not needing to worry about this is the whole point of having compilers.
Instead of figuring out which optimizations are done by the compiler in general, it's probably better to NOT have any dependencies on such compiler-specific knowledge.
Instead start out with a good design and algorithm, writing (as much as possible) portable code that's easy to follow. Then profile the code if it's too slow and fix the actual hotspots. Compiler optimizations are useful no doubt, but better is to apply some investigation to what's actually happening in the code. Algorithmic/design improvements at the source level will typically help performance more than the presence or absence of optimizations like transforming if/else into switch-case.
I'm not sure what "convert if/else to switch/case" means. My processor doesn't have a hardware switch/case instruction.
Typical compilers have several different ways to implement switch/case. A well-known one is using a jump table, but this is only done if appropriate.
For if/else, certainly it is normal for compilers to analyse a digraph of execution flow. I would expect a compiler to notice if each condition references the same variable, and I would expect the compiler to treat equivalent forms of conditionals the same way in general. But this isn't something I'd worry about.
IIRC, the general policy in GCC is that regressions in optimisation are tolerable so long as preferred improvements result. Optimisation is complex and what is "generally" a good optimisation isn't always that great. Plus for perfect optimisation, the compiler would have to know things it can't know (e.g. what inputs it will encounter in real life).
The point is that it really isn't worthwhile knowing that much about specific optimisations unless you happen to be a compiler developer. If you depend on something being optimised by V8, that particular optimisation might not happen in V9 or V10.

C/C++ compiler feedback optimization

Has anyone seen any real world numbers for different programs which are using the feedback optimization that C/C++ compilers offer to support the branch prediction, cache preloading functions etc.
I searched for it and amazingly not even the popular interpreter development groups seem to have checked the effect. And increasing ruby,python,php etc. performance by 10% or so should be considered usefull.
Is there really no benefit or is the whole developer community just to lazy to use it?
10% is a good ballpark figure. That said, ...
You have to REALLY care about the performance to go this route. The product I work on (DB2) uses PGO and other invasive and agressive optimizations. Among the costs are significant build time (triple on some platforms) and development and support nightmares.
When something goes wrong it can be non-trivial to map the fault location in the optimized code back to the source. Developers don't usually expect that functions in different modules can end up merged and inlined and this can have "interesting" effects.
Problems with pointer aliasing, which are nasty to track down also usually show up with these sorts of optimizations. You have the additional fun of having non-deterministic builds (an aliasing problem can show up in monday's build, vanish again till thursday's, ...).
The line between what is correct or incorrect compiler behaviour under these sorts of aggressive optimizations also becomes fairly blurred. Even with the luxury of having our compiler guys in house (literally) the optimization issues (either in our source or the compiler) are still not easy to understand and resolve.
From unladen-swallow (a project optimizing the CPython VM):
For us, the final nail in PyBench's coffin was when experimenting with gcc's feedback-directed optimization tools, we were able to produce a universal 15% performance increase across our macrobenchmarks; using the same training workload, PyBench got 10% slower.
So some people are at least looking at it. That said, PGO sets some pretty tricky requirements on the build environment that are hard to satisfy for open-source projects meant to be built by a distributed heterogeneous group of people. Heavy optimization also creates difficult to debug heisenbugs. It's less work to give the compiler explicit hints for the performance critical parts.
That said, I expect significant performance increases from runtime profile guided optimization. JIT'ing allows the optimizer to cope with the profile of data changing across the execution of a program and do many extremely runtime data specific optimizations that would explode the code size for static compilation. Especially dynamic languages need good runtime data based optimization to perform well. With dynamic language performance getting significant attention lately (JavaScript VM's, MS DLR, JSR-292, PyPy and so on) there's a lot of work being done in this area.
Traditional methods in improving the compiler efficiency via profiling is done by performance analysis tools. However, how the data from the tools may be of use in optimization still depends on the compiler you use. For example, GCC is a framework being worked on to produce compilers for different domains. Providing profiling mechanism in the such compiler framework will be extremely difficult.
We can rely on statistical data to do certain optimization. For instance, GCC unrolls a loop if the loop count is less than a constant (say 7). How it fixes up the constant will be based on statistical result of the code size generated for different target architecture.
Profile guided optimizations track the special areas of the source. Details regarding previous run results needs to be stored which is an overhead. The input on the other hand requires a statistical representation of the target application which may use the compiler. So the complexity level rises with the number of different inputs and outputs. In short, deciding profile guided optimization needs extreme data collection. Automation or embedding such profiling into source needs careful monitoring. If not, the entire result will be awry and in our effort to swim we actually will drown.
However, experimentation on this regard is ongoing. Just have a look at POGO.