Visual Studio Profile Guided Optimization - c++

I have a native C++ application which performs heavy calculations and consumes a lot of memory. My goal is to optimize it, mainly reduce its run time.
After several cycles of profiling-optimizing, I tried the Profile Guided Optimization which I never tried before.
I followed the steps described on MSDN Profile-Guided Optimizations, changing the compilation (/GL) and linking (/LTCG) flags. After adding /GENPROFILE, I ran the application to create .pgc and .pdg files, then changed the linker options to /USEPROFILE and watched additional linker messages that reported that the profiling data was used:
3> 0 of 0 ( 0.0%) original invalid call sites were matched.
3> 0 new call sites were added.
3> 116 of 27096 ( 0.43%) profiled functions will be compiled for speed, and the rest of the functions will be compiled for size
3> 63583 of 345025 inline instances were from dead/cold paths
3> 27096 of 27096 functions (100.0%) were optimized using profile data
3> 608324578581 of 608324578581 instructions (100.0%) were optimized using profile data
3> Finished generating code
Everything looked promising, until I measured the program's performance.
The results were absolutely counterintuitive for me
Performance went down instead of up! 4% to 5% slower than without using Profile Guided Optimization (when comparing with/without the /USEPROFILE option).
Even when running the exact same scenario that was used with /GENPROFILE to create the Profile Guided Optimization data files, it ran 4% slower.
What is going on?

Looking at the sparse doc here the profiler doesn't seem to include any memory optimizations.
If your program takes 2GiB of memory, I'd speculate that the execution speed is limited by memory access and not by the CPU itself. (You also stated something about maps being used, these are also memory limited)
Memory access is difficult to optimize for a profiler, cause it can't change your malloc calls to (for example) put frequently used data into the same pages or make sure they are moved to the same cache line of the CPU.
In addition to that the profiler may introduce additional memory accesses when trying to optimize the bare CPU performance of your program.
The doc states "virtual call speculation", I would speculate that this (and maybe other features like inlining) could introduce additional memory traffic, thus degrading the overall performance cause memory bandwidth is already the limiting factor.

Don't look at it as a black box. If the program can be speeded up, it's because it is doing things it doesn't need to do.
Those things will hide from the profile-guided or any other optimizer, and they will certainly hide from your guesser.
They won't hide from this. Many people use it.
I'm trying to resist the temptation to guess, but I'm failing.
Here's what I see in big C++ apps, no matter how well-written they are.
When people could use a simple data structure like an array, instead they use an abstract container class, with iterators and whatnot. Where does the time go?
That's where it goes.
Another thing they do is write "powerful functions and methods". The writer of the function is so proud of it, that it does so much, that he/she expects it will be called reverently and sparingly.
The user of the function (which could be the same person) thinks "Look how useful this function is! See how much I can get done in a single line of code? The more I use it the more productive I will be."
See how this can easily do needless work?
There's another thing that happens in software - layers of abstraction.
If the pattern above is repeated over several layers of abstraction, the slowdown factors multiply.
The good news is, if you can find those, and if you can fix them, you can get enormous speedup. The bad news is you could suffer as "not a team player".

Related

How to measure the time spent managing memory

This reddit thread has drawn my attention on custom memory allocators. User Rohmboid says, for instance:
People wouldn't be writing their own pool allocators it if there
wasn't a clear benefit.
How do they know there is one?
I don’t want to waste my time/money/energy on writing a custom allocator if the time spent managing memory only accounts for less than 1% of the duration of my program. Neither do I want to switch to a custom allocator and be unable to tell the speedup. So I am wondering: how can I measure (or at least, estimate) the time spent allocating/freeing/fetching memory?
How do I know there is one?
Profile your code.
There's no point optimizing something that isn't a hot path in your code.
If the Allocator (A) takes 5% of your CPU time, and your app takes the other 95%, speeding the allocator up twice gives you (5/2)/100 = 2.5% boost. Now try to speed up B by even a fraction.
How?
The easiest way is to use the IDE built-in profiler; MSVS one is rather decent, although I am using Intel VTune; its ease of use is really great, and it just shows you - optimize here.
Using the profiling program has the additional benefit; you don't have to modify your code at all; you also don't have to recompile when you want to change profiling options and run again. That being said, timers in your application can also give nice results, although they rarely need to be placed directly inside allocator. It's better to continuously narrow the possible places where the program takes the most time.

Pipeline optimzation, is there any point to do this?

Some very expencied programmer from another company told me about some low-level code-optimzation tips that targetting specific CPU, including pipeline-optimzation, which means, arrange the code (inlined assembly, obviously) in special orders such that it fit the pipeline better for the targetting hardware.
With the presence of out-of-order and speculative execuation, I just wonder is there any points to do this kind of low-level stuff? We are mostly invovled in high performance computing, so we can really focus on one very specific CPU type to do our optimzation, but I just dont know if there is any point to do this specific optimzation, anyone has any experience here, where to begin? are there any code examples for this kind of optimzation? many thanks!
I'll start by saying that the compiler will usually optimize code sufficiently (i.e. well enough) that you do not need to worry about this provided your high-level code and algorithms are optimized. In general, manual optimizing should only happen if you have hard evidence that there is an actual performance issue that you can quantify and have tracked down.
Now, with that said, it's always possible to improve things - sometimes a little, sometimes a lot.
If you are in the high-performance computing game, then this sort of optimization might make sense. There are all sorts of "tricks" that can be done, but they are best left to real experts and not for the faint of heart.
If you really want to know more about this topic, a good place to start is by reading Agner Fog's website.
Pipeline optimization will improve your programs performance:
Branches and jumps may force your processor to reload the instruction pipeline, which takes some time. This time could be devoted to data processing instructions.
Some platform independent methods for pipeline optimizations:
Reduce number of branches.
Use Boolean Arithmetic
Set up code to allow for conditional execution of instructions.
Unroll loops.
Make loops have short content (that can fit in a processor's cache
without loading).
Edit 1: Other optimizations
Reduce code by eliminating features and requirements.
Review and optimize the design.
Review implementation for more efficient implementations.
Revert to assembly language only when all other optimizations have
provided little performance improvement; optimize only the code that
is executed 80% of the time; find out by profiling.
Edit 2: Data Optimizations
You can also gain performance improvements by organizing your data. Search the web for "Data Driven Design" or "Optimize performance data".
One idea is that the most frequently used data should be close together and ultimately fit into the processor's data cache. This will reduce the frequency that the processor has to reload its data cache.
Another optimization is to: Load data (into registers), operate on data, then write all data back to memory. The idea here is to trigger the processor's data cache loading circuitry before it processes the data (or registers).
If you can, organize the data to fit in one "line" of your processor's cache. Sequential locations require less time than random access locations.
There are always things that "help" vs. "hinder" the execution in the pipeline, but for most general purpose code that isn't highly specialized, I would expect that performance from compiled code is about as good as the best you can get without highly specialized code for each model of processor. If you have a controlled system, where all of your machines are using the same (or a small number of similar) processor model, and you know that 99% of the time is spent in this particular function, then there may be a benefit to optimizing that particular function to become more efficient.
In your case, it being HPC, it may well be beneficial to handwrite some of the low-level code (e.g. matrix multiplication) to be optimized for the processor you are running on. This does take some reasonable amount of understanding of the processor however, so you need to study the optimization guides for that processor model, and if you can, talk to people who've worked on that processor before.
Some of the things you'd look at is "register to register dependencies" - where you need the result of c = a + b to calculate x = c + d - so you try to separate these with some other useful work, such that the calculation of x doesn't get held up by the c = a + b calculation.
Cache-prefetching and generally caring for how the caches are used is also a useful thing to look at - not kicking useful cached data out that you need 100 instructions later, when you are storing the resulting 1MB array that won't be used again for several seconds can be worth a lot of processor time.
It's hard(er) to control these things when compilers decide to shuffle it around in it's own optimisation, so handwritten assembler is pretty much the only way to go.

Function size vs execution speed

I remember hearing somewhere that "large functions might have higher execution times" because of code size, and CPU cache or something like that.
How can I tell if function size is imposing a performance hit for my application? How can I optimize against this? I have a CPU intensive computation that I have split into (as many threads as there are CPU cores). The main thread waits until all of the worker threads are finished before continuing.
I happen to be using C++ on Visual Studio 2010, but I'm not sure that's really important.
Edit:
I'm running a ray tracer that shoots about 5,000 rays per pixel. I create (cores-1) threads (1 per extra core), split the screen into rows, and give each row to a CPU thread. I run the trace function on each thread about 5,000 times per pixel.
I'm actually looking for ways to speed this up. It is possible for me to reduce the size of the main tracing function by refactoring, and I want to know if I should expect to see a performance gain.
A lot of people seem to be answering the wrong question here, I'm looking for an answer to this specific question, even if you think I can probably do better by optimizing the contents of the function, I want to know if there is a function size/performance relationship.
It's not really the size of the function, it's the total size of the code that gets cached when it runs. You aren't going to speed things up by splitting code into a greater number of smaller functions, unless some of those functions aren't called at all in your critical code path, and hence don't need to occupy any cache. Besides, any attempt you make to split code into multiple functions might get reversed by the compiler, if it decides to inline them.
So it's not really possible to say whether your current code is "imposing a performance hit". A hit compared with which of the many, many ways that you could have structured your code differently? And you can't reasonably expect changes of that kind to make any particular difference to performance.
I suppose that what you're looking for is instructions that are rarely executed (your profiler will tell you which they are), but are located in the close vicinity of instructions that are executed a lot (and hence will need to be in cache a lot, and will pull in the cache line around them). If you can cluster the commonly-executed code together, you'll get more out of your instruction cache.
Practically speaking though, this is not a very fruitful line of optimization. It's unlikely you'll make much difference. If nothing else, your commonly-executed code is probably quite small and adjacent already, it'll be some small number of tight loops somewhere (your profiler will tell you where). And cache lines at the lowest levels are typically small (of the order of 32 or 64 bytes), so you'd need some very fine re-arrangement of code. C++ puts a lot between you and the object code, that obstructs careful placement of instructions in memory.
Tools like perf can give you information on cache misses - most of those won't be for executable code, but on most systems it really doesn't matter which cache misses you're avoiding: if you can avoid some then you'll speed your code up. Perhaps not by a lot, unless it's a lot of misses, but some.
Anyway, what context did you hear this? The most common one I've heard it come up in, is the idea that function inlining is sometimes counter-productive, because sometimes the overhead of the code bloat is greater than the function call overhead avoided. I'm not sure, but profile-guided optimization might help with that, if your compiler supports it. A fairly plausible profile-guided optimization is to preferentially inline at call sites that are executed a larger number of times, leaving colder code smaller, with less overhead to load and fix up in the first place, and (hopefully) less disruptive to the instruction cache when it is pulled in. Somebody with far more knowledge of compilers than me, will have thought hard about whether that's a good profile-guided optimization, and therefore decided whether or not to implement it.
Unless you're going to hand-tune to the assembly level, to include locking specific lines of code in cache, you're not going to see a significant execution difference between one large function and multiple small functions. In both cases, you still have the same amount of work to perform and that's going to be your bottleneck.
Breaking things up into multiple smaller functions will, however, be easier to maintain and easier to read -- especially 6 months later when you've forgotten what you did in the first place.
Function size is unlikely to be a bottleneck in your application. What you do in the function is much more important that it's physical size. There are some things your compiler can do with small function that it cannot do with large functions (namely inlining), but usually this isn't a huge difference anyway.
You can profile the code to see where the real bottleneck is. I suspect calling a large function is not the problem.
You should, however, break up the function into smaller function for code readability reasons.
It's not really about function size, but about what you do in it. Depending on what you do, there is possibly some way to optimize it.

Profiling code built from ifort 11.1 yields __powr8i4 routine, what is it?

I built a Fortran code with Intel 11.1. I built it with the -p option in order to produce profiling data. When I check these results, there are some routines present that aren't a part of my code. I assume they were put there by Intel. The include:
__powr8i4
__intel_new_memset
__intel_fast_memset
__intel_fast_memset.J
__intel_fast_memcpy
__intel_new_memcpy
__intel_fast_memcpy.J
There are others, too. When I build the code without optimization, the code doesn't spend much time in them. Except that results show __powr8i4 being used 3.3% of the time. However, when I build the code with optimization, this number goes way up to about 35%. I can't seem to find out what these routines are, but they are confusing my results because I want to know where to look to optimize my code.
Most programs spend a lot of their cycles in the calling of subroutines, often library subroutines, so if you look only at exclusive (self) time, you will see what you are seeing.
So point 1 is look at inclusive (self plus callees) time.
Now, if the profiler is a "CPU profiler", it will probably be blind to I/O time. That means your program might be spending most of its time reading or writing, but the profiler will give you no clue about that.
So point 2 is use a profiler that works on "wall clock" time, not "CPU" time, unless you are sure you are not doing much I/O. (Sometimes you think you're not doing I/O, but deep inside some subroutine layers deep, guess what - it's doing I/O.)
Many profilers try to produce a call-graph, and if your program does not contain recursion, and if the profiler has access to all the routines in your code, that can be helpful in identifying the subroutine calls in your code that account for a lot of time.
However, if routine A is large and calls B in several places, the profiler won't tell you which lines of code to look at.
Point 3 is use a profiler that gives you line-level inclusive time percentage, if possible.
(Percentage is the most useful number, because that tells you how much overall time you would save if you could somehow remove that line of code. Also, it is not much affected by competing processes in the system.)
One example of such a profiler is Zoom.
It may be that after you do all this, you don't see much you could do to speed up the code.
However, if you could see how certain properties of the data might affect performance, you might find there were further speedups you could get. Profilers are unable to look at data.
What I do is randomly sample the state of the program under the debugger, and see if I can really understand what it is doing at each sample.
You can find things that way that you can't find any other way.
(Some people say this is not accurate, but it is accurate - about what matters. What matters is what the problem is, not precisely how much it costs.)
And that is point 4.

Optimizing for space instead of speed in C++

When you say "optimization", people tend to think "speed". But what about embedded systems where speed isn't all that critical, but memory is a major constraint? What are some guidelines, techniques, and tricks that can be used for shaving off those extra kilobytes in ROM and RAM? How does one "profile" code to see where the memory bloat is?
P.S. One could argue that "prematurely" optimizing for space in embedded systems isn't all that evil, because you leave yourself more room for data storage and feature creep. It also allows you to cut hardware production costs because your code can run on smaller ROM/RAM.
P.P.S. References to articles and books are welcome too!
P.P.P.S. These questions are closely related: 404615, 1561629
My experience from an extremely constrained embedded memory environment:
Use fixed size buffers. Don't use pointers or dynamic allocation because they have too much overhead.
Use the smallest int data type that works.
Don't ever use recursion. Always use looping.
Don't pass lots of function parameters. Use globals instead. :)
There are many things you can do to reduce your memory footprints, I'm sure people have written books on the subject, but a few of the major ones are:
Compiler options to reduce code size (including -Os and packing/alignment options)
Linker options to strip dead code
If you're loading from flash (or ROM) to ram to execute (rather than executing from flash), then use a compressed flash image, and decompress it with your bootloader.
Use static allocation: a heap is an inefficient way to allocate limited memory, and if it might fail due to fragmentation if it is constrained.
Tools to find the stack high-watermark (typically they fill the stack with a pattern, execute the program, then see where the pattern remains), so you can set the stack size(s) optimally
And of course, optimising the algorithms you use for memory footprint (often at expense of speed)
A few obvious ones
If speed isn't critical, execute the code directly from flash.
Declare constant data tables using const. This will avoid the data being copied from flash to RAM
Pack large data tables tightly using the smallest data types, and in the correct order to avoid padding.
Use compression for large sets of data (as long as the compression code doesn't outweigh the data)
Turn off exception handling and RTTI.
Did anybody mention using -Os? ;-)
Folding knowledge into data
One of the rules of Unix philosophy can help make code more compact:
Rule of Representation: Fold knowledge into data so program logic can be stupid and robust.
I can't count how many times I've seen elaborate branching logic, spanning many pages, that could've been folded into a nice compact table of rules, constants, and function pointers. State machines can often be represented this way (State Pattern). The Command Pattern also applies. It's all about the declarative vs imperative styles of programming.
Log codes + binary data instead of text
Instead of logging plain text, log event codes and binary data. Then use a "phrasebook" to reconstitute the event messages. The messages in the phrasebook can even contain printf-style format specifiers, so that the event data values are displayed neatly within the text.
Minimize the number of threads
Each thread needs it own memory block for a stack and TSS. Where you don't need preemption, consider making your tasks execute co-operatively within the same thread (cooperative multi-tasking).
Use memory pools instead of hoarding
To avoid heap fragmentation, I've often seen separate modules hoard large static memory buffers for their own use, even when the memory is only occasionally required. A memory pool could be used instead so the the memory is only used "on demand". However, this approach may require careful analysis and instrumentation to make sure pools are not depleted at runtime.
Dynamic allocation only at initialization
In embedded systems where only one application runs indefinitely, you can use dynamic allocation in a sensible way that doesn't lead to fragmentation: Just dynamically allocate once in your various initialization routines, and never free the memory. reserve() your containers to the correct capacity and don't let them auto-grow. If you need to frequently allocate/free buffers of data (say, for communication packets), then use memory pools. I once even extended the C/C++ runtimes so that it would abort my program if anything tried to dynamically allocate memory after the initialization sequence.
As with all optimization, first optimize algorithms, second optimize the code and data, finally optimize the compiler.
I don't know what your program does, so I can't advice on algorithms. Many others have written about the compiler. So, here's some advice on code and data:
Eliminate redundancy in your code. Any repeated code that's three or more lines long, repeated three times in your code, should be changed to a function call.
Eliminate redundancy in your data. Find the most compact representation: merge read-only data, and consider using compression codes.
Run the code through a regular profiler; eliminate all code that isn't used.
Generate a map file from your linker. It will show how the memory is allocated. This is a good start when optimizing for memory usage. It also will show all the functions and how the code-space is laid out.
Here's a book on the subject Small Memory Software: Patterns for systems with limited memory.
Compile in VS with /Os. Often times this is even faster than optimizing for speed anyway, because smaller code size == less paging.
Comdat folding should be enabled in the linker (it is by default in release builds)
Be careful about data structure packing; often time this results in the compiler generated more code (== more memory) to generate the assembly to access unaligned memory. Using 1 bit for a boolean flag is a classic example.
Also, be careful when choosing a memory efficient algorithm over an algorithm with a better runtime. This is where premature optimizations come in.
Ok most were mentioned already, but here is my list anyway:
Learn what your compiler can do. Read compiler documentation, experiment with code examples. Check settings.
Check generated code at target optimization level. Sometimes results are surprising and often it turns out optimization actually slows things down (or just take too much space).
choose suitable memory model. If you target really small tight system, large or huge memory model might not be the best choice (but usually easisest to program for...)
Prefer static allocation. Use dynamic allocation only on startup or over
statically allocated buffer (pool or maximum instance sized static buffer).
Use C99 style data types. Use smallest sufficient data type, for storage types. Local variables like loop variables are sometimes more efficient with "fast" data types.
Select inline candidates. Some parameter heavy function with relatively simple bodies are better off when inlined. Or consider passing structure of parameters. Globals are also option, but be careful - tests and maintenance can become difficult if anyone in them isn't disciplned enough.
Use const keyword well , be aware of array initialization implications.
Map file, ideally also with module sizes. Check also what is included from crt (is it really neccessary?).
Recursion just say no (limited stack space)
Floating point numbers - prefer fixed point math. Tends to include and call a lot of code (even for simple addition or multiplication).
C++ you should know C++ VERY WELL. If you don't, program constrainted embedded systems in C, please. Those who dare must be careful with all advanced C++ constructs (inheritance, templates, exceptions, overloading, etc.). Consider close to HW code to be
rather Super-C and C++ is used where it counts: in high level logic, GUI, etc.
Disable whatever you don't need in compiler settings (be it parts of libraries, language constructs, etc.)
Last but not least - while hunting for smallest possible code size - don't overdo it. Watch out also for performance and maintainability. Over-optimized code tends to decay very quickly.
Firstly, tell your compiler to optimize for code size. GCC has the -Os flag for this.
Everything else is at the algorithmic level - use similar tools that you would for finding memory leaks, but instead look for allocs and frees that you could avoid.
Also take a look at commonly used data structure packing - if you can shave a byte or two off them, you can cut down memory use substantially.
If you're looking for a good way to profile your application's heap usage, check out valgrind's massif tool. It will let you take snapshots of your app's memory usage profile over time, and you can then use that information to better see where the "low hanging fruit" is, and aim your optimizations accordingly.
Profiling code or data bloat can be done via map files: for gcc see here, for VS see here.
I have yet to see a useful tool for size profiling though (and don't have time to fix my VS AddIn hack).
on top what others suggest:
Limit use of c++ features, write like in ANSI C with minor extensions. Standard (std::) templates use a large system of dynamic allocation. If you can, avoid templates altogether. While not inherently harmful, they make it way too easy to generate lots and lots of machine code from just a couple simple, clean, elegant high-level instructions. This encourages writing in a way that - despite all the "clean code" advantages - is very memory hungry.
If you must use templates, write your own or use ones designed for embedded use, pass fixed sizes as template parameters, and write a test program so you can test your template AND check your -S output to ensure the compiler is not generating horrible assembly code to instantiate it.
Align your structures by hand, or use #pragma pack
{char a; long b; char c; long d; char e; char f; } //is 18 bytes,
{char a; char c; char d; char f; long b; long d; } //is 12 bytes.
For the same reason, use a centralized global data storage structure instead of scattered local static variables.
Intelligently balance usage of malloc()/new and static structures.
If you need a subset of functionality of given library, consider writing your own.
Unroll short loops.
for(i=0;i<3;i++){ transform_vector[i]; }
is longer than
transform_vector[0];
transform_vector[1];
transform_vector[2];
Don't do that for longer ones.
Pack multiple files together to let the compiler inline short functions and perform various optimizations Linker can't.
Don't be afraid to write 'little languages' inside your program. Sometimes a table of strings and an interpreter can get a LOT done. For instance, in a system I've worked on, we have a lot of internal tables, which have to be accessed in various ways (loop through, whatever). We've got an internal system of commands for referencing the tables that forms a sort of half-way language that's quite compact for what it gets donw.
But, BE CAREFUL! Know that you are writing such things (I wrote one accidentally, myself), and DOCUMENT what you are doing. The original developers do NOT seem to have been conscious of what they were doing, so it's much harder to manage than it should be.
Optimizing is a popular term but often technically incorrect. It literally means to make optimal. Such a condition is never actually achieved for either speed or size. We can simply take measures to move toward optimization.
Many (but not all) of the techniques used to move toward minimum time to a computing result sacrifices memory requirement, and many (but not all) of the techniques used to move toward minimum memory requirement lengthens the time to result.
Reduction of memory requirements amounts to a fixed number of general techniques. It is difficult to find a specific technique that does not neatly fit into one or more of these. If you did all of them, you'd have something very close to the minimal space requirement for the program if not the absolute minimum possible. For a real application, it could take a team of experienced programmers a thousand years to do it.
Remove all redundancy from stored data, including intermediates.
Remove all need for storing data that could be streamed instead.
Allocate only the number of bytes needed, never a single more.
Remove all unused data.
Remove all unused variables.
Free data as soon as it is no longer possibly needed.
Remove all unused algorithms and branches within algorithms.
Find the algorithm that is represented in the minimally sized execution unit.
Remove all unused space between items.
This is a computer science view of the topic, not a developer's one.
For instance, packing a data structure is an effort that combines (3) and (9) above. Compressing data is a way to at least partly achieve (1) above. Reducing overhead of higher level programming constructs is a way to achieve some progress in (7) and (8). Dynamic allocation is an attempt to exploit a multitasking environment to employ (3). Compilation warnings, if turned on, can help with (5). Destructors attempt to assist with (6). Sockets, streams, and pipes can be used to accomplish (2). Simplifying a polynomial is a technique to gain ground in (8).
Understanding of the meaning of nine and the various ways to achieve them is the result of years of learning and checking memory maps resulting from compilation. Embedded programmers often learn them more quickly because of limited memory available.
Using the -Os option on a gnu compiler makes a request to the compiler to attempt to find patterns that can be transformed to accomplish these, but the -Os is an aggregate flag that turns on a number of optimization features, each of which attempts to perform transformations to accomplish one of the 9 tasks above.
Compiler directives can produce results without programmer effort, but automated processes in the compiler rarely correct problems created by lack of awareness in the writers of the code.
Bear in mind the implementation cost of some C++ features, such as virtual function tables and overloaded operators that create temporary objects.
Along with that everyone else said, I'd just like to add don't use virtual functions because with virtual functions a VTable must be created which can take up who knows how much space.
Also watch out for exceptions. With gcc, I don't believe there is a growing size for each try-catch block(except for 2 function calls for each try-catch), but there is a fixed size function which must be linked in which could be wasting precious bytes