I decided to try out the performance analyzer thingy in vs 2012. To my surprise, the test code(way too big to post) runs about 15% faster when analyzing than in the default release configuration over the length of ~1 minute. What could be the reason for this? Is it using different compiler flags or something?
To elaborate a bit more on the code: Its a specialized spatial sorting algorithm(most similar to counting sort) that operates on relatively simple pod classes and is looped 10k times, IO times are excluded from being timed.
Well I think I finally got a clue. When I run the program from the IDE, I think the thingy that allows for breakpoints in code is slowing it down, can't break in the analyzer. To confirm I ran the program from the .exe instead and surely enough it was even faster than it was in the analyzer (not by much but still), probably because it didn't have the sampler poking at it. Mystery solved!
Related
I'm using an A8-5600K so I'm used to waiting a bit for CPU tasks to complete - nothing too bad but as far as CPUs go it's not the greatest. 12GB RAM, running very latest Visual Studio 2017 (15.6.6 iirc) as of this morning.
I'm wanting to run static analysis simply because I believe it could save me time chasing edge cases and bugs while also making the code look nicer. I've got a relatively small project (single file, 5000 lines) which uses WxWidgets, OpenCV, STD, Chrono, Threads, Windows, and probably a few other things. I'm only running a single check (Style) as I figured it would be faster than running every check. It's been running for an hour and in the meantime I've got a CPU that's more or less tapped out and can't compile anything.
I presume by this point that the Checker is getting through my code in no time and spending ages trying to tell me how to fix OpenCV or something. I understand that without knowing how OpenCV works the Checker might find it difficult understanding some stuff that's going on in my code.
What can I do? Is the Checker simply not meant for consumer-grade CPUs ? If I ran this overnight would it finish? Does the number of rules have that much of an effect or should I just have them all checked for (ie 95% crawling my project, 5% analysis)?
Also, the posts I've seen in regard to 3rd Party Libraries seem content with simply filtering out the warnings post-check (words on the blackboard are written but merely erased afterward so to speak) - this likely won't speed things up for me.
I have some heavily templated c++ code that I am working with. I can compile and profile with AMD tools and sleepy in debug mode. However without optimisation most of time spent concentrated in the templated code and STL. With optimised compilation, all the profile tools that I know produce garbage information. Does anybody know a good way to profile optimised native code
PS1:
The code that I am writing is also heavily templated. Most of the time spent in the unoptimised code will be optimized away. I am talking about 96-97% of the run time are spent in templated code without optimisation. This is going to corrupt the accuracy of the profiling. And yes I can change many templated code or at least what part of the templated code is introducing the most trouble and I can do better in those places.
You should focus on the code you wrote because that is what you can change, time spent in STL is irrelevant, just ignore it and focus on the callers of that code. If too much time is spent in STL you probably can call some other STL primitive instead of the current one.
Profiling unoptimized code is less interesting, but you can still get some informations. If used algorithms from some parts of code are totally flawed it will show up even there. But you should be able to get useful informations from any good profiling tool in optimized code. What tools do you use exactly and why do you call their output garbage ?
Also it's usually easy enough to instrument your code by hand and find out exactly which parts are efficient and which are not. It's just a matter of calling timer functions (or reading cycle count of processor if possible) at well chosen points. I usually do that from unit tests to have reproducible results, but all depends of the specifics of your program.
Tools or instrumenting code are the easy part of optimization. The hard part is finding ways to get faster code where it's needed.
What do you mean by "garbage information"?
Profiling is only really meaningful on optimized builds, so tools are designed to work with them -- thus if you're getting meaningless results, it's probably due to the profiler not finding the right symbols, or needing to instrument the build.
In the case of Intel VTune, for example, I found I got impossible results from the sampler unless I explicitly told it where to find the PDBs for the executable I was tuning. In the instrumented version, I had to fiddle with the settings until it was reliably putting probes into the function calls.
When #kriss says
You should focus on the code you wrote
because that is what you can change
that's exactly what I was going to say.
I would add that in my opinion it is easier to do performance tuning first on code compiled without optimization, and then later turn on the optimizer, for the same reason. If something you can fix is costing excess time, it will cost proportionally excess time regardless of what the compiler does, and it's easier to find it in code that hasn't been scrambled.
I don't look for such code by measuring time. If the excess time is, say, 20%, then what I do is randomly pause it several times. As soon as I see something that can obviously be improved on 2 or more samples, then I've found it. It's an oddball method, but it doesn't really miss anything. I do measure the overall time before and after to see how much I saved. This can be done multiple times until you can't find anything to fix. (BTW, if you're on Linux, Zoom is a more automated way to do this.)
Then you can turn on the optimizer and see how much it gives you, but when you see what changes you made, you can see there's really no way the compiler could have done it for you.
When building projects in Visual Studio (I'm using 2008 SP1) there is an optimizing option
called Enable link-time code generation. As far as I understand, this allows specific inlining techniques to be used and that sounds pretty cool.
Still, using this option dramatically increases the size of static libraries built. In my case it was something like 40 mb -> 250 mb and, obviously building process can become REALLY slow if you have even 5-6 libraries that are that huge.
So my question is - is it worth it?. Is the effect of link-time code generation measurable so that I leave it turned on and suffer from slooooooooooooow builds?
Thank you.
How are we supposed to know? You're the one suffering the slower link times.
If you can live with the slower builds, then it speeds up your code, which is good.
If you want faster builds, you lose out on optimizations, making your code run slower.
Is it worth it? That depends on you and nothing else. How patient are you? How long can you wait for a build?
It can significantly speed up your code though. If you need the speed, it is a very valuable optimization.
It's up to you. This is rather a subjective question. Here's a few things to go over to help you make that determination:
Benchmark the performance with and without this feature. Sometimes smaller code runs faster, sometimes more inlining works. It's not always so clear cut and dry.
Is performance critical? Will your client reject your application with it's current speed unless you find a way to improve things on that front?
How slow is acceptable in the build process? Do you have to keep this on while you, yourself build it, or can you push it off to the test environment/continuous build machine?
Personally, I'd go with whatever helped me develop faster and then worry about the optimizations later. Make sure that it does what it needs to do first.
Are there any tools to give some sort of histogram of where most of the execution time of the program is spent at?
This is for a project using c++ in visual studio 2008.
The name you're after is a profiler. Try Find Application Bottlenecks with Visual Studio Profiler
You need a profiler.
Visual Studio Team edition includes a profiler (which is what you are looking for) but you may only have access to the Professional or Express editions. Take a look at these threads for alternatives:
What's your favorite profiling tool (for C++)
What are some good profilers for native C++ on Windows?
You really shouldn't optimize ANY parts of your application until you've measured how long they take to run. Otherwise you may be directing effort in the wrong place, and you may be making things worse, not better.
I have used a profiler called "AQ Time" which gives every detail you want to know about about the performance of your code. It's not free though..
You could get a histogram of the program counter, but it is practically useless unless you are doing something dumb like spending time in a bubble sort of a big array of ints or doubles.
If you do something as simple as a bubble sort of an array of strings, the PC histogram will only tell you that you have a hotspot in the string compare routine.
That's not much help, is it?
I know you wouldn't do such a bubble sort, but just for fun, let's assume you did, and it was taking 90% of your time. (i.e. if you fixed it, it could go up to 10 times faster.)
It's actually a very easy thing to find, because if you just hit the pause button in the debugger, you will almost certainly see that it stops in the string compare routine. Then if you look up the stack one level, you will be looking directly at the bubble sort loop which is your bug. If you're not sure you've really spotted the problem, just pause it several times. The number of times you see the problem tells you how costly it is.
Any line of code that appears on the call stack on multiple pauses, is something that is begging you to fix it. Some you can't, like "call _main", but if you can you will get a good speedup, guaranteed.
Then do it again, and again.
When you run out of things you can fix, then you've really tuned the program within an inch of its life.
It's that simple.
You could also use the profiler in Visual Studio. It is a nice tool but be aware of these shortcomings:
Confusing you with "exclusive time", which if you concentrate on line-level information, is almost meaningless.
If your program is wasting time doing I/O, it won't see that, because when it stops to do I/O, the samples stop, unless you use instrumentation.
But if you use instrumentation, you won't get line-level information, only function-level. That's OK if your functions are all small.
Confusing you with the "call tree". What matters for a line of code is how many stack samples it is on. If it is in many branches of the call tree, the call tree won't show you what it really costs.
If it tells you a line is costly, it cannot tell you why. For that you want to see as much state information on each sample as you need, rather than just summaries.
It's hard to tell it when you want to do samples, and when not. You want it to sample when you're waiting for the app, not when it's waiting for you.
So now that you know you need a profiler, you might not have the Visual Studio one, so Very Sleepy might be of help.
Anyone know this compiler feature? It seems GCC support that. How does it work? What is the potential gain? In which case it's good? Inner loops?
(this question is specific, not about optimization in general, thanks)
It works by placing extra code to count the number of times each codepath is taken. When you compile a second time the compiler uses the knowledge gained about execution of your program that it could only guess at before. There are a couple things PGO can work toward:
Deciding which functions should be inlined or not depending on how often they are called.
Deciding how to place hints about which branch of an "if" statement should be predicted on based on the percentage of calls going one way or the other.
Deciding how to optimize loops based on how many iterations get taken each time that loop is called.
You never really know how much these things can help until you test it.
PGO gives about a 5% speed boost when compiling x264, the project I work on, and we have a built-in system for it (make fprofiled). Its a nice free speed boost in some cases, and probably helps more in applications that, unlike x264, are less made up of handwritten assembly.
Jason's advise is right on. The best speedups you are going to get come from "discovering" that you let an O(n2) algorithm slip into an inner loop somewhere, or that you can cache certain computations outside of expensive functions.
Compared to the micro-optimizations that PGO can trigger, these are the big winners. Once you've done that level of optimization PGO might be able to help. We never had much luck with it though - the cost of the instrumentation was such that our application become unusably slow (by several orders of magnitude).
I like using Intel VTune as a profiler primarily because it is non-invasive compared to instrumenting profilers which change behaviour too much.
The fun thing about optimization is that speed gains are found in the unlikeliest of places.
It's also the reason you need a profiler, rather than guessing where the speed problems are.
I recommend starting with a profiler (gperf if you're using GCC) and just start poking around the results of running your application through some normal operations.