C++ Asymptotic Profiling - c++

I have a performance issue where I suspect one standard C library function is taking too long and causing my entire system (suite of processes) to basically "hiccup". Sure enough if I comment out the library function call, the hiccup goes away. This prompted me to investigate what standard methods there are to prove this type of thing? What would be the best practice for testing a function to see if it causes an entire system to hang for a sec (causing other processes to be momentarily starved)?
I would at least like to definitively correlate the function being called and the visible freeze.
Thanks

The best way to determine this stuff is to use a profiling tool to get the information on how long is spent in each function call.
Failing that set up a function that reserves a block of memory. Then in your code at various points, write a string to memory including the current time. (This avoids the delays associated with writing to the display).
After you have run your code, pull out the memory and parse it to deterimine how long parts of your code are taking.

I'm trying to figure out what you mean by "hiccup". I'm imagining your program does something like this:
while (...){
// 1. do some computing and/or file I/O
// 2. print something to the console or move something on the screen
}
and normally the printed or graphical output hums along in a subjectively continuous way, but sometimes it appears to freeze, while the computing part takes longer.
Is that what you meant?
If so, I suspect in the running state it is most always in step 2, but in the hiccup state it spending time in step 1.
I would comment out step 2, so it would spend nearly all it's time in the hiccup state, and then just pause it under the debugger to see what it's doing.
That technique tells you exactly what the problem is with very little effort.

Related

Change a currently non used class implementation at runtime

I have a C++ code being executing on a big file(~15 GB). The code has two phases and the first phase will take much time to finish. But in the mean time I have got a better implementation technique for its phase 2, and don't want to restart the whole execution right from start. The two phases are categorized by the two classes actually being used. Take an idea from it:
Parser.parse(filePath); // phase one
Processor.processAndLog(); // phase two
So, is there some method to change the implementation of Processor class before it starts executing? The end of phase 1 (or even how much it has completed) can be distinguished from some time to time messages(say logs) I have printed.
If Processor.processAndLog is a member function pointer than you can change it anytime before it's called.
An alternative is to have Processor.processAndLog be a wrapper function for other functions - a dispatch function.
There's also the matter of hooking a function. there's a library called detours. Use this only if you can't change the source code if the program.
So if I understand this correctly: You have a program that is currently running, but which hasn't yet gotten to executing the code in a particular class. And you want to find a way to update it to use a new version of the code for that class without stopping the program.
In theory that could be done. But in practice, it's likely to be far more trouble than it's worth, especially if this is a one-time need. C++ was not designed for this kind of thing. It's not like there is simply human-readable source code sitting in the process's memory that could be overwritten easily.
Doing this correctly would almost certainly take a significant amount of time and effort, most likely involving a lot of experimenting and trail-and-error. If you get something about it wrong (which is likely the first time) then you've probably just corrupted your process and your results, and would thus need to restart it anyway.
I don't know how long your process is currently taking, but trying to figure how to accomplish this idea might very well take more time than just restarting the process after building the new version of the program.

Profiling algorithm

I need to implement execution time measuring functionality. I thought about two possibilities.
First - regular time() call, just remember time each execution step starts and time when each execution step completes. Unix time shell command works this way.
Second method is sampling. Every execution step set some sort of flag before execution begins(for example - creates some object in the stack frame), and destroy it when it's completes. Timer periodically scan all flags and generate execution time profile. If some execution step takes more time then the others - it will be scanned more times. Many profilers works like this.
I need to add some profiling functionality in my server application, what method is better and why? I think that second method is less accurate and first method add dependency to profiling library code.
The second method is essentially stack sampling.
You can try to do it yourself, by means of some kind of entry-exit event capture, or it's better if there is a utility to actually read the stack.
The latter has an advantage in that you get line-of-code resolution, rather than just method-level.
There's something that a lot of people don't get about this, which is that precision of timing measurement is far less important than precision of problem identification.
It is important to take samples even during I/O or other blocking, so you are not blind to needless I/O. If you are worried that competition with other processes could inflate the time, don't be, because what really matters is not absolute time measurements, but percentages.
For example, if one line of code is on the stack 50% of the wall-clock time, and thus responsible for that much, getting rid of it would double the speed of the app, regardless of whatever else is going on.
Profiling is about more than just getting samples.
Often people are pretty casual about what they do with them, but that's where the money is.
First, inclusive time is the fraction of time a method or line of code is on the stack. Forget "self" time - it's included in inclusive time.
Forget invocation counting - its relation to inclusive percent is, at best, very indirect.
If you are summarizing, the best way to do it is to have a "butterfly view" whose focus is on a single line of code.
To its left and right are the lines of code appearing immediately above it and below it on the stack samples.
Next to each line of code is a percent - the percent of stack samples containing that line of code.
(And don't worry about recursion. It's simply not an issue.)
Even better than any kind of summary is to just let the user see a small random selection of the stack samples themselves.
That way, the user can get the whole picture of why each snapshot in time was being spent.
Any avoidable activity appearing on more than one sample is a chance for some serious speedup, guaranteed.
People often think "Well, that could just be a fluke, not a real bottleneck".
Not so. Fixing it will pay off, maybe a little, maybe a lot, but on average - significant.
People should not be ruled by risk-aversion.
More on all that.
When boost is an option, you can use the timer library.
Make sure that you really know what you're looking for in the profiler you're writing, whenever you're collecting the total execution time of a certain piece of code, it will include time spent in all its children and it may be hard to really find what is the bottleneck in your system as the most top-level function will always bubble up as the most expensive one - like for instance main().
What I would suggest is to hook into every function's prologue and epilogue (if your application is a CLR application, you can use the ICorProfilerInfo::SetEnterLeaveFunctionHooks to do that, you can also use macros at the beginning of every method, or any other mechanism that would inject your code at the beginning and and of every function) and collect your times in a form of a tree for each thread that your profiling.
The algorithm for this would look somehow similar to this:
For each thread that you're monitoring create a stack-like data structure.
Whenever you're notified about a function that began execution, push something that would identify the function into that stack.
If that function is not the only function on the stack, then you know that the previous function that did not return yet was the one that called your function.
Keep track of those callee-called relationships in your favorite data structure.
Whenever a method returns, it's identifier will always be on top of its thread stack. It's total execution time is equal to (time when the last (it's) identifier was pushed on the stack - current time). Pop that identifier of the stack.
This way you'll have a tree-like breakdown of what eats up your execution time where you can see what child calls account for the total execution time of a function.
Have fun!
In my profiler I used an extended version of the 1-st approach mentioned by you.
I have a class which provides context objects. You can define them in your work code as automatic objects to be freed up as soon as execution flow leaves the context where they have been defined (for example, a function or a loop). The constructor calls GetTickCount (it was a Windows project, you can choose analogous function as appropriate to your target platform) and stores this value, while destructor calls GetTickCount again and calculates difference between this moment and start. Each object has unique context ID (can be autogenerated as a static object inside the same context), so profiler can sum up all timings with the same IDs, which means that the same context has been passed several times. Also number of executions is counted.
Here is a macro for preprocessor, which helps to profile a function:
#define _PROFILEFUNC_ static ProfilerLocator locator(__FUNC__); ProfilerObject obj(locator);
When I want to profile a function I just insert PROFILEFUNC at the beginning of the function. This generates a static object locator which identifies the context and stores a name of it as the function name (you may decide to choose another naming). Then automatic ProfilerObject is created on stack and "traces" own creation and deletion, reporting this to the profiler.

How do I detect where the program is stuck in an infinite loop?

I am working on a (relatively complex) game. The game freezes in release mode. The freeze happens after 1-2 min. of game-play. The current configuration of the release mode that I have allows me to break (that is go into debug), which is good, but may give me wrong information but that is fine for this particular case (I can turn off the optimization for a single file/function/code).
Problem is, I (we, since we are a team) don't know where it is hanging. It is not as simple as one relatively small infinite loop that is hanging, as other things (Graphics, sound) are being updated, just that the game-play has stalled. The main game loop (an infinite loop) is always running and is very long/complex, so stepping through is going to be a pain (but it is one of the options).
The first thing I tried is Visual Studio's break all but it always breaks in code that is not mine and consequently shows me assembly output. Eventually, with enough persistence, SVN history checking and commenting out code I will be able to figure out where it is hanging, but there has to be a better way... hopefully?
Note: There is a Visual Studio option I am aware of that allows debugging user code only, but that is managed code only.
EDIT: Was able to solve the problem via stack trace and lots of hours of keeping track of various things to see where the game is hanging. I will select Sjoerd's answer as the correct one, however, if someone has a suggestion for a tool/technique that allows to automate such a task, by all means, add your answer!
If you break and you encounter native code that is not yours, check the call stack. The call stack is the list of functions that got called to reach the current point in the code. Go up some levels in the stack until you encounter the method which is currently running.
Hit the pause button in Visual Studio while the program is hung.
This should break the debugger at the current line. You can then step through and see what is happening.
As an alternative to debugging symbols and breaks (which is the tool of choice when possible), add logging: It is not uncommon for games (and other apps) to have a huge logging system they can turn on and off with a compiler flag so they can still do some kind of debugging/tracing in "release builds". If your logging works fine you should see what is and what is not happening and get at least some idea where things go wrong.
You might well never be able to catch the problem via an interrupt if the code that should be executing isn't executing. There are lots of ways this can happen. Just a few:
You have some parameter that indicates the time at which the next update is to be performed. If this somehow gets set to some big number, the code that does the update will happily see that nothing needs to be done. Next! This can give all the appearances of a hung program even though it isn't really hung at all. The state update and the graphics functions are still being called at their prescribed rate.
You may some counter that represents time and some rounding mechanism for incrementing time. If the counter is a 32 bit signed int and the granularity of your counter is 0.1 microseconds, you will hit INT32_MAX after just 3.6 minutes. Now time is frozen, so once again you have a situation where updates may not be performed.
You are using a single precision floating point number to represent time and update time via time += delta_t; This will stop working after a couple of minutes if your delta_t is 10 microseconds. This is yet another mechanism by which time can be frozen.
Edit
Have you looked at the CPU usage in your various threads? The above problems might cause the physics or game-playing thread to exhibit a drastic drop in CPU usage after a couple of minutes. You might also get this behavior if the game playing thread is perpetually locked, but here you might (with the right tool) get an indication that that thread is always asleep.

Profiling serialization code

I ran my app twice (in the VS ide). The first time it took 33seconds. I decommented obj.save which calls a lot of code and it took 87seconds. Thats some slow serialization code! I suspect two problems. The first is i do the below
template<class T> void Save_IntX(ostream& o, T v){ o.write((char*)&v,sizeof(T)); }
I call this templates hundreds of thousands of times (well maybe not that much). Does each .write() use a lock that may be slowing it down? maybe i can use a memory steam which doesnt require a lock and dump that instead? Which ostream may i use that doesnt lock and perhaps depends that its only used in a single thread?
The other suspected problem is i use dynamic_cast a lot. But i am unsure if i can work around this.
Here is a quick profiling session after converting it to use fopen instead of ostream. I wonder why i dont see the majority of my functions in this list but as you can see write is still taking the longest. Note: i just realize my output file is half a gig. oops. Maybe that is why.
I'm glad you got it figured out, but the next time you do profiling, you might want to consider a few points:
The VS profiler in sampling mode does not sample during I/O or any other time your program is blocked, so it's only really useful for CPU-bound profiling. For example, if it says a routine has 80% inclusive time, but the app is actually computing only 10% of the time, that 80% is really only 8%. Because of this, for any non-CPU-bound work, you need to use the profiler's instrumentation mode.
Assuming you did that, of all those columns of data, the one that matters is "Inclusive %", because that is the routine's true cost, in the sense that if it could be avoided, that is how much the overall time would be reduced.
Of all those rows of data, the ones likely to matter are the ones containing your routines, because your routines are the only ones you can do anything about. It looks like "Unknown Frames" are maybe your code, if your code is compiled without debugging info. In general, it's a good idea to profile with debugging info, make it fast, and then remove the debugging info.

Using gprof with sockets

I have a program I want to profile with gprof. The problem (seemingly) is that it uses sockets. So I get things like this:
::select(): Interrupted system call
I hit this problem a while back, gave up, and moved on. But I would really like to be able to profile my code, using gprof if possible. What can I do? Is there a gprof option I'm missing? A socket option? Is gprof totally useless in the presence of these types of system calls? If so, is there a viable alternative?
EDIT: Platform:
Linux 2.6 (x64)
GCC 4.4.1
gprof 2.19
The socket code needs to handle interrupted system calls regardless of profiler, but under profiler it's unavoidable. This means having code like.
if ( errno == EINTR ) { ...
after each system call.
Take a look, for example, here for the background.
gprof (here's the paper) is reliable, but it only was ever intended to measure changes, and even for that, it only measures CPU-bound issues. It was never advertised to be useful for locating problems. That is an idea that other people layered on top of it.
Consider this method.
Another good option, if you don't mind spending some money, is Zoom.
Added: If I can just give you an example. Suppose you have a call-hierarchy where Main calls A some number of times, A calls B some number of times, B calls C some number of times, and C waits for some I/O with a socket or file, and that's basically all the program does. Now, further suppose that the number of times each routine calls the next one down is 25% more times than it really needs to. Since 1.25^3 is about 2, that means the entire program takes twice as long to run as it really needs to.
In the first place, since all the time is spent waiting for I/O gprof will tell you nothing about how that time is spent, because it only looks at "running" time.
Second, suppose (just for argument) it did count the I/O time. It could give you a call graph, basically saying that each routine takes 100% of the time. What does that tell you? Nothing more than you already know.
However, if you take a small number of stack samples, you will see on every one of them the lines of code where each routine calls the next.
In other words, it's not just giving you a rough percentage time estimate, it is pointing you at specific lines of code that are costly.
You can look at each line of code and ask if there is a way to do it fewer times. Assuming you do this, you will get the factor of 2 speedup.
People get big factors this way. In my experience, the number of call levels can easily be 30 or more. Every call seems necessary, until you ask if it can be avoided. Even small numbers of avoidable calls can have a huge effect over that many layers.