I searched for this after seeing it's the top rated item when profiling using Very Sleepy, and it seems everyone gets the answer "it's a system function, ignore it". But Sleepy's hint for the function says:
Hint: KiFastSystemCallRet often means
the thread was waiting for something
else to finish. Possible causes
might be disk I/O, waiting for an
event, or maybe just calling Sleep().
Now, my app is absolutely thrashing the CPU and so it's a bit weird 33% of the time is spent waiting for something to happen.
Do I really just ignore it?
EDIT: apparently, 77% of the calls to this come from QueryOglResource (?) which is in module nvd3dnum. I think that might be nvidia Direct3D stuff, i.e rendering.
Don't ignore it. Find out how it's being called.
If you look back up the call stack to where it gets into your code,
that will tell you where the problem is.
It's important to halt it at random (not with a breakpoint), so that the stack traces that are actually costing a lot of time will be most likely to appear.
That function is pretty meaningless for a profiler, it's basically the logical end point for a whole range of system functions. What functions do you have calling it? WaitForMultipleObjects? Asynch reads?
Related
It's pretty obvious how to visualize a regular call stack and count internal and external execution times. However, if one have dealt with coroutines, the call stack can look pretty messy. I mean, a coroutine may yield execution not to its parent but to another coroutine (eg. greenlet). Are there some common ways to make consistent profiling output for such scenarios?
Think about a single sample, of the stack for all threads at the same time.
What you need to know is - who's waiting for whom, and why.
Normally if function A is above B on a stack, it means A is waiting for B to return, and the reason is that A wanted B to do something.
If you look at a whole stack, for one thread, you get a chain of reasons why that particular nanosecond is being spent, by that thread.
If you're looking for speed, you're looking for chains of reasons that, altogether, you don't really need (because there is a weak link).
This works even if the chain ends in I/O.
If it is user input it's simply waiting for the user.
But if it's output, or disk I/O, or plain old CPU cranking, you might be able to do something to reduce it, and get a performance gain (if you see the same problem on 2 or more samples).
What if thread A is waiting for thread B?
Then what you see at the bottom of A's stack is a function that waits for the other thread.
You need to figure out which is thread B, and look at its stack, because the longer it takes, the longer A takes.
So this is more difficult, but surely you're not afraid of that.
I'm talking about manual profiling here, where you take samples yourself, in a debugger, and apply your full attention to each sample.
Profiling tools tend to assume you're lazy and only want numbers, and if nothing jumps out of those numbers you will be happy because you found nothing.
In fact, if some silly needless activity is taking 30% of time, then on average the number of samples you require to see it twice is 2/0.3 = 6.67 samples (not a big number), and it is quite likely that you will see it and the profiler will not.
That's random pausing.
I'm writing a music player. I already have done a lot to improve the performance of my audio callback:
All decoding etc. is of course done in a separate thread. That will fill up some buffers.
To avoid any locking, I avoided any mutexes and coded all relevant structures just based on atomic operations. It's basically a lock-less FIFO.
I try to avoid page faults by using mlock on all the allocated memory.
I set my thread to realtime constraints via thread_policy_set (similar as here).
It sometimes still happens that I get underflows. I wonder how to debug that because I want to know what causing them.
I was thinking about maybe a way to trace the current execution of the audio callback if it took longer than 2ms or so. But how could I do that?
Also, maybe it still reads some memory which results in page faults. How can I debug those?
All the code in the callback is still somewhat complex. Maybe it's just too complicated. I could work around that by introducing another indirection and make the code really minimal by using just a simple ring buffer. That would introduce some more latency and I'm not sure if that is really the problem.
What I would try is, if I have a procedure that is supposed to complete in less than 2ms, I would set a 2ms alarm-clock interrupt when entering the procedure, and clear it when exiting the procedure.
Even if the overtime occurs rarely, this is sure to get it.
So when the interrupt occurs, I can catch it in the debugger and examine the stack.
Would this catch it in the act of doing what is taking extra time?
Maybe, maybe not, but doing it several times is bound to reveal something interesting.
The other thing I would do is simply look for speedups in the callback itself.
To do this, I would just randomly pause it manually a number of times while it's running, and each time examine the stack.
I would simply ignore any samples where the callback was not on the stack.
For the remaining samples, where the callback is on the stack, it will be at a random position in the state sequence of the callback, so chances are excellent that if there's anything it's doing whose optimization would save much time, I will see it doing it.
I ran my app twice (in the VS ide). The first time it took 33seconds. I decommented obj.save which calls a lot of code and it took 87seconds. Thats some slow serialization code! I suspect two problems. The first is i do the below
template<class T> void Save_IntX(ostream& o, T v){ o.write((char*)&v,sizeof(T)); }
I call this templates hundreds of thousands of times (well maybe not that much). Does each .write() use a lock that may be slowing it down? maybe i can use a memory steam which doesnt require a lock and dump that instead? Which ostream may i use that doesnt lock and perhaps depends that its only used in a single thread?
The other suspected problem is i use dynamic_cast a lot. But i am unsure if i can work around this.
Here is a quick profiling session after converting it to use fopen instead of ostream. I wonder why i dont see the majority of my functions in this list but as you can see write is still taking the longest. Note: i just realize my output file is half a gig. oops. Maybe that is why.
I'm glad you got it figured out, but the next time you do profiling, you might want to consider a few points:
The VS profiler in sampling mode does not sample during I/O or any other time your program is blocked, so it's only really useful for CPU-bound profiling. For example, if it says a routine has 80% inclusive time, but the app is actually computing only 10% of the time, that 80% is really only 8%. Because of this, for any non-CPU-bound work, you need to use the profiler's instrumentation mode.
Assuming you did that, of all those columns of data, the one that matters is "Inclusive %", because that is the routine's true cost, in the sense that if it could be avoided, that is how much the overall time would be reduced.
Of all those rows of data, the ones likely to matter are the ones containing your routines, because your routines are the only ones you can do anything about. It looks like "Unknown Frames" are maybe your code, if your code is compiled without debugging info. In general, it's a good idea to profile with debugging info, make it fast, and then remove the debugging info.
I have a program I want to profile with gprof. The problem (seemingly) is that it uses sockets. So I get things like this:
::select(): Interrupted system call
I hit this problem a while back, gave up, and moved on. But I would really like to be able to profile my code, using gprof if possible. What can I do? Is there a gprof option I'm missing? A socket option? Is gprof totally useless in the presence of these types of system calls? If so, is there a viable alternative?
EDIT: Platform:
Linux 2.6 (x64)
GCC 4.4.1
gprof 2.19
The socket code needs to handle interrupted system calls regardless of profiler, but under profiler it's unavoidable. This means having code like.
if ( errno == EINTR ) { ...
after each system call.
Take a look, for example, here for the background.
gprof (here's the paper) is reliable, but it only was ever intended to measure changes, and even for that, it only measures CPU-bound issues. It was never advertised to be useful for locating problems. That is an idea that other people layered on top of it.
Consider this method.
Another good option, if you don't mind spending some money, is Zoom.
Added: If I can just give you an example. Suppose you have a call-hierarchy where Main calls A some number of times, A calls B some number of times, B calls C some number of times, and C waits for some I/O with a socket or file, and that's basically all the program does. Now, further suppose that the number of times each routine calls the next one down is 25% more times than it really needs to. Since 1.25^3 is about 2, that means the entire program takes twice as long to run as it really needs to.
In the first place, since all the time is spent waiting for I/O gprof will tell you nothing about how that time is spent, because it only looks at "running" time.
Second, suppose (just for argument) it did count the I/O time. It could give you a call graph, basically saying that each routine takes 100% of the time. What does that tell you? Nothing more than you already know.
However, if you take a small number of stack samples, you will see on every one of them the lines of code where each routine calls the next.
In other words, it's not just giving you a rough percentage time estimate, it is pointing you at specific lines of code that are costly.
You can look at each line of code and ask if there is a way to do it fewer times. Assuming you do this, you will get the factor of 2 speedup.
People get big factors this way. In my experience, the number of call levels can easily be 30 or more. Every call seems necessary, until you ask if it can be avoided. Even small numbers of avoidable calls can have a huge effect over that many layers.
I have a performance issue where I suspect one standard C library function is taking too long and causing my entire system (suite of processes) to basically "hiccup". Sure enough if I comment out the library function call, the hiccup goes away. This prompted me to investigate what standard methods there are to prove this type of thing? What would be the best practice for testing a function to see if it causes an entire system to hang for a sec (causing other processes to be momentarily starved)?
I would at least like to definitively correlate the function being called and the visible freeze.
Thanks
The best way to determine this stuff is to use a profiling tool to get the information on how long is spent in each function call.
Failing that set up a function that reserves a block of memory. Then in your code at various points, write a string to memory including the current time. (This avoids the delays associated with writing to the display).
After you have run your code, pull out the memory and parse it to deterimine how long parts of your code are taking.
I'm trying to figure out what you mean by "hiccup". I'm imagining your program does something like this:
while (...){
// 1. do some computing and/or file I/O
// 2. print something to the console or move something on the screen
}
and normally the printed or graphical output hums along in a subjectively continuous way, but sometimes it appears to freeze, while the computing part takes longer.
Is that what you meant?
If so, I suspect in the running state it is most always in step 2, but in the hiccup state it spending time in step 1.
I would comment out step 2, so it would spend nearly all it's time in the hiccup state, and then just pause it under the debugger to see what it's doing.
That technique tells you exactly what the problem is with very little effort.