Linux time sample based profiler - c++

short version:
Is there a good time based sampling profiler for Linux?
long version:
I generally use OProfile to optimize my applications. I recently found a shortcoming that has me wondering.
The problem was a tight loop, spawning c++filt to demangle a c++ name. I only stumbled upon the code by accident while chasing down another bottleneck. The OProfile didn't show anything unusual about the code so I almost ignored it but my code sense told me to optimize the call and see what happened. I changed the popen of c++filt to abi::__cxa_demangle. The runtime went from more than a minute to a little over a second. About a x60 speed up.
Is there a way I could have configured OProfile to flag the popen call? As the profile data sits now OProfile thinks the bottle neck was the heap and std::string calls (which BTW once optimized dropped the runtime to less than a second, more than x2 speed up).
Here is my OProfile configuration:
$ sudo opcontrol --status
Daemon not running
Event 0: CPU_CLK_UNHALTED:90000:0:1:1
Separate options: library
vmlinux file: none
Image filter: /path/to/executable
Call-graph depth: 7
Buffer size: 65536
Is there another profiler for Linux that could have found the bottleneck?
I suspect the issue is that OProfile only logs its samples to the currently running process. I'd like it to always log its samples to the process I'm profiling. So if the process is currently switched out (blocking on IO or a popen call) OProfile would just place its sample at the blocked call.
If I can't fix this, OProfile will only be useful when the executable is pushing near 100% CPU. It can't help with executables that have inefficient blocking calls.

Glad you asked. I believe OProfile can be made to do what I consider the right thing, which is to take stack samples on wall-clock time when the program is being slow and, if it won't let you examine individual stack samples, at least summarize for each line of code that appears on samples, the percent of samples the line appears on. That is a direct measure of what would be saved if that line were not there. Here's one discussion. Here's another, and another. And, as Paul said, Zoom should do it.
If your time went from 60 sec to 1 sec, that implies every single stack sample would have had a 59/60 probability of showing you the problem.

Try Zoom - I believe it will let you profile all processes - it would be interesting to know if it highlights your problem in this case.

I wrote this a long time ago, only because I couldn't find anything better: https://github.com/dicej/profile
I just found this, too, though I haven't tried it: https://github.com/oliver/ptrace-sampler

Quickly hacked up trivial sampling profiler for linux: http://vi-server.org/vi/simple_sampling_profiler.html
It appends backtrace(3) to a file on SIGUSR1, and then converts it to annotated source.

After trying everything suggested here (except for the now-defunct Zoom, which is still available as huge file from dropbox), I found that NOTHING does what Mr. Dunlavey recommends. The "quick hacks" listed above in some of the answers wouldn't build for me, or didn't work for me either. Spent all day trying stuff... and nothing could find fseek as a hotspot in an otherwise simple test program that was I/O bound.
So I coded up yet another profiler, this time with no build dependencies, based on GDB, so it should "just work" for almost any debuggable code. A single CPP file.
https://github.com/jasonrohrer/wallClockProfiler
It automates the manual process suggested by Mr. Dunlavey, interrupting the target process with GDB periodically and harvesting a stack trace, and then printing a report at the end about which stack traces are the most common. Those are your true wall-clock hotspots. And it actually works.

Related

Analyzing function memory and CPU usage

I am making a video game, which is a pretty small 2D shooter. Recently I noticed that the frame rate drops dramatically when there are about 9 bullets in the scene or more. My laptop can handle advanced 3D games and my game is very very simple so hardware should not be a problem.
So now I have a very big code (at least for one person) and I am pretty confused where I should look for? there are too many functions and classes related to bullets, and for example, I don't know how to analyze if the rendering function has problems or the update function? I could use MVS 2015 debugging tools for other programs, but for a game, it is not practical, for example, if I put a breakpoint before the render function, It should be checked 60 times in a second plus I can't input anything so I will never have bullets to test render function! I tried to use task manager, and I realized that CPU usage goes up really fast for each bullet, but when the game slows down only 10 percent of the CPU is used!
So my questions are:
How can I analyze functions when I can't use debugging tool?
And why game slows down while it still can use system resources?
To see what part consumes most of the processing power, you should use a function profiler. It doesn't "debug", but it creates a report when it's finished.
Valgrind is a good tool for that.
Why the game slows down? Depends on your implementation. I can create a program that divides two numbers and make it take 5 minutes to calculate the result.
We're in the video-game industry as well and we use a very simple tool on PC for CPU profiling: very sleepy.
http://www.codersnotes.com/sleepy/
It is simple, but really helped me out a lot of times. Just fire up the program from IDE and let very sleepy run for a few thousand samples and off you go!
When it comes to memory holes, Valgrind is a good tool, as already noted by The Quantum Physicist.
For timing, I would write my own small tracing/profiling tool (if my IDE does not already have one). Use a text debugging output to write short messages to a log file. Something like that:
void HandleBullet() {
printf("HandleBullet START: %i", GetSysTime());
// do your function stuff
printf("HandleBullet END: %i", GetSysTime()); // or calculate time of function directly
}
Write those debugging messages in all of the functions where you think they could take too long.
After some execution time, you can look into that file and see if something obvious happened (blocking somewhere).
If not, use a high level language of your choice to write a small parser for your created log file to tidy up and analyze your output. Calculate stuff like overall time spent in some function, or chart which functions took the longest. Should not be too difficult, if you stick to a log message style which is easy parsable for you.

Visual Studio - Program runs faster when profiling

I have been doing some profiling of a physics application I wrote, and I've noticed when I profile it, it runs faster and perhaps smoother than without the profiler. Note that I am NOT running the program in the debug configuration or with the debugger attached.
I measured the difference, and I found program runs ~50% faster under the profiler. I don't consider this a duplicate because the other question doesn't make it clear whether he/she was running it with the debugger attached, and the top answer assumes that's the case (And the 20x speedup strongly indicates it would be the correct answer in most cases).
Another answer suggests a "heisenburg" bug, but that's kind of a catchall hypothesis (I'm still going to investigate down this line).
Is it possible that Visual Studio does something that prevents other applications from interfering with my application's compute or memory resources (in order to get a "fairer" result)?
Visual Studio's "CPU Usage" profiler appears to disregard laptop power usage settings, so if you run an application on a laptop that is trying to conserve battery power, it will run slower than if you run the profiler on it.
I discovered this when I got home from work- I noticed the speed difference had disappeared. On a hunch, I unplugged my laptop and tried the test several more times. The speed difference returned. What's more, under the profiler, the application runs at about the same speed plugged in or not.
I was not able to find any sources on this, but I'll be happy to edit them in if someone can find some.
If you use threading in your code, this can be caused due to the System Timer Resolution in Windows.
Default windows timer resolution is 15.6ms
When you are running the profiler, this is reduced to 1ms and your program run fast.
Checkout this answer

How to find where the program is waiting

I am working on a big code base. It is heavily multithreaded.
After running the linux based application for a few hours, in the end, right before reporting, the application silences. It doesn't die, it doesn't crash, it just waits there. Joins, mutexes, condition variables ... any of these can be the culprit.
If it had crashed, I would at least have a chance to find the source using debugger. But this way, I have no clue how to use what tool to find the bug. I can't even post a code sample for you. The only thing that can possibly help is to tap MANY places with cout to get a visual where the application is.
Have you been in such a situation? What do you recommend?
If you're running under Linux then just use gdb to run the program. When the application 'silences', interrupt it with CTRL+C, then type backtrace to see the call stack. With this you will find out the function where your application was blocked.
Incase of linux, gdb will be great help. Another tool that can be of great help is strace (This can also be used where there are problems with program for with source is not readily available because strace does not need recompilation to trace them.)
strace shall intercept/record system calls that are called by a process and also the signals that are received by a process. It will be able to show the order of events and all the return/resumption paths of calls. This can take you almost closer to the area of problem.
iotop, LTTng and Ftrace are few of other tools that be helpful to you in this scenario.

gdb: How do I pause during loop execution?

I'm writing a software renderer in g++ under mingw32 in Windows 7, using NetBeans 7 as my IDE.
I've been needing to profile it of late, and this need has reached critical mass now that I'm past laying down the structure. I looked around, and to me this answer shows the most promise in being simultaneously cross-platform and keeping things simple.
The gist of that approach is that possibly the most basic (and in many ways, the most accurate) way to profile/optimise is to simply sample the stack directly every now and then by halting execution... Unfortunately, NetBeans won't pause. So I'm trying to find out how to do this sampling with gdb directly.
I don't know a great deal about gdb. What I can tell from the man pages though, is that you set breakpoints before running your executable. That doesn't help me.
Does anyone know of a simple approach to getting gdb (or other gnu tools) to either:
Sample the stack when I say so (preferable)
Take a whole bunch of samples at random intervals over a given period
...give my stated configuration?
Have you tried simply running your executable in gdb, and then just hitting ^C (Ctrl+C) when you want to interrupt it? That should drop you to gdb's prompt, where you can simply run the where command to see where you are, and then carry on execution with continue.
If you find yourself in a irrelevant thread (e.g. a looping UI thread), use thread, info threads and thread n to go to the correct one, then execute where.

Accurate benchmark under windows

I am writing a program that needs to run a set of executables and find their execution times.
My first approach was just to run a process, start a timer and see the difference between the start time and the moment when process returns exit value.
Unfortunately, this program will not run on the dedicated machine so many other processes / threads can greatly change the execution time.
I would like to get time in milliseconds / clocks that actually was given to the process by OS. I hope that windows stores that information somewhere but I cannot find anything useful on the msdn.
Sure one solution is to run the process multiple times and calculate the avrage time, but I wont to avoid that.
Thanks.
You can take a look at GetProcessTimes API.
The "High-Performance Counter" might be what you're looking for.
I've used QueryPerformanceCounter/QueryPerformanceFrequency for high-resolution timing in stuff like 3D programming where the stock functionality just doesn't cut it.
You could also try the RDTSC x86 instruction.