Profiling: How to discover where/why my program is sleeping - c++

My company's codebase has a unit test that takes unreasonably long (5 minutes wall clock time, 200ms cpu-time). It is not I/O bound, so it is probably sleeping/waiting somewhere. How would one go about discovering where?
I can't do a textual search. The actual repository is huge and has legitimate reasons to sleep. I'd get too many false positives
Profilers I tried (perf, valgrind, gprof) all seem to focus on finding functions that use a lot of cpu time, not wall clock time.
I ended up doing strace -r -k and then looking for slow system calls, but surely there are more convenient approaches?

Related

Best event counter to use for measuring wall clock time using perf tools

Simple but yet complicated question:
What counter to use to get perf tools to measure wall clock time?
As a base line the first thing when profiling code I think I need to measure is just wall clock time to get an first idea where the code takes most of the time.
I don’t care if it’s IO or bandwidth limited or something else I just want to know where it is slow.
Sounds simple requirement, but with all the many tricks modern CPUs do to work efficient (like frequency scaling etc.) and the hell lot of different (not so well documented) performance counters available in perf, it’s not easy to be sure measuring the right thing.
Currently I do:
perf record -g -e ref-cycles -F 999 -- <cmd>
I think this is unscaled CPU frequency and thus proportional to the amount of wall clock time that part of the code is running. But who the hell knows?
You can use task-clock.
This is explicitly wall clock time while the process is running and as a bonus is portable because it doesn't rely on any PMU event.

Are there tools/ methods to objectively measure performance?

I'm writing a high performance application (a raytracer) in C++ using Visual Studio, and I just spent two days trying to root out a performance drop I witnessed after refactoring the code. The reason it took so long was because the performance drop was smaller than the normal variation in execution time I witnessed from run to run.
Not sure if this is normal, but sometimes the program may run at around 33fps pretty consistently, then if you close and rerun, it may run at 37fps. This means that in order to test any new change, I had to manually run and rerun until I witnessed peak performance (And this could require up to like 10 runs). Simply running it for some large number of frames, and measuring the time doesn't fix this variability. For example, if the program runs for 40 seconds on average, it will nevertheless vary by over 1-2 seconds, which makes this test nearly useless for detecting the 1 millisecond per frame performance loss I was dealing with.
Visual Studio's profiling tools also didn't help find this small of an issue, because they also were subject to variation, and in any case, its not necessarily going to tell me the exact offending line, so I have to test solutions, and the profiler is not very effective at confirming a proposed solution's efficacy.
I realize this all may sound like premature optimization, but I don't think it is because I'm optimizing only after finishing complete features; I'm just trying to monitor changes in performance regularly so that issues like the above don't slip in and just get added to the apparent cost of the new feature.
Anyways, my question is simply whether there's a way to objectively determine the "real" speed of an application, discounting the effect of variation. Or, failing that, how do developers deal with such issues? I doubt that my current process is the ideal one.
There are lots of profilers for both c++ and openGL. For those who just need the links, here are they.
OpenGL debugger-profiler
C++ profilers but I recommend Google orbit because it has dark theme.
My eyes stopped at
Objectively measure performance
As you mentioned the speed varies from run to run because it's too complex system. It helps if the scope is small and it only tests some key algorithms. It worth to automatize and collect some reference data. As every scientist say one test is not a test, you should rely on regular tests with controlled environments.
And here comes some tricks that can be used to measure performance.
In the comments others said, an average based on several runs may help you. It softens the noise from the outside.
Process priority or processor affinity could help you control the environment. By giving low priority to other processes your program gains more resource.
Measuring the whole execution of a test and compare it against processor time. As several processes runs at the same time, processor time may differs from execution time.
Update your reference values if you do a software update. Perhaps one update comes with performance boost while other with security patch.
Give a performance range for your program instead of one specific number. Perhaps the temperature messed up your measurement and the clock speed was decreased.
If a test runs too fast to measure, execute the most critical part several times in a test case. Too fast depend on how accurate you can measure. On ms basis it's really hard to decide if a test executed in 2 ms instead of 1 ms is a failure or not. However, if executed 1000 times - 1033 ms compared to 1000 ms gives you better insight.
Only test what is the critical section. Set up the environment and start the stopwatch when everything is ready. The system startup could be another test.

Is there any usage for letting a process "warm up"?

I recently did some digging into memory and how to use it properly. Of course, I also stumbled upon prefetching and how I can make life easier for the CPU.
I ran some benchmarks to see the actual benefits of proper storage/access of data and instructions. These benchmarks showed not only the expected benefits of helping your CPU prefetch, it also showed that prefetching also speeds up the process during runtime. After about 100 program cycles, the CPU seems to have figured it out and has optimized the cache accordingly. This saves me up to 200.000 ticks per cycle, the number drops from around 750.000 to 550.000. I got these Numbers using the qTestLib.
Now to the Question: Is there a safe way to use this runtime-speedup, letting it warm up, so to speak? Or should one not calculate this in at all and just build faster code from the start up?
First of all, there is generally no gain in trying to warm up a process prior to normal execution: That would only speed up the first 100 program cycles in your case, gaining a total of less than 20000 ticks. That's much less than the 75000 ticks you would have to invest in the warming up.
Second, all these gains from warming up a process/cache/whatever are rather brittle. There is a number of events that destroy the warming effect that you generally do not control. Mostly these come from your process not being alone on the system. A process switch can behave pretty much like an asynchronous cache flush, and whenever the kernel needs a page of memory, it may drop a page from the disk cache.
Since the factors make computing time pretty unpredictable, they need to be controlled when running benchmarks that are supposed to produce results of any reliability. Apart from that, these effects are mostly ignored.
It is important to note that keeping the CPU busy isn't necessarily a bad thing. Ideally you want your CPU to run anywhere from 60% to 100% because that means that your computer is actually doing "work". Granted, if there is a process that you are unaware of and that process is taking up CPU cycles, that isn't good.
In answer to your question, the machine usually takes care of this.

What is the most accurate way to estimate the time performance of a code in C++

I am writing a C++ code that runs on Ubuntu. I use pthreads as well. I am doing my research in an algorithm performance area.
I have this algorithm that I have improved on, it could run for 6~ 10 hours. But the time measurements I am taking also include things that are very minute, like in ms.
Also, the computer I am running it on is also running other processes, so how do I make sure that the time measured does not include the time processing for other processes.
There is no single "most accurate way". As with any measurement, you will have to define what you want to measure first. If you just want to measure execution time, repeatedly doing the same task and stopping the time is appropriate.
If you want to measure the time the CPU(s) is (are) actually busy doing your task, top might be a program of interest to you.
If you need to know how much time is spend in what subroutine, the linux-utils package contains perf, which can measure individual call times.
Often, non-CPU latencies dominate runtime. It, for example, doesn't make sense to just measure the time spent by the CPU when you're waiting on network input or hard drive data.
So: your question is actually a question for yourself: What do you want to measure? "Algorithm Performance" suggests you're doing computer science, in which case you should have access to extensive literature that explains what might be of interest. There's no single "solution" to the question "what is the most accurate measurement", unless you define "measurement" more closely; that's your job, and usually it's the harder part of measuring things.

Finding performance issue that may be due to thread locking (possibly)

I've spent a little time running valgrind/callgrind to profile a server that does a lot of TCP/IP communications using many threads. After some time improving the performance, I realised that in this particular test scenario, the process is not CPU bound so the performance "improvements" I'd looked at were of no use.
In theory, the CPU should be very busy. I know the TCP/IP device it connects to isn't the limitation as the server runs on two machines. One is a PC the other is an embedded device with an Arm processor. Even the embedded device only gets to about 2% CPU usage but it does far fewer transactions - about a tenth. Both systems only get up to about 2% even though we're trying to get data as fast as possible.
My guess is that some mutex is locked and is holding up a thread. This is a pure guess! There are a few threads in the system with common data. Perhaps there are other possibilities but how do I tell?
Is there anyway to use a tool like valgrind/callgrind that might show the time spent in system calls? I can also run it on Windows with Visual Studio 2012 if that's better.
We might have to try walking through the code or something but not sure that we have time.
Any tips appreciated.
Thanks.
Callgrind is a great profiler but it does have some drawbacks. In particular, it assumes that the same instruction always executes in the same amount of time, and it assumes that instruction counts are the most important metric.
This is fine for getting (mostly) reproducible profiling results and for analyzing in detail what instructions are executed, but there are some types of performance problems which Callgrind doesn't detect:
time spent waiting for locks
time spent sleeping (eg. simple sleep()/usleep() calls will effectively slow down your application but won't show up in Callgrind)
time spent waiting for disk I/O or network I/O
time spent waiting for data that was swapped out
influences from CPU cache hits/misses (you can try to use Cachegrind for this particular topic)
influences from CPU pipeline stalls, branch prediction failures and all the other features of modern CPUs that can cause the same instruction to be executed faster or slower depending on the context
These problems can be detected quite well using a statistical (or sample-based) profiler. Examples would be Sysprof and OProfile, or any kind of "poor-man's sampling profiler" as described eg. at https://stackoverflow.com/a/378024. The VS2012 built-in profiler mentioned by WhozCraig appears to be a sampling profiler as well.
While statistical profilers are really useful because they provide "real-world" results instead of simple instructions counts, they have the possible drawback that you don't get reproducible results easily (the results will vary a little bit with every run), and that you need to gather sufficient number of samples to get detailed results.