Timing in OpenMPI - c++

Quick question about timing in OpenMPI. I see that with qstat -a I can show the wall time, and with qstat I can see the CPU time. Is it possible to have these two values written to the output file when the job is done so I can check the performance of my code?

Related

perf samples do not add up to 100%

When analyzing with Linux perf:
perf record -g -e cycles my-executable
perf report
... the outermost invocation is shown to only use 50.6% of execution time, although the manual asserts it should always be 100%. Where do these samples go?
I'm aware a similar question exists; I think my problem is different because my application runs only for ~5s, and I do not expect the core to idle during this time.

attaching callgrind/valgrind to program in mid-way of its execution

I use Valgrind to detect issues by running a program from the beginning.
Now I have memory/performance issues in a very specific moment in the program. Unfortunately there is no feasible way to make a shortcut to this place from the start.
Is there a way to instrument the c++ program ( Valgrind/Callgrind ) in its mid-way execution, like attaching to the process?
Already answered here:
How use callgrind to profiling only a certain period of program execution?
There is no way to use valgrind on an already running program.
For callgrind, you can however somewhat speed it up by only recording data later during execution.
For this, you might e.g. look at callgrind options
--instr-atstart=no|yes Do instrumentation at callgrind start [yes]
--collect-atstart=no|yes Collect at process/thread start [yes]
--toggle-collect=<func> Toggle collection on enter/leave function
You can also control such aspects from within your program.
Refer to valgrind/callgrind user manual for more details.
There are two things that Callgrind does that slows down execution.
Counting operations (the collect part)
Simulating the cache and branch predictor (the instrumentation part). This depends on the --cache-sim and --branch-sim options, which both default to "no". If you are using these options and you disable instrumentation then I expect that there will be some impact on the accuracy of the modelling as the cache and predictor won't be "warm" when they are toggled.
There are a few other approaches that you could use.
Use the client request mechanisms. This would require you to include a Valgrind header and add a few lines that use Valgrind macros CALLGRIND_START_INSTRUMENTATION/CALLGRIND_STOP_INSTRUMENTATION and CALLGRIND_TOGGLE_COLLECTto start and stop instrumentation/collection. See the manual for details. Then just run your application under Valgrind with --instr-atstart=no --collect-atstart=no
Use the gdb monitor commands. In this case you would have two terminals. In the first you would run Valgrind with --instr-atstart=no --collect-atstart=no --vgdb=yes. In the second terminal run gdb yourapplication then from the gdb prompt (gdb) target remote | vgdb. Then you can use the monitor commands as described in the manual, which include collect and instrumentation control.
callgrind_control, part of the Valgrind distribution. To be honest I've never used this.
I recently did some profiling using the first technique without cache/branch sim. When I used Callgrind on the whole run I saw a 23x runtime increase compared to running outside of Callgrind. When I profiled only the one function that I wanted to analyze, this fell to about a 5x slowdown. Obviously this will very greatly case by case.
Thanks all for help.
Seems It's been already answered here: How use callgrind to profiling only a certain period of program execution?
but not marked as answered for some reason.
Summary:
Starting callgrind with instrumentation off
valgrind --tool=callgrind --instr-atstart=no <PROG>
Controling instrumentation ( can be executed in other shell )
callgrind_control -i on/off

Benchmarking Code Runtime with Trace32

I have an embedded system with code that I'd like to benchmark. In this case, there's a single line I want to know the time spent on (it's the creation of a new object that kicks off the rest of our application).
I'm able to open Trace->Chart->Symbols and see the time taken for the region selected with my cursor, but this is cumbersome and not as accurate as I'd like. I've also found Perf->Function Runtime, but I'm benchmarking the assignment of a new object, not of any particular function call (new is called in multiple places, not just the line of interest).
Is there a way to view the real-world time taken on a line of code with Trace32? Going further than a single line: would there be a way to easily benchmark the time between two breakpoints?
The solution by codehearts, which uses the RunTime commands, is just fine if you don't have a real-time trace. It works with any Lauterbach tool and any target CPU.
However if you have a real-time trace (e.g. CPU with ETM and Lauterbach PowerTrace hardare), I recommend to use the command Trace.STATistic.AddressDURation <start-addr> <end-addr> instead. This command opens a window which shows the average time between two addresses. You get best results, if you execute the code between the two addresses several times.
If you are using an ARM Cortex CPU, which supports cycle-accurate timing information (usually all Cortex-A, Cortex-R and Cortex-M7) you can improve the accuracy of the result dramatically by using the setting ETM.TImeMode.CycleAccurate (together with ETM.CLOCK <core-frequency>).
If you are using a Lauterbach CombiProbe or uTrace (and you can't use the ETM.TImeMode.CycleAccurate) I recommend the setting Trace.PortFilter.ON. (By default the port-filter is set to PACK, which allows to record more data and program flow, but with a slightly worse timing accuracy.)
Opening the Misc->Runtime window shows you the total time taken since "laststart." By setting a breakpoint on the first line of your code block and another after the last line, you can see the time taken from the first breakpoint to the second under the "actual" column.

Accurate benchmark under windows

I am writing a program that needs to run a set of executables and find their execution times.
My first approach was just to run a process, start a timer and see the difference between the start time and the moment when process returns exit value.
Unfortunately, this program will not run on the dedicated machine so many other processes / threads can greatly change the execution time.
I would like to get time in milliseconds / clocks that actually was given to the process by OS. I hope that windows stores that information somewhere but I cannot find anything useful on the msdn.
Sure one solution is to run the process multiple times and calculate the avrage time, but I wont to avoid that.
Thanks.
You can take a look at GetProcessTimes API.
The "High-Performance Counter" might be what you're looking for.
I've used QueryPerformanceCounter/QueryPerformanceFrequency for high-resolution timing in stuff like 3D programming where the stock functionality just doesn't cut it.
You could also try the RDTSC x86 instruction.

Linux time sample based profiler

short version:
Is there a good time based sampling profiler for Linux?
long version:
I generally use OProfile to optimize my applications. I recently found a shortcoming that has me wondering.
The problem was a tight loop, spawning c++filt to demangle a c++ name. I only stumbled upon the code by accident while chasing down another bottleneck. The OProfile didn't show anything unusual about the code so I almost ignored it but my code sense told me to optimize the call and see what happened. I changed the popen of c++filt to abi::__cxa_demangle. The runtime went from more than a minute to a little over a second. About a x60 speed up.
Is there a way I could have configured OProfile to flag the popen call? As the profile data sits now OProfile thinks the bottle neck was the heap and std::string calls (which BTW once optimized dropped the runtime to less than a second, more than x2 speed up).
Here is my OProfile configuration:
$ sudo opcontrol --status
Daemon not running
Event 0: CPU_CLK_UNHALTED:90000:0:1:1
Separate options: library
vmlinux file: none
Image filter: /path/to/executable
Call-graph depth: 7
Buffer size: 65536
Is there another profiler for Linux that could have found the bottleneck?
I suspect the issue is that OProfile only logs its samples to the currently running process. I'd like it to always log its samples to the process I'm profiling. So if the process is currently switched out (blocking on IO or a popen call) OProfile would just place its sample at the blocked call.
If I can't fix this, OProfile will only be useful when the executable is pushing near 100% CPU. It can't help with executables that have inefficient blocking calls.
Glad you asked. I believe OProfile can be made to do what I consider the right thing, which is to take stack samples on wall-clock time when the program is being slow and, if it won't let you examine individual stack samples, at least summarize for each line of code that appears on samples, the percent of samples the line appears on. That is a direct measure of what would be saved if that line were not there. Here's one discussion. Here's another, and another. And, as Paul said, Zoom should do it.
If your time went from 60 sec to 1 sec, that implies every single stack sample would have had a 59/60 probability of showing you the problem.
Try Zoom - I believe it will let you profile all processes - it would be interesting to know if it highlights your problem in this case.
I wrote this a long time ago, only because I couldn't find anything better: https://github.com/dicej/profile
I just found this, too, though I haven't tried it: https://github.com/oliver/ptrace-sampler
Quickly hacked up trivial sampling profiler for linux: http://vi-server.org/vi/simple_sampling_profiler.html
It appends backtrace(3) to a file on SIGUSR1, and then converts it to annotated source.
After trying everything suggested here (except for the now-defunct Zoom, which is still available as huge file from dropbox), I found that NOTHING does what Mr. Dunlavey recommends. The "quick hacks" listed above in some of the answers wouldn't build for me, or didn't work for me either. Spent all day trying stuff... and nothing could find fseek as a hotspot in an otherwise simple test program that was I/O bound.
So I coded up yet another profiler, this time with no build dependencies, based on GDB, so it should "just work" for almost any debuggable code. A single CPP file.
https://github.com/jasonrohrer/wallClockProfiler
It automates the manual process suggested by Mr. Dunlavey, interrupting the target process with GDB periodically and harvesting a stack trace, and then printing a report at the end about which stack traces are the most common. Those are your true wall-clock hotspots. And it actually works.