Profiling for wall-time on Linux - profiling

I have an application that I want to profile wrt how much time is spent in various activities. Since this application is I/O intensive, I want to get a report that will summarize how much time is spent in every library/system call (wall time).
I've tried oprofile, but it seems it gives time in terms of Unhalted CPU cycles (thats cputime, not real time)
I've tried strace -T, which gives wall time, but the data generated is huge and getting the summary report is difficult (and awk/py scripts exist for this ?)
Now I'm looking upto SystemTap, but I don't find any script that is close enough and can be modified, and the onsite tutorial didn't help much either. I am not sure if what I am looking for can be done.
I need someone to point me in the right direction.
Thanks a lot!

Judging from this commit, the recently released strace 4.9 supports this with:
strace -w -c
They call it "syscall latency" (and it's hard to see from the manpage alone that's what -w does).

Are you doing this just out of measurement curiosity, or because you want to find time-drains that you can fix to make it run faster?
If your goal is to make it run as fast as possible, then try random-pausing.
It doesn't measure anything, except very roughly.
It may be counter-intuitive, but what it does is pinpoint the code that will result in the greatest speed-up.

See the fntimes.stp systemtap sample script. https://sourceware.org/systemtap/examples/index.html#profiling/fntimes.stp
The fntimes.stp script monitors the execution time history of a given function family (assumed non-recursive). Each time (beyond a warmup interval) is then compared to the historical maximum. If it exceeds a certain threshold (250%), a message is printed.
# stap fntimes.stp 'kernel.function("sys_*")'
or
# stap fntimes.stp 'process("/path/to/your/binary").function("*")'
The last line of that .stp script demonstrates the way to track time consumed in a given family of functions
probe $1.return { elapsed = gettimeofday_us()-#entry(gettimeofday_us()) }

Related

Can you profile a single call of a function with perf?

I have a C++ function that I want to profile and only that function. One possible way is to use chrono and just measure the time it takes to run that function and print it out, run the program a few times and then do stats on the samples.
I am wondering if I can skip having to explicitly code time measurements and just ask perf to focus on the time spent in a specified function.
Have a look at Google's benchmarking library to micro-benchmark the function of interest.
You can then profile the resulting the executable as usual using perf.
For example, let's say that following the basic usage, you generated an executable named mybenchmark. Then, you can run perf on the binary as usual
$ perf stat ./mybenchmark
You can build a flame graph of whole application in SVG format. With flame graph you can quickly see function that take most of the time when consuming CPU. SVG flame graph is interactive: you can click any function and see detailed flame graph only for that selected function. From description of flame graphs:
It is also interactive: mouse over the SVGs to reveal details, and click to zoom.
You can try it in action for sample bash flame graph:
http://www.brendangregg.com/FlameGraphs/cpu-bash-flamegraph.svg

Using log4cxx as input counter

I want to add a counter that record how many data input per hour or per day.
Since there is no timer in my code, I hope that log4cxx, which can handle daily log rotation, could help me. Like, every midnight, print a log showing how many data got in yesterday.
Do anyone know the trick or any reference?
THX.
This is a late answer, but maybe it's going to be useful to other people.
Nope, log4cxx cannot do it — print a log at a given time, out of itself. Log4cxx is not about timers, roll-over detection routine is checked with every log statement processed by the library, more specific, by the appender. There are no watchdog threads to trigger any behaviour.

CFZip Issue - Timing out before reaching Timeout Limit

I am using cfzip to zip folders on my server, anywhere from 2mb to 5gb.
Its timing out on a folder that is 1.25gb and I get the following error:
The request has exceeded the allowable
time limit Tag: cfoutput
It errors after 11 minutes and I have the following tag at the top of the page <cfsetting requesttimeout="99999">. So technically it should be waiting 1666.65 minutes before timing out, right?
It's dedicated so I can push it to the max.
Any help with this would be very much appreciated.
Thanks :)
Zipping something that size it probably going to take a loooong time. With a file 5GB in size, I would also think you would start to get outofmemory exceptions as well.
I'd be inclined to step out of the Java process, and use cfexecute to run it at a native level using the command line (should be easy enough with whatever platform you are on).
Dropping that also into a cfthread is probably a good idea as well, and then working out some sort of alert system when it is complete sounds like a good idea.
You could try shoving the process into a thread. Those things rock out forever.

How to measure the amount of data transmitted by my MPI program?

I'm experimenting my distributed clustering algorithm (implemented with MPI) on 24 computers that I set up as a cluster using BCCD (Bootable Cluster CD) that can be downloaded at http://bccd.net/.
I've written a batch program to run my experiment that consists in running my algorithm several times varying the number of nodes and the size of the input data.
I want to know the amount of data used in the MPI communications for each run of my algorithm so I can see how the amount of data changes when varying the previous mentioned parameters. And I want to do all this automatically using a batch program.
Someone told me to use tcpdump, but I found some difficulties in this approach.
First, I don't know how to call tcpdump in my batch program (which is written in C++ using the command system for making calls) before each run of my algorithm, since tcpdump requires another terminal to run in parallel with my application. And I can't run tcpdump in another computer since the network uses a switch. So I need to run it on the master node.
Second, I saw the traffic with tcpdump while my experiment was going on and I couldn't figure out what was the port used by MPI. It seems to use many ports. I wanted to know that for filtering the packages.
Third, I tried capturing whole packages and saving it to a file using tcpdump and in a few seconds the file was 3,5MB. But my whole experiment takes 2 days. So the final log file will be huge if I follow this approach.
The ideal approach would be to capture just the size field in the header of the packages and sum this up to obtain the total amount of data transmitted. In that way the logfile would be much smaller than if I were capturing the whole package. But I don't know how to do it.
Another restriction is that I don't have access to the computer disc. So I only have the RAM and my 4GB USB Flash drive. So I can't have huge logfiles.
I have already thought about using some MPI tracing or profiling tool such as those mentioned at http://www.open-mpi.org/faq/?category=perftools. I have only tested Sun Performance Analyzer until now. The problem is that I guess it will be difficult to install those tools on BCCD and maybe even impossible. In addtion to that, this tool will make my experiment take longer to end, sice it adds overhead. But if someone is familiar with BCCD and think it is a good choice to use one of those tools, so please let me know.
Hope someone have a solution.
Implementations like tcpdump won't work if there are multi-core nodes which use shard memory to communicate, anyway.
Using something like MPE is almost certainly the way to go. Those tools add very little overhead, and some overhead is always going to be necessary if you want to count messages. You can use mpitrace to write out every MPI call, and parse the resulting text file yourself. By the way, note that MPE is explicitly discussed on the bccd website. MPICH2 comes with MPE built in, but it can be compiled for any implementation. I've only found a very modest overhead for MPE.
IPM is another nice tool that does counting of messages and sizes; you should be able either parse the XML output, or use the postprocessing tools and just manually integrate the graphs (say either bytes_rx/bytes_tx by rank, or the message buffer size/count graph). The overhead for IPM is even less than for MPE, and mostly comes after the program's finished running to do the file I/O.
If you were really super worried about the overhead with either of these approaches, you could always write your own MPI wrappers using the profiling interface that wrapped MPI_Send, MPI_Recv, etc, and just counted # of bytes sent and recieved for each process, and output only that total at the end.

Programmatically getting per-process disk io statistics on Windows?

I would like to display a list of processes (Windows, C++) and how much they are reading and writing from the disk in KB/sec.
The Resource Monitor of Windows 7 has the ability so I should be able to do the same.
However I have unable to find a relevant API-call or find anything in the perfmon counters. Could anyone point me in the direction?
You can call GetProcessIoCounters to get overall disk I/O data per process - you'll need to keep track of deltas and converting to time-based rate yourself.
This API will tell you total number of I/O operations as well as total bytes.
WMI can do it, as long as you periodically snapshot it to get differential stats for some "recent" slice of time. This post presents a peculiarly mixed solution, with VBScript reading the info from WMI and Perl continually presenting the information in a Windows console. Despite the strange language mix, I think it stands as a good example of how to get at the kind of information you require (it should be quite possible to recode all of it in C++, of course).